2013-10-16 18:16:17

by Benjamin Segall

[permalink] [raw]
Subject: [PATCH 0/5] Fix several races in CFS_BANDWIDTH

There are several races we've found in CFS_BANDWIDTH that cause crashes or stuck
tasks.

---

Ben Segall (4):
sched: Fix race on toggling cfs_bandwidth_used
sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
sched: Fix hrtimer_cancel/rq->lock deadlock
sched: Avoid throttle_cfs_rq racing with period_timer stopping

Paul Turner (1):
sched: Guarantee new group-entities always have weight


kernel/sched/core.c | 9 ++++++++-
kernel/sched/debug.c | 8 ++++++++
kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++++---------------
kernel/sched/sched.h | 3 ++-
4 files changed, 53 insertions(+), 17 deletions(-)

--
Signature


2013-10-16 18:16:20

by Benjamin Segall

[permalink] [raw]
Subject: [PATCH 2/5] sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining

hrtimer_expires_remaining does not take internal hrtimer locks and thus
must be guarded against concurrent __hrtimer_start_range_ns (but
returning HRTIMER_RESTART is safe). Use cfs_b->lock to make it safe.

Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/fair.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 67c9ecf..b0eba47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3294,7 +3294,13 @@ static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
/* how long we wait to gather additional slack before distributing */
static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC;

-/* are we near the end of the current quota period? */
+/*
+ * Are we near the end of the current quota period?
+ *
+ * Requires cfs_b->lock for hrtimer_expires_remaining to be safe against the
+ * hrtimer base being cleared by __hrtimer_start_range_ns. This is possibly
+ * incorrect if we ever built with CPU_HOTPLUG.
+ */
static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
{
struct hrtimer *refresh_timer = &cfs_b->period_timer;
@@ -3370,10 +3376,12 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
u64 expires;

/* confirm we're still not at a refresh boundary */
- if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
+ raw_spin_lock(&cfs_b->lock);
+ if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
+ raw_spin_unlock(&cfs_b->lock);
return;
+ }

- raw_spin_lock(&cfs_b->lock);
if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) {
runtime = cfs_b->runtime;
cfs_b->runtime = 0;

2013-10-16 18:16:34

by Benjamin Segall

[permalink] [raw]
Subject: [PATCH 4/5] sched: Guarantee new group-entities always have weight

From: Paul Turner <[email protected]>

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

put_prev_task(group_cfs_rq(a[0]), t1)
put_prev_entity(..., t1)
check_cfs_rq_runtime(group_cfs_rq(a[0]))
throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

enqueue_task_fair(rq[0], t2)
enqueue_entity(group_cfs_rq(b[0]), t2)
enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
account_entity_enqueue(group_cfs_ra(b[0]), t2)
update_cfs_shares(group_cfs_rq(b[0]))
< skipped because b is part of a throttled hierarchy >
enqueue_entity(group_cfs_rq(a[0]), b[0])
...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Signed-off-by: Paul Turner <[email protected]>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc44cc3..424c294 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7207,7 +7207,8 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
se->cfs_rq = parent->my_q;

se->my_q = cfs_rq;
- update_load_set(&se->load, 0);
+ /* guarantee group entities always have weight */
+ update_load_set(&se->load, NICE_0_LOAD);
se->parent = parent;
}

2013-10-16 18:17:01

by Benjamin Segall

[permalink] [raw]
Subject: [PATCH 1/5] sched: Fix race on toggling cfs_bandwidth_used

When we transition cfs_bandwidth_used to false, any currently
throttled groups will incorrectly return false from cfs_rq_throttled.
While tg_set_cfs_bandwidth will unthrottle them eventually, currently
running code (including at least dequeue_task_fair and
distribute_cfs_runtime) will cause errors.

Fix this by turning off cfs_bandwidth_used only after unthrottling all
cfs_rqs.

Tested: toggle bandwidth back and forth on a loaded cgroup. Caused
crashes in minutes without the patch, hasn't crashed with it.

Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/core.c | 9 ++++++++-
kernel/sched/fair.c | 16 +++++++++-------
kernel/sched/sched.h | 3 ++-
3 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c3feeb..2d2ac236 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7433,7 +7433,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)

runtime_enabled = quota != RUNTIME_INF;
runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
- account_cfs_bandwidth_used(runtime_enabled, runtime_was_enabled);
+ /*
+ * If we need to toggle cfs_bandwidth_used, off->on must occur
+ * before making related changes, and on->off must occur afterwards
+ */
+ if (runtime_enabled && !runtime_was_enabled)
+ cfs_bandwidth_usage_inc();
raw_spin_lock_irq(&cfs_b->lock);
cfs_b->period = ns_to_ktime(period);
cfs_b->quota = quota;
@@ -7459,6 +7464,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
unthrottle_cfs_rq(cfs_rq);
raw_spin_unlock_irq(&rq->lock);
}
+ if (runtime_was_enabled && !runtime_enabled)
+ cfs_bandwidth_usage_dec();
out_unlock:
mutex_unlock(&cfs_constraints_mutex);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4aa0b10..67c9ecf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2854,13 +2854,14 @@ static inline bool cfs_bandwidth_used(void)
return static_key_false(&__cfs_bandwidth_used);
}

-void account_cfs_bandwidth_used(int enabled, int was_enabled)
+void cfs_bandwidth_usage_inc(void)
{
- /* only need to count groups transitioning between enabled/!enabled */
- if (enabled && !was_enabled)
- static_key_slow_inc(&__cfs_bandwidth_used);
- else if (!enabled && was_enabled)
- static_key_slow_dec(&__cfs_bandwidth_used);
+ static_key_slow_inc(&__cfs_bandwidth_used);
+}
+
+void cfs_bandwidth_usage_dec(void)
+{
+ static_key_slow_dec(&__cfs_bandwidth_used);
}
#else /* HAVE_JUMP_LABEL */
static bool cfs_bandwidth_used(void)
@@ -2868,7 +2869,8 @@ static bool cfs_bandwidth_used(void)
return true;
}

-void account_cfs_bandwidth_used(int enabled, int was_enabled) {}
+void cfs_bandwidth_usage_inc(void) {}
+void cfs_bandwidth_usage_dec(void) {}
#endif /* HAVE_JUMP_LABEL */

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d69cb32..d73761f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1334,7 +1334,8 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
extern void init_cfs_rq(struct cfs_rq *cfs_rq);
extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);

-extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
+extern void cfs_bandwidth_usage_inc(void);
+extern void cfs_bandwidth_usage_dec(void);

#ifdef CONFIG_NO_HZ_COMMON
enum rq_nohz_flag_bits {

2013-10-16 18:22:53

by Benjamin Segall

[permalink] [raw]
Subject: [PATCH 3/5] sched: Fix hrtimer_cancel/rq->lock deadlock

__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock.

Fix this by ensuring that cfs_b->timer_active is cleared only if the
_latest_ call to do_sched_cfs_period_timer is returning as idle. Then
__start_cfs_bandwidth can just call hrtimer_try_to_cancel and wait for
that to succeed or timer_active == 1.

Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/fair.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0eba47..fc44cc3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3234,6 +3234,13 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
if (idle)
goto out_unlock;

+ /*
+ * if we have relooped after returning idle once, we need to update our
+ * status as actually running, so that other cpus doing
+ * __start_cfs_bandwidth will stop trying to cancel us.
+ */
+ cfs_b->timer_active = 1;
+
__refill_cfs_bandwidth_runtime(cfs_b);

if (!throttled) {
@@ -3502,11 +3509,11 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
* (timer_active==0 becomes visible before the hrtimer call-back
* terminates). In either case we ensure that it's re-programmed
*/
- while (unlikely(hrtimer_active(&cfs_b->period_timer))) {
+ while (unlikely(hrtimer_active(&cfs_b->period_timer)) &&
+ hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) {
+ /* bounce the lock to allow do_sched_cfs_period_timer to run */
raw_spin_unlock(&cfs_b->lock);
- /* ensure cfs_b->lock is available while we wait */
- hrtimer_cancel(&cfs_b->period_timer);
-
+ cpu_relax();
raw_spin_lock(&cfs_b->lock);
/* if someone else restarted the timer then we're done */
if (cfs_b->timer_active)

2013-10-16 18:41:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining

On Wed, Oct 16, 2013 at 11:16:17AM -0700, Ben Segall wrote:
> + * Requires cfs_b->lock for hrtimer_expires_remaining to be safe against the
> + * hrtimer base being cleared by __hrtimer_start_range_ns. This is possibly
> + * incorrect if we ever built with CPU_HOTPLUG.

Do clarify; most people build with hotplug enabled; due to wanting
suspend/resume stuff etc..

2013-10-16 18:46:47

by Benjamin Segall

[permalink] [raw]
Subject: [PATCH 5/5] sched: Avoid throttle_cfs_rq racing with period_timer stopping

throttle_cfs_rq doesn't check to make sure that period_timer is running,
and while update_curr/assign_cfs_runtime does, a concurrently running
period_timer on another cpu could cancel itself between this cpu's
update_curr and throttle_cfs_rq. If there are no other cfs_rqs running
in the tg to restart the timer, this causes the cfs_rq to be stranded
forever.

Fix this by calling __start_cfs_bandwidth in throttle if the timer is
inactive.

Also add some sched_debug lines for cfs_bandwidth.

Tested: make a run/sleep task in a cgroup, loop switching the cgroup
between 1ms/100ms quota and unlimited, checking for timer_active=0 and
throttled=1 as a failure. With the throttle_cfs_rq change commented out
this fails, with the full patch it passes.

Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/debug.c | 8 ++++++++
kernel/sched/fair.c | 2 ++
2 files changed, 10 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e6ba5e3..5c34d18 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -229,6 +229,14 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
atomic_read(&cfs_rq->tg->runnable_avg));
#endif
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ SEQ_printf(m, " .%-30s: %d\n", "tg->cfs_bandwidth.timer_active",
+ cfs_rq->tg->cfs_bandwidth.timer_active);
+ SEQ_printf(m, " .%-30s: %d\n", "throttled",
+ cfs_rq->throttled);
+ SEQ_printf(m, " .%-30s: %d\n", "throttle_count",
+ cfs_rq->throttle_count);
+#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 424c294..2dc6fbe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3121,6 +3121,8 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
cfs_rq->throttled_clock = rq_clock(rq);
raw_spin_lock(&cfs_b->lock);
list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+ if (!cfs_b->timer_active)
+ __start_cfs_bandwidth(cfs_b);
raw_spin_unlock(&cfs_b->lock);
}

2013-10-16 19:12:56

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining

Peter Zijlstra <[email protected]> writes:

> On Wed, Oct 16, 2013 at 11:16:17AM -0700, Ben Segall wrote:
>> + * Requires cfs_b->lock for hrtimer_expires_remaining to be safe against the
>> + * hrtimer base being cleared by __hrtimer_start_range_ns. This is possibly
>> + * incorrect if we ever built with CPU_HOTPLUG.
>
> Do clarify; most people build with hotplug enabled; due to wanting
> suspend/resume stuff etc..

Whoops, that comment turned out to be incorrect anyway - I was worried
about migrate_timers, but it is safe as it is. The following should be
accurate:

+ * Requires cfs_b->lock for hrtimer_expires_remaining to be safe against the
+ * hrtimer base being cleared by __hrtimer_start_range_ns. In the case of
+ * migrate_hrtimers, base is never cleared, so we are fine.

2013-10-16 22:01:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Guarantee new group-entities always have weight

On Wed, Oct 16, 2013 at 11:16:27AM -0700, Ben Segall wrote:
> From: Paul Turner <[email protected]>
>
> Currently, group entity load-weights are initialized to zero. This
> admits some races with respect to the first time they are re-weighted in
> earlty use. ( Let g[x] denote the se for "g" on cpu "x". )
>
> Suppose that we have root->a and that a enters a throttled state,
> immediately followed by a[0]->t1 (the only task running on cpu[0])
> blocking:
>
> put_prev_task(group_cfs_rq(a[0]), t1)
> put_prev_entity(..., t1)
> check_cfs_rq_runtime(group_cfs_rq(a[0]))
> throttle_cfs_rq(group_cfs_rq(a[0]))
>
> Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
> time:
>
> enqueue_task_fair(rq[0], t2)
> enqueue_entity(group_cfs_rq(b[0]), t2)
> enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
> account_entity_enqueue(group_cfs_ra(b[0]), t2)
> update_cfs_shares(group_cfs_rq(b[0]))
> < skipped because b is part of a throttled hierarchy >
> enqueue_entity(group_cfs_rq(a[0]), b[0])
> ...
>
> We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
> which violates invariants in several code-paths. Eliminate the
> possibility of this by initializing group entity weight.
>
> Signed-off-by: Paul Turner <[email protected]>
> ---
> kernel/sched/fair.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fc44cc3..424c294 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7207,7 +7207,8 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> se->cfs_rq = parent->my_q;
>
> se->my_q = cfs_rq;
> - update_load_set(&se->load, 0);
> + /* guarantee group entities always have weight */
> + update_load_set(&se->load, NICE_0_LOAD);
> se->parent = parent;
> }

Hurm.. this gives new groups a massive weight; nr_cpus * NICE_0. ISTR
there being some issues with this; or was that on the wakeup path where
a task woke on a cpu who's group entity had '0' load because it used to
run on another cpu -- I can't remember.

But please do expand how this isn't a problem. I suppose for the regular
cgroup case, group creation is a rare event so nobody cares, but
autogroups can come and go far too quickly I think.

2013-10-16 22:06:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: Avoid throttle_cfs_rq racing with period_timer stopping

On Wed, Oct 16, 2013 at 11:16:32AM -0700, Ben Segall wrote:
> +#ifdef CONFIG_CFS_BANDWIDTH
> + SEQ_printf(m, " .%-30s: %d\n", "tg->cfs_bandwidth.timer_active",
> + cfs_rq->tg->cfs_bandwidth.timer_active);
> + SEQ_printf(m, " .%-30s: %d\n", "throttled",
> + cfs_rq->throttled);
> + SEQ_printf(m, " .%-30s: %d\n", "throttle_count",
> + cfs_rq->throttle_count);
> +#endif

Not really your fault; but that function just cries for something like:

#define SEQ_cfs_rq(m, member) \
SEQ_printf(m, " .%-30s: %ld\n", #member, (long)cfs_rq->member)

2013-10-16 22:17:17

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Guarantee new group-entities always have weight

On Wed, Oct 16, 2013 at 3:01 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 16, 2013 at 11:16:27AM -0700, Ben Segall wrote:
>> From: Paul Turner <[email protected]>
>>
>> Currently, group entity load-weights are initialized to zero. This
>> admits some races with respect to the first time they are re-weighted in
>> earlty use. ( Let g[x] denote the se for "g" on cpu "x". )
>>
>> Suppose that we have root->a and that a enters a throttled state,
>> immediately followed by a[0]->t1 (the only task running on cpu[0])
>> blocking:
>>
>> put_prev_task(group_cfs_rq(a[0]), t1)
>> put_prev_entity(..., t1)
>> check_cfs_rq_runtime(group_cfs_rq(a[0]))
>> throttle_cfs_rq(group_cfs_rq(a[0]))
>>
>> Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
>> time:
>>
>> enqueue_task_fair(rq[0], t2)
>> enqueue_entity(group_cfs_rq(b[0]), t2)
>> enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
>> account_entity_enqueue(group_cfs_ra(b[0]), t2)
>> update_cfs_shares(group_cfs_rq(b[0]))
>> < skipped because b is part of a throttled hierarchy >
>> enqueue_entity(group_cfs_rq(a[0]), b[0])
>> ...
>>
>> We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
>> which violates invariants in several code-paths. Eliminate the
>> possibility of this by initializing group entity weight.
>>
>> Signed-off-by: Paul Turner <[email protected]>
>> ---
>> kernel/sched/fair.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fc44cc3..424c294 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7207,7 +7207,8 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>> se->cfs_rq = parent->my_q;
>>
>> se->my_q = cfs_rq;
>> - update_load_set(&se->load, 0);
>> + /* guarantee group entities always have weight */
>> + update_load_set(&se->load, NICE_0_LOAD);
>> se->parent = parent;
>> }
>
> Hurm.. this gives new groups a massive weight; nr_cpus * NICE_0. ISTR
> there being some issues with this; or was that on the wakeup path where
> a task woke on a cpu who's group entity had '0' load because it used to
> run on another cpu -- I can't remember.

This could also arbitrarily be MIN_WEIGHT.

I don't think it actually matters, in practice the set of conditions
for this weight to ever see use are very specific (e.g. the race
described above). Otherwise it's always going to be re-initialized (on
first actual enqueue) to the right value. (NICE_0_LOAD seemed to make
sense since this is what you'd "expect" for a new, single thread,
autogroup/group.)

>
> But please do expand how this isn't a problem. I suppose for the regular
> cgroup case, group creation is a rare event so nobody cares, but
> autogroups can come and go far too quickly I think.
>

This isn't typically a problem since the first enqueue will almost
universally set a weight.

An alternative considered was to remove the throttled-hierarchy check
in the re-weight; however it seemed better to make this actually an
invariant (e.g. an entity ALWAYS has some weight).

2013-10-16 22:22:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Guarantee new group-entities always have weight

On Wed, Oct 16, 2013 at 03:16:44PM -0700, Paul Turner wrote:
> > Hurm.. this gives new groups a massive weight; nr_cpus * NICE_0. ISTR
> > there being some issues with this; or was that on the wakeup path where
> > a task woke on a cpu who's group entity had '0' load because it used to
> > run on another cpu -- I can't remember.
>
> This could also arbitrarily be MIN_WEIGHT.
>
> I don't think it actually matters, in practice the set of conditions
> for this weight to ever see use are very specific (e.g. the race
> described above). Otherwise it's always going to be re-initialized (on
> first actual enqueue) to the right value. (NICE_0_LOAD seemed to make
> sense since this is what you'd "expect" for a new, single thread,
> autogroup/group.)

Fair enough. Thanks!

2013-10-16 22:40:55

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: Guarantee new group-entities always have weight

Peter Zijlstra <[email protected]> writes:

> On Wed, Oct 16, 2013 at 11:16:27AM -0700, Ben Segall wrote:
>> From: Paul Turner <[email protected]>
>>
>> Currently, group entity load-weights are initialized to zero. This
>> admits some races with respect to the first time they are re-weighted in
>> earlty use. ( Let g[x] denote the se for "g" on cpu "x". )
>>
>> Suppose that we have root->a and that a enters a throttled state,
>> immediately followed by a[0]->t1 (the only task running on cpu[0])
>> blocking:
>>
>> put_prev_task(group_cfs_rq(a[0]), t1)
>> put_prev_entity(..., t1)
>> check_cfs_rq_runtime(group_cfs_rq(a[0]))
>> throttle_cfs_rq(group_cfs_rq(a[0]))
>>
>> Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
>> time:
>>
>> enqueue_task_fair(rq[0], t2)
>> enqueue_entity(group_cfs_rq(b[0]), t2)
>> enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
>> account_entity_enqueue(group_cfs_ra(b[0]), t2)
>> update_cfs_shares(group_cfs_rq(b[0]))
>> < skipped because b is part of a throttled hierarchy >
>> enqueue_entity(group_cfs_rq(a[0]), b[0])
>> ...
>>
>> We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
>> which violates invariants in several code-paths. Eliminate the
>> possibility of this by initializing group entity weight.
>>
>> Signed-off-by: Paul Turner <[email protected]>
>> ---
>> kernel/sched/fair.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fc44cc3..424c294 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7207,7 +7207,8 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>> se->cfs_rq = parent->my_q;
>>
>> se->my_q = cfs_rq;
>> - update_load_set(&se->load, 0);
>> + /* guarantee group entities always have weight */
>> + update_load_set(&se->load, NICE_0_LOAD);
>> se->parent = parent;
>> }
>
> Hurm.. this gives new groups a massive weight; nr_cpus * NICE_0. ISTR
> there being some issues with this; or was that on the wakeup path where
> a task woke on a cpu who's group entity had '0' load because it used to
> run on another cpu -- I can't remember.
>
> But please do expand how this isn't a problem. I suppose for the regular
> cgroup case, group creation is a rare event so nobody cares, but
> autogroups can come and go far too quickly I think.

I wouldn't expect this to be a problem in the common case because the
first enqueue onto one of the new group's tg->cfs_rq[cpu] will
cause an update_cfs_shares(tg->cfs_rq[cpu]), which will correct it (and
this is before the new group gets to enqueue_entity(... tg->se[cpu], ...) or
anything, so placement shouldn't be an issue). I don't think anything
cares about the weights of a !on_rq se, so it shouldn't be an issue
until enqueue.

Now, that said, in the racing case Paul wrote up, the update_cfs_shares
could get skipped, and unthrottle wouldn't fix the weight either, so
you'd wind up with the wrong weight until another enqueue/dequeue or
tick with it as current happened. I suppose this could be fixed by doing
an update_cfs_shares on unthrottle (or just removing the restriction on
update_cfs_shares, if it seems to be more trouble than it's worth).

It's possible the old walk_tg_tree based and ratelimited computation of
h_load might have had issues, but the new code looks safe since it won't
ratelimit, and in order to do an h_load computation you'll need a task
in the group, and that requires enqueue_entity->update_cfs_shares.

Subject: [tip:sched/core] sched: Fix race on toggling cfs_bandwidth_used

Commit-ID: 1ee14e6c8cddeeb8a490d7b54cd9016e4bb900b4
Gitweb: http://git.kernel.org/tip/1ee14e6c8cddeeb8a490d7b54cd9016e4bb900b4
Author: Ben Segall <[email protected]>
AuthorDate: Wed, 16 Oct 2013 11:16:12 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 29 Oct 2013 12:02:19 +0100

sched: Fix race on toggling cfs_bandwidth_used

When we transition cfs_bandwidth_used to false, any currently
throttled groups will incorrectly return false from cfs_rq_throttled.
While tg_set_cfs_bandwidth will unthrottle them eventually, currently
running code (including at least dequeue_task_fair and
distribute_cfs_runtime) will cause errors.

Fix this by turning off cfs_bandwidth_used only after unthrottling all
cfs_rqs.

Tested: toggle bandwidth back and forth on a loaded cgroup. Caused
crashes in minutes without the patch, hasn't crashed with it.

Signed-off-by: Ben Segall <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 9 ++++++++-
kernel/sched/fair.c | 16 +++++++++-------
kernel/sched/sched.h | 3 ++-
3 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7c61f31..450a34b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7436,7 +7436,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)

runtime_enabled = quota != RUNTIME_INF;
runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
- account_cfs_bandwidth_used(runtime_enabled, runtime_was_enabled);
+ /*
+ * If we need to toggle cfs_bandwidth_used, off->on must occur
+ * before making related changes, and on->off must occur afterwards
+ */
+ if (runtime_enabled && !runtime_was_enabled)
+ cfs_bandwidth_usage_inc();
raw_spin_lock_irq(&cfs_b->lock);
cfs_b->period = ns_to_ktime(period);
cfs_b->quota = quota;
@@ -7462,6 +7467,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
unthrottle_cfs_rq(cfs_rq);
raw_spin_unlock_irq(&rq->lock);
}
+ if (runtime_was_enabled && !runtime_enabled)
+ cfs_bandwidth_usage_dec();
out_unlock:
mutex_unlock(&cfs_constraints_mutex);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 813dd61..ebd187f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2845,13 +2845,14 @@ static inline bool cfs_bandwidth_used(void)
return static_key_false(&__cfs_bandwidth_used);
}

-void account_cfs_bandwidth_used(int enabled, int was_enabled)
+void cfs_bandwidth_usage_inc(void)
{
- /* only need to count groups transitioning between enabled/!enabled */
- if (enabled && !was_enabled)
- static_key_slow_inc(&__cfs_bandwidth_used);
- else if (!enabled && was_enabled)
- static_key_slow_dec(&__cfs_bandwidth_used);
+ static_key_slow_inc(&__cfs_bandwidth_used);
+}
+
+void cfs_bandwidth_usage_dec(void)
+{
+ static_key_slow_dec(&__cfs_bandwidth_used);
}
#else /* HAVE_JUMP_LABEL */
static bool cfs_bandwidth_used(void)
@@ -2859,7 +2860,8 @@ static bool cfs_bandwidth_used(void)
return true;
}

-void account_cfs_bandwidth_used(int enabled, int was_enabled) {}
+void cfs_bandwidth_usage_inc(void) {}
+void cfs_bandwidth_usage_dec(void) {}
#endif /* HAVE_JUMP_LABEL */

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ffc7087..4e650ac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1352,7 +1352,8 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
extern void init_cfs_rq(struct cfs_rq *cfs_rq);
extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);

-extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
+extern void cfs_bandwidth_usage_inc(void);
+extern void cfs_bandwidth_usage_dec(void);

#ifdef CONFIG_NO_HZ_COMMON
enum rq_nohz_flag_bits {

Subject: [tip:sched/core] sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining

Commit-ID: db06e78cc13d70f10877e0557becc88ab3ad2be8
Gitweb: http://git.kernel.org/tip/db06e78cc13d70f10877e0557becc88ab3ad2be8
Author: Ben Segall <[email protected]>
AuthorDate: Wed, 16 Oct 2013 11:16:17 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 29 Oct 2013 12:02:20 +0100

sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining

hrtimer_expires_remaining does not take internal hrtimer locks and thus
must be guarded against concurrent __hrtimer_start_range_ns (but
returning HRTIMER_RESTART is safe). Use cfs_b->lock to make it safe.

Signed-off-by: Ben Segall <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/20131016181617.22647.73829.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebd187f..897d977 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3285,7 +3285,13 @@ static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
/* how long we wait to gather additional slack before distributing */
static const u64 cfs_bandwidth_slack_period = 5 * NSEC_PER_MSEC;

-/* are we near the end of the current quota period? */
+/*
+ * Are we near the end of the current quota period?
+ *
+ * Requires cfs_b->lock for hrtimer_expires_remaining to be safe against the
+ * hrtimer base being cleared by __hrtimer_start_range_ns. In the case of
+ * migrate_hrtimers, base is never cleared, so we are fine.
+ */
static int runtime_refresh_within(struct cfs_bandwidth *cfs_b, u64 min_expire)
{
struct hrtimer *refresh_timer = &cfs_b->period_timer;
@@ -3361,10 +3367,12 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
u64 expires;

/* confirm we're still not at a refresh boundary */
- if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
+ raw_spin_lock(&cfs_b->lock);
+ if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
+ raw_spin_unlock(&cfs_b->lock);
return;
+ }

- raw_spin_lock(&cfs_b->lock);
if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) {
runtime = cfs_b->runtime;
cfs_b->runtime = 0;

Subject: [tip:sched/core] sched: Fix hrtimer_cancel()/rq->lock deadlock

Commit-ID: 927b54fccbf04207ec92f669dce6806848cbec7d
Gitweb: http://git.kernel.org/tip/927b54fccbf04207ec92f669dce6806848cbec7d
Author: Ben Segall <[email protected]>
AuthorDate: Wed, 16 Oct 2013 11:16:22 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 29 Oct 2013 12:02:21 +0100

sched: Fix hrtimer_cancel()/rq->lock deadlock

__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock.

Fix this by ensuring that cfs_b->timer_active is cleared only if the
_latest_ call to do_sched_cfs_period_timer is returning as idle. Then
__start_cfs_bandwidth can just call hrtimer_try_to_cancel and wait for
that to succeed or timer_active == 1.

Signed-off-by: Ben Segall <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/20131016181622.22647.16643.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 897d977..f6308cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3225,6 +3225,13 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
if (idle)
goto out_unlock;

+ /*
+ * if we have relooped after returning idle once, we need to update our
+ * status as actually running, so that other cpus doing
+ * __start_cfs_bandwidth will stop trying to cancel us.
+ */
+ cfs_b->timer_active = 1;
+
__refill_cfs_bandwidth_runtime(cfs_b);

if (!throttled) {
@@ -3493,11 +3500,11 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
* (timer_active==0 becomes visible before the hrtimer call-back
* terminates). In either case we ensure that it's re-programmed
*/
- while (unlikely(hrtimer_active(&cfs_b->period_timer))) {
+ while (unlikely(hrtimer_active(&cfs_b->period_timer)) &&
+ hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) {
+ /* bounce the lock to allow do_sched_cfs_period_timer to run */
raw_spin_unlock(&cfs_b->lock);
- /* ensure cfs_b->lock is available while we wait */
- hrtimer_cancel(&cfs_b->period_timer);
-
+ cpu_relax();
raw_spin_lock(&cfs_b->lock);
/* if someone else restarted the timer then we're done */
if (cfs_b->timer_active)

Subject: [tip:sched/core] sched: Avoid throttle_cfs_rq() racing with period_timer stopping

Commit-ID: f9f9ffc237dd924f048204e8799da74f9ecf40cf
Gitweb: http://git.kernel.org/tip/f9f9ffc237dd924f048204e8799da74f9ecf40cf
Author: Ben Segall <[email protected]>
AuthorDate: Wed, 16 Oct 2013 11:16:32 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 29 Oct 2013 12:02:32 +0100

sched: Avoid throttle_cfs_rq() racing with period_timer stopping

throttle_cfs_rq() doesn't check to make sure that period_timer is running,
and while update_curr/assign_cfs_runtime does, a concurrently running
period_timer on another cpu could cancel itself between this cpu's
update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
in the tg to restart the timer, this causes the cfs_rq to be stranded
forever.

Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
inactive.

(Also add some sched_debug lines for cfs_bandwidth.)

Tested: make a run/sleep task in a cgroup, loop switching the cgroup
between 1ms/100ms quota and unlimited, checking for timer_active=0 and
throttled=1 as a failure. With the throttle_cfs_rq() change commented out
this fails, with the full patch it passes.

Signed-off-by: Ben Segall <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/debug.c | 8 ++++++++
kernel/sched/fair.c | 2 ++
2 files changed, 10 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e6ba5e3..5c34d18 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -229,6 +229,14 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
atomic_read(&cfs_rq->tg->runnable_avg));
#endif
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ SEQ_printf(m, " .%-30s: %d\n", "tg->cfs_bandwidth.timer_active",
+ cfs_rq->tg->cfs_bandwidth.timer_active);
+ SEQ_printf(m, " .%-30s: %d\n", "throttled",
+ cfs_rq->throttled);
+ SEQ_printf(m, " .%-30s: %d\n", "throttle_count",
+ cfs_rq->throttle_count);
+#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0923ab2..41c02b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3112,6 +3112,8 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
cfs_rq->throttled_clock = rq_clock(rq);
raw_spin_lock(&cfs_b->lock);
list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+ if (!cfs_b->timer_active)
+ __start_cfs_bandwidth(cfs_b);
raw_spin_unlock(&cfs_b->lock);
}

Subject: [tip:sched/core] sched: Guarantee new group-entities always have weight

Commit-ID: 0ac9b1c21874d2490331233b3242085f8151e166
Gitweb: http://git.kernel.org/tip/0ac9b1c21874d2490331233b3242085f8151e166
Author: Paul Turner <[email protected]>
AuthorDate: Wed, 16 Oct 2013 11:16:27 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 29 Oct 2013 12:02:23 +0100

sched: Guarantee new group-entities always have weight

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

put_prev_task(group_cfs_rq(a[0]), t1)
put_prev_entity(..., t1)
check_cfs_rq_runtime(group_cfs_rq(a[0]))
throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

enqueue_task_fair(rq[0], t2)
enqueue_entity(group_cfs_rq(b[0]), t2)
enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
account_entity_enqueue(group_cfs_ra(b[0]), t2)
update_cfs_shares(group_cfs_rq(b[0]))
< skipped because b is part of a throttled hierarchy >
enqueue_entity(group_cfs_rq(a[0]), b[0])
...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f6308cb..0923ab2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7198,7 +7198,8 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
se->cfs_rq = parent->my_q;

se->my_q = cfs_rq;
- update_load_set(&se->load, 0);
+ /* guarantee group entities always have weight */
+ update_load_set(&se->load, NICE_0_LOAD);
se->parent = parent;
}