2015-04-03 12:41:45

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

Pick_next_task_fair() must be sure that here is at least one runnable
task before calling put_prev_task(), but put_prev_task() can expire
last remains of cfs quota and throttle all currently runnable tasks.
As a result pick_next_task_fair() cannot find next task and crashes.

This patch leaves 1 in ->runtime_remaining when current assignation
expires and tries to refill it right after that. In the worst case
task will be scheduled once and throttled at the end of slice.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..91785d077db4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);

- /* if the deadline is ahead of our clock, nothing to do */
- if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
+ /* nothing to expire */
+ if (cfs_rq->runtime_remaining <= 0)
return;

- if (cfs_rq->runtime_remaining < 0)
+ /* if the deadline is ahead of our clock, nothing to do */
+ if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
return;

/*
@@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
- /* global deadline is ahead, expiration has passed */
- cfs_rq->runtime_remaining = 0;
+ /*
+ * Global deadline is ahead, expiration has passed.
+ *
+ * Do not expire runtime completely. Otherwise put_prev_task()
+ * can throttle all tasks when we already checked nr_running or
+ * put_prev_entity() can throttle already chosen next entity.
+ */
+ cfs_rq->runtime_remaining = 1;
}
}

@@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
cfs_rq->runtime_remaining -= delta_exec;
expire_cfs_rq_runtime(cfs_rq);

- if (likely(cfs_rq->runtime_remaining > 0))
+ if (likely(cfs_rq->runtime_remaining > 1))
return;

/*


2015-04-03 12:51:23

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

On 03.04.2015 15:41, Konstantin Khlebnikov wrote:
> Pick_next_task_fair() must be sure that here is at least one runnable
> task before calling put_prev_task(), but put_prev_task() can expire
> last remains of cfs quota and throttle all currently runnable tasks.
> As a result pick_next_task_fair() cannot find next task and crashes.

Kernel crash looks like this:

<1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000038
<1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.719567] PGD 0
<4>[50288.719578] Oops: 0000 [#1] SMP
<4>[50288.719594] Modules linked in: vhost_net macvtap macvlan vhost
8021q mrp garp ip6table_filter ip6_tables nf_conntrack_ipv4
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4
xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables
bridge stp llc netconsole configfs x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm mgag200 crc32_pclmul ghash_clmulni_intel
aesni_intel ablk_helper cryptd lrw ttm gf128mul drm_kms_helper drm
glue_helper aes_x86_64 i2c_algo_bit sysimgblt sysfillrect i2c_core
sb_edac edac_core syscopyarea microcode ipmi_si ipmi_msghandler lpc_ich
ioatdma dca mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp
e1000e ptp pps_core ahci libahci raid10 raid456 async_pq async_xor xor
async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0
multipath<4>[50288.719956] linear
<4>[50288.719964] CPU: 27 PID: 11505 Comm: kvm Not tainted 3.18.10-7 #7
<4>[50288.719987] Hardware name:
<4>[50288.720015] task: ffff880036acbaa0 ti: ffff8808445f8000 task.ti:
ffff8808445f8000
<4>[50288.720041] RIP: 0010:[<ffffffff81097b8c>] [<ffffffff81097b8c>]
set_next_entity+0x1c/0x80
<4>[50288.720072] RSP: 0018:ffff8808445fbbb8 EFLAGS: 00010086
<4>[50288.720091] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
000000000000bcb8
<4>[50288.720116] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff88107fd72af0
<4>[50288.720141] RBP: ffff8808445fbbd8 R08: 0000000000000000 R09:
0000000000000001
<4>[50288.720165] R10: 0000000000000000 R11: 0000000000000001 R12:
0000000000000000
<4>[50288.720190] R13: 0000000000000000 R14: ffff880b6f250030 R15:
ffff88107fd72af0
<4>[50288.720214] FS: 00007f55467fc700(0000) GS:ffff88107fd60000(0000)
knlGS:ffff8802175e0000
<4>[50288.720242] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[50288.720262] CR2: 0000000000000038 CR3: 0000000324ede000 CR4:
00000000000427e0
<4>[50288.720287] Stack:
<4>[50288.720296] ffff88107fd72a80 ffff88107fd72a80 0000000000000000
0000000000000000
<4>[50288.720327] ffff8808445fbc68 ffffffff8109ead8 ffff880800000000
ffffffffa1438990
<4>[50288.720357] ffff880b6f250000 0000000000000000 0000000000012a80
ffff880036acbaa0
<4>[50288.720388] Call Trace:
<4>[50288.720402] [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0
<4>[50288.720429] [<ffffffffa1438990>] ?
vmx_fpu_activate.part.63+0x90/0xb0 [kvm_intel]
<4>[50288.720457] [<ffffffff81096b95>] ? sched_clock_cpu+0x85/0xc0
<4>[50288.720479] [<ffffffff816b5b99>] __schedule+0xf9/0x7d0
<4>[50288.720500] [<ffffffff816bb210>] ? reboot_interrupt+0x80/0x80
<4>[50288.720522] [<ffffffff816b630a>] _cond_resched+0x2a/0x40
<4>[50288.720549] [<ffffffffa03dd8c5>] __vcpu_run+0xd35/0xf30 [kvm]
<4>[50288.720573] [<ffffffff81075fc7>] ? __set_task_blocked+0x37/0x80
<4>[50288.720595] [<ffffffff8109387e>] ? try_to_wake_up+0x21e/0x360
<4>[50288.720622] [<ffffffffa03ddb65>]
kvm_arch_vcpu_ioctl_run+0xa5/0x220 [kvm]
<4>[50288.720650] [<ffffffffa03c48b2>] kvm_vcpu_ioctl+0x2c2/0x620 [kvm]
<4>[50288.720675] [<ffffffff811c01c6>] do_vfs_ioctl+0x86/0x4f0
<4>[50288.720697] [<ffffffff810d14a2>] ? SyS_futex+0x142/0x1a0
<4>[50288.720717] [<ffffffff811c06c1>] SyS_ioctl+0x91/0xb0
<4>[50288.720737] [<ffffffff816ba489>] system_call_fastpath+0x12/0x17
<4>[50288.720758] Code: c7 47 60 00 00 00 00 45 31 c0 e9 0c ff ff ff 66
66 66 66 90 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c
89 6d f8 <44> 8b 4e 38 49 89 fc 45 85 c9 74 17 4c 8d 6e 10 4c 39 6f 30 74
<1>[50288.722636] RIP [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.723533] RSP <ffff8808445fbbb8>
<4>[50288.724406] CR2: 0000000000000038

in pick_next_task_fair() cfs_rq->nr_running was non-zero but after
put_prev_task(rq, prev) kernel cannot find any tasks to schedule next.

It crashes from time to time on strange libvirt/kvm setup where
cfs_quota is set on two levels: at parent cgroup which contains kvm
and at per-vcpu child cgroup.

This patch isn't verified yet.
But I haven't found any other possible reasons for that crash.

>
> This patch leaves 1 in ->runtime_remaining when current assignation
> expires and tries to refill it right after that. In the worst case
> task will be scheduled once and throttled at the end of slice.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> ---
> kernel/sched/fair.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce18f3c097a..91785d077db4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> {
> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>
> - /* if the deadline is ahead of our clock, nothing to do */
> - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> + /* nothing to expire */
> + if (cfs_rq->runtime_remaining <= 0)
> return;
>
> - if (cfs_rq->runtime_remaining < 0)
> + /* if the deadline is ahead of our clock, nothing to do */
> + if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> return;
>
> /*
> @@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> /* extend local deadline, drift is bounded above by 2 ticks */
> cfs_rq->runtime_expires += TICK_NSEC;
> } else {
> - /* global deadline is ahead, expiration has passed */
> - cfs_rq->runtime_remaining = 0;
> + /*
> + * Global deadline is ahead, expiration has passed.
> + *
> + * Do not expire runtime completely. Otherwise put_prev_task()
> + * can throttle all tasks when we already checked nr_running or
> + * put_prev_entity() can throttle already chosen next entity.
> + */
> + cfs_rq->runtime_remaining = 1;
> }
> }
>
> @@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> cfs_rq->runtime_remaining -= delta_exec;
> expire_cfs_rq_runtime(cfs_rq);
>
> - if (likely(cfs_rq->runtime_remaining > 0))
> + if (likely(cfs_rq->runtime_remaining > 1))
> return;
>
> /*
>


--
Konstantin

2015-04-06 22:45:30

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

Konstantin Khlebnikov <[email protected]> writes:

> Pick_next_task_fair() must be sure that here is at least one runnable
> task before calling put_prev_task(), but put_prev_task() can expire
> last remains of cfs quota and throttle all currently runnable tasks.
> As a result pick_next_task_fair() cannot find next task and crashes.
>
> This patch leaves 1 in ->runtime_remaining when current assignation
> expires and tries to refill it right after that. In the worst case
> task will be scheduled once and throttled at the end of slice.
>

I don't think expire_cfs_rq_runtime is the problem. What I believe
happens is this:

/prev/some_task is running, calls schedule() with nr_running == 2.
pick_next's first do/while loop does update_curr(/) and picks /next, and
the next iteration just sees check_cfs_rq_runtime(/next), and thus does
goto simple. However, there is now only /prev/some_task runnable, and it
hasn't checked the entire prev hierarchy for throttling, thus leading to
the crash.

This would require that check_cfs_rq_runtime(/next) return true despite
being on_rq though, which iirc is not supposed to happen (note that we
do not call update_curr(/next), and it would do nothing if we did,
because /next isn't part of the current thread's hierarchy). However,
this /can/ happen if runtime has just been (re)enabled on /next, because
tg_set_cfs_bandwidth sets runtime_remaining to 0, not 1.

The idea was that each rq would grab runtime when they were scheduled
(pick_next_task_fair didn't ever look at throttling info), so this was
fine with the old code, but is a problem now. I think it would be
sufficient to just initialize to 1 in tg_set_cfs_bandwidth. The arguably
more precise option would be to only check_cfs_rq_runtime if
cfs_rq->curr is set, but the code is slightly less pretty.

Could you check this patch to see if it works (or the trivial
tg_set_bandwidth runtime_remaining = 1 patch)?

---8<----->8---

>From f12fa8e981bf1d87cbbc30951bdf27e70c803e25 Mon Sep 17 00:00:00 2001
From: Ben Segall <[email protected]>
Date: Mon, 6 Apr 2015 15:28:10 -0700
Subject: [PATCH] sched: prevent throttle in early pick_next_task_fair

The first call to check_cfs_rq_runtime in pick_next_task_fair is only supposed
to trigger when cfs_rq is still an ancestor of prev. However, it was able to
trigger on tgs that had just had bandwidth toggled, because tg_set_cfs_bandwidth
set runtime_remaining to 0, and check_cfs_rq_runtime doesn't check the global
pool.

Fix this by only calling check_cfs_rq_runtime if we are still in prev's
ancestry, as evidenced by cfs_rq->curr.

Reported-by: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/fair.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee595ef..5cb52e9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5038,18 +5038,21 @@ again:
* entity, update_curr() will update its vruntime, otherwise
* forget we've ever seen it.
*/
- if (curr && curr->on_rq)
- update_curr(cfs_rq);
- else
- curr = NULL;
+ if (curr) {
+ if (curr->on_rq)
+ update_curr(cfs_rq);
+ else
+ curr = NULL;

- /*
- * This call to check_cfs_rq_runtime() will do the throttle and
- * dequeue its entity in the parent(s). Therefore the 'simple'
- * nr_running test will indeed be correct.
- */
- if (unlikely(check_cfs_rq_runtime(cfs_rq)))
- goto simple;
+ /*
+ * This call to check_cfs_rq_runtime() will do the
+ * throttle and dequeue its entity in the parent(s).
+ * Therefore the 'simple' nr_running test will indeed
+ * be correct.
+ */
+ if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+ goto simple;
+ }

se = pick_next_entity(cfs_rq, curr);
cfs_rq = group_cfs_rq(se);
--
2.2.0.rc0.207.ga3a616c

2015-04-07 12:53:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

On Fri, Apr 03, 2015 at 03:51:17PM +0300, Konstantin Khlebnikov wrote:
> On 03.04.2015 15:41, Konstantin Khlebnikov wrote:
> >Pick_next_task_fair() must be sure that here is at least one runnable
> >task before calling put_prev_task(), but put_prev_task() can expire
> >last remains of cfs quota and throttle all currently runnable tasks.
> >As a result pick_next_task_fair() cannot find next task and crashes.
>
> Kernel crash looks like this:
>
> <1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> <1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80

> <4>[50288.720388] Call Trace:
> <4>[50288.720402] [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0
> <4>[50288.720479] [<ffffffff816b5b99>] __schedule+0xf9/0x7d0

Which set_next_entity() is that? There are 3 in pick_next_task_fair().

I have a vague suspicion its in the 'simple' code, please verify.

The thinking is that if it was the 'complex' pick_next_entity()
returning NULL we'd have exploded elsewhere, the cfs_rq iteration
would've wandered off into random memory and most likely exploded on
cfs_rq->curr.

Which too would suggest the check_cfs_rq_runtime() thing works just
fine, it send us to the simple code.

> >This patch leaves 1 in ->runtime_remaining when current assignation
> >expires and tries to refill it right after that. In the worst case
> >task will be scheduled once and throttled at the end of slice.

Which is a strange approach. If pick_next_task_fair() is borken, we
should fix that, no?

In any case, it appears to me that: 606dba2e2894 ("sched: Push
put_prev_task() into pick_next_task()") inverted the ->nr_running and
put_prev_task() statements.

If the above set_next_entity() is indeed the simple one, does the below
cure things?

---
kernel/sched/fair.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26eb7218..df72d61138a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5176,12 +5176,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
simple:
cfs_rq = &rq->cfs;
#endif
+ put_prev_task(rq, prev);

if (!cfs_rq->nr_running)
goto idle;

- put_prev_task(rq, prev);
-
do {
se = pick_next_entity(cfs_rq, NULL);
set_next_entity(cfs_rq, se);

2015-04-07 13:48:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

On Tue, Apr 07, 2015 at 02:52:51PM +0200, Peter Zijlstra wrote:
> If the above set_next_entity() is indeed the simple one, does the below
> cure things?
>
> ---
> kernel/sched/fair.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdae26eb7218..df72d61138a8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5176,12 +5176,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> simple:
> cfs_rq = &rq->cfs;
> #endif
> + put_prev_task(rq, prev);
>
> if (!cfs_rq->nr_running)
> goto idle;
>
> - put_prev_task(rq, prev);
> -
> do {
> se = pick_next_entity(cfs_rq, NULL);
> set_next_entity(cfs_rq, se);

Bah, that's broken because if we end up going idle pick_next_task_idle()
is going to do put_prev_task() again.

Lemme think a bit more on that.

2015-04-07 15:13:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote:
> Lemme think a bit more on that.

So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from
empty queue") something like this would be called for.

---
kernel/sched/fair.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26eb7218..1e47f816c976 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
return p;
simple:
cfs_rq = &rq->cfs;
+
+ if (cfs_bandwidth_used()) {
+ /*
+ * We may dequeue prev's cfs_rq in put_prev_task().
+ * So, we update time before cfs_rq->nr_running check.
+ */
+ if (prev->on_rq)
+ update_curr(cfs_rq);
+
+ se = &prev->se;
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ check_cfs_rq_runtime(cfs_rq);
+ }
+ }
#endif

if (!cfs_rq->nr_running)

2015-04-07 15:32:54

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

On 07.04.2015 18:12, Peter Zijlstra wrote:
> On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote:
>> Lemme think a bit more on that.
>
> So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from
> empty queue") something like this would be called for.
>
> ---
> kernel/sched/fair.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdae26eb7218..1e47f816c976 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> return p;
> simple:
> cfs_rq = &rq->cfs;

We don't need this if prev isn't from fair_sched_class:
first "goto simple" should go directly to "if (!cfs_rq->nr_running)".

> +
> + if (cfs_bandwidth_used()) {
> + /*
> + * We may dequeue prev's cfs_rq in put_prev_task().
> + * So, we update time before cfs_rq->nr_running check.
> + */
> + if (prev->on_rq)
> + update_curr(cfs_rq);
> +
> + se = &prev->se;
> + for_each_sched_entity(se) {

Probably update_curr() should be inside loop too?

> + cfs_rq = cfs_rq_of(se);
> + check_cfs_rq_runtime(cfs_rq);
> + }
> + }
> #endif
>
> if (!cfs_rq->nr_running)
>


--
Konstantin

2015-04-07 15:43:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

On Tue, Apr 07, 2015 at 06:32:47PM +0300, Konstantin Khlebnikov wrote:
> On 07.04.2015 18:12, Peter Zijlstra wrote:
> >On Tue, Apr 07, 2015 at 03:47:58PM +0200, Peter Zijlstra wrote:
> >>Lemme think a bit more on that.
> >
> >So going by 734ff2a71f9e ("sched/rt: Fix picking RT and DL tasks from
> >empty queue") something like this would be called for.
> >
> >---
> > kernel/sched/fair.c | 15 +++++++++++++++
> > 1 file changed, 15 insertions(+)
> >
> >diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >index fdae26eb7218..1e47f816c976 100644
> >--- a/kernel/sched/fair.c
> >+++ b/kernel/sched/fair.c
> >@@ -5175,6 +5175,21 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> > return p;
> > simple:
> > cfs_rq = &rq->cfs;
>
> We don't need this if prev isn't from fair_sched_class:
> first "goto simple" should go directly to "if (!cfs_rq->nr_running)".

Just add that to the test here:

> >+ if (cfs_bandwidth_used()) {
> >+ /*
> >+ * We may dequeue prev's cfs_rq in put_prev_task().
> >+ * So, we update time before cfs_rq->nr_running check.
> >+ */
> >+ if (prev->on_rq)
> >+ update_curr(cfs_rq);
> >+
> >+ se = &prev->se;
> >+ for_each_sched_entity(se) {
>
> Probably update_curr() should be inside loop too?

Ah, yes, you're right.

> >+ cfs_rq = cfs_rq_of(se);
> >+ check_cfs_rq_runtime(cfs_rq);
> >+ }
> >+ }
> > #endif
> >
> > if (!cfs_rq->nr_running)
> >
>
>
> --
> Konstantin

2015-04-07 15:53:46

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

On 07.04.2015 01:45, [email protected] wrote:
> Konstantin Khlebnikov <[email protected]> writes:
>
>> Pick_next_task_fair() must be sure that here is at least one runnable
>> task before calling put_prev_task(), but put_prev_task() can expire
>> last remains of cfs quota and throttle all currently runnable tasks.
>> As a result pick_next_task_fair() cannot find next task and crashes.
>>
>> This patch leaves 1 in ->runtime_remaining when current assignation
>> expires and tries to refill it right after that. In the worst case
>> task will be scheduled once and throttled at the end of slice.
>>
>
> I don't think expire_cfs_rq_runtime is the problem. What I believe
> happens is this:
>
> /prev/some_task is running, calls schedule() with nr_running == 2.
> pick_next's first do/while loop does update_curr(/) and picks /next, and
> the next iteration just sees check_cfs_rq_runtime(/next), and thus does
> goto simple. However, there is now only /prev/some_task runnable, and it
> hasn't checked the entire prev hierarchy for throttling, thus leading to
> the crash.
>
> This would require that check_cfs_rq_runtime(/next) return true despite
> being on_rq though, which iirc is not supposed to happen (note that we
> do not call update_curr(/next), and it would do nothing if we did,
> because /next isn't part of the current thread's hierarchy). However,
> this /can/ happen if runtime has just been (re)enabled on /next, because
> tg_set_cfs_bandwidth sets runtime_remaining to 0, not 1.

Yeah, this seems possible too.

> The idea was that each rq would grab runtime when they were scheduled
> (pick_next_task_fair didn't ever look at throttling info), so this was
> fine with the old code, but is a problem now. I think it would be
> sufficient to just initialize to 1 in tg_set_cfs_bandwidth. The arguably
> more precise option would be to only check_cfs_rq_runtime if
> cfs_rq->curr is set, but the code is slightly less pretty.

Probably this code should call something like
account_cfs_rq_runtime(cfs_rq, 0);

>
> Could you check this patch to see if it works (or the trivial
> tg_set_bandwidth runtime_remaining = 1 patch)?

I'm not sure that I'll see this bug again: we're replacing this setup
with something different.

>
> ---8<----->8---
>
> From f12fa8e981bf1d87cbbc30951bdf27e70c803e25 Mon Sep 17 00:00:00 2001
> From: Ben Segall <[email protected]>
> Date: Mon, 6 Apr 2015 15:28:10 -0700
> Subject: [PATCH] sched: prevent throttle in early pick_next_task_fair
>
> The first call to check_cfs_rq_runtime in pick_next_task_fair is only supposed
> to trigger when cfs_rq is still an ancestor of prev. However, it was able to
> trigger on tgs that had just had bandwidth toggled, because tg_set_cfs_bandwidth
> set runtime_remaining to 0, and check_cfs_rq_runtime doesn't check the global
> pool.
>
> Fix this by only calling check_cfs_rq_runtime if we are still in prev's
> ancestry, as evidenced by cfs_rq->curr.
>
> Reported-by: Konstantin Khlebnikov <[email protected]>
> Signed-off-by: Ben Segall <[email protected]>
> ---
> kernel/sched/fair.c | 25 ++++++++++++++-----------
> 1 file changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ee595ef..5cb52e9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5038,18 +5038,21 @@ again:
> * entity, update_curr() will update its vruntime, otherwise
> * forget we've ever seen it.
> */
> - if (curr && curr->on_rq)
> - update_curr(cfs_rq);
> - else
> - curr = NULL;
> + if (curr) {
> + if (curr->on_rq)
> + update_curr(cfs_rq);
> + else
> + curr = NULL;
>
> - /*
> - * This call to check_cfs_rq_runtime() will do the throttle and
> - * dequeue its entity in the parent(s). Therefore the 'simple'
> - * nr_running test will indeed be correct.
> - */
> - if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> - goto simple;
> + /*
> + * This call to check_cfs_rq_runtime() will do the
> + * throttle and dequeue its entity in the parent(s).
> + * Therefore the 'simple' nr_running test will indeed
> + * be correct.
> + */
> + if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> + goto simple;
> + }
>
> se = pick_next_entity(cfs_rq, curr);
> cfs_rq = group_cfs_rq(se);
>


--
Konstantin

Subject: [tip:sched/core] sched/fair: Prevent throttling in early pick_next_task_fair()

Commit-ID: 54d27365cae88fbcc853b391dcd561e71acb81fa
Gitweb: http://git.kernel.org/tip/54d27365cae88fbcc853b391dcd561e71acb81fa
Author: Ben Segall <[email protected]>
AuthorDate: Mon, 6 Apr 2015 15:28:10 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 7 Jun 2015 15:57:44 +0200

sched/fair: Prevent throttling in early pick_next_task_fair()

The optimized task selection logic optimistically selects a new task
to run without first doing a full put_prev_task(). This is so that we
can avoid a put/set on the common ancestors of the old and new task.

Similarly, we should only call check_cfs_rq_runtime() to throttle
eligible groups if they're part of the common ancestry, otherwise it
is possible to end up with no eligible task in the simple task
selection.

Imagine:
/root
/prev /next
/A /B

If our optimistic selection ends up throttling /next, we goto simple
and our put_prev_task() ends up throttling /prev, after which we're
going to bug out in set_next_entity() because there aren't any tasks
left.

Avoid this scenario by only throttling common ancestors.

Reported-by: Mohammed Naser <[email protected]>
Reported-by: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Ben Segall <[email protected]>
[ munged Changelog ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 678d5718d8d0 ("sched/fair: Optimize cgroup pick_next_task_fair()")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d4632f..84ada05 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5322,18 +5322,21 @@ again:
* entity, update_curr() will update its vruntime, otherwise
* forget we've ever seen it.
*/
- if (curr && curr->on_rq)
- update_curr(cfs_rq);
- else
- curr = NULL;
+ if (curr) {
+ if (curr->on_rq)
+ update_curr(cfs_rq);
+ else
+ curr = NULL;

- /*
- * This call to check_cfs_rq_runtime() will do the throttle and
- * dequeue its entity in the parent(s). Therefore the 'simple'
- * nr_running test will indeed be correct.
- */
- if (unlikely(check_cfs_rq_runtime(cfs_rq)))
- goto simple;
+ /*
+ * This call to check_cfs_rq_runtime() will do the
+ * throttle and dequeue its entity in the parent(s).
+ * Therefore the 'simple' nr_running test will indeed
+ * be correct.
+ */
+ if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+ goto simple;
+ }

se = pick_next_entity(cfs_rq, curr);
cfs_rq = group_cfs_rq(se);