2018-04-15 13:33:22

by Nicholas Piggin

[permalink] [raw]
Subject: [RFC PATCH] kernel/sched/core: busy wait before going idle

This is a quick hack for comments, but I've always wondered --
if we have a short term polling idle states in cpuidle for performance
-- why not skip the context switch and entry into all the idle states,
and just wait for a bit to see if something wakes up again.

It's not uncommon to see various going-to-idle work in kernel profiles.
This might be a way to reduce that (and just the cost of switching
registers and kernel stack to idle thread). This can be an important
path for single thread request-response throughput.

tbench bandwidth seems to be improved (the numbers aren't too stable
but they pretty consistently show some gain). 10-20% would be a pretty
nice gain for such workloads

clients 1 2 4 8 16 128
vanilla 232 467 823 1819 3218 9065
patched 310 503 962 2465 3743 9820

---
kernel/sched/core.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8afd6086f23..30a0b13edfa5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3404,6 +3404,7 @@ static void __sched notrace __schedule(bool preempt)
struct rq_flags rf;
struct rq *rq;
int cpu;
+ bool do_idle_spin = true;

cpu = smp_processor_id();
rq = cpu_rq(cpu);
@@ -3428,6 +3429,7 @@ static void __sched notrace __schedule(bool preempt)
rq_lock(rq, &rf);
smp_mb__after_spinlock();

+idle_spin_end:
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
@@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt)
if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
} else {
+ /*
+ * Busy wait before switching to idle thread. This
+ * is marked unlikely because we're idle so jumping
+ * out of line doesn't matter too much.
+ */
+ if (unlikely(do_idle_spin && rq->nr_running == 1)) {
+ u64 start;
+
+ do_idle_spin = false;
+
+ rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
+ rq_unlock_irq(rq, &rf);
+
+ spin_begin();
+ start = local_clock();
+ while (!need_resched() && prev->state &&
+ !signal_pending_state(prev->state, prev)) {
+ spin_cpu_relax();
+ if (local_clock() - start > 1000000)
+ break;
+ }
+ spin_end();
+
+ rq_lock_irq(rq, &rf);
+ goto idle_spin_end;
+ }
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
prev->on_rq = 0;

--
2.17.0



2018-04-20 07:46:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> This is a quick hack for comments, but I've always wondered --
> if we have a short term polling idle states in cpuidle for performance
> -- why not skip the context switch and entry into all the idle states,
> and just wait for a bit to see if something wakes up again.

Is that context switch so expensive?

And what kernel did you test on? We recently merged a bunch of patches
from Rafael that avoided disabling the tick for short idle predictions.
This also has a performance improvements for such workloads. Did your
kernel include those?

> It's not uncommon to see various going-to-idle work in kernel profiles.
> This might be a way to reduce that (and just the cost of switching
> registers and kernel stack to idle thread). This can be an important
> path for single thread request-response throughput.

So I feel that _if_ we do a spin here, it should only be long enough to
amortize the schedule switch context.

However, doing busy waits here has the downside that the 'idle' time is
not in fact fed into the cpuidle predictor.

2018-04-20 09:03:34

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

On Fri, 20 Apr 2018 09:44:56 +0200
Peter Zijlstra <[email protected]> wrote:

> On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> > This is a quick hack for comments, but I've always wondered --
> > if we have a short term polling idle states in cpuidle for performance
> > -- why not skip the context switch and entry into all the idle states,
> > and just wait for a bit to see if something wakes up again.
>
> Is that context switch so expensive?

I guess relatively much more than taking one branch mispredict on the
loop exit when the task wakes. 10s of cycles vs 1000s?

> And what kernel did you test on? We recently merged a bunch of patches
> from Rafael that avoided disabling the tick for short idle predictions.
> This also has a performance improvements for such workloads. Did your
> kernel include those?

Yes that actually improved profiles quite a lot, but these numbers were
with those changes. I'll try to find some fast disks or network and get
some more more interesting numbers.

> > It's not uncommon to see various going-to-idle work in kernel profiles.
> > This might be a way to reduce that (and just the cost of switching
> > registers and kernel stack to idle thread). This can be an important
> > path for single thread request-response throughput.
>
> So I feel that _if_ we do a spin here, it should only be long enough to
> amortize the schedule switch context.
>
> However, doing busy waits here has the downside that the 'idle' time is
> not in fact fed into the cpuidle predictor.

That's why I cc'ed Rafael :)

Yes the latency in my hack is probably too long, but I think if we
did this, the cpuile predictor could become involved here. There is
no fundamental reason it has to wait for the idle task to be context
switched for that... it's already become involved in core scheduler
code.

Thanks,
Nick

2018-04-20 11:00:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

On Fri, Apr 20, 2018 at 07:01:47PM +1000, Nicholas Piggin wrote:
> On Fri, 20 Apr 2018 09:44:56 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> > > This is a quick hack for comments, but I've always wondered --
> > > if we have a short term polling idle states in cpuidle for performance
> > > -- why not skip the context switch and entry into all the idle states,
> > > and just wait for a bit to see if something wakes up again.
> >
> > Is that context switch so expensive?
>
> I guess relatively much more than taking one branch mispredict on the
> loop exit when the task wakes. 10s of cycles vs 1000s?

Sure, just wondering how much. And I'm assuming you're looking at Power
here, right?

> > And what kernel did you test on? We recently merged a bunch of patches
> > from Rafael that avoided disabling the tick for short idle predictions.
> > This also has a performance improvements for such workloads. Did your
> > kernel include those?
>
> Yes that actually improved profiles quite a lot, but these numbers were
> with those changes. I'll try to find some fast disks or network and get
> some more more interesting numbers.

OK, good that you have those patches in. That ensures you're not trying
to fix something that's possibly already addressed elsewhere.

> > > It's not uncommon to see various going-to-idle work in kernel profiles.
> > > This might be a way to reduce that (and just the cost of switching
> > > registers and kernel stack to idle thread). This can be an important
> > > path for single thread request-response throughput.
> >
> > So I feel that _if_ we do a spin here, it should only be long enough to
> > amortize the schedule switch context.
> >
> > However, doing busy waits here has the downside that the 'idle' time is
> > not in fact fed into the cpuidle predictor.
>
> That's why I cc'ed Rafael :)
>
> Yes the latency in my hack is probably too long, but I think if we did
> this, the cpuile predictor could become involved here. There is no
> fundamental reason it has to wait for the idle task to be context
> switched for that... it's already become involved in core scheduler
> code.

Yes, cpuidle/cpufreq are getting more and more intergrated so there is
no objection from that point.

Growing multiple 'idle' points otoh is a little dodgy and could cause
some maintenance issues.

Of course, this loop would have the same idle-duration problems as the
poll_state.c one. We should probably use that code. Also, do we want to
ask the estimator before doing this? If it predicts a very long idle
time, spinning here is just wasting cycles.

2018-04-20 12:30:23

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

On Fri, 20 Apr 2018 12:58:27 +0200
Peter Zijlstra <[email protected]> wrote:

> On Fri, Apr 20, 2018 at 07:01:47PM +1000, Nicholas Piggin wrote:
> > On Fri, 20 Apr 2018 09:44:56 +0200
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> > > > This is a quick hack for comments, but I've always wondered --
> > > > if we have a short term polling idle states in cpuidle for performance
> > > > -- why not skip the context switch and entry into all the idle states,
> > > > and just wait for a bit to see if something wakes up again.
> > >
> > > Is that context switch so expensive?
> >
> > I guess relatively much more than taking one branch mispredict on the
> > loop exit when the task wakes. 10s of cycles vs 1000s?
>
> Sure, just wondering how much. And I'm assuming you're looking at Power
> here, right?

Well I'll try to get more numbers.

Yes, talking about power. It trails x86 on context switches by a bit,
but similar orders of magnitude. My skylake is doing ~1900 cycles
syscall + context switch with a distro kernel. POWER9 is ~2500.

> > > And what kernel did you test on? We recently merged a bunch of patches
> > > from Rafael that avoided disabling the tick for short idle predictions.
> > > This also has a performance improvements for such workloads. Did your
> > > kernel include those?
> >
> > Yes that actually improved profiles quite a lot, but these numbers were
> > with those changes. I'll try to find some fast disks or network and get
> > some more more interesting numbers.
>
> OK, good that you have those patches in. That ensures you're not trying
> to fix something that's possibly already addressed elsewhere.

Yep.

>
> > > > It's not uncommon to see various going-to-idle work in kernel profiles.
> > > > This might be a way to reduce that (and just the cost of switching
> > > > registers and kernel stack to idle thread). This can be an important
> > > > path for single thread request-response throughput.
> > >
> > > So I feel that _if_ we do a spin here, it should only be long enough to
> > > amortize the schedule switch context.
> > >
> > > However, doing busy waits here has the downside that the 'idle' time is
> > > not in fact fed into the cpuidle predictor.
> >
> > That's why I cc'ed Rafael :)
> >
> > Yes the latency in my hack is probably too long, but I think if we did
> > this, the cpuile predictor could become involved here. There is no
> > fundamental reason it has to wait for the idle task to be context
> > switched for that... it's already become involved in core scheduler
> > code.
>
> Yes, cpuidle/cpufreq are getting more and more intergrated so there is
> no objection from that point.
>
> Growing multiple 'idle' points otoh is a little dodgy and could cause
> some maintenance issues.

Right, it should be done a bit better than my patch, which is
just a hack.

> Of course, this loop would have the same idle-duration problems as the
> poll_state.c one. We should probably use that code. Also, do we want to
> ask the estimator before doing this? If it predicts a very long idle
> time, spinning here is just wasting cycles.

I would say so, yes. I think if we did go this route, it should
take over the the existing polling idle states, so it would make
sense to control it in a similar way.

(Unless polling idle is the only state available of course we need
to switch to it eventually, and we must immediately switch in case
of do_task_dead, etc)

Anyway I'll wait for the merge window to settle and try to get some
more numbers. I basically just wanted to see if there were any
fundamental problems with the concept.

Thanks,
Nick

2018-04-23 10:19:10

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

Hi Nick,

On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> This is a quick hack for comments, but I've always wondered --
> if we have a short term polling idle states in cpuidle for performance
> -- why not skip the context switch and entry into all the idle states,
> and just wait for a bit to see if something wakes up again.
>
> It's not uncommon to see various going-to-idle work in kernel profiles.
> This might be a way to reduce that (and just the cost of switching
> registers and kernel stack to idle thread). This can be an important
> path for single thread request-response throughput.
>
> tbench bandwidth seems to be improved (the numbers aren't too stable
> but they pretty consistently show some gain). 10-20% would be a pretty
> nice gain for such workloads
>
> clients 1 2 4 8 16 128
> vanilla 232 467 823 1819 3218 9065
> patched 310 503 962 2465 3743 9820
>

<snip>

> +idle_spin_end:
> /* Promote REQ to ACT */
> rq->clock_update_flags <<= 1;
> update_rq_clock(rq);
> @@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt)
> if (unlikely(signal_pending_state(prev->state, prev))) {
> prev->state = TASK_RUNNING;
> } else {
> + /*
> + * Busy wait before switching to idle thread. This
> + * is marked unlikely because we're idle so jumping
> + * out of line doesn't matter too much.
> + */
> + if (unlikely(do_idle_spin && rq->nr_running == 1)) {
> + u64 start;
> +
> + do_idle_spin = false;
> +
> + rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
> + rq_unlock_irq(rq, &rf);
> +
> + spin_begin();
> + start = local_clock();
> + while (!need_resched() && prev->state &&
> + !signal_pending_state(prev->state, prev)) {
> + spin_cpu_relax();
> + if (local_clock() - start > 1000000)
> + break;
> + }

Couple of comments/questions.

When a RT task is doing this busy loop,

(1) need_resched() may not be set even if a fair/normal task is enqueued on
this CPU.

(2) Any lower prio RT task waking up on this CPU may migrate to another CPU
thinking this CPU is busy with higher prio RT task.

> + spin_end();
> +
> + rq_lock_irq(rq, &rf);
> + goto idle_spin_end;
> + }
> deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> prev->on_rq = 0;
>
> --
> 2.17.0
>

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-04-24 05:46:17

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

On Mon, 23 Apr 2018 15:47:40 +0530
Pavan Kondeti <[email protected]> wrote:

> Hi Nick,
>
> On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> > This is a quick hack for comments, but I've always wondered --
> > if we have a short term polling idle states in cpuidle for performance
> > -- why not skip the context switch and entry into all the idle states,
> > and just wait for a bit to see if something wakes up again.
> >
> > It's not uncommon to see various going-to-idle work in kernel profiles.
> > This might be a way to reduce that (and just the cost of switching
> > registers and kernel stack to idle thread). This can be an important
> > path for single thread request-response throughput.
> >
> > tbench bandwidth seems to be improved (the numbers aren't too stable
> > but they pretty consistently show some gain). 10-20% would be a pretty
> > nice gain for such workloads
> >
> > clients 1 2 4 8 16 128
> > vanilla 232 467 823 1819 3218 9065
> > patched 310 503 962 2465 3743 9820
> >
>
> <snip>
>
> > +idle_spin_end:
> > /* Promote REQ to ACT */
> > rq->clock_update_flags <<= 1;
> > update_rq_clock(rq);
> > @@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt)
> > if (unlikely(signal_pending_state(prev->state, prev))) {
> > prev->state = TASK_RUNNING;
> > } else {
> > + /*
> > + * Busy wait before switching to idle thread. This
> > + * is marked unlikely because we're idle so jumping
> > + * out of line doesn't matter too much.
> > + */
> > + if (unlikely(do_idle_spin && rq->nr_running == 1)) {
> > + u64 start;
> > +
> > + do_idle_spin = false;
> > +
> > + rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
> > + rq_unlock_irq(rq, &rf);
> > +
> > + spin_begin();
> > + start = local_clock();
> > + while (!need_resched() && prev->state &&
> > + !signal_pending_state(prev->state, prev)) {
> > + spin_cpu_relax();
> > + if (local_clock() - start > 1000000)
> > + break;
> > + }
>
> Couple of comments/questions.
>
> When a RT task is doing this busy loop,
>
> (1) need_resched() may not be set even if a fair/normal task is enqueued on
> this CPU.

This is true, it should probably spin on nr_running == 1, good catch.

>
> (2) Any lower prio RT task waking up on this CPU may migrate to another CPU
> thinking this CPU is busy with higher prio RT task.

Also true. If we completely replaced the polling idle states with a
spin here, this would not be acceptable and it would have to be quite
a lot more work to interact with load calculations etc.

On the other hand if it is a much smaller spin on the order of
context switch latency that could be considered part of the cost
of context switching for the purposes of load balancing, *maybe*
not much else is need.

Thanks,
Nick