2014-02-11 20:24:20

by Andy Lutomirski

[permalink] [raw]
Subject: Too many rescheduling interrupts (still!)

Rumor has it that Linux 3.13 was supposed to get rid of all the silly
rescheduling interrupts. It doesn't, although it does seem to have
improved the situation.

A small number of reschedule interrupts appear to be due to a race:
both resched_task and wake_up_idle_cpu do, essentially:

set_tsk_need_resched(t);
smb_mb();
if (!tsk_is_polling(t))
smp_send_reschedule(cpu);

The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
is too quick (which isn't surprising if it was in C0 or C1), then it
could *clear* TS_POLLING before tsk_is_polling is read.

Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
TS_POLLING is in thread->status? Couldn't both of these be in the
same field in something like struct rq? That would allow a real
atomic op here.

The more serious issue is that AFAICS default_wake_function is
completely missing the polling check. It goes through
ttwu_queue_remote, which unconditionally sends an interrupt.

--Andy


2014-02-11 21:21:36

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Tue, 11 Feb 2014, Andy Lutomirski wrote:

Just adding Peter for now, as I'm too tired to grok the issue right
now.

> Rumor has it that Linux 3.13 was supposed to get rid of all the silly
> rescheduling interrupts. It doesn't, although it does seem to have
> improved the situation.
>
> A small number of reschedule interrupts appear to be due to a race:
> both resched_task and wake_up_idle_cpu do, essentially:
>
> set_tsk_need_resched(t);
> smb_mb();
> if (!tsk_is_polling(t))
> smp_send_reschedule(cpu);
>
> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> is too quick (which isn't surprising if it was in C0 or C1), then it
> could *clear* TS_POLLING before tsk_is_polling is read.
>
> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
> TS_POLLING is in thread->status? Couldn't both of these be in the
> same field in something like struct rq? That would allow a real
> atomic op here.
>
> The more serious issue is that AFAICS default_wake_function is
> completely missing the polling check. It goes through
> ttwu_queue_remote, which unconditionally sends an interrupt.
>
> --Andy
>

2014-02-11 22:34:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 11 Feb 2014, Andy Lutomirski wrote:
>
> Just adding Peter for now, as I'm too tired to grok the issue right
> now.
>
>> Rumor has it that Linux 3.13 was supposed to get rid of all the silly
>> rescheduling interrupts. It doesn't, although it does seem to have
>> improved the situation.
>>
>> A small number of reschedule interrupts appear to be due to a race:
>> both resched_task and wake_up_idle_cpu do, essentially:
>>
>> set_tsk_need_resched(t);
>> smb_mb();
>> if (!tsk_is_polling(t))
>> smp_send_reschedule(cpu);
>>
>> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
>> is too quick (which isn't surprising if it was in C0 or C1), then it
>> could *clear* TS_POLLING before tsk_is_polling is read.
>>
>> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
>> TS_POLLING is in thread->status? Couldn't both of these be in the
>> same field in something like struct rq? That would allow a real
>> atomic op here.
>>
>> The more serious issue is that AFAICS default_wake_function is
>> completely missing the polling check. It goes through
>> ttwu_queue_remote, which unconditionally sends an interrupt.

There would be an extra benefit of moving the resched-related bits to
some per-cpu structure: it would allow lockless wakeups.
ttwu_queue_remote, and probably all of the other reschedule-a-cpu
functions, could do something like:

if (...) {
old = atomic_read(resched_flags(cpu));
while(true) {
if (old & RESCHED_NEED_RESCHED)
return;
if (!(old & RESCHED_POLLING)) {
smp_send_reschedule(cpu);
return;
}
new = old | RESCHED_NEED_RESCHED;
old = atomic_cmpxchg(resched_flags(cpu), old, new);
}
}

The point being that, with the current location of the flags, either
an interrupt needs to be sent or something needs to be done to prevent
rq->curr from disappearing. (It probably doesn't matter if the
current task changes, because TS_POLLING will be clear, but what if
the task goes away entirely?)

All that being said, it looks like ttwu_queue_remote doesn't actually
work if the IPI isn't sent. The attached patch appears to work (and
reduces total rescheduling IPIs by a large amount for my workload),
but I don't really think it's worthy of being applied...

--Andy


Attachments:
0001-sched-Try-to-avoid-sending-an-IPI-in-ttwu_queue_remo.patch (2.08 kB)

2014-02-12 10:13:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <[email protected]> wrote:
> >> A small number of reschedule interrupts appear to be due to a race:
> >> both resched_task and wake_up_idle_cpu do, essentially:
> >>
> >> set_tsk_need_resched(t);
> >> smb_mb();
> >> if (!tsk_is_polling(t))
> >> smp_send_reschedule(cpu);
> >>
> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> >> could *clear* TS_POLLING before tsk_is_polling is read.

Yeah we have the wrong default for the idle loops.. it should default to
polling and only switch to !polling at the very last moment if it really
needs an interrupt to wake.

Changing this requires someone (probably me again :/) to audit all arch
cpu idle drivers/functions.

> >> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
> >> TS_POLLING is in thread->status? Couldn't both of these be in the
> >> same field in something like struct rq? That would allow a real
> >> atomic op here.

I don't see the value of an atomic op there; but many archs already have
this, grep for TIF_POLLING.

> >> The more serious issue is that AFAICS default_wake_function is
> >> completely missing the polling check. It goes through
> >> ttwu_queue_remote, which unconditionally sends an interrupt.

Yah, because it does more than just wake the CPU; at the time we didn't
have a generic idle path, we could cure things now though.

> There would be an extra benefit of moving the resched-related bits to
> some per-cpu structure: it would allow lockless wakeups.
> ttwu_queue_remote, and probably all of the other reschedule-a-cpu
> functions, could do something like:
>
> if (...) {
> old = atomic_read(resched_flags(cpu));
> while(true) {
> if (old & RESCHED_NEED_RESCHED)
> return;
> if (!(old & RESCHED_POLLING)) {
> smp_send_reschedule(cpu);
> return;
> }
> new = old | RESCHED_NEED_RESCHED;
> old = atomic_cmpxchg(resched_flags(cpu), old, new);
> }
> }

That looks hideously expensive.. for no apparent reason.

Sending that IPI isn't _that_ bad, esp if we get the false-positive
window smaller than it is now (its far too wide because of the wrong
default state).

> The point being that, with the current location of the flags, either
> an interrupt needs to be sent or something needs to be done to prevent
> rq->curr from disappearing. (It probably doesn't matter if the
> current task changes, because TS_POLLING will be clear, but what if
> the task goes away entirely?)

It can't we're holding its rq->lock.

> All that being said, it looks like ttwu_queue_remote doesn't actually
> work if the IPI isn't sent. The attached patch appears to work (and
> reduces total rescheduling IPIs by a large amount for my workload),
> but I don't really think it's worthy of being applied...

We can do something similar though; we can move sched_ttwu_pending()
into the generic idle loop, right next to set_preempt_need_resched().

2014-02-12 15:49:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
>> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <[email protected]> wrote:
>> >> A small number of reschedule interrupts appear to be due to a race:
>> >> both resched_task and wake_up_idle_cpu do, essentially:
>> >>
>> >> set_tsk_need_resched(t);
>> >> smb_mb();
>> >> if (!tsk_is_polling(t))
>> >> smp_send_reschedule(cpu);
>> >>
>> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
>> >> is too quick (which isn't surprising if it was in C0 or C1), then it
>> >> could *clear* TS_POLLING before tsk_is_polling is read.
>
> Yeah we have the wrong default for the idle loops.. it should default to
> polling and only switch to !polling at the very last moment if it really
> needs an interrupt to wake.

I might be missing something, but won't that break the scheduler? If
tsk_is_polling always returns true on mwait-capable systems, then
other cpus won't be able to use the polling bit to distinguish between
the idle state (where setting need_resched is enough) and the non-idle
state (where the IPI is needed to preempt whatever task is running).

Since rq->lock is held, the resched calls could check the rq state
(curr == idle, maybe) to distinguish these cases.


>
>> There would be an extra benefit of moving the resched-related bits to
>> some per-cpu structure: it would allow lockless wakeups.
>> ttwu_queue_remote, and probably all of the other reschedule-a-cpu
>> functions, could do something like:
>>
>> if (...) {
>> old = atomic_read(resched_flags(cpu));
>> while(true) {
>> if (old & RESCHED_NEED_RESCHED)
>> return;
>> if (!(old & RESCHED_POLLING)) {
>> smp_send_reschedule(cpu);
>> return;
>> }
>> new = old | RESCHED_NEED_RESCHED;
>> old = atomic_cmpxchg(resched_flags(cpu), old, new);
>> }
>> }
>
> That looks hideously expensive.. for no apparent reason.
>
> Sending that IPI isn't _that_ bad, esp if we get the false-positive
> window smaller than it is now (its far too wide because of the wrong
> default state).
>
>> The point being that, with the current location of the flags, either
>> an interrupt needs to be sent or something needs to be done to prevent
>> rq->curr from disappearing. (It probably doesn't matter if the
>> current task changes, because TS_POLLING will be clear, but what if
>> the task goes away entirely?)
>
> It can't we're holding its rq->lock.

Exactly. AFAICT the only reason that any of this code holds rq->lock
(especially ttwu_queue_remote, which I seem to call a few thousand
times per second) is because the only way to make a cpu reschedule
involves playing with per-task flags. If the flags were per-rq or
per-cpu instead, then rq->lock wouldn't be needed. If this were all
done locklessly, then I think either a full cmpxchg or some fairly
careful use of full barriers would be needed, but I bet that cmpxchg
is still considerably faster than a spinlock plus a set_bit.

>
>> All that being said, it looks like ttwu_queue_remote doesn't actually
>> work if the IPI isn't sent. The attached patch appears to work (and
>> reduces total rescheduling IPIs by a large amount for my workload),
>> but I don't really think it's worthy of being applied...
>
> We can do something similar though; we can move sched_ttwu_pending()
> into the generic idle loop, right next to set_preempt_need_resched().

Oh, right -- either the IPI or the idle code is guaranteed to happen
soon. (But wouldn't setting TS_POLLING always break this, too?)

--Andy

2014-02-12 15:59:55

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

2014-02-12 11:13 GMT+01:00 Peter Zijlstra <[email protected]>:
> On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
>> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <[email protected]> wrote:
>> >> A small number of reschedule interrupts appear to be due to a race:
>> >> both resched_task and wake_up_idle_cpu do, essentially:
>> >>
>> >> set_tsk_need_resched(t);
>> >> smb_mb();
>> >> if (!tsk_is_polling(t))
>> >> smp_send_reschedule(cpu);
>> >>
>> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
>> >> is too quick (which isn't surprising if it was in C0 or C1), then it
>> >> could *clear* TS_POLLING before tsk_is_polling is read.
>
> Yeah we have the wrong default for the idle loops.. it should default to
> polling and only switch to !polling at the very last moment if it really
> needs an interrupt to wake.
>
> Changing this requires someone (probably me again :/) to audit all arch
> cpu idle drivers/functions.

Looking at wake_up_idle_cpu(), we set need_resched and send the IPI.
On the other end, the CPU wakes up, exits the idle loop and even goes
to the scheduler while there is probably no task to schedule.

I wonder if this is all necessary. All we need is the timer to be
handled by the dynticks code to re-evaluate the next tick. So calling
irq_exit() -> tick_nohz_irq_exit() from the scheduler_ipi() should be
enough.

We could use a specific flag set before smp_send_reschedule() and read
in scheduler_ipi() entry to check if we need irq_entry()/irq_exit().

2014-02-12 16:39:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Wed, Feb 12, 2014 at 07:49:07AM -0800, Andy Lutomirski wrote:
> On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <[email protected]> wrote:
> >> >> A small number of reschedule interrupts appear to be due to a race:
> >> >> both resched_task and wake_up_idle_cpu do, essentially:
> >> >>
> >> >> set_tsk_need_resched(t);
> >> >> smb_mb();
> >> >> if (!tsk_is_polling(t))
> >> >> smp_send_reschedule(cpu);
> >> >>
> >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> >> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> >> >> could *clear* TS_POLLING before tsk_is_polling is read.
> >
> > Yeah we have the wrong default for the idle loops.. it should default to
> > polling and only switch to !polling at the very last moment if it really
> > needs an interrupt to wake.
>
> I might be missing something, but won't that break the scheduler?

for the idle task.. all other tasks will have it !polling.

But note how the current generic idle loop does:

if (!current_clr_polling_and_test()) {
...
if (cpuidle_idle_call())
arch_cpu_idle();
...
}

This means that it still runs a metric ton of code, right up to the
mwait with !polling, and then at the mwait we switch it back to polling.

Completely daft.

> Since rq->lock is held, the resched calls could check the rq state
> (curr == idle, maybe) to distinguish these cases.

Not enough; but I'm afraid I confused you with the above.

My suggestion was really more that we should call into the cpuidle/arch
idle code with polling set, and only right before we hit hlt/wfi/etc..
should we clear the polling bit.

> > It can't we're holding its rq->lock.
>
> Exactly. AFAICT the only reason that any of this code holds rq->lock
> (especially ttwu_queue_remote, which I seem to call a few thousand
> times per second) is because the only way to make a cpu reschedule
> involves playing with per-task flags. If the flags were per-rq or
> per-cpu instead, then rq->lock wouldn't be needed. If this were all
> done locklessly, then I think either a full cmpxchg or some fairly
> careful use of full barriers would be needed, but I bet that cmpxchg
> is still considerably faster than a spinlock plus a set_bit.

Ahh, that's what you're saying. Yes we should be able to do something
clever there.

Something like the below is I think as close as we can come without
major surgery and moving TIF_NEED_RESCHED and POLLING into a per-cpu
variable.

I might have messed it up though; brain seems to have given out for the
day :/

---
kernel/sched/core.c | 17 +++++++++++++----
kernel/sched/idle.c | 21 +++++++++++++--------
kernel/sched/sched.h | 5 ++++-
3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9764fbc537..a5b64040c21d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -529,7 +529,7 @@ void resched_task(struct task_struct *p)
}

/* NEED_RESCHED must be visible before we test polling */
- smp_mb();
+ smp_mb__after_clear_bit();
if (!tsk_is_polling(p))
smp_send_reschedule(cpu);
}
@@ -1476,12 +1476,15 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
}

#ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+void sched_ttwu_pending(void)
{
struct rq *rq = this_rq();
struct llist_node *llist = llist_del_all(&rq->wake_list);
struct task_struct *p;

+ if (!llist)
+ return;
+
raw_spin_lock(&rq->lock);

while (llist) {
@@ -1536,8 +1539,14 @@ void scheduler_ipi(void)

static void ttwu_queue_remote(struct task_struct *p, int cpu)
{
- if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
- smp_send_reschedule(cpu);
+ struct rq *rq = cpu_rq(cpu);
+
+ if (llist_add(&p->wake_entry, &rq->wake_list)) {
+ set_tsk_need_resched(rq->idle);
+ smp_mb__after_clear_bit();
+ if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle)
+ smp_send_reschedule(cpu);
+ }
}

bool cpus_share_cache(int this_cpu, int that_cpu)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 14ca43430aee..bd8ed2d2f2f7 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -105,19 +105,24 @@ static void cpu_idle_loop(void)
} else {
local_irq_enable();
}
- __current_set_polling();
}
arch_cpu_idle_exit();
- /*
- * We need to test and propagate the TIF_NEED_RESCHED
- * bit here because we might not have send the
- * reschedule IPI to idle tasks.
- */
- if (tif_need_resched())
- set_preempt_need_resched();
}
+
+ /*
+ * We must clear polling before running sched_ttwu_pending().
+ * Otherwise it becomes possible to have entries added in
+ * ttwu_queue_remote() and still not get an IPI to process
+ * them.
+ */
+ __current_clr_polling();
+
+ set_preempt_need_resched();
+ sched_ttwu_pending();
+
tick_nohz_idle_exit();
schedule_preempt_disabled();
+ __current_set_polling();
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bf34c257d3b..b59dbdb135d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1157,9 +1157,10 @@ extern const struct sched_class rt_sched_class;
extern const struct sched_class fair_sched_class;
extern const struct sched_class idle_sched_class;

-
#ifdef CONFIG_SMP

+extern void sched_ttwu_pending(void)
+
extern void update_group_power(struct sched_domain *sd, int cpu);

extern void trigger_load_balance(struct rq *rq);
@@ -1170,6 +1171,8 @@ extern void idle_exit_fair(struct rq *this_rq);

#else /* CONFIG_SMP */

+static inline void sched_ttwu_pending(void) { }
+
static inline void idle_balance(int cpu, struct rq *rq)
{
}

2014-02-12 16:44:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Wed, Feb 12, 2014 at 04:59:52PM +0100, Frederic Weisbecker wrote:
> 2014-02-12 11:13 GMT+01:00 Peter Zijlstra <[email protected]>:
> > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <[email protected]> wrote:
> >> >> A small number of reschedule interrupts appear to be due to a race:
> >> >> both resched_task and wake_up_idle_cpu do, essentially:
> >> >>
> >> >> set_tsk_need_resched(t);
> >> >> smb_mb();
> >> >> if (!tsk_is_polling(t))
> >> >> smp_send_reschedule(cpu);
> >> >>
> >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> >> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> >> >> could *clear* TS_POLLING before tsk_is_polling is read.
> >
> > Yeah we have the wrong default for the idle loops.. it should default to
> > polling and only switch to !polling at the very last moment if it really
> > needs an interrupt to wake.
> >
> > Changing this requires someone (probably me again :/) to audit all arch
> > cpu idle drivers/functions.
>
> Looking at wake_up_idle_cpu(), we set need_resched and send the IPI.
> On the other end, the CPU wakes up, exits the idle loop and even goes
> to the scheduler while there is probably no task to schedule.
>
> I wonder if this is all necessary. All we need is the timer to be
> handled by the dynticks code to re-evaluate the next tick. So calling
> irq_exit() -> tick_nohz_irq_exit() from the scheduler_ipi() should be
> enough.

No no, the idea was to NOT send IPIs. So falling out of idle by writing
TIF_NEED_RESCHED and having the idle loop fixup the timers on its way
back to idle is what you want.

2014-02-12 17:46:45

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Wed, Feb 12, 2014 at 05:43:56PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 12, 2014 at 04:59:52PM +0100, Frederic Weisbecker wrote:
> > 2014-02-12 11:13 GMT+01:00 Peter Zijlstra <[email protected]>:
> > > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> > >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <[email protected]> wrote:
> > >> >> A small number of reschedule interrupts appear to be due to a race:
> > >> >> both resched_task and wake_up_idle_cpu do, essentially:
> > >> >>
> > >> >> set_tsk_need_resched(t);
> > >> >> smb_mb();
> > >> >> if (!tsk_is_polling(t))
> > >> >> smp_send_reschedule(cpu);
> > >> >>
> > >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> > >> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> > >> >> could *clear* TS_POLLING before tsk_is_polling is read.
> > >
> > > Yeah we have the wrong default for the idle loops.. it should default to
> > > polling and only switch to !polling at the very last moment if it really
> > > needs an interrupt to wake.
> > >
> > > Changing this requires someone (probably me again :/) to audit all arch
> > > cpu idle drivers/functions.
> >
> > Looking at wake_up_idle_cpu(), we set need_resched and send the IPI.
> > On the other end, the CPU wakes up, exits the idle loop and even goes
> > to the scheduler while there is probably no task to schedule.
> >
> > I wonder if this is all necessary. All we need is the timer to be
> > handled by the dynticks code to re-evaluate the next tick. So calling
> > irq_exit() -> tick_nohz_irq_exit() from the scheduler_ipi() should be
> > enough.
>
> No no, the idea was to NOT send IPIs. So falling out of idle by writing
> TIF_NEED_RESCHED and having the idle loop fixup the timers on its way
> back to idle is what you want.

Ok but if the target is idle, dynticks and not polling, we don't have the choice
but to send an IPI, right? I'm talking about this kind of case.

2014-02-12 18:15:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Wed, Feb 12, 2014 at 06:46:39PM +0100, Frederic Weisbecker wrote:
> Ok but if the target is idle, dynticks and not polling, we don't have the choice
> but to send an IPI, right? I'm talking about this kind of case.

Yes; but Andy doesn't seem concerned with such hardware (!x86).

Anything x86 (except ancient stuff) is effectively polling and wakes up
from the TIF_NEED_RESCHED write.

2014-02-12 18:20:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Wed, Feb 12, 2014 at 8:39 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Feb 12, 2014 at 07:49:07AM -0800, Andy Lutomirski wrote:
>> On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra <[email protected]> wrote:
>> Exactly. AFAICT the only reason that any of this code holds rq->lock
>> (especially ttwu_queue_remote, which I seem to call a few thousand
>> times per second) is because the only way to make a cpu reschedule
>> involves playing with per-task flags. If the flags were per-rq or
>> per-cpu instead, then rq->lock wouldn't be needed. If this were all
>> done locklessly, then I think either a full cmpxchg or some fairly
>> careful use of full barriers would be needed, but I bet that cmpxchg
>> is still considerably faster than a spinlock plus a set_bit.
>
> Ahh, that's what you're saying. Yes we should be able to do something
> clever there.
>
> Something like the below is I think as close as we can come without
> major surgery and moving TIF_NEED_RESCHED and POLLING into a per-cpu
> variable.
>
> I might have messed it up though; brain seems to have given out for the
> day :/
>
> ---
> kernel/sched/core.c | 17 +++++++++++++----
> kernel/sched/idle.c | 21 +++++++++++++--------
> kernel/sched/sched.h | 5 ++++-
> 3 files changed, 30 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fb9764fbc537..a5b64040c21d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -529,7 +529,7 @@ void resched_task(struct task_struct *p)
> }
>
> /* NEED_RESCHED must be visible before we test polling */
> - smp_mb();
> + smp_mb__after_clear_bit();
> if (!tsk_is_polling(p))
> smp_send_reschedule(cpu);
> }
> @@ -1476,12 +1476,15 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
> }
>
> #ifdef CONFIG_SMP
> -static void sched_ttwu_pending(void)
> +void sched_ttwu_pending(void)
> {
> struct rq *rq = this_rq();
> struct llist_node *llist = llist_del_all(&rq->wake_list);
> struct task_struct *p;
>
> + if (!llist)
> + return;
> +
> raw_spin_lock(&rq->lock);
>
> while (llist) {
> @@ -1536,8 +1539,14 @@ void scheduler_ipi(void)
>
> static void ttwu_queue_remote(struct task_struct *p, int cpu)
> {
> - if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
> - smp_send_reschedule(cpu);
> + struct rq *rq = cpu_rq(cpu);
> +
> + if (llist_add(&p->wake_entry, &rq->wake_list)) {
> + set_tsk_need_resched(rq->idle);
> + smp_mb__after_clear_bit();
> + if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle)
> + smp_send_reschedule(cpu);
> + }

At the very least this needs a comment pointing out that rq->lock is
intentionally not taken. This makes my brain hurt a little :)

> }
>
> bool cpus_share_cache(int this_cpu, int that_cpu)
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 14ca43430aee..bd8ed2d2f2f7 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -105,19 +105,24 @@ static void cpu_idle_loop(void)
> } else {
> local_irq_enable();
> }
> - __current_set_polling();
> }
> arch_cpu_idle_exit();
> - /*
> - * We need to test and propagate the TIF_NEED_RESCHED
> - * bit here because we might not have send the
> - * reschedule IPI to idle tasks.
> - */
> - if (tif_need_resched())
> - set_preempt_need_resched();
> }
> +
> + /*
> + * We must clear polling before running sched_ttwu_pending().
> + * Otherwise it becomes possible to have entries added in
> + * ttwu_queue_remote() and still not get an IPI to process
> + * them.
> + */
> + __current_clr_polling();
> +
> + set_preempt_need_resched();
> + sched_ttwu_pending();
> +
> tick_nohz_idle_exit();
> schedule_preempt_disabled();
> + __current_set_polling();

I wonder if this side has enough barriers to make this work.


I'll see if I have a few free minutes (yeah right!) to try out the
major surgery approach. I think I can do it without even cmpxchg.
Basically, there would be a percpu variable idlepoll_state with three
values: IDLEPOLL_NOT_POLLING, IDLEPOLL_WOKEN, and IDLEPOLL_POLLING.

The polling idle code does:

idlepoll_state = IDLEPOLL_POLLING;
smp_mb();
check for ttwu and need_resched;
mwait, poll, or whatever until idlepoll_state != IDLEPOLL_POLLING;
idlepoll_state = IDLEPOLL_NOT_POLLING;
smp_mb();
check for ttwu and need_resched;

The idle non-polling code does:


idlepoll_state = IDLEPOLL_NOT_POLLING;
smp_mb();
check for ttwu and need_resched;
wait for interrupt;
idlepoll_state = IDLEPOLL_NOT_POLLING;
smp_mb();
check for ttwu and need_resched;

The IPI handler does:
if (xchg(&idlepoll_state, IDLEPOLL_NOT_POLLING) == IDLEPOLL_WOKEN))
set need_resched;

The wakeup code does:

if (xchg(&idlepoll_state, IDLEPOLL_WOKEN) == IDLEPOLL_NOT_POLLING))
smp_send_reschedule(cpu);

or even:

if (idlepoll_state == IDLEPOLL_NOT_POLLING || xchg(&idlepoll_state,
IDLEPOLL_WOKEN) == IDLEPOLL_NOT_POLLING))
smp_send_reschedule(cpu);

I'll mull on this. If I'm right, this should be faster than the
current code for !x86, too -- waking up a CPU becomes a read of a
single shared cacheline + smp_send_reschedule.

I don't know how the !dynticks case works well enough to know whether
this will hurt performance. I assume the reason that TIF_NEED_RESCHED
exists in the first place is so that the kernel exit code can check a
single thing for all slow-path work including rescheduling.

--Andy

2014-02-12 20:17:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Too many rescheduling interrupts (still!)

On Wed, Feb 12, 2014 at 10:19:42AM -0800, Andy Lutomirski wrote:
> > static void ttwu_queue_remote(struct task_struct *p, int cpu)
> > {
> > - if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
> > - smp_send_reschedule(cpu);
> > + struct rq *rq = cpu_rq(cpu);
> > +
> > + if (llist_add(&p->wake_entry, &rq->wake_list)) {
> > + set_tsk_need_resched(rq->idle);
> > + smp_mb__after_clear_bit();
> > + if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle)
> > + smp_send_reschedule(cpu);
> > + }
>
> At the very least this needs a comment pointing out that rq->lock is
> intentionally not taken. This makes my brain hurt a little :)

Oh absolutely; I wanted to write one, but couldn't get a straight story
so gave up for now.

> > + /*
> > + * We must clear polling before running sched_ttwu_pending().
> > + * Otherwise it becomes possible to have entries added in
> > + * ttwu_queue_remote() and still not get an IPI to process
> > + * them.
> > + */
> > + __current_clr_polling();
> > +
> > + set_preempt_need_resched();
> > + sched_ttwu_pending();
> > +
> > tick_nohz_idle_exit();
> > schedule_preempt_disabled();
> > + __current_set_polling();
>
> I wonder if this side has enough barriers to make this work.

sched_ttwu_pending() does xchg() as first op and thereby orders itself
against the clr_polling.


I'll need a fresh brain for your proposal.. will read it again in the
morning.

2014-02-13 01:40:23

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC] sched: Add a new lockless wake-from-idle implementation

This is a strawman proposal to simplify the idle implementation, eliminate
a race

Benefits over current code:
- ttwu_queue_remote doesn't use an IPI unless needed
- The diffstat should speak for itself :)
- Less racy. Spurious IPIs are possible, but only in narrow windows or
when two wakeups occur in rapid succession.
- Seems to work (?)

Issues:
- Am I doing the percpu stuff right?
- Needs work on non-x86 architectures
- The !CONFIG_SMP case needs to be checked
- Is "idlepoll" a good name for the new code? It doesn't have *that*
much to do with the idle state. Maybe cpukick?

If this turns out okay, TIF_NEED_RESCHED could possibly be deleted as well.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/mwait.h | 22 +++----
arch/x86/include/asm/thread_info.h | 2 -
arch/x86/kernel/apm_32.c | 12 ----
include/linux/idlepoll.h | 85 ++++++++++++++++++++++++++
include/linux/preempt.h | 5 --
include/linux/sched.h | 120 -------------------------------------
kernel/cpu/idle.c | 60 ++++++++++++-------
kernel/sched/core.c | 93 +++++++++++-----------------
8 files changed, 167 insertions(+), 232 deletions(-)
create mode 100644 include/linux/idlepoll.h

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 1da25a5..8addd29 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -2,6 +2,7 @@
#define _ASM_X86_MWAIT_H

#include <linux/sched.h>
+#include <linux/idlepoll.h>

#define MWAIT_SUBSTATE_MASK 0xf
#define MWAIT_CSTATE_MASK 0xf
@@ -42,18 +43,17 @@ static inline void __mwait(unsigned long eax, unsigned long ecx)
*/
static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
{
- if (!current_set_polling_and_test()) {
- if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
- mb();
- clflush((void *)&current_thread_info()->flags);
- mb();
- }
-
- __monitor((void *)&current_thread_info()->flags, 0, 0);
- if (!need_resched())
- __mwait(eax, ecx);
+ idlepoll_begin_poll();
+ if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
+ mb();
+ clflush(idlepoll_poll_ptr());
+ mb();
}
- current_clr_polling();
+
+ __monitor(idlepoll_poll_ptr(), 0, 0);
+ if (idlepoll_keep_polling())
+ __mwait(eax, ecx);
+ idlepoll_end_poll();
}

#endif /* _ASM_X86_MWAIT_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e1940c0..caa3f63 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -234,8 +234,6 @@ static inline struct thread_info *current_thread_info(void)
* have to worry about atomic accesses.
*/
#define TS_COMPAT 0x0002 /* 32bit syscall active (64BIT)*/
-#define TS_POLLING 0x0004 /* idle task polling need_resched,
- skip sending interrupt */
#define TS_RESTORE_SIGMASK 0x0008 /* restore signal mask in do_signal() */

#ifndef __ASSEMBLY__
diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c
index 3ab0343..5848744 100644
--- a/arch/x86/kernel/apm_32.c
+++ b/arch/x86/kernel/apm_32.c
@@ -841,24 +841,12 @@ static int apm_do_idle(void)
u32 eax;
u8 ret = 0;
int idled = 0;
- int polling;
int err = 0;

- polling = !!(current_thread_info()->status & TS_POLLING);
- if (polling) {
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we
- * test NEED_RESCHED:
- */
- smp_mb();
- }
if (!need_resched()) {
idled = 1;
ret = apm_bios_call_simple(APM_FUNC_IDLE, 0, 0, &eax, &err);
}
- if (polling)
- current_thread_info()->status |= TS_POLLING;

if (!idled)
return 0;
diff --git a/include/linux/idlepoll.h b/include/linux/idlepoll.h
new file mode 100644
index 0000000..072bc7e
--- /dev/null
+++ b/include/linux/idlepoll.h
@@ -0,0 +1,85 @@
+/*
+ * idlepoll.h: definitions and inlines for lockless wake-from-idle.
+ *
+ * Copyright (c) 2014 Andy Lutomirski ([email protected])
+ */
+#ifndef __IDLEPOLL_H
+#define __IDLEPOLL_H
+
+#include <linux/percpu-defs.h>
+#include <asm/atomic.h>
+
+DECLARE_PER_CPU_ALIGNED(atomic_t, idlepoll_state);
+
+#define IDLEPOLL_NOT_POLLING 0 /* wakers must send IPI */
+#define IDLEPOLL_POLLING 1 /* ok to wake using idlepoll_state */
+#define IDLEPOLL_KICKED 2 /* __idlepoll_kicked will be called soon */
+
+void __idlepoll_kicked(void);
+
+/**
+ * idlepoll_poll_ptr - Address to use for hardware idle polling
+ *
+ * On architectures with an idle-until-an-address-is-written operations,
+ * this returns the address to use.
+ */
+static inline void *idlepoll_poll_ptr(void)
+{
+ return this_cpu_ptr(&idlepoll_state);
+}
+
+/**
+ * idlepoll_begin_poll - Address to use for hardware idle polling
+ *
+ * If an idle implementation polls, call this before polling.
+ */
+static inline void idlepoll_begin_poll(void)
+{
+ BUG_ON(atomic_read(this_cpu_ptr(&idlepoll_state)) == IDLEPOLL_POLLING);
+
+ /*
+ * Mark us polling. Note: we don't particularly care what
+ * what the previous value was. If it was IDLEPOLL_KICKED, then
+ * an IPI is on its way.
+ */
+ atomic_set(this_cpu_ptr(&idlepoll_state), IDLEPOLL_POLLING);
+
+ /*
+ * No barrier here: we assume that whoever calls us has enough
+ * barriers to notice idlepoll_state changing.
+ */
+}
+
+/**
+ * idlepoll_end_poll - Address to use for hardware idle polling
+ *
+ * If an idle implementation polls, call this after polling.
+ */
+static inline void idlepoll_end_poll(void)
+{
+ /*
+ * It's possible to be in IDLEPOLL_NOT_POLLING: an IPI
+ * could have come in and beaten us to the punch.
+ */
+ if (atomic_xchg(this_cpu_ptr(&idlepoll_state), IDLEPOLL_NOT_POLLING) ==
+ IDLEPOLL_KICKED)
+ __idlepoll_kicked();
+}
+
+/**
+ * idlepoll_keep_polling - Should a polling idle implementation stop polling?
+ *
+ * If an idle implementation polls, it should check this before pausing
+ * or asking hardware to wait.
+ */
+static inline bool idlepoll_keep_polling(void)
+{
+ return atomic_read(this_cpu_ptr(&idlepoll_state)) == IDLEPOLL_POLLING
+ && !test_preempt_need_resched();
+}
+
+#ifdef CONFIG_SMP
+void idlepoll_kick_cpu(int cpu);
+#endif
+
+#endif
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index de83b4e..04a3f39 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -138,11 +138,6 @@ do { \
do { \
set_preempt_need_resched(); \
} while (0)
-#define preempt_fold_need_resched() \
-do { \
- if (tif_need_resched()) \
- set_preempt_need_resched(); \
-} while (0)

#ifdef CONFIG_PREEMPT_NOTIFIERS

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..508df16 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2661,126 +2661,6 @@ static inline int spin_needbreak(spinlock_t *lock)
#endif
}

-/*
- * Idle thread specific functions to determine the need_resched
- * polling state. We have two versions, one based on TS_POLLING in
- * thread_info.status and one based on TIF_POLLING_NRFLAG in
- * thread_info.flags
- */
-#ifdef TS_POLLING
-static inline int tsk_is_polling(struct task_struct *p)
-{
- return task_thread_info(p)->status & TS_POLLING;
-}
-static inline void __current_set_polling(void)
-{
- current_thread_info()->status |= TS_POLLING;
-}
-
-static inline bool __must_check current_set_polling_and_test(void)
-{
- __current_set_polling();
-
- /*
- * Polling state must be visible before we test NEED_RESCHED,
- * paired by resched_task()
- */
- smp_mb();
-
- return unlikely(tif_need_resched());
-}
-
-static inline void __current_clr_polling(void)
-{
- current_thread_info()->status &= ~TS_POLLING;
-}
-
-static inline bool __must_check current_clr_polling_and_test(void)
-{
- __current_clr_polling();
-
- /*
- * Polling state must be visible before we test NEED_RESCHED,
- * paired by resched_task()
- */
- smp_mb();
-
- return unlikely(tif_need_resched());
-}
-#elif defined(TIF_POLLING_NRFLAG)
-static inline int tsk_is_polling(struct task_struct *p)
-{
- return test_tsk_thread_flag(p, TIF_POLLING_NRFLAG);
-}
-
-static inline void __current_set_polling(void)
-{
- set_thread_flag(TIF_POLLING_NRFLAG);
-}
-
-static inline bool __must_check current_set_polling_and_test(void)
-{
- __current_set_polling();
-
- /*
- * Polling state must be visible before we test NEED_RESCHED,
- * paired by resched_task()
- *
- * XXX: assumes set/clear bit are identical barrier wise.
- */
- smp_mb__after_clear_bit();
-
- return unlikely(tif_need_resched());
-}
-
-static inline void __current_clr_polling(void)
-{
- clear_thread_flag(TIF_POLLING_NRFLAG);
-}
-
-static inline bool __must_check current_clr_polling_and_test(void)
-{
- __current_clr_polling();
-
- /*
- * Polling state must be visible before we test NEED_RESCHED,
- * paired by resched_task()
- */
- smp_mb__after_clear_bit();
-
- return unlikely(tif_need_resched());
-}
-
-#else
-static inline int tsk_is_polling(struct task_struct *p) { return 0; }
-static inline void __current_set_polling(void) { }
-static inline void __current_clr_polling(void) { }
-
-static inline bool __must_check current_set_polling_and_test(void)
-{
- return unlikely(tif_need_resched());
-}
-static inline bool __must_check current_clr_polling_and_test(void)
-{
- return unlikely(tif_need_resched());
-}
-#endif
-
-static inline void current_clr_polling(void)
-{
- __current_clr_polling();
-
- /*
- * Ensure we check TIF_NEED_RESCHED after we clear the polling bit.
- * Once the bit is cleared, we'll get IPIs with every new
- * TIF_NEED_RESCHED and the IPI handler, scheduler_ipi(), will also
- * fold.
- */
- smp_mb(); /* paired with resched_task() */
-
- preempt_fold_need_resched();
-}
-
static __always_inline bool need_resched(void)
{
return unlikely(tif_need_resched());
diff --git a/kernel/cpu/idle.c b/kernel/cpu/idle.c
index 277f494..5f772f5 100644
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -6,12 +6,14 @@
#include <linux/tick.h>
#include <linux/mm.h>
#include <linux/stackprotector.h>
+#include <linux/idlepoll.h>

#include <asm/tlb.h>

#include <trace/events/power.h>

static int __read_mostly cpu_idle_force_poll;
+DEFINE_PER_CPU(atomic_t, idlepoll_state) ____cacheline_aligned;

void cpu_idle_poll_ctrl(bool enable)
{
@@ -44,13 +46,39 @@ static inline int cpu_idle_poll(void)
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
- while (!tif_need_resched())
+
+ idlepoll_begin_poll();
+ while (idlepoll_keep_polling())
cpu_relax();
+ idlepoll_end_poll();
+
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();
return 1;
}

+#ifdef CONFIG_SMP
+DEFINE_PER_CPU_ALIGNED(atomic_t, idlepoll_state);
+EXPORT_SYMBOL_GPL(idlepoll_state);
+
+/**
+ * idlepoll_kick_cpu - Cause a target cpu to call __idlepoll_kicked
+ *
+ * This implies a strong enough barrier that any writes before
+ * idlepoll_kick_cpu will be visible before __idlepoll_kicked is called.
+ *
+ * Do NOT call this on the current cpu. Call from any context: no
+ * locks are required or taken.
+ */
+void idlepoll_kick_cpu(int cpu)
+{
+ if (atomic_xchg(per_cpu_ptr(&idlepoll_state, cpu), IDLEPOLL_KICKED) ==
+ IDLEPOLL_NOT_POLLING)
+ smp_send_reschedule(cpu);
+}
+EXPORT_SYMBOL_GPL(idlepoll_kick_cpu);
+#endif
+
/* Weak implementations for optional arch specific functions */
void __weak arch_cpu_idle_prepare(void) { }
void __weak arch_cpu_idle_enter(void) { }
@@ -65,12 +93,13 @@ void __weak arch_cpu_idle(void)
/*
* Generic idle loop implementation
*/
+
static void cpu_idle_loop(void)
{
while (1) {
tick_nohz_idle_enter();

- while (!need_resched()) {
+ while (!test_preempt_need_resched()) {
check_pgt_cache();
rmb();

@@ -92,30 +121,16 @@ static void cpu_idle_loop(void)
if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
cpu_idle_poll();
} else {
- if (!current_clr_polling_and_test()) {
- stop_critical_timings();
- rcu_idle_enter();
- arch_cpu_idle();
- WARN_ON_ONCE(irqs_disabled());
- rcu_idle_exit();
- start_critical_timings();
- } else {
- local_irq_enable();
- }
- __current_set_polling();
+ stop_critical_timings();
+ rcu_idle_enter();
+ arch_cpu_idle();
+ WARN_ON_ONCE(irqs_disabled());
+ rcu_idle_exit();
+ start_critical_timings();
}
arch_cpu_idle_exit();
}

- /*
- * Since we fell out of the loop above, we know
- * TIF_NEED_RESCHED must be set, propagate it into
- * PREEMPT_NEED_RESCHED.
- *
- * This is required because for polling idle loops we will
- * not have had an IPI to fold the state for us.
- */
- preempt_set_need_resched();
tick_nohz_idle_exit();
schedule_preempt_disabled();
}
@@ -138,7 +153,6 @@ void cpu_startup_entry(enum cpuhp_state state)
*/
boot_init_stack_canary();
#endif
- __current_set_polling();
arch_cpu_idle_prepare();
cpu_idle_loop();
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b46131e..b771c46 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -73,6 +73,7 @@
#include <linux/init_task.h>
#include <linux/binfmts.h>
#include <linux/context_tracking.h>
+#include <linux/idlepoll.h>

#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -504,6 +505,18 @@ static inline void init_hrtick(void)
}
#endif /* CONFIG_SCHED_HRTICK */

+void resched_cpu(int cpu)
+{
+#ifdef CONFIG_SMP
+ if (cpu == smp_processor_id())
+ set_preempt_need_resched();
+ else
+ idlepoll_kick_cpu(cpu);
+#else
+ set_preempt_need_resched();
+#endif
+}
+
/*
* resched_task - mark a task 'to be rescheduled now'.
*
@@ -513,36 +526,16 @@ static inline void init_hrtick(void)
*/
void resched_task(struct task_struct *p)
{
- int cpu;
-
lockdep_assert_held(&task_rq(p)->lock);

+ /* XXX: this is probably unnecessary. */
if (test_tsk_need_resched(p))
return;

+ /* XXX: so is this. */
set_tsk_need_resched(p);

- cpu = task_cpu(p);
- if (cpu == smp_processor_id()) {
- set_preempt_need_resched();
- return;
- }
-
- /* NEED_RESCHED must be visible before we test polling */
- smp_mb();
- if (!tsk_is_polling(p))
- smp_send_reschedule(cpu);
-}
-
-void resched_cpu(int cpu)
-{
- struct rq *rq = cpu_rq(cpu);
- unsigned long flags;
-
- if (!raw_spin_trylock_irqsave(&rq->lock, flags))
- return;
- resched_task(cpu_curr(cpu));
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+ resched_cpu(task_cpu(p));
}

#ifdef CONFIG_SMP
@@ -586,32 +579,10 @@ unlock:
*/
static void wake_up_idle_cpu(int cpu)
{
- struct rq *rq = cpu_rq(cpu);
-
if (cpu == smp_processor_id())
return;

- /*
- * This is safe, as this function is called with the timer
- * wheel base lock of (cpu) held. When the CPU is on the way
- * to idle and has not yet set rq->curr to idle then it will
- * be serialized on the timer wheel base lock and take the new
- * timer into account automatically.
- */
- if (rq->curr != rq->idle)
- return;
-
- /*
- * We can set TIF_RESCHED on the idle task of the other CPU
- * lockless. The worst case is that the other CPU runs the
- * idle task through an additional NOOP schedule()
- */
- set_tsk_need_resched(rq->idle);
-
- /* NEED_RESCHED must be visible before we test polling */
- smp_mb();
- if (!tsk_is_polling(rq->idle))
- smp_send_reschedule(cpu);
+ idlepoll_kick_cpu(cpu);
}

static bool wake_up_full_nohz_cpu(int cpu)
@@ -1493,19 +1464,23 @@ static void sched_ttwu_pending(void)
raw_spin_unlock(&rq->lock);
}

+void __idlepoll_kicked(void)
+{
+ set_preempt_need_resched();
+ sched_ttwu_pending();
+}
+EXPORT_SYMBOL_GPL(__idlepoll_kicked);
+
void scheduler_ipi(void)
{
- /*
- * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
- * TIF_NEED_RESCHED remotely (for the first time) will also send
- * this IPI.
- */
- preempt_fold_need_resched();
+ irq_enter();

- if (llist_empty(&this_rq()->wake_list)
- && !tick_nohz_full_cpu(smp_processor_id())
- && !got_nohz_idle_kick())
- return;
+ if (atomic_xchg(this_cpu_ptr(&idlepoll_state), IDLEPOLL_NOT_POLLING) ==
+ IDLEPOLL_KICKED)
+ __idlepoll_kicked();
+
+ if (!tick_nohz_full_cpu(smp_processor_id()) && !got_nohz_idle_kick())
+ goto out;

/*
* Not all reschedule IPI handlers call irq_enter/irq_exit, since
@@ -1520,9 +1495,7 @@ void scheduler_ipi(void)
* however a fair share of IPIs are still resched only so this would
* somewhat pessimize the simple resched case.
*/
- irq_enter();
tick_nohz_full_check();
- sched_ttwu_pending();

/*
* Check if someone kicked us for doing the nohz idle load balance.
@@ -1531,13 +1504,15 @@ void scheduler_ipi(void)
this_rq()->idle_balance = 1;
raise_softirq_irqoff(SCHED_SOFTIRQ);
}
+
+out:
irq_exit();
}

static void ttwu_queue_remote(struct task_struct *p, int cpu)
{
if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
- smp_send_reschedule(cpu);
+ idlepoll_kick_cpu(cpu);
}

bool cpus_share_cache(int this_cpu, int that_cpu)
--
1.8.5.3

2014-02-13 09:38:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation


* Andy Lutomirski <[email protected]> wrote:

> This is a strawman proposal to simplify the idle implementation, eliminate
> a race
>
> Benefits over current code:
> - ttwu_queue_remote doesn't use an IPI unless needed

Cool.

> - The diffstat should speak for itself :)

Neat!

> - Less racy. Spurious IPIs are possible, but only in narrow windows or
> when two wakeups occur in rapid succession.
> - Seems to work (?)
>
> Issues:
> - Am I doing the percpu stuff right?
> - Needs work on non-x86 architectures

Absolutely, and with the least amount of disruption possible, as
people are not very good at testing 'all' of them.

> - The !CONFIG_SMP case needs to be checked

Which also happens to be the default for half of all non-x86 arches.

> - Is "idlepoll" a good name for the new code? It doesn't have *that*
> much to do with the idle state. Maybe cpukick?

'cpukick', hands down.

> If this turns out okay, TIF_NEED_RESCHED could possibly be deleted as well.

Cool ...

Thanks,

Ingo

2014-02-13 14:49:09

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

2014-02-13 2:40 GMT+01:00 Andy Lutomirski <[email protected]>:
> This is a strawman proposal to simplify the idle implementation, eliminate
> a race

Please describe the race in question.

>
> Benefits over current code:
> - ttwu_queue_remote doesn't use an IPI unless needed
> - The diffstat should speak for itself :)

Actually referring to the diffstat alone sounds dangerous here. Sure
this simplifies the code, I'm all for negative diffstat, and probably
it avoids a few IPIs, but this comes at the cost of added non-cheap
atomic_xchg() calls in some critical fastpath like resched_task()
path, the timer enqueue path and the inner idle loop.

So it's not like this all comes for free. I'm not saying we don't want
it but at least some serious measurements is needed.

2014-02-13 14:50:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

On Wed, Feb 12, 2014 at 05:40:12PM -0800, Andy Lutomirski wrote:
> This is a strawman proposal to simplify the idle implementation, eliminate
> a race
>
> Benefits over current code:
> - ttwu_queue_remote doesn't use an IPI unless needed
> - The diffstat should speak for itself :)
> - Less racy. Spurious IPIs are possible, but only in narrow windows or
> when two wakeups occur in rapid succession.
> - Seems to work (?)
>
> Issues:
> - Am I doing the percpu stuff right?
> - Needs work on non-x86 architectures
> - The !CONFIG_SMP case needs to be checked
> - Is "idlepoll" a good name for the new code? It doesn't have *that*
> much to do with the idle state. Maybe cpukick?
>
> If this turns out okay, TIF_NEED_RESCHED could possibly be deleted as well.

No, we can't do away with that; its used in some fairly critical paths
(return to userspace) and adding a second cacheline load there would be
unfortunate.

I also don't really like how the polling state is an atomic; its a cpu
local property.

Now given we can't get rid of TIF_NEED_RESCHED, and we need an atomic op
on a remote cacheline anyhow; the simplest solution would be to convert
all TS_POLLING users to TIF_POLLING_NRFLAG and use an atomic_or_return()
like construct to do:

atomic_or_return(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG

and avoid the IPI if that is false.

Something a little like this; it does require a lot of auditing; but it
boots on my x86_64.

---
arch/x86/include/asm/mwait.h | 19 ++++----
arch/x86/include/asm/thread_info.h | 4 +-
arch/x86/kernel/process.c | 8 ++--
include/linux/sched.h | 90 +++-----------------------------------
kernel/sched/core.c | 70 ++++++++++++++++++-----------
kernel/sched/idle.c | 40 ++++++++---------
kernel/sched/sched.h | 3 ++
7 files changed, 88 insertions(+), 146 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 1da25a5f96f9..d6a7b7cdc478 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -42,18 +42,15 @@ static inline void __mwait(unsigned long eax, unsigned long ecx)
*/
static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
{
- if (!current_set_polling_and_test()) {
- if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
- mb();
- clflush((void *)&current_thread_info()->flags);
- mb();
- }
-
- __monitor((void *)&current_thread_info()->flags, 0, 0);
- if (!need_resched())
- __mwait(eax, ecx);
+ if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
+ mb();
+ clflush((void *)&current_thread_info()->flags);
+ mb();
}
- current_clr_polling();
+
+ __monitor((void *)&current_thread_info()->flags, 0, 0);
+ if (!need_resched())
+ __mwait(eax, ecx);
}

#endif /* _ASM_X86_MWAIT_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e1940c06ed02..1f2fe575cfca 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -88,6 +88,7 @@ struct thread_info {
#define TIF_FORK 18 /* ret_from_fork */
#define TIF_NOHZ 19 /* in adaptive nohz mode */
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
+#define TIF_POLLING_NRFLAG 21
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
@@ -111,6 +112,7 @@ struct thread_info {
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_FORK (1 << TIF_FORK)
#define _TIF_NOHZ (1 << TIF_NOHZ)
+#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
@@ -234,8 +236,6 @@ static inline struct thread_info *current_thread_info(void)
* have to worry about atomic accesses.
*/
#define TS_COMPAT 0x0002 /* 32bit syscall active (64BIT)*/
-#define TS_POLLING 0x0004 /* idle task polling need_resched,
- skip sending interrupt */
#define TS_RESTORE_SIGMASK 0x0008 /* restore signal mask in do_signal() */

#ifndef __ASSEMBLY__
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 4505e2a950d8..298c002629c6 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -306,9 +306,11 @@ void arch_cpu_idle(void)
*/
void default_idle(void)
{
- trace_cpu_idle_rcuidle(1, smp_processor_id());
- safe_halt();
- trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
+ if (!current_clr_polling_and_test()) {
+ trace_cpu_idle_rcuidle(1, smp_processor_id());
+ safe_halt();
+ trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
+ }
}
#ifdef CONFIG_APM_MODULE
EXPORT_SYMBOL(default_idle);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c49a2585ff7d..b326abe95978 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2692,85 +2692,26 @@ static inline int spin_needbreak(spinlock_t *lock)
* thread_info.status and one based on TIF_POLLING_NRFLAG in
* thread_info.flags
*/
-#ifdef TS_POLLING
-static inline int tsk_is_polling(struct task_struct *p)
-{
- return task_thread_info(p)->status & TS_POLLING;
-}
-static inline void __current_set_polling(void)
-{
- current_thread_info()->status |= TS_POLLING;
-}
-
-static inline bool __must_check current_set_polling_and_test(void)
-{
- __current_set_polling();
-
- /*
- * Polling state must be visible before we test NEED_RESCHED,
- * paired by resched_task()
- */
- smp_mb();
-
- return unlikely(tif_need_resched());
-}
-
-static inline void __current_clr_polling(void)
-{
- current_thread_info()->status &= ~TS_POLLING;
-}
-
-static inline bool __must_check current_clr_polling_and_test(void)
-{
- __current_clr_polling();
-
- /*
- * Polling state must be visible before we test NEED_RESCHED,
- * paired by resched_task()
- */
- smp_mb();
-
- return unlikely(tif_need_resched());
-}
-#elif defined(TIF_POLLING_NRFLAG)
+#ifdef CONFIG_SMP
static inline int tsk_is_polling(struct task_struct *p)
{
return test_tsk_thread_flag(p, TIF_POLLING_NRFLAG);
}

-static inline void __current_set_polling(void)
+static inline void current_set_polling(void)
{
set_thread_flag(TIF_POLLING_NRFLAG);
}

-static inline bool __must_check current_set_polling_and_test(void)
-{
- __current_set_polling();
-
- /*
- * Polling state must be visible before we test NEED_RESCHED,
- * paired by resched_task()
- *
- * XXX: assumes set/clear bit are identical barrier wise.
- */
- smp_mb__after_clear_bit();
-
- return unlikely(tif_need_resched());
-}
-
-static inline void __current_clr_polling(void)
+static inline void current_clr_polling(void)
{
clear_thread_flag(TIF_POLLING_NRFLAG);
}

static inline bool __must_check current_clr_polling_and_test(void)
{
- __current_clr_polling();
+ current_clr_polling();

- /*
- * Polling state must be visible before we test NEED_RESCHED,
- * paired by resched_task()
- */
smp_mb__after_clear_bit();

return unlikely(tif_need_resched());
@@ -2778,34 +2719,15 @@ static inline bool __must_check current_clr_polling_and_test(void)

#else
static inline int tsk_is_polling(struct task_struct *p) { return 0; }
-static inline void __current_set_polling(void) { }
-static inline void __current_clr_polling(void) { }
+static inline void current_set_polling(void) { }
+static inline void current_clr_polling(void) { }

-static inline bool __must_check current_set_polling_and_test(void)
-{
- return unlikely(tif_need_resched());
-}
static inline bool __must_check current_clr_polling_and_test(void)
{
return unlikely(tif_need_resched());
}
#endif

-static inline void current_clr_polling(void)
-{
- __current_clr_polling();
-
- /*
- * Ensure we check TIF_NEED_RESCHED after we clear the polling bit.
- * Once the bit is cleared, we'll get IPIs with every new
- * TIF_NEED_RESCHED and the IPI handler, scheduler_ipi(), will also
- * fold.
- */
- smp_mb(); /* paired with resched_task() */
-
- preempt_fold_need_resched();
-}
-
static __always_inline bool need_resched(void)
{
return unlikely(tif_need_resched());
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9764fbc537..19de9a1cc96c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -504,45 +504,63 @@ static inline void init_hrtick(void)
}
#endif /* CONFIG_SCHED_HRTICK */

-/*
- * resched_task - mark a task 'to be rescheduled now'.
- *
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
- */
-void resched_task(struct task_struct *p)
+static void __resched_cpu(int cpu)
{
- int cpu;
+ struct rq *rq = cpu_rq(cpu);
+ struct task_struct *p;
+ struct thread_info *ti;
+ unsigned long val, old, new;
+ bool ipi;

- lockdep_assert_held(&task_rq(p)->lock);
+ rcu_read_lock();
+ do {
+ p = ACCESS_ONCE(rq->curr);
+ ti = task_thread_info(p);
+
+ val = ti->flags;
+ for (;;) {
+ new = val | _TIF_NEED_RESCHED;
+ old = cmpxchg(&ti->flags, val, new);
+ if (old == val)
+ break;
+ val = old;
+ }

- if (test_tsk_need_resched(p))
- return;
+ ipi = !(old & _TIF_POLLING_NRFLAG);

- set_tsk_need_resched(p);
+ } while (p != ACCESS_ONCE(rq->curr));
+ rcu_read_unlock();

- cpu = task_cpu(p);
+ if (ipi)
+ smp_send_reschedule(cpu);
+}
+
+void resched_cpu(int cpu)
+{
if (cpu == smp_processor_id()) {
+ set_tsk_need_resched(current);
set_preempt_need_resched();
return;
}

- /* NEED_RESCHED must be visible before we test polling */
- smp_mb();
- if (!tsk_is_polling(p))
- smp_send_reschedule(cpu);
+ __resched_cpu(cpu);
}

-void resched_cpu(int cpu)
+/*
+ * resched_task - mark a task 'to be rescheduled now'.
+ *
+ * On UP this means the setting of the need_resched flag, on SMP it
+ * might also involve a cross-CPU call to trigger the scheduler on
+ * the target CPU.
+ */
+void resched_task(struct task_struct *p)
{
- struct rq *rq = cpu_rq(cpu);
- unsigned long flags;
+ lockdep_assert_held(&task_rq(p)->lock);

- if (!raw_spin_trylock_irqsave(&rq->lock, flags))
+ if (test_tsk_need_resched(p))
return;
- resched_task(cpu_curr(cpu));
- raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+ resched_cpu(task_cpu(p));
}

#ifdef CONFIG_SMP
@@ -1476,7 +1494,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
}

#ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+void sched_ttwu_pending(void)
{
struct rq *rq = this_rq();
struct llist_node *llist = llist_del_all(&rq->wake_list);
@@ -2689,7 +2707,7 @@ static void __sched __schedule(void)
update_rq_clock(rq);

next = pick_next_task(rq, prev);
- clear_tsk_need_resched(prev);
+ clear_tsk_need_resched(next);
clear_preempt_need_resched();
rq->skip_clock_update = 0;

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 14ca43430aee..03748737473e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -11,6 +11,7 @@
#include <asm/tlb.h>

#include <trace/events/power.h>
+#include "sched.h"

static int __read_mostly cpu_idle_force_poll;

@@ -71,6 +72,8 @@ static void cpu_idle_loop(void)
while (1) {
tick_nohz_idle_enter();

+ current_set_polling();
+
while (!need_resched()) {
check_pgt_cache();
rmb();
@@ -93,29 +96,27 @@ static void cpu_idle_loop(void)
if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
cpu_idle_poll();
} else {
- if (!current_clr_polling_and_test()) {
- stop_critical_timings();
- rcu_idle_enter();
- if (cpuidle_idle_call())
- arch_cpu_idle();
- if (WARN_ON_ONCE(irqs_disabled()))
- local_irq_enable();
- rcu_idle_exit();
- start_critical_timings();
- } else {
+ stop_critical_timings();
+ rcu_idle_enter();
+ if (cpuidle_idle_call())
+ arch_cpu_idle();
+ if (WARN_ON_ONCE(irqs_disabled()))
local_irq_enable();
- }
- __current_set_polling();
+ rcu_idle_exit();
+ start_critical_timings();
}
arch_cpu_idle_exit();
- /*
- * We need to test and propagate the TIF_NEED_RESCHED
- * bit here because we might not have send the
- * reschedule IPI to idle tasks.
- */
- if (tif_need_resched())
- set_preempt_need_resched();
}
+
+ current_clr_polling();
+
+ /*
+ * We need to test and propagate the TIF_NEED_RESCHED bit here
+ * because we might not have send the reschedule IPI to idle
+ * tasks.
+ */
+ set_preempt_need_resched();
+ sched_ttwu_pending();
tick_nohz_idle_exit();
schedule_preempt_disabled();
}
@@ -138,7 +139,6 @@ void cpu_startup_entry(enum cpuhp_state state)
*/
boot_init_stack_canary();
#endif
- __current_set_polling();
arch_cpu_idle_prepare();
cpu_idle_loop();
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bf34c257d3b..269b1a4e0bdf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1160,6 +1160,7 @@ extern const struct sched_class idle_sched_class;

#ifdef CONFIG_SMP

+extern void sched_ttwu_pending(void);
extern void update_group_power(struct sched_domain *sd, int cpu);

extern void trigger_load_balance(struct rq *rq);
@@ -1170,6 +1171,8 @@ extern void idle_exit_fair(struct rq *this_rq);

#else /* CONFIG_SMP */

+static inline void sched_ttwu_pending(void) { }
+
static inline void idle_balance(int cpu, struct rq *rq)
{
}


2014-02-13 17:07:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

On Thu, Feb 13, 2014 at 6:50 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Feb 12, 2014 at 05:40:12PM -0800, Andy Lutomirski wrote:
>> This is a strawman proposal to simplify the idle implementation, eliminate
>> a race
>>
>> Benefits over current code:
>> - ttwu_queue_remote doesn't use an IPI unless needed
>> - The diffstat should speak for itself :)
>> - Less racy. Spurious IPIs are possible, but only in narrow windows or
>> when two wakeups occur in rapid succession.
>> - Seems to work (?)
>>
>> Issues:
>> - Am I doing the percpu stuff right?
>> - Needs work on non-x86 architectures
>> - The !CONFIG_SMP case needs to be checked
>> - Is "idlepoll" a good name for the new code? It doesn't have *that*
>> much to do with the idle state. Maybe cpukick?
>>
>> If this turns out okay, TIF_NEED_RESCHED could possibly be deleted as well.
>
> No, we can't do away with that; its used in some fairly critical paths
> (return to userspace) and adding a second cacheline load there would be
> unfortunate.
>
> I also don't really like how the polling state is an atomic; its a cpu
> local property.

Your patch also makes polling state be an atomic (albeit one that
isn't changed remotely).

>
> Now given we can't get rid of TIF_NEED_RESCHED, and we need an atomic op
> on a remote cacheline anyhow; the simplest solution would be to convert
> all TS_POLLING users to TIF_POLLING_NRFLAG and use an atomic_or_return()
> like construct to do:
>
> atomic_or_return(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG
>
> and avoid the IPI if that is false.
>
> Something a little like this; it does require a lot of auditing; but it
> boots on my x86_64.

Hmm.

Yours is certainly a simpler change than mine. I don't see anything
obviously wrong with it. There are plenty of weird cases in which one
cpu schedules while another cpu is in the new cmpxchg look, but I
suspect that the worst that can happen is that a spurious wakeup later
on.

My patch (assuming that all the kinks get worked out) will probably be
faster -- there's neither an rcu lock nor a cmpxchg. I'm not
personally inclined to fix up every arch's idle routine, though.

On the subject of major surgery, though: there are very few places in
the kernel where TIF_NEED_RESCHED gets set. With something like my
patch applied, I think that there is no code at all that sets any
other task's TIF_NEED_RESCHED. That suggests that all
set_tsk_need_resched callers could just call into the scheduler
directly. If so, the change could probably delete a whole lot of
assembly code, and every kernel exit would get faster.

--Andy

2014-02-13 19:59:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

On Thu, Feb 13, 2014 at 6:50 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Feb 12, 2014 at 05:40:12PM -0800, Andy Lutomirski wrote:
>> This is a strawman proposal to simplify the idle implementation, eliminate
>> a race
>>
>> Benefits over current code:
>> - ttwu_queue_remote doesn't use an IPI unless needed
>> - The diffstat should speak for itself :)
>> - Less racy. Spurious IPIs are possible, but only in narrow windows or
>> when two wakeups occur in rapid succession.
>> - Seems to work (?)
>>
>> Issues:
>> - Am I doing the percpu stuff right?
>> - Needs work on non-x86 architectures
>> - The !CONFIG_SMP case needs to be checked
>> - Is "idlepoll" a good name for the new code? It doesn't have *that*
>> much to do with the idle state. Maybe cpukick?
>>
>> If this turns out okay, TIF_NEED_RESCHED could possibly be deleted as well.
>
> No, we can't do away with that; its used in some fairly critical paths
> (return to userspace) and adding a second cacheline load there would be
> unfortunate.
>
> I also don't really like how the polling state is an atomic; its a cpu
> local property.
>
> Now given we can't get rid of TIF_NEED_RESCHED, and we need an atomic op
> on a remote cacheline anyhow; the simplest solution would be to convert
> all TS_POLLING users to TIF_POLLING_NRFLAG and use an atomic_or_return()
> like construct to do:
>
> atomic_or_return(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG
>
> and avoid the IPI if that is false.
>
> Something a little like this; it does require a lot of auditing; but it
> boots on my x86_64.
>
> ---
> arch/x86/include/asm/mwait.h | 19 ++++----
> arch/x86/include/asm/thread_info.h | 4 +-
> arch/x86/kernel/process.c | 8 ++--
> include/linux/sched.h | 90 +++-----------------------------------
> kernel/sched/core.c | 70 ++++++++++++++++++-----------
> kernel/sched/idle.c | 40 ++++++++---------
> kernel/sched/sched.h | 3 ++
> 7 files changed, 88 insertions(+), 146 deletions(-)

This patch changes the default from !polling to polling, right? If
so, doesn't it break every driver that fails to *clear* the polling
flag? (My patch is about to have the same problem.)

>
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 1da25a5f96f9..d6a7b7cdc478 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -42,18 +42,15 @@ static inline void __mwait(unsigned long eax, unsigned long ecx)
> */
> static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
> {
> - if (!current_set_polling_and_test()) {
> - if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> - mb();
> - clflush((void *)&current_thread_info()->flags);
> - mb();
> - }
> -
> - __monitor((void *)&current_thread_info()->flags, 0, 0);
> - if (!need_resched())
> - __mwait(eax, ecx);
> + if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> + mb();
> + clflush((void *)&current_thread_info()->flags);
> + mb();
> }
> - current_clr_polling();
> +
> + __monitor((void *)&current_thread_info()->flags, 0, 0);
> + if (!need_resched())
> + __mwait(eax, ecx);
> }
>
> #endif /* _ASM_X86_MWAIT_H */
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index e1940c06ed02..1f2fe575cfca 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -88,6 +88,7 @@ struct thread_info {
> #define TIF_FORK 18 /* ret_from_fork */
> #define TIF_NOHZ 19 /* in adaptive nohz mode */
> #define TIF_MEMDIE 20 /* is terminating due to OOM killer */
> +#define TIF_POLLING_NRFLAG 21
> #define TIF_IO_BITMAP 22 /* uses I/O bitmap */
> #define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
> #define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
> @@ -111,6 +112,7 @@ struct thread_info {
> #define _TIF_IA32 (1 << TIF_IA32)
> #define _TIF_FORK (1 << TIF_FORK)
> #define _TIF_NOHZ (1 << TIF_NOHZ)
> +#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
> #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
> #define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
> #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
> @@ -234,8 +236,6 @@ static inline struct thread_info *current_thread_info(void)
> * have to worry about atomic accesses.
> */
> #define TS_COMPAT 0x0002 /* 32bit syscall active (64BIT)*/
> -#define TS_POLLING 0x0004 /* idle task polling need_resched,
> - skip sending interrupt */
> #define TS_RESTORE_SIGMASK 0x0008 /* restore signal mask in do_signal() */
>
> #ifndef __ASSEMBLY__
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 4505e2a950d8..298c002629c6 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -306,9 +306,11 @@ void arch_cpu_idle(void)
> */
> void default_idle(void)
> {
> - trace_cpu_idle_rcuidle(1, smp_processor_id());
> - safe_halt();
> - trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
> + if (!current_clr_polling_and_test()) {
> + trace_cpu_idle_rcuidle(1, smp_processor_id());
> + safe_halt();
> + trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
> + }
> }
> #ifdef CONFIG_APM_MODULE
> EXPORT_SYMBOL(default_idle);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c49a2585ff7d..b326abe95978 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2692,85 +2692,26 @@ static inline int spin_needbreak(spinlock_t *lock)
> * thread_info.status and one based on TIF_POLLING_NRFLAG in
> * thread_info.flags
> */
> -#ifdef TS_POLLING
> -static inline int tsk_is_polling(struct task_struct *p)
> -{
> - return task_thread_info(p)->status & TS_POLLING;
> -}
> -static inline void __current_set_polling(void)
> -{
> - current_thread_info()->status |= TS_POLLING;
> -}
> -
> -static inline bool __must_check current_set_polling_and_test(void)
> -{
> - __current_set_polling();
> -
> - /*
> - * Polling state must be visible before we test NEED_RESCHED,
> - * paired by resched_task()
> - */
> - smp_mb();
> -
> - return unlikely(tif_need_resched());
> -}
> -
> -static inline void __current_clr_polling(void)
> -{
> - current_thread_info()->status &= ~TS_POLLING;
> -}
> -
> -static inline bool __must_check current_clr_polling_and_test(void)
> -{
> - __current_clr_polling();
> -
> - /*
> - * Polling state must be visible before we test NEED_RESCHED,
> - * paired by resched_task()
> - */
> - smp_mb();
> -
> - return unlikely(tif_need_resched());
> -}
> -#elif defined(TIF_POLLING_NRFLAG)
> +#ifdef CONFIG_SMP
> static inline int tsk_is_polling(struct task_struct *p)
> {
> return test_tsk_thread_flag(p, TIF_POLLING_NRFLAG);
> }
>
> -static inline void __current_set_polling(void)
> +static inline void current_set_polling(void)
> {
> set_thread_flag(TIF_POLLING_NRFLAG);
> }
>
> -static inline bool __must_check current_set_polling_and_test(void)
> -{
> - __current_set_polling();
> -
> - /*
> - * Polling state must be visible before we test NEED_RESCHED,
> - * paired by resched_task()
> - *
> - * XXX: assumes set/clear bit are identical barrier wise.
> - */
> - smp_mb__after_clear_bit();
> -
> - return unlikely(tif_need_resched());
> -}
> -
> -static inline void __current_clr_polling(void)
> +static inline void current_clr_polling(void)
> {
> clear_thread_flag(TIF_POLLING_NRFLAG);
> }
>
> static inline bool __must_check current_clr_polling_and_test(void)
> {
> - __current_clr_polling();
> + current_clr_polling();
>
> - /*
> - * Polling state must be visible before we test NEED_RESCHED,
> - * paired by resched_task()
> - */
> smp_mb__after_clear_bit();
>
> return unlikely(tif_need_resched());
> @@ -2778,34 +2719,15 @@ static inline bool __must_check current_clr_polling_and_test(void)
>
> #else
> static inline int tsk_is_polling(struct task_struct *p) { return 0; }
> -static inline void __current_set_polling(void) { }
> -static inline void __current_clr_polling(void) { }
> +static inline void current_set_polling(void) { }
> +static inline void current_clr_polling(void) { }
>
> -static inline bool __must_check current_set_polling_and_test(void)
> -{
> - return unlikely(tif_need_resched());
> -}
> static inline bool __must_check current_clr_polling_and_test(void)
> {
> return unlikely(tif_need_resched());
> }
> #endif
>
> -static inline void current_clr_polling(void)
> -{
> - __current_clr_polling();
> -
> - /*
> - * Ensure we check TIF_NEED_RESCHED after we clear the polling bit.
> - * Once the bit is cleared, we'll get IPIs with every new
> - * TIF_NEED_RESCHED and the IPI handler, scheduler_ipi(), will also
> - * fold.
> - */
> - smp_mb(); /* paired with resched_task() */
> -
> - preempt_fold_need_resched();
> -}
> -
> static __always_inline bool need_resched(void)
> {
> return unlikely(tif_need_resched());
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fb9764fbc537..19de9a1cc96c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -504,45 +504,63 @@ static inline void init_hrtick(void)
> }
> #endif /* CONFIG_SCHED_HRTICK */
>
> -/*
> - * resched_task - mark a task 'to be rescheduled now'.
> - *
> - * On UP this means the setting of the need_resched flag, on SMP it
> - * might also involve a cross-CPU call to trigger the scheduler on
> - * the target CPU.
> - */
> -void resched_task(struct task_struct *p)
> +static void __resched_cpu(int cpu)
> {
> - int cpu;
> + struct rq *rq = cpu_rq(cpu);
> + struct task_struct *p;
> + struct thread_info *ti;
> + unsigned long val, old, new;
> + bool ipi;
>
> - lockdep_assert_held(&task_rq(p)->lock);
> + rcu_read_lock();
> + do {
> + p = ACCESS_ONCE(rq->curr);
> + ti = task_thread_info(p);
> +
> + val = ti->flags;
> + for (;;) {
> + new = val | _TIF_NEED_RESCHED;
> + old = cmpxchg(&ti->flags, val, new);
> + if (old == val)
> + break;
> + val = old;
> + }
>
> - if (test_tsk_need_resched(p))
> - return;
> + ipi = !(old & _TIF_POLLING_NRFLAG);
>
> - set_tsk_need_resched(p);
> + } while (p != ACCESS_ONCE(rq->curr));
> + rcu_read_unlock();
>
> - cpu = task_cpu(p);
> + if (ipi)
> + smp_send_reschedule(cpu);
> +}
> +
> +void resched_cpu(int cpu)
> +{
> if (cpu == smp_processor_id()) {
> + set_tsk_need_resched(current);
> set_preempt_need_resched();
> return;
> }
>
> - /* NEED_RESCHED must be visible before we test polling */
> - smp_mb();
> - if (!tsk_is_polling(p))
> - smp_send_reschedule(cpu);
> + __resched_cpu(cpu);
> }
>
> -void resched_cpu(int cpu)
> +/*
> + * resched_task - mark a task 'to be rescheduled now'.
> + *
> + * On UP this means the setting of the need_resched flag, on SMP it
> + * might also involve a cross-CPU call to trigger the scheduler on
> + * the target CPU.
> + */
> +void resched_task(struct task_struct *p)
> {
> - struct rq *rq = cpu_rq(cpu);
> - unsigned long flags;
> + lockdep_assert_held(&task_rq(p)->lock);
>
> - if (!raw_spin_trylock_irqsave(&rq->lock, flags))
> + if (test_tsk_need_resched(p))
> return;
> - resched_task(cpu_curr(cpu));
> - raw_spin_unlock_irqrestore(&rq->lock, flags);
> +
> + resched_cpu(task_cpu(p));
> }
>
> #ifdef CONFIG_SMP
> @@ -1476,7 +1494,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
> }
>
> #ifdef CONFIG_SMP
> -static void sched_ttwu_pending(void)
> +void sched_ttwu_pending(void)
> {
> struct rq *rq = this_rq();
> struct llist_node *llist = llist_del_all(&rq->wake_list);
> @@ -2689,7 +2707,7 @@ static void __sched __schedule(void)
> update_rq_clock(rq);
>
> next = pick_next_task(rq, prev);
> - clear_tsk_need_resched(prev);
> + clear_tsk_need_resched(next);
> clear_preempt_need_resched();
> rq->skip_clock_update = 0;
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 14ca43430aee..03748737473e 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -11,6 +11,7 @@
> #include <asm/tlb.h>
>
> #include <trace/events/power.h>
> +#include "sched.h"
>
> static int __read_mostly cpu_idle_force_poll;
>
> @@ -71,6 +72,8 @@ static void cpu_idle_loop(void)
> while (1) {
> tick_nohz_idle_enter();
>
> + current_set_polling();
> +
> while (!need_resched()) {
> check_pgt_cache();
> rmb();
> @@ -93,29 +96,27 @@ static void cpu_idle_loop(void)
> if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> cpu_idle_poll();
> } else {
> - if (!current_clr_polling_and_test()) {
> - stop_critical_timings();
> - rcu_idle_enter();
> - if (cpuidle_idle_call())
> - arch_cpu_idle();
> - if (WARN_ON_ONCE(irqs_disabled()))
> - local_irq_enable();
> - rcu_idle_exit();
> - start_critical_timings();
> - } else {
> + stop_critical_timings();
> + rcu_idle_enter();
> + if (cpuidle_idle_call())
> + arch_cpu_idle();
> + if (WARN_ON_ONCE(irqs_disabled()))
> local_irq_enable();
> - }
> - __current_set_polling();
> + rcu_idle_exit();
> + start_critical_timings();
> }
> arch_cpu_idle_exit();
> - /*
> - * We need to test and propagate the TIF_NEED_RESCHED
> - * bit here because we might not have send the
> - * reschedule IPI to idle tasks.
> - */
> - if (tif_need_resched())
> - set_preempt_need_resched();
> }
> +
> + current_clr_polling();
> +
> + /*
> + * We need to test and propagate the TIF_NEED_RESCHED bit here
> + * because we might not have send the reschedule IPI to idle
> + * tasks.
> + */
> + set_preempt_need_resched();
> + sched_ttwu_pending();
> tick_nohz_idle_exit();
> schedule_preempt_disabled();
> }
> @@ -138,7 +139,6 @@ void cpu_startup_entry(enum cpuhp_state state)
> */
> boot_init_stack_canary();
> #endif
> - __current_set_polling();
> arch_cpu_idle_prepare();
> cpu_idle_loop();
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1bf34c257d3b..269b1a4e0bdf 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1160,6 +1160,7 @@ extern const struct sched_class idle_sched_class;
>
> #ifdef CONFIG_SMP
>
> +extern void sched_ttwu_pending(void);
> extern void update_group_power(struct sched_domain *sd, int cpu);
>
> extern void trigger_load_balance(struct rq *rq);
> @@ -1170,6 +1171,8 @@ extern void idle_exit_fair(struct rq *this_rq);
>
> #else /* CONFIG_SMP */
>
> +static inline void sched_ttwu_pending(void) { }
> +
> static inline void idle_balance(int cpu, struct rq *rq)
> {
> }
>
>
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-02-13 20:35:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

On Thu, Feb 13, 2014 at 12:16 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Feb 13, 2014 at 09:07:10AM -0800, Andy Lutomirski wrote:
>> > I also don't really like how the polling state is an atomic; its a cpu
>> > local property.
>>
>> Your patch also makes polling state be an atomic (albeit one that
>> isn't changed remotely).
>
> Yah, sorry for that, changed the email (and code about) a number of
> times before posting :/
>
>> On the subject of major surgery, though: there are very few places in
>> the kernel where TIF_NEED_RESCHED gets set. With something like my
>> patch applied, I think that there is no code at all that sets any
>> other task's TIF_NEED_RESCHED. That suggests that all
>> set_tsk_need_resched callers could just call into the scheduler
>> directly.
>
> One of the main callers would be the timer tick for local preemption;
> that's from interrupt context, can't call schedule() there, really needs
> to be the interrupt return path.
>
>> If so, the change could probably delete a whole lot of
>> assembly code, and every kernel exit would get faster.
>
> We already need to load that word for all kinds of other things; like
> delivering signals, so testing the one bit in the return path is
> basically free.
>
> Anyway; after all this mucking about I finally remembered Venki once
> attempted something similar:
>
> https://lkml.org/lkml/2012/2/6/357
>
> How about something like this?

This looks remarkably similar to my version, except that it's just for
mwait. Hmm.

My updated patch is currently completely broken. I'll fix it up and
send it out for comparison.

--Andy

>
> ---
> arch/x86/include/asm/mwait.h | 33 ++++++++++++++++--------
> arch/x86/kernel/process.c | 2 ++
> arch/x86/kernel/smp.c | 61 ++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 83 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 1da25a5f96f9..cb7bb8bb6617 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -1,6 +1,7 @@
> #ifndef _ASM_X86_MWAIT_H
> #define _ASM_X86_MWAIT_H
>
> +#include <linux/percpu.h>
> #include <linux/sched.h>
>
> #define MWAIT_SUBSTATE_MASK 0xf
> @@ -15,6 +16,14 @@
>
> #define MWAIT_ECX_INTERRUPT_BREAK 0x1
>
> +#define MWAIT_IPI_ENABLED 0x01
> +#define MWAIT_IPI_RESCHED 0x02
> +#define MWAIT_IPI_SINGLE 0x04
> +
> +extern void mwait_intercept_handler(void);
> +
> +DECLARE_PER_CPU_ALIGNED(unsigned int, mwait_ipi);
> +
> static inline void __monitor(const void *eax, unsigned long ecx,
> unsigned long edx)
> {
> @@ -42,18 +51,20 @@ static inline void __mwait(unsigned long eax, unsigned long ecx)
> */
> static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
> {
> - if (!current_set_polling_and_test()) {
> - if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> - mb();
> - clflush((void *)&current_thread_info()->flags);
> - mb();
> - }
> -
> - __monitor((void *)&current_thread_info()->flags, 0, 0);
> - if (!need_resched())
> - __mwait(eax, ecx);
> + unsigned int *ptr = this_cpu_ptr(&mwait_ipi);
> + unsigned int old = xchg(ptr, MWAIT_IPI_ENABLED);
> +
> + WARN_ON_ONCE(old);
> +
> + if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> + mb();
> + clflush((void *)ptr);
> + mb();
> }
> - current_clr_polling();
> +
> + __monitor((void *)ptr, 0, 0);
> + if (!(*ptr & ~MWAIT_IPI_ENABLED))
> + __mwait(eax, ecx);
> }
>
> #endif /* _ASM_X86_MWAIT_H */
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 4505e2a950d8..00afb2b676b8 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -28,6 +28,7 @@
> #include <asm/fpu-internal.h>
> #include <asm/debugreg.h>
> #include <asm/nmi.h>
> +#include <asm/mwait.h>
>
> /*
> * per-CPU TSS segments. Threads are completely 'soft' on Linux,
> @@ -286,6 +287,7 @@ void arch_cpu_idle_enter(void)
> void arch_cpu_idle_exit(void)
> {
> __exit_idle();
> + mwait_intercept_handler();
> }
>
> void arch_cpu_idle_dead(void)
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 7c3a5a61f2e4..4b078a8d6b83 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -23,6 +23,8 @@
> #include <linux/interrupt.h>
> #include <linux/cpu.h>
> #include <linux/gfp.h>
> +#include <linux/sched.h>
> +#include <linux/smp.h>
>
> #include <asm/mtrr.h>
> #include <asm/tlbflush.h>
> @@ -31,6 +33,8 @@
> #include <asm/apic.h>
> #include <asm/nmi.h>
> #include <asm/trace/irq_vectors.h>
> +#include <asm/mwait.h>
> +
> /*
> * Some notes on x86 processor bugs affecting SMP operation:
> *
> @@ -113,6 +117,56 @@
> static atomic_t stopping_cpu = ATOMIC_INIT(-1);
> static bool smp_no_nmi_ipi = false;
>
> +DEFINE_PER_CPU_ALIGNED(unsigned int, mwait_ipi);
> +EXPORT_PER_CPU_SYMBOL_GPL(mwait_ipi);
> +
> +static bool mwait_intercept(int cpu, int ipi)
> +{
> + u32 *ptr = &per_cpu(mwait_ipi, cpu);
> + u32 val, new, old;
> +
> + if (!static_cpu_has(X86_FEATURE_MWAIT))
> + return false;
> +
> + val = *ptr;
> + if (!(val & MWAIT_IPI_ENABLED))
> + return false;
> +
> + for (;;) {
> + new = val | ipi;
> + old = cmpxchg(ptr, val, new);
> + if (old == val)
> + break;
> + val = old;
> + }
> +
> + if (!(old & MWAIT_IPI_ENABLED))
> + return false;
> +
> + return true;
> +}
> +
> +void mwait_intercept_handler(void)
> +{
> + unsigned int val, *ptr;
> +
> + if (!static_cpu_has(X86_FEATURE_MWAIT))
> + return;
> +
> + ptr = this_cpu_ptr(&mwait_ipi);
> + val = xchg(ptr, 0);
> +
> + if (!(val & ~MWAIT_IPI_ENABLED))
> + return;
> +
> + local_irq_disable();
> + if (val & MWAIT_IPI_RESCHED)
> + scheduler_ipi();
> + if (val & MWAIT_IPI_SINGLE)
> + generic_smp_call_function_single_interrupt();
> + local_irq_enable();
> +}
> +
> /*
> * this function sends a 'reschedule' IPI to another CPU.
> * it goes straight through and wastes no time serializing
> @@ -124,12 +178,15 @@ static void native_smp_send_reschedule(int cpu)
> WARN_ON(1);
> return;
> }
> - apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
> +
> + if (!mwait_intercept(cpu, MWAIT_IPI_RESCHED))
> + apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
> }
>
> void native_send_call_func_single_ipi(int cpu)
> {
> - apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
> + if (!mwait_intercept(cpu, MWAIT_IPI_SINGLE))
> + apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
> }
>
> void native_send_call_func_ipi(const struct cpumask *mask)



--
Andy Lutomirski
AMA Capital Management, LLC

2014-02-13 20:49:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

On Thu, Feb 13, 2014 at 09:07:10AM -0800, Andy Lutomirski wrote:
> > I also don't really like how the polling state is an atomic; its a cpu
> > local property.
>
> Your patch also makes polling state be an atomic (albeit one that
> isn't changed remotely).

Yah, sorry for that, changed the email (and code about) a number of
times before posting :/

> On the subject of major surgery, though: there are very few places in
> the kernel where TIF_NEED_RESCHED gets set. With something like my
> patch applied, I think that there is no code at all that sets any
> other task's TIF_NEED_RESCHED. That suggests that all
> set_tsk_need_resched callers could just call into the scheduler
> directly.

One of the main callers would be the timer tick for local preemption;
that's from interrupt context, can't call schedule() there, really needs
to be the interrupt return path.

> If so, the change could probably delete a whole lot of
> assembly code, and every kernel exit would get faster.

We already need to load that word for all kinds of other things; like
delivering signals, so testing the one bit in the return path is
basically free.

Anyway; after all this mucking about I finally remembered Venki once
attempted something similar:

https://lkml.org/lkml/2012/2/6/357

How about something like this?

---
arch/x86/include/asm/mwait.h | 33 ++++++++++++++++--------
arch/x86/kernel/process.c | 2 ++
arch/x86/kernel/smp.c | 61 ++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 1da25a5f96f9..cb7bb8bb6617 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -1,6 +1,7 @@
#ifndef _ASM_X86_MWAIT_H
#define _ASM_X86_MWAIT_H

+#include <linux/percpu.h>
#include <linux/sched.h>

#define MWAIT_SUBSTATE_MASK 0xf
@@ -15,6 +16,14 @@

#define MWAIT_ECX_INTERRUPT_BREAK 0x1

+#define MWAIT_IPI_ENABLED 0x01
+#define MWAIT_IPI_RESCHED 0x02
+#define MWAIT_IPI_SINGLE 0x04
+
+extern void mwait_intercept_handler(void);
+
+DECLARE_PER_CPU_ALIGNED(unsigned int, mwait_ipi);
+
static inline void __monitor(const void *eax, unsigned long ecx,
unsigned long edx)
{
@@ -42,18 +51,20 @@ static inline void __mwait(unsigned long eax, unsigned long ecx)
*/
static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
{
- if (!current_set_polling_and_test()) {
- if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
- mb();
- clflush((void *)&current_thread_info()->flags);
- mb();
- }
-
- __monitor((void *)&current_thread_info()->flags, 0, 0);
- if (!need_resched())
- __mwait(eax, ecx);
+ unsigned int *ptr = this_cpu_ptr(&mwait_ipi);
+ unsigned int old = xchg(ptr, MWAIT_IPI_ENABLED);
+
+ WARN_ON_ONCE(old);
+
+ if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
+ mb();
+ clflush((void *)ptr);
+ mb();
}
- current_clr_polling();
+
+ __monitor((void *)ptr, 0, 0);
+ if (!(*ptr & ~MWAIT_IPI_ENABLED))
+ __mwait(eax, ecx);
}

#endif /* _ASM_X86_MWAIT_H */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 4505e2a950d8..00afb2b676b8 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -28,6 +28,7 @@
#include <asm/fpu-internal.h>
#include <asm/debugreg.h>
#include <asm/nmi.h>
+#include <asm/mwait.h>

/*
* per-CPU TSS segments. Threads are completely 'soft' on Linux,
@@ -286,6 +287,7 @@ void arch_cpu_idle_enter(void)
void arch_cpu_idle_exit(void)
{
__exit_idle();
+ mwait_intercept_handler();
}

void arch_cpu_idle_dead(void)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 7c3a5a61f2e4..4b078a8d6b83 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,8 @@
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/gfp.h>
+#include <linux/sched.h>
+#include <linux/smp.h>

#include <asm/mtrr.h>
#include <asm/tlbflush.h>
@@ -31,6 +33,8 @@
#include <asm/apic.h>
#include <asm/nmi.h>
#include <asm/trace/irq_vectors.h>
+#include <asm/mwait.h>
+
/*
* Some notes on x86 processor bugs affecting SMP operation:
*
@@ -113,6 +117,56 @@
static atomic_t stopping_cpu = ATOMIC_INIT(-1);
static bool smp_no_nmi_ipi = false;

+DEFINE_PER_CPU_ALIGNED(unsigned int, mwait_ipi);
+EXPORT_PER_CPU_SYMBOL_GPL(mwait_ipi);
+
+static bool mwait_intercept(int cpu, int ipi)
+{
+ u32 *ptr = &per_cpu(mwait_ipi, cpu);
+ u32 val, new, old;
+
+ if (!static_cpu_has(X86_FEATURE_MWAIT))
+ return false;
+
+ val = *ptr;
+ if (!(val & MWAIT_IPI_ENABLED))
+ return false;
+
+ for (;;) {
+ new = val | ipi;
+ old = cmpxchg(ptr, val, new);
+ if (old == val)
+ break;
+ val = old;
+ }
+
+ if (!(old & MWAIT_IPI_ENABLED))
+ return false;
+
+ return true;
+}
+
+void mwait_intercept_handler(void)
+{
+ unsigned int val, *ptr;
+
+ if (!static_cpu_has(X86_FEATURE_MWAIT))
+ return;
+
+ ptr = this_cpu_ptr(&mwait_ipi);
+ val = xchg(ptr, 0);
+
+ if (!(val & ~MWAIT_IPI_ENABLED))
+ return;
+
+ local_irq_disable();
+ if (val & MWAIT_IPI_RESCHED)
+ scheduler_ipi();
+ if (val & MWAIT_IPI_SINGLE)
+ generic_smp_call_function_single_interrupt();
+ local_irq_enable();
+}
+
/*
* this function sends a 'reschedule' IPI to another CPU.
* it goes straight through and wastes no time serializing
@@ -124,12 +178,15 @@ static void native_smp_send_reschedule(int cpu)
WARN_ON(1);
return;
}
- apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
+
+ if (!mwait_intercept(cpu, MWAIT_IPI_RESCHED))
+ apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
}

void native_send_call_func_single_ipi(int cpu)
{
- apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
+ if (!mwait_intercept(cpu, MWAIT_IPI_SINGLE))
+ apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
}

void native_send_call_func_ipi(const struct cpumask *mask)

2014-02-14 01:38:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

On Thu, Feb 13, 2014 at 11:58 AM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Feb 13, 2014 at 6:50 AM, Peter Zijlstra <[email protected]> wrote:
>> On Wed, Feb 12, 2014 at 05:40:12PM -0800, Andy Lutomirski wrote:
>>> This is a strawman proposal to simplify the idle implementation, eliminate
>>> a race
>>>
>>> Benefits over current code:
>>> - ttwu_queue_remote doesn't use an IPI unless needed
>>> - The diffstat should speak for itself :)
>>> - Less racy. Spurious IPIs are possible, but only in narrow windows or
>>> when two wakeups occur in rapid succession.
>>> - Seems to work (?)
>>>
>>> Issues:
>>> - Am I doing the percpu stuff right?
>>> - Needs work on non-x86 architectures
>>> - The !CONFIG_SMP case needs to be checked
>>> - Is "idlepoll" a good name for the new code? It doesn't have *that*
>>> much to do with the idle state. Maybe cpukick?
>>>
>>> If this turns out okay, TIF_NEED_RESCHED could possibly be deleted as well.
>>
>> No, we can't do away with that; its used in some fairly critical paths
>> (return to userspace) and adding a second cacheline load there would be
>> unfortunate.
>>
>> I also don't really like how the polling state is an atomic; its a cpu
>> local property.
>>
>> Now given we can't get rid of TIF_NEED_RESCHED, and we need an atomic op
>> on a remote cacheline anyhow; the simplest solution would be to convert
>> all TS_POLLING users to TIF_POLLING_NRFLAG and use an atomic_or_return()
>> like construct to do:
>>
>> atomic_or_return(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG
>>
>> and avoid the IPI if that is false.
>>
>> Something a little like this; it does require a lot of auditing; but it
>> boots on my x86_64.
>>
>> ---
>> arch/x86/include/asm/mwait.h | 19 ++++----
>> arch/x86/include/asm/thread_info.h | 4 +-
>> arch/x86/kernel/process.c | 8 ++--
>> include/linux/sched.h | 90 +++-----------------------------------
>> kernel/sched/core.c | 70 ++++++++++++++++++-----------
>> kernel/sched/idle.c | 40 ++++++++---------
>> kernel/sched/sched.h | 3 ++
>> 7 files changed, 88 insertions(+), 146 deletions(-)
>
> This patch changes the default from !polling to polling, right? If
> so, doesn't it break every driver that fails to *clear* the polling
> flag? (My patch is about to have the same problem.)

I'm playing with making polling be the default and I'm seeing all
kinds of erratic behavior. Not sure what's going on. Adding printks
just causes *different* weird behavior. Sigh.

So no updated patch from me today.

--Andy

2014-02-14 20:01:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

On Thu, Feb 13, 2014 at 6:50 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Feb 12, 2014 at 05:40:12PM -0800, Andy Lutomirski wrote:
>> This is a strawman proposal to simplify the idle implementation, eliminate
>> a race
>>
>> Benefits over current code:
>> - ttwu_queue_remote doesn't use an IPI unless needed
>> - The diffstat should speak for itself :)
>> - Less racy. Spurious IPIs are possible, but only in narrow windows or
>> when two wakeups occur in rapid succession.
>> - Seems to work (?)
>>
>> Issues:
>> - Am I doing the percpu stuff right?
>> - Needs work on non-x86 architectures
>> - The !CONFIG_SMP case needs to be checked
>> - Is "idlepoll" a good name for the new code? It doesn't have *that*
>> much to do with the idle state. Maybe cpukick?
>>
>> If this turns out okay, TIF_NEED_RESCHED could possibly be deleted as well.
>
> No, we can't do away with that; its used in some fairly critical paths
> (return to userspace) and adding a second cacheline load there would be
> unfortunate.
>
> I also don't really like how the polling state is an atomic; its a cpu
> local property.
>
> Now given we can't get rid of TIF_NEED_RESCHED, and we need an atomic op
> on a remote cacheline anyhow; the simplest solution would be to convert
> all TS_POLLING users to TIF_POLLING_NRFLAG and use an atomic_or_return()
> like construct to do:
>
> atomic_or_return(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG
>
> and avoid the IPI if that is false.
>
> Something a little like this; it does require a lot of auditing; but it
> boots on my x86_64.

On further consideration, I think I like this approach. It's a
simpler change than mine and it appears to work (unlike mine, and I
still haven't figured out what I'm doing wrong). If anyone wants to
get rid of the cmpxchg loop, a trick like would likely work well built
on top of this.

That being said, I can't really test this, because I can't seem to
boot any recent 3.14-based kernel on real hardware, and they die
before they produce any output. They work fine in QEMU.

--Andy

>
> ---
> arch/x86/include/asm/mwait.h | 19 ++++----
> arch/x86/include/asm/thread_info.h | 4 +-
> arch/x86/kernel/process.c | 8 ++--
> include/linux/sched.h | 90 +++-----------------------------------
> kernel/sched/core.c | 70 ++++++++++++++++++-----------
> kernel/sched/idle.c | 40 ++++++++---------
> kernel/sched/sched.h | 3 ++
> 7 files changed, 88 insertions(+), 146 deletions(-)
>
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 1da25a5f96f9..d6a7b7cdc478 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -42,18 +42,15 @@ static inline void __mwait(unsigned long eax, unsigned long ecx)
> */
> static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
> {
> - if (!current_set_polling_and_test()) {
> - if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> - mb();
> - clflush((void *)&current_thread_info()->flags);
> - mb();
> - }
> -
> - __monitor((void *)&current_thread_info()->flags, 0, 0);
> - if (!need_resched())
> - __mwait(eax, ecx);
> + if (static_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> + mb();
> + clflush((void *)&current_thread_info()->flags);
> + mb();
> }
> - current_clr_polling();
> +
> + __monitor((void *)&current_thread_info()->flags, 0, 0);
> + if (!need_resched())
> + __mwait(eax, ecx);
> }
>
> #endif /* _ASM_X86_MWAIT_H */
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index e1940c06ed02..1f2fe575cfca 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -88,6 +88,7 @@ struct thread_info {
> #define TIF_FORK 18 /* ret_from_fork */
> #define TIF_NOHZ 19 /* in adaptive nohz mode */
> #define TIF_MEMDIE 20 /* is terminating due to OOM killer */
> +#define TIF_POLLING_NRFLAG 21
> #define TIF_IO_BITMAP 22 /* uses I/O bitmap */
> #define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
> #define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
> @@ -111,6 +112,7 @@ struct thread_info {
> #define _TIF_IA32 (1 << TIF_IA32)
> #define _TIF_FORK (1 << TIF_FORK)
> #define _TIF_NOHZ (1 << TIF_NOHZ)
> +#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
> #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
> #define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
> #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
> @@ -234,8 +236,6 @@ static inline struct thread_info *current_thread_info(void)
> * have to worry about atomic accesses.
> */
> #define TS_COMPAT 0x0002 /* 32bit syscall active (64BIT)*/
> -#define TS_POLLING 0x0004 /* idle task polling need_resched,
> - skip sending interrupt */
> #define TS_RESTORE_SIGMASK 0x0008 /* restore signal mask in do_signal() */
>
> #ifndef __ASSEMBLY__
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 4505e2a950d8..298c002629c6 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -306,9 +306,11 @@ void arch_cpu_idle(void)
> */
> void default_idle(void)
> {
> - trace_cpu_idle_rcuidle(1, smp_processor_id());
> - safe_halt();
> - trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
> + if (!current_clr_polling_and_test()) {
> + trace_cpu_idle_rcuidle(1, smp_processor_id());
> + safe_halt();
> + trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
> + }
> }
> #ifdef CONFIG_APM_MODULE
> EXPORT_SYMBOL(default_idle);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c49a2585ff7d..b326abe95978 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2692,85 +2692,26 @@ static inline int spin_needbreak(spinlock_t *lock)
> * thread_info.status and one based on TIF_POLLING_NRFLAG in
> * thread_info.flags
> */
> -#ifdef TS_POLLING
> -static inline int tsk_is_polling(struct task_struct *p)
> -{
> - return task_thread_info(p)->status & TS_POLLING;
> -}
> -static inline void __current_set_polling(void)
> -{
> - current_thread_info()->status |= TS_POLLING;
> -}
> -
> -static inline bool __must_check current_set_polling_and_test(void)
> -{
> - __current_set_polling();
> -
> - /*
> - * Polling state must be visible before we test NEED_RESCHED,
> - * paired by resched_task()
> - */
> - smp_mb();
> -
> - return unlikely(tif_need_resched());
> -}
> -
> -static inline void __current_clr_polling(void)
> -{
> - current_thread_info()->status &= ~TS_POLLING;
> -}
> -
> -static inline bool __must_check current_clr_polling_and_test(void)
> -{
> - __current_clr_polling();
> -
> - /*
> - * Polling state must be visible before we test NEED_RESCHED,
> - * paired by resched_task()
> - */
> - smp_mb();
> -
> - return unlikely(tif_need_resched());
> -}
> -#elif defined(TIF_POLLING_NRFLAG)
> +#ifdef CONFIG_SMP
> static inline int tsk_is_polling(struct task_struct *p)
> {
> return test_tsk_thread_flag(p, TIF_POLLING_NRFLAG);
> }
>
> -static inline void __current_set_polling(void)
> +static inline void current_set_polling(void)
> {
> set_thread_flag(TIF_POLLING_NRFLAG);
> }
>
> -static inline bool __must_check current_set_polling_and_test(void)
> -{
> - __current_set_polling();
> -
> - /*
> - * Polling state must be visible before we test NEED_RESCHED,
> - * paired by resched_task()
> - *
> - * XXX: assumes set/clear bit are identical barrier wise.
> - */
> - smp_mb__after_clear_bit();
> -
> - return unlikely(tif_need_resched());
> -}
> -
> -static inline void __current_clr_polling(void)
> +static inline void current_clr_polling(void)
> {
> clear_thread_flag(TIF_POLLING_NRFLAG);
> }
>
> static inline bool __must_check current_clr_polling_and_test(void)
> {
> - __current_clr_polling();
> + current_clr_polling();
>
> - /*
> - * Polling state must be visible before we test NEED_RESCHED,
> - * paired by resched_task()
> - */
> smp_mb__after_clear_bit();
>
> return unlikely(tif_need_resched());
> @@ -2778,34 +2719,15 @@ static inline bool __must_check current_clr_polling_and_test(void)
>
> #else
> static inline int tsk_is_polling(struct task_struct *p) { return 0; }
> -static inline void __current_set_polling(void) { }
> -static inline void __current_clr_polling(void) { }
> +static inline void current_set_polling(void) { }
> +static inline void current_clr_polling(void) { }
>
> -static inline bool __must_check current_set_polling_and_test(void)
> -{
> - return unlikely(tif_need_resched());
> -}
> static inline bool __must_check current_clr_polling_and_test(void)
> {
> return unlikely(tif_need_resched());
> }
> #endif
>
> -static inline void current_clr_polling(void)
> -{
> - __current_clr_polling();
> -
> - /*
> - * Ensure we check TIF_NEED_RESCHED after we clear the polling bit.
> - * Once the bit is cleared, we'll get IPIs with every new
> - * TIF_NEED_RESCHED and the IPI handler, scheduler_ipi(), will also
> - * fold.
> - */
> - smp_mb(); /* paired with resched_task() */
> -
> - preempt_fold_need_resched();
> -}
> -
> static __always_inline bool need_resched(void)
> {
> return unlikely(tif_need_resched());
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fb9764fbc537..19de9a1cc96c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -504,45 +504,63 @@ static inline void init_hrtick(void)
> }
> #endif /* CONFIG_SCHED_HRTICK */
>
> -/*
> - * resched_task - mark a task 'to be rescheduled now'.
> - *
> - * On UP this means the setting of the need_resched flag, on SMP it
> - * might also involve a cross-CPU call to trigger the scheduler on
> - * the target CPU.
> - */
> -void resched_task(struct task_struct *p)
> +static void __resched_cpu(int cpu)
> {
> - int cpu;
> + struct rq *rq = cpu_rq(cpu);
> + struct task_struct *p;
> + struct thread_info *ti;
> + unsigned long val, old, new;
> + bool ipi;
>
> - lockdep_assert_held(&task_rq(p)->lock);
> + rcu_read_lock();
> + do {
> + p = ACCESS_ONCE(rq->curr);
> + ti = task_thread_info(p);
> +
> + val = ti->flags;
> + for (;;) {
> + new = val | _TIF_NEED_RESCHED;
> + old = cmpxchg(&ti->flags, val, new);
> + if (old == val)
> + break;
> + val = old;
> + }
>
> - if (test_tsk_need_resched(p))
> - return;
> + ipi = !(old & _TIF_POLLING_NRFLAG);
>
> - set_tsk_need_resched(p);
> + } while (p != ACCESS_ONCE(rq->curr));
> + rcu_read_unlock();
>
> - cpu = task_cpu(p);
> + if (ipi)
> + smp_send_reschedule(cpu);
> +}
> +
> +void resched_cpu(int cpu)
> +{
> if (cpu == smp_processor_id()) {
> + set_tsk_need_resched(current);
> set_preempt_need_resched();
> return;
> }
>
> - /* NEED_RESCHED must be visible before we test polling */
> - smp_mb();
> - if (!tsk_is_polling(p))
> - smp_send_reschedule(cpu);
> + __resched_cpu(cpu);
> }
>
> -void resched_cpu(int cpu)
> +/*
> + * resched_task - mark a task 'to be rescheduled now'.
> + *
> + * On UP this means the setting of the need_resched flag, on SMP it
> + * might also involve a cross-CPU call to trigger the scheduler on
> + * the target CPU.
> + */
> +void resched_task(struct task_struct *p)
> {
> - struct rq *rq = cpu_rq(cpu);
> - unsigned long flags;
> + lockdep_assert_held(&task_rq(p)->lock);
>
> - if (!raw_spin_trylock_irqsave(&rq->lock, flags))
> + if (test_tsk_need_resched(p))
> return;
> - resched_task(cpu_curr(cpu));
> - raw_spin_unlock_irqrestore(&rq->lock, flags);
> +
> + resched_cpu(task_cpu(p));
> }
>
> #ifdef CONFIG_SMP
> @@ -1476,7 +1494,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
> }
>
> #ifdef CONFIG_SMP
> -static void sched_ttwu_pending(void)
> +void sched_ttwu_pending(void)
> {
> struct rq *rq = this_rq();
> struct llist_node *llist = llist_del_all(&rq->wake_list);
> @@ -2689,7 +2707,7 @@ static void __sched __schedule(void)
> update_rq_clock(rq);
>
> next = pick_next_task(rq, prev);
> - clear_tsk_need_resched(prev);
> + clear_tsk_need_resched(next);
> clear_preempt_need_resched();
> rq->skip_clock_update = 0;
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 14ca43430aee..03748737473e 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -11,6 +11,7 @@
> #include <asm/tlb.h>
>
> #include <trace/events/power.h>
> +#include "sched.h"
>
> static int __read_mostly cpu_idle_force_poll;
>
> @@ -71,6 +72,8 @@ static void cpu_idle_loop(void)
> while (1) {
> tick_nohz_idle_enter();
>
> + current_set_polling();
> +
> while (!need_resched()) {
> check_pgt_cache();
> rmb();
> @@ -93,29 +96,27 @@ static void cpu_idle_loop(void)
> if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> cpu_idle_poll();
> } else {
> - if (!current_clr_polling_and_test()) {
> - stop_critical_timings();
> - rcu_idle_enter();
> - if (cpuidle_idle_call())
> - arch_cpu_idle();
> - if (WARN_ON_ONCE(irqs_disabled()))
> - local_irq_enable();
> - rcu_idle_exit();
> - start_critical_timings();
> - } else {
> + stop_critical_timings();
> + rcu_idle_enter();
> + if (cpuidle_idle_call())
> + arch_cpu_idle();
> + if (WARN_ON_ONCE(irqs_disabled()))
> local_irq_enable();
> - }
> - __current_set_polling();
> + rcu_idle_exit();
> + start_critical_timings();
> }
> arch_cpu_idle_exit();
> - /*
> - * We need to test and propagate the TIF_NEED_RESCHED
> - * bit here because we might not have send the
> - * reschedule IPI to idle tasks.
> - */
> - if (tif_need_resched())
> - set_preempt_need_resched();
> }
> +
> + current_clr_polling();
> +
> + /*
> + * We need to test and propagate the TIF_NEED_RESCHED bit here
> + * because we might not have send the reschedule IPI to idle
> + * tasks.
> + */
> + set_preempt_need_resched();
> + sched_ttwu_pending();
> tick_nohz_idle_exit();
> schedule_preempt_disabled();
> }
> @@ -138,7 +139,6 @@ void cpu_startup_entry(enum cpuhp_state state)
> */
> boot_init_stack_canary();
> #endif
> - __current_set_polling();
> arch_cpu_idle_prepare();
> cpu_idle_loop();
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1bf34c257d3b..269b1a4e0bdf 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1160,6 +1160,7 @@ extern const struct sched_class idle_sched_class;
>
> #ifdef CONFIG_SMP
>
> +extern void sched_ttwu_pending(void);
> extern void update_group_power(struct sched_domain *sd, int cpu);
>
> extern void trigger_load_balance(struct rq *rq);
> @@ -1170,6 +1171,8 @@ extern void idle_exit_fair(struct rq *this_rq);
>
> #else /* CONFIG_SMP */
>
> +static inline void sched_ttwu_pending(void) { }
> +
> static inline void idle_balance(int cpu, struct rq *rq)
> {
> }
>
>
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-02-14 20:17:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

On Fri, Feb 14, 2014 at 12:01 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Feb 13, 2014 at 6:50 AM, Peter Zijlstra <[email protected]> wrote:
>> On Wed, Feb 12, 2014 at 05:40:12PM -0800, Andy Lutomirski wrote:
>>> This is a strawman proposal to simplify the idle implementation, eliminate
>>> a race
>>>
>>> Benefits over current code:
>>> - ttwu_queue_remote doesn't use an IPI unless needed
>>> - The diffstat should speak for itself :)
>>> - Less racy. Spurious IPIs are possible, but only in narrow windows or
>>> when two wakeups occur in rapid succession.
>>> - Seems to work (?)
>>>
>>> Issues:
>>> - Am I doing the percpu stuff right?
>>> - Needs work on non-x86 architectures
>>> - The !CONFIG_SMP case needs to be checked
>>> - Is "idlepoll" a good name for the new code? It doesn't have *that*
>>> much to do with the idle state. Maybe cpukick?
>>>
>>> If this turns out okay, TIF_NEED_RESCHED could possibly be deleted as well.
>>
>> No, we can't do away with that; its used in some fairly critical paths
>> (return to userspace) and adding a second cacheline load there would be
>> unfortunate.
>>
>> I also don't really like how the polling state is an atomic; its a cpu
>> local property.
>>
>> Now given we can't get rid of TIF_NEED_RESCHED, and we need an atomic op
>> on a remote cacheline anyhow; the simplest solution would be to convert
>> all TS_POLLING users to TIF_POLLING_NRFLAG and use an atomic_or_return()
>> like construct to do:
>>
>> atomic_or_return(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG
>>
>> and avoid the IPI if that is false.
>>
>> Something a little like this; it does require a lot of auditing; but it
>> boots on my x86_64.
>
> On further consideration, I think I like this approach. It's a
> simpler change than mine and it appears to work (unlike mine, and I
> still haven't figured out what I'm doing wrong). If anyone wants to
> get rid of the cmpxchg loop, a trick like would likely work well built
> on top of this.
>
> That being said, I can't really test this, because I can't seem to
> boot any recent 3.14-based kernel on real hardware, and they die
> before they produce any output. They work fine in QEMU.

Either you have a bug or I rebased it wrong. With the attached
rebased version, I hit WARN_ON_ONCE(irqs_disabled()) in
cpu_idle_loop() on a semi-regular basis when I boot with:

virtme-runkernel arch/x86/boot/bzImage -smp 2 -cpu host

It also sometimes hangs after starting a shell.

(shameless plug: this is virtme from
https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git )

--Andy


Attachments:
peterz2.patch (11.24 kB)

2014-02-14 21:19:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] sched: Add a new lockless wake-from-idle implementation

On Fri, Feb 14, 2014 at 12:17:27PM -0800, Andy Lutomirski wrote:
> Either you have a bug or I rebased it wrong. With the attached
> rebased version, I hit WARN_ON_ONCE(irqs_disabled()) in
> cpu_idle_loop() on a semi-regular basis when I boot with:

Its got a bug; I got it too, but it did boot and build the next kernel.

I didn't get any time today to look at things; I'll try and prod at
things a bit more next week.