2013-09-10 13:23:14

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count

In order to combine the preemption and need_resched test we need to
fold the need_resched information into the preempt_count value.

We keep the existing TIF_NEED_RESCHED infrastructure in place but at 3
sites test it and fold its value into preempt_count; namely:

- resched_task() when setting TIF_NEED_RESCHED on the current task
- scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
remote task it follows it up with a reschedule IPI
and we can modify the cpu local preempt_count from
there.
- cpu_idle_loop() for when resched_task() found tsk_is_polling().

We use an inverted bitmask to indicate need_resched so that a 0 means
both need_resched and !atomic.

Also remove the barrier() in preempt_enable() between
preempt_enable_no_resched() and preempt_check_resched() to avoid
having to reload the preemption value and allow the compiler to use
the flags of the previuos decrement. I couldn't come up with any sane
reason for this barrier() to be there as preempt_enable_no_resched()
already has a barrier() before doing the decrement.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/preempt.h | 42 +++++++++++++++++++++++++++++++++++++-----
include/linux/sched.h | 2 +-
include/linux/thread_info.h | 1 +
kernel/context_tracking.c | 2 +-
kernel/cpu/idle.c | 10 ++++++++++
kernel/sched/core.c | 18 ++++++++++++++----
6 files changed, 64 insertions(+), 11 deletions(-)

--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -10,9 +10,19 @@
#include <linux/linkage.h>
#include <linux/list.h>

+/*
+ * We use the MSB mostly because its available; see <linux/hardirq.h> for
+ * the other bits.
+ */
+#define PREEMPT_NEED_RESCHED 0x80000000
+
+/*
+ * We mask the PREEMPT_NEED_RESCHED bit so as not to confuse all current users
+ * that think a non-zero value indicates we cannot preempt.
+ */
static __always_inline int preempt_count(void)
{
- return current_thread_info()->preempt_count;
+ return current_thread_info()->preempt_count & ~PREEMPT_NEED_RESCHED;
}

static __always_inline int *preempt_count_ptr(void)
@@ -20,6 +30,30 @@ static __always_inline int *preempt_coun
return &current_thread_info()->preempt_count;
}

+/*
+ * We fold the NEED_RESCHED bit into the preempt count such that
+ * preempt_enable() can decrement and test for needing to reschedule with a
+ * single instruction.
+ *
+ * We invert the actual bit, so that when the decrement hits 0 we know we both
+ * need to resched (the bit is cleared) and can resched (no preempt count).
+ */
+
+static __always_inline void set_preempt_need_resched(void)
+{
+ *preempt_count_ptr() &= ~PREEMPT_NEED_RESCHED;
+}
+
+static __always_inline void clear_preempt_need_resched(void)
+{
+ *preempt_count_ptr() |= PREEMPT_NEED_RESCHED;
+}
+
+static __always_inline bool test_preempt_need_resched(void)
+{
+ return !(*preempt_count_ptr() & PREEMPT_NEED_RESCHED);
+}
+
#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER)
extern void add_preempt_count(int val);
extern void sub_preempt_count(int val);
@@ -37,7 +71,7 @@ asmlinkage void preempt_schedule(void);

#define preempt_check_resched() \
do { \
- if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
+ if (unlikely(!*preempt_count_ptr())) \
preempt_schedule(); \
} while (0)

@@ -47,7 +81,7 @@ void preempt_schedule_context(void);

#define preempt_check_resched_context() \
do { \
- if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
+ if (unlikely(!*preempt_count_ptr())) \
preempt_schedule_context(); \
} while (0)
#else
@@ -83,7 +117,6 @@ do { \
#define preempt_enable() \
do { \
preempt_enable_no_resched(); \
- barrier(); \
preempt_check_resched(); \
} while (0)

@@ -111,7 +144,6 @@ do { \
#define preempt_enable_notrace() \
do { \
preempt_enable_no_resched_notrace(); \
- barrier(); \
preempt_check_resched_context(); \
} while (0)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2405,7 +2405,7 @@ static inline int signal_pending_state(l

static inline int need_resched(void)
{
- return unlikely(test_thread_flag(TIF_NEED_RESCHED));
+ return unlikely(test_preempt_need_resched());
}

/*
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -104,6 +104,7 @@ static inline int test_ti_thread_flag(st
#define test_thread_flag(flag) \
test_ti_thread_flag(current_thread_info(), flag)

+#define test_need_resched() test_thread_flag(TIF_NEED_RESCHED)
#define set_need_resched() set_thread_flag(TIF_NEED_RESCHED)
#define clear_need_resched() clear_thread_flag(TIF_NEED_RESCHED)

--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -115,7 +115,7 @@ void __sched notrace preempt_schedule_co
{
enum ctx_state prev_ctx;

- if (likely(!preemptible()))
+ if (likely(preempt_count() || irqs_disabled()))
return;

/*
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -106,6 +106,13 @@ static void cpu_idle_loop(void)
current_set_polling();
}
arch_cpu_idle_exit();
+ /*
+ * We need to test and propagate the TIF_NEED_RESCHED
+ * bit here because we might not have send the
+ * reschedule IPI to idle tasks.
+ */
+ if (test_need_resched())
+ set_preempt_need_resched();
}
tick_nohz_idle_exit();
schedule_preempt_disabled();
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -526,8 +526,10 @@ void resched_task(struct task_struct *p)
set_tsk_need_resched(p);

cpu = task_cpu(p);
- if (cpu == smp_processor_id())
+ if (cpu == smp_processor_id()) {
+ set_preempt_need_resched();
return;
+ }

/* NEED_RESCHED must be visible before we test polling */
smp_mb();
@@ -1411,6 +1413,14 @@ static void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
+ /*
+ * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
+ * TIF_NEED_RESCHED remotely (for the first time) will also send
+ * this IPI.
+ */
+ if (test_need_resched())
+ set_preempt_need_resched();
+
if (llist_empty(&this_rq()->wake_list)
&& !tick_nohz_full_cpu(smp_processor_id())
&& !got_nohz_idle_kick())
@@ -2445,6 +2455,7 @@ static void __sched __schedule(void)
put_prev_task(rq, prev);
next = pick_next_task(rq);
clear_tsk_need_resched(prev);
+ clear_preempt_need_resched();
rq->skip_clock_update = 0;

if (likely(prev != next)) {
@@ -2531,7 +2542,7 @@ asmlinkage void __sched notrace preempt_
* If there is a non-zero preempt_count or interrupts are disabled,
* we do not want to preempt the current task. Just return..
*/
- if (likely(!preemptible()))
+ if (likely(preempt_count() || irqs_disabled()))
return;

do {
@@ -2556,11 +2567,10 @@ EXPORT_SYMBOL(preempt_schedule);
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
- struct thread_info *ti = current_thread_info();
enum ctx_state prev_state;

/* Catch callers which need to be fixed */
- BUG_ON(ti->preempt_count || !irqs_disabled());
+ BUG_ON(preempt_count() || !irqs_disabled());

prev_state = exception_enter();



2013-09-11 02:00:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count

On 09/10/2013 06:08 AM, Peter Zijlstra wrote:
> In order to combine the preemption and need_resched test we need to
> fold the need_resched information into the preempt_count value.
>
> We keep the existing TIF_NEED_RESCHED infrastructure in place but at 3
> sites test it and fold its value into preempt_count; namely:
>
> - resched_task() when setting TIF_NEED_RESCHED on the current task
> - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
> remote task it follows it up with a reschedule IPI
> and we can modify the cpu local preempt_count from
> there.
> - cpu_idle_loop() for when resched_task() found tsk_is_polling().


It looks like the intel_idle code can get confused if TIF_NEED_RESCHED
is set but the preempt resched bit is not -- the need_resched call
between monitor and mwait won't notice TIF_NEED_RESCHED.

Is this condition possible?

--Andy

2013-09-11 08:25:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count

On Tue, Sep 10, 2013 at 06:59:57PM -0700, Andy Lutomirski wrote:
> On 09/10/2013 06:08 AM, Peter Zijlstra wrote:
> > In order to combine the preemption and need_resched test we need to
> > fold the need_resched information into the preempt_count value.
> >
> > We keep the existing TIF_NEED_RESCHED infrastructure in place but at 3
> > sites test it and fold its value into preempt_count; namely:
> >
> > - resched_task() when setting TIF_NEED_RESCHED on the current task
> > - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
> > remote task it follows it up with a reschedule IPI
> > and we can modify the cpu local preempt_count from
> > there.
> > - cpu_idle_loop() for when resched_task() found tsk_is_polling().
>
>
> It looks like the intel_idle code can get confused if TIF_NEED_RESCHED
> is set but the preempt resched bit is not -- the need_resched call
> between monitor and mwait won't notice TIF_NEED_RESCHED.
>
> Is this condition possible?

Ah indeed, I'll have to go find all idle loops out there it seems. Now
if only they were easy to spot :/

I was hoping the generic idle thing was good enough, apparently not
quite. Thanks for spotting that.

2013-09-11 11:06:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count

On Wed, Sep 11, 2013 at 10:25:30AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 06:59:57PM -0700, Andy Lutomirski wrote:

> > It looks like the intel_idle code can get confused if TIF_NEED_RESCHED
> > is set but the preempt resched bit is not -- the need_resched call
> > between monitor and mwait won't notice TIF_NEED_RESCHED.
> >
> > Is this condition possible?
>
> Ah indeed, I'll have to go find all idle loops out there it seems. Now
> if only they were easy to spot :/
>
> I was hoping the generic idle thing was good enough, apparently not
> quite. Thanks for spotting that.

OK, and the reason I didn't notice is that the entire TS_POLLING thing
is completely wrecked so we'll get the interrupt anyway.

The below might fix it.. it boots, but then it already did so that
doesn't say much.

Mike does it fix the funny you saw?

---
Subject: sched, idle: Fix the idle polling state logic
From: Peter Zijlstra <[email protected]>
Date: Wed Sep 11 12:43:13 CEST 2013

Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
regressed several workloads and caused excessive reschedule
interrupts.

The patch in question failed to notice that the x86 code had an
inverted sense of the polling state versus the new generic code (x86:
default polling, generic: default !polling).

Fix the two prominent x86 mwait based idle drivers and introduce a few
new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
usage).

Also switch the idle routines to using test_need_resched() which is an
immediate TIF_NEED_RESCHED test as opposed to need_resched which will
end up being slightly different.

Reported-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/x86/kernel/process.c | 6 +--
drivers/acpi/processor_idle.c | 46 +++++-------------------
drivers/idle/intel_idle.c | 2 -
include/linux/sched.h | 78 ++++++++++++++++++++++++++++++++++++++----
kernel/cpu/idle.c | 9 ++--
5 files changed, 89 insertions(+), 52 deletions(-)

--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -391,9 +391,9 @@ static void amd_e400_idle(void)
* The switch back from broadcast mode needs to be
* called with interrupts disabled.
*/
- local_irq_disable();
- clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
- local_irq_enable();
+ local_irq_disable();
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+ local_irq_enable();
} else
default_idle();
}
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -119,17 +119,10 @@ static struct dmi_system_id processor_po
*/
static void acpi_safe_halt(void)
{
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we
- * test NEED_RESCHED:
- */
- smp_mb();
- if (!need_resched()) {
+ if (!test_need_resched()) {
safe_halt();
local_irq_disable();
}
- current_thread_info()->status |= TS_POLLING;
}

#ifdef ARCH_APICTIMER_STOPS_ON_C3
@@ -737,6 +730,11 @@ static int acpi_idle_enter_c1(struct cpu
if (unlikely(!pr))
return -EINVAL;

+ if (cx->entry_method == ACPI_CSTATE_FFH) {
+ if (current_set_polling_and_test())
+ return -EINVAL;
+ }
+
lapic_timer_state_broadcast(pr, cx, 1);
acpi_idle_do_entry(cx);

@@ -790,18 +788,9 @@ static int acpi_idle_enter_simple(struct
if (unlikely(!pr))
return -EINVAL;

- if (cx->entry_method != ACPI_CSTATE_FFH) {
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we test
- * NEED_RESCHED:
- */
- smp_mb();
-
- if (unlikely(need_resched())) {
- current_thread_info()->status |= TS_POLLING;
+ if (cx->entry_method == ACPI_CSTATE_FFH) {
+ if (current_set_polling_and_test())
return -EINVAL;
- }
}

/*
@@ -819,9 +808,6 @@ static int acpi_idle_enter_simple(struct

sched_clock_idle_wakeup_event(0);

- if (cx->entry_method != ACPI_CSTATE_FFH)
- current_thread_info()->status |= TS_POLLING;
-
lapic_timer_state_broadcast(pr, cx, 0);
return index;
}
@@ -858,18 +844,9 @@ static int acpi_idle_enter_bm(struct cpu
}
}

- if (cx->entry_method != ACPI_CSTATE_FFH) {
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we test
- * NEED_RESCHED:
- */
- smp_mb();
-
- if (unlikely(need_resched())) {
- current_thread_info()->status |= TS_POLLING;
+ if (cx->entry_method == ACPI_CSTATE_FFH) {
+ if (current_set_polling_and_test())
return -EINVAL;
- }
}

acpi_unlazy_tlb(smp_processor_id());
@@ -915,9 +892,6 @@ static int acpi_idle_enter_bm(struct cpu

sched_clock_idle_wakeup_event(0);

- if (cx->entry_method != ACPI_CSTATE_FFH)
- current_thread_info()->status |= TS_POLLING;
-
lapic_timer_state_broadcast(pr, cx, 0);
return index;
}
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -359,7 +359,7 @@ static int intel_idle(struct cpuidle_dev
if (!(lapic_timer_reliable_states & (1 << (cstate))))
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);

- if (!need_resched()) {
+ if (!current_set_polling_and_test()) {

__monitor((void *)&current_thread_info()->flags, 0, 0);
smp_mb();
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2471,34 +2471,98 @@ static inline int tsk_is_polling(struct
{
return task_thread_info(p)->status & TS_POLLING;
}
-static inline void current_set_polling(void)
+static inline void __current_set_polling(void)
{
current_thread_info()->status |= TS_POLLING;
}

-static inline void current_clr_polling(void)
+static inline bool __must_check current_set_polling_and_test(void)
+{
+ __current_set_polling();
+
+ /*
+ * Polling state must be visible before we test NEED_RESCHED,
+ * paired by resched_task()
+ */
+ smp_mb();
+
+ return unlikely(test_need_resched());
+}
+
+static inline void __current_clr_polling(void)
{
current_thread_info()->status &= ~TS_POLLING;
- smp_mb__after_clear_bit();
+}
+
+static inline bool __must_check current_clr_polling_and_test(void)
+{
+ __current_clr_polling();
+
+ /*
+ * Polling state must be visible before we test NEED_RESCHED,
+ * paired by resched_task()
+ */
+ smp_mb();
+
+ return unlikely(test_need_resched());
}
#elif defined(TIF_POLLING_NRFLAG)
static inline int tsk_is_polling(struct task_struct *p)
{
return test_tsk_thread_flag(p, TIF_POLLING_NRFLAG);
}
-static inline void current_set_polling(void)
+
+static inline void __current_set_polling(void)
{
set_thread_flag(TIF_POLLING_NRFLAG);
}

-static inline void current_clr_polling(void)
+static inline bool __must_check current_set_polling_and_test(void)
+{
+ __current_set_polling();
+
+ /*
+ * Polling state must be visible before we test NEED_RESCHED,
+ * paired by resched_task()
+ *
+ * XXX: assumes set/clear bit are identical barrier wise.
+ */
+ smp_mb__after_clear_bit();
+
+ return unlikely(test_need_resched());
+}
+
+static inline void __current_clr_polling(void)
{
clear_thread_flag(TIF_POLLING_NRFLAG);
}
+
+static inline bool __must_check current_clr_polling_and_test(void)
+{
+ __current_clr_polling();
+
+ /*
+ * Polling state must be visible before we test NEED_RESCHED,
+ * paired by resched_task()
+ */
+ smp_mb__after_clear_bit();
+
+ return unlikely(test_need_resched());
+}
+
#else
static inline int tsk_is_polling(struct task_struct *p) { return 0; }
-static inline void current_set_polling(void) { }
-static inline void current_clr_polling(void) { }
+static inline void __current_set_polling(void) { }
+static inline void __current_clr_polling(void) { }
+
+static inline bool __must_check current_set_polling_and_test(void)
+{
+ return unlikely(test_need_resched);
+}
+static inline bool __must_check current_clr_polling_and_test(void)
+{
+ return unlikely(test_need_resched);
+}
#endif

/*
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -44,7 +44,7 @@ static inline int cpu_idle_poll(void)
rcu_idle_enter();
trace_cpu_idle_rcuidle(0, smp_processor_id());
local_irq_enable();
- while (!need_resched())
+ while (!test_need_resched())
cpu_relax();
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
rcu_idle_exit();
@@ -92,8 +92,7 @@ static void cpu_idle_loop(void)
if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
cpu_idle_poll();
} else {
- current_clr_polling();
- if (!need_resched()) {
+ if (!current_clr_polling_and_test()) {
stop_critical_timings();
rcu_idle_enter();
arch_cpu_idle();
@@ -103,7 +102,7 @@ static void cpu_idle_loop(void)
} else {
local_irq_enable();
}
- current_set_polling();
+ __current_set_polling();
}
arch_cpu_idle_exit();
/*
@@ -136,7 +135,7 @@ void cpu_startup_entry(enum cpuhp_state
*/
boot_init_stack_canary();
#endif
- current_set_polling();
+ __current_set_polling();
arch_cpu_idle_prepare();
cpu_idle_loop();
}

2013-09-11 11:15:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count


Prompted by a question from Mike I updated the Changelog to explain why
we need to keep TIF_NEED_RESCHED.

---
Subject: sched: Add NEED_RESCHED to the preempt_count
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 14 14:55:31 CEST 2013

In order to combine the preemption and need_resched test we need to
fold the need_resched information into the preempt_count value.

Since the NEED_RESCHED flag is set across CPUs this needs to be an
atomic operation, however we very much want to avoid making
preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED
infrastructure in place but at 3 sites test it and fold its value into
preempt_count; namely:

- resched_task() when setting TIF_NEED_RESCHED on the current task
- scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
remote task it follows it up with a reschedule IPI
and we can modify the cpu local preempt_count from
there.
- cpu_idle_loop() for when resched_task() found tsk_is_polling().

We use an inverted bitmask to indicate need_resched so that a 0 means
both need_resched and !atomic.

Also remove the barrier() in preempt_enable() between
preempt_enable_no_resched() and preempt_check_resched() to avoid
having to reload the preemption value and allow the compiler to use
the flags of the previuos decrement. I couldn't come up with any sane
reason for this barrier() to be there as preempt_enable_no_resched()
already has a barrier() before doing the decrement.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/preempt.h | 42 +++++++++++++++++++++++++++++++++++++-----
include/linux/sched.h | 2 +-
include/linux/thread_info.h | 1 +
kernel/context_tracking.c | 2 +-
kernel/cpu/idle.c | 10 ++++++++++
kernel/sched/core.c | 18 ++++++++++++++----
6 files changed, 64 insertions(+), 11 deletions(-)

--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -10,9 +10,19 @@
#include <linux/linkage.h>
#include <linux/list.h>

+/*
+ * We use the MSB mostly because its available; see <linux/hardirq.h> for
+ * the other bits.
+ */
+#define PREEMPT_NEED_RESCHED 0x80000000
+
+/*
+ * We mask the PREEMPT_NEED_RESCHED bit so as not to confuse all current users
+ * that think a non-zero value indicates we cannot preempt.
+ */
static __always_inline int preempt_count(void)
{
- return current_thread_info()->preempt_count;
+ return current_thread_info()->preempt_count & ~PREEMPT_NEED_RESCHED;
}

static __always_inline int *preempt_count_ptr(void)
@@ -20,6 +30,30 @@ static __always_inline int *preempt_coun
return &current_thread_info()->preempt_count;
}

+/*
+ * We fold the NEED_RESCHED bit into the preempt count such that
+ * preempt_enable() can decrement and test for needing to reschedule with a
+ * single instruction.
+ *
+ * We invert the actual bit, so that when the decrement hits 0 we know we both
+ * need to resched (the bit is cleared) and can resched (no preempt count).
+ */
+
+static __always_inline void set_preempt_need_resched(void)
+{
+ *preempt_count_ptr() &= ~PREEMPT_NEED_RESCHED;
+}
+
+static __always_inline void clear_preempt_need_resched(void)
+{
+ *preempt_count_ptr() |= PREEMPT_NEED_RESCHED;
+}
+
+static __always_inline bool test_preempt_need_resched(void)
+{
+ return !(*preempt_count_ptr() & PREEMPT_NEED_RESCHED);
+}
+
#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER)
extern void add_preempt_count(int val);
extern void sub_preempt_count(int val);
@@ -37,7 +71,7 @@ asmlinkage void preempt_schedule(void);

#define preempt_check_resched() \
do { \
- if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
+ if (unlikely(!*preempt_count_ptr())) \
preempt_schedule(); \
} while (0)

@@ -47,7 +81,7 @@ void preempt_schedule_context(void);

#define preempt_check_resched_context() \
do { \
- if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
+ if (unlikely(!*preempt_count_ptr())) \
preempt_schedule_context(); \
} while (0)
#else
@@ -83,7 +117,6 @@ do { \
#define preempt_enable() \
do { \
preempt_enable_no_resched(); \
- barrier(); \
preempt_check_resched(); \
} while (0)

@@ -111,7 +144,6 @@ do { \
#define preempt_enable_notrace() \
do { \
preempt_enable_no_resched_notrace(); \
- barrier(); \
preempt_check_resched_context(); \
} while (0)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2405,7 +2405,7 @@ static inline int signal_pending_state(l

static inline int need_resched(void)
{
- return unlikely(test_thread_flag(TIF_NEED_RESCHED));
+ return unlikely(test_preempt_need_resched());
}

/*
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -104,6 +104,7 @@ static inline int test_ti_thread_flag(st
#define test_thread_flag(flag) \
test_ti_thread_flag(current_thread_info(), flag)

+#define test_need_resched() test_thread_flag(TIF_NEED_RESCHED)
#define set_need_resched() set_thread_flag(TIF_NEED_RESCHED)
#define clear_need_resched() clear_thread_flag(TIF_NEED_RESCHED)

--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -115,7 +115,7 @@ void __sched notrace preempt_schedule_co
{
enum ctx_state prev_ctx;

- if (likely(!preemptible()))
+ if (likely(preempt_count() || irqs_disabled()))
return;

/*
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -106,6 +106,13 @@ static void cpu_idle_loop(void)
current_set_polling();
}
arch_cpu_idle_exit();
+ /*
+ * We need to test and propagate the TIF_NEED_RESCHED
+ * bit here because we might not have send the
+ * reschedule IPI to idle tasks.
+ */
+ if (test_need_resched())
+ set_preempt_need_resched();
}
tick_nohz_idle_exit();
schedule_preempt_disabled();
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -526,8 +526,10 @@ void resched_task(struct task_struct *p)
set_tsk_need_resched(p);

cpu = task_cpu(p);
- if (cpu == smp_processor_id())
+ if (cpu == smp_processor_id()) {
+ set_preempt_need_resched();
return;
+ }

/* NEED_RESCHED must be visible before we test polling */
smp_mb();
@@ -1411,6 +1413,14 @@ static void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
+ /*
+ * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
+ * TIF_NEED_RESCHED remotely (for the first time) will also send
+ * this IPI.
+ */
+ if (test_need_resched())
+ set_preempt_need_resched();
+
if (llist_empty(&this_rq()->wake_list)
&& !tick_nohz_full_cpu(smp_processor_id())
&& !got_nohz_idle_kick())
@@ -2445,6 +2455,7 @@ static void __sched __schedule(void)
put_prev_task(rq, prev);
next = pick_next_task(rq);
clear_tsk_need_resched(prev);
+ clear_preempt_need_resched();
rq->skip_clock_update = 0;

if (likely(prev != next)) {
@@ -2531,7 +2542,7 @@ asmlinkage void __sched notrace preempt_
* If there is a non-zero preempt_count or interrupts are disabled,
* we do not want to preempt the current task. Just return..
*/
- if (likely(!preemptible()))
+ if (likely(preempt_count() || irqs_disabled()))
return;

do {
@@ -2556,11 +2567,10 @@ EXPORT_SYMBOL(preempt_schedule);
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
- struct thread_info *ti = current_thread_info();
enum ctx_state prev_state;

/* Catch callers which need to be fixed */
- BUG_ON(ti->preempt_count || !irqs_disabled());
+ BUG_ON(preempt_count() || !irqs_disabled());

prev_state = exception_enter();

2013-09-11 13:35:11

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count

On Wed, 2013-09-11 at 13:06 +0200, Peter Zijlstra wrote:
> On Wed, Sep 11, 2013 at 10:25:30AM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 10, 2013 at 06:59:57PM -0700, Andy Lutomirski wrote:
>
> > > It looks like the intel_idle code can get confused if TIF_NEED_RESCHED
> > > is set but the preempt resched bit is not -- the need_resched call
> > > between monitor and mwait won't notice TIF_NEED_RESCHED.
> > >
> > > Is this condition possible?
> >
> > Ah indeed, I'll have to go find all idle loops out there it seems. Now
> > if only they were easy to spot :/
> >
> > I was hoping the generic idle thing was good enough, apparently not
> > quite. Thanks for spotting that.
>
> OK, and the reason I didn't notice is that the entire TS_POLLING thing
> is completely wrecked so we'll get the interrupt anyway.
>
> The below might fix it.. it boots, but then it already did so that
> doesn't say much.
>
> Mike does it fix the funny you saw?

Yup, modulo test_need_resched() not existing in master.
E5620 pipe-test
v3.7.10 578.5 KHz
v3.7.10-nothrottle 366.7 KHz
v3.8.13 468.3 KHz
v3.9.11 462.0 KHz
v3.10.4 419.4 KHz
v3.11-rc3-4-g36f571e 400.1 KHz
master-9013-gbf83e61 553.5 KHz

I still have your not quite complete hrtimer patch in there, lest menu
governor munch any improvement, as well as my throttle-nohz patch, so
seems we may be a _bit_ down vs 3.7, but reschedule_interrupt did crawl
back under its rock, and while I haven't yet booted Q6600 box, core2
Toshiba Satellite lappy is using mwait_idle_with_hints() in master.

Thanks,

-Mike

> ---
> Subject: sched, idle: Fix the idle polling state logic
> From: Peter Zijlstra <[email protected]>
> Date: Wed Sep 11 12:43:13 CEST 2013
>
> Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
> regressed several workloads and caused excessive reschedule
> interrupts.
>
> The patch in question failed to notice that the x86 code had an
> inverted sense of the polling state versus the new generic code (x86:
> default polling, generic: default !polling).
>
> Fix the two prominent x86 mwait based idle drivers and introduce a few
> new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
> usage).
>
> Also switch the idle routines to using test_need_resched() which is an
> immediate TIF_NEED_RESCHED test as opposed to need_resched which will
> end up being slightly different.
>
> Reported-by: Mike Galbraith <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> arch/x86/kernel/process.c | 6 +--
> drivers/acpi/processor_idle.c | 46 +++++-------------------
> drivers/idle/intel_idle.c | 2 -
> include/linux/sched.h | 78 ++++++++++++++++++++++++++++++++++++++----
> kernel/cpu/idle.c | 9 ++--
> 5 files changed, 89 insertions(+), 52 deletions(-)
>
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -391,9 +391,9 @@ static void amd_e400_idle(void)
> * The switch back from broadcast mode needs to be
> * called with interrupts disabled.
> */
> - local_irq_disable();
> - clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> - local_irq_enable();
> + local_irq_disable();
> + clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> + local_irq_enable();
> } else
> default_idle();
> }
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -119,17 +119,10 @@ static struct dmi_system_id processor_po
> */
> static void acpi_safe_halt(void)
> {
> - current_thread_info()->status &= ~TS_POLLING;
> - /*
> - * TS_POLLING-cleared state must be visible before we
> - * test NEED_RESCHED:
> - */
> - smp_mb();
> - if (!need_resched()) {
> + if (!test_need_resched()) {
> safe_halt();
> local_irq_disable();
> }
> - current_thread_info()->status |= TS_POLLING;
> }
>
> #ifdef ARCH_APICTIMER_STOPS_ON_C3
> @@ -737,6 +730,11 @@ static int acpi_idle_enter_c1(struct cpu
> if (unlikely(!pr))
> return -EINVAL;
>
> + if (cx->entry_method == ACPI_CSTATE_FFH) {
> + if (current_set_polling_and_test())
> + return -EINVAL;
> + }
> +
> lapic_timer_state_broadcast(pr, cx, 1);
> acpi_idle_do_entry(cx);
>
> @@ -790,18 +788,9 @@ static int acpi_idle_enter_simple(struct
> if (unlikely(!pr))
> return -EINVAL;
>
> - if (cx->entry_method != ACPI_CSTATE_FFH) {
> - current_thread_info()->status &= ~TS_POLLING;
> - /*
> - * TS_POLLING-cleared state must be visible before we test
> - * NEED_RESCHED:
> - */
> - smp_mb();
> -
> - if (unlikely(need_resched())) {
> - current_thread_info()->status |= TS_POLLING;
> + if (cx->entry_method == ACPI_CSTATE_FFH) {
> + if (current_set_polling_and_test())
> return -EINVAL;
> - }
> }
>
> /*
> @@ -819,9 +808,6 @@ static int acpi_idle_enter_simple(struct
>
> sched_clock_idle_wakeup_event(0);
>
> - if (cx->entry_method != ACPI_CSTATE_FFH)
> - current_thread_info()->status |= TS_POLLING;
> -
> lapic_timer_state_broadcast(pr, cx, 0);
> return index;
> }
> @@ -858,18 +844,9 @@ static int acpi_idle_enter_bm(struct cpu
> }
> }
>
> - if (cx->entry_method != ACPI_CSTATE_FFH) {
> - current_thread_info()->status &= ~TS_POLLING;
> - /*
> - * TS_POLLING-cleared state must be visible before we test
> - * NEED_RESCHED:
> - */
> - smp_mb();
> -
> - if (unlikely(need_resched())) {
> - current_thread_info()->status |= TS_POLLING;
> + if (cx->entry_method == ACPI_CSTATE_FFH) {
> + if (current_set_polling_and_test())
> return -EINVAL;
> - }
> }
>
> acpi_unlazy_tlb(smp_processor_id());
> @@ -915,9 +892,6 @@ static int acpi_idle_enter_bm(struct cpu
>
> sched_clock_idle_wakeup_event(0);
>
> - if (cx->entry_method != ACPI_CSTATE_FFH)
> - current_thread_info()->status |= TS_POLLING;
> -
> lapic_timer_state_broadcast(pr, cx, 0);
> return index;
> }
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -359,7 +359,7 @@ static int intel_idle(struct cpuidle_dev
> if (!(lapic_timer_reliable_states & (1 << (cstate))))
> clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
>
> - if (!need_resched()) {
> + if (!current_set_polling_and_test()) {
>
> __monitor((void *)&current_thread_info()->flags, 0, 0);
> smp_mb();
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2471,34 +2471,98 @@ static inline int tsk_is_polling(struct
> {
> return task_thread_info(p)->status & TS_POLLING;
> }
> -static inline void current_set_polling(void)
> +static inline void __current_set_polling(void)
> {
> current_thread_info()->status |= TS_POLLING;
> }
>
> -static inline void current_clr_polling(void)
> +static inline bool __must_check current_set_polling_and_test(void)
> +{
> + __current_set_polling();
> +
> + /*
> + * Polling state must be visible before we test NEED_RESCHED,
> + * paired by resched_task()
> + */
> + smp_mb();
> +
> + return unlikely(test_need_resched());
> +}
> +
> +static inline void __current_clr_polling(void)
> {
> current_thread_info()->status &= ~TS_POLLING;
> - smp_mb__after_clear_bit();
> +}
> +
> +static inline bool __must_check current_clr_polling_and_test(void)
> +{
> + __current_clr_polling();
> +
> + /*
> + * Polling state must be visible before we test NEED_RESCHED,
> + * paired by resched_task()
> + */
> + smp_mb();
> +
> + return unlikely(test_need_resched());
> }
> #elif defined(TIF_POLLING_NRFLAG)
> static inline int tsk_is_polling(struct task_struct *p)
> {
> return test_tsk_thread_flag(p, TIF_POLLING_NRFLAG);
> }
> -static inline void current_set_polling(void)
> +
> +static inline void __current_set_polling(void)
> {
> set_thread_flag(TIF_POLLING_NRFLAG);
> }
>
> -static inline void current_clr_polling(void)
> +static inline bool __must_check current_set_polling_and_test(void)
> +{
> + __current_set_polling();
> +
> + /*
> + * Polling state must be visible before we test NEED_RESCHED,
> + * paired by resched_task()
> + *
> + * XXX: assumes set/clear bit are identical barrier wise.
> + */
> + smp_mb__after_clear_bit();
> +
> + return unlikely(test_need_resched());
> +}
> +
> +static inline void __current_clr_polling(void)
> {
> clear_thread_flag(TIF_POLLING_NRFLAG);
> }
> +
> +static inline bool __must_check current_clr_polling_and_test(void)
> +{
> + __current_clr_polling();
> +
> + /*
> + * Polling state must be visible before we test NEED_RESCHED,
> + * paired by resched_task()
> + */
> + smp_mb__after_clear_bit();
> +
> + return unlikely(test_need_resched());
> +}
> +
> #else
> static inline int tsk_is_polling(struct task_struct *p) { return 0; }
> -static inline void current_set_polling(void) { }
> -static inline void current_clr_polling(void) { }
> +static inline void __current_set_polling(void) { }
> +static inline void __current_clr_polling(void) { }
> +
> +static inline bool __must_check current_set_polling_and_test(void)
> +{
> + return unlikely(test_need_resched);
> +}
> +static inline bool __must_check current_clr_polling_and_test(void)
> +{
> + return unlikely(test_need_resched);
> +}
> #endif
>
> /*
> --- a/kernel/cpu/idle.c
> +++ b/kernel/cpu/idle.c
> @@ -44,7 +44,7 @@ static inline int cpu_idle_poll(void)
> rcu_idle_enter();
> trace_cpu_idle_rcuidle(0, smp_processor_id());
> local_irq_enable();
> - while (!need_resched())
> + while (!test_need_resched())
> cpu_relax();
> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
> rcu_idle_exit();
> @@ -92,8 +92,7 @@ static void cpu_idle_loop(void)
> if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> cpu_idle_poll();
> } else {
> - current_clr_polling();
> - if (!need_resched()) {
> + if (!current_clr_polling_and_test()) {
> stop_critical_timings();
> rcu_idle_enter();
> arch_cpu_idle();
> @@ -103,7 +102,7 @@ static void cpu_idle_loop(void)
> } else {
> local_irq_enable();
> }
> - current_set_polling();
> + __current_set_polling();
> }
> arch_cpu_idle_exit();
> /*
> @@ -136,7 +135,7 @@ void cpu_startup_entry(enum cpuhp_state
> */
> boot_init_stack_canary();
> #endif
> - current_set_polling();
> + __current_set_polling();
> arch_cpu_idle_prepare();
> cpu_idle_loop();
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-09-11 16:35:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count

On Wed, Sep 11, 2013 at 4:06 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Sep 11, 2013 at 10:25:30AM +0200, Peter Zijlstra wrote:
>> On Tue, Sep 10, 2013 at 06:59:57PM -0700, Andy Lutomirski wrote:
>
>> > It looks like the intel_idle code can get confused if TIF_NEED_RESCHED
>> > is set but the preempt resched bit is not -- the need_resched call
>> > between monitor and mwait won't notice TIF_NEED_RESCHED.
>> >
>> > Is this condition possible?
>>
>> Ah indeed, I'll have to go find all idle loops out there it seems. Now
>> if only they were easy to spot :/
>>
>> I was hoping the generic idle thing was good enough, apparently not
>> quite. Thanks for spotting that.
>
> OK, and the reason I didn't notice is that the entire TS_POLLING thing
> is completely wrecked so we'll get the interrupt anyway.
>

I bet that this improves cross-cpu wakeup latency, too -- the old code
would presumably wake up the cpu and then immediately interrupt it.

It might be nice to rename one or both of need_resched and
test_need_resched, though -- the difference is somewhat inscrutable.

--Andy

2013-09-11 18:06:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count

On Wed, Sep 11, 2013 at 09:35:08AM -0700, Andy Lutomirski wrote:
> I bet that this improves cross-cpu wakeup latency, too -- the old code
> would presumably wake up the cpu and then immediately interrupt it.

Yeah,. its what clued Mike in to there being a problem.

> It might be nice to rename one or both of need_resched and
> test_need_resched, though -- the difference is somewhat inscrutable.

I agreed, I've just been unable to come up with anything sane.

The best I could come up with is renaming
{set,clear,test}_need_resched() to {}_tif_need_resched(), so we then end
up with test_tif_need_resched(), which is slightly more different from
need_resched().

Ideas?

2013-09-11 18:08:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count

On Wed, Sep 11, 2013 at 11:05 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Sep 11, 2013 at 09:35:08AM -0700, Andy Lutomirski wrote:
>> I bet that this improves cross-cpu wakeup latency, too -- the old code
>> would presumably wake up the cpu and then immediately interrupt it.
>
> Yeah,. its what clued Mike in to there being a problem.
>
>> It might be nice to rename one or both of need_resched and
>> test_need_resched, though -- the difference is somewhat inscrutable.
>
> I agreed, I've just been unable to come up with anything sane.
>
> The best I could come up with is renaming
> {set,clear,test}_need_resched() to {}_tif_need_resched(), so we then end
> up with test_tif_need_resched(), which is slightly more different from
> need_resched().

Nothing great. That's at least better than the current state, I think.

--Andy

2013-09-12 06:03:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 2/7] sched: Add NEED_RESCHED to the preempt_count

On Wed, 2013-09-11 at 15:34 +0200, Mike Galbraith wrote:
> On Wed, 2013-09-11 at 13:06 +0200, Peter Zijlstra wrote:

> > Mike does it fix the funny you saw?
>
> Yup, modulo test_need_resched() not existing in master.

...

> while I haven't yet booted Q6600 box, core2
> Toshiba Satellite lappy is using mwait_idle_with_hints() in master.

Heh. Core2 lappy uses mwait_idle_with_hints().. unless you boot
processor.max_cstate=1 so it doesn't use piece of shite hpet. Q6600
always has idle woes, seemingly because box doesn't do deep enough
cstate to make something (acpi?) happy, despite mwait_idle() having
worked fine for years.

Seems I need to find the cstate dependency spot, and rap it upside the
head if I want my core2 boxen to regain happy campers status.

-Mike