2012-02-23 00:37:12

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

smp_call_function_single and ttwu_queue_remote sends unconditional IPI
to target CPU. However, if the target CPU is in mwait based idle, we can
do IPI-less wakeups using the magical powers of monitor-mwait.
Doing this has certain advantages:
* Lower overhead on Async IPI send path. Measurements on Westmere based
systems show savings on "no wait" smp_call_function_single with idle
target CPU (as measured on the sender side).
local socket smp_call_func cost goes from ~1600 to ~1200 cycles
remote socket smp_call_func cost goes from ~2000 to ~1800 cycles
* Avoiding actual interrupts shows a measurable reduction (10%) in system
non-idle cycles and cache-references with micro-benchmark sending IPI from
one CPU to all the other mostly idle CPUs in the system.
* On a mostly idle system, turbostat shows a tiny decrease in C0(active) time
and a corresponding increase in C6 state (Each row being 10min avg)
%c0 %c1 %c6
Before
Run 1 1.49 2.88 95.63
Run 2 1.48 2.89 95.63
Run 3 1.50 2.91 95.59
After
Run 1 1.28 2.38 96.33
Run 2 1.30 2.44 96.26
Run 3 1.31 2.45 96.24

* As a bonus, we can avoid sched/call IPI overhead altogether in a special case.
When CPU Y has woken up CPU X (which can take 50-100us to actually wakeup
from a deep idle state) and CPU Z wants to send IPI to CPU X in this period.
It can get it for free.

We started looking at this with one of our workloads where system is partially
busy and we noticed some kernel hotspots in find_next_bit and
default_send_IPI_mask_sequence_phys coming from sched wakeup (futex wakeups)
and networking call functions.

Thanks to Suresh for the suggestion of using TIF flags instead of
having a new percpu state variable and complicated update logic.

Notes:
* This only helps when target CPU is idle. When it is busy we will still send
IPI as before.
* Do we need some accounting for these wakeups exported for powertop?
* We can also eliminate TS_POLLING flag in favor of this. But, that will have
a lot more touchpoints and better done as a standlone change.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---

Changes since previous versions:
* RFC https://lkml.org/lkml/2012/2/6/357
Moved the changes into arch specific code as per PeterZ suggestion
Got rid of new per cpu state logic in favor of TIF flag bits

arch/x86/include/asm/ipiless_poke.h | 82 +++++++++++++++++++++++++++++++++++
arch/x86/include/asm/thread_info.h | 4 ++
arch/x86/kernel/acpi/cstate.c | 7 ++-
arch/x86/kernel/process_32.c | 2 +
arch/x86/kernel/process_64.c | 2 +
arch/x86/kernel/smp.c | 16 +++++++
include/linux/sched.h | 2 +
kernel/sched/core.c | 13 ++++++
8 files changed, 126 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/include/asm/ipiless_poke.h

diff --git a/arch/x86/include/asm/ipiless_poke.h b/arch/x86/include/asm/ipiless_poke.h
new file mode 100644
index 0000000..58670c7
--- /dev/null
+++ b/arch/x86/include/asm/ipiless_poke.h
@@ -0,0 +1,82 @@
+#ifndef _ASM_X86_IPILESS_POKE_H
+#define _ASM_X86_IPILESS_POKE_H
+
+#include <linux/sched.h>
+#include <asm/thread_info.h>
+
+#ifdef CONFIG_SMP
+
+DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
+
+/*
+ * Use 2 bits in idle_task's thead info flags:
+ * TIF_IPILESS_IDLE marks enter to and exit from idle states with ipiless
+ * wakeup capability.
+ * TIF_IPI_PENDING set by IPI source CPU if it finds that the IPI target CPU
+ * is in TIF_IPILESS_IDLE state (and TIF_IPI_PENDING is not already set).
+ * Setting of TIF_IPI_PENDING bit brings the target CPU out of idle state.
+ */
+
+static inline void ipiless_idle_enter(void)
+{
+ set_thread_flag(TIF_IPILESS_IDLE);
+}
+
+static inline void ipiless_idle_exit(void)
+{
+ clear_thread_flag(TIF_IPILESS_IDLE);
+}
+
+static inline int is_ipi_pending(void)
+{
+ return unlikely(test_thread_flag(TIF_IPI_PENDING));
+}
+
+static inline int need_wakeup(void)
+{
+ return need_resched() || is_ipi_pending();
+}
+
+static inline void ipiless_pending_work(void)
+{
+ if (is_ipi_pending()) {
+ clear_thread_flag(TIF_IPI_PENDING);
+ local_bh_disable();
+ local_irq_disable();
+ generic_smp_call_function_single_interrupt();
+ __scheduler_ipi();
+ local_irq_enable();
+ local_bh_enable();
+ }
+}
+
+static inline int ipiless_magic_poke(int cpu)
+{
+ int val;
+ atomic_t *idle_flag = per_cpu(idle_task_ti_flags, cpu);
+
+ val = atomic_read(idle_flag);
+ if (unlikely(val & _TIF_IPI_PENDING))
+ return 1;
+
+ if (!(val & _TIF_IPILESS_IDLE))
+ return 0;
+
+ if (val == atomic_cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))
+ return 1;
+
+ return 0;
+}
+
+#else
+static inline void ipiless_pending_work(void) { }
+static inline void ipiless_idle_enter(void) { }
+static inline void ipiless_idle_exit(void) { }
+
+static inline int need_wakeup(void)
+{
+ return need_resched();
+}
+#endif
+
+#endif /* _ASM_X86_IPILESS_POKE_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index bc817cd..f5cd1b8 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,6 +95,8 @@ struct thread_info {
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
+#define TIF_IPILESS_IDLE 29 /* IPIless idle bit */
+#define TIF_IPI_PENDING 30 /* IPI pending on the CPU */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -116,6 +118,8 @@ struct thread_info {
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_IPILESS_IDLE (1 << TIF_IPILESS_IDLE)
+#define _TIF_IPI_PENDING (1 << TIF_IPI_PENDING)

/* work to do in syscall_trace_enter() */
#define _TIF_WORK_SYSCALL_ENTRY \
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index f50e7fb..50833ed 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -14,6 +14,7 @@
#include <acpi/processor.h>
#include <asm/acpi.h>
#include <asm/mwait.h>
+#include <asm/ipiless_poke.h>

/*
* Initialize bm_flags based on the CPU cache properties
@@ -161,15 +162,17 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
*/
void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
{
- if (!need_resched()) {
+ ipiless_idle_enter();
+ if (!need_wakeup()) {
if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
clflush((void *)&current_thread_info()->flags);

__monitor((void *)&current_thread_info()->flags, 0, 0);
smp_mb();
- if (!need_resched())
+ if (!need_wakeup())
__mwait(ax, cx);
}
+ ipiless_idle_exit();
}

void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 324cd72..a963e98 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -44,6 +44,7 @@
#include <asm/system.h>
#include <asm/ldt.h>
#include <asm/processor.h>
+#include <asm/ipiless_poke.h>
#include <asm/i387.h>
#include <asm/desc.h>
#ifdef CONFIG_MATH_EMULATION
@@ -116,6 +117,7 @@ void cpu_idle(void)
if (cpuidle_idle_call())
pm_idle();
start_critical_timings();
+ ipiless_pending_work();
}
rcu_idle_exit();
tick_nohz_idle_exit();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 753e803..93b2e5f 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -44,6 +44,7 @@
#include <asm/processor.h>
#include <asm/i387.h>
#include <asm/mmu_context.h>
+#include <asm/ipiless_poke.h>
#include <asm/prctl.h>
#include <asm/desc.h>
#include <asm/proto.h>
@@ -153,6 +154,7 @@ void cpu_idle(void)
has already called exit_idle. But some idle
loops can be woken up without interrupt. */
__exit_idle();
+ ipiless_pending_work();
}

tick_nohz_idle_exit();
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index a8ff227..e66a4c8 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
#include <asm/mtrr.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
+#include <asm/ipiless_poke.h>
#include <asm/proto.h>
#include <asm/apic.h>
#include <asm/nmi.h>
@@ -109,6 +110,14 @@
* about nothing of note with C stepping upwards.
*/

+DEFINE_PER_CPU(atomic_t *, idle_task_ti_flags);
+
+void __cpuinit thread_idle_state_setup(struct task_struct *idle, int cpu)
+{
+ per_cpu(idle_task_ti_flags, cpu) =
+ (atomic_t *)(&(task_thread_info(idle)->flags));
+}
+
/*
* this function sends a 'reschedule' IPI to another CPU.
* it goes straight through and wastes no time serializing
@@ -120,11 +129,18 @@ static void native_smp_send_reschedule(int cpu)
WARN_ON(1);
return;
}
+
+ if (ipiless_magic_poke(cpu))
+ return;
+
apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
}

void native_send_call_func_single_ipi(int cpu)
{
+ if (ipiless_magic_poke(cpu))
+ return;
+
apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..e07ca62 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2298,9 +2298,11 @@ extern char *get_task_comm(char *to, struct task_struct *tsk);

#ifdef CONFIG_SMP
void scheduler_ipi(void);
+void __scheduler_ipi(void);
extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
#else
static inline void scheduler_ipi(void) { }
+static inline void __scheduler_ipi(void) { }
static inline unsigned long wait_task_inactive(struct task_struct *p,
long match_state)
{
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5255c9d..1558316 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1451,6 +1451,14 @@ static void sched_ttwu_pending(void)
raw_spin_unlock(&rq->lock);
}

+void __scheduler_ipi(void)
+{
+ if (llist_empty(&this_rq()->wake_list))
+ return;
+
+ sched_ttwu_pending();
+}
+
void scheduler_ipi(void)
{
if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
@@ -4827,6 +4835,10 @@ void __cpuinit init_idle_bootup_task(struct task_struct *idle)
idle->sched_class = &idle_sched_class;
}

+void __cpuinit __weak thread_idle_state_setup(struct task_struct *idle, int cpu)
+{
+}
+
/**
* init_idle - set up an idle thread for a given CPU
* @idle: task in question
@@ -4869,6 +4881,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)

/* Set the preempt count _outside_ the spinlocks! */
task_thread_info(idle)->preempt_count = 0;
+ thread_idle_state_setup(idle, cpu);

/*
* The idle tasks have their own, simple scheduling class:
--
1.7.7.3


2012-02-23 07:50:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1


* Venkatesh Pallipadi <[email protected]> wrote:

> * Do we need some accounting for these wakeups exported for powertop?

If then tracepoints.

> * We can also eliminate TS_POLLING flag in favor of this. But, that will have
> a lot more touchpoints and better done as a standlone change.

Should most definitely be done for this series to be acceptble -
as a preparatory patch in the series, with the feature at the
end of the series.

> +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);

That's ugly, we should access the idle task's ti flags directly.

To have efficient percpu access to the idle threads another
clean-up is needed: we should turn idle_thread_array into a
full-structure PER_CPU area.

For that we need a small variant of fork_idle(), which does not
dup the init thread - pretty trivial.

fork_idle() should also make sure it does not schedule the child
thread: thus we'd also be able to further simplify smpboot.c and
get rid of all that extremely ugly 'struct create_idle'
gymnastics in smpboot.c.

Thanks,

Ingo

2012-02-23 09:09:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Thu, 2012-02-23 at 08:50 +0100, Ingo Molnar wrote:
> * Venkatesh Pallipadi <[email protected]> wrote:

> > * We can also eliminate TS_POLLING flag in favor of this. But, that will have
> > a lot more touchpoints and better done as a standlone change.
>
> Should most definitely be done for this series to be acceptble -
> as a preparatory patch in the series, with the feature at the
> end of the series.

I don't think you can, TS_POLLING still works for idle=poll, whereas
this new stuff requires monitor/mwait.

2012-02-23 09:31:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Wed, 2012-02-22 at 16:36 -0800, Venkatesh Pallipadi wrote:

> Changes since previous versions:

> Moved the changes into arch specific code as per PeterZ suggestion

You failed:

> include/linux/sched.h | 2 +
> kernel/sched/core.c | 13 ++++++



> diff --git a/arch/x86/include/asm/ipiless_poke.h b/arch/x86/include/asm/ipiless_poke.h
> new file mode 100644
> index 0000000..58670c7
> --- /dev/null
> +++ b/arch/x86/include/asm/ipiless_poke.h
> @@ -0,0 +1,82 @@
> +#ifndef _ASM_X86_IPILESS_POKE_H
> +#define _ASM_X86_IPILESS_POKE_H
> +
> +#include <linux/sched.h>
> +#include <asm/thread_info.h>
> +
> +#ifdef CONFIG_SMP
> +
> +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
> +
> +/*
> + * Use 2 bits in idle_task's thead info flags:
> + * TIF_IPILESS_IDLE marks enter to and exit from idle states with ipiless
> + * wakeup capability.
> + * TIF_IPI_PENDING set by IPI source CPU if it finds that the IPI target CPU
> + * is in TIF_IPILESS_IDLE state (and TIF_IPI_PENDING is not already set).
> + * Setting of TIF_IPI_PENDING bit brings the target CPU out of idle state.
> + */
> +
> +static inline void ipiless_idle_enter(void)
> +{
> + set_thread_flag(TIF_IPILESS_IDLE);
> +}
> +
> +static inline void ipiless_idle_exit(void)
> +{
> + clear_thread_flag(TIF_IPILESS_IDLE);
> +}
> +
> +static inline int is_ipi_pending(void)
> +{
> + return unlikely(test_thread_flag(TIF_IPI_PENDING));
> +}
> +
> +static inline int need_wakeup(void)
> +{
> + return need_resched() || is_ipi_pending();
> +}
> +
> +static inline void ipiless_pending_work(void)
> +{
> + if (is_ipi_pending()) {
> + clear_thread_flag(TIF_IPI_PENDING);
> + local_bh_disable();
> + local_irq_disable();

That local_bh_disable() is completely superfluous, disabling IRQs
already kills bh.

> + generic_smp_call_function_single_interrupt();
> + __scheduler_ipi();

Why not scheduler_ipi()?

Also, you could keep a pending vector bitmask in a per-cpu variable to
avoid having to call all handlers. not sure that's worth it for just
two, but something to keep in mind for if/when this starts expanding.

> + local_irq_enable();
> + local_bh_enable();
> + }
> +}

Why doesn't ipiless_idle_exit() call this? That would keep it isolated
to just those idle routines that actually use mwait instead of bothering
the generic idle paths with this.

> +static inline int ipiless_magic_poke(int cpu)
> +{
> + int val;
> + atomic_t *idle_flag = per_cpu(idle_task_ti_flags, cpu);

What's this atomic_t nonsense about? thread_info::flags is __u32,
casting it to atomic_t is complete rubbish.

> +
> + val = atomic_read(idle_flag);

The __u32 version would look like: val = ACCESS_ONCE(*idle_flag);

> + if (unlikely(val & _TIF_IPI_PENDING))
> + return 1;
> +
> + if (!(val & _TIF_IPILESS_IDLE))
> + return 0;
> +
> + if (val == atomic_cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))

The __u32 version would look like:

if (val == cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))

Bonus win for being shorter!

> + return 1;
> +
> + return 0;
> +}
> +
> +#else
> +static inline void ipiless_pending_work(void) { }
> +static inline void ipiless_idle_enter(void) { }
> +static inline void ipiless_idle_exit(void) { }
> +
> +static inline int need_wakeup(void)
> +{
> + return need_resched();
> +}
> +#endif
> +
> +#endif /* _ASM_X86_IPILESS_POKE_H */


> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index a8ff227..e66a4c8 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -27,6 +27,7 @@
> #include <asm/mtrr.h>
> #include <asm/tlbflush.h>
> #include <asm/mmu_context.h>
> +#include <asm/ipiless_poke.h>
> #include <asm/proto.h>
> #include <asm/apic.h>
> #include <asm/nmi.h>
> @@ -109,6 +110,14 @@
> * about nothing of note with C stepping upwards.
> */
>
> +DEFINE_PER_CPU(atomic_t *, idle_task_ti_flags);
> +
> +void __cpuinit thread_idle_state_setup(struct task_struct *idle, int cpu)
> +{
> + per_cpu(idle_task_ti_flags, cpu) =
> + (atomic_t *)(&(task_thread_info(idle)->flags));
> +}

As Ingo already pointed out, its the architecture that actually sets up
the idle threads, so putting callbacks into the generic code to find
them is kinda backwards.

Again, *yuck* at that atomic_t business.


> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5255c9d..1558316 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1451,6 +1451,14 @@ static void sched_ttwu_pending(void)
> raw_spin_unlock(&rq->lock);
> }
>
> +void __scheduler_ipi(void)
> +{
> + if (llist_empty(&this_rq()->wake_list))
> + return;
> +
> + sched_ttwu_pending();
> +}

FAIL!! It should be 100% identical to a normal IPI, that calls
scheduler_ipi() so this should too.

> void scheduler_ipi(void)
> {
> if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> @@ -4827,6 +4835,10 @@ void __cpuinit init_idle_bootup_task(struct task_struct *idle)
> idle->sched_class = &idle_sched_class;
> }
>
> +void __cpuinit __weak thread_idle_state_setup(struct task_struct *idle, int cpu)
> +{
> +}
> +
> /**
> * init_idle - set up an idle thread for a given CPU
> * @idle: task in question
> @@ -4869,6 +4881,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
>
> /* Set the preempt count _outside_ the spinlocks! */
> task_thread_info(idle)->preempt_count = 0;
> + thread_idle_state_setup(idle, cpu);

I suggest you put this in smpboot.c someplace ;-)

2012-02-23 19:34:13

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, 2012-02-22 at 16:36 -0800, Venkatesh Pallipadi wrote:
>
> > Changes since previous versions:
>
> > ? Moved the changes into arch specific code as per PeterZ suggestion
>
> You failed:
>
> > ?include/linux/sched.h ? ? ? ? ? ? ? | ? ?2 +
> > ?kernel/sched/core.c ? ? ? ? ? ? ? ? | ? 13 ++++++
>
>
>
> > diff --git a/arch/x86/include/asm/ipiless_poke.h
> > b/arch/x86/include/asm/ipiless_poke.h
> > new file mode 100644
> > index 0000000..58670c7
> > --- /dev/null
> > +++ b/arch/x86/include/asm/ipiless_poke.h
> > @@ -0,0 +1,82 @@
> > +#ifndef _ASM_X86_IPILESS_POKE_H
> > +#define _ASM_X86_IPILESS_POKE_H
> > +
> > +#include <linux/sched.h>
> > +#include <asm/thread_info.h>
> > +
> > +#ifdef CONFIG_SMP
> > +
> > +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
> > +
> > +/*
> > + * Use 2 bits in idle_task's thead info flags:
> > + * TIF_IPILESS_IDLE marks enter to and exit from idle states with
> > ipiless
> > + * wakeup capability.
> > + * TIF_IPI_PENDING set by IPI source CPU if it finds that the IPI
> > target CPU
> > + * is in TIF_IPILESS_IDLE state (and TIF_IPI_PENDING is not already
> > set).
> > + * Setting of TIF_IPI_PENDING bit brings the target CPU out of idle
> > state.
> > + */
> > +
> > +static inline void ipiless_idle_enter(void)
> > +{
> > + ? ? set_thread_flag(TIF_IPILESS_IDLE);
> > +}
> > +
> > +static inline void ipiless_idle_exit(void)
> > +{
> > + ? ? clear_thread_flag(TIF_IPILESS_IDLE);
> > +}
> > +
> > +static inline int is_ipi_pending(void)
> > +{
> > + ? ? return unlikely(test_thread_flag(TIF_IPI_PENDING));
> > +}
> > +
> > +static inline int need_wakeup(void)
> > +{
> > + ? ? return need_resched() || is_ipi_pending();
> > +}
> > +
> > +static inline void ipiless_pending_work(void)
> > +{
> > + ? ? if (is_ipi_pending()) {
> > + ? ? ? ? ? ? clear_thread_flag(TIF_IPI_PENDING);
> > + ? ? ? ? ? ? local_bh_disable();
> > + ? ? ? ? ? ? local_irq_disable();
>
> That local_bh_disable() is completely superfluous, disabling IRQs
> already kills bh.

The reason for local_bh_disable/enable was to check for potential
softirqs after these interrupt.

>
> > + ? ? ? ? ? ? generic_smp_call_function_single_interrupt();
> > + ? ? ? ? ? ? __scheduler_ipi();
>
> Why not scheduler_ipi()?

Was trying to avoid irq_enter/exit. As the work here is done in idle
thread context, I though we could avoid enter/exit. Also, if we need
it, we should ideally do it once across scheduler and smp_call work
together instead of in scheduler_ipi().

>
> Also, you could keep a pending vector bitmask in a per-cpu variable to
> avoid having to call all handlers. not sure that's worth it for just
> two, but something to keep in mind for if/when this starts expanding.
>

Agree. For anything more than two, it will be better to have an
additional bitmask.

> > + ? ? ? ? ? ? local_irq_enable();
> > + ? ? ? ? ? ? local_bh_enable();
> > + ? ? }
> > +}
>
> Why doesn't ipiless_idle_exit() call this? That would keep it isolated
> to just those idle routines that actually use mwait instead of bothering
> the generic idle paths with this.

ipiless_idle_exit is called in the inner most part of idle entry exit.
In mwait case we may not even have interrupts enabled at that time and
there is code that does idle residency timing etc which will get
impacted if we start doing the pending ipi work there. Doing it at
higher level along with things like enter-exit tickless felt nicer.

>
> > +static inline int ipiless_magic_poke(int cpu)
> > +{
> > + ? ? int val;
> > + ? ? atomic_t *idle_flag = per_cpu(idle_task_ti_flags, cpu);
>
> What's this atomic_t nonsense about? thread_info::flags is __u32,
> casting it to atomic_t is complete rubbish.
>

I have to say I was not a big fan of that typecast as well. I just did
not know about ACCESS_ONCE() interface.
Thanks for the tips below. This is the first thing I am going to
change in this patch.

> > +
> > + ? ? val = atomic_read(idle_flag);
>
> The __u32 version would look like: val = ACCESS_ONCE(*idle_flag);
>
> > + ? ? if (unlikely(val & _TIF_IPI_PENDING))
> > + ? ? ? ? ? ? return 1;
> > +
> > + ? ? if (!(val & _TIF_IPILESS_IDLE))
> > + ? ? ? ? ? ? return 0;
> > +
> > + ? ? if (val == atomic_cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))
>
> The __u32 version would look like:
>
> ?if (val == cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))
>
> Bonus win for being shorter!
>
> > + ? ? ? ? ? ? return 1;
> > +
> > + ? ? return 0;
> > +}
> > +
> > +#else
> > +static inline void ipiless_pending_work(void) { }
> > +static inline void ipiless_idle_enter(void) { }
> > +static inline void ipiless_idle_exit(void) { }
> > +
> > +static inline int need_wakeup(void)
> > +{
> > + ? ? return need_resched();
> > +}
> > +#endif
> > +
> > +#endif /* _ASM_X86_IPILESS_POKE_H */
>
>
> > diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> > index a8ff227..e66a4c8 100644
> > --- a/arch/x86/kernel/smp.c
> > +++ b/arch/x86/kernel/smp.c
> > @@ -27,6 +27,7 @@
> > ?#include <asm/mtrr.h>
> > ?#include <asm/tlbflush.h>
> > ?#include <asm/mmu_context.h>
> > +#include <asm/ipiless_poke.h>
> > ?#include <asm/proto.h>
> > ?#include <asm/apic.h>
> > ?#include <asm/nmi.h>
> > @@ -109,6 +110,14 @@
> > ? * ? about nothing of note with C stepping upwards.
> > ? */
> >
> > +DEFINE_PER_CPU(atomic_t *, idle_task_ti_flags);
> > +
> > +void __cpuinit thread_idle_state_setup(struct task_struct *idle, int
> > cpu)
> > +{
> > + ? ? per_cpu(idle_task_ti_flags, cpu) =
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? (atomic_t
> > *)(&(task_thread_info(idle)->flags));
> > +}
>
> As Ingo already pointed out, its the architecture that actually sets up
> the idle threads, so putting callbacks into the generic code to find
> them is kinda backwards.

Yes. I need to spend a bit more time looking at cleanups Ingo suggested.

>
> Again, *yuck* at that atomic_t business.

They are gone now...

>
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 5255c9d..1558316 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1451,6 +1451,14 @@ static void sched_ttwu_pending(void)
> > ? ? ? raw_spin_unlock(&rq->lock);
> > ?}
> >
> > +void __scheduler_ipi(void)
> > +{
> > + ? ? if (llist_empty(&this_rq()->wake_list))
> > + ? ? ? ? ? ? return;
> > +
> > + ? ? sched_ttwu_pending();
> > +}
>
> FAIL!! It should be 100% identical to a normal IPI, that calls
> scheduler_ipi() so this should too.
>
> > ?void scheduler_ipi(void)
> > ?{
> > ? ? ? if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> > @@ -4827,6 +4835,10 @@ void __cpuinit init_idle_bootup_task(struct
> > task_struct *idle)
> > ? ? ? idle->sched_class = &idle_sched_class;
> > ?}
> >
> > +void __cpuinit __weak thread_idle_state_setup(struct task_struct *idle,
> > int cpu)
> > +{
> > +}
> > +
> > ?/**
> > ? * init_idle - set up an idle thread for a given CPU
> > ? * @idle: task in question
> > @@ -4869,6 +4881,7 @@ void __cpuinit init_idle(struct task_struct *idle,
> > int cpu)
> >
> > ? ? ? /* Set the preempt count _outside_ the spinlocks! */
> > ? ? ? task_thread_info(idle)->preempt_count = 0;
> > + ? ? thread_idle_state_setup(idle, cpu);
>
> I suggest you put this in smpboot.c someplace ;-)

Thanks,
Venki

2012-02-23 20:03:50

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

[ Resending without the ugly email client formatting ]

On Wed, Feb 22, 2012 at 11:50 PM, Ingo Molnar <[email protected]> wrote:
>
> * Venkatesh Pallipadi <[email protected]> wrote:
>
>> * Do we need some accounting for these wakeups exported for powertop?
>
> If then tracepoints.
>
>> * We can also eliminate TS_POLLING flag in favor of this. But, that will have
>> ? a lot more touchpoints and better done as a standlone change.
>
> Should most definitely be done for this series to be acceptble -
> as a preparatory patch in the series, with the feature at the
> end of the series.
>
OK. Will look at TS_POLLING part and likely include it in the next resend.

>> +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
>
> That's ugly, we should access the idle task's ti flags directly.
>
> To have efficient percpu access to the idle threads another
> clean-up is needed: we should turn idle_thread_array into a
> full-structure PER_CPU area.
>
> For that we need a small variant of fork_idle(), which does not
> dup the init thread - pretty trivial.
>
> fork_idle() should also make sure it does not schedule the child
> thread: thus we'd also be able to further simplify smpboot.c and
> get rid of all that extremely ugly 'struct create_idle'
> gymnastics in smpboot.c.
>

Hmm. Not being very familiar with that code, I will have to take a closer
look at this potential cleanup...

Thanks,
Venki

> Thanks,
>
> ? ? ? ?Ingo

2012-02-23 20:04:48

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

[ Resending without the ugly email client formatting ]

On Thu, Feb 23, 2012 at 1:08 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-02-23 at 08:50 +0100, Ingo Molnar wrote:
>> * Venkatesh Pallipadi <[email protected]> wrote:
>
>> > * We can also eliminate TS_POLLING flag in favor of this. But, that will have
>> > ? a lot more touchpoints and better done as a standlone change.
>>
>> Should most definitely be done for this series to be acceptble -
>> as a preparatory patch in the series, with the feature at the
>> end of the series.
>
> I don't think you can, TS_POLLING still works for idle=poll, whereas
> this new stuff requires monitor/mwait.
>

Even poll_idle should be able to use this ipiless_wakeup. It just have
to poll for an additional IPI_PENDING bit along with need_resched.

2012-02-24 05:42:09

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > Why not scheduler_ipi()?
>
> Was trying to avoid irq_enter/exit. As the work here is done in idle
> thread context, I though we could avoid enter/exit.

It seems we could not.
At least RCU need it, see commit c5d753a55, otherwise we will get
warning like 'RCU used illegally from extended quiescent state!'

Thanks,
Yong

2012-02-24 06:14:10

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > Why not scheduler_ipi()?
> >
> > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > thread context, I though we could avoid enter/exit.
>
> It seems we could not.
> At least RCU need it, see commit c5d753a55, otherwise we will get
> warning like 'RCU used illegally from extended quiescent state!'

[Off topic]
This remind me that we should have moved the irq_enter()/irq_exit() to
each arch's related irq handler.
see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2

So Peter, is there someone alread on it? or it still worth doing now?

Thanks,
Yong

2012-02-26 01:33:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > Why not scheduler_ipi()?
> >
> > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > thread context, I though we could avoid enter/exit.
>
> It seems we could not.
> At least RCU need it, see commit c5d753a55, otherwise we will get
> warning like 'RCU used illegally from extended quiescent state!'

If the use is tracing, then Steven Rostedt's patchset plus use of his
_rcuidle() tracing variants handles this:

https://lkml.org/lkml/2012/2/7/231

If this is instead algorithmic use of RCU, a set of patches I have queued
up for 3.4 will be required.

Thanx, Paul

2012-02-27 08:38:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:

> This remind me that we should have moved the irq_enter()/irq_exit() to
> each arch's related irq handler.
> see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
>
> So Peter, is there someone alread on it? or it still worth doing now?

trouble was that some people didn't feel comfortable adding that
overhead to plain resched IPIs that didn't need to do anything. So I
more or less let it sit for a while.

2012-02-27 08:46:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Thu, 2012-02-23 at 11:34 -0800, Venki Pallipadi wrote:

> > > + local_bh_disable();
> > > + local_irq_disable();
> >
> > That local_bh_disable() is completely superfluous, disabling IRQs
> > already kills bh.
>
> The reason for local_bh_disable/enable was to check for potential
> softirqs after these interrupt.

Why is that needed? We never wrap local_irq_disable() in bh muck?


> > Why doesn't ipiless_idle_exit() call this? That would keep it isolated
> > to just those idle routines that actually use mwait instead of bothering
> > the generic idle paths with this.
>
> ipiless_idle_exit is called in the inner most part of idle entry exit.
> In mwait case we may not even have interrupts enabled at that time and
> there is code that does idle residency timing etc which will get
> impacted if we start doing the pending ipi work there. Doing it at
> higher level along with things like enter-exit tickless felt nicer.

But but but, an actual interrupt can be handled before the instruction
after mwait, so why would this be a problem?

2012-02-27 09:07:01

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > > Why not scheduler_ipi()?
> > >
> > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > thread context, I though we could avoid enter/exit.
> >
> > It seems we could not.
> > At least RCU need it, see commit c5d753a55, otherwise we will get
> > warning like 'RCU used illegally from extended quiescent state!'
>
> If the use is tracing, then Steven Rostedt's patchset plus use of his
> _rcuidle() tracing variants handles this:
>
> https://lkml.org/lkml/2012/2/7/231
>
> If this is instead algorithmic use of RCU, a set of patches I have queued
> up for 3.4 will be required.

scheduler_ipi() doing more than tracing. Will look at your patches :)

Thanks,
Yong

2012-02-27 09:08:32

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 09:38:01AM +0100, Peter Zijlstra wrote:
> On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:
>
> > This remind me that we should have moved the irq_enter()/irq_exit() to
> > each arch's related irq handler.
> > see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> >
> > So Peter, is there someone alread on it? or it still worth doing now?
>
> trouble was that some people didn't feel comfortable adding that
> overhead to plain resched IPIs that didn't need to do anything.

Any ideas on where the plain resched IPIs come from?

Thanks,
Yong

> So I
> more or less let it sit for a while.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Only stand for myself

2012-02-27 09:31:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, 2012-02-27 at 17:08 +0800, Yong Zhang wrote:
> On Mon, Feb 27, 2012 at 09:38:01AM +0100, Peter Zijlstra wrote:
> > On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:
> >
> > > This remind me that we should have moved the irq_enter()/irq_exit() to
> > > each arch's related irq handler.
> > > see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> > >
> > > So Peter, is there someone alread on it? or it still worth doing now?
> >
> > trouble was that some people didn't feel comfortable adding that
> > overhead to plain resched IPIs that didn't need to do anything.
>
> Any ideas on where the plain resched IPIs come from?

remote wakeups that don't queue I guess.

IIRC I looked at it a while back and while not in the majority there
were still a few (I only added counters, didn't look where they came
from).

However since 518cd623 there'd be a lot more I guess..

Feel free to investigate. Also, some non-sched users of the resched ipi
exist, eg. KVM uses it to kick a remote vcpu out from guest mode.

2012-02-27 09:51:22

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 10:30:58AM +0100, Peter Zijlstra wrote:
> On Mon, 2012-02-27 at 17:08 +0800, Yong Zhang wrote:
> > On Mon, Feb 27, 2012 at 09:38:01AM +0100, Peter Zijlstra wrote:
> > > On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:
> > >
> > > > This remind me that we should have moved the irq_enter()/irq_exit() to
> > > > each arch's related irq handler.
> > > > see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> > > >
> > > > So Peter, is there someone alread on it? or it still worth doing now?
> > >
> > > trouble was that some people didn't feel comfortable adding that
> > > overhead to plain resched IPIs that didn't need to do anything.
> >
> > Any ideas on where the plain resched IPIs come from?
>
> remote wakeups that don't queue I guess.
>
> IIRC I looked at it a while back and while not in the majority there
> were still a few (I only added counters, didn't look where they came
> from).
>
> However since 518cd623 there'd be a lot more I guess..
>
> Feel free to investigate. Also, some non-sched users of the resched ipi
> exist, eg. KVM uses it to kick a remote vcpu out from guest mode.

Thanks for the detail.

Will try to find if there is something interesting.

Thanks,
Yong

2012-02-27 17:09:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > > > Why not scheduler_ipi()?
> > > >
> > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > thread context, I though we could avoid enter/exit.
> > >
> > > It seems we could not.
> > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > warning like 'RCU used illegally from extended quiescent state!'
> >
> > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > _rcuidle() tracing variants handles this:
> >
> > https://lkml.org/lkml/2012/2/7/231
> >
> > If this is instead algorithmic use of RCU, a set of patches I have queued
> > up for 3.4 will be required.
>
> scheduler_ipi() doing more than tracing. Will look at your patches :)

Ah! The key question is whether or not the code in question is called
both from idle and from non-idle. This will be easiest if the code is
called only from idle, in which case you should only need this one:

https://lkml.org/lkml/2012/2/3/498

Thanx, Paul

2012-02-27 18:17:49

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 12:45 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-02-23 at 11:34 -0800, Venki Pallipadi wrote:
>
>> > > + ? ? ? ? ? ? local_bh_disable();
>> > > + ? ? ? ? ? ? local_irq_disable();
>> >
>> > That local_bh_disable() is completely superfluous, disabling IRQs
>> > already kills bh.
>>
>> The reason for local_bh_disable/enable was to check for potential
>> softirqs after these interrupt.
>
> Why is that needed? We never wrap local_irq_disable() in bh muck?
>

For normal interrupts, we check for bh on interrutp return. For idle
loop no interrupt case, we will have to call bh handler explicitly as
otherwise we will go back to idle until need_resched().
local_bh_enable() handles this bh call here.

>> > Why doesn't ipiless_idle_exit() call this? That would keep it isolated
>> > to just those idle routines that actually use mwait instead of bothering
>> > the generic idle paths with this.
>>
>> ipiless_idle_exit is called in the inner most part of idle entry exit.
>> In mwait case we may not even have interrupts enabled at that time and
>> there is code that does idle residency timing etc which will get
>> impacted if we start doing the pending ipi work there. Doing it at
>> higher level along with things like enter-exit tickless felt nicer.
>
> But but but, an actual interrupt can be handled before the instruction
> after mwait, so why would this be a problem?
>

With most commom usage of mwait(), interrupts are disabled on entry
and exit from mwait(). There is a special flag that brings CPU out of
mwait on interrupt, without actually handling the interrupt. We do
C-state timing in acpi idle/intel idle and enable interrupts and thats
where interrupt will get handled. My concern is calling the interrupt
and bh immediatley after mwait exit will inflate C-state residency
timings.

Thanks,
Venki

2012-02-28 07:13:13

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 09:05:27AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> > On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > > > > Why not scheduler_ipi()?
> > > > >
> > > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > > thread context, I though we could avoid enter/exit.
> > > >
> > > > It seems we could not.
> > > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > > warning like 'RCU used illegally from extended quiescent state!'
> > >
> > > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > > _rcuidle() tracing variants handles this:
> > >
> > > https://lkml.org/lkml/2012/2/7/231
> > >
> > > If this is instead algorithmic use of RCU, a set of patches I have queued
> > > up for 3.4 will be required.
> >
> > scheduler_ipi() doing more than tracing. Will look at your patches :)
>
> Ah! The key question is whether or not the code in question is called
> both from idle and from non-idle.

In fact before this patch from Venki, the only call site of scheduler_ipi()
is resched irq handler. Then Venki introduce __scheduler_ipi()(which avoid
irq_enter()/irq_exit()) into cpu_idle(). So the answer is yes.

But when I was testing this patch, I didn't see explicit warning on
illegal rcu usage. The reason maybe 1) there are no much rcu dereference
in scheduler_ipi(), but we indeed do tracing in it; 2) rq->lock provide
some kind of protection.
Maybe I'm overstraining, but it is potential danger.

But anyway, it's not an issue anymore since Venki removed __scheduler_ipi()
in his latest version.

> This will be easiest if the code is
> called only from idle, in which case you should only need this one:
>
> https://lkml.org/lkml/2012/2/3/498

Hmm... Yeah, RCU_NONIDLE() could survive IMHO :)

Thanks,
Yong

2012-02-28 13:07:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Tue, Feb 28, 2012 at 03:12:55PM +0800, Yong Zhang wrote:
> On Mon, Feb 27, 2012 at 09:05:27AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> > > On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > > > > > Why not scheduler_ipi()?
> > > > > >
> > > > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > > > thread context, I though we could avoid enter/exit.
> > > > >
> > > > > It seems we could not.
> > > > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > > > warning like 'RCU used illegally from extended quiescent state!'
> > > >
> > > > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > > > _rcuidle() tracing variants handles this:
> > > >
> > > > https://lkml.org/lkml/2012/2/7/231
> > > >
> > > > If this is instead algorithmic use of RCU, a set of patches I have queued
> > > > up for 3.4 will be required.
> > >
> > > scheduler_ipi() doing more than tracing. Will look at your patches :)
> >
> > Ah! The key question is whether or not the code in question is called
> > both from idle and from non-idle.
>
> In fact before this patch from Venki, the only call site of scheduler_ipi()
> is resched irq handler. Then Venki introduce __scheduler_ipi()(which avoid
> irq_enter()/irq_exit()) into cpu_idle(). So the answer is yes.

Ah, that explains why I didn't see it in my testing. ;-)

> But when I was testing this patch, I didn't see explicit warning on
> illegal rcu usage. The reason maybe 1) there are no much rcu dereference
> in scheduler_ipi(), but we indeed do tracing in it; 2) rq->lock provide
> some kind of protection.
> Maybe I'm overstraining, but it is potential danger.

Did you have CONFIG_PROVE_RCU=y when testing?

> But anyway, it's not an issue anymore since Venki removed __scheduler_ipi()
> in his latest version.

OK.

> > This will be easiest if the code is
> > called only from idle, in which case you should only need this one:
> >
> > https://lkml.org/lkml/2012/2/3/498
>
> Hmm... Yeah, RCU_NONIDLE() could survive IMHO :)

Seems like it might be needed sooner rather than later...

Thanx, Paul

2012-02-29 06:40:46

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Tue, Feb 28, 2012 at 05:05:52AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 28, 2012 at 03:12:55PM +0800, Yong Zhang wrote:
> > On Mon, Feb 27, 2012 at 09:05:27AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> > > > On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > > > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > > > > > > Why not scheduler_ipi()?
> > > > > > >
> > > > > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > > > > thread context, I though we could avoid enter/exit.
> > > > > >
> > > > > > It seems we could not.
> > > > > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > > > > warning like 'RCU used illegally from extended quiescent state!'
> > > > >
> > > > > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > > > > _rcuidle() tracing variants handles this:
> > > > >
> > > > > https://lkml.org/lkml/2012/2/7/231
> > > > >
> > > > > If this is instead algorithmic use of RCU, a set of patches I have queued
> > > > > up for 3.4 will be required.
> > > >
> > > > scheduler_ipi() doing more than tracing. Will look at your patches :)
> > >
> > > Ah! The key question is whether or not the code in question is called
> > > both from idle and from non-idle.
> >
> > In fact before this patch from Venki, the only call site of scheduler_ipi()
> > is resched irq handler. Then Venki introduce __scheduler_ipi()(which avoid
> > irq_enter()/irq_exit()) into cpu_idle(). So the answer is yes.
>
> Ah, that explains why I didn't see it in my testing. ;-)
>
> > But when I was testing this patch, I didn't see explicit warning on
> > illegal rcu usage. The reason maybe 1) there are no much rcu dereference
> > in scheduler_ipi(), but we indeed do tracing in it; 2) rq->lock provide
> > some kind of protection.
> > Maybe I'm overstraining, but it is potential danger.
>
> Did you have CONFIG_PROVE_RCU=y when testing?

Yeah.

> zgrep PROVE /proc/config.gz
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_REPEATEDLY=y

>
> > But anyway, it's not an issue anymore since Venki removed __scheduler_ipi()
> > in his latest version.
>
> OK.
>
> > > This will be easiest if the code is
> > > called only from idle, in which case you should only need this one:
> > >
> > > https://lkml.org/lkml/2012/2/3/498
> >
> > Hmm... Yeah, RCU_NONIDLE() could survive IMHO :)
>
> Seems like it might be needed sooner rather than later...

;-)

Thanks,
Yong