LinuxLists.cc - [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Thu, 2012-02-23 at 08:50 +0100, Ingo Molnar wrote:
> * Venkatesh Pallipadi <[email protected]> wrote:

> > * We can also eliminate TS_POLLING flag in favor of this. But, that will have
> > a lot more touchpoints and better done as a standlone change.
>
> Should most definitely be done for this series to be acceptble -
> as a preparatory patch in the series, with the feature at the
> end of the series.

I don't think you can, TS_POLLING still works for idle=poll, whereas
this new stuff requires monitor/mwait.

2012-02-23 09:31:05

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Wed, 2012-02-22 at 16:36 -0800, Venkatesh Pallipadi wrote:

> Changes since previous versions:

> Moved the changes into arch specific code as per PeterZ suggestion

You failed:

> include/linux/sched.h | 2 +
> kernel/sched/core.c | 13 ++++++

> diff --git a/arch/x86/include/asm/ipiless_poke.h b/arch/x86/include/asm/ipiless_poke.h
> new file mode 100644
> index 0000000..58670c7
> --- /dev/null
> +++ b/arch/x86/include/asm/ipiless_poke.h
> @@ -0,0 +1,82 @@
> +#ifndef _ASM_X86_IPILESS_POKE_H
> +#define _ASM_X86_IPILESS_POKE_H
> +
> +#include <linux/sched.h>
> +#include <asm/thread_info.h>
> +
> +#ifdef CONFIG_SMP
> +
> +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
> +
> +/*
> + * Use 2 bits in idle_task's thead info flags:
> + * TIF_IPILESS_IDLE marks enter to and exit from idle states with ipiless
> + * wakeup capability.
> + * TIF_IPI_PENDING set by IPI source CPU if it finds that the IPI target CPU
> + * is in TIF_IPILESS_IDLE state (and TIF_IPI_PENDING is not already set).
> + * Setting of TIF_IPI_PENDING bit brings the target CPU out of idle state.
> + */
> +
> +static inline void ipiless_idle_enter(void)
> +{
> + set_thread_flag(TIF_IPILESS_IDLE);
> +}
> +
> +static inline void ipiless_idle_exit(void)
> +{
> + clear_thread_flag(TIF_IPILESS_IDLE);
> +}
> +
> +static inline int is_ipi_pending(void)
> +{
> + return unlikely(test_thread_flag(TIF_IPI_PENDING));
> +}
> +
> +static inline int need_wakeup(void)
> +{
> + return need_resched() || is_ipi_pending();
> +}
> +
> +static inline void ipiless_pending_work(void)
> +{
> + if (is_ipi_pending()) {
> + clear_thread_flag(TIF_IPI_PENDING);
> + local_bh_disable();
> + local_irq_disable();

That local_bh_disable() is completely superfluous, disabling IRQs
already kills bh.

> + generic_smp_call_function_single_interrupt();
> + __scheduler_ipi();

Why not scheduler_ipi()?

Also, you could keep a pending vector bitmask in a per-cpu variable to
avoid having to call all handlers. not sure that's worth it for just
two, but something to keep in mind for if/when this starts expanding.

> + local_irq_enable();
> + local_bh_enable();
> + }
> +}

Why doesn't ipiless_idle_exit() call this? That would keep it isolated
to just those idle routines that actually use mwait instead of bothering
the generic idle paths with this.

> +static inline int ipiless_magic_poke(int cpu)
> +{
> + int val;
> + atomic_t *idle_flag = per_cpu(idle_task_ti_flags, cpu);

What's this atomic_t nonsense about? thread_info::flags is __u32,
casting it to atomic_t is complete rubbish.

> +
> + val = atomic_read(idle_flag);

The __u32 version would look like: val = ACCESS_ONCE(*idle_flag);

> + if (unlikely(val & _TIF_IPI_PENDING))
> + return 1;
> +
> + if (!(val & _TIF_IPILESS_IDLE))
> + return 0;
> +
> + if (val == atomic_cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))

The __u32 version would look like:

if (val == cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))

Bonus win for being shorter!

> + return 1;
> +
> + return 0;
> +}
> +
> +#else
> +static inline void ipiless_pending_work(void) { }
> +static inline void ipiless_idle_enter(void) { }
> +static inline void ipiless_idle_exit(void) { }
> +
> +static inline int need_wakeup(void)
> +{
> + return need_resched();
> +}
> +#endif
> +
> +#endif /* _ASM_X86_IPILESS_POKE_H */

> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index a8ff227..e66a4c8 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -27,6 +27,7 @@
> #include <asm/mtrr.h>
> #include <asm/tlbflush.h>
> #include <asm/mmu_context.h>
> +#include <asm/ipiless_poke.h>
> #include <asm/proto.h>
> #include <asm/apic.h>
> #include <asm/nmi.h>
> @@ -109,6 +110,14 @@
> * about nothing of note with C stepping upwards.
> */
>
> +DEFINE_PER_CPU(atomic_t *, idle_task_ti_flags);
> +
> +void __cpuinit thread_idle_state_setup(struct task_struct *idle, int cpu)
> +{
> + per_cpu(idle_task_ti_flags, cpu) =
> + (atomic_t *)(&(task_thread_info(idle)->flags));
> +}

As Ingo already pointed out, its the architecture that actually sets up
the idle threads, so putting callbacks into the generic code to find
them is kinda backwards.

Again, *yuck* at that atomic_t business.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5255c9d..1558316 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1451,6 +1451,14 @@ static void sched_ttwu_pending(void)
> raw_spin_unlock(&rq->lock);
> }
>
> +void __scheduler_ipi(void)
> +{
> + if (llist_empty(&this_rq()->wake_list))
> + return;
> +
> + sched_ttwu_pending();
> +}

FAIL!! It should be 100% identical to a normal IPI, that calls
scheduler_ipi() so this should too.

> void scheduler_ipi(void)
> {
> if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> @@ -4827,6 +4835,10 @@ void __cpuinit init_idle_bootup_task(struct task_struct *idle)
> idle->sched_class = &idle_sched_class;
> }
>
> +void __cpuinit __weak thread_idle_state_setup(struct task_struct *idle, int cpu)
> +{
> +}
> +
> /**
> * init_idle - set up an idle thread for a given CPU
> * @idle: task in question
> @@ -4869,6 +4881,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
>
> /* Set the preempt count _outside_ the spinlocks! */
> task_thread_info(idle)->preempt_count = 0;
> + thread_idle_state_setup(idle, cpu);

I suggest you put this in smpboot.c someplace ;-)

2012-02-23 19:34:13

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, 2012-02-22 at 16:36 -0800, Venkatesh Pallipadi wrote:
>
> > Changes since previous versions:
>
> > ? Moved the changes into arch specific code as per PeterZ suggestion
>
> You failed:
>
> > ?include/linux/sched.h ? ? ? ? ? ? ? | ? ?2 +
> > ?kernel/sched/core.c ? ? ? ? ? ? ? ? | ? 13 ++++++
>
>
>
> > diff --git a/arch/x86/include/asm/ipiless_poke.h
> > b/arch/x86/include/asm/ipiless_poke.h
> > new file mode 100644
> > index 0000000..58670c7
> > --- /dev/null
> > +++ b/arch/x86/include/asm/ipiless_poke.h
> > @@ -0,0 +1,82 @@
> > +#ifndef _ASM_X86_IPILESS_POKE_H
> > +#define _ASM_X86_IPILESS_POKE_H
> > +
> > +#include <linux/sched.h>
> > +#include <asm/thread_info.h>
> > +
> > +#ifdef CONFIG_SMP
> > +
> > +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
> > +
> > +/*
> > + * Use 2 bits in idle_task's thead info flags:
> > + * TIF_IPILESS_IDLE marks enter to and exit from idle states with
> > ipiless
> > + * wakeup capability.
> > + * TIF_IPI_PENDING set by IPI source CPU if it finds that the IPI
> > target CPU
> > + * is in TIF_IPILESS_IDLE state (and TIF_IPI_PENDING is not already
> > set).
> > + * Setting of TIF_IPI_PENDING bit brings the target CPU out of idle
> > state.
> > + */
> > +
> > +static inline void ipiless_idle_enter(void)
> > +{
> > + ? ? set_thread_flag(TIF_IPILESS_IDLE);
> > +}
> > +
> > +static inline void ipiless_idle_exit(void)
> > +{
> > + ? ? clear_thread_flag(TIF_IPILESS_IDLE);
> > +}
> > +
> > +static inline int is_ipi_pending(void)
> > +{
> > + ? ? return unlikely(test_thread_flag(TIF_IPI_PENDING));
> > +}
> > +
> > +static inline int need_wakeup(void)
> > +{
> > + ? ? return need_resched() || is_ipi_pending();
> > +}
> > +
> > +static inline void ipiless_pending_work(void)
> > +{
> > + ? ? if (is_ipi_pending()) {
> > + ? ? ? ? ? ? clear_thread_flag(TIF_IPI_PENDING);
> > + ? ? ? ? ? ? local_bh_disable();
> > + ? ? ? ? ? ? local_irq_disable();
>
> That local_bh_disable() is completely superfluous, disabling IRQs
> already kills bh.

The reason for local_bh_disable/enable was to check for potential
softirqs after these interrupt.

>
> > + ? ? ? ? ? ? generic_smp_call_function_single_interrupt();
> > + ? ? ? ? ? ? __scheduler_ipi();
>
> Why not scheduler_ipi()?

Was trying to avoid irq_enter/exit. As the work here is done in idle
thread context, I though we could avoid enter/exit. Also, if we need
it, we should ideally do it once across scheduler and smp_call work
together instead of in scheduler_ipi().

>
> Also, you could keep a pending vector bitmask in a per-cpu variable to
> avoid having to call all handlers. not sure that's worth it for just
> two, but something to keep in mind for if/when this starts expanding.
>

Agree. For anything more than two, it will be better to have an
additional bitmask.

> > + ? ? ? ? ? ? local_irq_enable();
> > + ? ? ? ? ? ? local_bh_enable();
> > + ? ? }
> > +}
>
> Why doesn't ipiless_idle_exit() call this? That would keep it isolated
> to just those idle routines that actually use mwait instead of bothering
> the generic idle paths with this.

ipiless_idle_exit is called in the inner most part of idle entry exit.
In mwait case we may not even have interrupts enabled at that time and
there is code that does idle residency timing etc which will get
impacted if we start doing the pending ipi work there. Doing it at
higher level along with things like enter-exit tickless felt nicer.

>
> > +static inline int ipiless_magic_poke(int cpu)
> > +{
> > + ? ? int val;
> > + ? ? atomic_t *idle_flag = per_cpu(idle_task_ti_flags, cpu);
>
> What's this atomic_t nonsense about? thread_info::flags is __u32,
> casting it to atomic_t is complete rubbish.
>

I have to say I was not a big fan of that typecast as well. I just did
not know about ACCESS_ONCE() interface.
Thanks for the tips below. This is the first thing I am going to
change in this patch.

> > +
> > + ? ? val = atomic_read(idle_flag);
>
> The __u32 version would look like: val = ACCESS_ONCE(*idle_flag);
>
> > + ? ? if (unlikely(val & _TIF_IPI_PENDING))
> > + ? ? ? ? ? ? return 1;
> > +
> > + ? ? if (!(val & _TIF_IPILESS_IDLE))
> > + ? ? ? ? ? ? return 0;
> > +
> > + ? ? if (val == atomic_cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))
>
> The __u32 version would look like:
>
> ?if (val == cmpxchg(idle_flag, val, val | _TIF_IPI_PENDING))
>
> Bonus win for being shorter!
>
> > + ? ? ? ? ? ? return 1;
> > +
> > + ? ? return 0;
> > +}
> > +
> > +#else
> > +static inline void ipiless_pending_work(void) { }
> > +static inline void ipiless_idle_enter(void) { }
> > +static inline void ipiless_idle_exit(void) { }
> > +
> > +static inline int need_wakeup(void)
> > +{
> > + ? ? return need_resched();
> > +}
> > +#endif
> > +
> > +#endif /* _ASM_X86_IPILESS_POKE_H */
>
>
> > diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> > index a8ff227..e66a4c8 100644
> > --- a/arch/x86/kernel/smp.c
> > +++ b/arch/x86/kernel/smp.c
> > @@ -27,6 +27,7 @@
> > ?#include <asm/mtrr.h>
> > ?#include <asm/tlbflush.h>
> > ?#include <asm/mmu_context.h>
> > +#include <asm/ipiless_poke.h>
> > ?#include <asm/proto.h>
> > ?#include <asm/apic.h>
> > ?#include <asm/nmi.h>
> > @@ -109,6 +110,14 @@
> > ? * ? about nothing of note with C stepping upwards.
> > ? */
> >
> > +DEFINE_PER_CPU(atomic_t *, idle_task_ti_flags);
> > +
> > +void __cpuinit thread_idle_state_setup(struct task_struct *idle, int
> > cpu)
> > +{
> > + ? ? per_cpu(idle_task_ti_flags, cpu) =
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? (atomic_t
> > *)(&(task_thread_info(idle)->flags));
> > +}
>
> As Ingo already pointed out, its the architecture that actually sets up
> the idle threads, so putting callbacks into the generic code to find
> them is kinda backwards.

Yes. I need to spend a bit more time looking at cleanups Ingo suggested.

>
> Again, *yuck* at that atomic_t business.

They are gone now...

>
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 5255c9d..1558316 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1451,6 +1451,14 @@ static void sched_ttwu_pending(void)
> > ? ? ? raw_spin_unlock(&rq->lock);
> > ?}
> >
> > +void __scheduler_ipi(void)
> > +{
> > + ? ? if (llist_empty(&this_rq()->wake_list))
> > + ? ? ? ? ? ? return;
> > +
> > + ? ? sched_ttwu_pending();
> > +}
>
> FAIL!! It should be 100% identical to a normal IPI, that calls
> scheduler_ipi() so this should too.
>
> > ?void scheduler_ipi(void)
> > ?{
> > ? ? ? if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> > @@ -4827,6 +4835,10 @@ void __cpuinit init_idle_bootup_task(struct
> > task_struct *idle)
> > ? ? ? idle->sched_class = &idle_sched_class;
> > ?}
> >
> > +void __cpuinit __weak thread_idle_state_setup(struct task_struct *idle,
> > int cpu)
> > +{
> > +}
> > +
> > ?/**
> > ? * init_idle - set up an idle thread for a given CPU
> > ? * @idle: task in question
> > @@ -4869,6 +4881,7 @@ void __cpuinit init_idle(struct task_struct *idle,
> > int cpu)
> >
> > ? ? ? /* Set the preempt count _outside_ the spinlocks! */
> > ? ? ? task_thread_info(idle)->preempt_count = 0;
> > + ? ? thread_idle_state_setup(idle, cpu);
>
> I suggest you put this in smpboot.c someplace ;-)

Thanks,
Venki

2012-02-23 20:03:50

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

[ Resending without the ugly email client formatting ]

On Wed, Feb 22, 2012 at 11:50 PM, Ingo Molnar <[email protected]> wrote:
>
> * Venkatesh Pallipadi <[email protected]> wrote:
>
>> * Do we need some accounting for these wakeups exported for powertop?
>
> If then tracepoints.
>
>> * We can also eliminate TS_POLLING flag in favor of this. But, that will have
>> ? a lot more touchpoints and better done as a standlone change.
>
> Should most definitely be done for this series to be acceptble -
> as a preparatory patch in the series, with the feature at the
> end of the series.
>
OK. Will look at TS_POLLING part and likely include it in the next resend.

>> +DECLARE_PER_CPU(atomic_t *, idle_task_ti_flags);
>
> That's ugly, we should access the idle task's ti flags directly.
>
> To have efficient percpu access to the idle threads another
> clean-up is needed: we should turn idle_thread_array into a
> full-structure PER_CPU area.
>
> For that we need a small variant of fork_idle(), which does not
> dup the init thread - pretty trivial.
>
> fork_idle() should also make sure it does not schedule the child
> thread: thus we'd also be able to further simplify smpboot.c and
> get rid of all that extremely ugly 'struct create_idle'
> gymnastics in smpboot.c.
>

Hmm. Not being very familiar with that code, I will have to take a closer
look at this potential cleanup...

Thanks,
Venki

> Thanks,
>
> ? ? ? ?Ingo

2012-02-23 20:04:48

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

[ Resending without the ugly email client formatting ]

On Thu, Feb 23, 2012 at 1:08 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-02-23 at 08:50 +0100, Ingo Molnar wrote:
>> * Venkatesh Pallipadi <[email protected]> wrote:
>
>> > * We can also eliminate TS_POLLING flag in favor of this. But, that will have
>> > ? a lot more touchpoints and better done as a standlone change.
>>
>> Should most definitely be done for this series to be acceptble -
>> as a preparatory patch in the series, with the feature at the
>> end of the series.
>
> I don't think you can, TS_POLLING still works for idle=poll, whereas
> this new stuff requires monitor/mwait.
>

Even poll_idle should be able to use this ipiless_wakeup. It just have
to poll for an additional IPI_PENDING bit along with need_resched.

2012-02-24 05:42:09

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > Why not scheduler_ipi()?
>
> Was trying to avoid irq_enter/exit. As the work here is done in idle
> thread context, I though we could avoid enter/exit.

It seems we could not.
At least RCU need it, see commit c5d753a55, otherwise we will get
warning like 'RCU used illegally from extended quiescent state!'

Thanks,
Yong

2012-02-24 06:14:10

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > Why not scheduler_ipi()?
> >
> > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > thread context, I though we could avoid enter/exit.
>
> It seems we could not.
> At least RCU need it, see commit c5d753a55, otherwise we will get
> warning like 'RCU used illegally from extended quiescent state!'

[Off topic]
This remind me that we should have moved the irq_enter()/irq_exit() to
each arch's related irq handler.
see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2

So Peter, is there someone alread on it? or it still worth doing now?

Thanks,
Yong

2012-02-26 01:33:00

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > Why not scheduler_ipi()?
> >
> > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > thread context, I though we could avoid enter/exit.
>
> It seems we could not.
> At least RCU need it, see commit c5d753a55, otherwise we will get
> warning like 'RCU used illegally from extended quiescent state!'

If the use is tracing, then Steven Rostedt's patchset plus use of his
_rcuidle() tracing variants handles this:

https://lkml.org/lkml/2012/2/7/231

If this is instead algorithmic use of RCU, a set of patches I have queued
up for 3.4 will be required.

Thanx, Paul

2012-02-27 08:38:11

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:

> This remind me that we should have moved the irq_enter()/irq_exit() to
> each arch's related irq handler.
> see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
>
> So Peter, is there someone alread on it? or it still worth doing now?

trouble was that some people didn't feel comfortable adding that
overhead to plain resched IPIs that didn't need to do anything. So I
more or less let it sit for a while.

2012-02-27 08:46:07

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Thu, 2012-02-23 at 11:34 -0800, Venki Pallipadi wrote:

> > > + local_bh_disable();
> > > + local_irq_disable();
> >
> > That local_bh_disable() is completely superfluous, disabling IRQs
> > already kills bh.
>
> The reason for local_bh_disable/enable was to check for potential
> softirqs after these interrupt.

Why is that needed? We never wrap local_irq_disable() in bh muck?

> > Why doesn't ipiless_idle_exit() call this? That would keep it isolated
> > to just those idle routines that actually use mwait instead of bothering
> > the generic idle paths with this.
>
> ipiless_idle_exit is called in the inner most part of idle entry exit.
> In mwait case we may not even have interrupts enabled at that time and
> there is code that does idle residency timing etc which will get
> impacted if we start doing the pending ipi work there. Doing it at
> higher level along with things like enter-exit tickless felt nicer.

But but but, an actual interrupt can be handled before the instruction
after mwait, so why would this be a problem?

2012-02-27 09:07:01

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > > Why not scheduler_ipi()?
> > >
> > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > thread context, I though we could avoid enter/exit.
> >
> > It seems we could not.
> > At least RCU need it, see commit c5d753a55, otherwise we will get
> > warning like 'RCU used illegally from extended quiescent state!'
>
> If the use is tracing, then Steven Rostedt's patchset plus use of his
> _rcuidle() tracing variants handles this:
>
> https://lkml.org/lkml/2012/2/7/231
>
> If this is instead algorithmic use of RCU, a set of patches I have queued
> up for 3.4 will be required.

scheduler_ipi() doing more than tracing. Will look at your patches :)

Thanks,
Yong

2012-02-27 09:08:32

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 09:38:01AM +0100, Peter Zijlstra wrote:
> On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:
>
> > This remind me that we should have moved the irq_enter()/irq_exit() to
> > each arch's related irq handler.
> > see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> >
> > So Peter, is there someone alread on it? or it still worth doing now?
>
> trouble was that some people didn't feel comfortable adding that
> overhead to plain resched IPIs that didn't need to do anything.

Any ideas on where the plain resched IPIs come from?

Thanks,
Yong

> So I
> more or less let it sit for a while.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Only stand for myself

2012-02-27 09:31:11

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, 2012-02-27 at 17:08 +0800, Yong Zhang wrote:
> On Mon, Feb 27, 2012 at 09:38:01AM +0100, Peter Zijlstra wrote:
> > On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:
> >
> > > This remind me that we should have moved the irq_enter()/irq_exit() to
> > > each arch's related irq handler.
> > > see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> > >
> > > So Peter, is there someone alread on it? or it still worth doing now?
> >
> > trouble was that some people didn't feel comfortable adding that
> > overhead to plain resched IPIs that didn't need to do anything.
>
> Any ideas on where the plain resched IPIs come from?

remote wakeups that don't queue I guess.

IIRC I looked at it a while back and while not in the majority there
were still a few (I only added counters, didn't look where they came
from).

However since 518cd623 there'd be a lot more I guess..

Feel free to investigate. Also, some non-sched users of the resched ipi
exist, eg. KVM uses it to kick a remote vcpu out from guest mode.

2012-02-27 09:51:22

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 10:30:58AM +0100, Peter Zijlstra wrote:
> On Mon, 2012-02-27 at 17:08 +0800, Yong Zhang wrote:
> > On Mon, Feb 27, 2012 at 09:38:01AM +0100, Peter Zijlstra wrote:
> > > On Fri, 2012-02-24 at 14:13 +0800, Yong Zhang wrote:
> > >
> > > > This remind me that we should have moved the irq_enter()/irq_exit() to
> > > > each arch's related irq handler.
> > > > see: http://marc.info/?l=linux-kernel&m=130709505700821&w=2
> > > >
> > > > So Peter, is there someone alread on it? or it still worth doing now?
> > >
> > > trouble was that some people didn't feel comfortable adding that
> > > overhead to plain resched IPIs that didn't need to do anything.
> >
> > Any ideas on where the plain resched IPIs come from?
>
> remote wakeups that don't queue I guess.
>
> IIRC I looked at it a while back and while not in the majority there
> were still a few (I only added counters, didn't look where they came
> from).
>
> However since 518cd623 there'd be a lot more I guess..
>
> Feel free to investigate. Also, some non-sched users of the resched ipi
> exist, eg. KVM uses it to kick a remote vcpu out from guest mode.

Thanks for the detail.

Will try to find if there is something interesting.

Thanks,
Yong

2012-02-27 17:09:10

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > > > Why not scheduler_ipi()?
> > > >
> > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > thread context, I though we could avoid enter/exit.
> > >
> > > It seems we could not.
> > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > warning like 'RCU used illegally from extended quiescent state!'
> >
> > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > _rcuidle() tracing variants handles this:
> >
> > https://lkml.org/lkml/2012/2/7/231
> >
> > If this is instead algorithmic use of RCU, a set of patches I have queued
> > up for 3.4 will be required.
>
> scheduler_ipi() doing more than tracing. Will look at your patches :)

Ah! The key question is whether or not the code in question is called
both from idle and from non-idle. This will be easiest if the code is
called only from idle, in which case you should only need this one:

https://lkml.org/lkml/2012/2/3/498

Thanx, Paul

2012-02-27 18:17:49

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 12:45 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-02-23 at 11:34 -0800, Venki Pallipadi wrote:
>
>> > > + ? ? ? ? ? ? local_bh_disable();
>> > > + ? ? ? ? ? ? local_irq_disable();
>> >
>> > That local_bh_disable() is completely superfluous, disabling IRQs
>> > already kills bh.
>>
>> The reason for local_bh_disable/enable was to check for potential
>> softirqs after these interrupt.
>
> Why is that needed? We never wrap local_irq_disable() in bh muck?
>

For normal interrupts, we check for bh on interrutp return. For idle
loop no interrupt case, we will have to call bh handler explicitly as
otherwise we will go back to idle until need_resched().
local_bh_enable() handles this bh call here.

>> > Why doesn't ipiless_idle_exit() call this? That would keep it isolated
>> > to just those idle routines that actually use mwait instead of bothering
>> > the generic idle paths with this.
>>
>> ipiless_idle_exit is called in the inner most part of idle entry exit.
>> In mwait case we may not even have interrupts enabled at that time and
>> there is code that does idle residency timing etc which will get
>> impacted if we start doing the pending ipi work there. Doing it at
>> higher level along with things like enter-exit tickless felt nicer.
>
> But but but, an actual interrupt can be handled before the instruction
> after mwait, so why would this be a problem?
>

With most commom usage of mwait(), interrupts are disabled on entry
and exit from mwait(). There is a special flag that brings CPU out of
mwait on interrupt, without actually handling the interrupt. We do
C-state timing in acpi idle/intel idle and enable interrupts and thats
where interrupt will get handled. My concern is calling the interrupt
and bh immediatley after mwait exit will inflate C-state residency
timings.

Thanks,
Venki

2012-02-28 07:13:13

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Mon, Feb 27, 2012 at 09:05:27AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> > On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > > > > Why not scheduler_ipi()?
> > > > >
> > > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > > thread context, I though we could avoid enter/exit.
> > > >
> > > > It seems we could not.
> > > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > > warning like 'RCU used illegally from extended quiescent state!'
> > >
> > > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > > _rcuidle() tracing variants handles this:
> > >
> > > https://lkml.org/lkml/2012/2/7/231
> > >
> > > If this is instead algorithmic use of RCU, a set of patches I have queued
> > > up for 3.4 will be required.
> >
> > scheduler_ipi() doing more than tracing. Will look at your patches :)
>
> Ah! The key question is whether or not the code in question is called
> both from idle and from non-idle.

In fact before this patch from Venki, the only call site of scheduler_ipi()
is resched irq handler. Then Venki introduce __scheduler_ipi()(which avoid
irq_enter()/irq_exit()) into cpu_idle(). So the answer is yes.

But when I was testing this patch, I didn't see explicit warning on
illegal rcu usage. The reason maybe 1) there are no much rcu dereference
in scheduler_ipi(), but we indeed do tracing in it; 2) rq->lock provide
some kind of protection.
Maybe I'm overstraining, but it is potential danger.

But anyway, it's not an issue anymore since Venki removed __scheduler_ipi()
in his latest version.

> This will be easiest if the code is
> called only from idle, in which case you should only need this one:
>
> https://lkml.org/lkml/2012/2/3/498

Hmm... Yeah, RCU_NONIDLE() could survive IMHO :)

Thanks,
Yong

2012-02-28 13:07:12

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v1

On Tue, Feb 28, 2012 at 03:12:55PM +0800, Yong Zhang wrote:
> On Mon, Feb 27, 2012 at 09:05:27AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 27, 2012 at 05:06:46PM +0800, Yong Zhang wrote:
> > > On Sat, Feb 25, 2012 at 05:32:53PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 24, 2012 at 01:41:50PM +0800, Yong Zhang wrote:
> > > > > On Thu, Feb 23, 2012 at 11:34:11AM -0800, Venki Pallipadi wrote:
> > > > > > On Thu, Feb 23, 2012 at 1:30 AM, Peter Zijlstra <[email protected]> wrote:
> > > > > > > Why not scheduler_ipi()?
> > > > > >
> > > > > > Was trying to avoid irq_enter/exit. As the work here is done in idle
> > > > > > thread context, I though we could avoid enter/exit.
> > > > >
> > > > > It seems we could not.
> > > > > At least RCU need it, see commit c5d753a55, otherwise we will get
> > > > > warning like 'RCU used illegally from extended quiescent state!'
> > > >
> > > > If the use is tracing, then Steven Rostedt's patchset plus use of his
> > > > _rcuidle() tracing variants handles this:
> > > >
> > > > https://lkml.org/lkml/2012/2/7/231
> > > >
> > > > If this is instead algorithmic use of RCU, a set of patches I have queued
> > > > up for 3.4 will be required.
> > >
> > > scheduler_ipi() doing more than tracing. Will look at your patches :)
> >
> > Ah! The key question is whether or not the code in question is called
> > both from idle and from non-idle.
>
> In fact before this patch from Venki, the only call site of scheduler_ipi()
> is resched irq handler. Then Venki introduce __scheduler_ipi()(which avoid
> irq_enter()/irq_exit()) into cpu_idle(). So the answer is yes.

Ah, that explains why I didn't see it in my testing. ;-)

> But when I was testing this patch, I didn't see explicit warning on
> illegal rcu usage. The reason maybe 1) there are no much rcu dereference
> in scheduler_ipi(), but we indeed do tracing in it; 2) rq->lock provide
> some kind of protection.
> Maybe I'm overstraining, but it is potential danger.

Did you have CONFIG_PROVE_RCU=y when testing?

> But anyway, it's not an issue anymore since Venki removed __scheduler_ipi()
> in his latest version.

OK.

> > This will be easiest if the code is
> > called only from idle, in which case you should only need this one:
> >
> > https://lkml.org/lkml/2012/2/3/498
>
> Hmm... Yeah, RCU_NONIDLE() could survive IMHO :)

Seems like it might be needed sooner rather than later...

Thanx, Paul

2012-02-29 06:40:46