2011-05-24 18:12:30

by Marc Zyngier

[permalink] [raw]
Subject: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

Peter,

I've experienced all kind of lock-ups on ARM SMP platforms recently, and
finally tracked it down to the following patch:

e4a52bcb9a18142d79e231b6733cabdbf2e67c1f [sched: Remove rq->lock from the first half of ttwu()].

Even on moderate load, the machine locks up, often silently, and
sometimes with a few messages like:
INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 1, t=12002 jiffies)

Another side effect of this patch is that the load average is always 0,
whatever load I throw at the system.

Reverting the sched changes up to that patch (included) gives me a
working system again, which happily survives parallel kernel
compilations without complaining.

My knowledge of the scheduler being rather limited, I haven't been able
to pinpoint the exact problem (though it probably have something to do
with __ARCH_WANT_INTERRUPTS_ON_CTXSW being defined on ARM). The enclosed
patch somehow papers over the load average problem, but the system ends
up locking up anyway:

diff --git a/kernel/sched.c b/kernel/sched.c
index d3ade54..5ab43c4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2526,8 +2526,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* to spin on ->on_cpu if p is current, since that would
* deadlock.
*/
- if (p == current)
+ if (p == current) {
+ p->sched_contributes_to_load = !!task_contributes_to_load(p);
+ p->state = TASK_WAKING;
+ if (p->sched_class->task_waking)
+ p->sched_class->task_waking(p);
goto out_activate;
+ }
#endif
cpu_relax();
}

I'd be happy to test any patch you may have.

Cheers,

M.
--
Reality is an implementation detail.


2011-05-24 21:29:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Tue, 2011-05-24 at 19:13 +0100, Marc Zyngier wrote:
> Peter,
>
> I've experienced all kind of lock-ups on ARM SMP platforms recently, and
> finally tracked it down to the following patch:
>
> e4a52bcb9a18142d79e231b6733cabdbf2e67c1f [sched: Remove rq->lock from the first half of ttwu()].
>
> Even on moderate load, the machine locks up, often silently, and
> sometimes with a few messages like:
> INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 1, t=12002 jiffies)
>
> Another side effect of this patch is that the load average is always 0,
> whatever load I throw at the system.
>
> Reverting the sched changes up to that patch (included) gives me a
> working system again, which happily survives parallel kernel
> compilations without complaining.
>
> My knowledge of the scheduler being rather limited, I haven't been able
> to pinpoint the exact problem (though it probably have something to do
> with __ARCH_WANT_INTERRUPTS_ON_CTXSW being defined on ARM). The enclosed
> patch somehow papers over the load average problem, but the system ends
> up locking up anyway:

Hurm.. I'll try and make x86 use __ARCH_WANT_INTERRUPTS_ON_CTXSW, IIRC
Ingo once said that that is possible and try to see if I can reproduce.
No clear ideas atm.

Thanks for reporting.

2011-05-24 21:39:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM


* Peter Zijlstra <[email protected]> wrote:

> On Tue, 2011-05-24 at 19:13 +0100, Marc Zyngier wrote:
> > Peter,
> >
> > I've experienced all kind of lock-ups on ARM SMP platforms recently, and
> > finally tracked it down to the following patch:
> >
> > e4a52bcb9a18142d79e231b6733cabdbf2e67c1f [sched: Remove rq->lock from the first half of ttwu()].
> >
> > Even on moderate load, the machine locks up, often silently, and
> > sometimes with a few messages like:
> > INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 1, t=12002 jiffies)
> >
> > Another side effect of this patch is that the load average is always 0,
> > whatever load I throw at the system.
> >
> > Reverting the sched changes up to that patch (included) gives me a
> > working system again, which happily survives parallel kernel
> > compilations without complaining.
> >
> > My knowledge of the scheduler being rather limited, I haven't been able
> > to pinpoint the exact problem (though it probably have something to do
> > with __ARCH_WANT_INTERRUPTS_ON_CTXSW being defined on ARM). The enclosed
> > patch somehow papers over the load average problem, but the system ends
> > up locking up anyway:
>
> Hurm.. I'll try and make x86 use __ARCH_WANT_INTERRUPTS_ON_CTXSW, IIRC
> Ingo once said that that is possible and try to see if I can reproduce.
> No clear ideas atm.

Yes, should be possible to just disable it on x86 - no further tricks needed.
It's been a long time since i tested that though.

Thanks,

Ingo

2011-05-25 12:22:36

by Marc Zyngier

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Tue, 2011-05-24 at 23:39 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, 2011-05-24 at 19:13 +0100, Marc Zyngier wrote:
> > > Peter,
> > >
> > > I've experienced all kind of lock-ups on ARM SMP platforms recently, and
> > > finally tracked it down to the following patch:
> > >
> > > e4a52bcb9a18142d79e231b6733cabdbf2e67c1f [sched: Remove rq->lock from the first half of ttwu()].
> > >
> > > Even on moderate load, the machine locks up, often silently, and
> > > sometimes with a few messages like:
> > > INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 1, t=12002 jiffies)
> > >
> > > Another side effect of this patch is that the load average is always 0,
> > > whatever load I throw at the system.
> > >
> > > Reverting the sched changes up to that patch (included) gives me a
> > > working system again, which happily survives parallel kernel
> > > compilations without complaining.
> > >
> > > My knowledge of the scheduler being rather limited, I haven't been able
> > > to pinpoint the exact problem (though it probably have something to do
> > > with __ARCH_WANT_INTERRUPTS_ON_CTXSW being defined on ARM). The enclosed
> > > patch somehow papers over the load average problem, but the system ends
> > > up locking up anyway:
> >
> > Hurm.. I'll try and make x86 use __ARCH_WANT_INTERRUPTS_ON_CTXSW, IIRC
> > Ingo once said that that is possible and try to see if I can reproduce.
> > No clear ideas atm.
>
> Yes, should be possible to just disable it on x86 - no further tricks needed.
> It's been a long time since i tested that though.

I can confirm this is SMP only. UP is fine. SMP+nosmp locks up as well.

M.
--
Reality is an implementation detail.

2011-05-25 17:09:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Tue, 2011-05-24 at 23:32 +0200, Peter Zijlstra wrote:
> On Tue, 2011-05-24 at 19:13 +0100, Marc Zyngier wrote:
> > Peter,
> >
> > I've experienced all kind of lock-ups on ARM SMP platforms recently, and
> > finally tracked it down to the following patch:
> >
> > e4a52bcb9a18142d79e231b6733cabdbf2e67c1f [sched: Remove rq->lock from the first half of ttwu()].
> >
> > Even on moderate load, the machine locks up, often silently, and
> > sometimes with a few messages like:
> > INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 1, t=12002 jiffies)
> >
> > Another side effect of this patch is that the load average is always 0,
> > whatever load I throw at the system.
> >
> > Reverting the sched changes up to that patch (included) gives me a
> > working system again, which happily survives parallel kernel
> > compilations without complaining.
> >
> > My knowledge of the scheduler being rather limited, I haven't been able
> > to pinpoint the exact problem (though it probably have something to do
> > with __ARCH_WANT_INTERRUPTS_ON_CTXSW being defined on ARM). The enclosed
> > patch somehow papers over the load average problem, but the system ends
> > up locking up anyway:
>
> Hurm.. I'll try and make x86 use __ARCH_WANT_INTERRUPTS_ON_CTXSW, IIRC
> Ingo once said that that is possible and try to see if I can reproduce.
> No clear ideas atm.

So I checked out that particular commit and build with the below patch
on-top. grep __ARCH_WANT /proc/sched_debug did indeed return those
strings so I'm assuming CPP did its magic and I'm indeed running a
kernel that enables IRQs around context switches.

The sad news however is that a make -j8 (on a dual core) seems to result
in a kernel image, not an oops.

Ooh, shiny, whilst typing this I got an NMI-watchdog error reporting me
that CPU1 got stuck in try_to_wake_up(), so it looks like I can indeed
reproduce some funnies.

/me goes dig in.

---
arch/x86/include/asm/system.h | 2 ++
kernel/sched_debug.c | 7 +++++++
2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/system.h b/arch/x86/include/asm/system.h
index 12569e6..56103bb 100644
--- a/arch/x86/include/asm/system.h
+++ b/arch/x86/include/asm/system.h
@@ -10,6 +10,8 @@
#include <linux/kernel.h>
#include <linux/irqflags.h>

+#define __ARCH_WANT_INTERRUPTS_ON_CTXSW
+
/* entries in ARCH_DLINFO: */
#if defined(CONFIG_IA32_EMULATION) || !defined(CONFIG_X86_64)
# define AT_VECTOR_SIZE_ARCH 2
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 3669bec6..18b4ace 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -335,6 +335,13 @@ static int sched_debug_show(struct seq_file *m, void *v)
(int)strcspn(init_utsname()->version, " "),
init_utsname()->version);

+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ SEQ_printf(m, "__ARCH_WANT_INTERRUPTS_ON_CTXSW\n");
+#endif
+#ifdef __ARCH_WANT_UNLOCKED_CTXSW
+ SEQ_printf(m, "__ARCH_WANT_UNLOCKED_CTXSW\n");
+#endif
+
#define P(x) \
SEQ_printf(m, "%-40s: %Ld\n", #x, (long long)(x))
#define PN(x) \

2011-05-25 21:16:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Wed, 2011-05-25 at 19:08 +0200, Peter Zijlstra wrote:
> Ooh, shiny, whilst typing this I got an NMI-watchdog error reporting me
> that CPU1 got stuck in try_to_wake_up(), so it looks like I can indeed
> reproduce some funnies.
>
> /me goes dig in.

Does the below make your ARM box happy again?

It restores the old ttwu behaviour for this case and seems to not mess
up my x86 with __ARCH_WANT_INTERRUPTS_ON_CTXSW.

Figuring out why the existing condition failed and writing a proper
changelog requires a mind that is slightly less deprived of sleep and I
shall attempt that tomorrow -- provided this does indeed work for you.

---
diff --git a/kernel/sched.c b/kernel/sched.c
index 2d12893..6976eac 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2573,7 +2573,19 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu)
if (!next)
smp_send_reschedule(cpu);
}
-#endif
+
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+static void ttwu_activate_remote(struct task_struct *p, int wake_flags)
+{
+ struct rq *rq = __task_rq_lock(p);
+
+ ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
+ ttwu_do_wakeup(rq, p, wake_flags);
+
+ __task_rq_unlock(rq);
+}
+#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
+#endif /* CONFIG_SMP */

static void ttwu_queue(struct task_struct *p, int cpu)
{
@@ -2630,18 +2642,11 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
while (p->on_cpu) {
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
- /*
- * If called from interrupt context we could have landed in the
- * middle of schedule(), in this case we should take care not
- * to spin on ->on_cpu if p is current, since that would
- * deadlock.
- */
- if (p == current) {
- ttwu_queue(p, cpu);
- goto stat;
- }
-#endif
+ ttwu_activate_remote(p, wake_flags);
+ goto stat;
+#else
cpu_relax();
+#endif
}
/*
* Pairs with the smp_wmb() in finish_lock_switch().

2011-05-26 07:29:26

by Yong Zhang

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, May 26, 2011 at 5:15 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2011-05-25 at 19:08 +0200, Peter Zijlstra wrote:
>> Ooh, shiny, whilst typing this I got an NMI-watchdog error reporting me
>> that CPU1 got stuck in try_to_wake_up(), so it looks like I can indeed
>> reproduce some funnies.
>>
>> /me goes dig in.
>
> Does the below make your ARM box happy again?
>
> It restores the old ttwu behaviour for this case and seems to not mess
> up my x86 with __ARCH_WANT_INTERRUPTS_ON_CTXSW.
>
> Figuring out why the existing condition failed

Seems 'current' will change before/after switch_to since it's derived from
sp register.
So that means if interrupt come before we switch sp, 'p == current' will
catch it, but if interrupt comes after we switch sp, we will lose a wake up.

Thanks,
Yong

> and writing a proper
> changelog requires a mind that is slightly less deprived of sleep and I
> shall attempt that tomorrow -- provided this does indeed work for you.
>
> ---
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 2d12893..6976eac 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2573,7 +2573,19 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu)
>        if (!next)
>                smp_send_reschedule(cpu);
>  }
> -#endif
> +
> +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> +static void ttwu_activate_remote(struct task_struct *p, int wake_flags)
> +{
> +       struct rq *rq = __task_rq_lock(p);
> +
> +       ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
> +       ttwu_do_wakeup(rq, p, wake_flags);
> +
> +       __task_rq_unlock(rq);
> +}
> +#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
> +#endif /* CONFIG_SMP */
>
>  static void ttwu_queue(struct task_struct *p, int cpu)
>  {
> @@ -2630,18 +2642,11 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>         */
>        while (p->on_cpu) {
>  #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> -               /*
> -                * If called from interrupt context we could have landed in the
> -                * middle of schedule(), in this case we should take care not
> -                * to spin on ->on_cpu if p is current, since that would
> -                * deadlock.
> -                */
> -               if (p == current) {
> -                       ttwu_queue(p, cpu);
> -                       goto stat;
> -               }
> -#endif
> +               ttwu_activate_remote(p, wake_flags);
> +               goto stat;
> +#else
>                cpu_relax();
> +#endif
>        }
>        /*
>         * Pairs with the smp_wmb() in finish_lock_switch().
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
Only stand for myself

2011-05-26 10:55:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 15:29 +0800, Yong Zhang wrote:
> > Figuring out why the existing condition failed
>
> Seems 'current' will change before/after switch_to since it's derived from
> sp register.
> So that means if interrupt come before we switch sp, 'p == current' will
> catch it, but if interrupt comes after we switch sp, we will lose a wake up.

Well, loosing a wakeup isn't the problem here (although it would be a
problem), the immediate problem is that we're getting stuck
(life-locked) in that while (p->on_cpu) loop.

But yes, I think that explains it, if the interrupts hits
context_switch() after current was changed but before clearing
p->on_cpu, we would life-lock in interrupt context.

Now we could of course go add in_interrupt() checks there, but that
would make this already fragile path more interesting, so I think I'll
stick with the proposed patch -- again provided it actually works.

Marc, any word on that?

2011-05-26 11:02:00

by Marc Zyngier

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 12:32 +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 15:29 +0800, Yong Zhang wrote:
> > > Figuring out why the existing condition failed
> >
> > Seems 'current' will change before/after switch_to since it's derived from
> > sp register.
> > So that means if interrupt come before we switch sp, 'p == current' will
> > catch it, but if interrupt comes after we switch sp, we will lose a wake up.
>
> Well, loosing a wakeup isn't the problem here (although it would be a
> problem), the immediate problem is that we're getting stuck
> (life-locked) in that while (p->on_cpu) loop.
>
> But yes, I think that explains it, if the interrupts hits
> context_switch() after current was changed but before clearing
> p->on_cpu, we would life-lock in interrupt context.
>
> Now we could of course go add in_interrupt() checks there, but that
> would make this already fragile path more interesting, so I think I'll
> stick with the proposed patch -- again provided it actually works.
>
> Marc, any word on that?

The box is currently building kernels in a loop (using -j64...). So far,
so good. Oh, and that fixed the load-average thing as well.

Oh wait (my turn...):
INFO: task gcc:10030 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

One of my ssh sessions is locking up periodically, and it generally
feels a bit sluggish.

M.
--
Reality is an implementation detail.

2011-05-26 11:33:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 12:02 +0100, Marc Zyngier wrote:

> The box is currently building kernels in a loop (using -j64...). So far,
> so good. Oh, and that fixed the load-average thing as well.

OK, great.

> Oh wait (my turn...):
> INFO: task gcc:10030 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>
> One of my ssh sessions is locking up periodically, and it generally
> feels a bit sluggish.

The good news is that I can indeed confirm that, I somehow failed to
notice that last night. I simply put the machine to build kernels and
walked off, only to come back 30 minutes or so later to see it was still
happily chugging along.

Further good news is that by disabling
__ARCH_WANT_INTERRUPTS_ON_CTXSW again it goes away, so it must be
something funny with the relatively little code under that directive.

The bad news is of course that I've got a little more head-scratching to
do, will keep you informed.

2011-05-26 12:22:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 13:32 +0200, Peter Zijlstra wrote:
>
> The bad news is of course that I've got a little more head-scratching to
> do, will keep you informed.

OK, that wasn't too hard.. (/me crosses fingers and prays Marc doesn't
find more funnies ;-).

Does the below cure all woes?

---
Subject: sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
From: Peter Zijlstra <[email protected]>
Date: Thu May 26 14:21:33 CEST 2011

Marc reported that e4a52bcb9 (sched: Remove rq->lock from the first
half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few
__ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu()
code was suspect.

Yong found that the interrupt could hit hits after context_switch() changes
current but before it clears p->on_cpu, if that interrupt were to
attempt a wake-up of p we would indeed find ourselves spinning in IRQ
context.

Sort this by reverting to the old behaviour for this situation and
perform a full remote wake-up.

Cc: Frank Rowand <[email protected]>
Cc: Yong Zhang <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Reported-by: Marc Zyngier <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched.c | 37 ++++++++++++++++++++++++++++---------
1 file changed, 28 insertions(+), 9 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2573,7 +2573,26 @@ static void ttwu_queue_remote(struct tas
if (!next)
smp_send_reschedule(cpu);
}
-#endif
+
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+static int ttwu_activate_remote(struct task_struct *p, int wake_flags)
+{
+ struct rq *rq;
+ int ret = 0;
+
+ rq = __task_rq_lock(p);
+ if (p->on_cpu) {
+ ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+ ttwu_do_wakeup(rq, p, wake_flags);
+ ret = 1;
+ }
+ __task_rq_unlock(rq);
+
+ return ret;
+
+}
+#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
+#endif /* CONFIG_SMP */

static void ttwu_queue(struct task_struct *p, int cpu)
{
@@ -2631,17 +2650,17 @@ try_to_wake_up(struct task_struct *p, un
while (p->on_cpu) {
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
/*
- * If called from interrupt context we could have landed in the
- * middle of schedule(), in this case we should take care not
- * to spin on ->on_cpu if p is current, since that would
- * deadlock.
+ * In case the architecture enables interrupts in
+ * context_switch(), we cannot busy wait, since that
+ * would lead to live-locks when an interrupt hits and
+ * tries to wake up @prev. So bail and do a complete
+ * remote wakeup.
*/
- if (p == current) {
- ttwu_queue(p, cpu);
+ if (ttwu_activate_remote(p, wake_flags))
goto stat;
- }
-#endif
+#else
cpu_relax();
+#endif
}
/*
* Pairs with the smp_wmb() in finish_lock_switch().

2011-05-26 12:26:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM


* Peter Zijlstra <[email protected]> wrote:

> Sort this by reverting to the old behaviour for this situation and
> perform a full remote wake-up.

Btw., ARM should consider switching most of its subarchitectures to
!__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during context
switches is silly and now expensive as well.

Thanks,

Ingo

2011-05-26 12:32:22

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > Sort this by reverting to the old behaviour for this situation and
> > perform a full remote wake-up.
>
> Btw., ARM should consider switching most of its subarchitectures to
> !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during context
> switches is silly and now expensive as well.

Not going to happen. The reason we do it is because most of the CPUs
have to (slowly) flush their caches during switch_mm(), and to have
IRQs off over the cache flush means that we lose IRQs.

So it's not silly at all, bit a technical requirement imposed by the
cache architecture.

If it's become expensive through development, it suggests that the
development did not take account of the issues we have on ARM.

2011-05-26 12:38:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 13:31 +0100, Russell King - ARM Linux wrote:
> On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > Sort this by reverting to the old behaviour for this situation and
> > > perform a full remote wake-up.
> >
> > Btw., ARM should consider switching most of its subarchitectures to
> > !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during context
> > switches is silly and now expensive as well.
>
> Not going to happen. The reason we do it is because most of the CPUs
> have to (slowly) flush their caches during switch_mm(), and to have
> IRQs off over the cache flush means that we lose IRQs.
>
> So it's not silly at all, bit a technical requirement imposed by the
> cache architecture.
>
> If it's become expensive through development, it suggests that the
> development did not take account of the issues we have on ARM.

Its not more expensive than it was before this patch series, and the
case in question is relatively rare (guesstimate, lacking measurements)
so ARM should benefit from most of the optimization provided.

2011-05-26 12:50:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM


* Russell King - ARM Linux <[email protected]> wrote:

> On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > Sort this by reverting to the old behaviour for this situation
> > > and perform a full remote wake-up.
> >
> > Btw., ARM should consider switching most of its subarchitectures
> > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during
> > context switches is silly and now expensive as well.
>
> Not going to happen. The reason we do it is because most of the
> CPUs have to (slowly) flush their caches during switch_mm(), and to
> have IRQs off over the cache flush means that we lose IRQs.

How much time does that take on contemporary ARM hardware, typically
(and worst-case)?

Thanks,

Ingo

2011-05-26 13:37:30

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, May 26, 2011 at 02:50:07PM +0200, Ingo Molnar wrote:
>
> * Russell King - ARM Linux <[email protected]> wrote:
>
> > On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
> > >
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > Sort this by reverting to the old behaviour for this situation
> > > > and perform a full remote wake-up.
> > >
> > > Btw., ARM should consider switching most of its subarchitectures
> > > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during
> > > context switches is silly and now expensive as well.
> >
> > Not going to happen. The reason we do it is because most of the
> > CPUs have to (slowly) flush their caches during switch_mm(), and to
> > have IRQs off over the cache flush means that we lose IRQs.
>
> How much time does that take on contemporary ARM hardware, typically
> (and worst-case)?

I can't give you precise figures because it really depends on the hardware
and how stuff is setup. All I can do is give you examples from platforms
I have here running which rely upon this.

Some ARM CPUs have to read 32K of data into the data cache in order to
ensure that any dirty data is flushed out. Others have to loop over the
cache segments/entries, cleaning and invalidating each one (that's 8 x 64
for ARM920 so 512 interations).

If my userspace program is correct, then it looks like StrongARM takes
about 700us to read 32K of data into the cache.

Measuring the context switches per second on the same machine (using an
old version of the Byte Benchmarks) gives about 904 context switches per
second (equating to 1.1ms per switch), so this figure looks about right.

Same CPU but different hardware gives 698 context switches per second -
about 1.4ms per switch. With IRQs enabled, its possible to make this
work but you have to read 64K of data instead, which would double the
ctxt switch latency here.

On an ARM920 machine, running the same program gives around 2476 per
second, which is around 400us per switch.

Your typical 16550A with a 16-byte FIFO running at 115200 baud will fill
from completely empty to overrun in 1.1ms. Realistically, you'll start
getting overruns well below that because of the FIFO thresholds - which
may be trigger an IRQ at half-full. So 600us.

This would mean 16550A's would be entirely unusable with StrongARM, with
an overrun guaranteed at every context switch.

This is not the whole story: if you have timing sensitive peripherals
like UARTs, then 1.1ms - 700us doesn't sound that bad, until you start
considering other IRQ load which can lock out servicing those peripherals
while other interrupt handlers are running.

So all in all, having IRQs off for the order of 700us over a context
switch is a complete non-starter of an idea.

2011-05-26 14:45:43

by Catalin Marinas

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On 26 May 2011 13:50, Ingo Molnar <[email protected]> wrote:
> * Russell King - ARM Linux <[email protected]> wrote:
>
>> On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
>> >
>> > * Peter Zijlstra <[email protected]> wrote:
>> >
>> > > Sort this by reverting to the old behaviour for this situation
>> > > and perform a full remote wake-up.
>> >
>> > Btw., ARM should consider switching most of its subarchitectures
>> > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during
>> > context switches is silly and now expensive as well.
>>
>> Not going to happen. ?The reason we do it is because most of the
>> CPUs have to (slowly) flush their caches during switch_mm(), and to
>> have IRQs off over the cache flush means that we lose IRQs.
>
> How much time does that take on contemporary ARM hardware, typically
> (and worst-case)?

On newer ARMv6 and ARMv7 hardware, we no longer flush the caches at
context switch as we got VIPT (or PIPT-like) caches.

But modern ARM processors use something called ASID to tag the TLB
entries and we are limited to 256. The switch_mm() code checks for
whether we ran out of them to restart the counting. This ASID
roll-over event needs to be broadcast to the other CPUs and issuing
IPIs with the IRQs disabled isn't always safe. Of course, we could
briefly re-enable them at the ASID roll-over time but I'm not sure
what the expectations of the code calling switch_mm() are.

--
Catalin

2011-05-26 14:55:44

by Marc Zyngier

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 14:21 +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 13:32 +0200, Peter Zijlstra wrote:
> >
> > The bad news is of course that I've got a little more head-scratching to
> > do, will keep you informed.
>
> OK, that wasn't too hard.. (/me crosses fingers and prays Marc doesn't
> find more funnies ;-).
>
> Does the below cure all woes?

So far so good. The box just went through it's two first iterations of
kernel building without a sweat, carried on, and still feels snappy
enough.

Thanks for having fixed that quickly!

> ---
> Subject: sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
> From: Peter Zijlstra <[email protected]>
> Date: Thu May 26 14:21:33 CEST 2011
>
> Marc reported that e4a52bcb9 (sched: Remove rq->lock from the first
> half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few
> __ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu()
> code was suspect.
>
> Yong found that the interrupt could hit hits after context_switch() changes
> current but before it clears p->on_cpu, if that interrupt were to
> attempt a wake-up of p we would indeed find ourselves spinning in IRQ
> context.
>
> Sort this by reverting to the old behaviour for this situation and
> perform a full remote wake-up.
>
> Cc: Frank Rowand <[email protected]>
> Cc: Yong Zhang <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Reported-by: Marc Zyngier <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>

Tested-by: Marc Zyngier <[email protected]>

M.
--
Reality is an implementation detail.

2011-05-26 15:46:59

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On 05/26, Peter Zijlstra wrote:
>
> static void ttwu_queue(struct task_struct *p, int cpu)
> {
> @@ -2631,17 +2650,17 @@ try_to_wake_up(struct task_struct *p, un
> while (p->on_cpu) {
> #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> /*
> - * If called from interrupt context we could have landed in the
> - * middle of schedule(), in this case we should take care not
> - * to spin on ->on_cpu if p is current, since that would
> - * deadlock.
> + * In case the architecture enables interrupts in
> + * context_switch(), we cannot busy wait, since that
> + * would lead to live-locks when an interrupt hits and
> + * tries to wake up @prev. So bail and do a complete
> + * remote wakeup.
> */
> - if (p == current) {
> - ttwu_queue(p, cpu);
> + if (ttwu_activate_remote(p, wake_flags))

Stupid question. Can't we fix this problem if we do

- if (p == current)
+ if (cpu == raw_smp_processor_id())

?

I forgot the rules... but iirc task_cpu(p) can't be changed under us?

Oleg.

2011-05-26 15:56:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 17:45 +0200, Oleg Nesterov wrote:
> Stupid question. Can't we fix this problem if we do
>
> - if (p == current)
> + if (cpu == raw_smp_processor_id())
>
> ?
>
> I forgot the rules... but iirc task_cpu(p) can't be changed under us?

Easy enough to test.. brain gave out again,. hold on ;-)

2011-05-26 16:05:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 17:59 +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 17:45 +0200, Oleg Nesterov wrote:
> > Stupid question. Can't we fix this problem if we do
> >
> > - if (p == current)
> > + if (cpu == raw_smp_processor_id())
> >
> > ?
> >
> > I forgot the rules... but iirc task_cpu(p) can't be changed under us?
>
> Easy enough to test.. brain gave out again,. hold on ;-)

The below seems to run all-right so far, I'll let it run for a while.

---
arch/x86/include/asm/system.h | 2 ++
kernel/sched.c | 3 ++-
kernel/sched_debug.c | 7 +++++++
3 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/system.h b/arch/x86/include/asm/system.h
index c2ff2a1..2c597e8 100644
--- a/arch/x86/include/asm/system.h
+++ b/arch/x86/include/asm/system.h
@@ -10,6 +10,8 @@
#include <linux/kernel.h>
#include <linux/irqflags.h>

+#define __ARCH_WANT_INTERRUPTS_ON_CTXSW
+
/* entries in ARCH_DLINFO: */
#if defined(CONFIG_IA32_EMULATION) || !defined(CONFIG_X86_64)
# define AT_VECTOR_SIZE_ARCH 2
diff --git a/kernel/sched.c b/kernel/sched.c
index 2d12893..f3627e5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2636,7 +2636,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* to spin on ->on_cpu if p is current, since that would
* deadlock.
*/
- if (p == current) {
+ if (cpu == smp_processor_id()) {
+ p->sched_contributes_to_load = 0;
ttwu_queue(p, cpu);
goto stat;
}
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index a6710a1..f0ff1de 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -332,6 +332,13 @@ static int sched_debug_show(struct seq_file *m, void *v)
(int)strcspn(init_utsname()->version, " "),
init_utsname()->version);

+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ SEQ_printf(m, "__ARCH_WANT_INTERRUPTS_ON_CTXSW\n");
+#endif
+#ifdef __ARCH_WANT_UNLOCKED_CTXSW
+ SEQ_printf(m, "__ARCH_WANT_UNLOCKED_CTXSW\n");
+#endif
+
#define P(x) \
SEQ_printf(m, "%-40s: %Ld\n", #x, (long long)(x))
#define PN(x) \

2011-05-26 16:19:45

by Marc Zyngier

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 18:09 +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 17:59 +0200, Peter Zijlstra wrote:
> > On Thu, 2011-05-26 at 17:45 +0200, Oleg Nesterov wrote:
> > > Stupid question. Can't we fix this problem if we do
> > >
> > > - if (p == current)
> > > + if (cpu == raw_smp_processor_id())
> > >
> > > ?
> > >
> > > I forgot the rules... but iirc task_cpu(p) can't be changed under us?
> >
> > Easy enough to test.. brain gave out again,. hold on ;-)
>
> The below seems to run all-right so far, I'll let it run for a while.

Doesn't look very good here. The serial console basically locks up as
soon as the system gets busy, even if the kernel compilation seem to
progress at a decent pace.

M.

--
Reality is an implementation detail.

2011-05-26 16:21:50

by Marc Zyngier

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 18:09 +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 17:59 +0200, Peter Zijlstra wrote:
> > On Thu, 2011-05-26 at 17:45 +0200, Oleg Nesterov wrote:
> > > Stupid question. Can't we fix this problem if we do
> > >
> > > - if (p == current)
> > > + if (cpu == raw_smp_processor_id())
> > >
> > > ?
> > >
> > > I forgot the rules... but iirc task_cpu(p) can't be changed under us?
> >
> > Easy enough to test.. brain gave out again,. hold on ;-)
>
> The below seems to run all-right so far, I'll let it run for a while.

Just got the following:

INFO: task kjournald:904 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald D c0284c88 0 904 2 0x00000000
[<c0284c88>] (schedule+0x54c/0x620) from [<c0284dd8>] (io_schedule+0x7c/0xac)
[<c0284dd8>] (io_schedule+0x7c/0xac) from [<c00dc1c0>] (sleep_on_buffer+0x8/0x10)
[<c00dc1c0>] (sleep_on_buffer+0x8/0x10) from [<c02854ac>] (__wait_on_bit+0x54/0x9c)
[<c02854ac>] (__wait_on_bit+0x54/0x9c) from [<c028556c>] (out_of_line_wait_on_bit+0x78/0x84)
[<c028556c>] (out_of_line_wait_on_bit+0x78/0x84) from [<c014bc4c>] (journal_commit_transaction+0x734/0x13f4)
[<c014bc4c>] (journal_commit_transaction+0x734/0x13f4) from [<c014f598>] (kjournald+0xb8/0x210)
[<c014f598>] (kjournald+0xb8/0x210) from [<c0066320>] (kthread+0x80/0x88)
[<c0066320>] (kthread+0x80/0x88) from [<c002ea14>] (kernel_thread_exit+0x0/0x8)

M.
--
Reality is an implementation detail.

2011-05-26 16:28:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 17:20 +0100, Marc Zyngier wrote:
>
> Doesn't look very good here. The serial console basically locks up as
> soon as the system gets busy, even if the kernel compilation seem to
> progress at a decent pace.


OK, I'll leave the one that worked queued up for this release. If we can
come up with a better alternative we can try for the next release, that
should give us ample time to test things and get us a working kernel
now ;-)

2011-05-26 17:07:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On 05/26, Peter Zijlstra wrote:
>
> @@ -2636,7 +2636,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> * to spin on ->on_cpu if p is current, since that would
> * deadlock.
> */
> - if (p == current) {
> + if (cpu == smp_processor_id()) {
> + p->sched_contributes_to_load = 0;
> ttwu_queue(p, cpu);

Btw. I do not pretend I really understand se->vruntime, but in this
case we are doing enqueue_task() without ->task_waking(), however we
pass ENQUEUE_WAKING. Is it correct?

Oleg.

2011-05-26 17:15:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 19:04 +0200, Oleg Nesterov wrote:
> On 05/26, Peter Zijlstra wrote:
> >
> > @@ -2636,7 +2636,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > * to spin on ->on_cpu if p is current, since that would
> > * deadlock.
> > */
> > - if (p == current) {
> > + if (cpu == smp_processor_id()) {
> > + p->sched_contributes_to_load = 0;
> > ttwu_queue(p, cpu);
>
> Btw. I do not pretend I really understand se->vruntime, but in this
> case we are doing enqueue_task() without ->task_waking(), however we
> pass ENQUEUE_WAKING. Is it correct?

No its not, that's the thing that I got wrong the first time and caused
these pauses.


2011-05-26 17:21:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 19:17 +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 19:04 +0200, Oleg Nesterov wrote:
> > On 05/26, Peter Zijlstra wrote:
> > >
> > > @@ -2636,7 +2636,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > > * to spin on ->on_cpu if p is current, since that would
> > > * deadlock.
> > > */
> > > - if (p == current) {
> > > + if (cpu == smp_processor_id()) {
> > > + p->sched_contributes_to_load = 0;
> > > ttwu_queue(p, cpu);
> >
> > Btw. I do not pretend I really understand se->vruntime, but in this
> > case we are doing enqueue_task() without ->task_waking(), however we
> > pass ENQUEUE_WAKING. Is it correct?
>
> No its not, that's the thing that I got wrong the first time and caused
> these pauses.

We'd end up with something like the below, which isn't too different
from what I've now got queued.

It has the extra cpu == smp_processor_id() check, but I'm not sure this
whole case is worth the trouble. I could go stick some counters in to
verify how often all this happens I guess.

---
arch/x86/include/asm/system.h | 2 ++
kernel/sched.c | 14 +++++++++++---
kernel/sched_debug.c | 7 +++++++
3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/system.h b/arch/x86/include/asm/system.h
index c2ff2a1..2c597e8 100644
--- a/arch/x86/include/asm/system.h
+++ b/arch/x86/include/asm/system.h
@@ -10,6 +10,8 @@
#include <linux/kernel.h>
#include <linux/irqflags.h>

+#define __ARCH_WANT_INTERRUPTS_ON_CTXSW
+
/* entries in ARCH_DLINFO: */
#if defined(CONFIG_IA32_EMULATION) || !defined(CONFIG_X86_64)
# define AT_VECTOR_SIZE_ARCH 2
diff --git a/kernel/sched.c b/kernel/sched.c
index 2d12893..e4f7a9f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2636,9 +2636,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* to spin on ->on_cpu if p is current, since that would
* deadlock.
*/
- if (p == current) {
- ttwu_queue(p, cpu);
- goto stat;
+ if (cpu == smp_processor_id()) {
+ struct rq *rq;
+
+ rq = __task_rq_lock(p);
+ if (p->on_cpu) {
+ ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+ ttwu_do_wakeup(rq, p, wake_flags);
+ __task_rq_unlock(rq);
+ goto stat;
+ }
+ __task_rq_unlock(rq);
}
#endif
cpu_relax();
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index a6710a1..f0ff1de 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -332,6 +332,13 @@ static int sched_debug_show(struct seq_file *m, void *v)
(int)strcspn(init_utsname()->version, " "),
init_utsname()->version);

+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ SEQ_printf(m, "__ARCH_WANT_INTERRUPTS_ON_CTXSW\n");
+#endif
+#ifdef __ARCH_WANT_UNLOCKED_CTXSW
+ SEQ_printf(m, "__ARCH_WANT_UNLOCKED_CTXSW\n");
+#endif
+
#define P(x) \
SEQ_printf(m, "%-40s: %Ld\n", #x, (long long)(x))
#define PN(x) \

2011-05-26 17:51:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On 05/26, Peter Zijlstra wrote:
>
> It has the extra cpu == smp_processor_id() check, but I'm not sure this
> whole case is worth the trouble.

Agreed, this case is very unlikely. Perhaps it makes the code more clear
though, up to you.

But, if we keep this check,

> @@ -2636,9 +2636,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> * to spin on ->on_cpu if p is current, since that would
> * deadlock.
> */
> - if (p == current) {
> - ttwu_queue(p, cpu);
> - goto stat;
> + if (cpu == smp_processor_id()) {
> + struct rq *rq;
> +
> + rq = __task_rq_lock(p);
> + if (p->on_cpu) {
> + ttwu_activate(rq, p, ENQUEUE_WAKEUP);
> + ttwu_do_wakeup(rq, p, wake_flags);
> + __task_rq_unlock(rq);

then why we re-check ->on_cpu? Just curious.

Oleg.

2011-05-27 07:01:40

by Yong Zhang

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Fri, May 27, 2011 at 1:23 AM, Peter Zijlstra <[email protected]> wrote:
>
> We'd end up with something like the below, which isn't too different
> from what I've now got queued.
>
> It has the extra cpu == smp_processor_id() check, but I'm not sure this
> whole case is worth the trouble. I could go stick some counters in to
> verify how often all this happens I guess.
>
> ---
>  arch/x86/include/asm/system.h |    2 ++
>  kernel/sched.c                |   14 +++++++++++---
>  kernel/sched_debug.c          |    7 +++++++
>  3 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/system.h b/arch/x86/include/asm/system.h
> index c2ff2a1..2c597e8 100644
> --- a/arch/x86/include/asm/system.h
> +++ b/arch/x86/include/asm/system.h
> @@ -10,6 +10,8 @@
>  #include <linux/kernel.h>
>  #include <linux/irqflags.h>
>
> +#define __ARCH_WANT_INTERRUPTS_ON_CTXSW
> +
>  /* entries in ARCH_DLINFO: */
>  #if defined(CONFIG_IA32_EMULATION) || !defined(CONFIG_X86_64)
>  # define AT_VECTOR_SIZE_ARCH 2
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 2d12893..e4f7a9f 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2636,9 +2636,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>                 * to spin on ->on_cpu if p is current, since that would
>                 * deadlock.
>                 */
> -               if (p == current) {
> -                       ttwu_queue(p, cpu);
> -                       goto stat;
> +               if (cpu == smp_processor_id()) {
> +                       struct rq *rq;
> +
> +                       rq = __task_rq_lock(p);
> +                       if (p->on_cpu) {

As Oleg has said, I also think we don't need this check.

> +                               ttwu_activate(rq, p, ENQUEUE_WAKEUP);
> +                               ttwu_do_wakeup(rq, p, wake_flags);

And the difference with ttwu_queue() is ttwu_queue() calls
ttwu_activate() with another flag ENQUEUE_WAKING, so if
we call ->task_waking() before ttwu_queue(), I guess it will work
too.
But I like this version, because we call ->task_waking() and
ttwu_activate() on the local cpu, that means the calculations on
vruntime in that two functions are accumulated into noop.

Thanks,
Yong

> +                               __task_rq_unlock(rq);
> +                               goto stat;
> +                       }
> +                       __task_rq_unlock(rq);
>                }
>  #endif
>                cpu_relax();
> diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
> index a6710a1..f0ff1de 100644
> --- a/kernel/sched_debug.c
> +++ b/kernel/sched_debug.c
> @@ -332,6 +332,13 @@ static int sched_debug_show(struct seq_file *m, void *v)
>                (int)strcspn(init_utsname()->version, " "),
>                init_utsname()->version);
>
> +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> +       SEQ_printf(m, "__ARCH_WANT_INTERRUPTS_ON_CTXSW\n");
> +#endif
> +#ifdef __ARCH_WANT_UNLOCKED_CTXSW
> +       SEQ_printf(m, "__ARCH_WANT_UNLOCKED_CTXSW\n");
> +#endif
> +
>  #define P(x) \
>        SEQ_printf(m, "%-40s: %Ld\n", #x, (long long)(x))
>  #define PN(x) \
>
>
>



--
Only stand for myself

2011-05-27 08:00:54

by Marc Zyngier

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Thu, 2011-05-26 at 18:32 +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 17:20 +0100, Marc Zyngier wrote:
> >
> > Doesn't look very good here. The serial console basically locks up as
> > soon as the system gets busy, even if the kernel compilation seem to
> > progress at a decent pace.
>
>
> OK, I'll leave the one that worked queued up for this release. If we can
> come up with a better alternative we can try for the next release, that
> should give us ample time to test things and get us a working kernel
> now ;-)

Agreed. The board has been compiling kernels for over 15 hours now, and
doesn't show any sign of deadlock. Yet ;-).

So until someone comes up with a much better approach, let's keep this
one. I'm of course happy to continue testing stuff though.

Cheers,

M.
--
Reality is an implementation detail.

2011-05-27 12:06:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM


* Catalin Marinas <[email protected]> wrote:

> > How much time does that take on contemporary ARM hardware,
> > typically (and worst-case)?
>
> On newer ARMv6 and ARMv7 hardware, we no longer flush the caches at
> context switch as we got VIPT (or PIPT-like) caches.
>
> But modern ARM processors use something called ASID to tag the TLB
> entries and we are limited to 256. The switch_mm() code checks for
> whether we ran out of them to restart the counting. This ASID
> roll-over event needs to be broadcast to the other CPUs and issuing
> IPIs with the IRQs disabled isn't always safe. Of course, we could
> briefly re-enable them at the ASID roll-over time but I'm not sure
> what the expectations of the code calling switch_mm() are.

The expectations are to have irqs off (we are holding the runqueue
lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i
suspect.

But in theory we could drop the rq lock and restart the scheduler
task-pick and balancing sequence when the ARM TLB tag rolls over. So
instead of this fragile and assymetric method we'd have a
straightforward retry-in-rare-cases method.

That means some modifications to switch_mm() but should be solvable.

That would make ARM special only in so far that it's one of the few
architectures that signal 'retry task pickup' via switch_mm() - it
would use the stock scheduler otherwise and we could remove
__ARCH_WANT_INTERRUPTS_ON_CTXSW and perhaps even
__ARCH_WANT_UNLOCKED_CTXSW altogether.

I'd suggest doing this once modern ARM chips get so widespread that
you can realistically induce a ~700 usecs irqs-off delays on old,
virtual-cache ARM chips. Old chips would likely use old kernels
anyway, right?

Thanks,

Ingo

2011-05-27 15:23:12

by Santosh Shilimkar

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

Peter,

On 5/26/2011 10:53 PM, Peter Zijlstra wrote:
> On Thu, 2011-05-26 at 19:17 +0200, Peter Zijlstra wrote:
>> On Thu, 2011-05-26 at 19:04 +0200, Oleg Nesterov wrote:
>>> On 05/26, Peter Zijlstra wrote:
>>>>
>>>> @@ -2636,7 +2636,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>>> * to spin on ->on_cpu if p is current, since that would
>>>> * deadlock.
>>>> */
>>>> - if (p == current) {
>>>> + if (cpu == smp_processor_id()) {
>>>> + p->sched_contributes_to_load = 0;
>>>> ttwu_queue(p, cpu);
>>>
>>> Btw. I do not pretend I really understand se->vruntime, but in this
>>> case we are doing enqueue_task() without ->task_waking(), however we
>>> pass ENQUEUE_WAKING. Is it correct?
>>
>> No its not, that's the thing that I got wrong the first time and caused
>> these pauses.
>
> We'd end up with something like the below, which isn't too different
> from what I've now got queued.
>
> It has the extra cpu == smp_processor_id() check, but I'm not sure this
> whole case is worth the trouble. I could go stick some counters in to
> verify how often all this happens I guess.
>
Are you planning send version of this patch for stable .39
too ?

Regards
Santosh

2011-05-27 15:28:44

by Marc Zyngier

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Fri, 2011-05-27 at 20:53 +0530, Santosh Shilimkar wrote:
> Peter,
>
> On 5/26/2011 10:53 PM, Peter Zijlstra wrote:
> > On Thu, 2011-05-26 at 19:17 +0200, Peter Zijlstra wrote:
> >> On Thu, 2011-05-26 at 19:04 +0200, Oleg Nesterov wrote:
> >>> On 05/26, Peter Zijlstra wrote:
> >>>>
> >>>> @@ -2636,7 +2636,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >>>> * to spin on ->on_cpu if p is current, since that would
> >>>> * deadlock.
> >>>> */
> >>>> - if (p == current) {
> >>>> + if (cpu == smp_processor_id()) {
> >>>> + p->sched_contributes_to_load = 0;
> >>>> ttwu_queue(p, cpu);
> >>>
> >>> Btw. I do not pretend I really understand se->vruntime, but in this
> >>> case we are doing enqueue_task() without ->task_waking(), however we
> >>> pass ENQUEUE_WAKING. Is it correct?
> >>
> >> No its not, that's the thing that I got wrong the first time and caused
> >> these pauses.
> >
> > We'd end up with something like the below, which isn't too different
> > from what I've now got queued.
> >
> > It has the extra cpu == smp_processor_id() check, but I'm not sure this
> > whole case is worth the trouble. I could go stick some counters in to
> > verify how often all this happens I guess.
> >
> Are you planning send version of this patch for stable .39
> too ?

.39 is fine, as the ttwu() changes only appeared in mainline during the
current merge window.

Cheers,

M.
--
Reality is an implementation detail.

2011-05-27 15:30:28

by Santosh Shilimkar

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On 5/27/2011 8:59 PM, Marc Zyngier wrote:
> On Fri, 2011-05-27 at 20:53 +0530, Santosh Shilimkar wrote:
[..]

>> Are you planning send version of this patch for stable .39
>> too ?
>
> .39 is fine, as the ttwu() changes only appeared in mainline during the
> current merge window.
>
Thanks Marc for info.

Regards
Santosh

2011-05-27 17:55:33

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Fri, May 27, 2011 at 02:06:29PM +0200, Ingo Molnar wrote:
> I'd suggest doing this once modern ARM chips get so widespread that
> you can realistically induce a ~700 usecs irqs-off delays on old,
> virtual-cache ARM chips. Old chips would likely use old kernels
> anyway, right?

Not necessarily. I have rather a lot of legacy hardware (it outweighs
the more modern stuff) and that legacy hardware is _loads_ more useful
than the modern stuff in that it can actually do stuff like run a network
(such as running kerberos servers, httpd, mtas, etc). Modern ARM
machines typically don't have ways to attach mass storage to them which
make them hellishly limited for such applications.

I'm planning to continue using my old machines, and continue to upgrade
their kernels, especially in order to keep up to date with security
issues.

2011-05-27 19:41:53

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Fri, 27 May 2011, Ingo Molnar wrote:

> I'd suggest doing this once modern ARM chips get so widespread that
> you can realistically induce a ~700 usecs irqs-off delays on old,
> virtual-cache ARM chips. Old chips would likely use old kernels
> anyway, right?

Those "old" ARM chips are still largely produced and fitted in new
designs, and using latest kernels. They are unlikely to fade away
before a couple years.


Nicolas

2011-05-27 20:53:14

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Fri, May 27, 2011 at 02:06:29PM +0200, Ingo Molnar wrote:
> The expectations are to have irqs off (we are holding the runqueue
> lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i
> suspect.

Just a thought, but we _might_ be able to avoid a lot of this hastle if
we had a new arch hook in finish_task_switch(), after finish_lock_switch()
returns but before the old MM is dropped.

For the new ASID-based switch_mm(), we currently do this:

1. check ASID validity
2. flush branch predictor
3. set reserved ASID value
4. set new page tables
5. set new ASID value

This will be shortly changed to:

1. check ASID validity
2. flush branch predictor
3. set swapper_pg_dir tables
4. set new ASID value
5. set new page tables

We could change switch_mm() to only do:

1. flush branch predictor
2. set swapper_pg_dir tables
3. check ASID validity
4. set new ASID value

At this point, we have no user mappings, and so nothing will be using the
ASID at this point. Then in a new post-finish_lock_switch() arch hook:

5. check whether we need to do flushing as a result of ASID change
6. set new page tables

I think this may simplify the ASID code. It needs prototyping out,
reviewing and testing, but I think it may work.

And I think it may also be workable with the CPUs which need to flush
the caches on context switches - we can postpone their page table
switch to this new arch hook too, which will mean we wouldn't require
__ARCH_WANT_INTERRUPTS_ON_CTXSW on ARM at all.

Any thoughts (if you've followed what I'm going on about) ?

2011-05-28 13:09:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Fri, 2011-05-27 at 21:52 +0100, Russell King - ARM Linux wrote:
> On Fri, May 27, 2011 at 02:06:29PM +0200, Ingo Molnar wrote:
> > The expectations are to have irqs off (we are holding the runqueue
> > lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i
> > suspect.
>
> Just a thought, but we _might_ be able to avoid a lot of this hastle if
> we had a new arch hook in finish_task_switch(), after finish_lock_switch()
> returns but before the old MM is dropped.

I'd be more than willing to provide this.

> For the new ASID-based switch_mm(), we currently do this:
>
> 1. check ASID validity
> 2. flush branch predictor
> 3. set reserved ASID value
> 4. set new page tables
> 5. set new ASID value
>
> This will be shortly changed to:
>
> 1. check ASID validity
> 2. flush branch predictor
> 3. set swapper_pg_dir tables
> 4. set new ASID value
> 5. set new page tables
>
> We could change switch_mm() to only do:
>
> 1. flush branch predictor
> 2. set swapper_pg_dir tables
> 3. check ASID validity
> 4. set new ASID value
>
> At this point, we have no user mappings, and so nothing will be using the
> ASID at this point. Then in a new post-finish_lock_switch() arch hook:
>
> 5. check whether we need to do flushing as a result of ASID change
> 6. set new page tables
>
> I think this may simplify the ASID code. It needs prototyping out,
> reviewing and testing, but I think it may work.
>
> And I think it may also be workable with the CPUs which need to flush
> the caches on context switches - we can postpone their page table
> switch to this new arch hook too, which will mean we wouldn't require
> __ARCH_WANT_INTERRUPTS_ON_CTXSW on ARM at all.
>
> Any thoughts (if you've followed what I'm going on about) ?

Yeah, definitely worth a try, you mentioned on IRC the problem of
detecting if switch_mm() happened in the new arch hook. Since
switch_mm() gets a @next pointer we can set a TIF flag there and have
the new arch hook test for that and conditionally perform the required
work.

Now, supposing we can get ARM to not rely on
__ARCH_WANT_INTERRUPTS_ON_CTXSW anymore, there's only microblaze left,
Michal, would a similar scheme work for you? If so we can fully
deprecate and remove this exception from the scheduler (yay!).


2011-05-29 09:51:47

by Catalin Marinas

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Fri, May 27, 2011 at 01:06:29PM +0100, Ingo Molnar wrote:
> * Catalin Marinas <[email protected]> wrote:
> > > How much time does that take on contemporary ARM hardware,
> > > typically (and worst-case)?
> >
> > On newer ARMv6 and ARMv7 hardware, we no longer flush the caches at
> > context switch as we got VIPT (or PIPT-like) caches.
> >
> > But modern ARM processors use something called ASID to tag the TLB
> > entries and we are limited to 256. The switch_mm() code checks for
> > whether we ran out of them to restart the counting. This ASID
> > roll-over event needs to be broadcast to the other CPUs and issuing
> > IPIs with the IRQs disabled isn't always safe. Of course, we could
> > briefly re-enable them at the ASID roll-over time but I'm not sure
> > what the expectations of the code calling switch_mm() are.
>
> The expectations are to have irqs off (we are holding the runqueue
> lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i
> suspect.
>
> But in theory we could drop the rq lock and restart the scheduler
> task-pick and balancing sequence when the ARM TLB tag rolls over. So
> instead of this fragile and assymetric method we'd have a
> straightforward retry-in-rare-cases method.

During switch_mm(), we check whether the task being scheduled in has an
old ASID and acquire a lock for a global ASID variable. If two CPUs do
the context switching at the same time, one of them would get stuck on
cpu_asid_lock. If on the other CPU we get an ASID roll-over, we have to
broadcast it to the other CPUs via IPI. But one of the other CPUs is
stuck on cpu_asid_lock with interrupts disabled and we get a deadlock.

An option could be to drop cpu_asid_lock and use some atomic operations
for the global ASID tracking variable but it needs some thinking. The
ASID tag requirements are that it should be unique across all the CPUs
in the system and two threads sharing the same mm must have the same
ASID (hence the IPI to the other CPUs).

Maybe Russell's idea to move the page table setting outside in some post
task-switch hook would be easier to implement.

--
Catalin

2011-05-29 10:21:32

by Catalin Marinas

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Fri, May 27, 2011 at 09:52:40PM +0100, Russell King - ARM Linux wrote:
> On Fri, May 27, 2011 at 02:06:29PM +0200, Ingo Molnar wrote:
> > The expectations are to have irqs off (we are holding the runqueue
> > lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i
> > suspect.
>
> Just a thought, but we _might_ be able to avoid a lot of this hastle if
> we had a new arch hook in finish_task_switch(), after finish_lock_switch()
> returns but before the old MM is dropped.
...
> We could change switch_mm() to only do:
>
> 1. flush branch predictor
> 2. set swapper_pg_dir tables
> 3. check ASID validity
> 4. set new ASID value

If we find that we ran out of ASIDs, we can't reset it across all the
other CPUs at this point as we have interrupts disabled. So here we
assume that we don't need to reset the ASIDs.

> At this point, we have no user mappings, and so nothing will be using the
> ASID at this point. Then in a new post-finish_lock_switch() arch hook:
>
> 5. check whether we need to do flushing as a result of ASID change
> 6. set new page tables

Can we actually not move points 1, 3 and 4 to the
post-finish_lock_switch() hook as well? We don't really care what's in
the ASID as long as we don't have any user mappings. The same goes for
the branch predictor (which may be wrongly placed already). This would
make the switch_mm() relatively simple and move the check_context() and
cpu_switch_mm() to the post-switch hook.

On A15, the ASID is part of TTBR0 so we set both of them at the same
time in the post-switch hook.

To avoid extra per-thread flags, we could set a per-cpu variable in
switch_mm() so that we know what to switch the page tables to in the
post-switch hook.

So I think this is feasible but it needs some intensive testing.

--
Catalin

2011-05-29 10:27:37

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Sun, May 29, 2011 at 11:21:19AM +0100, Catalin Marinas wrote:
> To avoid extra per-thread flags, we could set a per-cpu variable in
> switch_mm() so that we know what to switch the page tables to in the
> post-switch hook.

Why do we need to add more per-cpu stuff when we already have easy access
to the thread flags?

2011-05-29 12:02:00

by Catalin Marinas

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Sunday, 29 May 2011, Russell King - ARM Linux <[email protected]> wrote:
> On Sun, May 29, 2011 at 11:21:19AM +0100, Catalin Marinas wrote:
>> To avoid extra per-thread flags, we could set a per-cpu variable in
>> switch_mm() so that we know what to switch the page tables to in the
>> post-switch hook.
>
> Why do we need to add more per-cpu stuff when we already have easy access
> to the thread flags?

It could work, I was thinking that we only get an mm structure in the
post-switch hook.

BTW, we currently have a per-cpu current_mm variable in context.c
because switch_mm() is called before switch_to() and the CPU may
receive an IPI to reset the ASID in this interval. But we can remove
it entirely if we set the ASID in the post-switch hook and run the
main switch code with interrupts disabled.


--
Catalin

2011-05-29 13:19:45

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Sun, May 29, 2011 at 01:01:58PM +0100, Catalin Marinas wrote:
> On Sunday, 29 May 2011, Russell King - ARM Linux <[email protected]> wrote:
> > On Sun, May 29, 2011 at 11:21:19AM +0100, Catalin Marinas wrote:
> >> To avoid extra per-thread flags, we could set a per-cpu variable in
> >> switch_mm() so that we know what to switch the page tables to in the
> >> post-switch hook.
> >
> > Why do we need to add more per-cpu stuff when we already have easy access
> > to the thread flags?
>
> It could work, I was thinking that we only get an mm structure in the
> post-switch hook.

No. What we get is the mm structure for the _previous_ task which
was running if the previous task was a lazy-tlb task. Otherwise it
will be NULL.

What we do get is the 'next' task and 'next' thread by virtue of the
fact that it has become the 'current' task - so current and
current_thread_info() both point at what switch_mm() regarded as the
'next' task/thread.

> BTW, we currently have a per-cpu current_mm variable in context.c
> because switch_mm() is called before switch_to() and the CPU may
> receive an IPI to reset the ASID in this interval. But we can remove
> it entirely if we set the ASID in the post-switch hook and run the
> main switch code with interrupts disabled.

Unconvinced. If we move the ASID update to the post-switch hook, then
we have the opposite problem - an IPI can sneak in between the dropping
of the IRQ disabling and the post-switch hook. This could mean that
we end up racing to update the hardware ASID value instead (we may
have read the ASID value from the mm struct, interrupt occurs, changes
the ASID value, returns, we program the old ASID value.)

2011-05-29 21:21:38

by Catalin Marinas

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On 29 May 2011 14:19, Russell King - ARM Linux <[email protected]> wrote:
> On Sun, May 29, 2011 at 01:01:58PM +0100, Catalin Marinas wrote:
>> BTW, we currently have a per-cpu current_mm variable in context.c
>> because switch_mm() is called before switch_to() and the CPU may
>> receive an IPI to reset the ASID in this interval. But we can remove
>> it entirely if we set the ASID in the post-switch hook and run the
>> main switch code with interrupts disabled.
>
> Unconvinced. ?If we move the ASID update to the post-switch hook, then
> we have the opposite problem - an IPI can sneak in between the dropping
> of the IRQ disabling and the post-switch hook. ?This could mean that
> we end up racing to update the hardware ASID value instead (we may
> have read the ASID value from the mm struct, interrupt occurs, changes
> the ASID value, returns, we program the old ASID value.)

Please note that we have this problem already, that's why Will posted
the patch to disable the interrupts around cpu_switch_mm(). With this
fix, even if cpu_switch_mm() happens in the post-switch hook, you
don't really have any problem. In the worst case you set the same
TTBR0 twice and maybe the first time with the old ASID followed
immediately by the setting of the new ASID (with the corresponding TLB
flushing). But that's all happening before getting to user space.

To my original point of getting rid of current_mm - in the post-switch
hook this would be equivalent to current->mm so no need for the
per-cpu variable.

--
Catalin

2011-05-31 11:08:40

by Michal Simek

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

Peter Zijlstra wrote:
> On Fri, 2011-05-27 at 21:52 +0100, Russell King - ARM Linux wrote:
>> On Fri, May 27, 2011 at 02:06:29PM +0200, Ingo Molnar wrote:
>>> The expectations are to have irqs off (we are holding the runqueue
>>> lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i
>>> suspect.
>> Just a thought, but we _might_ be able to avoid a lot of this hastle if
>> we had a new arch hook in finish_task_switch(), after finish_lock_switch()
>> returns but before the old MM is dropped.
>
> I'd be more than willing to provide this.
>
>> For the new ASID-based switch_mm(), we currently do this:
>>
>> 1. check ASID validity
>> 2. flush branch predictor
>> 3. set reserved ASID value
>> 4. set new page tables
>> 5. set new ASID value
>>
>> This will be shortly changed to:
>>
>> 1. check ASID validity
>> 2. flush branch predictor
>> 3. set swapper_pg_dir tables
>> 4. set new ASID value
>> 5. set new page tables
>>
>> We could change switch_mm() to only do:
>>
>> 1. flush branch predictor
>> 2. set swapper_pg_dir tables
>> 3. check ASID validity
>> 4. set new ASID value
>>
>> At this point, we have no user mappings, and so nothing will be using the
>> ASID at this point. Then in a new post-finish_lock_switch() arch hook:
>>
>> 5. check whether we need to do flushing as a result of ASID change
>> 6. set new page tables
>>
>> I think this may simplify the ASID code. It needs prototyping out,
>> reviewing and testing, but I think it may work.
>>
>> And I think it may also be workable with the CPUs which need to flush
>> the caches on context switches - we can postpone their page table
>> switch to this new arch hook too, which will mean we wouldn't require
>> __ARCH_WANT_INTERRUPTS_ON_CTXSW on ARM at all.
>>
>> Any thoughts (if you've followed what I'm going on about) ?
>
> Yeah, definitely worth a try, you mentioned on IRC the problem of
> detecting if switch_mm() happened in the new arch hook. Since
> switch_mm() gets a @next pointer we can set a TIF flag there and have
> the new arch hook test for that and conditionally perform the required
> work.
>
> Now, supposing we can get ARM to not rely on
> __ARCH_WANT_INTERRUPTS_ON_CTXSW anymore, there's only microblaze left,
> Michal, would a similar scheme work for you? If so we can fully
> deprecate and remove this exception from the scheduler (yay!).

Hi,

please correct me if I am wrong but this is workaround just for ARM.
I am not aware that we need to do anything with caches. I enabled that options
after our discussion (http://lkml.org/lkml/2009/12/3/204) because of problems
with lockdep. I will look if I can remove that option but it will be necessary
to do some changes in code. switch_to should be called with irq OFF right?

Michal


Michal






--
Michal Simek, Ing. (M.Eng)
PetaLogix - Linux Solutions for a Reconfigurable World
w: http://www.petalogix.com p: +61-7-30090663,+42-0-721842854 f: +61-7-30090663

2011-05-31 13:22:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Tue, 2011-05-31 at 13:08 +0200, Michal Simek wrote:
>
> please correct me if I am wrong but this is workaround just for ARM.
> I am not aware that we need to do anything with caches. I enabled that options
> after our discussion (http://lkml.org/lkml/2009/12/3/204) because of problems
> with lockdep. I will look if I can remove that option but it will be necessary
> to do some changes in code. switch_to should be called with irq OFF right?

Hmm, so the problem was that interrupts got enabled on microblaze (or
lockdep thought they were), so we need to figure out why that is so
instead of ensuring that it is so.

/me goes poke about in the microblaze code..

So on fork() the child ip gets set to ret_from_fork(), then when we wake
the child we'll eventually schedule to it. So we get a context switch
like X -> child.

Then X calls schedule()->context_switch()->switch_to() which will
continue at ret_from_fork()->schedule_tail()->finish_task_switch()->
finish_lock_switch()->spin_acquire(&rq->lock.depmap..)

Now the lockdep report says that at that point interrupts were enabled,
and I can't quite see how that would happen, we go into switch_to() with
interrupts disabled (assuming !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so the
whole ret_from_fork()->... path should run with interrupts disabled as
well.

I can't find where it would have enabled IRQs. Maybe the current
microblaze code doesn't suffer this, or I simply missed it in the
entry.S magic -- its not like I can actually read microblaze asm well.

Does it still explode like back then, if so, can you see where it
enables IRQs?

2011-05-31 13:37:38

by Michal Simek

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

Peter Zijlstra wrote:
> On Tue, 2011-05-31 at 13:08 +0200, Michal Simek wrote:
>> please correct me if I am wrong but this is workaround just for ARM.
>> I am not aware that we need to do anything with caches. I enabled that options
>> after our discussion (http://lkml.org/lkml/2009/12/3/204) because of problems
>> with lockdep. I will look if I can remove that option but it will be necessary
>> to do some changes in code. switch_to should be called with irq OFF right?
>
> Hmm, so the problem was that interrupts got enabled on microblaze (or
> lockdep thought they were), so we need to figure out why that is so
> instead of ensuring that it is so.
>
> /me goes poke about in the microblaze code..
>
> So on fork() the child ip gets set to ret_from_fork(), then when we wake
> the child we'll eventually schedule to it. So we get a context switch
> like X -> child.
>
> Then X calls schedule()->context_switch()->switch_to() which will
> continue at ret_from_fork()->schedule_tail()->finish_task_switch()->
> finish_lock_switch()->spin_acquire(&rq->lock.depmap..)
>
> Now the lockdep report says that at that point interrupts were enabled,
> and I can't quite see how that would happen, we go into switch_to() with
> interrupts disabled (assuming !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so the
> whole ret_from_fork()->... path should run with interrupts disabled as
> well.
>
> I can't find where it would have enabled IRQs. Maybe the current
> microblaze code doesn't suffer this, or I simply missed it in the
> entry.S magic -- its not like I can actually read microblaze asm well.
>
> Does it still explode like back then, if so, can you see where it
> enables IRQs?

I briefly looked at it and it probably come from copy_thread function (process.c
- line: childregs->msr |= MSR_IE;)
When context switch happen, childregs->msr value is loaded to MSR (machine
status register) which caused that IE is enabled ( entry.S:~977 lwi r12, r11,
CC_MSR; mts rmsr, r12)

NOTE: MSR stores flags for IE, i/d-cache ON/OFF, virtual memory/user mode etc.

This is no problem if context switch is done with irq on. But maybe there is
another place which is causing some problems.

Where exactly should be IRQ reenable after context switch?

I would like to also check some things.
1. When schedule should be called from arch specific code?
Currently we are calling schedule after syscall/exception/interrupt happen.
Is there any place where schedule should/shouldn't be called?

2. For syscall and exception handling - interrupt is ON but it is only masked.
When schedule is called from that any code has to enable IRQ if generic code
doesn't do that. Not sure if it does.

Michal




--
Michal Simek, Ing. (M.Eng)
w: http://www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

2011-05-31 13:52:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Tue, 2011-05-31 at 15:37 +0200, Michal Simek wrote:

> I briefly looked at it and it probably come from copy_thread function (process.c
> - line: childregs->msr |= MSR_IE;)
> When context switch happen, childregs->msr value is loaded to MSR (machine
> status register) which caused that IE is enabled ( entry.S:~977 lwi r12, r11,
> CC_MSR; mts rmsr, r12)
>
> NOTE: MSR stores flags for IE, i/d-cache ON/OFF, virtual memory/user mode etc.
>
> This is no problem if context switch is done with irq on. But maybe there is
> another place which is causing some problems.

Ahh, no wonder I didn't find that ;-)

> Where exactly should be IRQ reenable after context switch?

the tail end of finish_lock_switch(), where it does:
raw_spin_unlock_irq(&rq->lock).

> I would like to also check some things.
> 1. When schedule should be called from arch specific code?
> Currently we are calling schedule after syscall/exception/interrupt happen.
> Is there any place where schedule should/shouldn't be called?

It should be called on the return to userspace path when
TIF_NEED_RESCHED is set. It should not be called from non-preemptible
contexts like non-zero preempt_count or IRQ-disabled.

[ with the exception of CONFIG_PREEMPT which calls preempt_schedule()
which checks both those things ]

> 2. For syscall and exception handling - interrupt is ON but it is only masked.

I'm having trouble understanding: on but masked.

> When schedule is called from that any code has to enable IRQ if generic code
> doesn't do that. Not sure if it does.

generic code isn't supposed to call schedule() with IRQs disabled (and
doesn't afaik)

2011-05-31 14:08:55

by Michal Simek

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

Peter Zijlstra wrote:
> On Tue, 2011-05-31 at 15:37 +0200, Michal Simek wrote:
>
>> I briefly looked at it and it probably come from copy_thread function (process.c
>> - line: childregs->msr |= MSR_IE;)
>> When context switch happen, childregs->msr value is loaded to MSR (machine
>> status register) which caused that IE is enabled ( entry.S:~977 lwi r12, r11,
>> CC_MSR; mts rmsr, r12)
>>
>> NOTE: MSR stores flags for IE, i/d-cache ON/OFF, virtual memory/user mode etc.
>>
>> This is no problem if context switch is done with irq on. But maybe there is
>> another place which is causing some problems.
>
> Ahh, no wonder I didn't find that ;-)

:-)

>
>> Where exactly should be IRQ reenable after context switch?
>
> the tail end of finish_lock_switch(), where it does:
> raw_spin_unlock_irq(&rq->lock).

ok - I see.

>
>> I would like to also check some things.
>> 1. When schedule should be called from arch specific code?
>> Currently we are calling schedule after syscall/exception/interrupt happen.
>> Is there any place where schedule should/shouldn't be called?
>
> It should be called on the return to userspace path when
> TIF_NEED_RESCHED is set.

Yes, we do that. (PTO + PT_MODE stores if return is to kernel or user space)

It should not be called from non-preemptible
> contexts like non-zero preempt_count or IRQ-disabled.

Is this even when the return is to userspace?

PREEMPT is not well tested feature but maybe it is right time to do so.
There is only small part of code (ifdef CONFIG_PREEMPT) when irq happen and
there is return to the kernel. Is this correct?

>
> [ with the exception of CONFIG_PREEMPT which calls preempt_schedule()
> which checks both those things ]

This is called only when IRQ happen right? We call preempt_schedule_irq because
irq are off and IRQ is ON by rtid below IRQ_return label.


>
>> 2. For syscall and exception handling - interrupt is ON but it is only masked.
>
> I'm having trouble understanding: on but masked.

Interrupt can't happen because some masking bits are setup. If you call
irgs_disabled() or others you will get that IRQ is ON but can't happen.

>
>> When schedule is called from that any code has to enable IRQ if generic code
>> doesn't do that. Not sure if it does.
>
> generic code isn't supposed to call schedule() with IRQs disabled (and
> doesn't afaik)

OK. Which means I have to disable IRQ before schedule is called. Is that correct?

Michal

--
Michal Simek, Ing. (M.Eng)
w: http://www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

2011-05-31 14:30:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

On Tue, 2011-05-31 at 16:08 +0200, Michal Simek wrote:
> Peter Zijlstra wrote:

> >> I would like to also check some things.
> >> 1. When schedule should be called from arch specific code?
> >> Currently we are calling schedule after syscall/exception/interrupt happen.
> >> Is there any place where schedule should/shouldn't be called?
> >
> > It should be called on the return to userspace path when
> > TIF_NEED_RESCHED is set.
>
> Yes, we do that. (PTO + PT_MODE stores if return is to kernel or user space)
>
> It should not be called from non-preemptible
> > contexts like non-zero preempt_count or IRQ-disabled.
>
> Is this even when the return is to userspace?

Well, return to userspace should have preempt_count == 0 and IRQs
enabled, right?

> PREEMPT is not well tested feature but maybe it is right time to do so.
> There is only small part of code (ifdef CONFIG_PREEMPT) when irq happen and
> there is return to the kernel. Is this correct?

I think so, never looked too closely, Ingo?

> > [ with the exception of CONFIG_PREEMPT which calls preempt_schedule()
> > which checks both those things ]
>
> This is called only when IRQ happen right? We call preempt_schedule_irq because
> irq are off and IRQ is ON by rtid below IRQ_return label.

Ah, there's also preempt_schedule_irq(), which can be called with
IRQs-disabled, not sure about the rules there though, Ingo?

> >
> >> 2. For syscall and exception handling - interrupt is ON but it is only masked.
> >
> > I'm having trouble understanding: on but masked.
>
> Interrupt can't happen because some masking bits are setup. If you call
> irgs_disabled() or others you will get that IRQ is ON but can't happen.

Ah, we generally ignore that state and only rely on state modified by
local_irq_enable/disable(), eg. your MSR_IE bit.

> >> When schedule is called from that any code has to enable IRQ if generic code
> >> doesn't do that. Not sure if it does.
> >
> > generic code isn't supposed to call schedule() with IRQs disabled (and
> > doesn't afaik)
>
> OK. Which means I have to disable IRQ before schedule is called. Is that correct?

Hum, I might have mis-understood. No, schedule() assumes IRQs are
enabled and will disable IRQs itself quite early:

raw_spin_lock_irq(&rq->lock);

2011-06-06 10:29:12

by Pavel Machek

[permalink] [raw]
Subject: Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM

Hi!

> The expectations are to have irqs off (we are holding the runqueue
> lock if !__ARCH_WANT_INTERRUPTS_ON_CTXSW), so that's not workable i
> suspect.
>
> But in theory we could drop the rq lock and restart the scheduler
> task-pick and balancing sequence when the ARM TLB tag rolls over. So
> instead of this fragile and assymetric method we'd have a
> straightforward retry-in-rare-cases method.
>
> That means some modifications to switch_mm() but should be solvable.
>
> That would make ARM special only in so far that it's one of the few
> architectures that signal 'retry task pickup' via switch_mm() - it
> would use the stock scheduler otherwise and we could remove
> __ARCH_WANT_INTERRUPTS_ON_CTXSW and perhaps even
> __ARCH_WANT_UNLOCKED_CTXSW altogether.
>
> I'd suggest doing this once modern ARM chips get so widespread that
> you can realistically induce a ~700 usecs irqs-off delays on old,
> virtual-cache ARM chips. Old chips would likely use old kernels
> anyway, right?

Not really. Equivalent machines (400g) with new chips do not exist, so
many people are stuck with Sharp Zaurus, and track latest kernels,
because support is actually improving there.

(And yes, I'd like to keep using bluetooth CF card with 16550-style
chip.)
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html