2013-09-19 19:51:15

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

Thomas,

Please consider this patchset for pulling from:

git://github.com/fweisbec/linux-dynticks.git
irq/core-v2

HEAD: 539b9cde35b473483c722de110133cd757015947

It fixes stacks overruns reported by Benjamin Herrenschmidt:
http://lkml.kernel.org/r/1378330796.4321.50.camel%40pasglop

And Paul Mackerras gave a feedback here:
http://lkml.kernel.org/r/20130918065101.GA22060@drongo

Of course the fix probably comes at the expense of a performance
hit due to cache switch, miss, etc... when softirq are processed
at the end of interrupts, although I haven't tried to measure that.

Thanks.

---
Frederic Weisbecker (3):
irq: Consolidate do_softirq() arch overriden implementations
irq: Execute softirq on its own stack on irq exit
irq: Comment on the use of inline stack for ksoftirqd

arch/metag/kernel/irq.c | 56 ++++++++++++++++++-------------------------
arch/parisc/kernel/irq.c | 17 ++-----------
arch/powerpc/kernel/irq.c | 17 +------------
arch/s390/kernel/irq.c | 52 ++++++++++++++++------------------------
arch/sh/kernel/irq.c | 60 +++++++++++++++++++---------------------------
arch/sparc/kernel/irq_64.c | 31 ++++++++----------------
arch/x86/kernel/irq_32.c | 34 ++++++++++----------------
arch/x86/kernel/irq_64.c | 18 +++-----------
include/linux/interrupt.h | 11 +++++++++
kernel/softirq.c | 13 +++++-----
10 files changed, 115 insertions(+), 194 deletions(-)

--
1.8.3.1


2013-09-19 19:51:21

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/3] irq: Consolidate do_softirq() arch overriden implementations

All arch overriden implementations of do_softirq() share the following
common code: disable irqs, check if there are softirqs pending, then
execute __do_softirq() on a specific stack.

Consolidate the common parts such that archs only worry about the
stack switch.

Reported-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Tested-by: Paul Mackerras <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: James Hogan <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Andrew Morton <[email protected]>
---
arch/metag/kernel/irq.c | 56 ++++++++++++++++++-------------------------
arch/parisc/kernel/irq.c | 17 ++-----------
arch/powerpc/kernel/irq.c | 17 +------------
arch/s390/kernel/irq.c | 52 ++++++++++++++++------------------------
arch/sh/kernel/irq.c | 60 +++++++++++++++++++---------------------------
arch/sparc/kernel/irq_64.c | 31 ++++++++----------------
arch/x86/kernel/irq_32.c | 34 ++++++++++----------------
arch/x86/kernel/irq_64.c | 18 +++-----------
include/linux/interrupt.h | 11 +++++++++
kernel/softirq.c | 7 ++----
10 files changed, 110 insertions(+), 193 deletions(-)

diff --git a/arch/metag/kernel/irq.c b/arch/metag/kernel/irq.c
index 2a2c9d5..12a12b4 100644
--- a/arch/metag/kernel/irq.c
+++ b/arch/metag/kernel/irq.c
@@ -159,44 +159,34 @@ void irq_ctx_exit(int cpu)

extern asmlinkage void __do_softirq(void);

-asmlinkage void do_softirq(void)
+void do_softirq_own_stack(void)
{
- unsigned long flags;
struct thread_info *curctx;
union irq_ctx *irqctx;
u32 *isp;

- if (in_interrupt())
- return;
-
- local_irq_save(flags);
-
- if (local_softirq_pending()) {
- curctx = current_thread_info();
- irqctx = softirq_ctx[smp_processor_id()];
- irqctx->tinfo.task = curctx->task;
-
- /* build the stack frame on the softirq stack */
- isp = (u32 *) ((char *)irqctx + sizeof(struct thread_info));
-
- asm volatile (
- "MOV D0.5,%0\n"
- "SWAP A0StP,D0.5\n"
- "CALLR D1RtP,___do_softirq\n"
- "MOV A0StP,D0.5\n"
- :
- : "r" (isp)
- : "memory", "cc", "D1Ar1", "D0Ar2", "D1Ar3", "D0Ar4",
- "D1Ar5", "D0Ar6", "D0Re0", "D1Re0", "D0.4", "D1RtP",
- "D0.5"
- );
- /*
- * Shouldn't happen, we returned above if in_interrupt():
- */
- WARN_ON_ONCE(softirq_count());
- }
-
- local_irq_restore(flags);
+ curctx = current_thread_info();
+ irqctx = softirq_ctx[smp_processor_id()];
+ irqctx->tinfo.task = curctx->task;
+
+ /* build the stack frame on the softirq stack */
+ isp = (u32 *) ((char *)irqctx + sizeof(struct thread_info));
+
+ asm volatile (
+ "MOV D0.5,%0\n"
+ "SWAP A0StP,D0.5\n"
+ "CALLR D1RtP,___do_softirq\n"
+ "MOV A0StP,D0.5\n"
+ :
+ : "r" (isp)
+ : "memory", "cc", "D1Ar1", "D0Ar2", "D1Ar3", "D0Ar4",
+ "D1Ar5", "D0Ar6", "D0Re0", "D1Re0", "D0.4", "D1RtP",
+ "D0.5"
+ );
+ /*
+ * Shouldn't happen, we returned above if in_interrupt():
+ */
+ WARN_ON_ONCE(softirq_count());
}
#endif

diff --git a/arch/parisc/kernel/irq.c b/arch/parisc/kernel/irq.c
index 2e6443b..ef59276 100644
--- a/arch/parisc/kernel/irq.c
+++ b/arch/parisc/kernel/irq.c
@@ -499,22 +499,9 @@ static void execute_on_irq_stack(void *func, unsigned long param1)
*irq_stack_in_use = 1;
}

-asmlinkage void do_softirq(void)
+void do_softirq_own_stack(void)
{
- __u32 pending;
- unsigned long flags;
-
- if (in_interrupt())
- return;
-
- local_irq_save(flags);
-
- pending = local_softirq_pending();
-
- if (pending)
- execute_on_irq_stack(__do_softirq, 0);
-
- local_irq_restore(flags);
+ execute_on_irq_stack(__do_softirq, 0);
}
#endif /* CONFIG_IRQSTACKS */

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index c69440c..7d0da88 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -601,7 +601,7 @@ void irq_ctx_init(void)
}
}

-static inline void do_softirq_onstack(void)
+void do_softirq_own_stack(void)
{
struct thread_info *curtp, *irqtp;
unsigned long saved_sp_limit = current->thread.ksp_limit;
@@ -623,21 +623,6 @@ static inline void do_softirq_onstack(void)
set_bits(irqtp->flags, &curtp->flags);
}

-void do_softirq(void)
-{
- unsigned long flags;
-
- if (in_interrupt())
- return;
-
- local_irq_save(flags);
-
- if (local_softirq_pending())
- do_softirq_onstack();
-
- local_irq_restore(flags);
-}
-
irq_hw_number_t virq_to_hw(unsigned int virq)
{
struct irq_data *irq_data = irq_get_irq_data(virq);
diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index 8ac2097..bb27a26 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -157,39 +157,29 @@ int arch_show_interrupts(struct seq_file *p, int prec)
/*
* Switch to the asynchronous interrupt stack for softirq execution.
*/
-asmlinkage void do_softirq(void)
+void do_softirq_own_stack(void)
{
- unsigned long flags, old, new;
-
- if (in_interrupt())
- return;
-
- local_irq_save(flags);
-
- if (local_softirq_pending()) {
- /* Get current stack pointer. */
- asm volatile("la %0,0(15)" : "=a" (old));
- /* Check against async. stack address range. */
- new = S390_lowcore.async_stack;
- if (((new - old) >> (PAGE_SHIFT + THREAD_ORDER)) != 0) {
- /* Need to switch to the async. stack. */
- new -= STACK_FRAME_OVERHEAD;
- ((struct stack_frame *) new)->back_chain = old;
-
- asm volatile(" la 15,0(%0)\n"
- " basr 14,%2\n"
- " la 15,0(%1)\n"
- : : "a" (new), "a" (old),
- "a" (__do_softirq)
- : "0", "1", "2", "3", "4", "5", "14",
- "cc", "memory" );
- } else {
- /* We are already on the async stack. */
- __do_softirq();
- }
+ unsigned long old, new;
+
+ /* Get current stack pointer. */
+ asm volatile("la %0,0(15)" : "=a" (old));
+ /* Check against async. stack address range. */
+ new = S390_lowcore.async_stack;
+ if (((new - old) >> (PAGE_SHIFT + THREAD_ORDER)) != 0) {
+ /* Need to switch to the async. stack. */
+ new -= STACK_FRAME_OVERHEAD;
+ ((struct stack_frame *) new)->back_chain = old;
+ asm volatile(" la 15,0(%0)\n"
+ " basr 14,%2\n"
+ " la 15,0(%1)\n"
+ : : "a" (new), "a" (old),
+ "a" (__do_softirq)
+ : "0", "1", "2", "3", "4", "5", "14",
+ "cc", "memory" );
+ } else {
+ /* We are already on the async stack. */
+ __do_softirq();
}
-
- local_irq_restore(flags);
}

/*
diff --git a/arch/sh/kernel/irq.c b/arch/sh/kernel/irq.c
index 063af10..50ecd48 100644
--- a/arch/sh/kernel/irq.c
+++ b/arch/sh/kernel/irq.c
@@ -149,47 +149,37 @@ void irq_ctx_exit(int cpu)
hardirq_ctx[cpu] = NULL;
}

-asmlinkage void do_softirq(void)
+void do_softirq_own_stack(void)
{
- unsigned long flags;
struct thread_info *curctx;
union irq_ctx *irqctx;
u32 *isp;

- if (in_interrupt())
- return;
-
- local_irq_save(flags);
+ curctx = current_thread_info();
+ irqctx = softirq_ctx[smp_processor_id()];
+ irqctx->tinfo.task = curctx->task;
+ irqctx->tinfo.previous_sp = current_stack_pointer;
+
+ /* build the stack frame on the softirq stack */
+ isp = (u32 *)((char *)irqctx + sizeof(*irqctx));
+
+ __asm__ __volatile__ (
+ "mov r15, r9 \n"
+ "jsr @%0 \n"
+ /* switch to the softirq stack */
+ " mov %1, r15 \n"
+ /* restore the thread stack */
+ "mov r9, r15 \n"
+ : /* no outputs */
+ : "r" (__do_softirq), "r" (isp)
+ : "memory", "r0", "r1", "r2", "r3", "r4",
+ "r5", "r6", "r7", "r8", "r9", "r15", "t", "pr"
+ );

- if (local_softirq_pending()) {
- curctx = current_thread_info();
- irqctx = softirq_ctx[smp_processor_id()];
- irqctx->tinfo.task = curctx->task;
- irqctx->tinfo.previous_sp = current_stack_pointer;
-
- /* build the stack frame on the softirq stack */
- isp = (u32 *)((char *)irqctx + sizeof(*irqctx));
-
- __asm__ __volatile__ (
- "mov r15, r9 \n"
- "jsr @%0 \n"
- /* switch to the softirq stack */
- " mov %1, r15 \n"
- /* restore the thread stack */
- "mov r9, r15 \n"
- : /* no outputs */
- : "r" (__do_softirq), "r" (isp)
- : "memory", "r0", "r1", "r2", "r3", "r4",
- "r5", "r6", "r7", "r8", "r9", "r15", "t", "pr"
- );
-
- /*
- * Shouldn't happen, we returned above if in_interrupt():
- */
- WARN_ON_ONCE(softirq_count());
- }
-
- local_irq_restore(flags);
+ /*
+ * Shouldn't happen, we returned above if in_interrupt():
+ */
+ WARN_ON_ONCE(softirq_count());
}
#else
static inline void handle_one_irq(unsigned int irq)
diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
index d4840ce..666193f 100644
--- a/arch/sparc/kernel/irq_64.c
+++ b/arch/sparc/kernel/irq_64.c
@@ -698,30 +698,19 @@ void __irq_entry handler_irq(int pil, struct pt_regs *regs)
set_irq_regs(old_regs);
}

-void do_softirq(void)
+void do_softirq_own_stack(void)
{
- unsigned long flags;
-
- if (in_interrupt())
- return;
-
- local_irq_save(flags);
+ void *orig_sp, *sp = softirq_stack[smp_processor_id()];

- if (local_softirq_pending()) {
- void *orig_sp, *sp = softirq_stack[smp_processor_id()];
-
- sp += THREAD_SIZE - 192 - STACK_BIAS;
-
- __asm__ __volatile__("mov %%sp, %0\n\t"
- "mov %1, %%sp"
- : "=&r" (orig_sp)
- : "r" (sp));
- __do_softirq();
- __asm__ __volatile__("mov %0, %%sp"
- : : "r" (orig_sp));
- }
+ sp += THREAD_SIZE - 192 - STACK_BIAS;

- local_irq_restore(flags);
+ __asm__ __volatile__("mov %%sp, %0\n\t"
+ "mov %1, %%sp"
+ : "=&r" (orig_sp)
+ : "r" (sp));
+ __do_softirq();
+ __asm__ __volatile__("mov %0, %%sp"
+ : : "r" (orig_sp));
}

#ifdef CONFIG_HOTPLUG_CPU
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 4186755..1036f03 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -149,35 +149,25 @@ void irq_ctx_init(int cpu)
cpu, per_cpu(hardirq_ctx, cpu), per_cpu(softirq_ctx, cpu));
}

-asmlinkage void do_softirq(void)
+void do_softirq_own_stack(void)
{
- unsigned long flags;
struct thread_info *curctx;
union irq_ctx *irqctx;
u32 *isp;

- if (in_interrupt())
- return;
-
- local_irq_save(flags);
-
- if (local_softirq_pending()) {
- curctx = current_thread_info();
- irqctx = __this_cpu_read(softirq_ctx);
- irqctx->tinfo.task = curctx->task;
- irqctx->tinfo.previous_esp = current_stack_pointer;
-
- /* build the stack frame on the softirq stack */
- isp = (u32 *) ((char *)irqctx + sizeof(*irqctx));
+ curctx = current_thread_info();
+ irqctx = __this_cpu_read(softirq_ctx);
+ irqctx->tinfo.task = curctx->task;
+ irqctx->tinfo.previous_esp = current_stack_pointer;

- call_on_stack(__do_softirq, isp);
- /*
- * Shouldn't happen, we returned above if in_interrupt():
- */
- WARN_ON_ONCE(softirq_count());
- }
+ /* build the stack frame on the softirq stack */
+ isp = (u32 *) ((char *)irqctx + sizeof(*irqctx));

- local_irq_restore(flags);
+ call_on_stack(__do_softirq, isp);
+ /*
+ * Shouldn't happen, we returned above if in_interrupt():
+ */
+ WARN_ON_ONCE(softirq_count());
}

bool handle_irq(unsigned irq, struct pt_regs *regs)
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index d04d3ec..8dd8c96 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -91,20 +91,8 @@ bool handle_irq(unsigned irq, struct pt_regs *regs)

extern void call_softirq(void);

-asmlinkage void do_softirq(void)
+void do_softirq_own_stack(void)
{
- __u32 pending;
- unsigned long flags;
-
- if (in_interrupt())
- return;
-
- local_irq_save(flags);
- pending = local_softirq_pending();
- /* Switch to interrupt stack */
- if (pending) {
- call_softirq();
- WARN_ON_ONCE(softirq_count());
- }
- local_irq_restore(flags);
+ call_softirq();
+ WARN_ON_ONCE(softirq_count());
}
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5e865b5..c9e831d 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -19,6 +19,7 @@

#include <linux/atomic.h>
#include <asm/ptrace.h>
+#include <asm/irq.h>

/*
* These correspond to the IORESOURCE_IRQ_* defines in
@@ -374,6 +375,16 @@ struct softirq_action

asmlinkage void do_softirq(void);
asmlinkage void __do_softirq(void);
+
+#ifdef __ARCH_HAS_DO_SOFTIRQ
+void do_softirq_own_stack(void);
+#else
+static inline void do_softirq_own_stack(void)
+{
+ __do_softirq();
+}
+#endif
+
extern void open_softirq(int nr, void (*action)(struct softirq_action *));
extern void softirq_init(void);
extern void __raise_softirq_irqoff(unsigned int nr);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 53cc09c..36112cf 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -29,7 +29,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/irq.h>

-#include <asm/irq.h>
/*
- No shared variables, all the data are CPU local.
- If a softirq needs serialization, let it serialize itself
@@ -283,7 +282,7 @@ restart:
tsk_restore_flags(current, old_flags, PF_MEMALLOC);
}

-#ifndef __ARCH_HAS_DO_SOFTIRQ
+

asmlinkage void do_softirq(void)
{
@@ -298,13 +297,11 @@ asmlinkage void do_softirq(void)
pending = local_softirq_pending();

if (pending)
- __do_softirq();
+ do_softirq_own_stack();

local_irq_restore(flags);
}

-#endif
-
/*
* Enter an interrupt context.
*/
--
1.8.3.1

2013-09-19 19:51:25

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 3/3] irq: Comment on the use of inline stack for ksoftirqd

Ksoftirqd shouldn't need softirq stack since it's executing
in a kernel thread with a callstack that is only beginning at
this stage.

Lets comment about that for clarity.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: James Hogan <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Andrew Morton <[email protected]>
---
kernel/softirq.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 8c8f08b..1b0f48e 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -759,6 +759,10 @@ static void run_ksoftirqd(unsigned int cpu)
{
local_irq_disable();
if (local_softirq_pending()) {
+ /*
+ * Execute softirq on inline stack, as we are not deep in
+ * the task stack here.
+ */
__do_softirq();
rcu_note_context_switch(cpu);
local_irq_enable();
--
1.8.3.1

2013-09-19 19:51:42

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/3] irq: Execute softirq on its own stack on irq exit

When a softirq executes in irq_exit(), it can contribute
to random complicated and large stack scenario involving task
calls, hw interrupt calls, softirq handler calls and
then other irqs, interrupting the softirq, that can dig further
with an irq handler.

Softirqs executing on the inline hw interrupt stack may favour
stack overflows in such circumstances, as it has been reported
in powerpc where task -> irq -> softirq -> irq can end up forming
a huge calltrace in the single kernel stack.

So if there are softirqs pending on hardirq exit, lets execute them
on the softirq stack to minimize this.

Reported-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Tested-by: Paul Mackerras <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: James Hogan <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Andrew Morton <[email protected]>
---
kernel/softirq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 36112cf..8c8f08b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -326,7 +326,7 @@ void irq_enter(void)
static inline void invoke_softirq(void)
{
if (!force_irqthreads)
- __do_softirq();
+ do_softirq_own_stack();
else
wakeup_softirqd();
}
--
1.8.3.1

2013-09-20 00:02:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Thu, Sep 19, 2013 at 2:51 PM, Frederic Weisbecker <[email protected]> wrote:
>
> It fixes stacks overruns reported by Benjamin Herrenschmidt:
> http://lkml.kernel.org/r/1378330796.4321.50.camel%40pasglop

So I don't really dislike this patch-series, but isn't "irq_exit()"
(which calls the new softirq_on_stack()) already running in the
context of the irq stack? And it's run at the very end of the irq
processing, so the irq stack should be empty too at that point.

So switching to *another* empty stack sounds really sad. No? Taking
more cache misses etc, instead of using the already empty - but
cache-hot - stack that we already have.

I'm assuming that the problem is that since we're already on the irq
stack, if *another* irq comes in, now that *other* irq doesn't get yet
another irq stack page. And I'm wondering whether we shouldn't just
fix that (hopefully unlikely) case instead? So instead of having a
softirq stack, we'd have just an extra irq stack for the case where
the original irq stack is already in use.

Hmm?

Linus

2013-09-20 01:56:30

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Thu, 2013-09-19 at 19:02 -0500, Linus Torvalds wrote:
> On Thu, Sep 19, 2013 at 2:51 PM, Frederic Weisbecker <[email protected]> wrote:
> >
> > It fixes stacks overruns reported by Benjamin Herrenschmidt:
> > http://lkml.kernel.org/r/1378330796.4321.50.camel%40pasglop
>
> So I don't really dislike this patch-series, but isn't "irq_exit()"
> (which calls the new softirq_on_stack()) already running in the
> context of the irq stack?

Not on powerpc and afaik not on i386 from my quick look at
handle_irq() in irq_32.c ... maybe x86_64 calls do_IRQ already
on the irq stack ?

Also irq and softirq are (somewhat on purpose) different stacks

> And it's run at the very end of the irq
> processing, so the irq stack should be empty too at that point.
> So switching to *another* empty stack sounds really sad. No? Taking
> more cache misses etc, instead of using the already empty - but
> cache-hot - stack that we already have.
>
> I'm assuming that the problem is that since we're already on the irq
> stack, if *another* irq comes in, now that *other* irq doesn't get yet
> another irq stack page. And I'm wondering whether we shouldn't just
> fix that (hopefully unlikely) case instead? So instead of having a
> softirq stack, we'd have just an extra irq stack for the case where
> the original irq stack is already in use.

Well actually in the crash we observed we aren't already in the irq
stack.

We could try to change powerpc to switch stack before calling do_IRQ but
that would be fairly invasive for various reasons (a significant change
of our assembly entry code) unless we do it as a kind of wrapper around
do_IRQ (and thus keep the actual interrupt frame on the main kernel
stack).

I'll look into hacking something up along those lines, it might be the
best approach for a RHEL7 fix anyway.

Ben.

> Hmm?
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-09-20 11:03:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Thu, 19 Sep 2013, Linus Torvalds wrote:

> On Thu, Sep 19, 2013 at 2:51 PM, Frederic Weisbecker <[email protected]> wrote:
> >
> > It fixes stacks overruns reported by Benjamin Herrenschmidt:
> > http://lkml.kernel.org/r/1378330796.4321.50.camel%40pasglop
>
> So I don't really dislike this patch-series, but isn't "irq_exit()"
> (which calls the new softirq_on_stack()) already running in the
> context of the irq stack? And it's run at the very end of the irq
> processing, so the irq stack should be empty too at that point.

Right, but most of the implementations are braindamaged.

irq_enter();
handle_irq_on_hardirq_stack();
irq_exit();

instead of doing:

switch_stack()
irq_enter()
handle_irq()
irq_exit()
restore_stack()

So in the case of softirq processing (the likely case) we end up doing:

switch_to_hardirq_stack()
...
restore_original_stack()
switch_to_softirq_stack()
...
restore_original_stack()

Two avoidable stack switch operations for no gain.

> I'm assuming that the problem is that since we're already on the irq
> stack, if *another* irq comes in, now that *other* irq doesn't get yet
> another irq stack page. And I'm wondering whether we shouldn't just
> fix that (hopefully unlikely) case instead? So instead of having a
> softirq stack, we'd have just an extra irq stack for the case where
> the original irq stack is already in use.

Why not have a single irq_stack large enough to accomodate interrupt
handling during softirq processing? We have no interrupt nesting so
the maximum stack depth necessary is

max(softirq_stack_usage) + max(irq_stack_usage)

Today we allocate THREAD_SIZE_ORDER for the hard and the soft context,
so allocating 2 * THREAD_SIZE_ORDER should be sufficient.

Thanks,

tglx

2013-09-20 11:12:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Fri, Sep 20, 2013 at 01:03:17PM +0200, Thomas Gleixner wrote:
> On Thu, 19 Sep 2013, Linus Torvalds wrote:
>
> > On Thu, Sep 19, 2013 at 2:51 PM, Frederic Weisbecker <[email protected]> wrote:
> > >
> > > It fixes stacks overruns reported by Benjamin Herrenschmidt:
> > > http://lkml.kernel.org/r/1378330796.4321.50.camel%40pasglop
> >
> > So I don't really dislike this patch-series, but isn't "irq_exit()"
> > (which calls the new softirq_on_stack()) already running in the
> > context of the irq stack? And it's run at the very end of the irq
> > processing, so the irq stack should be empty too at that point.
>
> Right, but most of the implementations are braindamaged.
>
> irq_enter();
> handle_irq_on_hardirq_stack();
> irq_exit();

I was only just staring at i386 and found it did exactly that. It had to
jump through preempt_count hoops to make that work and obviously I
hadn't test-build the preempt patches on i386.

2013-09-20 16:32:22

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Fri, Sep 20, 2013 at 01:03:17PM +0200, Thomas Gleixner wrote:
> On Thu, 19 Sep 2013, Linus Torvalds wrote:
>
> > On Thu, Sep 19, 2013 at 2:51 PM, Frederic Weisbecker <[email protected]> wrote:
> > >
> > > It fixes stacks overruns reported by Benjamin Herrenschmidt:
> > > http://lkml.kernel.org/r/1378330796.4321.50.camel%40pasglop
> >
> > So I don't really dislike this patch-series, but isn't "irq_exit()"
> > (which calls the new softirq_on_stack()) already running in the
> > context of the irq stack? And it's run at the very end of the irq
> > processing, so the irq stack should be empty too at that point.
>
> Right, but most of the implementations are braindamaged.
>
> irq_enter();
> handle_irq_on_hardirq_stack();
> irq_exit();
>
> instead of doing:
>
> switch_stack()
> irq_enter()
> handle_irq()
> irq_exit()
> restore_stack()
>
> So in the case of softirq processing (the likely case) we end up doing:
>
> switch_to_hardirq_stack()
> ...
> restore_original_stack()
> switch_to_softirq_stack()
> ...
> restore_original_stack()
>
> Two avoidable stack switch operations for no gain.
>
> > I'm assuming that the problem is that since we're already on the irq
> > stack, if *another* irq comes in, now that *other* irq doesn't get yet
> > another irq stack page. And I'm wondering whether we shouldn't just
> > fix that (hopefully unlikely) case instead? So instead of having a
> > softirq stack, we'd have just an extra irq stack for the case where
> > the original irq stack is already in use.
>
> Why not have a single irq_stack large enough to accomodate interrupt
> handling during softirq processing? We have no interrupt nesting so
> the maximum stack depth necessary is
>
> max(softirq_stack_usage) + max(irq_stack_usage)
>
> Today we allocate THREAD_SIZE_ORDER for the hard and the soft context,
> so allocating 2 * THREAD_SIZE_ORDER should be sufficient.

Looks good to me.

Now just for clarity, what do we then do with inline sofirq executions: on local_bh_enable()
for example, or explicit calls to do_softirq() other than irq exit?

Should we keep the current switch to a different softirq stack? If we have a generic irq stack
(used for both hard and soft) that is big enough, perhaps we can also switch to this
generic irq stack for inline softirqs executions? After all there is no much point in keeping
a separate stack for that: this result in cache misses if the inline softirq is interrupted by
a hardirq, also inlined softirqs can't happen in hardirq, so there should be no much risk of overruns.

Or am I missing other things?

>
> Thanks,
>
> tglx

2013-09-20 17:30:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Fri, 20 Sep 2013, Frederic Weisbecker wrote:

> Now just for clarity, what do we then do with inline sofirq
> executions: on local_bh_enable() for example, or explicit calls to
> do_softirq() other than irq exit? Should we keep the current switch
> to a different softirq stack? If we have a generic irq stack (used
> for both hard and soft) that is big enough, perhaps we can also
> switch to this generic irq stack for inline softirqs executions?
> After all there is no much point in keeping a separate stack for
> that: this result in cache misses if the inline softirq is
> interrupted by a hardirq, also inlined softirqs can't happen in
> hardirq, so there should be no much risk of overruns.

We can use the same irqstack for this because from the irqstack point
of view, thats the same as if softirqs get executed from
irq_exit().

Thanks,

tglx

2013-09-20 18:37:39

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Fri, Sep 20, 2013 at 07:30:14PM +0200, Thomas Gleixner wrote:
> On Fri, 20 Sep 2013, Frederic Weisbecker wrote:
>
> > Now just for clarity, what do we then do with inline sofirq
> > executions: on local_bh_enable() for example, or explicit calls to
> > do_softirq() other than irq exit? Should we keep the current switch
> > to a different softirq stack? If we have a generic irq stack (used
> > for both hard and soft) that is big enough, perhaps we can also
> > switch to this generic irq stack for inline softirqs executions?
> > After all there is no much point in keeping a separate stack for
> > that: this result in cache misses if the inline softirq is
> > interrupted by a hardirq, also inlined softirqs can't happen in
> > hardirq, so there should be no much risk of overruns.
>
> We can use the same irqstack for this because from the irqstack point
> of view, thats the same as if softirqs get executed from
> irq_exit().

Ok, so I see that's what x86-64 is doing. But x86-32 seems to be using different
stacks for hard and soft irqs for no much reasons (expept maybe to avoid overrun if
the hardirq). And x86-32 only switches to hardirq
stack for the handler. So probably we can use the same stack for the whole and extend
it to before irq_enter() and after irq_exit(), like you suggested.

BTW, we still want the 1st patch of my series I think, as it simply consolidate existing code.
Using the same stack for hard and soft irqs is independant from that.

If you're ok with it, would you agree to apply it?

Thanks.

2013-09-20 22:14:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Fri, Sep 20, 2013 at 9:26 AM, Frederic Weisbecker <[email protected]> wrote:
>
> Now just for clarity, what do we then do with inline sofirq executions: on local_bh_enable()
> for example, or explicit calls to do_softirq() other than irq exit?

If we do a softirq because it was pending and we did a
"local_bh_enable()" in normal code, we need a new stack. The
"local_bh_enable()" may be pretty deep in the callchain on a normal
process stack, so I think it would be safest to switch to a separate
stack for softirq handling.

So you have a few different cases:

- irq_exit(). The irq stack is by definition empty (assuming
itq_exit() is done on the irq stack), so doing softirq in that context
should be fine. However, that assumes that if we get *another*
interrupt, then we'll switch stacks again, so this does mean that we
need two irq stacks. No, irq's don't nest, but if we run softirq on
the first irq stack, the other irq *can* nest that softirq.

- process context doing local_bh_enable, and a bh became pending
while it was disabled. See above: this needs a stack switch. Which
stack to use is open, again assuming that a hardirq coming in will
switch to yet another stack.

Hmm?

Linus

2013-09-21 00:54:25

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Fri, 2013-09-20 at 11:26 -0500, Frederic Weisbecker wrote:

> Looks good to me.
>
> Now just for clarity, what do we then do with inline sofirq executions: on local_bh_enable()
> for example, or explicit calls to do_softirq() other than irq exit?

We cannot make the irq stack larger than the normal stacks on ppc due to
how we get to the thread info (by masking low bits of the stack
pointer), however our stacks are pretty big too (16k) so it might be
ok to run the softirqs on the irq stack, it's just a matter of us
doing the switch before do_IRQ rather than later when calling the
handlers.

I think we basically copied what x86 originally did, but we can
definitely change that.

> Should we keep the current switch to a different softirq stack? If we have a generic irq stack
> (used for both hard and soft) that is big enough, perhaps we can also switch to this
> generic irq stack for inline softirqs executions? After all there is no much point in keeping
> a separate stack for that: this result in cache misses if the inline softirq is interrupted by
> a hardirq, also inlined softirqs can't happen in hardirq, so there should be no much risk of overruns.
>
> Or am I missing other things?

Originally IRQs could nest so we really wanted separate stacks,
especially since softirq is where network processing happens and that
can go fairly deep.

But that's not the case anymore so I suppose it makes some sense...

Ben.

> >
> > Thanks,
> >
> > tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-09-21 00:57:12

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Fri, 2013-09-20 at 13:11 +0200, Peter Zijlstra wrote:
> On Fri, Sep 20, 2013 at 01:03:17PM +0200, Thomas Gleixner wrote:
> > On Thu, 19 Sep 2013, Linus Torvalds wrote:
> >
> > > On Thu, Sep 19, 2013 at 2:51 PM, Frederic Weisbecker <[email protected]> wrote:
> > > >
> > > > It fixes stacks overruns reported by Benjamin Herrenschmidt:
> > > > http://lkml.kernel.org/r/1378330796.4321.50.camel%40pasglop
> > >
> > > So I don't really dislike this patch-series, but isn't "irq_exit()"
> > > (which calls the new softirq_on_stack()) already running in the
> > > context of the irq stack? And it's run at the very end of the irq
> > > processing, so the irq stack should be empty too at that point.
> >
> > Right, but most of the implementations are braindamaged.
> >
> > irq_enter();
> > handle_irq_on_hardirq_stack();
> > irq_exit();
>
> I was only just staring at i386 and found it did exactly that. It had to
> jump through preempt_count hoops to make that work and obviously I
> hadn't test-build the preempt patches on i386.

Right and powerpc does the switch even later when calling the individual
handlers.

Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-09-21 07:47:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix


* Linus Torvalds <[email protected]> wrote:

> On Fri, Sep 20, 2013 at 9:26 AM, Frederic Weisbecker <[email protected]> wrote:
> >
> > Now just for clarity, what do we then do with inline sofirq
> > executions: on local_bh_enable() for example, or explicit calls to
> > do_softirq() other than irq exit?
>
> If we do a softirq because it was pending and we did a
> "local_bh_enable()" in normal code, we need a new stack. The
> "local_bh_enable()" may be pretty deep in the callchain on a normal
> process stack, so I think it would be safest to switch to a separate
> stack for softirq handling.
>
> So you have a few different cases:
>
> - irq_exit(). The irq stack is by definition empty (assuming itq_exit()
> is done on the irq stack), so doing softirq in that context should be
> fine. However, that assumes that if we get *another* interrupt, then
> we'll switch stacks again, so this does mean that we need two irq
> stacks. No, irq's don't nest, but if we run softirq on the first irq
> stack, the other irq *can* nest that softirq.
>
> - process context doing local_bh_enable, and a bh became pending while
> it was disabled. See above: this needs a stack switch. Which stack to
> use is open, again assuming that a hardirq coming in will switch to yet
> another stack.
>
> Hmm?

I'd definitely argue in favor of never letting unknown-size stacks nest
(i.e. to always switch if we start a new context on top of a non-trivial
stack).

Known (small) size stack nesting is not real stack nesting, it's just a
somewhat unusual (and faster) way of stack switching.

Thanks,

Ingo

2013-09-21 18:58:14

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

2013/9/20 Linus Torvalds <[email protected]>:
> On Fri, Sep 20, 2013 at 9:26 AM, Frederic Weisbecker <[email protected]> wrote:
>>
>> Now just for clarity, what do we then do with inline sofirq executions: on local_bh_enable()
>> for example, or explicit calls to do_softirq() other than irq exit?
>
> If we do a softirq because it was pending and we did a
> "local_bh_enable()" in normal code, we need a new stack. The
> "local_bh_enable()" may be pretty deep in the callchain on a normal
> process stack, so I think it would be safest to switch to a separate
> stack for softirq handling.

Right.

>
> So you have a few different cases:
>
> - irq_exit(). The irq stack is by definition empty (assuming
> itq_exit() is done on the irq stack), so doing softirq in that context
> should be fine. However, that assumes that if we get *another*
> interrupt, then we'll switch stacks again, so this does mean that we
> need two irq stacks. No, irq's don't nest, but if we run softirq on
> the first irq stack, the other irq *can* nest that softirq.

Well, most archs don't define __ARCH_IRQ_EXIT_IRQS_DISABLED. It doesn't
even mean that the majority of them actually run irq_exit() with irqs enabled in
practice. But there may be thoretically some where hardirqs can nest
without even the
help of softirqs.

So it's quite possible to run softirqs on a hardirq stack that is not empty.

Now certainly what needs to be fixed then is archs that don't have
__ARCH_IRQ_EXIT_IRQS_DISABLED
or archs that have any other significant opportunity to nest interrupt.

>
> - process context doing local_bh_enable, and a bh became pending
> while it was disabled. See above: this needs a stack switch. Which
> stack to use is open, again assuming that a hardirq coming in will
> switch to yet another stack.

Right. Now if we do like Thomas suggested, we can have a common irq
stack that is big enough for hard and softirqs. After all there should
never be more than two or three nesting irq contexts:
hardirq->softirq->hardirq, softirq->hardirq, ...

At least if we put aside the unsane archs that can nest irqs somehow.

2013-09-21 21:47:10

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sat, 2013-09-21 at 13:58 -0500, Frederic Weisbecker wrote:

> Now certainly what needs to be fixed then is archs that don't have
> __ARCH_IRQ_EXIT_IRQS_DISABLED
> or archs that have any other significant opportunity to nest interrupt.

Interesting. I notice we don't define it on powerpc but we don't enable
IRQs in do_IRQ either... our path is very similar to x86 in this regard,
the only thing that can cause them to become enabled would be if a
driver interrupt handler did local_irq_enable().

It used to be fairly common for drivers to do spin_unlock_irq() which
would unconditionally re-enable. Did we add WARNs or lockdep logic to
catch these nowadays ?

> > - process context doing local_bh_enable, and a bh became pending
> > while it was disabled. See above: this needs a stack switch. Which
> > stack to use is open, again assuming that a hardirq coming in will
> > switch to yet another stack.
>
> Right. Now if we do like Thomas suggested, we can have a common irq
> stack that is big enough for hard and softirqs. After all there should
> never be more than two or three nesting irq contexts:
> hardirq->softirq->hardirq, softirq->hardirq, ...
>
> At least if we put aside the unsane archs that can nest irqs somehow.

I really don't like the "larger" irq stack ... probably because I can't
make it work easily :-) See my previous comment about how we get to
thread_info on ppc.

What I *can* do that would help I suppose would be to switch to the irq
stack before irq_enter/exit which would at least mean that softirq would
run from the top of the irq stack which is better than the current
situation.

I'm fact I'll whip up a quick fix see if that might be enough of a band
aid for RHEL7.

Cheers,
Ben.

2013-09-21 23:27:09

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sun, Sep 22, 2013 at 07:45:01AM +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2013-09-21 at 13:58 -0500, Frederic Weisbecker wrote:
>
> > Now certainly what needs to be fixed then is archs that don't have
> > __ARCH_IRQ_EXIT_IRQS_DISABLED
> > or archs that have any other significant opportunity to nest interrupt.
>
> Interesting. I notice we don't define it on powerpc

Yeah, x86 doesn't define it either. In fact few archs do.

> but we don't enable
> IRQs in do_IRQ either... our path is very similar to x86 in this regard,
> the only thing that can cause them to become enabled would be if a
> driver interrupt handler did local_irq_enable().
>
> It used to be fairly common for drivers to do spin_unlock_irq() which
> would unconditionally re-enable. Did we add WARNs or lockdep logic to
> catch these nowadays ?

Right there is a check in handle_irq_event_percpu() that warns if the handler
exits with irqs enabled.

And irq_exit() also warns when (__ARCH_IRQ_EXIT_IRQS_DISABLED && !irq_disabled())

>
> > > - process context doing local_bh_enable, and a bh became pending
> > > while it was disabled. See above: this needs a stack switch. Which
> > > stack to use is open, again assuming that a hardirq coming in will
> > > switch to yet another stack.
> >
> > Right. Now if we do like Thomas suggested, we can have a common irq
> > stack that is big enough for hard and softirqs. After all there should
> > never be more than two or three nesting irq contexts:
> > hardirq->softirq->hardirq, softirq->hardirq, ...
> >
> > At least if we put aside the unsane archs that can nest irqs somehow.
>
> I really don't like the "larger" irq stack ... probably because I can't
> make it work easily :-) See my previous comment about how we get to
> thread_info on ppc.
>
> What I *can* do that would help I suppose would be to switch to the irq
> stack before irq_enter/exit which would at least mean that softirq would
> run from the top of the irq stack which is better than the current
> situation.

Yeah I think that doing this should solve the biggest part of the problem on ppc.
You'll at least ensure that you have splitup stacks for tasks and softirq/irq stacks.

>
> I'm fact I'll whip up a quick fix see if that might be enough of a band
> aid for RHEL7.
>
> Cheers,
> Ben.
>

2013-09-22 02:03:46

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On 09/21/2013 02:45 PM, Benjamin Herrenschmidt wrote:
>
> I really don't like the "larger" irq stack ... probably because I can't
> make it work easily :-) See my previous comment about how we get to
> thread_info on ppc.
>

For the record, I intend to remove thread_info from the stack on x86 and
instead merge it with task_struct as a single structure pointed to with
a percpu variable.

-hpa

2013-09-22 04:41:14

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sat, 2013-09-21 at 19:01 -0700, H. Peter Anvin wrote:
> On 09/21/2013 02:45 PM, Benjamin Herrenschmidt wrote:
> >
> > I really don't like the "larger" irq stack ... probably because I can't
> > make it work easily :-) See my previous comment about how we get to
> > thread_info on ppc.
> >
>
> For the record, I intend to remove thread_info from the stack on x86 and
> instead merge it with task_struct as a single structure pointed to with
> a percpu variable.

Last I looked, our per-cpu codegen was pretty poor... but then we have
this "PACA" (somewhat arch specific per-cpu blob that is separate from
the rest of per-cpu because of a mix of historical reasons and the fact
that it has to be allocated in a specific part of memory at boot time)
which we point to directly via a GPR, so we could point to it via PACA.

How do you do your per-cpu on x86 ? On powerpc we struggle because we
try to dedicate a register (r13) to this PACA (the per-cpu offset hangs
off it), but we constantly run into issues where gcc copies r13 to
another register and then indexes off that, even accross
preempt_enable/disable sections, or worst such as saving/restoring from
the stack. We can't seem to get the compiler to treat it appropriately
as volatile.

Ben.

2013-09-22 05:01:21

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sun, 2013-09-22 at 14:39 +1000, Benjamin Herrenschmidt wrote:
> How do you do your per-cpu on x86 ? On powerpc we struggle because we
> try to dedicate a register (r13) to this PACA (the per-cpu offset hangs
> off it), but we constantly run into issues where gcc copies r13 to
> another register and then indexes off that, even accross
> preempt_enable/disable sections, or worst such as saving/restoring from
> the stack. We can't seem to get the compiler to treat it appropriately
> as volatile.

Also, do you have a half-decent way of getting to per-cpu from asm ?

Cheers,
Ben.

2013-09-22 16:24:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sun, Sep 22, 2013 at 02:41:01PM +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2013-09-22 at 14:39 +1000, Benjamin Herrenschmidt wrote:
> > How do you do your per-cpu on x86 ?

We use a segment offset. Something like:

inc %gs:var;

would be a per-cpu increment. The actual memory location used for the
memop is the variable address + GS offset.

And our GS offset is per cpu and points to the base of the per cpu
segment for that cpu.

> Also, do you have a half-decent way of getting to per-cpu from asm ?

Yes, see above :-)

Assuming we repurpose r13 as per-cpu base, you could do the whole
this_cpu_* stuff which is locally atomic -- ie. safe against IRQs and
preemption as:

loop:
lwarx rt, var, r13
inc rt
stwcx rt, var, r13
bne- loop

Except, I think your ll/sc pair is actually slower than doing:

local_irq_save(flags)
var++;
local_irq_restore(flags)

Esp. with the lazy irq disable you have.

And I'm fairly sure using them as generic per cpu accessors isn't sane,
but I'm not sure PPC64 has other memops with implicit addition like
that.

As to the problem of GCC moving r13 about, some archs have some
exceptions in the register allocator and leave some registers alone.
IIRC MIPS has this and uses one of those (istr there's 2) for the
per cpu base address.



2013-09-22 17:48:54

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On 09/22/2013 09:24 AM, Peter Zijlstra wrote:
>
> As to the problem of GCC moving r13 about, some archs have some
> exceptions in the register allocator and leave some registers alone.
> IIRC MIPS has this and uses one of those (istr there's 2) for the
> per cpu base address.
>

You can force gcc to leave a register alone with the command line option
-ffixed-r13.

-hpa

2013-09-22 21:58:18

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sun, 2013-09-22 at 18:24 +0200, Peter Zijlstra wrote:
> On Sun, Sep 22, 2013 at 02:41:01PM +1000, Benjamin Herrenschmidt wrote:
> > On Sun, 2013-09-22 at 14:39 +1000, Benjamin Herrenschmidt wrote:
> > > How do you do your per-cpu on x86 ?
>
> We use a segment offset. Something like:
>
> inc %gs:var;
>
> would be a per-cpu increment. The actual memory location used for the
> memop is the variable address + GS offset.
>
> And our GS offset is per cpu and points to the base of the per cpu
> segment for that cpu.
>
> > Also, do you have a half-decent way of getting to per-cpu from asm ?
>
> Yes, see above :-)

And gcc makes no stupid assumptions that this gs doesn't change ? That's
the main problem we have with using r13 for PACA.

> Assuming we repurpose r13 as per-cpu base, you could do the whole
> this_cpu_* stuff which is locally atomic -- ie. safe against IRQs and
> preemption as:
>
> loop:
> lwarx rt, var, r13
> inc rt
> stwcx rt, var, r13
> bne- loop
>
> Except, I think your ll/sc pair is actually slower than doing:
>
> local_irq_save(flags)
> var++;
> local_irq_restore(flags)
>
> Esp. with the lazy irq disable you have.

Right, for local atomics we really don't want to use reservations, they
have to go to the L2 (and flush the corresponding L1 line along the
way).

> And I'm fairly sure using them as generic per cpu accessors isn't sane,
> but I'm not sure PPC64 has other memops with implicit addition like
> that.

We have no memop with implicit addition today, so for generic counters
we do need reservation. For other per-cpus such as thread info etc...
the problem is more how quick it is to get to the per-cpu.

> As to the problem of GCC moving r13 about, some archs have some
> exceptions in the register allocator and leave some registers alone.
> IIRC MIPS has this and uses one of those (istr there's 2) for the
> per cpu base address.

Right, not sure if there's anything we can do short of getting gcc
modified and relying on that new change (which would be problematic).
I've been trying to get straight answers from gcc folks about what we
can or cannot rely upon and never got any.

r13 is actually the TLS, but we can't use it as such in the kernel as
gcc *will* assume it doesn't change (though we could use it for current
and have our per-cpu offset hanging off that, one level of indirection
is better than nothing as suppose).

The problem boils down to gcc not having a concept that a global
register variable can be volatile.

Cheers,
Ben.

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-09-22 22:01:09

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sun, 2013-09-22 at 10:47 -0700, H. Peter Anvin wrote:
> On 09/22/2013 09:24 AM, Peter Zijlstra wrote:
> >
> > As to the problem of GCC moving r13 about, some archs have some
> > exceptions in the register allocator and leave some registers alone.
> > IIRC MIPS has this and uses one of those (istr there's 2) for the
> > per cpu base address.
> >
>
> You can force gcc to leave a register alone with the command line option
> -ffixed-r13.

Haven't tried that in a while but iirc, from the discussions I had with
the gcc folks, it didn't work the way we wanted.

Basically, it still assumes that it's not going to change at random
points, I can't have something like

register volatile unsigned long pre_cpu_offset asm("r13")

It will barf on the "volatile" and if I don't have it, it will make
assumptions that r13 doesn't change, and thus might copy its value
in another cpu accross preempt_enable/disable.

I think I need another sessions with gcc folks on that one.

Cheers,
Ben.

2013-09-22 22:22:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sun, Sep 22, 2013 at 2:56 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Sun, 2013-09-22 at 18:24 +0200, Peter Zijlstra wrote:
>>
>> We use a segment offset. Something like:
>>
>> inc %gs:var;
>>
>
> And gcc makes no stupid assumptions that this gs doesn't change ? That's
> the main problem we have with using r13 for PACA.

Since gcc doesn't really know about segment registers at all (modulo
%fs as TLS on x86), we do everything like that using inline asm.

It's not *too* painful if you have a number of macro helpers to build
up all the different versions.

And r13 isn't volatile if you are preempt-safe, so I'm wondering if
you could just make the preempt disable code mark %r13 as modified
("+r"). Then gcc won't ever cache r13 across one of those. And if you
don't have preemption disabled, then you cannot do multiple ops using
%r13 anyway, since on a load-store architecture it might change even
between the load and store, so a per-cpu "add" operation *has* to
cache the %r13 value in *another* register anyway, because using
memory ops with just an offset off %r13 would be buggy.

So I don't think this is a gcc issue. gcc can't fix those kinds of problems.

Personally, I'd suggest something like:

- the paca stuff is just insane. Try to get rid of it.

- use %r13 for the per-thread thread-info pointer instead. A
per-thread pointer is *not* volatile like the per-cpu base is.

- Now you can make the per-cpu offset be loaded off the per-thread
pointer (update it at context switch). gcc knows to not cache it
across function calls, since it's a memory access. Use ACCESS_ONCE()
or something to make sure it's only loaded once for the cpu offset
ops.

Alternatively, make %r13 point to the percpu side, but make sure that
you always use an asm accessor to fetch the value. In particular, I
think you need to make __my_cpu_offset be an inline asm that fetches
%r13 into some other register. Otherwise you can never get it right.

Linus

2013-09-22 22:39:21

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sun, 2013-09-22 at 15:22 -0700, Linus Torvalds wrote:
> On Sun, Sep 22, 2013 at 2:56 PM, Benjamin Herrenschmidt
> <[email protected]> wrote:
> > On Sun, 2013-09-22 at 18:24 +0200, Peter Zijlstra wrote:
> >>
> >> We use a segment offset. Something like:
> >>
> >> inc %gs:var;
> >>
> >
> > And gcc makes no stupid assumptions that this gs doesn't change ? That's
> > the main problem we have with using r13 for PACA.
>
> Since gcc doesn't really know about segment registers at all (modulo
> %fs as TLS on x86), we do everything like that using inline asm.
>
> It's not *too* painful if you have a number of macro helpers to build
> up all the different versions.
>
> And r13 isn't volatile if you are preempt-safe, so I'm wondering if
> you could just make the preempt disable code mark %r13 as modified
> ("+r"). Then gcc won't ever cache r13 across one of those. And if you
> don't have preemption disabled, then you cannot do multiple ops using
> %r13 anyway, since on a load-store architecture it might change even
> between the load and store, so a per-cpu "add" operation *has* to
> cache the %r13 value in *another* register anyway, because using
> memory ops with just an offset off %r13 would be buggy.

The only other one is our soft-disable of irqs which is a byte off r13,
so technically a simple:

li r0,0
stb r0,PACASOFTIRQEN(r13)

Is a perfectly valid implementation of local_irq_disable() ;-) But that
sort of special case I think we can handle with special cases.

Right, using an r13 clobber in a dummy asm in things like
preempt_disable and local_irq_disable is so far the most interesting
option left on our table.

> So I don't think this is a gcc issue. gcc can't fix those kinds of problems.

Well, it would help if we had a way to tell gcc that a global register
variable is volatile. But it doesn't understand that concept.

> Personally, I'd suggest something like:
>
> - the paca stuff is just insane. Try to get rid of it.

I wish I could... it dig deep. The problem is fundamentally because with
MMU off (aka "real" mode), under a hypervisor, we have only access to a
portion of the partition memory (it's a fake real mode if you want, and
on older systems it's implemented as a physically contiguous chunk of
memory reserved by the hypervisor for the partition, and as such cannot
be arbitrarily large, newer CPUs support something smarter but with
still with some limitations).

That combined with the fact that all interrupts are delivered in real
mode, and you get the picture. KVM "userspace" (the variant that
emulates the guest as a user process) makes it worse as it needs to
maintain much more state in a place that is "real mode accessible".

We *might* be bale to repurpose r13 as our per-cpu offset and move a
chunk of stuff from the paca to per-cpu variables, but we will still
need "something" akin to the PACA for the guts of the exception
entry/exit and KVM.

> - use %r13 for the per-thread thread-info pointer instead. A
> per-thread pointer is *not* volatile like the per-cpu base is.

Yes, use it as a TLS, that would be an option I've been considering. It
could also be the task struct but thread_info might be slightly hotter.

That's what I had in mind when I mentioned using it as a TLS for
"current".

> - Now you can make the per-cpu offset be loaded off the per-thread
> pointer (update it at context switch). gcc knows to not cache it
> across function calls, since it's a memory access. Use ACCESS_ONCE()
> or something to make sure it's only loaded once for the cpu offset
> ops.

I have seen gcc completely disregard ACCESS_ONCE in the past though, at
least in the context of PACA relative accesses...

> Alternatively, make %r13 point to the percpu side, but make sure that
> you always use an asm accessor to fetch the value. In particular, I
> think you need to make __my_cpu_offset be an inline asm that fetches
> %r13 into some other register. Otherwise you can never get it right.

Yes, that's my plan B, at least for PACA but it would work for per-cpu
as well. Use a get_paca/put_paca style construct which fetches is in
another register (and does the preempt enable/disable while at it) and
for the handful of cases where we do want direct and faster access such
as manipulating the soft-irq flag, an inline asm load or store which is
what we do today.

Also with LE, we still have a bit of potential for tweaking the ABI and
the compiler, so that's something I want to put on the table.

Cheers,
Ben.

2013-09-23 04:37:34

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [PATCH] powerpc/irq: Run softirqs off the top of the irq stack

Nowadays, irq_exit() calls __do_softirq() pretty much directly
instead of calling do_softirq() which switches to the decicated
softirq stack.

This has lead to observed stack overflows on powerpc since we call
irq_enter() and irq_exit() outside of the scope that switches to
the irq stack.

This fixes it by moving the stack switching up a level, making
irq_enter() and irq_exit() run off the irq stack.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---

This is the "band aid" discussed so far for the stack overflow
problem for powerpc. I assume sparc and i386 at least need
something similar (I had a quick look at ARM and it doesn't
seem to have irq stacks at all).

Unless objection in the next day or so, I intend to send that to
Linus and to -stable ASAP.

arch/powerpc/include/asm/irq.h | 4 +-
arch/powerpc/kernel/irq.c | 99 ++++++++++++++++++++++--------------------
arch/powerpc/kernel/misc_32.S | 9 ++--
arch/powerpc/kernel/misc_64.S | 10 ++---
4 files changed, 62 insertions(+), 60 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 0e40843..41f13ce 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -69,9 +69,9 @@ extern struct thread_info *softirq_ctx[NR_CPUS];

extern void irq_ctx_init(void);
extern void call_do_softirq(struct thread_info *tp);
-extern int call_handle_irq(int irq, void *p1,
- struct thread_info *tp, void *func);
+extern void call_do_irq(struct pt_regs *regs, struct thread_info *tp);
extern void do_IRQ(struct pt_regs *regs);
+extern void __do_irq(struct pt_regs *regs);

int irq_choose_cpu(const struct cpumask *mask);

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index c69440c..0c9646f 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -443,46 +443,7 @@ void migrate_irqs(void)

static inline void handle_one_irq(unsigned int irq)
{
- struct thread_info *curtp, *irqtp;
- unsigned long saved_sp_limit;
- struct irq_desc *desc;
-
- desc = irq_to_desc(irq);
- if (!desc)
- return;
-
- /* Switch to the irq stack to handle this */
- curtp = current_thread_info();
- irqtp = hardirq_ctx[smp_processor_id()];
-
- if (curtp == irqtp) {
- /* We're already on the irq stack, just handle it */
- desc->handle_irq(irq, desc);
- return;
- }
-
- saved_sp_limit = current->thread.ksp_limit;
-
- irqtp->task = curtp->task;
- irqtp->flags = 0;
-
- /* Copy the softirq bits in preempt_count so that the
- * softirq checks work in the hardirq context. */
- irqtp->preempt_count = (irqtp->preempt_count & ~SOFTIRQ_MASK) |
- (curtp->preempt_count & SOFTIRQ_MASK);

- current->thread.ksp_limit = (unsigned long)irqtp +
- _ALIGN_UP(sizeof(struct thread_info), 16);
-
- call_handle_irq(irq, desc, irqtp, desc->handle_irq);
- current->thread.ksp_limit = saved_sp_limit;
- irqtp->task = NULL;
-
- /* Set any flag that may have been set on the
- * alternate stack
- */
- if (irqtp->flags)
- set_bits(irqtp->flags, &curtp->flags);
}

static inline void check_stack_overflow(void)
@@ -501,9 +462,9 @@ static inline void check_stack_overflow(void)
#endif
}

-void do_IRQ(struct pt_regs *regs)
+void __do_irq(struct pt_regs *regs)
{
- struct pt_regs *old_regs = set_irq_regs(regs);
+ struct irq_desc *desc;
unsigned int irq;

irq_enter();
@@ -519,18 +480,64 @@ void do_IRQ(struct pt_regs *regs)
*/
irq = ppc_md.get_irq();

- /* We can hard enable interrupts now */
+ /* We can hard enable interrupts now to allow perf interrupts */
may_hard_irq_enable();

/* And finally process it */
- if (irq != NO_IRQ)
- handle_one_irq(irq);
- else
+ if (unlikely(irq == NO_IRQ))
__get_cpu_var(irq_stat).spurious_irqs++;
+ else {
+ desc = irq_to_desc(irq);
+ if (likely(desc))
+ desc->handle_irq(irq, desc);
+ }

trace_irq_exit(regs);

irq_exit();
+}
+
+void do_IRQ(struct pt_regs *regs)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+ struct thread_info *curtp, *irqtp;
+ unsigned long saved_sp_limit;
+
+ /* Switch to the irq stack to handle this */
+ curtp = current_thread_info();
+ irqtp = hardirq_ctx[raw_smp_processor_id()];
+
+ /* Already there ? */
+ if (unlikely(curtp == irqtp)) {
+ __do_irq(regs);
+ set_irq_regs(old_regs);
+ return;
+ }
+
+ /* Adjust the stack limit */
+ saved_sp_limit = current->thread.ksp_limit;
+ current->thread.ksp_limit = (unsigned long)irqtp +
+ _ALIGN_UP(sizeof(struct thread_info), 16);
+
+
+ /* Prepare the thread_info in the irq stack */
+ irqtp->task = curtp->task;
+ irqtp->flags = 0;
+
+ /* Copy the preempt_count so that the [soft]irq checks work. */
+ irqtp->preempt_count = curtp->preempt_count;
+
+ /* Switch stack and call */
+ call_do_irq(regs, irqtp);
+
+ /* Restore stack limit */
+ current->thread.ksp_limit = saved_sp_limit;
+ irqtp->task = NULL;
+
+ /* Copy back updates to the thread_info */
+ if (irqtp->flags)
+ set_bits(irqtp->flags, &curtp->flags);
+
set_irq_regs(old_regs);
}

@@ -592,12 +599,10 @@ void irq_ctx_init(void)
memset((void *)softirq_ctx[i], 0, THREAD_SIZE);
tp = softirq_ctx[i];
tp->cpu = i;
- tp->preempt_count = 0;

memset((void *)hardirq_ctx[i], 0, THREAD_SIZE);
tp = hardirq_ctx[i];
tp->cpu = i;
- tp->preempt_count = HARDIRQ_OFFSET;
}
}

diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index 777d999..7da3882 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -47,13 +47,12 @@ _GLOBAL(call_do_softirq)
mtlr r0
blr

-_GLOBAL(call_handle_irq)
+_GLOBAL(call_do_irq)
mflr r0
stw r0,4(r1)
- mtctr r6
- stwu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r5)
- mr r1,r5
- bctrl
+ stwu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r4)
+ mr r1,r4
+ bl __do_irq
lwz r1,0(r1)
lwz r0,4(r1)
mtlr r0
diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
index 971d7e7..e59caf8 100644
--- a/arch/powerpc/kernel/misc_64.S
+++ b/arch/powerpc/kernel/misc_64.S
@@ -40,14 +40,12 @@ _GLOBAL(call_do_softirq)
mtlr r0
blr

-_GLOBAL(call_handle_irq)
- ld r8,0(r6)
+_GLOBAL(call_do_irq)
mflr r0
std r0,16(r1)
- mtctr r8
- stdu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r5)
- mr r1,r5
- bctrl
+ stdu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r4)
+ mr r1,r4
+ bl .__do_irq
ld r1,0(r1)
ld r0,16(r1)
mtlr r0

2013-09-23 04:41:40

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sun, 2013-09-22 at 07:45 +1000, Benjamin Herrenschmidt wrote:
> What I *can* do that would help I suppose would be to switch to the irq
> stack before irq_enter/exit which would at least mean that softirq would
> run from the top of the irq stack which is better than the current
> situation.
>
> I'm fact I'll whip up a quick fix see if that might be enough of a band
> aid for RHEL7.

OK I've done that, it seems to work so far. Heads up guys: i386 and sparc
at least seem to need the same treatment. I haven't looked at others except
ARM which doesn't seem to have irq stacks to begin with.

We can also instead apply Fred's series to put back in the switch to the
softirq stack since this is actually a regression , but then, arguably,
making sure irq_exit() is called off the irq stack is better and means
we do one instead of two stack switches.

Fred: Maybe revert partially under an arch #define/Kconfig so we can get
the best of both worlds ?

Cheers,
Ben.

2013-09-23 05:01:13

by David Miller

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

From: Benjamin Herrenschmidt <[email protected]>
Date: Mon, 23 Sep 2013 14:40:34 +1000

> Heads up guys: i386 and sparc at least seem to need the same
> treatment.

I'll take a look thanks for the heads up.

2013-09-23 07:56:42

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH] powerpc/irq: Run softirqs off the top of the irq stack

Hi Ben,

On Mon, 23 Sep 2013 14:35:58 +1000 Benjamin Herrenschmidt <[email protected]> wrote:
>
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index c69440c..0c9646f 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -443,46 +443,7 @@ void migrate_irqs(void)
>
> static inline void handle_one_irq(unsigned int irq)
> {
> - struct thread_info *curtp, *irqtp;
> - unsigned long saved_sp_limit;
> - struct irq_desc *desc;
> -
> - desc = irq_to_desc(irq);
> - if (!desc)
> - return;
> -
> - /* Switch to the irq stack to handle this */
> - curtp = current_thread_info();
> - irqtp = hardirq_ctx[smp_processor_id()];
> -
> - if (curtp == irqtp) {
> - /* We're already on the irq stack, just handle it */
> - desc->handle_irq(irq, desc);
> - return;
> - }
> -
> - saved_sp_limit = current->thread.ksp_limit;
> -
> - irqtp->task = curtp->task;
> - irqtp->flags = 0;
> -
> - /* Copy the softirq bits in preempt_count so that the
> - * softirq checks work in the hardirq context. */
> - irqtp->preempt_count = (irqtp->preempt_count & ~SOFTIRQ_MASK) |
> - (curtp->preempt_count & SOFTIRQ_MASK);
>
> - current->thread.ksp_limit = (unsigned long)irqtp +
> - _ALIGN_UP(sizeof(struct thread_info), 16);
> -
> - call_handle_irq(irq, desc, irqtp, desc->handle_irq);
> - current->thread.ksp_limit = saved_sp_limit;
> - irqtp->task = NULL;
> -
> - /* Set any flag that may have been set on the
> - * alternate stack
> - */
> - if (irqtp->flags)
> - set_bits(irqtp->flags, &curtp->flags);
> }

This function ends up as a single blank line ...

> @@ -519,18 +480,64 @@ void do_IRQ(struct pt_regs *regs)
> */
> irq = ppc_md.get_irq();
>
> - /* We can hard enable interrupts now */
> + /* We can hard enable interrupts now to allow perf interrupts */
> may_hard_irq_enable();
>
> /* And finally process it */
> - if (irq != NO_IRQ)
> - handle_one_irq(irq);

then you remove the only call, so why not just remove the function
completely?

--
Cheers,
Stephen Rothwell [email protected]


Attachments:
(No filename) (2.04 kB)
(No filename) (836.00 B)
Download all attachments

2013-09-23 10:15:19

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] powerpc/irq: Run softirqs off the top of the irq stack

On Mon, 2013-09-23 at 17:56 +1000, Stephen Rothwell wrote:
> Hi Ben,
>
> On Mon, 23 Sep 2013 14:35:58 +1000 Benjamin Herrenschmidt <[email protected]> wrote:
> >
> > diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> > index c69440c..0c9646f 100644
> > --- a/arch/powerpc/kernel/irq.c
> > +++ b/arch/powerpc/kernel/irq.c
> > @@ -443,46 +443,7 @@ void migrate_irqs(void)
> >
> > static inline void handle_one_irq(unsigned int irq)
> > {
> > - struct thread_info *curtp, *irqtp;
> > - unsigned long saved_sp_limit;
> > - struct irq_desc *desc;
> > -
> > - desc = irq_to_desc(irq);
> > - if (!desc)
> > - return;
> > -
> > - /* Switch to the irq stack to handle this */
> > - curtp = current_thread_info();
> > - irqtp = hardirq_ctx[smp_processor_id()];
> > -
> > - if (curtp == irqtp) {
> > - /* We're already on the irq stack, just handle it */
> > - desc->handle_irq(irq, desc);
> > - return;
> > - }
> > -
> > - saved_sp_limit = current->thread.ksp_limit;
> > -
> > - irqtp->task = curtp->task;
> > - irqtp->flags = 0;
> > -
> > - /* Copy the softirq bits in preempt_count so that the
> > - * softirq checks work in the hardirq context. */
> > - irqtp->preempt_count = (irqtp->preempt_count & ~SOFTIRQ_MASK) |
> > - (curtp->preempt_count & SOFTIRQ_MASK);
> >
> > - current->thread.ksp_limit = (unsigned long)irqtp +
> > - _ALIGN_UP(sizeof(struct thread_info), 16);
> > -
> > - call_handle_irq(irq, desc, irqtp, desc->handle_irq);
> > - current->thread.ksp_limit = saved_sp_limit;
> > - irqtp->task = NULL;
> > -
> > - /* Set any flag that may have been set on the
> > - * alternate stack
> > - */
> > - if (irqtp->flags)
> > - set_bits(irqtp->flags, &curtp->flags);
> > }
>
> This function ends up as a single blank line ...
>
> > @@ -519,18 +480,64 @@ void do_IRQ(struct pt_regs *regs)
> > */
> > irq = ppc_md.get_irq();
> >
> > - /* We can hard enable interrupts now */
> > + /* We can hard enable interrupts now to allow perf interrupts */
> > may_hard_irq_enable();
> >
> > /* And finally process it */
> > - if (irq != NO_IRQ)
> > - handle_one_irq(irq);
>
> then you remove the only call, so why not just remove the function
> completely?

Because I'm an idiot ? :-)

I moved bits and pieces to do_IRQ and forgot to remove the remainder
(and gcc didn't warn :-)

Cheers,
Ben.


2013-09-23 16:47:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] powerpc/irq: Run softirqs off the top of the irq stack

On Sun, Sep 22, 2013 at 9:35 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
>
> This is the "band aid" discussed so far for the stack overflow
> problem for powerpc.

I don't think it's a "band-aid" in any way, except perhaps in the
sense that there are certainly other things we can also do in this
series (ie I liked Frederic's cleanups etc).

Linus

2013-09-23 18:00:03

by Chris Metcalf

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On 9/22/2013 6:23 PM, Linus Torvalds wrote:
> Alternatively, make %r13 point to the percpu side, but make sure that
> you always use an asm accessor to fetch the value. In particular, I
> think you need to make __my_cpu_offset be an inline asm that fetches
> %r13 into some other register. Otherwise you can never get it right.

We just came up against this on tilegx with a customer bug report in a
PREEMPT environment. On tile, %tp is a GPR that points to the percpu area.
The following seems to be the right abstraction -- though I'd also argue
that letting barrier() clobber not just memory, but %tp, might be a better
solution, but it's not clear what the best way is to do per-architecture
overrides of per-compiler definitions like barrier(). See also the ARM v7
code, which has to do something similar, though their percpu pointer is
not a GPR, which changes the tradeoffs somewhat.

register unsigned long my_cpu_offset_reg asm("tp");

#ifdef CONFIG_PREEMPT
/*
* For full preemption, we can't just use the register variable
* directly, since we need barrier() to hazard against it, causing the
* compiler to reload anything computed from a previous "tp" value.
* But we also don't want to use volatile asm, since we'd like the
* compiler to be able to cache the value across multiple percpu reads.
* So we use a fake stack read as a hazard against barrier().
*/
static inline unsigned long __my_cpu_offset(void)
{
unsigned long tp;
register unsigned long *sp asm("sp");
asm("move %0, tp" : "=r" (tp) : "m" (*sp));
return tp;
}
#define __my_cpu_offset __my_cpu_offset()
#else
/*
* We don't need to hazard against barrier() since "tp" doesn't ever
* change with PREEMPT_NONE, and with PREEMPT_VOLUNTARY it only
* changes at function call points, at which we are already re-reading
* the value of "tp" due to "my_cpu_offset_reg" being a global variable.
*/
#define __my_cpu_offset my_cpu_offset_reg
#endif

#define set_my_cpu_offset(tp) (my_cpu_offset_reg = (tp))

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2013-09-23 20:52:54

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] powerpc/irq: Run softirqs off the top of the irq stack

On Mon, 2013-09-23 at 09:47 -0700, Linus Torvalds wrote:
> On Sun, Sep 22, 2013 at 9:35 PM, Benjamin Herrenschmidt
> <[email protected]> wrote:
> >
> > This is the "band aid" discussed so far for the stack overflow
> > problem for powerpc.
>
> I don't think it's a "band-aid" in any way, except perhaps in the
> sense that there are certainly other things we can also do in this
> series (ie I liked Frederic's cleanups etc).

Ah yes, I thought of it as a band-aid in the sense that a better
approach would be to switch to the irq stack earlier like x86_64 does
but that would be a lot more invasive. Definitely something I would look
into if I was to tackle changing how we do per-cpu and the PACA though.

Cheers,
Ben.

2013-09-23 20:58:38

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Mon, 2013-09-23 at 13:59 -0400, Chris Metcalf wrote:
> We just came up against this on tilegx with a customer bug report in a
> PREEMPT environment. On tile, %tp is a GPR that points to the percpu area.
> The following seems to be the right abstraction -- though I'd also argue
> that letting barrier() clobber not just memory, but %tp, might be a better
> solution, but it's not clear what the best way is to do per-architecture
> overrides of per-compiler definitions like barrier(). See also the ARM v7
> code, which has to do something similar, though their percpu pointer is
> not a GPR, which changes the tradeoffs somewhat.

Hrm, if I read correctly what you did is that you read "tp" into another
register *and* also mark that action as clobbering the top int on the stack ?

I don't quite get what the stack clobber brings you here and how it works
around the fact that gcc might still cache that copy of tp into another
register accross preempt_enable/disable...

It's hard to tell with gcc ... the best I've had so far as an option was
something that would mark my per-cpu register (r13) *itself* as clobbered...

Cheers,
Ben.

2013-09-24 00:11:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Sun, 2013-09-22 at 15:22 -0700, Linus Torvalds wrote:
> - use %r13 for the per-thread thread-info pointer instead. A
> per-thread pointer is *not* volatile like the per-cpu base is.

.../...

> Alternatively, make %r13 point to the percpu side, but make sure that
> you always use an asm accessor to fetch the value. In particular, I
> think you need to make __my_cpu_offset be an inline asm that fetches
> %r13 into some other register. Otherwise you can never get it right.

BTW, that boils down to a choice between using r13 as either a TLS for
current or current_thread_info, or as a per-cpu pointer, which one is
the most performance critical ?

Now in the first case, it seems to me that using it as "current" rather
than "current_thread_info()" is a better idea since we access current a
LOT more overall in the kernel, from there we can find a way to put
thread_info into task struct (via thread struct maybe) to make it a
simple offset from current.

The big pro of that approach is of course that r13 becomes the TLS as
intended, and we can feel a lot more comfortable that we are "safe" vs.
whatever crazyness gcc will come up with next.

The flip side is that per-cpu will remain a load away, so getting the
address of a per-cpu variable would typically be a 3 instruction deal
involving a load and a pair of adds to get to the address, then the
actual per-cpu access proper. This is equivalent to what we have today
(we put the per-cpu offset in the PACA). Using r13 as per-cpu allows to
avoid that first load.

So what's the most worthwhile thing to do here ? I'm leaning toward 1,
ie, stick current in r13 and feel a lot safer about it (I won't have to
scrutinize generated code all over the place to convince myself things
aren't crossing the barriers), and if the thread_info is in the task
struct, that makes accessing it really trivial & fast as well.

Cheers,
Ben.

2013-09-24 01:19:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Mon, Sep 23, 2013 at 5:10 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
>
> BTW, that boils down to a choice between using r13 as either a TLS for
> current or current_thread_info, or as a per-cpu pointer, which one is
> the most performance critical ?

I think you can tune most of the architecture setup to best suit your needs.

For example, on x86, we don't have much choice: the per-cpu accessors
are going to be faster than the alternatives, and there are patches
afoot to tune the preempt and rcu-readside counters to use the percpu
area (and then save/restore things at task switch time). But having
the counters natively in the thread_info struct is fine too and is
what we do now.

Generally, we've put the performance-critical stuff into
"current_thread_info" as opposed to "current", so it's likely that if
the choice is between those two, then you might want to pick %r13
pointing to the thread-info rather than the "struct task_struct" (ie
things like low-level thread flags). But which is better probably
depends on load, and again, some of it you can tweak by just making
per-architecture structure choices and making the macros point at one
or the other.

There's a few things that really depend on per-cpu areas, but I don't
think it's a huge performance issue if you have to indirect off memory
to get that. Most of the performance issues with per-cpu stuff is
about avoiding cachelines ping-ponging back and forth, not so much
about fast direct access. Of course, if some load really uses a *lot*
of percpu accesses, you get both.

The advantage of having %r13 point to thread data (which is "stable"
as far as the compiler is concerned) as opposed to having it be a
per-cpu pointer (which can change randomly due to task switching) is
that from a correctness standpoint I really do think that either
thread-info or current is *much* easier to handle than using it for
the per-cpu base pointer.

Linus

2013-09-24 01:53:32

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Mon, 2013-09-23 at 18:19 -0700, Linus Torvalds wrote:
> On Mon, Sep 23, 2013 at 5:10 PM, Benjamin Herrenschmidt
> <[email protected]> wrote:
> >
> > BTW, that boils down to a choice between using r13 as either a TLS for
> > current or current_thread_info, or as a per-cpu pointer, which one is
> > the most performance critical ?
>
> I think you can tune most of the architecture setup to best suit your needs.
>
> For example, on x86, we don't have much choice: the per-cpu accessors
> are going to be faster than the alternatives, and there are patches
> afoot to tune the preempt and rcu-readside counters to use the percpu
> area (and then save/restore things at task switch time). But having
> the counters natively in the thread_info struct is fine too and is
> what we do now.

Right, as long as the generic code doesn't move toward putting
everything in per-cpu without leaving us the option :-)

> Generally, we've put the performance-critical stuff into
> "current_thread_info" as opposed to "current", so it's likely that if
> the choice is between those two, then you might want to pick %r13
> pointing to the thread-info rather than the "struct task_struct" (ie
> things like low-level thread flags). But which is better probably
> depends on load, and again, some of it you can tweak by just making
> per-architecture structure choices and making the macros point at one
> or the other.

Well, if current_thread_info is basically inside the thread struct, it
will be the same, just a different offset from r13... task struct,
thread struct, thread info, it all becomes just one big structure
pointed to by r13.

> There's a few things that really depend on per-cpu areas, but I don't
> think it's a huge performance issue if you have to indirect off memory
> to get that. Most of the performance issues with per-cpu stuff is
> about avoiding cachelines ping-ponging back and forth, not so much
> about fast direct access. Of course, if some load really uses a *lot*
> of percpu accesses, you get both.
>
> The advantage of having %r13 point to thread data (which is "stable"
> as far as the compiler is concerned) as opposed to having it be a
> per-cpu pointer (which can change randomly due to task switching) is
> that from a correctness standpoint I really do think that either
> thread-info or current is *much* easier to handle than using it for
> the per-cpu base pointer.

Right. I had a chat with Alan Modra (gcc) and he reckons the "right" way
to make the per-cpu (or PACA) stuff work reasonably reliably is to do
something along the lines of:

register unsigned long per_cpu_offset asm("r13");

And have a barrier in preempt_enable/disable (and irq enable/disable,
though arguably we could just make barrier() do it) which marks r13 as
an *output* (not a clobber !).

>From there, gcc knows that after any such barrier, r13 can have changed
and we intend to use the new value (if it's marked as a clobber, it
assumes it was *clobbered* and thus need to be restored to it's
*previous* value).

So if that holds, we have a solid way to do per-cpu. On one side, I tend
to think that r13 being task/thread/thread_info is probably a better
overall choice, I'm worried that going in a different direction than x86
means generic code will get "tuned" to use per-cpu for performance
critical stuff rather than task/thread/thread_info in inflexible ways.

Cheers,
Ben.

> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-09-24 02:44:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Mon, Sep 23, 2013 at 02:40:34PM +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2013-09-22 at 07:45 +1000, Benjamin Herrenschmidt wrote:
> > What I *can* do that would help I suppose would be to switch to the irq
> > stack before irq_enter/exit which would at least mean that softirq would
> > run from the top of the irq stack which is better than the current
> > situation.
> >
> > I'm fact I'll whip up a quick fix see if that might be enough of a band
> > aid for RHEL7.
>
> OK I've done that, it seems to work so far. Heads up guys: i386 and sparc
> at least seem to need the same treatment. I haven't looked at others except
> ARM which doesn't seem to have irq stacks to begin with.
>
> We can also instead apply Fred's series to put back in the switch to the
> softirq stack since this is actually a regression , but then, arguably,
> making sure irq_exit() is called off the irq stack is better and means
> we do one instead of two stack switches.
>
> Fred: Maybe revert partially under an arch #define/Kconfig so we can get
> the best of both worlds ?

Aye, I did not realize that's indeed a regression, caused by
("irq: Sanitize invoke_softirq") facd8b80c67a3cf64a467c4a2ac5fb31f2e6745b
which converted do_softirq() to __do_softirq() on irq_exit(). It indeed
looked like the macro-conditional call was only there for irq disability
reasons. But then these crashes...

So the safest way to fix this is to unconditionally call do_softirq() from irq_exit().
A performance penalty may come along but safety primes.

We should probably do that and work on longer term solutions (Kconfig based arch switch, etc...)
for the next merge window?

I'll respin the series plus the regression fix, unless somebody has a better solution.

Thanks.

2013-09-24 04:44:19

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Tue, 2013-09-24 at 04:44 +0200, Frederic Weisbecker wrote:
> So the safest way to fix this is to unconditionally call do_softirq()
> from irq_exit().
> A performance penalty may come along but safety primes.
>
> We should probably do that and work on longer term solutions (Kconfig
> based arch switch, etc...)
> for the next merge window?

As you prefer, though I'm keen on getting the "fast" version in RHEL7 if
RH will take it :-)

>From the generic code POV, it's a one-liner #ifdef to select between
do_softirq and __do_softirq() right ? Then it's up to the arch to
#define I_CAN_DO_FAST !

> I'll respin the series plus the regression fix, unless somebody has a
> better solution.

Cheers,
Ben.

2013-09-24 05:44:26

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [PATCH v2] powerpc/irq: Run softirqs off the top of the irq stack

Nowadays, irq_exit() calls __do_softirq() pretty much directly
instead of calling do_softirq() which switches to the decicated
softirq stack.

This has lead to observed stack overflows on powerpc since we call
irq_enter() and irq_exit() outside of the scope that switches to
the irq stack.

This fixes it by moving the stack switching up a level, making
irq_enter() and irq_exit() run off the irq stack.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>

v2: Garbage collect now empty handle_one_irq()

---
arch/powerpc/include/asm/irq.h | 4 +-
arch/powerpc/kernel/irq.c | 104 ++++++++++++++++++++---------------------
arch/powerpc/kernel/misc_32.S | 9 ++--
arch/powerpc/kernel/misc_64.S | 10 ++--
4 files changed, 62 insertions(+), 65 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 0e40843..41f13ce 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -69,9 +69,9 @@ extern struct thread_info *softirq_ctx[NR_CPUS];

extern void irq_ctx_init(void);
extern void call_do_softirq(struct thread_info *tp);
-extern int call_handle_irq(int irq, void *p1,
- struct thread_info *tp, void *func);
+extern void call_do_irq(struct pt_regs *regs, struct thread_info *tp);
extern void do_IRQ(struct pt_regs *regs);
+extern void __do_irq(struct pt_regs *regs);

int irq_choose_cpu(const struct cpumask *mask);

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index c69440c..2234a12 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -441,50 +441,6 @@ void migrate_irqs(void)
}
#endif

-static inline void handle_one_irq(unsigned int irq)
-{
- struct thread_info *curtp, *irqtp;
- unsigned long saved_sp_limit;
- struct irq_desc *desc;
-
- desc = irq_to_desc(irq);
- if (!desc)
- return;
-
- /* Switch to the irq stack to handle this */
- curtp = current_thread_info();
- irqtp = hardirq_ctx[smp_processor_id()];
-
- if (curtp == irqtp) {
- /* We're already on the irq stack, just handle it */
- desc->handle_irq(irq, desc);
- return;
- }
-
- saved_sp_limit = current->thread.ksp_limit;
-
- irqtp->task = curtp->task;
- irqtp->flags = 0;
-
- /* Copy the softirq bits in preempt_count so that the
- * softirq checks work in the hardirq context. */
- irqtp->preempt_count = (irqtp->preempt_count & ~SOFTIRQ_MASK) |
- (curtp->preempt_count & SOFTIRQ_MASK);
-
- current->thread.ksp_limit = (unsigned long)irqtp +
- _ALIGN_UP(sizeof(struct thread_info), 16);
-
- call_handle_irq(irq, desc, irqtp, desc->handle_irq);
- current->thread.ksp_limit = saved_sp_limit;
- irqtp->task = NULL;
-
- /* Set any flag that may have been set on the
- * alternate stack
- */
- if (irqtp->flags)
- set_bits(irqtp->flags, &curtp->flags);
-}
-
static inline void check_stack_overflow(void)
{
#ifdef CONFIG_DEBUG_STACKOVERFLOW
@@ -501,9 +457,9 @@ static inline void check_stack_overflow(void)
#endif
}

-void do_IRQ(struct pt_regs *regs)
+void __do_irq(struct pt_regs *regs)
{
- struct pt_regs *old_regs = set_irq_regs(regs);
+ struct irq_desc *desc;
unsigned int irq;

irq_enter();
@@ -519,18 +475,64 @@ void do_IRQ(struct pt_regs *regs)
*/
irq = ppc_md.get_irq();

- /* We can hard enable interrupts now */
+ /* We can hard enable interrupts now to allow perf interrupts */
may_hard_irq_enable();

/* And finally process it */
- if (irq != NO_IRQ)
- handle_one_irq(irq);
- else
+ if (unlikely(irq == NO_IRQ))
__get_cpu_var(irq_stat).spurious_irqs++;
+ else {
+ desc = irq_to_desc(irq);
+ if (likely(desc))
+ desc->handle_irq(irq, desc);
+ }

trace_irq_exit(regs);

irq_exit();
+}
+
+void do_IRQ(struct pt_regs *regs)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+ struct thread_info *curtp, *irqtp;
+ unsigned long saved_sp_limit;
+
+ /* Switch to the irq stack to handle this */
+ curtp = current_thread_info();
+ irqtp = hardirq_ctx[raw_smp_processor_id()];
+
+ /* Already there ? */
+ if (unlikely(curtp == irqtp)) {
+ __do_irq(regs);
+ set_irq_regs(old_regs);
+ return;
+ }
+
+ /* Adjust the stack limit */
+ saved_sp_limit = current->thread.ksp_limit;
+ current->thread.ksp_limit = (unsigned long)irqtp +
+ _ALIGN_UP(sizeof(struct thread_info), 16);
+
+
+ /* Prepare the thread_info in the irq stack */
+ irqtp->task = curtp->task;
+ irqtp->flags = 0;
+
+ /* Copy the preempt_count so that the [soft]irq checks work. */
+ irqtp->preempt_count = curtp->preempt_count;
+
+ /* Switch stack and call */
+ call_do_irq(regs, irqtp);
+
+ /* Restore stack limit */
+ current->thread.ksp_limit = saved_sp_limit;
+ irqtp->task = NULL;
+
+ /* Copy back updates to the thread_info */
+ if (irqtp->flags)
+ set_bits(irqtp->flags, &curtp->flags);
+
set_irq_regs(old_regs);
}

@@ -592,12 +594,10 @@ void irq_ctx_init(void)
memset((void *)softirq_ctx[i], 0, THREAD_SIZE);
tp = softirq_ctx[i];
tp->cpu = i;
- tp->preempt_count = 0;

memset((void *)hardirq_ctx[i], 0, THREAD_SIZE);
tp = hardirq_ctx[i];
tp->cpu = i;
- tp->preempt_count = HARDIRQ_OFFSET;
}
}

diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index 777d999..7da3882 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -47,13 +47,12 @@ _GLOBAL(call_do_softirq)
mtlr r0
blr

-_GLOBAL(call_handle_irq)
+_GLOBAL(call_do_irq)
mflr r0
stw r0,4(r1)
- mtctr r6
- stwu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r5)
- mr r1,r5
- bctrl
+ stwu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r4)
+ mr r1,r4
+ bl __do_irq
lwz r1,0(r1)
lwz r0,4(r1)
mtlr r0
diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
index 971d7e7..e59caf8 100644
--- a/arch/powerpc/kernel/misc_64.S
+++ b/arch/powerpc/kernel/misc_64.S
@@ -40,14 +40,12 @@ _GLOBAL(call_do_softirq)
mtlr r0
blr

-_GLOBAL(call_handle_irq)
- ld r8,0(r6)
+_GLOBAL(call_do_irq)
mflr r0
std r0,16(r1)
- mtctr r8
- stdu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r5)
- mr r1,r5
- bctrl
+ stdu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r4)
+ mr r1,r4
+ bl .__do_irq
ld r1,0(r1)
ld r0,16(r1)
mtlr r0

2013-09-24 08:05:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Tue, Sep 24, 2013 at 11:52:07AM +1000, Benjamin Herrenschmidt wrote:
> So if that holds, we have a solid way to do per-cpu. On one side, I tend
> to think that r13 being task/thread/thread_info is probably a better
> overall choice, I'm worried that going in a different direction than x86
> means generic code will get "tuned" to use per-cpu for performance
> critical stuff rather than task/thread/thread_info in inflexible ways.

The plus side of per-cpu over per-task is that one typically has a lot
less cpus than tasks. Also, its far easier/cheaper to iterate cpus than
it is to iterate tasks.

2013-09-24 08:18:19

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Tue, 2013-09-24 at 10:04 +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2013 at 11:52:07AM +1000, Benjamin Herrenschmidt wrote:
> > So if that holds, we have a solid way to do per-cpu. On one side, I tend
> > to think that r13 being task/thread/thread_info is probably a better
> > overall choice, I'm worried that going in a different direction than x86
> > means generic code will get "tuned" to use per-cpu for performance
> > critical stuff rather than task/thread/thread_info in inflexible ways.
>
> The plus side of per-cpu over per-task is that one typically has a lot
> less cpus than tasks. Also, its far easier/cheaper to iterate cpus than
> it is to iterate tasks.

I don't see how that relates to the above though... ie, how the number
of CPUs vs. tasks and the relative ease of iteration relates to how fast
we access the "current" cpu and the "current" task ?

The thing is, if I use r13 as "current" (and put thread_info in the task
struct), virtually *all* my accesses to task, thread and thread_info
structs become a single load or store instruction from r13.

On the other hand, access to "my_cpu_offset" for per-cpu remains (since
that's what we do today already) a load to get the offset an an addition
+ access. (ie, I would stash the cpu offset into the thread info and
context switch it).

If I use r13 as "my_cpu_offset", then I might be able to skip a load for
per-cpu accesses to the current cpu, making them a bit faster, but add
an indirection for "current" and thread_info.

It's basically a question of where do I put the extra load and what has
the bigger net benefit. I tend to think we access "current" a LOT more
than per-cpu but I might be wrong.

Additionally, using r13 for "current" removes all the problems with gcc
copying the value accross preempt_disable/enable etc... while using it
as per-cpu means they remain. We think we have a fix (which will involve
a special preempt_barrier() added to preempt_disable/enable and irq
enable/disable), but it's still not as nice as "the problem just doesn't
exist" :-)

Cheers,
Ben.

2013-09-24 08:21:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Tue, Sep 24, 2013 at 06:16:53PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2013-09-24 at 10:04 +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2013 at 11:52:07AM +1000, Benjamin Herrenschmidt wrote:
> > > So if that holds, we have a solid way to do per-cpu. On one side, I tend
> > > to think that r13 being task/thread/thread_info is probably a better
> > > overall choice, I'm worried that going in a different direction than x86
> > > means generic code will get "tuned" to use per-cpu for performance
> > > critical stuff rather than task/thread/thread_info in inflexible ways.
> >
> > The plus side of per-cpu over per-task is that one typically has a lot
> > less cpus than tasks. Also, its far easier/cheaper to iterate cpus than
> > it is to iterate tasks.
>
> I don't see how that relates to the above though...

It was a comment on the increase of per-cpu usage in generic code.

2013-09-24 09:32:47

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Tue, 2013-09-24 at 10:21 +0200, Peter Zijlstra wrote:
> > I don't see how that relates to the above though...
>
> It was a comment on the increase of per-cpu usage in generic code.

Oh I see. Ok, but some things are intrinsically per-task so if you use
per-cpu in that case you end up context switching, at least that's what
Linus alluded to. It would be nice if such construct we design so that
we still access them via the task struct on arch where such accesses are
cheap.

I think if/when we get rid of using r13 for our arch specific "PACA", it
becomes easy for me to experiment with using it for either per-cpu or
current and do some measurements. It also means I can easily switch "the
other way" if I need to. I'll dig into that, it will probably take some
time before I sanitize our code enough anyway.

Cheers,
Ben.

2013-09-24 13:56:29

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Tue, Sep 24, 2013 at 02:42:57PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2013-09-24 at 04:44 +0200, Frederic Weisbecker wrote:
> > So the safest way to fix this is to unconditionally call do_softirq()
> > from irq_exit().
> > A performance penalty may come along but safety primes.
> >
> > We should probably do that and work on longer term solutions (Kconfig
> > based arch switch, etc...)
> > for the next merge window?
>
> As you prefer, though I'm keen on getting the "fast" version in RHEL7 if
> RH will take it :-)

So what is the fast version? Converting __do_softirq() to do_softirq()
unconditionally.

RH will accept any fix that goes upstream.

>
> From the generic code POV, it's a one-liner #ifdef to select between
> do_softirq and __do_softirq() right ? Then it's up to the arch to
> #define I_CAN_DO_FAST !

I'd rather say #define I_CAN_DO_SAFE :)

But I guess the kind of symbol we want is some ARCH_HAS_IRQ_STACK_LOW_HANDLER

>
> > I'll respin the series plus the regression fix, unless somebody has a
> > better solution.
>
> Cheers,
> Ben.
>
>

2013-09-24 19:27:33

by Chris Metcalf

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On 9/23/2013 4:57 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2013-09-23 at 13:59 -0400, Chris Metcalf wrote:
>> We just came up against this on tilegx with a customer bug report in a
>> PREEMPT environment. On tile, %tp is a GPR that points to the percpu area.
>> The following seems to be the right abstraction -- though I'd also argue
>> that letting barrier() clobber not just memory, but %tp, might be a better
>> solution, but it's not clear what the best way is to do per-architecture
>> overrides of per-compiler definitions like barrier(). See also the ARM v7
>> code, which has to do something similar, though their percpu pointer is
>> not a GPR, which changes the tradeoffs somewhat.
> Hrm, if I read correctly what you did is that you read "tp" into another
> register *and* also mark that action as clobbering the top int on the stack ?
>
> I don't quite get what the stack clobber brings you here and how it works
> around the fact that gcc might still cache that copy of tp into another
> register accross preempt_enable/disable...

Correct. The idea is that since gcc knows that the outputs of the asm are
dependent on memory, when it sees a memory clobber it knows it has to
instantiate the asm again to get a new value of the asm outputs. gcc will know
that any copies of tp in other registers are also dependent on memory and
will drop all of them across the barrier() as well.


> It's hard to tell with gcc ... the best I've had so far as an option was
> something that would mark my per-cpu register (r13) *itself* as clobbered...

Well, as I said above, that would be better, but it requires providing an
alternate definition of barrier() that is per-architecture, not just
per-compiler. If there's interest in pursuing a solution like that, it
would be technically somewhat better; in particular, with PREEMPT_NONE,
you could treat the "tp" register int as locally scoped in an inline, and
the compiler wouldn't have to reload it after function calls. Presumably
we'd need to pick an asm header that could provide an arch_barrier_clobbers
string that would be added to barrier() for gcc if it were defined.

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2013-09-24 20:57:13

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Tue, 2013-09-24 at 15:56 +0200, Frederic Weisbecker wrote:
> On Tue, Sep 24, 2013 at 02:42:57PM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2013-09-24 at 04:44 +0200, Frederic Weisbecker wrote:
> > > So the safest way to fix this is to unconditionally call do_softirq()
> > > from irq_exit().
> > > A performance penalty may come along but safety primes.
> > >
> > > We should probably do that and work on longer term solutions (Kconfig
> > > based arch switch, etc...)
> > > for the next merge window?
> >
> > As you prefer, though I'm keen on getting the "fast" version in RHEL7 if
> > RH will take it :-)
>
> So what is the fast version? Converting __do_softirq() to do_softirq()
> unconditionally.
>
> RH will accept any fix that goes upstream.

No, me fixing powerpc do_IRQ to do irq_exit run on the irq stack, and
your fix for everybody else with an ifdef such that x86_64 and powerpc
get to skip the additional stack switch.

> >
> > From the generic code POV, it's a one-liner #ifdef to select between
> > do_softirq and __do_softirq() right ? Then it's up to the arch to
> > #define I_CAN_DO_FAST !
>
> I'd rather say #define I_CAN_DO_SAFE :)
>
> But I guess the kind of symbol we want is some ARCH_HAS_IRQ_STACK_LOW_HANDLER

ARCH_IRQ_EXIT_ON_IRQ_STACK

Cheers,
Ben.

> >
> > > I'll respin the series plus the regression fix, unless somebody has a
> > > better solution.
> >
> > Cheers,
> > Ben.
> >
> >

2013-09-24 20:59:22

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Tue, 2013-09-24 at 15:27 -0400, Chris Metcalf wrote:
> > It's hard to tell with gcc ... the best I've had so far as an option was
> > something that would mark my per-cpu register (r13) *itself* as clobbered...
>
> Well, as I said above, that would be better, but it requires providing an
> alternate definition of barrier() that is per-architecture, not just
> per-compiler.

My compiler people tell me "clobbered" is wrong. It will tell the
compiler that barrier() damages r13 (or whatever other register you use)
and instead make it do exactly the wrong thing which is to back it up
before the barrier and use the backup :-)

I'm told what we need is an empty asm that marks r13 as an *output* (and
possible an input as well to be safe).

I will experiment.

> If there's interest in pursuing a solution like that, it
> would be technically somewhat better; in particular, with PREEMPT_NONE,
> you could treat the "tp" register int as locally scoped in an inline, and
> the compiler wouldn't have to reload it after function calls. Presumably
> we'd need to pick an asm header that could provide an arch_barrier_clobbers
> string that would be added to barrier() for gcc if it were defined.

My idea was to add a preempt_barrier() and put that in preempt_enable/disable
and the various irq enable/disable.

Cheers,
Ben.

2013-09-25 08:46:50

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

On Wed, Sep 25, 2013 at 06:55:47AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2013-09-24 at 15:56 +0200, Frederic Weisbecker wrote:
> > On Tue, Sep 24, 2013 at 02:42:57PM +1000, Benjamin Herrenschmidt wrote:
> > > On Tue, 2013-09-24 at 04:44 +0200, Frederic Weisbecker wrote:
> > > > So the safest way to fix this is to unconditionally call do_softirq()
> > > > from irq_exit().
> > > > A performance penalty may come along but safety primes.
> > > >
> > > > We should probably do that and work on longer term solutions (Kconfig
> > > > based arch switch, etc...)
> > > > for the next merge window?
> > >
> > > As you prefer, though I'm keen on getting the "fast" version in RHEL7 if
> > > RH will take it :-)
> >
> > So what is the fast version? Converting __do_softirq() to do_softirq()
> > unconditionally.
> >
> > RH will accept any fix that goes upstream.
>
> No, me fixing powerpc do_IRQ to do irq_exit run on the irq stack, and
> your fix for everybody else with an ifdef such that x86_64 and powerpc
> get to skip the additional stack switch.
>
> > >
> > > From the generic code POV, it's a one-liner #ifdef to select between
> > > do_softirq and __do_softirq() right ? Then it's up to the arch to
> > > #define I_CAN_DO_FAST !
> >
> > I'd rather say #define I_CAN_DO_SAFE :)
> >
> > But I guess the kind of symbol we want is some ARCH_HAS_IRQ_STACK_LOW_HANDLER
>
> ARCH_IRQ_EXIT_ON_IRQ_STACK

Ok, I'll pick this one.

Thanks!

>
> Cheers,
> Ben.
>
> > >
> > > > I'll respin the series plus the regression fix, unless somebody has a
> > > > better solution.
> > >
> > > Cheers,
> > > Ben.
> > >
> > >
>
>