Subject: [PATCH] Revert "x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT"

where do I start. Let me explain what is going on here. The code
sequence
| pushf
| pop %edx
| or $0x1,%dh
| push %edx
| mov $0xe0,%eax
| popf
| sysenter

triggers the bug. On 64bit kernel we see the double fault (with 32bit and
64bit userland) and on 32bit kernel there is no problem. The reporter said
that double fault does not happen on 64bit kernel with 64bit userland and
this is because in that case the VDSO uses the "syscall" interface instead
of "sysenter".

The bug. "popf" loads the flags with the TF bit set which enables
"single stepping" and this leads to a debug exception. Usually on 64bit
we have a special IST stack for the debug exception. Due to patch [0] we
do not use the IST stack but the kernel stack instead. On 64bit the
sysenter instruction starts in kernel with the stack address NULL. The
code sequence above enters the debug exception (TF flag) after the
sysenter instruction was executed which sets the stack pointer to NULL
and we have a fault (it seems that the debug exception saves some bytes
on the stack).
To fix the double fault I'm going to drop patch [0]. It is completely
pointless. In do_debug() and do_stack_segment() we disable preemption
which means the task can't leave the CPU. So it does not matter if we run
on IST or on kernel stack.
There is a patch [1] which drops preempt_disable() call for a 32bit
kernel but not for 64bit so there should be no regression.
And [1] seems valid even for this code sequence. We enter the debug
exception with a 256bytes long per cpu stack and migrate to the kernel
stack before calling do_debug().

[0] x86-disable-debug-stack.patch
[1] fix-rt-int3-x86_32-3.2-rt.patch

Reported-by: Brian Silverman <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/include/asm/page_64_types.h | 21 ++++++---------------
arch/x86/kernel/cpu/common.c | 2 --
arch/x86/kernel/dumpstack_64.c | 4 ----
3 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 695e04d..43dcd80 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -14,21 +14,12 @@
#define IRQ_STACK_ORDER 2
#define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER)

-#ifdef CONFIG_PREEMPT_RT_FULL
-# define STACKFAULT_STACK 0
-# define DOUBLEFAULT_STACK 1
-# define NMI_STACK 2
-# define DEBUG_STACK 0
-# define MCE_STACK 3
-# define N_EXCEPTION_STACKS 3 /* hw limit: 7 */
-#else
-# define STACKFAULT_STACK 1
-# define DOUBLEFAULT_STACK 2
-# define NMI_STACK 3
-# define DEBUG_STACK 4
-# define MCE_STACK 5
-# define N_EXCEPTION_STACKS 5 /* hw limit: 7 */
-#endif
+#define STACKFAULT_STACK 1
+#define DOUBLEFAULT_STACK 2
+#define NMI_STACK 3
+#define DEBUG_STACK 4
+#define MCE_STACK 5
+#define N_EXCEPTION_STACKS 5 /* hw limit: 7 */

#define PUD_PAGE_SIZE (_AC(1, UL) << PUD_SHIFT)
#define PUD_PAGE_MASK (~(PUD_PAGE_SIZE-1))
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c0dcf06..2793d1f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1105,9 +1105,7 @@ DEFINE_PER_CPU(struct task_struct *, fpu_owner_task);
*/
static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
[0 ... N_EXCEPTION_STACKS - 1] = EXCEPTION_STKSZ,
-#if DEBUG_STACK > 0
[DEBUG_STACK - 1] = DEBUG_STKSZ
-#endif
};

static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 52b4bcd..addb207 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -21,14 +21,10 @@
(N_EXCEPTION_STACKS + DEBUG_STKSZ/EXCEPTION_STKSZ - 2)

static char x86_stack_ids[][8] = {
-#if DEBUG_STACK > 0
[ DEBUG_STACK-1 ] = "#DB",
-#endif
[ NMI_STACK-1 ] = "NMI",
[ DOUBLEFAULT_STACK-1 ] = "#DF",
-#if STACKFAULT_STACK > 0
[ STACKFAULT_STACK-1 ] = "#SS",
-#endif
[ MCE_STACK-1 ] = "#MC",
#if DEBUG_STKSZ > EXCEPTION_STKSZ
[ N_EXCEPTION_STACKS ...
--
1.8.5.2


2014-01-04 18:18:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Revert "x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT"

On Fri, Jan 03, 2014 at 02:55:48PM +0100, Sebastian Andrzej Siewior wrote:
> where do I start. Let me explain what is going on here. The code
> sequence

Yes the IST stacks are needed for correctness, even in more cases than
the example below. You cannot just disable them, just because you don't
like them.

-Andi

2014-01-05 04:45:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] Revert "x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT"

On Sat, 2014-01-04 at 19:18 +0100, Andi Kleen wrote:
> On Fri, Jan 03, 2014 at 02:55:48PM +0100, Sebastian Andrzej Siewior wrote:
> > where do I start. Let me explain what is going on here. The code
> > sequence
>
> Yes the IST stacks are needed for correctness, even in more cases than
> the example below. You cannot just disable them, just because you don't
> like them.

You had a better reason than dislike.

<quote>
From: Andi Kleen <[email protected]>
Date: Fri, 3 Jul 2009 08:44:10 -0500
Subject: [PATCH 209/303] x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT

Normally the x86-64 trap handlers for debug/int 3/stack fault run
on a special interrupt stack to make them more robust
when dealing with kernel code.

The PREEMPT_RT kernel can sleep in locks even while allocating
GFP_ATOMIC memory. When one of these trap handlers needs to send
real time signals for ptrace it allocates memory and could then
try to to schedule. But it is not allowed to schedule on a
IST stack. This can cause warnings and hangs.

This patch disables the IST stacks for these handlers for PREEMPT_RT
kernel. Instead let them run on the normal process stack.

...

A better solution would be to use similar logic as the NMI "paranoid"
path: check if signal is for user space, if yes go back to entry.S, switch stack,
call sync_regs, then do the signal sending etc.

</quote>

Or perhaps tell sleeping locks to spin in annoying spots? I converted
rt spinlocks globally to preemptible spinning locks (wasn't pretty, but
worked), so seems that could work.

-Mike

2014-01-05 05:05:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] Revert "x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT"

On Sun, 2014-01-05 at 05:45 +0100, Mike Galbraith wrote:

> Or perhaps tell sleeping locks to spin in annoying spots? I converted
> rt spinlocks globally to preemptible spinning locks (wasn't pretty, but
> worked), so seems that could work.

Bah. Nope, if lock owner is on your cpu, you have to yield.

2014-01-05 18:04:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Revert "x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT"

On Sun, Jan 05, 2014 at 05:45:47AM +0100, Mike Galbraith wrote:
> On Sat, 2014-01-04 at 19:18 +0100, Andi Kleen wrote:
> > On Fri, Jan 03, 2014 at 02:55:48PM +0100, Sebastian Andrzej Siewior wrote:
> > > where do I start. Let me explain what is going on here. The code
> > > sequence
> >
> > Yes the IST stacks are needed for correctness, even in more cases than
> > the example below. You cannot just disable them, just because you don't
> > like them.
>
> You had a better reason than dislike.

Ah true it was me. Good point. I forgot all about that.

Probably it needs some form of the NMI style paranoid_* switch.

-Andi

Subject: Re: [PATCH] Revert "x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT"

* Andi Kleen | 2014-01-04 19:18:07 [+0100]:

>On Fri, Jan 03, 2014 at 02:55:48PM +0100, Sebastian Andrzej Siewior wrote:
>> where do I start. Let me explain what is going on here. The code
>> sequence
>
>Yes the IST stacks are needed for correctness, even in more cases than
>the example below. You cannot just disable them, just because you don't
>like them.

Andi, you were the Author of that patch.
I plan to migrate from the IST stack to the kernel stack so I can enable
preemption. This is he case on 32bit. You mention more cases than this.
Could you please give me some examples so I can consider them?

>-Andi

Sebastian