2014-10-01 18:49:10

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 0/2] x86_64,entry: Clear NT on entry and speed up switch_to

Anish Bhatt noticed that user programs can set RFLAGS.NT before
syscall or sysenter, and the kernel entry code doesn't filter out
NT. This causes kernel C code and, depending on thread flags, the
exit slow path to run with NT set.

The former is a little bit scary (imagine calling into EFI with NT
set), and the latter will fail with #GP and send a spurious SIGSEGV.

One answer would be "don't do that". But the kernel can do better
here.

These patches filter NT on all kernel entries. For syscall (both
bitnesses), this is free. For sysenter, it seems to cost very
little (less than my ability to measure, although I didn't try that
hard). Patch 2, which isn't tagged for -stable, speeds up context
switches by avoiding saving and restoring flags, so this series
should be a decent overall performance win.

See: https://bugs.winehq.org/show_bug.cgi?id=33275

Note to bikeshedders: I have no desire to go crazy micro-optimizing
the sysenter path. :) This version seems to be good enough (and
should be a performance *increase* for most workloads).

Changes from v3:
- Added a better description of the impact in patch 1

Changes from v2:
- Move the flag fixup out of line
- Fix a CFI buglet

Changes from v1:
- Spell [email protected] correctly
- Tidy up changelog text
- Actually commit an asm constraint fix in patch 2 (egads!)
- Replace the unconditional popfq with a branch

Andy Lutomirski (2):
x86_64,entry: Filter RFLAGS.NT on entry from userspace
x86_64: Don't save flags on context switch

arch/x86/ia32/ia32entry.S | 18 +++++++++++++++++-
arch/x86/include/asm/switch_to.h | 12 ++++++++----
arch/x86/kernel/cpu/common.c | 2 +-
3 files changed, 26 insertions(+), 6 deletions(-)

--
1.9.3


2014-10-01 18:49:15

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 1/2] x86_64,entry: Filter RFLAGS.NT on entry from userspace

The NT flag doesn't do anything in long mode other than causing IRET
to #GP. Oddly, CPL3 code can still set NT using popf.

Entry via hardware or software interrupt clears NT automatically, so
the only relevant entries are fast syscalls.

If user code causes kernel code to run with NT set, then there's at
least some (small) chance that it could cause trouble. For example,
user code could cause a call to EFI code with NT set, and who knows
what would happen? Apparently some games on Wine sometimes do
this (!), and, if an IRET return happens, they will segfault. That
segfault cannot be handled, because signal delivery fails, too.

This patch programs the CPU to clear NT on entry via SYSCALL (both
32-bit and 64-bit, by my reading of the AMD APM), and it clears NT
in software on entry via SYSENTER.

To save a few cycles, this borrows a trick from Jan Beulich in Xen:
it checks whether NT is set before trying to clear it. As a result,
it seems to have very little effect on SYSENTER performance on my
machine.

There's another minor bug fix in here: it looks like the CFI
annotations were wrong if CONFIG_AUDITSYSCALL=n.

Testers beware: on Xen, SYSENTER with NT set turns into a GPF.

I haven't touched anything on 32-bit kernels.

The syscall mask change comes from a variant of this patch by Anish
Bhatt.

Note to stable maintainers: there is no known security issue here.
A misguided program can set NT and cause the kernel to try and fail
to deliver SIGSEGV, crashing the program. This patch fixes Far Cry
on Wine: https://bugs.winehq.org/show_bug.cgi?id=33275

Cc: [email protected]
Reported-by: Anish Bhatt <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/ia32/ia32entry.S | 18 +++++++++++++++++-
arch/x86/kernel/cpu/common.c | 2 +-
2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4299eb05023c..711de084ab57 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -151,6 +151,16 @@ ENTRY(ia32_sysenter_target)
1: movl (%rbp),%ebp
_ASM_EXTABLE(1b,ia32_badarg)
ASM_CLAC
+
+ /*
+ * Sysenter doesn't filter flags, so we need to clear NT
+ * ourselves. To save a few cycles, we can check whether
+ * NT was set instead of doing an unconditional popfq.
+ */
+ testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
+ jnz sysenter_fix_flags
+sysenter_flags_fixed:
+
orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
CFI_REMEMBER_STATE
@@ -184,6 +194,8 @@ sysexit_from_sys_call:
TRACE_IRQS_ON
ENABLE_INTERRUPTS_SYSEXIT32

+ CFI_RESTORE_STATE
+
#ifdef CONFIG_AUDITSYSCALL
.macro auditsys_entry_common
movl %esi,%r9d /* 6th arg: 4th syscall arg */
@@ -226,7 +238,6 @@ sysexit_from_sys_call:
.endm

sysenter_auditsys:
- CFI_RESTORE_STATE
auditsys_entry_common
movl %ebp,%r9d /* reload 6th syscall arg */
jmp sysenter_dispatch
@@ -235,6 +246,11 @@ sysexit_audit:
auditsys_exit sysexit_from_sys_call
#endif

+sysenter_fix_flags:
+ pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
+ popfq_cfi
+ jmp sysenter_flags_fixed
+
sysenter_tracesys:
#ifdef CONFIG_AUDITSYSCALL
testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index e4ab2b42bd6f..31265580c38a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1184,7 +1184,7 @@ void syscall_init(void)
/* Flags to clear on syscall */
wrmsrl(MSR_SYSCALL_MASK,
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
- X86_EFLAGS_IOPL|X86_EFLAGS_AC);
+ X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
}

/*
--
1.9.3

2014-10-01 18:49:23

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v4 2/2] x86_64: Don't save flags on context switch

Now that the kernel always runs with clean flags (in particular, NT
is clear), there is no need to save and restore flags on every
context switch.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/switch_to.h | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index d7f3b3b78ac3..751bf4b7bf11 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -79,12 +79,12 @@ do { \
#else /* CONFIG_X86_32 */

/* frame pointer must be last for get_wchan */
-#define SAVE_CONTEXT "pushf ; pushq %%rbp ; movq %%rsi,%%rbp\n\t"
-#define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp ; popf\t"
+#define SAVE_CONTEXT "pushq %%rbp ; movq %%rsi,%%rbp\n\t"
+#define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp\t"

#define __EXTRA_CLOBBER \
, "rcx", "rbx", "rdx", "r8", "r9", "r10", "r11", \
- "r12", "r13", "r14", "r15"
+ "r12", "r13", "r14", "r15", "flags"

#ifdef CONFIG_CC_STACKPROTECTOR
#define __switch_canary \
@@ -100,7 +100,11 @@ do { \
#define __switch_canary_iparam
#endif /* CC_STACKPROTECTOR */

-/* Save restore flags to clear handle leaking NT */
+/*
+ * There is no need to save or restore flags, because flags are always
+ * clean in kernel mode, with the possible exception of IOPL. Kernel IOPL
+ * has no effect.
+ */
#define switch_to(prev, next, last) \
asm volatile(SAVE_CONTEXT \
"movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \
--
1.9.3

2014-10-01 19:50:20

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86_64,entry: Filter RFLAGS.NT on entry from userspace

On Wed, Oct 1, 2014 at 11:49 AM, Andy Lutomirski <[email protected]> wrote:
> The NT flag doesn't do anything in long mode other than causing IRET
> to #GP. Oddly, CPL3 code can still set NT using popf.
>

[...]

> +
> + /*
> + * Sysenter doesn't filter flags, so we need to clear NT
> + * ourselves. To save a few cycles, we can check whether
> + * NT was set instead of doing an unconditional popfq.
> + */
> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
> + jnz sysenter_fix_flags
> +sysenter_flags_fixed:
> +

Because this thread hasn't gone on long enough:

Do we need to clear IOPL here, too? With patch 2 applied, an IOPL !=
0 program can leak IOPL into another task. It should be cleared on
iret, sysexit (via popf) and sysret (directly), so this shouldn't
matter. Am I missing something?

Adding IOPL to the test will add no overhead for non-iopl-using tasks,
but it will slighly slow down 32-bit tasks that use iopl.

--Andy

2014-10-02 15:36:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86_64,entry: Filter RFLAGS.NT on entry from userspace

On 10/01/2014 12:49 PM, Andy Lutomirski wrote:
> On Wed, Oct 1, 2014 at 11:49 AM, Andy Lutomirski <[email protected]> wrote:
>> The NT flag doesn't do anything in long mode other than causing IRET
>> to #GP. Oddly, CPL3 code can still set NT using popf.
>>
>
> [...]
>
>> +
>> + /*
>> + * Sysenter doesn't filter flags, so we need to clear NT
>> + * ourselves. To save a few cycles, we can check whether
>> + * NT was set instead of doing an unconditional popfq.
>> + */
>> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
>> + jnz sysenter_fix_flags
>> +sysenter_flags_fixed:
>> +
>
> Because this thread hasn't gone on long enough:
>
> Do we need to clear IOPL here, too? With patch 2 applied, an IOPL !=
> 0 program can leak IOPL into another task. It should be cleared on
> iret, sysexit (via popf) and sysret (directly), so this shouldn't
> matter. Am I missing something?
>
> Adding IOPL to the test will add no overhead for non-iopl-using tasks,
> but it will slighly slow down 32-bit tasks that use iopl.
>

As you correctly point out, IOPL is completely irrelevant in ring 0. We
have to restore the user space flags before returning to user space, so
it shouldn't matter.

-hpa

2014-10-06 16:39:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] x86_64,entry: Clear NT on entry and speed up switch_to

On Wed, Oct 1, 2014 at 11:49 AM, Andy Lutomirski <[email protected]> wrote:
> Anish Bhatt noticed that user programs can set RFLAGS.NT before
> syscall or sysenter, and the kernel entry code doesn't filter out
> NT. This causes kernel C code and, depending on thread flags, the
> exit slow path to run with NT set.
>

Quick ping: now that the merge window is (sort of) open, what's
happening with these patches?

Thanks,
Andy

> The former is a little bit scary (imagine calling into EFI with NT
> set), and the latter will fail with #GP and send a spurious SIGSEGV.
>
> One answer would be "don't do that". But the kernel can do better
> here.
>
> These patches filter NT on all kernel entries. For syscall (both
> bitnesses), this is free. For sysenter, it seems to cost very
> little (less than my ability to measure, although I didn't try that
> hard). Patch 2, which isn't tagged for -stable, speeds up context
> switches by avoiding saving and restoring flags, so this series
> should be a decent overall performance win.
>
> See: https://bugs.winehq.org/show_bug.cgi?id=33275
>
> Note to bikeshedders: I have no desire to go crazy micro-optimizing
> the sysenter path. :) This version seems to be good enough (and
> should be a performance *increase* for most workloads).
>
> Changes from v3:
> - Added a better description of the impact in patch 1
>
> Changes from v2:
> - Move the flag fixup out of line
> - Fix a CFI buglet
>
> Changes from v1:
> - Spell [email protected] correctly
> - Tidy up changelog text
> - Actually commit an asm constraint fix in patch 2 (egads!)
> - Replace the unconditional popfq with a branch
>
> Andy Lutomirski (2):
> x86_64,entry: Filter RFLAGS.NT on entry from userspace
> x86_64: Don't save flags on context switch
>
> arch/x86/ia32/ia32entry.S | 18 +++++++++++++++++-
> arch/x86/include/asm/switch_to.h | 12 ++++++++----
> arch/x86/kernel/cpu/common.c | 2 +-
> 3 files changed, 26 insertions(+), 6 deletions(-)
>
> --
> 1.9.3
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-06 16:41:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] x86_64,entry: Clear NT on entry and speed up switch_to

On 10/06/2014 09:39 AM, Andy Lutomirski wrote:
> On Wed, Oct 1, 2014 at 11:49 AM, Andy Lutomirski <[email protected]> wrote:
>> Anish Bhatt noticed that user programs can set RFLAGS.NT before
>> syscall or sysenter, and the kernel entry code doesn't filter out
>> NT. This causes kernel C code and, depending on thread flags, the
>> exit slow path to run with NT set.
>>
>
> Quick ping: now that the merge window is (sort of) open, what's
> happening with these patches?
>
> Thanks,
> Andy
>

My preference would be to queue them up for 3.19 since they arrived late
in the 3.18 cycle.

-hpa

2014-10-06 16:42:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] x86_64,entry: Filter RFLAGS.NT on entry from userspace

On 10/01/2014 12:49 PM, Andy Lutomirski wrote:
> On Wed, Oct 1, 2014 at 11:49 AM, Andy Lutomirski <[email protected]> wrote:
>> The NT flag doesn't do anything in long mode other than causing IRET
>> to #GP. Oddly, CPL3 code can still set NT using popf.
>>
>
> [...]
>
>> +
>> + /*
>> + * Sysenter doesn't filter flags, so we need to clear NT
>> + * ourselves. To save a few cycles, we can check whether
>> + * NT was set instead of doing an unconditional popfq.
>> + */
>> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
>> + jnz sysenter_fix_flags
>> +sysenter_flags_fixed:
>> +
>
> Because this thread hasn't gone on long enough:
>
> Do we need to clear IOPL here, too? With patch 2 applied, an IOPL !=
> 0 program can leak IOPL into another task. It should be cleared on
> iret, sysexit (via popf) and sysret (directly), so this shouldn't
> matter. Am I missing something?
>
> Adding IOPL to the test will add no overhead for non-iopl-using tasks,
> but it will slighly slow down 32-bit tasks that use iopl.
>

I don't see why. IOPL has no effect in the kernel.

-hpa

2014-10-06 16:46:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] x86_64,entry: Clear NT on entry and speed up switch_to

On Mon, Oct 6, 2014 at 9:41 AM, H. Peter Anvin <[email protected]> wrote:
> On 10/06/2014 09:39 AM, Andy Lutomirski wrote:
>> On Wed, Oct 1, 2014 at 11:49 AM, Andy Lutomirski <[email protected]> wrote:
>>> Anish Bhatt noticed that user programs can set RFLAGS.NT before
>>> syscall or sysenter, and the kernel entry code doesn't filter out
>>> NT. This causes kernel C code and, depending on thread flags, the
>>> exit slow path to run with NT set.
>>>
>>
>> Quick ping: now that the merge window is (sort of) open, what's
>> happening with these patches?
>>
>> Thanks,
>> Andy
>>
>
> My preference would be to queue them up for 3.19 since they arrived late
> in the 3.18 cycle.

I see nothing wrong with deferring patch 2 for 3.19, but deferring
patch 1, which is tagged for stable, for an entire release seems a bit
silly to me.

--Andy

2014-10-06 16:57:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] x86_64,entry: Clear NT on entry and speed up switch_to

On 10/06/2014 09:45 AM, Andy Lutomirski wrote:
> On Mon, Oct 6, 2014 at 9:41 AM, H. Peter Anvin <[email protected]> wrote:
>> On 10/06/2014 09:39 AM, Andy Lutomirski wrote:
>>> On Wed, Oct 1, 2014 at 11:49 AM, Andy Lutomirski <[email protected]> wrote:
>>>> Anish Bhatt noticed that user programs can set RFLAGS.NT before
>>>> syscall or sysenter, and the kernel entry code doesn't filter out
>>>> NT. This causes kernel C code and, depending on thread flags, the
>>>> exit slow path to run with NT set.
>>>>
>>>
>>> Quick ping: now that the merge window is (sort of) open, what's
>>> happening with these patches?
>>>
>>> Thanks,
>>> Andy
>>>
>>
>> My preference would be to queue them up for 3.19 since they arrived late
>> in the 3.18 cycle.
>
> I see nothing wrong with deferring patch 2 for 3.19, but deferring
> patch 1, which is tagged for stable, for an entire release seems a bit
> silly to me.
>

Yes, that is probably going to get pushed later in this merge window;
that is usually how we deal with that sort of thing.

-hpa

Subject: [tip:x86/urgent] x86_64, entry: Filter RFLAGS.NT on entry from userspace

Commit-ID: 8c7aa698baca5e8f1ba9edb68081f1e7a1abf455
Gitweb: http://git.kernel.org/tip/8c7aa698baca5e8f1ba9edb68081f1e7a1abf455
Author: Andy Lutomirski <[email protected]>
AuthorDate: Wed, 1 Oct 2014 11:49:04 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Mon, 6 Oct 2014 10:53:26 -0700

x86_64, entry: Filter RFLAGS.NT on entry from userspace

The NT flag doesn't do anything in long mode other than causing IRET
to #GP. Oddly, CPL3 code can still set NT using popf.

Entry via hardware or software interrupt clears NT automatically, so
the only relevant entries are fast syscalls.

If user code causes kernel code to run with NT set, then there's at
least some (small) chance that it could cause trouble. For example,
user code could cause a call to EFI code with NT set, and who knows
what would happen? Apparently some games on Wine sometimes do
this (!), and, if an IRET return happens, they will segfault. That
segfault cannot be handled, because signal delivery fails, too.

This patch programs the CPU to clear NT on entry via SYSCALL (both
32-bit and 64-bit, by my reading of the AMD APM), and it clears NT
in software on entry via SYSENTER.

To save a few cycles, this borrows a trick from Jan Beulich in Xen:
it checks whether NT is set before trying to clear it. As a result,
it seems to have very little effect on SYSENTER performance on my
machine.

There's another minor bug fix in here: it looks like the CFI
annotations were wrong if CONFIG_AUDITSYSCALL=n.

Testers beware: on Xen, SYSENTER with NT set turns into a GPF.

I haven't touched anything on 32-bit kernels.

The syscall mask change comes from a variant of this patch by Anish
Bhatt.

Note to stable maintainers: there is no known security issue here.
A misguided program can set NT and cause the kernel to try and fail
to deliver SIGSEGV, crashing the program. This patch fixes Far Cry
on Wine: https://bugs.winehq.org/show_bug.cgi?id=33275

Cc: <[email protected]>
Reported-by: Anish Bhatt <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
Link: http://lkml.kernel.org/r/395749a5d39a29bd3e4b35899cf3a3c1340e5595.1412189265.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/ia32/ia32entry.S | 18 +++++++++++++++++-
arch/x86/kernel/cpu/common.c | 2 +-
2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4299eb0..711de08 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -151,6 +151,16 @@ ENTRY(ia32_sysenter_target)
1: movl (%rbp),%ebp
_ASM_EXTABLE(1b,ia32_badarg)
ASM_CLAC
+
+ /*
+ * Sysenter doesn't filter flags, so we need to clear NT
+ * ourselves. To save a few cycles, we can check whether
+ * NT was set instead of doing an unconditional popfq.
+ */
+ testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
+ jnz sysenter_fix_flags
+sysenter_flags_fixed:
+
orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
CFI_REMEMBER_STATE
@@ -184,6 +194,8 @@ sysexit_from_sys_call:
TRACE_IRQS_ON
ENABLE_INTERRUPTS_SYSEXIT32

+ CFI_RESTORE_STATE
+
#ifdef CONFIG_AUDITSYSCALL
.macro auditsys_entry_common
movl %esi,%r9d /* 6th arg: 4th syscall arg */
@@ -226,7 +238,6 @@ sysexit_from_sys_call:
.endm

sysenter_auditsys:
- CFI_RESTORE_STATE
auditsys_entry_common
movl %ebp,%r9d /* reload 6th syscall arg */
jmp sysenter_dispatch
@@ -235,6 +246,11 @@ sysexit_audit:
auditsys_exit sysexit_from_sys_call
#endif

+sysenter_fix_flags:
+ pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
+ popfq_cfi
+ jmp sysenter_flags_fixed
+
sysenter_tracesys:
#ifdef CONFIG_AUDITSYSCALL
testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index e4ab2b4..3126558 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1184,7 +1184,7 @@ void syscall_init(void)
/* Flags to clear on syscall */
wrmsrl(MSR_SYSCALL_MASK,
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
- X86_EFLAGS_IOPL|X86_EFLAGS_AC);
+ X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
}

/*

2014-11-01 00:33:11

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] x86_64,entry: Clear NT on entry and speed up switch_to

Andy Lutomirski <[email protected]> writes:
> Anish Bhatt noticed that user programs can set RFLAGS.NT before
> syscall or sysenter, and the kernel entry code doesn't filter out
> NT. This causes kernel C code and, depending on thread flags, the
> exit slow path to run with NT set.

OK, this causes oopsen as a guest under kvm for me. Details below:

commit 8c7aa698baca5e8f1ba9edb68081f1e7a1abf455
Author: Andy Lutomirski <[email protected]>
Date: Wed Oct 1 11:49:04 2014 -0700

x86_64, entry: Filter RFLAGS.NT on entry from userspace

Some dmesg:

[ 0.820982] serio: i8042 KBD port at 0x60,0x64 irq 1
[ 0.822118] serio: i8042 AUX port at 0x60,0x64 irq 12
[ 0.824445] mousedev: PS/2 mouse device common for all mice
[ 0.827262] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
[ 0.830249] rtc_cmos 00:00: RTC can wake from S4
[ 0.831830] rtc_cmos 00:00: rtc core: registered rtc_cmos as rtc0
[ 0.833314] rtc_cmos 00:00: alarms up to one day, 114 bytes nvram, hpet irqs
[ 0.835128] device-mapper: uevent: version 1.0.3
[ 0.836526] device-mapper: ioctl: 4.27.0-ioctl (2013-10-30) initialised: [email protected]
[ 0.838566] TCP: cubic registered
[ 0.839891] NET: Registered protocol family 10
[ 0.841868] NET: Registered protocol family 17
[ 0.843005] Key type dns_resolver registered
[ 0.845481] registered taskstats version 1
[ 0.847120] kworker/u2:2 (48) used greatest stack depth: 14400 bytes left
[ 0.849147] kworker/u2:3 (50) used greatest stack depth: 14048 bytes left
[ 0.850779] Key type trusted registered
[ 0.853360] Key type encrypted registered
[ 0.855561] AppArmor: AppArmor sha1 policy hashing enabled
[ 0.856768] cryptomgr_probe (63) used greatest stack depth: 13712 bytes left
[ 0.858156] evm: HMAC attrs: 0x1
[ 0.859577] Magic number: 2:172:455
[ 0.860833] rtc_cmos 00:00: setting system clock to 2014-10-31 23:26:48 UTC (1414798008)
[ 0.862465] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 0.863663] EDD information not available.
[ 0.964548] ata2.00: ATAPI: QEMU DVD-ROM, 2.1.0, max UDMA/100
[ 0.966081] ata2.00: configured for MWDMA2
[ 0.968174] scsi 1:0:0:0: CD-ROM QEMU QEMU DVD-ROM 2.1. PQ: 0 ANSI: 5
[ 0.977913] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
[ 0.978861] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 0.981138] sr 1:0:0:0: Attached scsi generic sg0 type 5
[ 0.982634] md: Waiting for all devices to be available before autodetect
[ 0.986583] md: If you don't use raid, use raid=noautodetect
[ 0.990236] md: Autodetecting RAID arrays.
[ 0.991035] md: Scanned 0 and added 0 devices.
[ 0.991815] md: autorun ...
[ 0.992215] md: ... autorun DONE.
[ 0.994068] EXT3-fs (vda1): error: couldn't mount because of unsupported optional features (240)
[ 0.996331] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities
[ 1.003145] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
[ 1.006600] VFS: Mounted root (ext4 filesystem) readonly on device 253:1.
[ 1.010007] devtmpfs: mounted
[ 1.011632] debug: unmapping init [mem 0xffffffff81d2b000-0xffffffff81e6ffff]
[ 1.012631] Write protecting the kernel read-only data: 12288k
[ 1.013571] debug: unmapping init [mem 0xffff88000170d000-0xffff8800017fffff]
[ 1.014639] debug: unmapping init [mem 0xffff880001b21000-0xffff880001bfffff]
[ 1.123201] random: init urandom read with 8 bits of entropy available
[ 1.126953] BUG: unable to handle kernel paging request at ffff88001da4c018
[ 1.128482] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.129513] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001da4c060
[ 1.129513] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 1.129513] Modules linked in:
[ 1.129513] CPU: 0 PID: 69 Comm: init Not tainted 3.17.0-rc7+ #245
[ 1.129513] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.129513] task: ffff88001da08000 ti: ffff88001da48000 task.ti: ffff88001da48000
[ 1.129513] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.129513] RSP: 0018:ffff88001da4bf88 EFLAGS: 00010296
[ 1.129513] RAX: 0000000000000137 RBX: 00000000f754e730 RCX: 000000000000000c
[ 1.129513] RDX: 00000000f7711000 RSI: 0000000000000000 RDI: 00000000f77c3040
[ 1.129513] RBP: 00000000ffca97c8 R08: ffffffff8138aa0b R09: 00000000ffcaba58
[ 1.129513] R10: 00000000f77a1b70 R11: 0000000000000000 R12: 0000000000000000
[ 1.129513] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.129513] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f754e6c0
[ 1.129513] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1.129513] CR2: ffff88001da4c018 CR3: 000000001da2c000 CR4: 00000000000006f0
[ 1.129513] Stack:
[ 1.129513] 0000000000000000 0000000000000000 00000000ffcaba58 ffffffff8138aa0b
[ 1.129513] 0000000000000137 000000000000000c 00000000f7711000 0000000000000000
[ 1.129513] 00000000f77c3040 0000000000000137 00000000f77a1b70 0000000000000023
[ 1.129513] Call Trace:
[ 1.129513] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.129513] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c
[ 1.129513] RIP [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.129513] RSP <ffff88001da4bf88>
[ 1.129513] CR2: ffff88001da4c018
[ 1.129513] ---[ end trace 7d7a8bfdc14fe3bb ]---
[ 1.129513] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41
[ 1.129513] in_atomic(): 0, irqs_disabled(): 1, pid: 69, name: init
[ 1.129513] INFO: lockdep is turned off.
[ 1.129513] irq event stamp: 62
[ 1.129513] hardirqs last enabled at (61): [<ffffffff81705909>] retint_swapgs+0xe/0x13
[ 1.129513] hardirqs last disabled at (62): [<ffffffff81706b13>] error_sti+0x5/0x6
[ 1.129513] softirqs last enabled at (0): [<ffffffff81054a28>] copy_process.part.30+0x5b8/0x1c70
[ 1.129513] softirqs last disabled at (0): [< (null)>] (null)
[ 1.129513] CPU: 0 PID: 69 Comm: init Tainted: G D 3.17.0-rc7+ #245
[ 1.129513] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.129513] 0000000000000009 ffff88001da4bc08 ffffffff816fbd34 ffff88001f7d35d8
[ 1.129513] ffff88001da4bc18 ffffffff8107d181 ffff88001da4bc38 ffffffff81702314
[ 1.129513] ffff88001da08000 ffff88001da08000 ffff88001da4bc58 ffffffff81067464
[ 1.129513] Call Trace:
[ 1.129513] [<ffffffff816fbd34>] dump_stack+0x4d/0x66
[ 1.129513] [<ffffffff8107d181>] __might_sleep+0xf1/0x120
[ 1.129513] [<ffffffff81702314>] down_read+0x24/0x70
[ 1.129513] [<ffffffff81067464>] exit_signals+0x24/0x130
[ 1.129513] [<ffffffff81058743>] do_exit+0xb3/0xbd0
[ 1.129513] [<ffffffff810b4328>] ? kmsg_dump+0x108/0x120
[ 1.129513] [<ffffffff810b4242>] ? kmsg_dump+0x22/0x120
[ 1.129513] [<ffffffff810064eb>] oops_end+0x8b/0xd0
[ 1.129513] [<ffffffff810452ac>] no_context+0x12c/0x380
[ 1.129513] [<ffffffff81704197>] ? _raw_spin_unlock+0x27/0x40
[ 1.129513] [<ffffffff81180dd5>] ? do_read_fault.isra.77+0xd5/0x2c0
[ 1.129513] [<ffffffff81045585>] __bad_area_nosemaphore+0x85/0x210
[ 1.129513] [<ffffffff81045723>] bad_area_nosemaphore+0x13/0x20
[ 1.129513] [<ffffffff81045bb6>] __do_page_fault+0xd6/0x5d0
[ 1.129513] [<ffffffff81045c72>] ? __do_page_fault+0x192/0x5d0
[ 1.129513] [<ffffffff8109d36f>] ? up_read+0x1f/0x40
[ 1.129513] [<ffffffff81045d74>] ? __do_page_fault+0x294/0x5d0
[ 1.129513] [<ffffffff8138aa4a>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 1.129513] [<ffffffff810460bc>] do_page_fault+0xc/0x10
[ 1.129513] [<ffffffff81706912>] page_fault+0x22/0x30
[ 1.129513] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.129513] [<ffffffff8170703d>] ? ia32_sysenter_target+0x4d/0x5e
[ 1.129513] [<ffffffff81705909>] ? retint_swapgs+0xe/0x13
[ 1.129513] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.217584] init (69) used greatest stack depth: 13528 bytes left
[ 1.229190] BUG: unable to handle kernel paging request at ffff88001da7c018
[ 1.230520] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.231890] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001da7c060
[ 1.232181] Oops: 0000 [#2] SMP DEBUG_PAGEALLOC
[ 1.232181] Modules linked in:
[ 1.232181] CPU: 0 PID: 71 Comm: init Tainted: G D 3.17.0-rc7+ #245
[ 1.232181] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.232181] task: ffff88001d9f2110 ti: ffff88001da78000 task.ti: ffff88001da78000
[ 1.232181] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.232181] RSP: 0018:ffff88001da7bf88 EFLAGS: 00010296
[ 1.232181] RAX: 0000000000000137 RBX: 00000000f754e730 RCX: 000000000000000c
[ 1.232181] RDX: 00000000f7711000 RSI: 0000000000000000 RDI: 00000000f77c3040
[ 1.232181] RBP: 00000000ffca97c8 R08: ffffffff8138aa0b R09: 0000000000000000
[ 1.232181] R10: 00000000f77a1b70 R11: 0000000000000000 R12: 0000000000000000
[ 1.232181] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.232181] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f754e6c0
[ 1.232181] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1.232181] CR2: ffff88001da7c018 CR3: 000000001da5e000 CR4: 00000000000006f0
[ 1.232181] Stack:
[ 1.232181] 0000000000000000 0000000000000000 0000000000000000 ffffffff8138aa0b
[ 1.232181] 0000000000000137 000000000000000c 00000000f7711000 0000000000000000
[ 1.232181] 00000000f77c3040 0000000000000137 00000000f77a1b70 0000000000000023
[ 1.232181] Call Trace:
[ 1.232181] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.232181] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c
[ 1.232181] RIP [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.232181] RSP <ffff88001da7bf88>
[ 1.232181] CR2: ffff88001da7c018
[ 1.232181] ---[ end trace 7d7a8bfdc14fe3bc ]---
[ 1.265113] BUG: unable to handle kernel paging request at ffff88001da84018
[ 1.266545] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.267854] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001da84060
[ 1.268444] Oops: 0000 [#3] SMP DEBUG_PAGEALLOC
[ 1.268444] Modules linked in:
[ 1.268444] CPU: 0 PID: 72 Comm: init Tainted: G D 3.17.0-rc7+ #245
[ 1.268444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.268444] task: ffff88001d9f4220 ti: ffff88001da80000 task.ti: ffff88001da80000
[ 1.268444] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.268444] RSP: 0018:ffff88001da83f88 EFLAGS: 00010296
[ 1.268444] RAX: 0000000000000137 RBX: 00000000f754e730 RCX: 000000000000000c
[ 1.268444] RDX: 00000000f7711000 RSI: 0000000000000000 RDI: 00000000f77c3040
[ 1.268444] RBP: 00000000ffca97c8 R08: ffffffff8138aa0b R09: 0000000000000000
[ 1.268444] R10: 00000000f77a1b70 R11: 0000000000000000 R12: 0000000000000000
[ 1.268444] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.268444] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f754e6c0
[ 1.268444] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1.268444] CR2: ffff88001da84018 CR3: 000000001da5f000 CR4: 00000000000006f0
[ 1.268444] Stack:
[ 1.268444] 0000000000000000 0000000000000000 0000000000000000 ffffffff8138aa0b
[ 1.268444] 0000000000000137 000000000000000c 00000000f7711000 0000000000000000
[ 1.268444] 00000000f77c3040 0000000000000137 00000000f77a1b70 0000000000000023
[ 1.268444] Call Trace:
[ 1.268444] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.268444] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c
[ 1.268444] RIP [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.268444] RSP <ffff88001da83f88>
[ 1.268444] CR2: ffff88001da84018
[ 1.268444] ---[ end trace 7d7a8bfdc14fe3bd ]---
[ 1.301978] init: Error while reading from descriptor: Bad file descriptor
[ 1.303740] init: hostname main process (69) killed by KILL signal
[ 1.306985] init: hwclock main process (71) killed by KILL signal
[ 1.309804] init: ureadahead main process (72) killed by KILL signal
[ 1.322693] BUG: unable to handle kernel paging request at ffff88001daa4018
[ 1.324040] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.324040] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001daa4060
[ 1.324040] Oops: 0000 [#4] SMP DEBUG_PAGEALLOC
[ 1.324040] Modules linked in:
[ 1.324040] CPU: 0 PID: 75 Comm: init Tainted: G D 3.17.0-rc7+ #245
[ 1.324040] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.324040] task: ffff88001d9f2110 ti: ffff88001daa0000 task.ti: ffff88001daa0000
[ 1.324040] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.324040] RSP: 0018:ffff88001daa3f88 EFLAGS: 00010296
[ 1.324040] RAX: 0000000000000137 RBX: 00000000f754e730 RCX: 000000000000000c
[ 1.324040] RDX: 00000000f7711000 RSI: 0000000000000000 RDI: 00000000f77c3040
[ 1.324040] RBP: 00000000ffca97c8 R08: ffffffff8138aa0b R09: 0000000000000000
[ 1.324040] R10: 00000000f77a1b70 R11: 0000000000000000 R12: 0000000000000000
[ 1.324040] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.324040] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f754e6c0
[ 1.324040] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1.324040] CR2: ffff88001daa4018 CR3: 000000001da6e000 CR4: 00000000000006f0
[ 1.324040] Stack:
[ 1.324040] 0000000000000000 0000000000000000 0000000000000000 ffffffff8138aa0b
[ 1.324040] 0000000000000137 000000000000000c 00000000f7711000 0000000000000000
[ 1.324040] 00000000f77c3040 0000000000000137 00000000f77a1b70 0000000000000023
[ 1.324040] Call Trace:
[ 1.324040] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.324040] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c
[ 1.324040] RIP [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.324040] RSP <ffff88001daa3f88>
[ 1.324040] CR2: ffff88001daa4018
[ 1.324040] ---[ end trace 7d7a8bfdc14fe3be ]---
[ 1.372657] plymouthd (70) used greatest stack depth: 13256 bytes left
[ 1.374306] init: Error while reading from descriptor: Bad file descriptor
[ 1.376348] init: mountall main process (75) killed by KILL signal
[ 1.386907] sh (76) used greatest stack depth: 13208 bytes left
[ 1.388173] tsc: Refined TSC clocksource calibration: 2594.100 MHz
[ 1.390528] BUG: unable to handle kernel paging request at ffff88001daa4018
[ 1.392121] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.392121] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001daa4060
[ 1.392121] Oops: 0000 [#5] SMP DEBUG_PAGEALLOC
[ 1.392121] Modules linked in:
[ 1.392121] CPU: 0 PID: 78 Comm: init Tainted: G D 3.17.0-rc7+ #245
[ 1.392121] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.392121] task: ffff88001da0a110 ti: ffff88001daa0000 task.ti: ffff88001daa0000
[ 1.392121] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.392121] RSP: 0018:ffff88001daa3f88 EFLAGS: 00010296
[ 1.392121] RAX: 0000000000000137 RBX: 00000000f754e730 RCX: 000000000000000c
[ 1.392121] RDX: 00000000f7711000 RSI: 0000000000000000 RDI: 00000000f77c3040
[ 1.392121] RBP: 00000000ffca97c8 R08: ffffffff8138aa0b R09: 0000000000000000
[ 1.392121] R10: 00000000f77a1b70 R11: 0000000000000000 R12: 0000000000000000
[ 1.392121] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.392121] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f754e6c0
[ 1.392121] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1.392121] CR2: ffff88001daa4018 CR3: 000000001da27000 CR4: 00000000000006f0
[ 1.392121] Stack:
[ 1.392121] 0000000000000000 0000000000000000 0000000000000000 ffffffff8138aa0b
[ 1.392121] 0000000000000137 000000000000000c 00000000f7711000 0000000000000000
[ 1.392121] 00000000f77c3040 0000000000000137 00000000f77a1b70 0000000000000023
[ 1.392121] Call Trace:
[ 1.392121] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.392121] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c
[ 1.392121] RIP [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.392121] RSP <ffff88001daa3f88>
[ 1.392121] CR2: ffff88001daa4018
[ 1.392121] ---[ end trace 7d7a8bfdc14fe3bf ]---
[ 1.436568] BUG: unable to handle kernel paging request at ffff88001da4c018
[ 1.438056] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.439308] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001da4c060
[ 1.440088] Oops: 0000 [#6] SMP DEBUG_PAGEALLOC
[ 1.440088] Modules linked in:
[ 1.440088] CPU: 0 PID: 73 Comm: plymouthd Tainted: G D 3.17.0-rc7+ #245
[ 1.440088] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.440088] task: ffff88001d9f0000 ti: ffff88001da48000 task.ti: ffff88001da48000
[ 1.440088] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.440088] RSP: 0018:ffff88001da4bf88 EFLAGS: 00010296
[ 1.440088] RAX: 0000000000000066 RBX: 0000000000000005 RCX: 00000000ffdc3810
[ 1.440088] RDX: 000000000a048bd0 RSI: 000000000a048ca0 RDI: 0000000000000000
[ 1.440088] RBP: 000000000a048c58 R08: 0000000000000000 R09: 0000000000000000
[ 1.440088] R10: 00000000f775ab70 R11: 0000000000000000 R12: 0000000000000000
[ 1.440088] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.440088] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f75176c0
[ 1.440088] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 1.440088] CR2: ffff88001da4c018 CR3: 000000001da64000 CR4: 00000000000006f0
[ 1.440088] Stack:
[ 1.440088] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1.440088] 0000000000000066 00000000ffdc3810 000000000a048bd0 000000000a048ca0
[ 1.440088] 0000000000000000 0000000000000066 00000000f775ab70 0000000000000023
[ 1.440088] Call Trace:
[ 1.440088] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c
[ 1.440088] RIP [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.440088] RSP <ffff88001da4bf88>
[ 1.440088] CR2: ffff88001da4c018
[ 1.440088] ---[ end trace 7d7a8bfdc14fe3c0 ]---
[ 1.478043] init: console-setup main process (78) killed by KILL signal
[ 1.485084] plymouthd (73) used greatest stack depth: 13048 bytes left
[ 1.493827] init: plymouth main process (73) killed by KILL signal
[ 1.496444] init: plymouth-stop pre-start process (79) terminated with status 2
General error mounting filesystems.
A maintenance shell will now be started.
CONTROL-D will terminate this shell and reboot the system.
[ 1.651076] BUG: unable to handle kernel paging request at ffff88001daa4018
[ 1.653236] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.654249] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001daa4060
[ 1.654249] Oops: 0000 [#7] SMP DEBUG_PAGEALLOC
[ 1.654249] Modules linked in:
[ 1.654249] CPU: 0 PID: 83 Comm: bash Tainted: G D 3.17.0-rc7+ #245
[ 1.654249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.654249] task: ffff88001d9f2110 ti: ffff88001daa0000 task.ti: ffff88001daa0000
[ 1.654249] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.654249] RSP: 0018:ffff88001daa3f88 EFLAGS: 00010296
[ 1.654249] RAX: 00000000000000af RBX: 0000000000000002 RCX: 000000000812e380
[ 1.654249] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 00000000f773d000
[ 1.654249] RBP: 00000000fffd1da0 R08: ffffffff8138aa0b R09: 0000000000000000
[ 1.654249] R10: 00000000f777bb70 R11: 0000000000000000 R12: 0000000000000000
[ 1.654249] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.654249] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f75966c0
[ 1.654249] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1.654249] CR2: ffff88001daa4018 CR3: 000000001da37000 CR4: 00000000000006f0
[ 1.654249] Stack:
[ 1.654249] 0000000000000000 0000000000000000 0000000000000000 ffffffff8138aa0b
[ 1.654249] 00000000000000af 000000000812e380 0000000000000000 0000000000000008
[ 1.654249] 00000000f773d000 00000000000000af 00000000f777bb70 0000000000000023
[ 1.654249] Call Trace:
[ 1.654249] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.654249] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c
[ 1.654249] RIP [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.654249] RSP <ffff88001daa3f88>
[ 1.654249] CR2: ffff88001daa4018
[ 1.654249] ---[ end trace 7d7a8bfdc14fe3c1 ]---
[ 1.846659] BUG: unable to handle kernel paging request at ffff88001daa4018
[ 1.847580] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.848331] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001daa4060
[ 1.849222] Oops: 0000 [#8] SMP DEBUG_PAGEALLOC
[ 1.849318] Modules linked in:
[ 1.849318] CPU: 0 PID: 85 Comm: bash Tainted: G D 3.17.0-rc7+ #245
[ 1.849318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.849318] task: ffff88001d9f2110 ti: ffff88001daa0000 task.ti: ffff88001daa0000
[ 1.849318] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.849318] RSP: 0018:ffff88001daa3f88 EFLAGS: 00010296
[ 1.849318] RAX: 00000000000000af RBX: 0000000000000002 RCX: 000000000812e380
[ 1.849318] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 00000000f773d000
[ 1.849318] RBP: 00000000fffd1cf0 R08: ffffffff8138aa0b R09: 0000000000000000
[ 1.849318] R10: 00000000f777bb70 R11: 0000000000000000 R12: 0000000000000000
[ 1.849318] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.849318] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f75966c0
[ 1.849318] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1.849318] CR2: ffff88001daa4018 CR3: 000000001da65000 CR4: 00000000000006f0
[ 1.849318] Stack:
[ 1.849318] 0000000000000000 0000000000000000 0000000000000000 ffffffff8138aa0b
[ 1.849318] 00000000000000af 000000000812e380 0000000000000000 0000000000000008
[ 1.849318] 00000000f773d000 00000000000000af 00000000f777bb70 0000000000000023
[ 1.849318] Call Trace:
[ 1.849318] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.849318] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c
[ 1.849318] RIP [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.849318] RSP <ffff88001daa3f88>
[ 1.849318] CR2: ffff88001daa4018
[ 1.849318] ---[ end trace 7d7a8bfdc14fe3c2 ]---
[ 1.882411] BUG: unable to handle kernel paging request at ffff88001daa8018
[ 1.884212] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.884506] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001daa8060
[ 1.884506] Oops: 0000 [#9] SMP DEBUG_PAGEALLOC
[ 1.884506] Modules linked in:
[ 1.884506] CPU: 0 PID: 86 Comm: bash Tainted: G D 3.17.0-rc7+ #245
[ 1.884506] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
[ 1.884506] task: ffff88001da08000 ti: ffff88001daa4000 task.ti: ffff88001daa4000
[ 1.884506] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.884506] RSP: 0018:ffff88001daa7f88 EFLAGS: 00010296
[ 1.884506] RAX: 00000000000000af RBX: 0000000000000002 RCX: 000000000812e380
[ 1.884506] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 00000000f773d000
[ 1.884506] RBP: 00000000fffd19e0 R08: ffffffff8138aa0b R09: 0000000000000000
[ 1.884506] R10: 00000000f777bb70 R11: 0000000000000000 R12: 0000000000000000
[ 1.884506] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1.884506] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f75966c0
[ 1.884506] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
[ 1.884506] CR2: ffff88001daa8018 CR3: 000000001da6e000 CR4: 00000000000006f0
[ 1.884506] Stack:
[ 1.884506] 0000000000000000 0000000000000000 0000000000000000 ffffffff8138aa0b
[ 1.884506] 00000000000000af 000000000812e380 0000000000000000 0000000000000008
[ 1.884506] 00000000f773d000 00000000000000af 00000000f777bb70 0000000000000023
[ 1.884506] Call Trace:
[ 1.884506] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1.884506] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c
[ 1.884506] RIP [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
[ 1.884506] RSP <ffff88001daa7f88>
[ 1.884506] CR2: ffff88001daa8018
[ 1.884506] ---[ end trace 7d7a8bfdc14fe3c3 ]---
root@(none):~# [ 2.388435] Switched to clocksource tsc

Qemu version:
QEMU emulator version 2.1.0 (Debian 2.1+dfsg-4ubuntu6), Copyright (c) 2003-2008 Fabrice Bellard

Invoked as:

$QEMU -machine pc,accel=kvm $ARGS -m 512 -net user,restrict=off -net nic,model=virtio -drive file=$QEMUIMAGE,index=0,media=disk,if=virtio -drive file=$QEMUIMAGEB,index=1,media=disk,if=virtio -kernel arch/x86/boot/bzImage -append "ro root=/dev/vda1 $KARGS $*"

The guest is a 32-bit Ubuntu 12.10, running the modern kernel of course.

Thanks,
Rusty.

2014-11-01 01:00:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] x86_64,entry: Clear NT on entry and speed up switch_to

On Fri, Oct 31, 2014 at 5:20 PM, Rusty Russell <[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>> Anish Bhatt noticed that user programs can set RFLAGS.NT before
>> syscall or sysenter, and the kernel entry code doesn't filter out
>> NT. This causes kernel C code and, depending on thread flags, the
>> exit slow path to run with NT set.
>
> OK, this causes oopsen as a guest under kvm for me. Details below:
>
> commit 8c7aa698baca5e8f1ba9edb68081f1e7a1abf455
> Author: Andy Lutomirski <[email protected]>
> Date: Wed Oct 1 11:49:04 2014 -0700
>
> x86_64, entry: Filter RFLAGS.NT on entry from userspace

Well, crap.

>
> Some dmesg:
>

> [ 1.126953] BUG: unable to handle kernel paging request at ffff88001da4c018
> [ 1.128482] IP: [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
> [ 1.129513] PGD 2d6c067 PUD 2d6d067 PMD 1fdf4067 PTE 800000001da4c060
> [ 1.129513] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 1.129513] Modules linked in:
> [ 1.129513] CPU: 0 PID: 69 Comm: init Not tainted 3.17.0-rc7+ #245
> [ 1.129513] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
> [ 1.129513] task: ffff88001da08000 ti: ffff88001da48000 task.ti: ffff88001da48000
> [ 1.129513] RIP: 0010:[<ffffffff8170703d>] [<ffffffff8170703d>] ia32_sysenter_target+0x4d/0x5e
> [ 1.129513] RSP: 0018:ffff88001da4bf88 EFLAGS: 00010296

So we're 0x78 bytes below the page boundary...

> [ 1.129513] RAX: 0000000000000137 RBX: 00000000f754e730 RCX: 000000000000000c
> [ 1.129513] RDX: 00000000f7711000 RSI: 0000000000000000 RDI: 00000000f77c3040
> [ 1.129513] RBP: 00000000ffca97c8 R08: ffffffff8138aa0b R09: 00000000ffcaba58
> [ 1.129513] R10: 00000000f77a1b70 R11: 0000000000000000 R12: 0000000000000000
> [ 1.129513] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 1.129513] FS: 0000000000000000(0000) GS:ffff88001fa00000(0063) knlGS:00000000f754e6c0
> [ 1.129513] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
> [ 1.129513] CR2: ffff88001da4c018 CR3: 000000001da2c000 CR4: 00000000000006f0
> [ 1.129513] Stack:
> [ 1.129513] 0000000000000000 0000000000000000 00000000ffcaba58 ffffffff8138aa0b
> [ 1.129513] 0000000000000137 000000000000000c 00000000f7711000 0000000000000000
> [ 1.129513] 00000000f77c3040 0000000000000137 00000000f77a1b70 0000000000000023
> [ 1.129513] Call Trace:
> [ 1.129513] [<ffffffff8138aa0b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 1.129513] Code: c0 41 52 50 fc 48 83 ec 48 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 4c 24 28 48 89 44 24 20 66 66 90 8b 6d 00 66 66 90 <f7> 84 24 90 00 00 00 00 40 00 00 0f 85 2f 01 00 00 83 8c 24 8c

0: c0 41 52 50 rolb $0x50,0x52(%rcx)
4: fc cld
5: 48 83 ec 48 sub $0x48,%rsp
9: 48 89 7c 24 40 mov %rdi,0x40(%rsp)
e: 48 89 74 24 38 mov %rsi,0x38(%rsp)
13: 48 89 54 24 30 mov %rdx,0x30(%rsp)
18: 48 89 4c 24 28 mov %rcx,0x28(%rsp)
1d: 48 89 44 24 20 mov %rax,0x20(%rsp)
22: 66 66 90 data32 xchg %ax,%ax
25: 8b 6d 00 mov 0x0(%rbp),%ebp
28: 66 66 90 data32 xchg %ax,%ax
2b:* f7 84 24 90 00 00 00 testl $0x4000,0x90(%rsp)
<-- trapping instruction

This seems to be just slightly out of bounds.

[Insert large number of expletives here] This is really bad. It
worked when I tested it because of dumb luck. If I read random
garbage there, there's a pretty good chance that the code will work.
But somehow you're right at the end of the entire memory map, and
you're totally screwed.

Fix coming.

2014-11-01 01:08:52

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH] x86_64, entry: Fix out of bounds read on sysenter

Rusty noticed a Really Bad Bug (tm) in my NT fix. The entry code
reads out of bounds, causing the NT fix to be unreliable. But, and
this is much, much worse, if your stack is somehow just below the
top of the direct map (or a hole), you read out of bounds and crash.

Excerpt from the crash:

[ 1.129513] RSP: 0018:ffff88001da4bf88 EFLAGS: 00010296

2b:* f7 84 24 90 00 00 00 testl $0x4000,0x90(%rsp)

That read is deterministically above the top of the stack. I
thought I even single-stepped through this code when I wrote it to
check the offset, but I clearly screwed it up.

Fixes 8c7aa698baca x86_64, entry: Filter RFLAGS.NT on entry from userspace

Reported-by: Rusty Russell <[email protected]>
Cc: [email protected]
Signed-off-by: Andy Lutomirski <[email protected]>
---

Linus, etc: this should probably go in pretty quickly before it hits -stable
too hard. Fortunately it's unlikely to be a meaningful security problem,
but it's a nasty regression. (It *is* a security problem in -next, but we
get a free pass on that one.)

arch/x86/ia32/ia32entry.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 8ffba18395c8..ffe71228fc10 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -157,7 +157,7 @@ ENTRY(ia32_sysenter_target)
* ourselves. To save a few cycles, we can check whether
* NT was set instead of doing an unconditional popfq.
*/
- testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
+ testl $X86_EFLAGS_NT,EFLAGS-ARGOFFSET(%rsp)
jnz sysenter_fix_flags
sysenter_flags_fixed:

--
1.9.3

2014-11-01 02:31:09

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] x86_64, entry: Fix out of bounds read on sysenter

Andy Lutomirski <[email protected]> writes:
> Rusty noticed a Really Bad Bug (tm) in my NT fix. The entry code
> reads out of bounds, causing the NT fix to be unreliable. But, and
> this is much, much worse, if your stack is somehow just below the
> top of the direct map (or a hole), you read out of bounds and crash.
>
> Excerpt from the crash:
>
> [ 1.129513] RSP: 0018:ffff88001da4bf88 EFLAGS: 00010296
>
> 2b:* f7 84 24 90 00 00 00 testl $0x4000,0x90(%rsp)
>
> That read is deterministically above the top of the stack. I
> thought I even single-stepped through this code when I wrote it to
> check the offset, but I clearly screwed it up.
>
> Fixes 8c7aa698baca x86_64, entry: Filter RFLAGS.NT on entry from userspace
>
> Reported-by: Rusty Russell <[email protected]>
> Cc: [email protected]
> Signed-off-by: Andy Lutomirski <[email protected]>

Tested-by: Rusty Russell <[email protected]>

Thanks for the fast response...
Rusty.