Anish Bhatt noticed that user programs can set RFLAGS.NT before
syscall or sysenter, and the kernel entry code doesn't filter out
NT. This causes kernel C code and, depending on thread flags, the
exit slow path to run with NT set.
The former is a little bit scary (imagine calling into EFI with NT
set), and the latter will fail with #GP and send a spurious SIGSEGV.
One answer would be "don't do that". But the kernel can do better
here.
These patches filter NT on all kernel entries. For syscall (both
bitnesses), this is free. For sysenter, it seems to cost very
little (less than my ability to measure, although I didn't try that
hard). Patch 2, which isn't tagged for -stable, speeds up context
switches by avoiding saving and restoring flags, so this series
should be a decent overall performance win.
See: https://bugs.winehq.org/show_bug.cgi?id=33275
Changes from v1:
- Spell [email protected] correctly
- Tidy up changelog text
- Actually commit an asm constraint fix in patch 2 (egads!)
- Replace the unconditional popfq with a branch
Andy Lutomirski (2):
x86_64,entry: Filter RFLAGS.NT on entry from userspace
x86_64: Don't save flags on context switch
arch/x86/ia32/ia32entry.S | 12 ++++++++++++
arch/x86/include/asm/switch_to.h | 12 ++++++++----
arch/x86/kernel/cpu/common.c | 2 +-
3 files changed, 21 insertions(+), 5 deletions(-)
--
1.9.3
The NT flag doesn't do anything in long mode other than causing IRET
to #GP. Oddly, CPL3 code can still set NT using popf.
Entry via hardware or software interrupt clears NT automatically, so
the only relevant entries are fast syscalls.
If user code causes kernel code to run with NT set, then there's at
least some (small) chance that it could cause trouble. For example,
user code could cause a call to EFI code with NT set, and who knows
what would happen? Apparently some games on Wine sometimes do
this (!), and, if an IRET return happens, they will segfault. That
segfault cannot be handled, because signal delivery fails, too.
This patch programs the CPU to clear NT on entry via SYSCALL (both
32-bit and 64-bit, by my reading of the AMD APM), and it clears NT
in software on entry via SYSENTER.
To save a few cycles, this borrows a trick from Jan Beulich in Xen:
it checks whether NT is set before trying to clear it. As a result,
it seems to have very little effect on SYSENTER performance on my
machine.
Testers beware: on Xen, SYSENTER with NT set turns into a GPF.
I haven't touched anything on 32-bit kernels.
The syscall mask change comes from a variant of this patch by Anish
Bhatt.
Cc: [email protected]
Reported-by: Anish Bhatt <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/ia32/ia32entry.S | 12 ++++++++++++
arch/x86/kernel/cpu/common.c | 2 +-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4299eb05023c..44d1dd371454 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
1: movl (%rbp),%ebp
_ASM_EXTABLE(1b,ia32_badarg)
ASM_CLAC
+
+ /*
+ * Sysenter doesn't filter flags, so we need to clear NT
+ * ourselves. To save a few cycles, we can check whether
+ * NT was set instead of doing an unconditional popfq.
+ */
+ testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
+ jz 1f
+ pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
+ popfq_cfi
+1:
+
orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
CFI_REMEMBER_STATE
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index e4ab2b42bd6f..31265580c38a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1184,7 +1184,7 @@ void syscall_init(void)
/* Flags to clear on syscall */
wrmsrl(MSR_SYSCALL_MASK,
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
- X86_EFLAGS_IOPL|X86_EFLAGS_AC);
+ X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
}
/*
--
1.9.3
Now that the kernel always runs with clean flags (in particular, NT
is clear), there is no need to save and restore flags on every
context switch.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/switch_to.h | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index d7f3b3b78ac3..751bf4b7bf11 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -79,12 +79,12 @@ do { \
#else /* CONFIG_X86_32 */
/* frame pointer must be last for get_wchan */
-#define SAVE_CONTEXT "pushf ; pushq %%rbp ; movq %%rsi,%%rbp\n\t"
-#define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp ; popf\t"
+#define SAVE_CONTEXT "pushq %%rbp ; movq %%rsi,%%rbp\n\t"
+#define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp\t"
#define __EXTRA_CLOBBER \
, "rcx", "rbx", "rdx", "r8", "r9", "r10", "r11", \
- "r12", "r13", "r14", "r15"
+ "r12", "r13", "r14", "r15", "flags"
#ifdef CONFIG_CC_STACKPROTECTOR
#define __switch_canary \
@@ -100,7 +100,11 @@ do { \
#define __switch_canary_iparam
#endif /* CC_STACKPROTECTOR */
-/* Save restore flags to clear handle leaking NT */
+/*
+ * There is no need to save or restore flags, because flags are always
+ * clean in kernel mode, with the possible exception of IOPL. Kernel IOPL
+ * has no effect.
+ */
#define switch_to(prev, next, last) \
asm volatile(SAVE_CONTEXT \
"movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \
--
1.9.3
> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
> + jz 1f
> + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
> + popfq_cfi
> +1:
> +
Do you think it makes sense to change the order here, so that no jump happens if
NT is not set (which happens a bit more often, than the other way round)? Just a
guess though, haven't measured if pipeline effects have such a big influence in this
case. ;)
On Tue, Sep 30, 2014 at 10:09 PM, Sebastian Lackner
<[email protected]> wrote:
>> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
>> + jz 1f
>> + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
>> + popfq_cfi
>> +1:
>> +
>
> Do you think it makes sense to change the order here, so that no jump happens if
> NT is not set (which happens a bit more often, than the other way round)? Just a
> guess though, haven't measured if pipeline effects have such a big influence in this
> case. ;)
>
It should be immeasurable in a tight loop, since it will predict
correctly almost every time. And, unless cfi state works across
.pushsection (does it?), getting the cfi annotations right will be
more complicated.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
On Tue, 30 Sep 2014 21:51:27 -0700
Andy Lutomirski <[email protected]> wrote:
> The NT flag doesn't do anything in long mode other than causing IRET
> to #GP. Oddly, CPL3 code can still set NT using popf.
>
> Entry via hardware or software interrupt clears NT automatically, so
> the only relevant entries are fast syscalls.
>
> If user code causes kernel code to run with NT set, then there's at
> least some (small) chance that it could cause trouble. For example,
> user code could cause a call to EFI code with NT set, and who knows
> what would happen? Apparently some games on Wine sometimes do
> this (!), and, if an IRET return happens, they will segfault. That
> segfault cannot be handled, because signal delivery fails, too.
>
> This patch programs the CPU to clear NT on entry via SYSCALL (both
> 32-bit and 64-bit, by my reading of the AMD APM), and it clears NT
> in software on entry via SYSENTER.
>
> To save a few cycles, this borrows a trick from Jan Beulich in Xen:
> it checks whether NT is set before trying to clear it. As a result,
> it seems to have very little effect on SYSENTER performance on my
> machine.
>
> Testers beware: on Xen, SYSENTER with NT set turns into a GPF.
>
> I haven't touched anything on 32-bit kernels.
>
> The syscall mask change comes from a variant of this patch by Anish
> Bhatt.
>
> Cc: [email protected]
> Reported-by: Anish Bhatt <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/ia32/ia32entry.S | 12 ++++++++++++
> arch/x86/kernel/cpu/common.c | 2 +-
> 2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 4299eb05023c..44d1dd371454 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
> 1: movl (%rbp),%ebp
> _ASM_EXTABLE(1b,ia32_badarg)
> ASM_CLAC
> +
> + /*
> + * Sysenter doesn't filter flags, so we need to clear NT
> + * ourselves. To save a few cycles, we can check whether
> + * NT was set instead of doing an unconditional popfq.
> + */
> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
> + jz 1f
> + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
> + popfq_cfi
> +1:
> +
I think you've gone backwards with this version. The earlier one got
some of the performance loss back by not needing to do the "cld" insn.
You should just replace that "cld" (line 146) with
pushfq_cfi $2
popfq_cfi
Unfortunately I'm not set up to test that yet. But I did look at
the SDM and can't see a need to preserve any of the flags.
> orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> CFI_REMEMBER_STATE
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index e4ab2b42bd6f..31265580c38a 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1184,7 +1184,7 @@ void syscall_init(void)
> /* Flags to clear on syscall */
> wrmsrl(MSR_SYSCALL_MASK,
> X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
> - X86_EFLAGS_IOPL|X86_EFLAGS_AC);
> + X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
> }
>
> /*
On Wed, 1 Oct 2014 09:09:13 -0500
Chuck Ebbert <[email protected]> wrote:
> On Tue, 30 Sep 2014 21:51:27 -0700
> Andy Lutomirski <[email protected]> wrote:
>
> > The NT flag doesn't do anything in long mode other than causing IRET
> > to #GP. Oddly, CPL3 code can still set NT using popf.
> >
> > Entry via hardware or software interrupt clears NT automatically, so
> > the only relevant entries are fast syscalls.
> >
> > If user code causes kernel code to run with NT set, then there's at
> > least some (small) chance that it could cause trouble. For example,
> > user code could cause a call to EFI code with NT set, and who knows
> > what would happen? Apparently some games on Wine sometimes do
> > this (!), and, if an IRET return happens, they will segfault. That
> > segfault cannot be handled, because signal delivery fails, too.
> >
> > This patch programs the CPU to clear NT on entry via SYSCALL (both
> > 32-bit and 64-bit, by my reading of the AMD APM), and it clears NT
> > in software on entry via SYSENTER.
> >
> > To save a few cycles, this borrows a trick from Jan Beulich in Xen:
> > it checks whether NT is set before trying to clear it. As a result,
> > it seems to have very little effect on SYSENTER performance on my
> > machine.
> >
> > Testers beware: on Xen, SYSENTER with NT set turns into a GPF.
> >
> > I haven't touched anything on 32-bit kernels.
> >
> > The syscall mask change comes from a variant of this patch by Anish
> > Bhatt.
> >
> > Cc: [email protected]
> > Reported-by: Anish Bhatt <[email protected]>
> > Signed-off-by: Andy Lutomirski <[email protected]>
> > ---
> > arch/x86/ia32/ia32entry.S | 12 ++++++++++++
> > arch/x86/kernel/cpu/common.c | 2 +-
> > 2 files changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> > index 4299eb05023c..44d1dd371454 100644
> > --- a/arch/x86/ia32/ia32entry.S
> > +++ b/arch/x86/ia32/ia32entry.S
> > @@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
> > 1: movl (%rbp),%ebp
> > _ASM_EXTABLE(1b,ia32_badarg)
> > ASM_CLAC
> > +
> > + /*
> > + * Sysenter doesn't filter flags, so we need to clear NT
> > + * ourselves. To save a few cycles, we can check whether
> > + * NT was set instead of doing an unconditional popfq.
> > + */
> > + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
> > + jz 1f
> > + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
> > + popfq_cfi
> > +1:
> > +
>
> I think you've gone backwards with this version. The earlier one got
> some of the performance loss back by not needing to do the "cld" insn.
>
> You should just replace that "cld" (line 146) with
>
> pushfq_cfi $2
> popfq_cfi
>
> Unfortunately I'm not set up to test that yet. But I did look at
> the SDM and can't see a need to preserve any of the flags.
>
<sigh> that's:
pushfw_cfi $0x202
IF needs to stay on because we've already enabled interrupts after
sysenter.
On Wed, Oct 1, 2014 at 7:32 AM, Chuck Ebbert <[email protected]> wrote:
> On Wed, 1 Oct 2014 09:09:13 -0500
> Chuck Ebbert <[email protected]> wrote:
>
>> On Tue, 30 Sep 2014 21:51:27 -0700
>> Andy Lutomirski <[email protected]> wrote:
>>
>> > The NT flag doesn't do anything in long mode other than causing IRET
>> > to #GP. Oddly, CPL3 code can still set NT using popf.
>> >
>> > Entry via hardware or software interrupt clears NT automatically, so
>> > the only relevant entries are fast syscalls.
>> >
>> > If user code causes kernel code to run with NT set, then there's at
>> > least some (small) chance that it could cause trouble. For example,
>> > user code could cause a call to EFI code with NT set, and who knows
>> > what would happen? Apparently some games on Wine sometimes do
>> > this (!), and, if an IRET return happens, they will segfault. That
>> > segfault cannot be handled, because signal delivery fails, too.
>> >
>> > This patch programs the CPU to clear NT on entry via SYSCALL (both
>> > 32-bit and 64-bit, by my reading of the AMD APM), and it clears NT
>> > in software on entry via SYSENTER.
>> >
>> > To save a few cycles, this borrows a trick from Jan Beulich in Xen:
>> > it checks whether NT is set before trying to clear it. As a result,
>> > it seems to have very little effect on SYSENTER performance on my
>> > machine.
>> >
>> > Testers beware: on Xen, SYSENTER with NT set turns into a GPF.
>> >
>> > I haven't touched anything on 32-bit kernels.
>> >
>> > The syscall mask change comes from a variant of this patch by Anish
>> > Bhatt.
>> >
>> > Cc: [email protected]
>> > Reported-by: Anish Bhatt <[email protected]>
>> > Signed-off-by: Andy Lutomirski <[email protected]>
>> > ---
>> > arch/x86/ia32/ia32entry.S | 12 ++++++++++++
>> > arch/x86/kernel/cpu/common.c | 2 +-
>> > 2 files changed, 13 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> > index 4299eb05023c..44d1dd371454 100644
>> > --- a/arch/x86/ia32/ia32entry.S
>> > +++ b/arch/x86/ia32/ia32entry.S
>> > @@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
>> > 1: movl (%rbp),%ebp
>> > _ASM_EXTABLE(1b,ia32_badarg)
>> > ASM_CLAC
>> > +
>> > + /*
>> > + * Sysenter doesn't filter flags, so we need to clear NT
>> > + * ourselves. To save a few cycles, we can check whether
>> > + * NT was set instead of doing an unconditional popfq.
>> > + */
>> > + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
>> > + jz 1f
>> > + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
>> > + popfq_cfi
>> > +1:
>> > +
>>
>> I think you've gone backwards with this version. The earlier one got
>> some of the performance loss back by not needing to do the "cld" insn.
>>
>> You should just replace that "cld" (line 146) with
>>
>> pushfq_cfi $2
>> popfq_cfi
>>
>> Unfortunately I'm not set up to test that yet. But I did look at
>> the SDM and can't see a need to preserve any of the flags.
>>
>
>
> <sigh> that's:
>
> pushfw_cfi $0x202
>
> IF needs to stay on because we've already enabled interrupts after
> sysenter.
I tried exactly this. It was much slower than the version I sent.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
On Wed, 1 Oct 2014 07:46:54 -0700
Andy Lutomirski <[email protected]> wrote:
> On Wed, Oct 1, 2014 at 7:32 AM, Chuck Ebbert <[email protected]> wrote:
> > On Wed, 1 Oct 2014 09:09:13 -0500
> > Chuck Ebbert <[email protected]> wrote:
> >
> >> On Tue, 30 Sep 2014 21:51:27 -0700
> >> Andy Lutomirski <[email protected]> wrote:
> >>
> >> > The NT flag doesn't do anything in long mode other than causing IRET
> >> > to #GP. Oddly, CPL3 code can still set NT using popf.
> >> >
> >> > Entry via hardware or software interrupt clears NT automatically, so
> >> > the only relevant entries are fast syscalls.
> >> >
> >> > If user code causes kernel code to run with NT set, then there's at
> >> > least some (small) chance that it could cause trouble. For example,
> >> > user code could cause a call to EFI code with NT set, and who knows
> >> > what would happen? Apparently some games on Wine sometimes do
> >> > this (!), and, if an IRET return happens, they will segfault. That
> >> > segfault cannot be handled, because signal delivery fails, too.
> >> >
> >> > This patch programs the CPU to clear NT on entry via SYSCALL (both
> >> > 32-bit and 64-bit, by my reading of the AMD APM), and it clears NT
> >> > in software on entry via SYSENTER.
> >> >
> >> > To save a few cycles, this borrows a trick from Jan Beulich in Xen:
> >> > it checks whether NT is set before trying to clear it. As a result,
> >> > it seems to have very little effect on SYSENTER performance on my
> >> > machine.
> >> >
> >> > Testers beware: on Xen, SYSENTER with NT set turns into a GPF.
> >> >
> >> > I haven't touched anything on 32-bit kernels.
> >> >
> >> > The syscall mask change comes from a variant of this patch by Anish
> >> > Bhatt.
> >> >
> >> > Cc: [email protected]
> >> > Reported-by: Anish Bhatt <[email protected]>
> >> > Signed-off-by: Andy Lutomirski <[email protected]>
> >> > ---
> >> > arch/x86/ia32/ia32entry.S | 12 ++++++++++++
> >> > arch/x86/kernel/cpu/common.c | 2 +-
> >> > 2 files changed, 13 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> >> > index 4299eb05023c..44d1dd371454 100644
> >> > --- a/arch/x86/ia32/ia32entry.S
> >> > +++ b/arch/x86/ia32/ia32entry.S
> >> > @@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
> >> > 1: movl (%rbp),%ebp
> >> > _ASM_EXTABLE(1b,ia32_badarg)
> >> > ASM_CLAC
> >> > +
> >> > + /*
> >> > + * Sysenter doesn't filter flags, so we need to clear NT
> >> > + * ourselves. To save a few cycles, we can check whether
> >> > + * NT was set instead of doing an unconditional popfq.
> >> > + */
> >> > + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
> >> > + jz 1f
> >> > + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
> >> > + popfq_cfi
> >> > +1:
> >> > +
> >>
> >> I think you've gone backwards with this version. The earlier one got
> >> some of the performance loss back by not needing to do the "cld" insn.
> >>
> >> You should just replace that "cld" (line 146) with
> >>
> >> pushfq_cfi $2
> >> popfq_cfi
> >>
> >> Unfortunately I'm not set up to test that yet. But I did look at
> >> the SDM and can't see a need to preserve any of the flags.
> >>
> >
> >
> > <sigh> that's:
> >
> > pushfw_cfi $0x202
> >
> > IF needs to stay on because we've already enabled interrupts after
> > sysenter.
>
> I tried exactly this. It was much slower than the version I sent.
>
Yeah, it looks like a new paravirt op that enables interrupts and
clears all the other flags would be the only way to do this without at
least some impact on performance.
On Wed, Oct 1, 2014 at 7:56 AM, Chuck Ebbert <[email protected]> wrote:
> On Wed, 1 Oct 2014 07:46:54 -0700
> Andy Lutomirski <[email protected]> wrote:
>
>> On Wed, Oct 1, 2014 at 7:32 AM, Chuck Ebbert <[email protected]> wrote:
>> > On Wed, 1 Oct 2014 09:09:13 -0500
>> > Chuck Ebbert <[email protected]> wrote:
>> >
>> >> On Tue, 30 Sep 2014 21:51:27 -0700
>> >> Andy Lutomirski <[email protected]> wrote:
>> >>
>> >> > The NT flag doesn't do anything in long mode other than causing IRET
>> >> > to #GP. Oddly, CPL3 code can still set NT using popf.
>> >> >
>> >> > Entry via hardware or software interrupt clears NT automatically, so
>> >> > the only relevant entries are fast syscalls.
>> >> >
>> >> > If user code causes kernel code to run with NT set, then there's at
>> >> > least some (small) chance that it could cause trouble. For example,
>> >> > user code could cause a call to EFI code with NT set, and who knows
>> >> > what would happen? Apparently some games on Wine sometimes do
>> >> > this (!), and, if an IRET return happens, they will segfault. That
>> >> > segfault cannot be handled, because signal delivery fails, too.
>> >> >
>> >> > This patch programs the CPU to clear NT on entry via SYSCALL (both
>> >> > 32-bit and 64-bit, by my reading of the AMD APM), and it clears NT
>> >> > in software on entry via SYSENTER.
>> >> >
>> >> > To save a few cycles, this borrows a trick from Jan Beulich in Xen:
>> >> > it checks whether NT is set before trying to clear it. As a result,
>> >> > it seems to have very little effect on SYSENTER performance on my
>> >> > machine.
>> >> >
>> >> > Testers beware: on Xen, SYSENTER with NT set turns into a GPF.
>> >> >
>> >> > I haven't touched anything on 32-bit kernels.
>> >> >
>> >> > The syscall mask change comes from a variant of this patch by Anish
>> >> > Bhatt.
>> >> >
>> >> > Cc: [email protected]
>> >> > Reported-by: Anish Bhatt <[email protected]>
>> >> > Signed-off-by: Andy Lutomirski <[email protected]>
>> >> > ---
>> >> > arch/x86/ia32/ia32entry.S | 12 ++++++++++++
>> >> > arch/x86/kernel/cpu/common.c | 2 +-
>> >> > 2 files changed, 13 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> >> > index 4299eb05023c..44d1dd371454 100644
>> >> > --- a/arch/x86/ia32/ia32entry.S
>> >> > +++ b/arch/x86/ia32/ia32entry.S
>> >> > @@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
>> >> > 1: movl (%rbp),%ebp
>> >> > _ASM_EXTABLE(1b,ia32_badarg)
>> >> > ASM_CLAC
>> >> > +
>> >> > + /*
>> >> > + * Sysenter doesn't filter flags, so we need to clear NT
>> >> > + * ourselves. To save a few cycles, we can check whether
>> >> > + * NT was set instead of doing an unconditional popfq.
>> >> > + */
>> >> > + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
>> >> > + jz 1f
>> >> > + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
>> >> > + popfq_cfi
>> >> > +1:
>> >> > +
>> >>
>> >> I think you've gone backwards with this version. The earlier one got
>> >> some of the performance loss back by not needing to do the "cld" insn.
>> >>
>> >> You should just replace that "cld" (line 146) with
>> >>
>> >> pushfq_cfi $2
>> >> popfq_cfi
>> >>
>> >> Unfortunately I'm not set up to test that yet. But I did look at
>> >> the SDM and can't see a need to preserve any of the flags.
>> >>
>> >
>> >
>> > <sigh> that's:
>> >
>> > pushfw_cfi $0x202
>> >
>> > IF needs to stay on because we've already enabled interrupts after
>> > sysenter.
>>
>> I tried exactly this. It was much slower than the version I sent.
>>
>
> Yeah, it looks like a new paravirt op that enables interrupts and
> clears all the other flags would be the only way to do this without at
> least some impact on performance.
We have that -- it's called something like setfl.
But it still wouldn't help. It seems that cld, test, jnz is simply
much faster than popfq.
If we could fold it with the sti earlier, *maybe* that would be a win,
but then we'd also have to patch the saved flags to avoid returning to
userspace with interrupts off. (And I tried that. It still didn't
seem to be fast enough.)
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
On 09/30/2014 10:24 PM, Andy Lutomirski wrote:
> On Tue, Sep 30, 2014 at 10:09 PM, Sebastian Lackner
> <[email protected]> wrote:
>>> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
>>> + jz 1f
>>> + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
>>> + popfq_cfi
>>> +1:
>>> +
>>
>> Do you think it makes sense to change the order here, so that no jump happens if
>> NT is not set (which happens a bit more often, than the other way round)? Just a
>> guess though, haven't measured if pipeline effects have such a big influence in this
>> case. ;)
>>
>
> It should be immeasurable in a tight loop, since it will predict
> correctly almost every time. And, unless cfi state works across
> .pushsection (does it?), getting the cfi annotations right will be
> more complicated.
>
It does, actually... otherwise it would be almost impossible to use in a
lot of cases.
-hpa
On 09/30/2014 09:51 PM, Andy Lutomirski wrote:
>
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 4299eb05023c..44d1dd371454 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
> 1: movl (%rbp),%ebp
> _ASM_EXTABLE(1b,ia32_badarg)
> ASM_CLAC
> +
> + /*
> + * Sysenter doesn't filter flags, so we need to clear NT
> + * ourselves. To save a few cycles, we can check whether
> + * NT was set instead of doing an unconditional popfq.
> + */
> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
> + jz 1f
> + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
> + popfq_cfi
> +1:
> +
I'm wondering if it would be easier to just remove ASM_CLAC and do this
unconditionally. On SMAP-enabled hardware then that gives us back some
of the cycles, may make the branch unnecessary.
-hpa
On 10/01/2014 08:22 AM, H. Peter Anvin wrote:
> On 09/30/2014 09:51 PM, Andy Lutomirski wrote:
>>
>> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> index 4299eb05023c..44d1dd371454 100644
>> --- a/arch/x86/ia32/ia32entry.S
>> +++ b/arch/x86/ia32/ia32entry.S
>> @@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
>> 1: movl (%rbp),%ebp
>> _ASM_EXTABLE(1b,ia32_badarg)
>> ASM_CLAC
>> +
>> + /*
>> + * Sysenter doesn't filter flags, so we need to clear NT
>> + * ourselves. To save a few cycles, we can check whether
>> + * NT was set instead of doing an unconditional popfq.
>> + */
>> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
>> + jz 1f
>> + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
>> + popfq_cfi
>> +1:
>> +
>
> I'm wondering if it would be easier to just remove ASM_CLAC and do this
> unconditionally. On SMAP-enabled hardware then that gives us back some
> of the cycles, may make the branch unnecessary.
>
Heck, we can drop the CLD and the STI as well (with some tweaking in
ia32_badarg.)
-hpa
On Oct 1, 2014 8:26 AM, "H. Peter Anvin" <[email protected]> wrote:
>
> On 10/01/2014 08:22 AM, H. Peter Anvin wrote:
> > On 09/30/2014 09:51 PM, Andy Lutomirski wrote:
> >>
> >> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> >> index 4299eb05023c..44d1dd371454 100644
> >> --- a/arch/x86/ia32/ia32entry.S
> >> +++ b/arch/x86/ia32/ia32entry.S
> >> @@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
> >> 1: movl (%rbp),%ebp
> >> _ASM_EXTABLE(1b,ia32_badarg)
> >> ASM_CLAC
> >> +
> >> + /*
> >> + * Sysenter doesn't filter flags, so we need to clear NT
> >> + * ourselves. To save a few cycles, we can check whether
> >> + * NT was set instead of doing an unconditional popfq.
> >> + */
> >> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
> >> + jz 1f
> >> + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
> >> + popfq_cfi
> >> +1:
> >> +
> >
> > I'm wondering if it would be easier to just remove ASM_CLAC and do this
> > unconditionally. On SMAP-enabled hardware then that gives us back some
> > of the cycles, may make the branch unnecessary.
> >
>
> Heck, we can drop the CLD and the STI as well (with some tweaking in
> ia32_badarg.)
I prototyped this, and performance sucked. I suspect that cld and sti
are fairly well optimized, that I ended up introducing stalls due to
stack manipulation, and that Sandy Bridge's popfq microcode is just
not that fast. Maybe I did it wrong. Dunno. Also, I can't benchmark
a SMAP machine, since I don't have one. (Does anyone? I'm currently
tempted to wait for Skylake before upgrading all my systems.)
In fact, I think we should change all the irqrestore code to do
if (flags & X86_EFLAFS_IF)
sti;
I can send a v3 with the unlikely code moved out of line.
--Andy
On Wed, Oct 1, 2014 at 8:50 AM, Andy Lutomirski <[email protected]> wrote:
> On Oct 1, 2014 8:26 AM, "H. Peter Anvin" <[email protected]> wrote:
>>
>> On 10/01/2014 08:22 AM, H. Peter Anvin wrote:
>> > On 09/30/2014 09:51 PM, Andy Lutomirski wrote:
>> >>
>> >> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> >> index 4299eb05023c..44d1dd371454 100644
>> >> --- a/arch/x86/ia32/ia32entry.S
>> >> +++ b/arch/x86/ia32/ia32entry.S
>> >> @@ -151,6 +151,18 @@ ENTRY(ia32_sysenter_target)
>> >> 1: movl (%rbp),%ebp
>> >> _ASM_EXTABLE(1b,ia32_badarg)
>> >> ASM_CLAC
>> >> +
>> >> + /*
>> >> + * Sysenter doesn't filter flags, so we need to clear NT
>> >> + * ourselves. To save a few cycles, we can check whether
>> >> + * NT was set instead of doing an unconditional popfq.
>> >> + */
>> >> + testl $X86_EFLAGS_NT,EFLAGS(%rsp) /* saved EFLAGS match cpu */
>> >> + jz 1f
>> >> + pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED)
>> >> + popfq_cfi
>> >> +1:
>> >> +
>> >
>> > I'm wondering if it would be easier to just remove ASM_CLAC and do this
>> > unconditionally. On SMAP-enabled hardware then that gives us back some
>> > of the cycles, may make the branch unnecessary.
>> >
>>
>> Heck, we can drop the CLD and the STI as well (with some tweaking in
>> ia32_badarg.)
>
> I prototyped this, and performance sucked. I suspect that cld and sti
> are fairly well optimized, that I ended up introducing stalls due to
> stack manipulation, and that Sandy Bridge's popfq microcode is just
> not that fast. Maybe I did it wrong. Dunno. Also, I can't benchmark
> a SMAP machine, since I don't have one. (Does anyone? I'm currently
> tempted to wait for Skylake before upgrading all my systems.)
Agner Fog's tables for Sandy Bridge have 9 uops for popf and
reciprocal throughput 18. sti isn't listed for Sandy Bridge or
anything similar, but cld is 3 uops with reciprocal throughput 4.
Also, popf accesses rsp, and the sysenter code is very heavy on stack
manipulation.
--Andy
>
> In fact, I think we should change all the irqrestore code to do
>
> if (flags & X86_EFLAFS_IF)
> sti;
>
> I can send a v3 with the unlikely code moved out of line.
>
> --Andy
--
Andy Lutomirski
AMA Capital Management, LLC
On 10/01/2014 09:04 AM, Andy Lutomirski wrote:
>
> Agner Fog's tables for Sandy Bridge have 9 uops for popf and
> reciprocal throughput 18. sti isn't listed for Sandy Bridge or
> anything similar, but cld is 3 uops with reciprocal throughput 4.
> Also, popf accesses rsp, and the sysenter code is very heavy on stack
> manipulation.
>
It does a stack operation. Newer CPUs optimize stack accesses pretty
heavily. That doesn't mean back-to-back push/pop are all that
optimized, I wonder if it would help separating them. popf is unlikely
to ever be all that fast.
-hpa