We rely on objtool to verify AC=1 doesn't escape. However there is no
objtool support for x86_32, and thus we cannot guarantee the
correctness of the 32bit code.
Also; if you're running 32bit kernels on hardware with SMAP (which all
should have LM support afaik) you're doing it wrong anyway.
XXX: we could do the PUSHF/POPF thing in __switch_to_asm() on x86_32,
but why bother.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/Kconfig | 2 ++
1 file changed, 2 insertions(+)
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1853,6 +1853,8 @@ config ARCH_RANDOM
config X86_SMAP
def_bool y
+ # Note: we rely on objtool to validate AC=1 doesn't escape
+ depends on HAVE_STACK_VALIDATION
prompt "Supervisor Mode Access Prevention" if EXPERT
---help---
Supervisor Mode Access Prevention (SMAP) is a security
On Mon, Mar 18, 2019 at 8:54 AM Peter Zijlstra <[email protected]> wrote:
>
> We rely on objtool to verify AC=1 doesn't escape. However there is no
> objtool support for x86_32, and thus we cannot guarantee the
> correctness of the 32bit code.
Absolutely not.
This is just crazy. We had working SMAP long before objtool, and we
will have it regardless of objtool.
This is like saying "ok, I don't have an oxygen sensor, so I don't
know that the air I'm breathing is sufficient to maintain life, so
I'll just stop breathing".
So no way in hell do we make SMAP go away on 32-bit for no sane reason
what-so-ever.
Besides, the x86-64 objtool coverage will cover 99% of all 32-bit code
too, simply because we share it. In fact, it will cover most of the
code for other architectures too.
Linus
On Mon, Mar 18, 2019 at 09:58:20AM -0700, Linus Torvalds wrote:
> On Mon, Mar 18, 2019 at 8:54 AM Peter Zijlstra <[email protected]> wrote:
> >
> > We rely on objtool to verify AC=1 doesn't escape. However there is no
> > objtool support for x86_32, and thus we cannot guarantee the
> > correctness of the 32bit code.
>
> Absolutely not.
>
> This is just crazy. We had working SMAP long before objtool, and we
> will have it regardless of objtool.
Well, 'working', because as shown with this work, there's been a number
of bugs around this.
Anyway, since you mentioned it, I checked, about 16 months: Broadwell
was Oct'14, objtool was Feb'16.
> This is like saying "ok, I don't have an oxygen sensor, so I don't
> know that the air I'm breathing is sufficient to maintain life, so
> I'll just stop breathing".
With PTI, booting a 32bit kernel on a Broadwell or later will already
complain about it being sub-optimal for missing out on PCID. But sure;
if you want to enable this..
> So no way in hell do we make SMAP go away on 32-bit for no sane reason
> what-so-ever.
then I'd sleep much better if we make 32bit context switch EFLAGS.
Because, yes, x86_64 objtool validates a lot of the x86_32 code too, but
unless we have 100% coverage, there's always the chance an AC=1 escapes.
---
arch/x86/entry/entry_32.S | 2 ++
arch/x86/include/asm/switch_to.h | 1 +
arch/x86/kernel/process_32.c | 5 +++++
3 files changed, 8 insertions(+)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..5fc76b755510 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -650,6 +650,7 @@ ENTRY(__switch_to_asm)
pushl %ebx
pushl %edi
pushl %esi
+ pushfl
/* switch stack */
movl %esp, TASK_threadsp(%eax)
@@ -672,6 +673,7 @@ ENTRY(__switch_to_asm)
#endif
/* restore callee-saved registers */
+ popfl
popl %esi
popl %edi
popl %ebx
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 7cf1a270d891..18a4b6890fa8 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -46,6 +46,7 @@ struct inactive_task_frame {
unsigned long r13;
unsigned long r12;
#else
+ unsigned long flags;
unsigned long si;
unsigned long di;
#endif
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index e471d8e6f0b2..d28e9a74b736 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -127,6 +127,11 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
struct task_struct *tsk;
int err;
+ /*
+ * We start a new task with the RESET value of EFLAGS, in particular we
+ * context switch with IRQs disabled.
+ */
+ frame->flags = X86_EFLAGS_FIXED;
frame->bp = 0;
frame->ret_addr = (unsigned long) ret_from_fork;
p->thread.sp = (unsigned long) fork_frame;
On Mon, Mar 18, 2019 at 06:36:57PM +0100, Peter Zijlstra wrote:
> then I'd sleep much better if we make 32bit context switch EFLAGS.
> Because, yes, x86_64 objtool validates a lot of the x86_32 code too, but
> unless we have 100% coverage, there's always the chance an AC=1 escapes.
How about I do a patch that schedules EFLAGS for both 32bit and 64bit,
mark this for backporting to infinity.
And then at the end, after the objtool-ac bits land, I do a patch
removing the EFLAGS scheduling for x86_64.
On Mon, Mar 18, 2019 at 10:51 AM Peter Zijlstra <[email protected]> wrote:
>
> How about I do a patch that schedules EFLAGS for both 32bit and 64bit,
> mark this for backporting to infinity.
>
> And then at the end, after the objtool-ac bits land, I do a patch
> removing the EFLAGS scheduling for x86_64.
Sounds sane to me.
And we can make it AC-conditional if it's actually shown to be visible
from a performance standpoint.
But iirc pushf/popf isn't really that expensive - in fact I think it's
pretty cheap when system flags don't change. Which would be the common
case unless you'd also make the popf do the irq restore part and
simply make finish_lock_switch() re-enable irq's by doing an
irqrestore?
I think popf is like 20 cycles or something (and pushf is just single
cycles). Probably not worth worrying about in the task switch context.
Linus
New patch #1
---
Subject: sched/x86: Save [ER]FLAGS on context switch
From: Peter Zijlstra <[email protected]>
Date: Thu Feb 14 10:30:52 CET 2019
Effectively reverts commit:
2c7577a75837 ("sched/x86_64: Don't save flags on context switch")
Specifically because SMAP uses FLAGS.AC which invalidates the claim
that the kernel has clean flags.
In particular; while preemption from interrupt return is fine (the
IRET frame on the exception stack contains FLAGS) it breaks any code
that does synchonous scheduling, including preempt_enable().
This has become a significant issue ever since commit:
5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses")
provided for means of having 'normal' C code between STAC / CLAC,
exposing the FLAGS.AC state. So far this hasn't led to trouble,
however fix it before it comes apart.
Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses")
Acked-by: Andy Lutomirski <[email protected]>
Reported-by: Julien Thierry <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/entry/entry_32.S | 2 ++
arch/x86/entry/entry_64.S | 2 ++
arch/x86/include/asm/switch_to.h | 1 +
arch/x86/kernel/process_32.c | 7 +++++++
arch/x86/kernel/process_64.c | 8 ++++++++
5 files changed, 20 insertions(+)
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -650,6 +650,7 @@ ENTRY(__switch_to_asm)
pushl %ebx
pushl %edi
pushl %esi
+ pushfl
/* switch stack */
movl %esp, TASK_threadsp(%eax)
@@ -672,6 +673,7 @@ ENTRY(__switch_to_asm)
#endif
/* restore callee-saved registers */
+ popfl
popl %esi
popl %edi
popl %ebx
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -291,6 +291,7 @@ ENTRY(__switch_to_asm)
pushq %r13
pushq %r14
pushq %r15
+ pushfq
/* switch stack */
movq %rsp, TASK_threadsp(%rdi)
@@ -313,6 +314,7 @@ ENTRY(__switch_to_asm)
#endif
/* restore callee-saved registers */
+ popfq
popq %r15
popq %r14
popq %r13
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void);
* order of the fields must match the code in __switch_to_asm().
*/
struct inactive_task_frame {
+ unsigned long flags;
#ifdef CONFIG_X86_64
unsigned long r15;
unsigned long r14;
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -127,6 +127,13 @@ int copy_thread_tls(unsigned long clone_
struct task_struct *tsk;
int err;
+ /*
+ * For a new task use the RESET flags value since there is no before.
+ * All the status flags are zero; DF and all the system flags must also
+ * be 0, specifically IF must be 0 because we context switch to the new
+ * task with interrupts disabled.
+ */
+ frame->flags = X86_EFLAGS_FIXED;
frame->bp = 0;
frame->ret_addr = (unsigned long) ret_from_fork;
p->thread.sp = (unsigned long) fork_frame;
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -392,6 +392,14 @@ int copy_thread_tls(unsigned long clone_
childregs = task_pt_regs(p);
fork_frame = container_of(childregs, struct fork_frame, regs);
frame = &fork_frame->frame;
+
+ /*
+ * For a new task use the RESET flags value since there is no before.
+ * All the status flags are zero; DF and all the system flags must also
+ * be 0, specifically IF must be 0 because we context switch to the new
+ * task with interrupts disabled.
+ */
+ frame->flags = X86_EFLAGS_FIXED;
frame->bp = 0;
frame->ret_addr = (unsigned long) ret_from_fork;
p->thread.sp = (unsigned long) fork_frame;
On March 18, 2019 11:10:22 AM PDT, Linus Torvalds <[email protected]> wrote:
>On Mon, Mar 18, 2019 at 10:51 AM Peter Zijlstra <[email protected]>
>wrote:
>>
>> How about I do a patch that schedules EFLAGS for both 32bit and
>64bit,
>> mark this for backporting to infinity.
>>
>> And then at the end, after the objtool-ac bits land, I do a patch
>> removing the EFLAGS scheduling for x86_64.
>
>Sounds sane to me.
>
>And we can make it AC-conditional if it's actually shown to be visible
>from a performance standpoint.
>
>But iirc pushf/popf isn't really that expensive - in fact I think it's
>pretty cheap when system flags don't change. Which would be the common
>case unless you'd also make the popf do the irq restore part and
>simply make finish_lock_switch() re-enable irq's by doing an
>irqrestore?
>
>I think popf is like 20 cycles or something (and pushf is just single
>cycles). Probably not worth worrying about in the task switch context.
>
> Linus
Yes, the problem isn't scheduling per se but the risk of hiding problems.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 3/18/19 7:10 PM, Linus Torvalds wrote:
> On Mon, Mar 18, 2019 at 10:51 AM Peter Zijlstra <[email protected]> wrote:
>>
>> How about I do a patch that schedules EFLAGS for both 32bit and 64bit,
>> mark this for backporting to infinity.
>>
>> And then at the end, after the objtool-ac bits land, I do a patch
>> removing the EFLAGS scheduling for x86_64.
>
> Sounds sane to me.
>
> And we can make it AC-conditional if it's actually shown to be visible
> from a performance standpoint.
>
> But iirc pushf/popf isn't really that expensive - in fact I think it's
> pretty cheap when system flags don't change.
I did not see evidence of this. In my testing,
POPF is always ~20 cycles, even if popped flags are identical to current
state of flags.
On March 21, 2019 10:25:05 AM PDT, Denys Vlasenko <[email protected]> wrote:
>On 3/18/19 7:10 PM, Linus Torvalds wrote:
>> On Mon, Mar 18, 2019 at 10:51 AM Peter Zijlstra
><[email protected]> wrote:
>>>
>>> How about I do a patch that schedules EFLAGS for both 32bit and
>64bit,
>>> mark this for backporting to infinity.
>>>
>>> And then at the end, after the objtool-ac bits land, I do a patch
>>> removing the EFLAGS scheduling for x86_64.
>>
>> Sounds sane to me.
>>
>> And we can make it AC-conditional if it's actually shown to be
>visible
>> from a performance standpoint.
>>
>> But iirc pushf/popf isn't really that expensive - in fact I think
>it's
>> pretty cheap when system flags don't change.
>
>I did not see evidence of this. In my testing,
>POPF is always ~20 cycles, even if popped flags are identical to
>current
>state of flags.
I think you will find that if you change system flags it is much slower.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Mar 21, 2019 at 10:25 AM Denys Vlasenko <[email protected]> wrote:
>
> >
> > But iirc pushf/popf isn't really that expensive - in fact I think it's
> > pretty cheap when system flags don't change.
>
> I did not see evidence of this. In my testing,
> POPF is always ~20 cycles, even if popped flags are identical to current
> state of flags.
It may have been an artifact on just some older CPU's. I have this
distinct memory of popf that changed IF being more expensive, but
maybe that was the Pentium 4 days.
Or maybe it's just that my memory is garbage.
Linus
On Thu, Mar 21, 2019 at 11:18:05AM -0700, [email protected] wrote:
> On March 21, 2019 10:25:05 AM PDT, Denys Vlasenko <[email protected]> wrote:
> >I did not see evidence of this. In my testing,
> >POPF is always ~20 cycles, even if popped flags are identical to
> >current state of flags.
>
> I think you will find that if you change system flags it is much slower.
So with all the patches in this series applied, only x86_32 will suffer
this, and I don't think anybody still considers that a performance
critical platform.
That said, we could do something terrible like:
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -673,7 +673,8 @@ ENTRY(__switch_to_asm)
#endif
/* restore callee-saved registers */
- popfl
+ ALTERNATIVE "popl %esi", \
+ "popfl", X86_FEATURE_SMAP
popl %esi
popl %edi
popl %ebx
And then you only pay the POPF penalty when you run a 32bit kernel on a
SMAP enabled CPU, and we have a very good solution in that code: run a
64bit kernel.