Kernel addresses are always mapped with _PAGE_USER=0, i.e. PKRU
isn't enforced, and so we should never see X86_PF_PK set on a
kernel address fault. WARN once to capture the issue in case we
somehow don't die, e.g. the access originated in userspace.
Remove a similar check and its comment from spurious_fault_check().
The intent of the comment (and later code[1]) was simply to document
that spurious faults due to protection keys should be impossible, but
that's irrelevant and a bit of a red herring since we should never
get a protection keys fault on a kernel address regardless of the
kernel's TLB flushing behavior.
[1] http://lists-archives.com/linux-kernel/28407455-x86-pkeys-new-page-fault-error-code-bit-pf_pk.html
Signed-off-by: Sean Christopherson <[email protected]>
Cc: Dave Hansen <[email protected]>
---
There's no indication that this condition has ever been encountered.
I came across the code in spurious_fault_check() and was confused as
to why we would unconditionally treat a protection keys fault as
spurious when the comment explicitly stated that such a case should
be impossible.
Dave Hansen suggested adding a WARN_ON_ONCE in spurious_fault_check(),
but it seemed more appropriate to freak out on any protection keys
fault on a kernel address since that would imply a hardware issue or
kernel bug. I omitted a Suggested-by since this isn't necessarily
what Dave had in mind.
arch/x86/mm/fault.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 2aafa6ab6103..f19a55972136 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1040,12 +1040,6 @@ static int spurious_fault_check(unsigned long error_code, pte_t *pte)
if ((error_code & X86_PF_INSTR) && !pte_exec(*pte))
return 0;
- /*
- * Note: We do not do lazy flushing on protection key
- * changes, so no spurious fault will ever set X86_PF_PK.
- */
- if ((error_code & X86_PF_PK))
- return 1;
return 1;
}
@@ -1241,6 +1235,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
* protection error (error_code & 9) == 0.
*/
if (unlikely(fault_in_kernel_space(address))) {
+ /*
+ * We should never encounter a protection keys fault on a
+ * kernel address as kernel address are always mapped with
+ * _PAGE_USER=0, i.e. PKRU isn't enforced.
+ */
+ if (WARN_ON_ONCE(error_code & X86_PF_PK))
+ goto bad_kernel_address;
+
if (!(error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
if (vmalloc_fault(address) >= 0)
return;
@@ -1253,6 +1255,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
/* kprobes don't want to hook the spurious faults: */
if (kprobes_fault(regs))
return;
+
+bad_kernel_address:
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock:
--
2.18.0
On 08/07/2018 10:29 AM, Sean Christopherson wrote:
> if (unlikely(fault_in_kernel_space(address))) {
> + /*
> + * We should never encounter a protection keys fault on a
> + * kernel address as kernel address are always mapped with
> + * _PAGE_USER=0, i.e. PKRU isn't enforced.
> + */
> + if (WARN_ON_ONCE(error_code & X86_PF_PK))
> + goto bad_kernel_address;
I just realized one more thing: the vsyscall page can bite us here.
It's at a fault_in_kernel_space() address and we *can* trigger a pkey
fault on it if we jump to an instruction that reads from a
pkey-protected area.
We can make a gadget out of unaligned vsyscall instructions that does
that. See:
0xffffffffff600002: shlb $0x0,0x0(%rax)
Then, we turn off access to all pkeys, including pkey-0, then jump to
the unaligned vsyscall instruction, which reads %rax, which is a kernel
address:
asm("movl $0xffffffff, %eax;\
movl $0x00000000, %ecx;\
movl $0x00000000, %edx;\
wrpkru;\
movq $0xffffffffff600000, %rax;\
movq $0xffffffffff600002, %rbx;\
jmpq *%rbx;");
So, my bad. It was not a good suggestion to do a WARN_ON(). But, the
other funny thing is I would have expected spurious_fault() to get us
into a fault loop, which it doesn't. It's definitely getting *called*
with my little test program (I see it in ftrace) but it's not quite
doing what I expect.
I need to dig a bit more.
On Tue, 7 Aug 2018, Dave Hansen wrote:
> On 08/07/2018 10:29 AM, Sean Christopherson wrote:
> > if (unlikely(fault_in_kernel_space(address))) {
> > + /*
> > + * We should never encounter a protection keys fault on a
> > + * kernel address as kernel address are always mapped with
> > + * _PAGE_USER=0, i.e. PKRU isn't enforced.
> > + */
> > + if (WARN_ON_ONCE(error_code & X86_PF_PK))
> > + goto bad_kernel_address;
>
> I just realized one more thing: the vsyscall page can bite us here.
> It's at a fault_in_kernel_space() address and we *can* trigger a pkey
> fault on it if we jump to an instruction that reads from a
> pkey-protected area.
>
> We can make a gadget out of unaligned vsyscall instructions that does
> that. See:
>
> 0xffffffffff600002: shlb $0x0,0x0(%rax)
>
> Then, we turn off access to all pkeys, including pkey-0, then jump to
> the unaligned vsyscall instruction, which reads %rax, which is a kernel
> address:
>
> asm("movl $0xffffffff, %eax;\
> movl $0x00000000, %ecx;\
> movl $0x00000000, %edx;\
> wrpkru;\
> movq $0xffffffffff600000, %rax;\
> movq $0xffffffffff600002, %rbx;\
> jmpq *%rbx;");
>
> So, my bad. It was not a good suggestion to do a WARN_ON(). But, the
> other funny thing is I would have expected spurious_fault() to get us
> into a fault loop, which it doesn't. It's definitely getting *called*
> with my little test program (I see it in ftrace) but it's not quite
> doing what I expect.
>
> I need to dig a bit more.
Given the time span you should be close to ground water with your digging
by now.
Thanks,
tglx
On 08/30/2018 03:40 AM, Thomas Gleixner wrote:
> Given the time span you should be close to ground water with your digging
> by now.
So, turns out that we start our spurious_fault() code with this check:
> if (error_code != (X86_PF_WRITE | X86_PF_PROT) &&
> error_code != (X86_PF_INSTR | X86_PF_PROT))
> return 0;
Which ensures that we only do spurious checking for *very* specific
error_code's. That ends up making the X86_PF_PK check inside of
spurious_fault_check() dead code _anyway_. It's totally unreachable as
far as I can tell.
We could add a comment above the error_code check to make it explicit
that it excludes pkeys.
But, otherwise, I think we can just axe the X86_PF_PK
spurious_fault_check().
On Tue, 7 Aug 2018 Dave Hansen <[email protected]> wrote:
>
> On 08/07/2018 10:29 AM, Sean Christopherson wrote:
> > if (unlikely(fault_in_kernel_space(address))) {
> > + /*
> > + * We should never encounter a protection keys fault on a
> > + * kernel address as kernel address are always mapped with
> > + * _PAGE_USER=0, i.e. PKRU isn't enforced.
> > + */
> > + if (WARN_ON_ONCE(error_code & X86_PF_PK))
> > + goto bad_kernel_address;
>
> I just realized one more thing: the vsyscall page can bite us here.
> It's at a fault_in_kernel_space() address and we *can* trigger a pkey
> fault on it if we jump to an instruction that reads from a
> pkey-protected area.
>
> We can make a gadget out of unaligned vsyscall instructions that does
> that. See:
>
> 0xffffffffff600002: shlb $0x0,0x0(%rax)
>
> Then, we turn off access to all pkeys, including pkey-0, then jump to
> the unaligned vsyscall instruction, which reads %rax, which is a kernel
> address:
Andy got rid of the (native) vsyscall page in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=076ca272a14cea558b1092ec85cea08510283f2a
('x86/vsyscall/64: Drop "native" vsyscalls') a few months ago, right?
At this point, the vsyscall page should never be executable.
> On Aug 30, 2018, at 7:38 PM, Jann Horn <[email protected]> wrote:
>
>> On Tue, 7 Aug 2018 Dave Hansen <[email protected]> wrote:
>>
>>> On 08/07/2018 10:29 AM, Sean Christopherson wrote:
>>> if (unlikely(fault_in_kernel_space(address))) {
>>> + /*
>>> + * We should never encounter a protection keys fault on a
>>> + * kernel address as kernel address are always mapped with
>>> + * _PAGE_USER=0, i.e. PKRU isn't enforced.
>>> + */
>>> + if (WARN_ON_ONCE(error_code & X86_PF_PK))
>>> + goto bad_kernel_address;
>>
>> I just realized one more thing: the vsyscall page can bite us here.
>> It's at a fault_in_kernel_space() address and we *can* trigger a pkey
>> fault on it if we jump to an instruction that reads from a
>> pkey-protected area.
>>
>> We can make a gadget out of unaligned vsyscall instructions that does
>> that. See:
>>
>> 0xffffffffff600002: shlb $0x0,0x0(%rax)
>>
>> Then, we turn off access to all pkeys, including pkey-0, then jump to
>> the unaligned vsyscall instruction, which reads %rax, which is a kernel
>> address:
>
> Andy got rid of the (native) vsyscall page in
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=076ca272a14cea558b1092ec85cea08510283f2a
> ('x86/vsyscall/64: Drop "native" vsyscalls') a few months ago, right?
> At this point, the vsyscall page should never be executable.
Indeed.
Can one of you cc me on the original patch?
Cc-ing Jann and Andy.
On Tue, Aug 07, 2018 at 10:29:20AM -0700, Sean Christopherson wrote:
> Kernel addresses are always mapped with _PAGE_USER=0, i.e. PKRU
> isn't enforced, and so we should never see X86_PF_PK set on a
> kernel address fault. WARN once to capture the issue in case we
> somehow don't die, e.g. the access originated in userspace.
>
> Remove a similar check and its comment from spurious_fault_check().
> The intent of the comment (and later code[1]) was simply to document
> that spurious faults due to protection keys should be impossible, but
> that's irrelevant and a bit of a red herring since we should never
> get a protection keys fault on a kernel address regardless of the
> kernel's TLB flushing behavior.
>
> [1] http://lists-archives.com/linux-kernel/28407455-x86-pkeys-new-page-fault-error-code-bit-pf_pk.html
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Cc: Dave Hansen <[email protected]>
> ---
> There's no indication that this condition has ever been encountered.
> I came across the code in spurious_fault_check() and was confused as
> to why we would unconditionally treat a protection keys fault as
> spurious when the comment explicitly stated that such a case should
> be impossible.
>
> Dave Hansen suggested adding a WARN_ON_ONCE in spurious_fault_check(),
> but it seemed more appropriate to freak out on any protection keys
> fault on a kernel address since that would imply a hardware issue or
> kernel bug. I omitted a Suggested-by since this isn't necessarily
> what Dave had in mind.
>
> arch/x86/mm/fault.c | 16 ++++++++++------
> 1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 2aafa6ab6103..f19a55972136 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1040,12 +1040,6 @@ static int spurious_fault_check(unsigned long error_code, pte_t *pte)
>
> if ((error_code & X86_PF_INSTR) && !pte_exec(*pte))
> return 0;
> - /*
> - * Note: We do not do lazy flushing on protection key
> - * changes, so no spurious fault will ever set X86_PF_PK.
> - */
> - if ((error_code & X86_PF_PK))
> - return 1;
>
> return 1;
> }
> @@ -1241,6 +1235,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
> * protection error (error_code & 9) == 0.
> */
> if (unlikely(fault_in_kernel_space(address))) {
> + /*
> + * We should never encounter a protection keys fault on a
> + * kernel address as kernel address are always mapped with
> + * _PAGE_USER=0, i.e. PKRU isn't enforced.
> + */
> + if (WARN_ON_ONCE(error_code & X86_PF_PK))
> + goto bad_kernel_address;
> +
> if (!(error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
> if (vmalloc_fault(address) >= 0)
> return;
> @@ -1253,6 +1255,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
> /* kprobes don't want to hook the spurious faults: */
> if (kprobes_fault(regs))
> return;
> +
> +bad_kernel_address:
> /*
> * Don't take the mm semaphore here. If we fixup a prefetch
> * fault we could otherwise deadlock:
> --
> 2.18.0
>
On Tue, Sep 4, 2018 at 12:50 PM, Sean Christopherson
<[email protected]> wrote:
> Cc-ing Jann and Andy.
>
> On Tue, Aug 07, 2018 at 10:29:20AM -0700, Sean Christopherson wrote:
>> Kernel addresses are always mapped with _PAGE_USER=0, i.e. PKRU
>> isn't enforced, and so we should never see X86_PF_PK set on a
>> kernel address fault. WARN once to capture the issue in case we
>> somehow don't die, e.g. the access originated in userspace.
>>
>> Remove a similar check and its comment from spurious_fault_check().
>> The intent of the comment (and later code[1]) was simply to document
>> that spurious faults due to protection keys should be impossible, but
>> that's irrelevant and a bit of a red herring since we should never
>> get a protection keys fault on a kernel address regardless of the
>> kernel's TLB flushing behavior.
>>
>> [1] http://lists-archives.com/linux-kernel/28407455-x86-pkeys-new-page-fault-error-code-bit-pf_pk.html
>>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> ---
>> There's no indication that this condition has ever been encountered.
>> I came across the code in spurious_fault_check() and was confused as
>> to why we would unconditionally treat a protection keys fault as
>> spurious when the comment explicitly stated that such a case should
>> be impossible.
>>
>> Dave Hansen suggested adding a WARN_ON_ONCE in spurious_fault_check(),
>> but it seemed more appropriate to freak out on any protection keys
>> fault on a kernel address since that would imply a hardware issue or
>> kernel bug. I omitted a Suggested-by since this isn't necessarily
>> what Dave had in mind.
>>
>> arch/x86/mm/fault.c | 16 ++++++++++------
>> 1 file changed, 10 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 2aafa6ab6103..f19a55972136 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -1040,12 +1040,6 @@ static int spurious_fault_check(unsigned long error_code, pte_t *pte)
>>
>> if ((error_code & X86_PF_INSTR) && !pte_exec(*pte))
>> return 0;
>> - /*
>> - * Note: We do not do lazy flushing on protection key
>> - * changes, so no spurious fault will ever set X86_PF_PK.
>> - */
>> - if ((error_code & X86_PF_PK))
>> - return 1;
>>
>> return 1;
>> }
>> @@ -1241,6 +1235,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
>> * protection error (error_code & 9) == 0.
>> */
>> if (unlikely(fault_in_kernel_space(address))) {
>> + /*
>> + * We should never encounter a protection keys fault on a
>> + * kernel address as kernel address are always mapped with
>> + * _PAGE_USER=0, i.e. PKRU isn't enforced.
>> + */
>> + if (WARN_ON_ONCE(error_code & X86_PF_PK))
>> + goto bad_kernel_address;
>> +
>> if (!(error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
>> if (vmalloc_fault(address) >= 0)
>> return;
>> @@ -1253,6 +1255,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
>> /* kprobes don't want to hook the spurious faults: */
>> if (kprobes_fault(regs))
>> return;
>> +
>> +bad_kernel_address:
>> /*
>> * Don't take the mm semaphore here. If we fixup a prefetch
>> * fault we could otherwise deadlock:
I have no objection to this patch.
Dave, why did you think that we could get a PK fault on the vsyscall
page, even on kernels that still marked it executable? Sure, you
could get an instruction in the vsyscall page to get a PK fault, but
CR2 wouldn't point to the vsyscall page, right?
>> --
>> 2.18.0
>>
On 09/04/2018 12:56 PM, Andy Lutomirski wrote:
> I have no objection to this patch.
>
> Dave, why did you think that we could get a PK fault on the vsyscall
> page, even on kernels that still marked it executable? Sure, you
> could get an instruction in the vsyscall page to get a PK fault, but
> CR2 wouldn't point to the vsyscall page, right?
I'm inferring the CR2 value from the page fault trace point. I see
entries like this:
protection_keys-4313 [002] d... 420257.094541: page_fault_user:
address=_end ip=_end error_code=0x15
But, that's not a PK fault, and it triggers the "misaligned vsyscall
(exploit attempt or buggy program)" stuff in dmesg. It's just the
symptom of trying to execute the non-executable vsyscall page.
I'm not a super big fan of this particular patch, though. The
fault_in_kernel_space() check is really presuming two things:
1. pkey faults (PF_PK=1) only occur on user pages (_PAGE_USER=1)
2. fault_in_kernel_space()==1 addresses are never user pages
#1 is a hardware expectation. We *can* look for that directly by just
making sure that X86_PF_PK is only set when it also comes with
X86_PF_USER in the hardware page fault error code.
(...
Aside: We should probably explicitly separate out the hardware
error code from the software-munged version, like we do here:
> if (user_mode(regs)) {
> local_irq_enable();
> error_code |= X86_PF_USER)
But, #2 is a bit of a more loose check. It wasn't true for the recent
vsyscall, and I've also seen goofy drivers map memory out to userspace
quite a few times in the kernel address space.
So, I'd much rather see a X86_PF_USER check than a X86_PF_USER check.
But, as for pkeys...
The original intent here was to relay: "protection key faults can never
be spurious". The reason in my silly comment was that we don't do lazy
flushing, but that's imprecise: the real reasoning is that we don't ever
have kernel pages on which we can take protection key faults.
IOW, I think the check here should be for "protection key faults only
occur on user pages", and all the *spurious* checking should be looking
at *just* user vs. kernel pages, like:
static int spurious_fault_check(unsigned long error_code, pte_t *pte)
{
/* Only expect spurious faults on kernel pages: */
WARN_ON_ONCE(pte_flags(*pte) & _PAGE_USER);
/* Only expect spurious faults originating from kernel code: */
WARN_ON_ONCE(error_code & X86_PF_USER);
...
On Tue, Sep 4, 2018 at 2:21 PM, Dave Hansen <[email protected]> wrote:
> On 09/04/2018 12:56 PM, Andy Lutomirski wrote:
>> I have no objection to this patch.
>>
>> Dave, why did you think that we could get a PK fault on the vsyscall
>> page, even on kernels that still marked it executable? Sure, you
>> could get an instruction in the vsyscall page to get a PK fault, but
>> CR2 wouldn't point to the vsyscall page, right?
>
> I'm inferring the CR2 value from the page fault trace point. I see
> entries like this:
>
> protection_keys-4313 [002] d... 420257.094541: page_fault_user:
> address=_end ip=_end error_code=0x15
>
> But, that's not a PK fault, and it triggers the "misaligned vsyscall
> (exploit attempt or buggy program)" stuff in dmesg. It's just the
> symptom of trying to execute the non-executable vsyscall page.
>
> I'm not a super big fan of this particular patch, though. The
> fault_in_kernel_space() check is really presuming two things:
> 1. pkey faults (PF_PK=1) only occur on user pages (_PAGE_USER=1)
> 2. fault_in_kernel_space()==1 addresses are never user pages
>
> #1 is a hardware expectation. We *can* look for that directly by just
> making sure that X86_PF_PK is only set when it also comes with
> X86_PF_USER in the hardware page fault error code.
>
> (...
> Aside: We should probably explicitly separate out the hardware
> error code from the software-munged version, like we do here:
> > if (user_mode(regs)) {
> > local_irq_enable();
> > error_code |= X86_PF_USER)
>
> But, #2 is a bit of a more loose check. It wasn't true for the recent
> vsyscall, and I've also seen goofy drivers map memory out to userspace
> quite a few times in the kernel address space.
>
> So, I'd much rather see a X86_PF_USER check than a X86_PF_USER check.
>
> But, as for pkeys...
>
> The original intent here was to relay: "protection key faults can never
> be spurious". The reason in my silly comment was that we don't do lazy
> flushing, but that's imprecise: the real reasoning is that we don't ever
> have kernel pages on which we can take protection key faults.
>
> IOW, I think the check here should be for "protection key faults only
> occur on user pages", and all the *spurious* checking should be looking
> at *just* user vs. kernel pages, like:
>
> static int spurious_fault_check(unsigned long error_code, pte_t *pte)
> {
> /* Only expect spurious faults on kernel pages: */
> WARN_ON_ONCE(pte_flags(*pte) & _PAGE_USER);
> /* Only expect spurious faults originating from kernel code: */
> WARN_ON_ONCE(error_code & X86_PF_USER);
> ...
>
Want to just send an alternative patch?
Also, I doubt it matters right here, but !X86_PF_USER isn't quite the
same thing as "originating from kernel code" -- it can also be user
code that does a CPL0 access due to exception delivery or access to a
descriptor table. Which you saw plenty of times while debugging
PTI... :) I doubt any of those should be spurious, though.
--Andy
On 09/04/2018 02:27 PM, Andy Lutomirski wrote:
> Also, I doubt it matters right here, but !X86_PF_USER isn't quite the
> same thing as "originating from kernel code" -- it can also be user
> code that does a CPL0 access due to exception delivery or access to a
> descriptor table. Which you saw plenty of times while debugging
> PTI... :) I doubt any of those should be spurious, though.
Yeah, you're talking about "implicit supervisor access". Right?
I definitely saw those during KAISER/KPTI development. But, it's
probably worth noting that the code that we have that _looks_ for these:
> /*
> * It's safe to allow irq's after cr2 has been saved and the
> * vmalloc fault has been handled.
> *
> * User-mode registers count as a user access even for any
> * potential system fault or CPU buglet:
> */
> if (user_mode(regs)) {
> local_irq_enable();
> error_code |= X86_PF_USER;
> flags |= FAULT_FLAG_USER;
is in the user address space handling portion of the code. So, it's
really not useful for any of the _known_ implicit supervisor-mode
accesses. I think it can only get triggered in the case of hardware or
software bugs.
On Wed, Sep 5, 2018 at 2:35 PM, Dave Hansen <[email protected]> wrote:
> On 09/04/2018 02:27 PM, Andy Lutomirski wrote:
>> Also, I doubt it matters right here, but !X86_PF_USER isn't quite the
>> same thing as "originating from kernel code" -- it can also be user
>> code that does a CPL0 access due to exception delivery or access to a
>> descriptor table. Which you saw plenty of times while debugging
>> PTI... :) I doubt any of those should be spurious, though.
>
> Yeah, you're talking about "implicit supervisor access". Right?
Yes.