2022-03-11 21:34:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault

On 3/11/22 11:07, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> access_error() currently does not check for execution permission
> violation. As a result, spurious page-faults due to execution permission
> violation cause SIGSEGV.

This is a bit muddy on the problem statement. I get that spurious
faults can theoretically cause this, but *do* they in practice on
current kernels?

> It appears not to be an issue so far, but the next patches avoid TLB
> flushes on permission promotion, which can lead to this scenario. nodejs
> for instance crashes when TLB flush is avoided on permission promotion.

By "it appears not to be an issue", do you mean that this suboptimal
behavior can not be triggered, period? Or, it can be triggered but
folks seem not to care that it can be triggered?

I *think* these can be triggered today. I think it takes two threads
that do something like:

Thread 1 Thread 2
======== ========
ptr = malloc();
memcpy(ptr, &code, len);
exec_now = 1;
while (!exec_now);
call(ptr);
// fault
mprotect(ptr, PROT_EXEC, len);
// fault sees VM_EXEC


But that has a bug: exec_now is set before the mprotect(). It's not
sane code.

Can any sane code trigger this?

> Add a check to prevent access_error() from returning mistakenly that
> spurious page-faults due to instruction fetch are a reason for an access
> error.
>
> It is assumed that error code bits of "instruction fetch" and "write" in
> the hardware error code are mutual exclusive, and the change assumes so.
> However, to be on the safe side, especially if hypervisors misbehave,
> assert this is the case and warn otherwise.
>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andrew Cooper <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Peter Xu <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Yu Zhao <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: [email protected]
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> arch/x86/mm/fault.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index d0074c6ed31a..ad0ef0a6087a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
> (error_code & X86_PF_INSTR), foreign))
> return 1;
>
> - if (error_code & X86_PF_WRITE) {
> + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
> + /*
> + * CPUs are not expected to set the two error code bits
> + * together, but to ensure that hypervisors do not misbehave,
> + * run an additional sanity check.
> + */
> + if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
> + (X86_PF_WRITE|X86_PF_INSTR)) {
> + WARN_ON_ONCE(1);
> + return 1;
> + }

access_error() is only used on the do_user_addr_fault() side of things.
Can we stick this check somewhere that also works for kernel address
faults? This is a generic sanity check. It can also be in a separate
patch.

Also, we should *probably* stop talking about CPUs here. If there's
ever something wonky with error code bits, I'd put my money on a weird
hypervisor before any kind of CPU issue.

> /* write, present and write, not present: */
> - if (unlikely(!(vma->vm_flags & VM_WRITE)))
> + if ((error_code & X86_PF_WRITE) &&
> + unlikely(!(vma->vm_flags & VM_WRITE)))
> + return 1;
> +
> + /* exec, present and exec, not present: */
> + if ((error_code & X86_PF_INSTR) &&
> + unlikely(!(vma->vm_flags & VM_EXEC)))
> return 1;
> +
> return 0;
> }

This is getting really ugly. I think we've gone over this before, but
it escapes me. Why do we need a common (X86_PF_WRITE | X86_PF_INSTR)
block of code? Why can't we just add a simple X86_PF_INSN if() that
mirrors the current X86_PF_WRITE one?


if (error_code & X86_PF_INSN) {
/* present and not exec: */
if (unlikely(!(vma->vm_flags & VM_EXEC)))
return 1;
return 0;
}



2022-03-11 23:32:23

by Nadav Amit

[permalink] [raw]
Subject: Re: [RESEND PATCH v3 2/5] x86/mm: check exec permissions on fault



> On Mar 11, 2022, at 11:41 AM, Dave Hansen <[email protected]> wrote:
>
> On 3/11/22 11:07, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>>
>> access_error() currently does not check for execution permission
>> violation. As a result, spurious page-faults due to execution permission
>> violation cause SIGSEGV.
>
> This is a bit muddy on the problem statement. I get that spurious
> faults can theoretically cause this, but *do* they in practice on
> current kernels?
>
>> It appears not to be an issue so far, but the next patches avoid TLB
>> flushes on permission promotion, which can lead to this scenario. nodejs
>> for instance crashes when TLB flush is avoided on permission promotion.
>
> By "it appears not to be an issue", do you mean that this suboptimal
> behavior can not be triggered, period? Or, it can be triggered but
> folks seem not to care that it can be triggered?
>
> I *think* these can be triggered today. I think it takes two threads
> that do something like:
>
> Thread 1 Thread 2
> ======== ========
> ptr = malloc();
> memcpy(ptr, &code, len);
> exec_now = 1;
> while (!exec_now);
> call(ptr);
> // fault
> mprotect(ptr, PROT_EXEC, len);
> // fault sees VM_EXEC
>
>
> But that has a bug: exec_now is set before the mprotect(). It's not
> sane code.
>
> Can any sane code trigger this?

Well, regarding this question and the previous one: I do not think that
this scenario is possible today since mprotect() holds the mmap_lock
for write. There is no other code that I am aware of that toggles
the NX bit on a present entry.

But I will not bet my life on it. That’s the reason for the somewhat
vague phrasing that I used.

>>
>> index d0074c6ed31a..ad0ef0a6087a 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>> (error_code & X86_PF_INSTR), foreign))
>> return 1;
>>
>> - if (error_code & X86_PF_WRITE) {
>> + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
>> + /*
>> + * CPUs are not expected to set the two error code bits
>> + * together, but to ensure that hypervisors do not misbehave,
>> + * run an additional sanity check.
>> + */
>> + if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
>> + (X86_PF_WRITE|X86_PF_INSTR)) {
>> + WARN_ON_ONCE(1);
>> + return 1;
>> + }
>
> access_error() is only used on the do_user_addr_fault() side of things.
> Can we stick this check somewhere that also works for kernel address
> faults? This is a generic sanity check. It can also be in a separate
> patch.

I can wrap it in a different function and also call it from
do_kern_addr_fault() or spurious_kernel_fault().

Anyhow, spurious_kernel_fault() should handle spurious faults on
executable code correctly.

>
> Also, we should *probably* stop talking about CPUs here. If there's
> ever something wonky with error code bits, I'd put my money on a weird
> hypervisor before any kind of CPU issue.

I thought I manage to convey exactly that in the comment. Can you provide
a better phrasing?

>
>> /* write, present and write, not present: */
>> - if (unlikely(!(vma->vm_flags & VM_WRITE)))
>> + if ((error_code & X86_PF_WRITE) &&
>> + unlikely(!(vma->vm_flags & VM_WRITE)))
>> + return 1;
>> +
>> + /* exec, present and exec, not present: */
>> + if ((error_code & X86_PF_INSTR) &&
>> + unlikely(!(vma->vm_flags & VM_EXEC)))
>> return 1;
>> +
>> return 0;
>> }
>
> This is getting really ugly. I think we've gone over this before, but
> it escapes me. Why do we need a common (X86_PF_WRITE | X86_PF_INSTR)
> block of code? Why can't we just add a simple X86_PF_INSN if() that
> mirrors the current X86_PF_WRITE one?
>
>
> if (error_code & X86_PF_INSN) {
> /* present and not exec: */
> if (unlikely(!(vma->vm_flags & VM_EXEC)))
> return 1;
> return 0;
> }

You are correct. My bad. I will fix it.