2008-01-24 00:05:49

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH] x86: ignore spurious faults

When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem. They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).

This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.

[ Changes to fault_32.c and fault_64.c are identical, and should be
easy unify when the time comes. ]

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Harvey Harrison <[email protected]>
---
arch/x86/mm/fault_32.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/fault_64.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+)

===================================================================
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -290,6 +290,53 @@ static int is_errata93(struct pt_regs *r


/*
+ * Handle a spurious fault caused by a stale TLB entry. This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+ unsigned long error_code)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ /* Reserved-bit violation or user access to kernel space? */
+ if (error_code & (PF_USER | PF_RSVD))
+ return 0;
+
+ pgd = init_mm.pgd + pgd_index(address);
+ if (!pgd_present(*pgd))
+ return 0;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return 0;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+ if ((error_code & 0x02) && !pte_write(*pte))
+ return 0;
+
+#if _PAGE_NX
+ if ((error_code & PF_INSTR) && !pte_exec(*pte))
+ return 0;
+#endif
+
+ return 1;
+}
+
+/*
* Handle a fault on the vmalloc or module mapping area
*
* This assumes no large pages in there.
@@ -412,6 +459,11 @@ void __kprobes do_page_fault(struct pt_r
if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) &&
vmalloc_fault(address) >= 0)
return;
+
+ /* Can handle a stale RO->RW TLB */
+ if (spurious_fault(address, error_code))
+ return;
+
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.
===================================================================
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -275,6 +275,53 @@ static noinline void pgtable_bad(unsigne
}

/*
+ * Handle a spurious fault caused by a stale TLB entry. This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+ unsigned long error_code)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ /* Reserved-bit violation or user access to kernel space? */
+ if (error_code & (PF_USER | PF_RSVD))
+ return 0;
+
+ pgd = init_mm.pgd + pgd_index(address);
+ if (!pgd_present(*pgd))
+ return 0;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return 0;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+ if ((error_code & 0x02) && !pte_write(*pte))
+ return 0;
+
+#if _PAGE_NX
+ if ((error_code & PF_INSTR) && !pte_exec(*pte))
+ return 0;
+#endif
+
+ return 1;
+}
+
+/*
* Handle a fault on the vmalloc area
*
* This assumes no large pages in there.
@@ -406,6 +453,11 @@ asmlinkage void __kprobes do_page_fault(
if (vmalloc_fault(address) >= 0)
return;
}
+
+ /* Can handle a stale RO->RW TLB */
+ if (spurious_fault(address, error_code))
+ return;
+
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.


2008-01-24 00:18:31

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] x86: ignore spurious faults

On Wed, 2008-01-23 at 16:05 -0800, Jeremy Fitzhardinge wrote:
> ===================================================================
> --- a/arch/x86/mm/fault_32.c
> +++ b/arch/x86/mm/fault_32.c
> @@ -290,6 +290,53 @@ static int is_errata93(struct pt_regs *r
>
>
> /*
> + * Handle a spurious fault caused by a stale TLB entry. This allows
> + * us to lazily refresh the TLB when increasing the permissions of a
> + * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
> + * expensive since that implies doing a full cross-processor TLB
> + * flush, even if no stale TLB entries exist on other processors.
> + * There are no security implications to leaving a stale TLB when
> + * increasing the permissions on a page.
> + */
> +static int spurious_fault(unsigned long address,
> + unsigned long error_code)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + /* Reserved-bit violation or user access to kernel space? */
> + if (error_code & (PF_USER | PF_RSVD))
> + return 0;
> +
> + pgd = init_mm.pgd + pgd_index(address);
> + if (!pgd_present(*pgd))
> + return 0;
> +
> + pud = pud_offset(pgd, address);
> + if (!pud_present(*pud))
> + return 0;
> +
> + pmd = pmd_offset(pud, address);
> + if (!pmd_present(*pmd))
> + return 0;
> +
> + pte = pte_offset_kernel(pmd, address);
> + if (!pte_present(*pte))
> + return 0;
> + if ((error_code & 0x02) && !pte_write(*pte))
> + return 0;

if ((error_code & PF_WRITE) && !pte_write(*pte))
return 0;

> +
> +#if _PAGE_NX
> + if ((error_code & PF_INSTR) && !pte_exec(*pte))
> + return 0;
> +#endif
> +

How about dropping the #if and rely on the !pte_exec() test always
being false when _PAGE_NX = 0? The compiler should just trim this
all away.

from pgtable.h:

static inline int pte_exec(pte_t pte) { return !(pte_val(pte)
& _PAGE_NX); }

Cheers,

Harvey

2008-01-24 00:26:41

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: ignore spurious faults

Harvey Harrison wrote:
> On Wed, 2008-01-23 at 16:05 -0800, Jeremy Fitzhardinge wrote:
>
>> ===================================================================
>> --- a/arch/x86/mm/fault_32.c
>> +++ b/arch/x86/mm/fault_32.c
>> @@ -290,6 +290,53 @@ static int is_errata93(struct pt_regs *r
>>
>>
>> /*
>> + * Handle a spurious fault caused by a stale TLB entry. This allows
>> + * us to lazily refresh the TLB when increasing the permissions of a
>> + * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
>> + * expensive since that implies doing a full cross-processor TLB
>> + * flush, even if no stale TLB entries exist on other processors.
>> + * There are no security implications to leaving a stale TLB when
>> + * increasing the permissions on a page.
>> + */
>> +static int spurious_fault(unsigned long address,
>> + unsigned long error_code)
>> +{
>> + pgd_t *pgd;
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + pte_t *pte;
>> +
>> + /* Reserved-bit violation or user access to kernel space? */
>> + if (error_code & (PF_USER | PF_RSVD))
>> + return 0;
>> +
>> + pgd = init_mm.pgd + pgd_index(address);
>> + if (!pgd_present(*pgd))
>> + return 0;
>> +
>> + pud = pud_offset(pgd, address);
>> + if (!pud_present(*pud))
>> + return 0;
>> +
>> + pmd = pmd_offset(pud, address);
>> + if (!pmd_present(*pmd))
>> + return 0;
>> +
>> + pte = pte_offset_kernel(pmd, address);
>> + if (!pte_present(*pte))
>> + return 0;
>> + if ((error_code & 0x02) && !pte_write(*pte))
>> + return 0;
>>
>
> if ((error_code & PF_WRITE) && !pte_write(*pte))
> return 0;
>

Oops, thanks. Overlooked that one.

> How about dropping the #if and rely on the !pte_exec() test always
> being false when _PAGE_NX = 0? The compiler should just trim this
> all away.
>
> from pgtable.h:
>
> static inline int pte_exec(pte_t pte) { return !(pte_val(pte)
> & _PAGE_NX); }
>

Thanks for reminding me; I thought of this the other day, but forgot to
do it. Will post an updated patch shortly.

J

2008-01-24 00:28:26

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH UPDATE] x86: ignore spurious faults

When changing a kernel page from RO->RW, it's OK to leave stale TLB
entries around, since doing a global flush is expensive and they pose
no security problem. They can, however, generate a spurious fault,
which we should catch and simply return from (which will have the
side-effect of reloading the TLB to the current PTE).

This can occur when running under Xen, because it frequently changes
kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
doing a global TLB flush after changing page permissions.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Harvey Harrison <[email protected]>
---
arch/x86/mm/fault_32.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/fault_64.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)

===================================================================
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -324,6 +324,51 @@ static int is_f00f_bug(struct pt_regs *r
}

/*
+ * Handle a spurious fault caused by a stale TLB entry. This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+ unsigned long error_code)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ /* Reserved-bit violation or user access to kernel space? */
+ if (error_code & (PF_USER | PF_RSVD))
+ return 0;
+
+ pgd = init_mm.pgd + pgd_index(address);
+ if (!pgd_present(*pgd))
+ return 0;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return 0;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+
+ if ((error_code & PF_WRITE) && !pte_write(*pte))
+ return 0;
+ if ((error_code & PF_INSTR) && !pte_exec(*pte))
+ return 0;
+
+ return 1;
+}
+
+/*
* Handle a fault on the vmalloc or module mapping area
*
* This assumes no large pages in there.
@@ -446,6 +491,11 @@ void __kprobes do_page_fault(struct pt_r
if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) &&
vmalloc_fault(address) >= 0)
return;
+
+ /* Can handle a stale RO->RW TLB */
+ if (spurious_fault(address, error_code))
+ return;
+
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.
===================================================================
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -312,6 +312,51 @@ static noinline void pgtable_bad(unsigne
}

/*
+ * Handle a spurious fault caused by a stale TLB entry. This allows
+ * us to lazily refresh the TLB when increasing the permissions of a
+ * kernel page (RO -> RW or NX -> X). Doing it eagerly is very
+ * expensive since that implies doing a full cross-processor TLB
+ * flush, even if no stale TLB entries exist on other processors.
+ * There are no security implications to leaving a stale TLB when
+ * increasing the permissions on a page.
+ */
+static int spurious_fault(unsigned long address,
+ unsigned long error_code)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ /* Reserved-bit violation or user access to kernel space? */
+ if (error_code & (PF_USER | PF_RSVD))
+ return 0;
+
+ pgd = init_mm.pgd + pgd_index(address);
+ if (!pgd_present(*pgd))
+ return 0;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return 0;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+
+ if ((error_code & PF_WRITE) && !pte_write(*pte))
+ return 0;
+ if ((error_code & PF_INSTR) && !pte_exec(*pte))
+ return 0;
+
+ return 1;
+}
+
+/*
* Handle a fault on the vmalloc area
*
* This assumes no large pages in there.
@@ -443,6 +488,11 @@ asmlinkage void __kprobes do_page_fault(
if (vmalloc_fault(address) >= 0)
return;
}
+
+ /* Can handle a stale RO->RW TLB */
+ if (spurious_fault(address, error_code))
+ return;
+
/*
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.

2008-01-24 06:49:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: ignore spurious faults

Jeremy Fitzhardinge <[email protected]> writes:
>
> /*
> + * Handle a spurious fault caused by a stale TLB entry. This allows

vmalloc_fault already has nearly the same logic. You should look
at sharing the code.

-Andi

2008-01-24 07:02:21

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: ignore spurious faults

Andi Kleen wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>> /*
>> + * Handle a spurious fault caused by a stale TLB entry. This allows
>>
>
> vmalloc_fault already has nearly the same logic. You should look
> at sharing the code.

Hm, I see what you mean, but its hard to see how to share much code
there. It's a case of "two lines common, one line different" repeated 4
times or so.

J

2008-01-24 07:11:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: ignore spurious faults

On Thursday 24 January 2008 08:02:11 Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > Jeremy Fitzhardinge <[email protected]> writes:
> >> /*
> >> + * Handle a spurious fault caused by a stale TLB entry. This allows
> >
> > vmalloc_fault already has nearly the same logic. You should look
> > at sharing the code.
>
> Hm, I see what you mean, but its hard to see how to share much code
> there. It's a case of "two lines common, one line different" repeated 4
> times or so.

The core common logic is checking if the fault agrees with the page tables.
If not do different things.

-Andi

2008-01-24 19:14:52

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults


On Wed, 2008-01-23 at 16:28 -0800, Jeremy Fitzhardinge wrote:
> When changing a kernel page from RO->RW, it's OK to leave stale TLB
> entries around, since doing a global flush is expensive and they pose
> no security problem. They can, however, generate a spurious fault,
> which we should catch and simply return from (which will have the
> side-effect of reloading the TLB to the current PTE).
>
> This can occur when running under Xen, because it frequently changes
> kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
> It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
> doing a global TLB flush after changing page permissions.

There's perhaps an opportunity to do this lazy TLB trick in the mmap
path as well, where RW mappings are initially mapped as RO so we can
catch processes dirtying them and then switched to RW. If the mapping is
shared across threads on multiple cores, we can defer synchronizing the
TLBs on the others.

--
Mathematics is the supreme nostalgia of our time.

2008-01-24 19:21:18

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

Matt Mackall wrote:
> There's perhaps an opportunity to do this lazy TLB trick in the mmap
> path as well, where RW mappings are initially mapped as RO so we can
> catch processes dirtying them and then switched to RW. If the mapping is
> shared across threads on multiple cores, we can defer synchronizing the
> TLBs on the others.
>

I think spurious usermode faults are already dealt with.
handle_pte_fault() does essentially the same thing as this patch:

if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
update_mmu_cache(vma, address, entry);
} else {
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (write_access)
flush_tlb_page(vma, address);
}



J

2008-01-24 23:41:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

On Friday 25 January 2008 06:21, Jeremy Fitzhardinge wrote:
> Matt Mackall wrote:
> > There's perhaps an opportunity to do this lazy TLB trick in the mmap
> > path as well, where RW mappings are initially mapped as RO so we can
> > catch processes dirtying them and then switched to RW. If the mapping is
> > shared across threads on multiple cores, we can defer synchronizing the
> > TLBs on the others.
>
> I think spurious usermode faults are already dealt with.
> handle_pte_fault() does essentially the same thing as this patch:
>
> if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
> update_mmu_cache(vma, address, entry);
> } else {
> /*
> * This is needed only for protection faults but the arch code
> * is not yet telling us if this is a protection fault or not.
> * This still avoids useless tlb flushes for .text page faults
> * with threads.
> */
> if (write_access)
> flush_tlb_page(vma, address);
> }

I (obviously) don't know exactly how the TLB works in x86, but I
thought that on a miss, the CPU walks the pagetables first before
faulting? Maybe that's not the case if there is an RO entry
actually in the TLB?

2008-01-25 00:26:40

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

Nick Piggin wrote:
> On Friday 25 January 2008 06:21, Jeremy Fitzhardinge wrote:
>
>> Matt Mackall wrote:
>>
>>> There's perhaps an opportunity to do this lazy TLB trick in the mmap
>>> path as well, where RW mappings are initially mapped as RO so we can
>>> catch processes dirtying them and then switched to RW. If the mapping is
>>> shared across threads on multiple cores, we can defer synchronizing the
>>> TLBs on the others.
>>>
>> I think spurious usermode faults are already dealt with.
>> handle_pte_fault() does essentially the same thing as this patch:
>>
>> if (ptep_set_access_flags(vma, address, pte, entry, write_access)) {
>> update_mmu_cache(vma, address, entry);
>> } else {
>> /*
>> * This is needed only for protection faults but the arch code
>> * is not yet telling us if this is a protection fault or not.
>> * This still avoids useless tlb flushes for .text page faults
>> * with threads.
>> */
>> if (write_access)
>> flush_tlb_page(vma, address);
>> }
>>
>
> I (obviously) don't know exactly how the TLB works in x86, but I
> thought that on a miss, the CPU walks the pagetables first before
> faulting? Maybe that's not the case if there is an RO entry
> actually in the TLB?
>

My understanding is that it will fault immediately if there's a TLB
entry, and rewalk the tables on return from the fault before restarting
the instruction, so there's no need for an explicit TLB flush. The TLB
doesn't have a notion of negative cache entries, so any entry represents
a present page of some variety.

J

2008-01-25 08:45:00

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

Actually, another thought: permitting (and handling) spurious faults for
kernel mappings conflicts with NMI handling, i.e. great care would be
needed to ensure the NMI path cannot touch any such mapping. So
even the present Xen/Linux Dom0 implementation may have some
(perhaps unlikely) problems here, and it would get worse if we added
e.g. a virtual watchdog NMI (something I am considering, which would
then extend the problem to DomU-s).

Jan

2008-01-25 08:48:37

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

On Friday 25 January 2008 19:15, Jan Beulich wrote:
> Actually, another thought: permitting (and handling) spurious faults for
> kernel mappings conflicts with NMI handling, i.e. great care would be
> needed to ensure the NMI path cannot touch any such mapping. So
> even the present Xen/Linux Dom0 implementation may have some
> (perhaps unlikely) problems here, and it would get worse if we added
> e.g. a virtual watchdog NMI (something I am considering, which would
> then extend the problem to DomU-s).

Can you explain how they conflict?

Thanks,
Nick

2008-01-25 08:56:49

by Keir Fraser

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

On 25/1/08 00:26, "Jeremy Fitzhardinge" <[email protected]> wrote:

>> I (obviously) don't know exactly how the TLB works in x86, but I
>> thought that on a miss, the CPU walks the pagetables first before
>> faulting? Maybe that's not the case if there is an RO entry
>> actually in the TLB?
>>
>
> My understanding is that it will fault immediately if there's a TLB
> entry, and rewalk the tables on return from the fault before restarting
> the instruction, so there's no need for an explicit TLB flush. The TLB
> doesn't have a notion of negative cache entries, so any entry represents
> a present page of some variety.

Yes, write access with a r/o TLB entry causes the TLB entry to be flushed
and an immediate #PF with no page walk. This is a hardware optimisation for
copy-on-write demand faults. Both Intel and AMD implement it.

-- Keir

2008-01-25 09:12:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

On Friday 25 January 2008 09:38:38 Nick Piggin wrote:
> On Friday 25 January 2008 19:15, Jan Beulich wrote:
> > Actually, another thought: permitting (and handling) spurious faults for
> > kernel mappings conflicts with NMI handling, i.e. great care would be
> > needed to ensure the NMI path cannot touch any such mapping. So
> > even the present Xen/Linux Dom0 implementation may have some
> > (perhaps unlikely) problems here, and it would get worse if we added
> > e.g. a virtual watchdog NMI (something I am considering, which would
> > then extend the problem to DomU-s).
>
> Can you explain how they conflict?

NMI is blocked by the hardware until IRET and when a page fault happens inside
the NMI handler the early IRET unblocks it and then NMIs can nest, which
will lead to stack corruption.

-Andi

2008-01-25 09:17:56

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

>>> Nick Piggin <[email protected]> 25.01.08 09:38 >>>
>On Friday 25 January 2008 19:15, Jan Beulich wrote:
>> Actually, another thought: permitting (and handling) spurious faults for
>> kernel mappings conflicts with NMI handling, i.e. great care would be
>> needed to ensure the NMI path cannot touch any such mapping. So
>> even the present Xen/Linux Dom0 implementation may have some
>> (perhaps unlikely) problems here, and it would get worse if we added
>> e.g. a virtual watchdog NMI (something I am considering, which would
>> then extend the problem to DomU-s).
>
>Can you explain how they conflict?

In the same way as vmalloc faults do (which is why vmalloc_sync_all()
got introduced): a page fault nested inside an NMI will, by virtue of
executing IRET, prematurely tell the processor that NMI handling is
done (and specifically unmask further NMIs).

Jan

2008-01-25 09:19:53

by Keir Fraser

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

On 25/1/08 09:11, "Andi Kleen" <[email protected]> wrote:

>>> Actually, another thought: permitting (and handling) spurious faults for
>>> kernel mappings conflicts with NMI handling, i.e. great care would be
>>> needed to ensure the NMI path cannot touch any such mapping. So
>>> even the present Xen/Linux Dom0 implementation may have some
>>> (perhaps unlikely) problems here, and it would get worse if we added
>>> e.g. a virtual watchdog NMI (something I am considering, which would
>>> then extend the problem to DomU-s).
>>
>> Can you explain how they conflict?
>
> NMI is blocked by the hardware until IRET and when a page fault happens inside
> the NMI handler the early IRET unblocks it and then NMIs can nest, which
> will lead to stack corruption.

Whether this a problem in light of Xen spurious faults depends on whether
NMI handlers touch dynamically-allocated data. And if they do, it still
depends on the exact scenario. If it is likely to be a problem, a Xen pv_op
can flush the TLB on NMI entry, or we could have Xen do that implicitly
before invoking the guest NMI handler.

Currently Xen guests do not use NMI for watchdog or oprofile, so that rather
limits the scope for problems.

-- Keir

2008-01-25 09:51:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults


> Whether this a problem in light of Xen spurious faults depends on whether
> NMI handlers touch dynamically-allocated data. And if they do, it still
> depends on the exact scenario. If it is likely to be a problem, a Xen pv_op
> can flush the TLB on NMI entry, or we could have Xen do that implicitly
> before invoking the guest NMI handler.

For PV kernels it would probably be better to just implement a truly
nesting NMI trigger method instead of the one bit state of the real hardware.

Actually I'm not sure NMIs are really that useful for guests. Most things
done traditionally by NMIs are probably better done outside the guest
in a virtualized environment.

-Andi

2008-01-25 10:20:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults


> Whether this a problem in light of Xen spurious faults depends on whether
> NMI handlers touch dynamically-allocated data.

How do you define dynamically-allocated data?

-Andi

2008-01-25 13:18:22

by Keir Fraser

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

On 25/1/08 10:19, "Andi Kleen" <[email protected]> wrote:

>> Whether this a problem in light of Xen spurious faults depends on whether
>> NMI handlers touch dynamically-allocated data.
>
> How do you define dynamically-allocated data?

Anything that could have been a read-only pte or ldt page in a previous life
with no intervening TLB flush. So get_free_page(), kmalloc(), vmalloc(), ...

Actually I think we are fine, now I think about it some more, because we
only clear the software NMI-in-flight flag if the guest executes IRET via
the hypervisor. Most Xen Linux guests only do IRET via the hypervisor when
the current context is an NMI handler (additionally x86_64 also does so when
returning to ring 3). Most importantly for this case, we will *not* IRET via
the hypervisor when returning from a #PF context nested in an NMI context.
Hence the NMI-in-flight flag will not be cleared, and guest virtual NMIs
will not nest. So that's a relief!

-- Keir

2008-01-25 15:30:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults


* Jeremy Fitzhardinge <[email protected]> wrote:

> When changing a kernel page from RO->RW, it's OK to leave stale TLB
> entries around, since doing a global flush is expensive and they pose
> no security problem. They can, however, generate a spurious fault,
> which we should catch and simply return from (which will have the
> side-effect of reloading the TLB to the current PTE).
>
> This can occur when running under Xen, because it frequently changes
> kernel pages from RW->RO->RW to implement Xen's pagetable semantics.
> It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids
> doing a global TLB flush after changing page permissions.

thanks, applied.

it would be nice to expose this ability of the architecture to the core
Linux kernel mprotect code as well, and let it skip on a TLB flush when
doing a RO->RW transition. It could speed up valgrind and the other
mprotect() users i guess? [and UML too perhaps]

Ingo

2008-01-25 15:54:36

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

Ingo Molnar wrote:
> thanks, applied.
>
> it would be nice to expose this ability of the architecture to the core
> Linux kernel mprotect code as well, and let it skip on a TLB flush when
> doing a RO->RW transition.

The usermode fault handler already effectively does this; this patch
just does it for kernel mode as well. I don't know if mprotect takes
advantage of this.

> It could speed up valgrind and the other
> mprotect() users i guess? [and UML too perhaps]
>

Not valgrind (it doesn't rely on mmap protections), but electric fence
perhaps.

J

2008-01-25 18:09:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults


* Jeremy Fitzhardinge <[email protected]> wrote:

> Ingo Molnar wrote:
>> thanks, applied.
>>
>> it would be nice to expose this ability of the architecture to the core
>> Linux kernel mprotect code as well, and let it skip on a TLB flush when
>> doing a RO->RW transition.
>
> The usermode fault handler already effectively does this; this patch
> just does it for kernel mode as well. I don't know if mprotect takes
> advantage of this.

spurious faults happen all the time on SMP, in the native kernel.

And what i mean is that Linux mprotect currently does not take advantage
of x86's ability to just change the ptes, because there's no structured
way to tell mm/mprotect.c that "it's safe to skip the TLB flush here".

The flush happens in mm/mprotect.c's change_protection() function:

flush_tlb_range(vma, start, end);

and that is unnecessary when we increase the protection rights, such as
in a RO->RW change. (all that is needed is an smp_wmb() instead, to make
sure all the pte modifications are visible when the syscall returns.)

and it's a really rare case these days that you can find an area where
Linux does not make use of a hardware MMU feature - so we should fix
this ;-)

Ingo

2008-01-25 18:39:41

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH UPDATE] x86: ignore spurious faults

Ingo Molnar wrote:
> spurious faults happen all the time on SMP, in the native kernel.
>
> And what i mean is that Linux mprotect currently does not take advantage
> of x86's ability to just change the ptes, because there's no structured
> way to tell mm/mprotect.c that "it's safe to skip the TLB flush here".
>
> The flush happens in mm/mprotect.c's change_protection() function:
>
> flush_tlb_range(vma, start, end);
>
> and that is unnecessary when we increase the protection rights, such as
> in a RO->RW change. (all that is needed is an smp_wmb() instead, to make
> sure all the pte modifications are visible when the syscall returns.)
>
> and it's a really rare case these days that you can find an area where
> Linux does not make use of a hardware MMU feature - so we should fix
> this ;-)

Well, I guess this isn't really specific to x86; we could always
legitimately not do a tlb flush after increasing permissions and leave
the fault handler to clean up the mess where needed. But I don't think
that's necessarily much of a win; it's cheaper to just do the tlb flush
rather than take a spurious fault, unless the faults are very rare. If
someone is doing an mprotect on a piece of memory (esp to make it
writable), my guess is that they're going to touch that memory in the
very near future.

The big win for this patch is avoiding cross-cpu tlb invalidation when
changing kernel mappings. mprotect doesn't attempt to do that anyway,
and so can incur spurious faults on other CPUs.

J