LinuxLists.cc - [PATCH v2 0/2] x86/xen: avoid 32-bit writes to PTEs in PV PAE guests

2018-08-21 15:40:44

Subject: [PATCH v2 0/2] x86/xen: avoid 32-bit writes to PTEs in PV PAE guests

While the hypervisor emulates plain writes to PTEs happily, this is
much slower than issuing a hypercall for PTE modifcations. And writing
a PTE via two 32-bit write instructions (especially when clearing the
PTE) will result in an intermediate L1TF vulnerable PTE.

Writes to PAE PTEs should always be done with 64-bit writes or via
hypercalls.

Juergen Gross (2):
x86/xen: don't write ptes directly in 32-bit PV guests
x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear

arch/x86/include/asm/pgtable-3level.h | 7 +++----
arch/x86/xen/mmu_pv.c | 7 +++----
2 files changed, 6 insertions(+), 8 deletions(-)

--
2.13.7

2018-08-21 15:40:01

by Juergen Gross

[permalink] [raw]

Subject: [PATCH v2 2/2] x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear

Using only 32-bit writes for the pte will result in an intermediate
L1TF vulnerable PTE. When running as a Xen PV guest this will at once
switch the guest to shadow mode resulting in a loss of performance.

Use arch_atomic64_xchg() instead which will perform the requested
operation atomically with all 64 bits.

Some performance considerations according to:

https://software.intel.com/sites/default/files/managed/ad/dc/Intel-Xeon-Scalable-Processor-throughput-latency.pdf

The main number should be the latency, as there is no tight loop around
native_ptep_get_and_clear().

"lock cmpxchg8b" has a latency of 20 cycles, while "lock xchg" (with a
memory operand) isn't mentioned in that document. "lock xadd" (with xadd
having 3 cycles less latency than xchg) has a latency of 11, so we can
assume a latency of 14 for "lock xchg".

Signed-off-by: Juergen Gross <[email protected]>
---
In case adding about 6 cycles for native_ptep_get_and_clear() is believed
to be too bad I can modify the patch to add a paravirt function for that
purpose in order to add the overhead for Xen guests only (in fact the
overhead for Xen guests will be less, as only one instruction writing to
the PTE has to be emulated by the hypervisor).
---
arch/x86/include/asm/pgtable-3level.h | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index a564084c6141..f8b1ad2c3828 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -2,6 +2,8 @@
#ifndef _ASM_X86_PGTABLE_3LEVEL_H
#define _ASM_X86_PGTABLE_3LEVEL_H

+#include <asm/atomic64_32.h>
+
/*
* Intel Physical Address Extension (PAE) Mode - three-level page
* tables on PPro+ CPUs.
@@ -150,10 +152,7 @@ static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
{
pte_t res;

- /* xchg acts as a barrier before the setting of the high bits */
- res.pte_low = xchg(&ptep->pte_low, 0);
- res.pte_high = ptep->pte_high;
- ptep->pte_high = 0;
+ res.pte = (pteval_t)arch_atomic64_xchg((atomic64_t *)ptep, 0);

return res;
}
--
2.13.7

2018-08-21 15:40:01

by Juergen Gross

[permalink] [raw]

Subject: [PATCH v2 1/2] x86/xen: don't write ptes directly in 32-bit PV guests

In some cases 32-bit PAE PV guests still write PTEs directly instead of
using hypercalls. This is especially bad when clearing a PTE as this is
done via 32-bit writes which will produce intermediate L1TF attackable
PTEs.

Change the code to use hypercalls instead.

Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Jan Beulich <[email protected]>
---
arch/x86/xen/mmu_pv.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 9e7012858420..9396b4d17064 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -434,14 +434,13 @@ static void xen_set_pud(pud_t *ptr, pud_t val)
static void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
{
trace_xen_mmu_set_pte_atomic(ptep, pte);
- set_64bit((u64 *)ptep, native_pte_val(pte));
+ __xen_set_pte(ptep, pte);
}

static void xen_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
trace_xen_mmu_pte_clear(mm, addr, ptep);
- if (!xen_batched_set_pte(ptep, native_make_pte(0)))
- native_pte_clear(mm, addr, ptep);
+ __xen_set_pte(ptep, native_make_pte(0));
}

static void xen_pmd_clear(pmd_t *pmdp)
@@ -1569,7 +1568,7 @@ static void __init xen_set_pte_init(pte_t *ptep, pte_t pte)
pte = __pte_ma(((pte_val_ma(*ptep) & _PAGE_RW) | ~_PAGE_RW) &
pte_val_ma(pte));
#endif
- native_set_pte(ptep, pte);
+ __xen_set_pte(ptep, pte);
}

/* Early in boot, while setting up the initial pagetable, assume
--
2.13.7

2018-08-21 16:45:39

by Jan Beulich

[permalink] [raw]

Subject: Re: [Xen-devel] [PATCH v2 2/2] x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear

>>> On 21.08.18 at 17:37, <[email protected]> wrote:
> Using only 32-bit writes for the pte will result in an intermediate
> L1TF vulnerable PTE. When running as a Xen PV guest this will at once
> switch the guest to shadow mode resulting in a loss of performance.
>
> Use arch_atomic64_xchg() instead which will perform the requested
> operation atomically with all 64 bits.
>
> Some performance considerations according to:
>
> https://software.intel.com/sites/default/files/managed/ad/dc/Intel-Xeon-Scal
> able-Processor-throughput-latency.pdf
>
> The main number should be the latency, as there is no tight loop around
> native_ptep_get_and_clear().
>
> "lock cmpxchg8b" has a latency of 20 cycles, while "lock xchg" (with a
> memory operand) isn't mentioned in that document. "lock xadd" (with xadd
> having 3 cycles less latency than xchg) has a latency of 11, so we can
> assume a latency of 14 for "lock xchg".
>
> Signed-off-by: Juergen Gross <[email protected]>

Reviewed-by: Jan Beulich <[email protected]>
with one further remark:

> @@ -150,10 +152,7 @@ static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
> {
> pte_t res;
>
> - /* xchg acts as a barrier before the setting of the high bits */
> - res.pte_low = xchg(&ptep->pte_low, 0);
> - res.pte_high = ptep->pte_high;
> - ptep->pte_high = 0;
> + res.pte = (pteval_t)arch_atomic64_xchg((atomic64_t *)ptep, 0);

Is the cast on the return value really needed here?

Jan

2018-08-26 10:29:11

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear

On Tue, 21 Aug 2018, Juergen Gross wrote:

> Using only 32-bit writes for the pte will result in an intermediate
> L1TF vulnerable PTE. When running as a Xen PV guest this will at once
> switch the guest to shadow mode resulting in a loss of performance.
>
> Use arch_atomic64_xchg() instead which will perform the requested
> operation atomically with all 64 bits.
>
> Some performance considerations according to:
>
> https://software.intel.com/sites/default/files/managed/ad/dc/Intel-Xeon-Scalable-Processor-throughput-latency.pdf
>
> The main number should be the latency, as there is no tight loop around
> native_ptep_get_and_clear().
>
> "lock cmpxchg8b" has a latency of 20 cycles, while "lock xchg" (with a
> memory operand) isn't mentioned in that document. "lock xadd" (with xadd
> having 3 cycles less latency than xchg) has a latency of 11, so we can
> assume a latency of 14 for "lock xchg".
>
> Signed-off-by: Juergen Gross <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2018-08-27 16:05:25

by Jason Andryuk

[permalink] [raw]

Subject: Re: [Xen-devel] [PATCH v2 0/2] x86/xen: avoid 32-bit writes to PTEs in PV PAE guests

On Tue, Aug 21, 2018 at 11:40 AM Juergen Gross <[email protected]> wrote:
>
> While the hypervisor emulates plain writes to PTEs happily, this is
> much slower than issuing a hypercall for PTE modifcations. And writing
> a PTE via two 32-bit write instructions (especially when clearing the
> PTE) will result in an intermediate L1TF vulnerable PTE.
>
> Writes to PAE PTEs should always be done with 64-bit writes or via
> hypercalls.
>
> Juergen Gross (2):
> x86/xen: don't write ptes directly in 32-bit PV guests
> x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear
>

I tested both patches on 4.14, changing patch 2 to atomic64_xchg since
arch_atomic64_xchg doesn't exist.

I haven't seen https://bugzilla.kernel.org/show_bug.cgi?id=198497
trigger since incorporating these patch. Without the patches, I would
have seen it trigger by now. Also, I've confirmed Xen does not enable
page table shadowing. For what it's worth, the PTEs that would
trigger Xen shadowing (0x8000'0002'0000'0000) are the same as those
that triggered bug 198497. There was at least 1 non-Xen user affected
by 198497, but this at least seems to fix it for me.

Tested-by: Jason Andryuk <[email protected]>

Thank you.
Jason

2018-08-27 16:08:51

by Jason Andryuk

[permalink] [raw]

Subject: Re: [Xen-devel] [PATCH v2 0/2] x86/xen: avoid 32-bit writes to PTEs in PV PAE guests

On Mon, Aug 27, 2018 at 12:03 PM Jason Andryuk <[email protected]> wrote:
>
> On Tue, Aug 21, 2018 at 11:40 AM Juergen Gross <[email protected]> wrote:
> >
> > While the hypervisor emulates plain writes to PTEs happily, this is
> > much slower than issuing a hypercall for PTE modifcations. And writing
> > a PTE via two 32-bit write instructions (especially when clearing the
> > PTE) will result in an intermediate L1TF vulnerable PTE.
> >
> > Writes to PAE PTEs should always be done with 64-bit writes or via
> > hypercalls.
> >
> > Juergen Gross (2):
> > x86/xen: don't write ptes directly in 32-bit PV guests
> > x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear
> >
>
> I tested both patches on 4.14, changing patch 2 to atomic64_xchg since
> arch_atomic64_xchg doesn't exist.
>
> I haven't seen https://bugzilla.kernel.org/show_bug.cgi?id=198497
> trigger since incorporating these patch. Without the patches, I would
> have seen it trigger by now. Also, I've confirmed Xen does not enable
> page table shadowing. For what it's worth, the PTEs that would
> trigger Xen shadowing (0x8000'0002'0000'0000) are the same as those
> that triggered bug 198497. There was at least 1 non-Xen user affected
> by 198497, but this at least seems to fix it for me.
>
> Tested-by: Jason Andryuk <[email protected]>

Also, can these patches be Cc: [email protected]?

Thanks,
Jason

2018-08-27 19:25:25

by Boris Ostrovsky

[permalink] [raw]

Subject: Re: [PATCH v2 0/2] x86/xen: avoid 32-bit writes to PTEs in PV PAE guests

On 08/21/2018 11:37 AM, Juergen Gross wrote:
> While the hypervisor emulates plain writes to PTEs happily, this is
> much slower than issuing a hypercall for PTE modifcations. And writing
> a PTE via two 32-bit write instructions (especially when clearing the
> PTE) will result in an intermediate L1TF vulnerable PTE.
>
> Writes to PAE PTEs should always be done with 64-bit writes or via
> hypercalls.
>
> Juergen Gross (2):
> x86/xen: don't write ptes directly in 32-bit PV guests
> x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear
>
> arch/x86/include/asm/pgtable-3level.h | 7 +++----
> arch/x86/xen/mmu_pv.c | 7 +++----
> 2 files changed, 6 insertions(+), 8 deletions(-)
>

Applied to for-linus-19b.

(+stable.)

-boris