2023-08-02 15:42:58

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 06/38] mm: Add default definition of set_ptes()

Most architectures can just define set_pte() and PFN_PTE_SHIFT to
use this definition. It's also a handy spot to document the guarantees
provided by the MM.

Suggested-by: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
---
include/linux/pgtable.h | 81 ++++++++++++++++++++++++++++++-----------
1 file changed, 60 insertions(+), 21 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f34e0f2cb4d8..3fde0d5d1c29 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -182,6 +182,66 @@ static inline int pmd_young(pmd_t pmd)
}
#endif

+/*
+ * A facility to provide lazy MMU batching. This allows PTE updates and
+ * page invalidations to be delayed until a call to leave lazy MMU mode
+ * is issued. Some architectures may benefit from doing this, and it is
+ * beneficial for both shadow and direct mode hypervisors, which may batch
+ * the PTE updates which happen during this window. Note that using this
+ * interface requires that read hazards be removed from the code. A read
+ * hazard could result in the direct mode hypervisor case, since the actual
+ * write to the page tables may not yet have taken place, so reads though
+ * a raw PTE pointer after it has been modified are not guaranteed to be
+ * up to date. This mode can only be entered and left under the protection of
+ * the page table locks for all page tables which may be modified. In the UP
+ * case, this is required so that preemption is disabled, and in the SMP case,
+ * it must synchronize the delayed page table writes properly on other CPUs.
+ */
+#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+#define arch_enter_lazy_mmu_mode() do {} while (0)
+#define arch_leave_lazy_mmu_mode() do {} while (0)
+#define arch_flush_lazy_mmu_mode() do {} while (0)
+#endif
+
+#ifndef set_ptes
+#ifdef PFN_PTE_SHIFT
+/**
+ * set_ptes - Map consecutive pages to a contiguous range of addresses.
+ * @mm: Address space to map the pages into.
+ * @addr: Address to map the first page at.
+ * @ptep: Page table pointer for the first entry.
+ * @pte: Page table entry for the first page.
+ * @nr: Number of pages to map.
+ *
+ * May be overridden by the architecture, or the architecture can define
+ * set_pte() and PFN_PTE_SHIFT.
+ *
+ * Context: The caller holds the page table lock. The pages all belong
+ * to the same folio. The PTEs are all in the same PMD.
+ */
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ page_table_check_ptes_set(mm, ptep, pte, nr);
+
+ arch_enter_lazy_mmu_mode();
+ for (;;) {
+ set_pte(ptep, pte);
+ if (--nr == 0)
+ break;
+ ptep++;
+ pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+ }
+ arch_leave_lazy_mmu_mode();
+}
+#ifndef set_pte_at
+#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
+#endif
+#endif
+#else
+#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
extern int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
@@ -1051,27 +1111,6 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
#define pgprot_decrypted(prot) (prot)
#endif

-/*
- * A facility to provide lazy MMU batching. This allows PTE updates and
- * page invalidations to be delayed until a call to leave lazy MMU mode
- * is issued. Some architectures may benefit from doing this, and it is
- * beneficial for both shadow and direct mode hypervisors, which may batch
- * the PTE updates which happen during this window. Note that using this
- * interface requires that read hazards be removed from the code. A read
- * hazard could result in the direct mode hypervisor case, since the actual
- * write to the page tables may not yet have taken place, so reads though
- * a raw PTE pointer after it has been modified are not guaranteed to be
- * up to date. This mode can only be entered and left under the protection of
- * the page table locks for all page tables which may be modified. In the UP
- * case, this is required so that preemption is disabled, and in the SMP case,
- * it must synchronize the delayed page table writes properly on other CPUs.
- */
-#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-#define arch_enter_lazy_mmu_mode() do {} while (0)
-#define arch_leave_lazy_mmu_mode() do {} while (0)
-#define arch_flush_lazy_mmu_mode() do {} while (0)
-#endif
-
/*
* A facility to provide batching of the reload of page tables and
* other process state with the actual context switch code for
--
2.40.1



2023-10-12 13:53:32

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v6 06/38] mm: Add default definition of set_ptes()

On Wed, 2023-08-02 at 16:13 +0100, Matthew Wilcox (Oracle) wrote:
> Most architectures can just define set_pte() and PFN_PTE_SHIFT to
> use this definition.  It's also a handy spot to document the guarantees
> provided by the MM.
>
> Suggested-by: Mike Rapoport (IBM) <[email protected]>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Mike Rapoport (IBM) <[email protected]>
> ---
>  include/linux/pgtable.h | 81 ++++++++++++++++++++++++++++++-----------
>  1 file changed, 60 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f34e0f2cb4d8..3fde0d5d1c29 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -182,6 +182,66 @@ static inline int pmd_young(pmd_t pmd)
>  }
>  #endif
>  
> +/*
> + * A facility to provide lazy MMU batching.  This allows PTE updates and
> + * page invalidations to be delayed until a call to leave lazy MMU mode
> + * is issued.  Some architectures may benefit from doing this, and it is
> + * beneficial for both shadow and direct mode hypervisors, which may batch
> + * the PTE updates which happen during this window.  Note that using this
> + * interface requires that read hazards be removed from the code.  A read
> + * hazard could result in the direct mode hypervisor case, since the actual
> + * write to the page tables may not yet have taken place, so reads though
> + * a raw PTE pointer after it has been modified are not guaranteed to be
> + * up to date.  This mode can only be entered and left under the protection of
> + * the page table locks for all page tables which may be modified.  In the UP
> + * case, this is required so that preemption is disabled, and in the SMP case,
> + * it must synchronize the delayed page table writes properly on other CPUs.
> + */
> +#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> +#define arch_enter_lazy_mmu_mode()     do {} while (0)
> +#define arch_leave_lazy_mmu_mode()     do {} while (0)
> +#define arch_flush_lazy_mmu_mode()     do {} while (0)
> +#endif
> +
> +#ifndef set_ptes
> +#ifdef PFN_PTE_SHIFT
> +/**
> + * set_ptes - Map consecutive pages to a contiguous range of addresses.
> + * @mm: Address space to map the pages into.
> + * @addr: Address to map the first page at.
> + * @ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @nr: Number of pages to map.
> + *
> + * May be overridden by the architecture, or the architecture can define
> + * set_pte() and PFN_PTE_SHIFT.
> + *
> + * Context: The caller holds the page table lock.  The pages all belong
> + * to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> +               pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +       page_table_check_ptes_set(mm, ptep, pte, nr);
> +
> +       arch_enter_lazy_mmu_mode();
> +       for (;;) {
> +               set_pte(ptep, pte);
> +               if (--nr == 0)
> +                       break;
> +               ptep++;
> +               pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
> +       }
> +       arch_leave_lazy_mmu_mode();
> +}


This breaks the Xen PV guest.

In move_ptes() in mm/mremap.c we arch_enter_lazy_mmu_mode() and then
loop calling set_pte_at(). Which now (or at least in a few commits time
when you wire it up for x86 in commit a3e1c9372c9b959) ends up in your
implementation of set_ptes(), calls arch_enter_lazy_mmu_mode() again,
and:

[ 0.628700] ------------[ cut here ]------------
[ 0.628718] kernel BUG at arch/x86/kernel/paravirt.c:144!
[ 0.628743] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 0.628769] CPU: 0 PID: 1 Comm: init Not tainted 6.5.0-rc4+ #1295
[ 0.628818] RIP: e030:paravirt_enter_lazy_mmu+0x24/0x30
[ 0.628839] Code: 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 65 8b 05 90 28 f9 7e 85 c0 75 10 65 c7 05 81 28 f9 7e 01 00 00 00 c3 cc cc cc cc <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90
[ 0.628875] RSP: e02b:ffffc9004000ba48 EFLAGS: 00010202
[ 0.628891] RAX: 0000000000000001 RBX: ffff8880051b7100 RCX: 000ffffffffff000
[ 0.628908] RDX: 80000000763ff967 RSI: 80000000763ff967 RDI: ffff8880051b7100
[ 0.628925] RBP: 80000000763ff967 R08: ffff8880051b6868 R09: 00007ffce1a20000
[ 0.628943] R10: deadbeefdeadf00d R11: 0000000000000000 R12: 00007ffffffff000
[ 0.628964] R13: ffff8880050b7000 R14: 0000000000000001 R15: 00007fffffffe000
[ 0.628988] FS: 0000000000000000(0000) GS:ffff88807b800000(0000) knlGS:0000000000000000
[ 0.629007] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.629024] CR2: ffffc900003f5000 CR3: 0000000003904000 CR4: 0000000000050660
[ 0.629046] Call Trace:
[ 0.629055] <TASK>
[ 0.629066] ? die+0x36/0x90
[ 0.629081] ? do_trap+0xda/0x100
[ 0.629093] ? paravirt_enter_lazy_mmu+0x24/0x30
[ 0.629112] ? do_error_trap+0x6a/0x90
[ 0.629123] ? paravirt_enter_lazy_mmu+0x24/0x30
[ 0.629138] ? exc_invalid_op+0x50/0x70
[ 0.629155] ? paravirt_enter_lazy_mmu+0x24/0x30
[ 0.629169] ? asm_exc_invalid_op+0x1a/0x20
[ 0.629185] ? paravirt_enter_lazy_mmu+0x24/0x30
[ 0.629212] ? pte_offset_map_nolock+0x48/0xc0
[ 0.629226] set_ptes.constprop.0+0xd/0x30
[ 0.629240] move_ptes.isra.0+0xdd/0x290
[ 0.629253] ? pmd_install+0xab/0xd0
[ 0.629267] move_page_tables+0x3a0/0x850
[ 0.629294] shift_arg_pages+0xf4/0x1d0
[ 0.629317] setup_arg_pages+0x205/0x380
[ 0.629330] load_elf_binary+0x398/0xe00


I'm working on making PV kernels testable in qemu. With...

• some qemu fixes and a nasty hackish Xen console implementation:
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/xenfv-console
• a CONFIG_PV_SHIM_EXCLUSIVE build of Xen itself to run in the guest,
• some suitable disk image lying around, in ${GUEST_IMAGE}, and
• CONFIG_KVM_XEN enabled in your host kernel,

...you should be able to do something like:

$ ./qemu-system-x86_64 --accel kvm,xen-version=0x40011,kernel-irqchip=split -drive file=${GUEST_IMAGE},if=none,id=disk -device xen-disk,drive=disk,vdev=xvda -m 1G -kernel ~/git/xen/xen/xen -initrd ~/git/linux/arch/x86/boot/bzImage -append "loglvl=all -- console=hvc0 root=/dev/xvda1" -display none






Attachments:
smime.p7s (5.83 kB)

2023-10-12 14:05:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 06/38] mm: Add default definition of set_ptes()

On Thu, Oct 12, 2023 at 02:53:05PM +0100, David Woodhouse wrote:
> > +???????arch_enter_lazy_mmu_mode();
> > +???????for (;;) {
> > +???????????????set_pte(ptep, pte);
> > +???????????????if (--nr == 0)
> > +???????????????????????break;
> > +???????????????ptep++;
> > +???????????????pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
> > +???????}
> > +???????arch_leave_lazy_mmu_mode();
>
> This breaks the Xen PV guest.
>
> In move_ptes() in mm/mremap.c we arch_enter_lazy_mmu_mode() and then
> loop calling set_pte_at(). Which now (or at least in a few commits time
> when you wire it up for x86 in commit a3e1c9372c9b959) ends up in your
> implementation of set_ptes(), calls arch_enter_lazy_mmu_mode() again,
> and:
>
> [ 0.628700] ------------[ cut here ]------------
> [ 0.628718] kernel BUG at arch/x86/kernel/paravirt.c:144!

Easy fix ... don't do that ;-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..f3da8836f689 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -231,9 +231,11 @@ static inline pte_t pte_next_pfn(pte_t pte)
static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr)
{
+ bool multiple = nr > 1;
page_table_check_ptes_set(mm, ptep, pte, nr);

- arch_enter_lazy_mmu_mode();
+ if (multiple)
+ arch_enter_lazy_mmu_mode();
for (;;) {
set_pte(ptep, pte);
if (--nr == 0)
@@ -241,7 +243,8 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
ptep++;
pte = pte_next_pfn(pte);
}
- arch_leave_lazy_mmu_mode();
+ if (multiple)
+ arch_leave_lazy_mmu_mode();
}
#endif
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

I think long-term, we should make lazy_mmu_mode nestable. But this is
a reasonable quick fix.

2023-10-12 14:44:16

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v6 06/38] mm: Add default definition of set_ptes()

On Thu, 2023-10-12 at 15:05 +0100, Matthew Wilcox wrote:
> On Thu, Oct 12, 2023 at 02:53:05PM +0100, David Woodhouse wrote:
> > > +       arch_enter_lazy_mmu_mode();
> > > +       for (;;) {
> > > +               set_pte(ptep, pte);
> > > +               if (--nr == 0)
> > > +                       break;
> > > +               ptep++;
> > > +               pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
> > > +       }
> > > +       arch_leave_lazy_mmu_mode();
> >
> > This breaks the Xen PV guest.
> >
> > In move_ptes() in mm/mremap.c we arch_enter_lazy_mmu_mode() and then
> > loop calling set_pte_at(). Which now (or at least in a few commits time
> > when you wire it up for x86 in commit a3e1c9372c9b959) ends up in your
> > implementation of set_ptes(), calls arch_enter_lazy_mmu_mode() again,
> > and:
> >
> > [    0.628700] ------------[ cut here ]------------
> > [    0.628718] kernel BUG at arch/x86/kernel/paravirt.c:144!
>
> Easy fix ... don't do that ;-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..f3da8836f689 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -231,9 +231,11 @@ static inline pte_t pte_next_pfn(pte_t pte)
>  static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>                 pte_t *ptep, pte_t pte, unsigned int nr)
>  {
> +       bool multiple = nr > 1;
>         page_table_check_ptes_set(mm, ptep, pte, nr);
>  
> -       arch_enter_lazy_mmu_mode();
> +       if (multiple)
> +               arch_enter_lazy_mmu_mode();
>         for (;;) {
>                 set_pte(ptep, pte);
>                 if (--nr == 0)
> @@ -241,7 +243,8 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>                 ptep++;
>                 pte = pte_next_pfn(pte);
>         }
> -       arch_leave_lazy_mmu_mode();
> +       if (multiple)
> +               arch_leave_lazy_mmu_mode();
>  }
>  #endif
>  #define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
>
> I think long-term, we should make lazy_mmu_mode nestable.  But this is
> a reasonable quick fix.

I don't much like doing it implicitly based on (nr==1) but sure, as a
quick fix that works. The 64-bit PV guest now boots again.

Tested-by: David Woodhouse <[email protected]>

Thanks.


Attachments:
smime.p7s (5.83 kB)