2024-04-11 17:43:31

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 0/2] s390/mm: shared zeropage + KVM fixes

This series fixes one issue with uffd + shared zeropages on s390x and
fixes that "ordinary" KVM guests can make use of shared zeropages again.

userfaultfd could currently end up mapping shared zeropages into processes
that forbid shared zeropages. This only apples to s390x, relevant for
handling PV guests and guests that use storage kets correctly. Fix it
by placing a zeroed folio instead of the shared zeropage during
UFFDIO_ZEROPAGE instead.

I stumbled over this issue while looking into a customer scenario that
is using:

(1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB
and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB
available and additional memory can be "fake hotplugged" to the VM
later on demand by deflating the balloon. Actual memory overcommit is
not desired, so physical memory would only be moved between VMs.

(2) Live migration of VMs between sites to evacuate servers in case of
emergency.

Without the shared zeropage, during (2), the VM would suddenly consume
100 GiB on the migration source and destination. On the migration source,
where we don't excpect memory overcommit, we could easilt end up crashing
the VM during migration.

Independent of that, memory handed back to the hypervisor using "free page
reporting" would end up consuming actual memory after the migration on the
destination, not getting freed up until reused+freed again.

While there might be ways to optimize parts of this in QEMU, we really
should just support the shared zeropage again for ordinary VMs.

We only expect legcy guests to make use of storage keys, so let's handle
zeropages again when enabling storage keys or when enabling PV. To not
break userfaultfd like we did in the past, don't zap the shared zeropages,
but instead trigger unsharing faults, just like we do for unsharing
KSM pages in break_ksm().

Unsharing faults will simply replace the shared zeropage by a zeroed
anonymous folio. We can already trigger the same fault path using GUP,
when trying to long-term pin a shared zeropage, but also when unmerging
a KSM-placed zeropages, so this is nothing new.

Patch #1 tested on 86-64 by forcing mm_forbids_zeropage() to be 1, and
running the uffd selftests.

Patch #2 tested on s390x: the live migration scenario now works as
expected, and kvm-unit-tests that trigger usage of skeys work well, whereby
I can see detection and unsharing of shared zeropages.

Further (as broken in v2), I tested that the shared zeropage is no
longer populated after skeys are used -- that mm_forbids_zeropage() works
as expected:
./s390x-run s390x/skey.elf \
-no-shutdown \
-chardev socket,id=monitor,path=/var/tmp/mon,server,nowait \
-mon chardev=monitor,mode=readline

Then, in another shell:

# cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
Rss: 31484 kB
# echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
...
# cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
Rss: 160452 kB

-> Reading guest memory does not populate the shared zeropage

Doing the same with selftest.elf (no skeys)

# cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
Rss: 30900 kB
# echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
...
# cat /proc/`pgrep qemu`/smaps_rollup | grep Rsstmp/mon
Rss: 30924 kB

-> Reading guest memory does populate the shared zeropage

Based on s390/features. Andrew agreed that both patches can go via the
s390x tree.

v2 -> v3:
* "mm/userfaultfd: don't place zeropages when zeropages are disallowed"
-> Fix wrong mm_forbids_zeropage check
* "s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests"
-> Fix wrong mm_forbids_zeropage define

v1 -> v2:
* "mm/userfaultfd: don't place zeropages when zeropages are disallowed"
-> Minor "ret" ahndling tweaks
* "s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests"
-> Added Fixes: tag

Cc: Christian Borntraeger <[email protected]>
Cc: Janosch Frank <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: [email protected]
Cc: [email protected]


David Hildenbrand (2):
mm/userfaultfd: don't place zeropages when zeropages are disallowed
s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

arch/s390/include/asm/gmap.h | 2 +-
arch/s390/include/asm/mmu.h | 5 +
arch/s390/include/asm/mmu_context.h | 1 +
arch/s390/include/asm/pgtable.h | 16 ++-
arch/s390/kvm/kvm-s390.c | 4 +-
arch/s390/mm/gmap.c | 163 +++++++++++++++++++++-------
mm/userfaultfd.c | 34 ++++++
7 files changed, 178 insertions(+), 47 deletions(-)

--
2.44.0



2024-04-11 17:44:16

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

commit fa41ba0d08de ("s390/mm: avoid empty zero pages for KVM guests to
avoid postcopy hangs") introduced an undesired side effect when combined
with memory ballooning and VM migration: memory part of the inflated
memory balloon will consume memory.

Assuming we have a 100GiB VM and inflated the balloon to 40GiB. Our VM
will consume ~60GiB of memory. If we now trigger a VM migration,
hypervisors like QEMU will read all VM memory. As s390x does not support
the shared zeropage, we'll end up allocating for all previously-inflated
memory part of the memory balloon: 50 GiB. So we might easily
(unexpectedly) crash the VM on the migration source.

Even worse, hypervisors like QEMU optimize for zeropage migration to not
consume memory on the migration destination: when migrating a
"page full of zeroes", on the migration destination they check whether the
target memory is already zero (by reading the destination memory) and avoid
writing to the memory to not allocate memory: however, s390x will also
allocate memory here, implying that also on the migration destination, we
will end up allocating all previously-inflated memory part of the memory
balloon.

This is especially bad if actual memory overcommit was not desired, when
memory ballooning is used for dynamic VM memory resizing, setting aside
some memory during boot that can be added later on demand. Alternatives
like virtio-mem that would avoid this issue are not yet available on
s390x.

There could be ways to optimize some cases in user space: before reading
memory in an anonymous private mapping on the migration source, check via
/proc/self/pagemap if anything is already populated. Similarly check on
the migration destination before reading. While that would avoid
populating tables full of shared zeropages on all architectures, it's
harder to get right and performant, and requires user space changes.

Further, with posctopy live migration we must place a page, so there,
"avoid touching memory to avoid allocating memory" is not really
possible. (Note that a previously we would have falsely inserted
shared zeropages into processes using UFFDIO_ZEROPAGE where
mm_forbids_zeropage() would have actually forbidden it)

PV is currently incompatible with memory ballooning, and in the common
case, KVM guests don't make use of storage keys. Instead of zapping
zeropages when enabling storage keys / PV, that turned out to be
problematic in the past, let's do exactly the same we do with KSM pages:
trigger unsharing faults to replace the shared zeropages by proper
anonymous folios.

What about added latency when enabling storage kes? Having a lot of
zeropages in applicable environments (PV, legacy guests, unittests) is
unexpected. Further, KSM could today already unshare the zeropages
and unmerging KSM pages when enabling storage kets would unshare the
KSM-placed zeropages in the same way, resulting in the same latency.

Reviewed-by: Christian Borntraeger <[email protected]>
Tested-by: Christian Borntraeger <[email protected]>
Fixes: fa41ba0d08de ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs")
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/include/asm/gmap.h | 2 +-
arch/s390/include/asm/mmu.h | 5 +
arch/s390/include/asm/mmu_context.h | 1 +
arch/s390/include/asm/pgtable.h | 16 ++-
arch/s390/kvm/kvm-s390.c | 4 +-
arch/s390/mm/gmap.c | 163 +++++++++++++++++++++-------
6 files changed, 144 insertions(+), 47 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 5cc46e0dde62..9725586f4259 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -146,7 +146,7 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,

void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
unsigned long gaddr, unsigned long vmaddr);
-int gmap_mark_unmergeable(void);
+int s390_disable_cow_sharing(void);
void s390_unlist_old_asce(struct gmap *gmap);
int s390_replace_asce(struct gmap *gmap);
void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);
diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index bb1b4bef1878..4c2dc7abc285 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -32,6 +32,11 @@ typedef struct {
unsigned int uses_skeys:1;
/* The mmu context uses CMM. */
unsigned int uses_cmm:1;
+ /*
+ * The mmu context allows COW-sharing of memory pages (KSM, zeropage).
+ * Note that COW-sharing during fork() is currently always allowed.
+ */
+ unsigned int allow_cow_sharing:1;
/* The gmaps associated with this context are allowed to use huge pages. */
unsigned int allow_gmap_hpage_1m:1;
} mm_context_t;
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index 929af18b0908..a7789a9f6218 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -35,6 +35,7 @@ static inline int init_new_context(struct task_struct *tsk,
mm->context.has_pgste = 0;
mm->context.uses_skeys = 0;
mm->context.uses_cmm = 0;
+ mm->context.allow_cow_sharing = 1;
mm->context.allow_gmap_hpage_1m = 0;
#endif
switch (mm->context.asce_limit) {
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 60950e7a25f5..259c2439c251 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -566,10 +566,20 @@ static inline pud_t set_pud_bit(pud_t pud, pgprot_t prot)
}

/*
- * In the case that a guest uses storage keys
- * faults should no longer be backed by zero pages
+ * As soon as the guest uses storage keys or enables PV, we deduplicate all
+ * mapped shared zeropages and prevent new shared zeropages from getting
+ * mapped.
*/
-#define mm_forbids_zeropage mm_has_pgste
+#define mm_forbids_zeropage mm_forbids_zeropage
+static inline int mm_forbids_zeropage(struct mm_struct *mm)
+{
+#ifdef CONFIG_PGSTE
+ if (!mm->context.allow_cow_sharing)
+ return 1;
+#endif
+ return 0;
+}
+
static inline int mm_uses_skeys(struct mm_struct *mm)
{
#ifdef CONFIG_PGSTE
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 5147b943a864..db3392f0be21 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2631,9 +2631,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
if (r)
break;

- mmap_write_lock(current->mm);
- r = gmap_mark_unmergeable();
- mmap_write_unlock(current->mm);
+ r = s390_disable_cow_sharing();
if (r)
break;

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 094b43b121cd..9233b0acac89 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2549,41 +2549,6 @@ static inline void thp_split_mm(struct mm_struct *mm)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

-/*
- * Remove all empty zero pages from the mapping for lazy refaulting
- * - This must be called after mm->context.has_pgste is set, to avoid
- * future creation of zero pages
- * - This must be called after THP was disabled.
- *
- * mm contracts with s390, that even if mm were to remove a page table,
- * racing with the loop below and so causing pte_offset_map_lock() to fail,
- * it will never insert a page table containing empty zero pages once
- * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
- */
-static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
- unsigned long end, struct mm_walk *walk)
-{
- unsigned long addr;
-
- for (addr = start; addr != end; addr += PAGE_SIZE) {
- pte_t *ptep;
- spinlock_t *ptl;
-
- ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
- if (!ptep)
- break;
- if (is_zero_pfn(pte_pfn(*ptep)))
- ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
- pte_unmap_unlock(ptep, ptl);
- }
- return 0;
-}
-
-static const struct mm_walk_ops zap_zero_walk_ops = {
- .pmd_entry = __zap_zero_pages,
- .walk_lock = PGWALK_WRLOCK,
-};
-
/*
* switch on pgstes for its userspace process (for kvm)
*/
@@ -2601,22 +2566,140 @@ int s390_enable_sie(void)
mm->context.has_pgste = 1;
/* split thp mappings and disable thp for future mappings */
thp_split_mm(mm);
- walk_page_range(mm, 0, TASK_SIZE, &zap_zero_walk_ops, NULL);
mmap_write_unlock(mm);
return 0;
}
EXPORT_SYMBOL_GPL(s390_enable_sie);

-int gmap_mark_unmergeable(void)
+static int find_zeropage_pte_entry(pte_t *pte, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ unsigned long *found_addr = walk->private;
+
+ /* Return 1 of the page is a zeropage. */
+ if (is_zero_pfn(pte_pfn(*pte))) {
+
+ /*
+ * Shared zeropage in e.g., a FS DAX mapping? We cannot do the
+ * right thing and likely don't care: FAULT_FLAG_UNSHARE
+ * currently only works in COW mappings, which is also where
+ * mm_forbids_zeropage() is checked.
+ */
+ if (!is_cow_mapping(walk->vma->vm_flags))
+ return -EFAULT;
+
+ *found_addr = addr;
+ return 1;
+ }
+ return 0;
+}
+
+static const struct mm_walk_ops find_zeropage_ops = {
+ .pte_entry = find_zeropage_pte_entry,
+ .walk_lock = PGWALK_WRLOCK,
+};
+
+/*
+ * Unshare all shared zeropages, replacing them by anonymous pages. Note that
+ * we cannot simply zap all shared zeropages, because this could later
+ * trigger unexpected userfaultfd missing events.
+ *
+ * This must be called after mm->context.allow_cow_sharing was
+ * set to 0, to avoid future mappings of shared zeropages.
+ *
+ * mm contracts with s390, that even if mm were to remove a page table,
+ * and racing with walk_page_range_vma() calling pte_offset_map_lock()
+ * would fail, it will never insert a page table containing empty zero
+ * pages once mm_forbids_zeropage(mm) i.e.
+ * mm->context.allow_cow_sharing is set to 0.
+ */
+static int __s390_unshare_zeropages(struct mm_struct *mm)
+{
+ struct vm_area_struct *vma;
+ VMA_ITERATOR(vmi, mm, 0);
+ unsigned long addr;
+ int rc;
+
+ for_each_vma(vmi, vma) {
+ /*
+ * We could only look at COW mappings, but it's more future
+ * proof to catch unexpected zeropages in other mappings and
+ * fail.
+ */
+ if ((vma->vm_flags & VM_PFNMAP) || is_vm_hugetlb_page(vma))
+ continue;
+ addr = vma->vm_start;
+
+retry:
+ rc = walk_page_range_vma(vma, addr, vma->vm_end,
+ &find_zeropage_ops, &addr);
+ if (rc <= 0)
+ continue;
+
+ /* addr was updated by find_zeropage_pte_entry() */
+ rc = handle_mm_fault(vma, addr,
+ FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE,
+ NULL);
+ if (rc & VM_FAULT_OOM)
+ return -ENOMEM;
+ /*
+ * See break_ksm(): even after handle_mm_fault() returned 0, we
+ * must start the lookup from the current address, because
+ * handle_mm_fault() may back out if there's any difficulty.
+ *
+ * VM_FAULT_SIGBUS and VM_FAULT_SIGSEGV are unexpected but
+ * maybe they could trigger in the future on concurrent
+ * truncation. In that case, the shared zeropage would be gone
+ * and we can simply retry and make progress.
+ */
+ cond_resched();
+ goto retry;
+ }
+
+ return rc;
+}
+
+static int __s390_disable_cow_sharing(struct mm_struct *mm)
{
+ int rc;
+
+ if (!mm->context.allow_cow_sharing)
+ return 0;
+
+ mm->context.allow_cow_sharing = 0;
+
+ /* Replace all shared zeropages by anonymous pages. */
+ rc = __s390_unshare_zeropages(mm);
/*
* Make sure to disable KSM (if enabled for the whole process or
* individual VMAs). Note that nothing currently hinders user space
* from re-enabling it.
*/
- return ksm_disable(current->mm);
+ if (!rc)
+ rc = ksm_disable(mm);
+ if (rc)
+ mm->context.allow_cow_sharing = 1;
+ return rc;
+}
+
+/*
+ * Disable most COW-sharing of memory pages for the whole process:
+ * (1) Disable KSM and unmerge/unshare any KSM pages.
+ * (2) Disallow shared zeropages and unshare any zerpages that are mapped.
+ *
+ * Not that we currently don't bother with COW-shared pages that are shared
+ * with parent/child processes due to fork().
+ */
+int s390_disable_cow_sharing(void)
+{
+ int rc;
+
+ mmap_write_lock(current->mm);
+ rc = __s390_disable_cow_sharing(current->mm);
+ mmap_write_unlock(current->mm);
+ return rc;
}
-EXPORT_SYMBOL_GPL(gmap_mark_unmergeable);
+EXPORT_SYMBOL_GPL(s390_disable_cow_sharing);

/*
* Enable storage key handling from now on and initialize the storage
@@ -2685,7 +2768,7 @@ int s390_enable_skey(void)
goto out_up;

mm->context.uses_skeys = 1;
- rc = gmap_mark_unmergeable();
+ rc = __s390_disable_cow_sharing(mm);
if (rc) {
mm->context.uses_skeys = 0;
goto out_up;
--
2.44.0


2024-04-11 17:53:13

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v3 1/2] mm/userfaultfd: don't place zeropages when zeropages are disallowed

s390x must disable shared zeropages for processes running VMs, because
the VMs could end up making use of "storage keys" or protected
virtualization, which are incompatible with shared zeropages.

Yet, with userfaultfd it is possible to insert shared zeropages into
such processes. Let's fallback to simply allocating a fresh zeroed
anonymous folio and insert that instead.

mm_forbids_zeropage() was introduced in commit 593befa6ab74 ("mm: introduce
mm_forbids_zeropage function"), briefly before userfaultfd went
upstream.

Note that we don't want to fail the UFFDIO_ZEROPAGE request like we do
for hugetlb, it would be rather unexpected. Further, we also
cannot really indicated "not supported" to user space ahead of time: it
could be that the MM disallows zeropages after userfaultfd was already
registered.

Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/userfaultfd.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 3c3539c573e7..7eb3dd0f8a49 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -316,6 +316,37 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
goto out;
}

+static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma, unsigned long dst_addr)
+{
+ struct folio *folio;
+ int ret = -ENOMEM;
+
+ folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr);
+ if (!folio)
+ return ret;
+
+ if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
+ goto out_put;
+
+ /*
+ * The memory barrier inside __folio_mark_uptodate makes sure that
+ * zeroing out the folio become visible before mapping the page
+ * using set_pte_at(). See do_anonymous_page().
+ */
+ __folio_mark_uptodate(folio);
+
+ ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
+ &folio->page, true, 0);
+ if (ret)
+ goto out_put;
+
+ return 0;
+out_put:
+ folio_put(folio);
+ return ret;
+}
+
static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr)
@@ -324,6 +355,9 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
spinlock_t *ptl;
int ret;

+ if (mm_forbids_zeropage(dst_vma->vm_mm))
+ return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr);
+
_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
dst_vma->vm_page_prot));
ret = -EAGAIN;
--
2.44.0


2024-04-11 17:55:04

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

On Thu, Apr 11, 2024 at 06:14:41PM +0200, David Hildenbrand wrote:

David, Christian,

> Tested-by: Christian Borntraeger <[email protected]>

Please, correct me if I am wrong, but (to my understanding) the
Tested-by for v2 does not apply for this version of the patch?

Thanks!

2024-04-11 21:10:15

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

On 11.04.24 18:37, Alexander Gordeev wrote:
> On Thu, Apr 11, 2024 at 06:14:41PM +0200, David Hildenbrand wrote:
>
> David, Christian,
>
>> Tested-by: Christian Borntraeger <[email protected]>
>
> Please, correct me if I am wrong, but (to my understanding) the
> Tested-by for v2 does not apply for this version of the patch?

I thought I'd removed it -- you're absolutely, this should be dropped.
Hopefully Christian has time to retest.

Thanks!

--
Cheers,

David / dhildenb


2024-04-11 21:28:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] s390/mm: shared zeropage + KVM fixes

On Thu, 11 Apr 2024 18:14:39 +0200 David Hildenbrand <[email protected]> wrote:

> This series fixes one issue with uffd + shared zeropages on s390x and
> fixes that "ordinary" KVM guests can make use of shared zeropages again.
>
> ...
>
> Without the shared zeropage, during (2), the VM would suddenly consume
> 100 GiB on the migration source and destination. On the migration source,
> where we don't excpect memory overcommit, we could easilt end up crashing
> the VM during migration.
>
> Independent of that, memory handed back to the hypervisor using "free page
> reporting" would end up consuming actual memory after the migration on the
> destination, not getting freed up until reused+freed again.
>

Is a backport desirable?

If so, the [1/2] Fixes dates back to 2015 and the [2/2] Fixes is from
2017. Is it appropriate that the patches be backported so far back,
and into different kernel versions?

2024-04-11 21:56:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] s390/mm: shared zeropage + KVM fixes

On 11.04.24 23:28, Andrew Morton wrote:
> On Thu, 11 Apr 2024 18:14:39 +0200 David Hildenbrand <[email protected]> wrote:
>
>> This series fixes one issue with uffd + shared zeropages on s390x and
>> fixes that "ordinary" KVM guests can make use of shared zeropages again.
>>
>> ...
>>
>> Without the shared zeropage, during (2), the VM would suddenly consume
>> 100 GiB on the migration source and destination. On the migration source,
>> where we don't excpect memory overcommit, we could easilt end up crashing
>> the VM during migration.
>>
>> Independent of that, memory handed back to the hypervisor using "free page
>> reporting" would end up consuming actual memory after the migration on the
>> destination, not getting freed up until reused+freed again.
>>
>
> Is a backport desirable?
>
> If so, the [1/2] Fixes dates back to 2015 and the [2/2] Fixes is from
> 2017. Is it appropriate that the patches be backported so far back,
> and into different kernel versions?
>

[2/2] won't be easy to backport to kernels without FAULT_FLAG_UNSHARE,
so I wouldn't really suggest backports to kernels before that. [1/2]
might be reasonable to backport, but might require some tweaking (page
vs. folio).

--
Cheers,

David / dhildenb


2024-04-12 13:25:54

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] s390/mm: shared zeropage + KVM fixes

> David Hildenbrand (2):
> mm/userfaultfd: don't place zeropages when zeropages are disallowed
> s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests
>
> arch/s390/include/asm/gmap.h | 2 +-
> arch/s390/include/asm/mmu.h | 5 +
> arch/s390/include/asm/mmu_context.h | 1 +
> arch/s390/include/asm/pgtable.h | 16 ++-
> arch/s390/kvm/kvm-s390.c | 4 +-
> arch/s390/mm/gmap.c | 163 +++++++++++++++++++++-------
> mm/userfaultfd.c | 34 ++++++
> 7 files changed, 178 insertions(+), 47 deletions(-)

Applied.
Thanks, David!

2024-04-15 11:49:46

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests


Am 11.04.24 um 23:09 schrieb David Hildenbrand:
> On 11.04.24 18:37, Alexander Gordeev wrote:
>> On Thu, Apr 11, 2024 at 06:14:41PM +0200, David Hildenbrand wrote:
>>
>> David, Christian,
>>
>>> Tested-by: Christian Borntraeger <[email protected]>
>>
>> Please, correct me if I am wrong, but (to my understanding) the
>> Tested-by for v2 does not apply for this version of the patch?
>
> I thought I'd removed it -- you're absolutely, this should be dropped. Hopefully Christian has time to retest.

So I can confirm that this patch does continue fix the qemu memory consumption for a guest doing managedsave/start.
A quick check of other aspects seems to be ok. We will have more coverage on the base functionality as soon as it hits next(via Andrew) as our daily CI will pick this up for lots of KVM tests.

2024-04-15 13:30:18

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

On Mon, Apr 15, 2024 at 01:49:15PM +0200, Christian Borntraeger wrote:

Hi Christian,

> > On 11.04.24 18:37, Alexander Gordeev wrote:
> > > On Thu, Apr 11, 2024 at 06:14:41PM +0200, David Hildenbrand wrote:
> > > David, Christian,
> > > > Tested-by: Christian Borntraeger <[email protected]>
> > > Please, correct me if I am wrong, but (to my understanding) the
> > > Tested-by for v2 does not apply for this version of the patch?
> > I thought I'd removed it -- you're absolutely, this should be dropped. Hopefully Christian has time to retest.
>
> So I can confirm that this patch does continue fix the qemu memory consumption for a guest doing managedsave/start.

I will re-add your Tested-by.

> A quick check of other aspects seems to be ok. We will have more coverage on the base functionality as soon as it hits next(via Andrew) as our daily CI will pick this up for lots of KVM tests.

As per the cover letter I will pull it via s390 tree:
"Based on s390/features. Andrew agreed that both patches can go via the
s390x tree."

Thanks!

2024-04-15 18:25:06

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

On Thu, Apr 11, 2024 at 06:14:41PM +0200, David Hildenbrand wrote:

David, could you please clarify the below questions?

> +static int __s390_unshare_zeropages(struct mm_struct *mm)
> +{
> + struct vm_area_struct *vma;
> + VMA_ITERATOR(vmi, mm, 0);
> + unsigned long addr;
> + int rc;
> +
> + for_each_vma(vmi, vma) {
> + /*
> + * We could only look at COW mappings, but it's more future
> + * proof to catch unexpected zeropages in other mappings and
> + * fail.
> + */
> + if ((vma->vm_flags & VM_PFNMAP) || is_vm_hugetlb_page(vma))
> + continue;
> + addr = vma->vm_start;
> +
> +retry:
> + rc = walk_page_range_vma(vma, addr, vma->vm_end,
> + &find_zeropage_ops, &addr);
> + if (rc <= 0)
> + continue;

So in case an error is returned for the last vma, __s390_unshare_zeropage()
finishes with that error. By contrast, the error for a non-last vma would
be ignored?

> +
> + /* addr was updated by find_zeropage_pte_entry() */
> + rc = handle_mm_fault(vma, addr,
> + FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE,
> + NULL);
> + if (rc & VM_FAULT_OOM)
> + return -ENOMEM;

Heiko pointed out that rc type is inconsistent vs vm_fault_t returned by
handle_mm_fault(). While fixing it up, I've got concerned whether is it
fine to continue in case any other error is met (including possible future
VM_FAULT_xxxx)?

> + /*
> + * See break_ksm(): even after handle_mm_fault() returned 0, we
> + * must start the lookup from the current address, because
> + * handle_mm_fault() may back out if there's any difficulty.
> + *
> + * VM_FAULT_SIGBUS and VM_FAULT_SIGSEGV are unexpected but
> + * maybe they could trigger in the future on concurrent
> + * truncation. In that case, the shared zeropage would be gone
> + * and we can simply retry and make progress.
> + */
> + cond_resched();
> + goto retry;
> + }
> +
> + return rc;
> +}

Thanks!

2024-04-15 19:14:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

On 15.04.24 20:24, Alexander Gordeev wrote:
> On Thu, Apr 11, 2024 at 06:14:41PM +0200, David Hildenbrand wrote:
>
> David, could you please clarify the below questions?

Sure, let me take a look if we're still missing to handle some corner cases correctly.

>
>> +static int __s390_unshare_zeropages(struct mm_struct *mm)
>> +{
>> + struct vm_area_struct *vma;
>> + VMA_ITERATOR(vmi, mm, 0);
>> + unsigned long addr;
>> + int rc;
>> +
>> + for_each_vma(vmi, vma) {
>> + /*
>> + * We could only look at COW mappings, but it's more future
>> + * proof to catch unexpected zeropages in other mappings and
>> + * fail.
>> + */
>> + if ((vma->vm_flags & VM_PFNMAP) || is_vm_hugetlb_page(vma))
>> + continue;
>> + addr = vma->vm_start;
>> +
>> +retry:
>> + rc = walk_page_range_vma(vma, addr, vma->vm_end,
>> + &find_zeropage_ops, &addr);
>> + if (rc <= 0)
>> + continue;
>
> So in case an error is returned for the last vma, __s390_unshare_zeropage()
> finishes with that error. By contrast, the error for a non-last vma would
> be ignored?

Right, it looks a bit off. walk_page_range_vma() shouldn't fail
unless find_zeropage_pte_entry() would fail -- which would also be
very unexpected.

To handle it cleanly in case we would ever get a weird zeropage where we
don't expect it, we should probably just exit early.

Something like the following (not compiled, addressing the comment below):


From b97cd17a3697ac402b07fe8d0033f3c10fbd6829 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Mon, 15 Apr 2024 20:56:20 +0200
Subject: [PATCH] fixup

Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/mm/gmap.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9233b0acac89..3e3322a9cc32 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2618,7 +2618,8 @@ static int __s390_unshare_zeropages(struct mm_struct *mm)
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, 0);
unsigned long addr;
- int rc;
+ vm_fault_t rc;
+ int zero_page;

for_each_vma(vmi, vma) {
/*
@@ -2631,9 +2632,11 @@ static int __s390_unshare_zeropages(struct mm_struct *mm)
addr = vma->vm_start;

retry:
- rc = walk_page_range_vma(vma, addr, vma->vm_end,
- &find_zeropage_ops, &addr);
- if (rc <= 0)
+ zero_page = walk_page_range_vma(vma, addr, vma->vm_end,
+ &find_zeropage_ops, &addr);
+ if (zero_page < 0)
+ return zero_page;
+ else if (!zero_page)
continue;

/* addr was updated by find_zeropage_pte_entry() */
@@ -2656,7 +2659,7 @@ static int __s390_unshare_zeropages(struct mm_struct *mm)
goto retry;
}

- return rc;
+ return 0;
}

static int __s390_disable_cow_sharing(struct mm_struct *mm)
--
2.44.0



>
>> +
>> + /* addr was updated by find_zeropage_pte_entry() */
>> + rc = handle_mm_fault(vma, addr,
>> + FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE,
>> + NULL);
>> + if (rc & VM_FAULT_OOM)
>> + return -ENOMEM;
>
> Heiko pointed out that rc type is inconsistent vs vm_fault_t returned by

Right, let's use another variable for that.

> handle_mm_fault(). While fixing it up, I've got concerned whether is it
> fine to continue in case any other error is met (including possible future
> VM_FAULT_xxxx)?

Such future changes would similarly break break_ksm(). Staring at it, I do wonder
if break_ksm() should be handling VM_FAULT_HWPOISON ... very likely we should
handle it and fail -- we might get an MC while copying from the source page.

VM_FAULT_HWPOISON on the shared zeropage would imply a lot of trouble, so
I'm not concerned about that for the case here, but handling it in the future
would be cleaner.

Note that we always retry the lookup, so we won't just skip a zeropage on unexpected
errors.

We could piggy-back on vm_fault_to_errno(). We could use
vm_fault_to_errno(rc, FOLL_HWPOISON), and only continue (retry) if the rc is 0 or
-EFAULT, otherwise fail with the returned error.

But I'd do that as a follow up, and also use it in break_ksm() in the same fashion.

--
Cheers,

David / dhildenb


2024-04-16 06:39:29

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

On Mon, Apr 15, 2024 at 09:14:03PM +0200, David Hildenbrand wrote:
> > > +retry:
> > > + rc = walk_page_range_vma(vma, addr, vma->vm_end,
> > > + &find_zeropage_ops, &addr);
> > > + if (rc <= 0)
> > > + continue;
> >
> > So in case an error is returned for the last vma, __s390_unshare_zeropage()
> > finishes with that error. By contrast, the error for a non-last vma would
> > be ignored?
>
> Right, it looks a bit off. walk_page_range_vma() shouldn't fail
> unless find_zeropage_pte_entry() would fail -- which would also be
> very unexpected.
>
> To handle it cleanly in case we would ever get a weird zeropage where we
> don't expect it, we should probably just exit early.
>
> Something like the following (not compiled, addressing the comment below):

> @@ -2618,7 +2618,8 @@ static int __s390_unshare_zeropages(struct mm_struct *mm)
> struct vm_area_struct *vma;
> VMA_ITERATOR(vmi, mm, 0);
> unsigned long addr;
> - int rc;
> + vm_fault_t rc;
> + int zero_page;

I would use "fault" for mm faults (just like everywhere else handle_mm_fault() is
called) and leave rc as is:

vm_fault_t fault;
int rc;

> for_each_vma(vmi, vma) {
> /*
> @@ -2631,9 +2632,11 @@ static int __s390_unshare_zeropages(struct mm_struct *mm)
> addr = vma->vm_start;
> retry:
> - rc = walk_page_range_vma(vma, addr, vma->vm_end,
> - &find_zeropage_ops, &addr);
> - if (rc <= 0)
> + zero_page = walk_page_range_vma(vma, addr, vma->vm_end,
> + &find_zeropage_ops, &addr);
> + if (zero_page < 0)
> + return zero_page;
> + else if (!zero_page)
> continue;
> /* addr was updated by find_zeropage_pte_entry() */
> @@ -2656,7 +2659,7 @@ static int __s390_unshare_zeropages(struct mm_struct *mm)
> goto retry;
> }
> - return rc;
> + return 0;
> }
> static int __s390_disable_cow_sharing(struct mm_struct *mm)

..

> > > + /* addr was updated by find_zeropage_pte_entry() */
> > > + rc = handle_mm_fault(vma, addr,
> > > + FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE,
> > > + NULL);
> > > + if (rc & VM_FAULT_OOM)
> > > + return -ENOMEM;
> >
> > Heiko pointed out that rc type is inconsistent vs vm_fault_t returned by
>
> Right, let's use another variable for that.
>
> > handle_mm_fault(). While fixing it up, I've got concerned whether is it
> > fine to continue in case any other error is met (including possible future
> > VM_FAULT_xxxx)?
>
> Such future changes would similarly break break_ksm(). Staring at it, I do wonder
> if break_ksm() should be handling VM_FAULT_HWPOISON ... very likely we should
> handle it and fail -- we might get an MC while copying from the source page.
>
> VM_FAULT_HWPOISON on the shared zeropage would imply a lot of trouble, so
> I'm not concerned about that for the case here, but handling it in the future
> would be cleaner.
>
> Note that we always retry the lookup, so we won't just skip a zeropage on unexpected
> errors.
>
> We could piggy-back on vm_fault_to_errno(). We could use
> vm_fault_to_errno(rc, FOLL_HWPOISON), and only continue (retry) if the rc is 0 or
> -EFAULT, otherwise fail with the returned error.
>
> But I'd do that as a follow up, and also use it in break_ksm() in the same fashion.

@Christian, do you agree with this suggestion?

Thanks!

2024-04-16 07:05:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

On 16.04.24 08:37, Alexander Gordeev wrote:
> On Mon, Apr 15, 2024 at 09:14:03PM +0200, David Hildenbrand wrote:
>>>> +retry:
>>>> + rc = walk_page_range_vma(vma, addr, vma->vm_end,
>>>> + &find_zeropage_ops, &addr);
>>>> + if (rc <= 0)
>>>> + continue;
>>>
>>> So in case an error is returned for the last vma, __s390_unshare_zeropage()
>>> finishes with that error. By contrast, the error for a non-last vma would
>>> be ignored?
>>
>> Right, it looks a bit off. walk_page_range_vma() shouldn't fail
>> unless find_zeropage_pte_entry() would fail -- which would also be
>> very unexpected.
>>
>> To handle it cleanly in case we would ever get a weird zeropage where we
>> don't expect it, we should probably just exit early.
>>
>> Something like the following (not compiled, addressing the comment below):
>
>> @@ -2618,7 +2618,8 @@ static int __s390_unshare_zeropages(struct mm_struct *mm)
>> struct vm_area_struct *vma;
>> VMA_ITERATOR(vmi, mm, 0);
>> unsigned long addr;
>> - int rc;
>> + vm_fault_t rc;
>> + int zero_page;
>
> I would use "fault" for mm faults (just like everywhere else handle_mm_fault() is
> called) and leave rc as is:
>
> vm_fault_t fault;
> int rc;

Sure, let me know once discussion here stopped whether you want a v4 or
can fix that up.

--
Cheers,

David / dhildenb


2024-04-16 12:03:13

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests



Am 16.04.24 um 08:37 schrieb Alexander Gordeev:

>> We could piggy-back on vm_fault_to_errno(). We could use
>> vm_fault_to_errno(rc, FOLL_HWPOISON), and only continue (retry) if the rc is 0 or
>> -EFAULT, otherwise fail with the returned error.
>>
>> But I'd do that as a follow up, and also use it in break_ksm() in the same fashion.
>
> @Christian, do you agree with this suggestion?

I would need to look into that more closely to give a proper answer. In general I am ok
with this but I prefer to have more eyes on that.
From what I can tell we should cover all the normal cases with our CI as soon as it hits
next. But maybe we should try to create/change a selftest to trigger these error cases?

2024-04-16 13:42:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests

On 16.04.24 14:02, Christian Borntraeger wrote:
>
>
> Am 16.04.24 um 08:37 schrieb Alexander Gordeev:
>
>>> We could piggy-back on vm_fault_to_errno(). We could use
>>> vm_fault_to_errno(rc, FOLL_HWPOISON), and only continue (retry) if the rc is 0 or
>>> -EFAULT, otherwise fail with the returned error.
>>>
>>> But I'd do that as a follow up, and also use it in break_ksm() in the same fashion.
>>
>> @Christian, do you agree with this suggestion?
>
> I would need to look into that more closely to give a proper answer. In general I am ok
> with this but I prefer to have more eyes on that.
> From what I can tell we should cover all the normal cases with our CI as soon as it hits
> next. But maybe we should try to create/change a selftest to trigger these error cases?

If we find a shared zeropage we expect the next unsharing fault to
succeed except:

(1) OOM, in which case we translate to -ENOMEM.

(2) Some obscure race with MADV_DONTNEED paired with concurrent
truncate(), in which case we get an error, but if we look again, we will
find the shared zeropage no longer mapped. (this is what break_ksm()
describes)

(3) MCE while copying the page, which doesn't quite apply here.

For the time being, we only get shared zeropages in (a) anon mappings
(b) MAP_PRIVATE shmem mappings via UFFDIO_ZEROPAGE. So (2) is hard or
even impossible to trigger. (1) is hard to test as well, and (3) ...

No easy way to extend selftests that I can see.

If we repeatedly find a shared zeropage in a COW mapping and get an
error from the unsharing fault, something else would be deeply flawed.
So I'm not really worried about that, but I agree that having a more
centralized check will make sense.

--
Cheers,

David / dhildenb


2024-04-17 12:47:45

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] s390/mm: shared zeropage + KVM fixes

On 17.04.24 14:46, Alexander Gordeev wrote:
> On Thu, Apr 11, 2024 at 06:14:39PM +0200, David Hildenbrand wrote:
>
> Hi David,
>
>> Based on s390/features. Andrew agreed that both patches can go via the
>> s390x tree.
>
> I am going to put on a branch this series together with the selftest:
> https://lore.kernel.org/r/[email protected]
>
> I there something in s390/features your three patches depend on?
> Or v6.9-rc2 contains everything needed already?

v6.9-rc2 should have all we need IIRC.

--
Cheers,

David / dhildenb


2024-04-17 12:48:06

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] s390/mm: shared zeropage + KVM fixes

On Thu, Apr 11, 2024 at 06:14:39PM +0200, David Hildenbrand wrote:

Hi David,

> Based on s390/features. Andrew agreed that both patches can go via the
> s390x tree.

I am going to put on a branch this series together with the selftest:
https://lore.kernel.org/r/[email protected]

I there something in s390/features your three patches depend on?
Or v6.9-rc2 contains everything needed already?

Thanks!

2024-04-18 13:18:14

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] s390/mm: re-enable the shared zeropage for !PV and !skeys KVM guests



Am 16.04.24 um 15:41 schrieb David Hildenbrand:
> On 16.04.24 14:02, Christian Borntraeger wrote:
>>
>>
>> Am 16.04.24 um 08:37 schrieb Alexander Gordeev:
>>
>>>> We could piggy-back on vm_fault_to_errno(). We could use
>>>> vm_fault_to_errno(rc, FOLL_HWPOISON), and only continue (retry) if the rc is 0 or
>>>> -EFAULT, otherwise fail with the returned error.
>>>>
>>>> But I'd do that as a follow up, and also use it in break_ksm() in the same fashion.
>>>
>>> @Christian, do you agree with this suggestion?
>>
>> I would need to look into that more closely to give a proper answer. In general I am ok
>> with this but I prefer to have more eyes on that.
>>   From what I can tell we should cover all the normal cases with our CI as soon as it hits
>> next. But maybe we should try to create/change a selftest to trigger these error cases?
>
> If we find a shared zeropage we expect the next unsharing fault to succeed except:
>
> (1) OOM, in which case we translate to -ENOMEM.
>
> (2) Some obscure race with MADV_DONTNEED paired with concurrent truncate(), in which case we get an error, but if we look again, we will find the shared zeropage no longer mapped. (this is what break_ksm() describes)
>
> (3) MCE while copying the page, which doesn't quite apply here.
>
> For the time being, we only get shared zeropages in (a) anon mappings (b) MAP_PRIVATE shmem mappings via UFFDIO_ZEROPAGE. So (2) is hard or even impossible to trigger. (1) is hard to test as well, and (3) ...
>
> No easy way to extend selftests that I can see.

Yes, lets just go forward.
>
> If we repeatedly find a shared zeropage in a COW mapping and get an error from the unsharing fault, something else would be deeply flawed. So I'm not really worried about that, but I agree that having a more centralized check will make sense.