2020-11-27 08:47:24

by Peter Xu

[permalink] [raw]
Subject: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

Faulting around for reads are in most cases helpful for the performance so that
continuous memory accesses may avoid another trip of page fault. However it
may not always work as expected.

For example, userfaultfd registered regions may not be the best candidate for
pre-faults around the reads.

For missing mode uffds, fault around does not help because if the page cache
existed, then the page should be there already. If the page cache is not
there, nothing else we can do, either. If the fault-around code is destined to
be helpless for userfault-missing vmas, then ideally we can skip it.

For wr-protected mode uffds, errornously fault in those pages around could lead
to threads accessing the pages without uffd server's awareness. For example,
when punching holes on uffd-wp registered shmem regions, we'll first try to
unmap all the pages before evicting the page cache but without locking the
page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
before shmem_truncate_range()). When fault-around happens near a hole being
punched, we might errornously fault in the "holes" right before it will be
punched. Then there's a small window before the page cache was finally
dropped, and after the page will be writable again (NOTE: the uffd-wp protect
information is totally lost due to the pre-unmap in shmem_fallocate(), so the
page can be writable within the small window). That's severe data loss.

Let's grant the userspace full control of the uffd-registered ranges, rather
than trying to do the tricks.

Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---

Note that since no file-backed uffd-wp support is there yet upstream, so the
uffd-wp check is actually not really functioning. However since we have all
the necessary uffd-wp concepts already upstream, maybe it's better to do it
once and for all.

This patch comes from debugging a data loss issue when working on the uffd-wp
support on shmem/hugetlbfs. I posted this out for early review and comments,
but also because it should already start to benefit missing mode userfaultfd to
avoid trying to fault around on reads.
---
include/linux/userfaultfd_k.h | 5 +++++
mm/memory.c | 17 +++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a8e5f3ea9bb2..451d99bb3a1a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -62,6 +62,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
return vma->vm_flags & VM_UFFD_WP;
}

+static inline bool vma_registered_userfaultfd(struct vm_area_struct *vma)
+{
+ return userfaultfd_missing(vma) || userfaultfd_wp(vma);
+}
+
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
diff --git a/mm/memory.c b/mm/memory.c
index eeae590e526a..ca58ada94c96 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3933,6 +3933,23 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
int off;
vm_fault_t ret = 0;

+ /*
+ * Be extremely careful with uffd-armed regions.
+ *
+ * For missing mode uffds, fault around does not help because if the
+ * page cache existed, then the page should be there already. If the
+ * page cache is not there, nothing else we can do either.
+ *
+ * For wr-protected mode uffds, errornously fault in those pages around
+ * could lead to threads accessing the pages without uffd server's
+ * awareness, finally it could cause ghostly data corruption.
+ *
+ * The idea is that, every single page of uffd regions should be
+ * governed by the userspace on which page to fault in.
+ */
+ if (unlikely(vma_registered_userfaultfd(vmf->vma)))
+ return 0;
+
nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;

--
2.26.2


2020-11-27 08:54:05

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote:
> Faulting around for reads are in most cases helpful for the performance so that
> continuous memory accesses may avoid another trip of page fault. However it
> may not always work as expected.
>
> For example, userfaultfd registered regions may not be the best candidate for
> pre-faults around the reads.
>
> For missing mode uffds, fault around does not help because if the page cache
> existed, then the page should be there already. If the page cache is not
> there, nothing else we can do, either. If the fault-around code is destined to
> be helpless for userfault-missing vmas, then ideally we can skip it.
>
> For wr-protected mode uffds, errornously fault in those pages around could lead
> to threads accessing the pages without uffd server's awareness. For example,
> when punching holes on uffd-wp registered shmem regions, we'll first try to
> unmap all the pages before evicting the page cache but without locking the
> page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> before shmem_truncate_range()). When fault-around happens near a hole being
> punched, we might errornously fault in the "holes" right before it will be
> punched. Then there's a small window before the page cache was finally
> dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> page can be writable within the small window). That's severe data loss.
>
> Let's grant the userspace full control of the uffd-registered ranges, rather
> than trying to do the tricks.
>
> Cc: Hugh Dickins <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>

One nit below, except that

Reviewed-by: Mike Rapoport <[email protected]>

> ---
>
> Note that since no file-backed uffd-wp support is there yet upstream, so the
> uffd-wp check is actually not really functioning. However since we have all
> the necessary uffd-wp concepts already upstream, maybe it's better to do it
> once and for all.
>
> This patch comes from debugging a data loss issue when working on the uffd-wp
> support on shmem/hugetlbfs. I posted this out for early review and comments,
> but also because it should already start to benefit missing mode userfaultfd to
> avoid trying to fault around on reads.
> ---
> include/linux/userfaultfd_k.h | 5 +++++
> mm/memory.c | 17 +++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index a8e5f3ea9bb2..451d99bb3a1a 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -62,6 +62,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> return vma->vm_flags & VM_UFFD_WP;
> }
>
> +static inline bool vma_registered_userfaultfd(struct vm_area_struct *vma)
> +{
> + return userfaultfd_missing(vma) || userfaultfd_wp(vma);
> +}

We have userfaultfd_armed() that does exectly this, don't we?

> +
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> pte_t pte)
> {
> diff --git a/mm/memory.c b/mm/memory.c
> index eeae590e526a..ca58ada94c96 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3933,6 +3933,23 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
> int off;
> vm_fault_t ret = 0;
>
> + /*
> + * Be extremely careful with uffd-armed regions.
> + *
> + * For missing mode uffds, fault around does not help because if the
> + * page cache existed, then the page should be there already. If the
> + * page cache is not there, nothing else we can do either.
> + *
> + * For wr-protected mode uffds, errornously fault in those pages around
> + * could lead to threads accessing the pages without uffd server's
> + * awareness, finally it could cause ghostly data corruption.
> + *
> + * The idea is that, every single page of uffd regions should be
> + * governed by the userspace on which page to fault in.
> + */
> + if (unlikely(vma_registered_userfaultfd(vmf->vma)))
> + return 0;
> +
> nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
> mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
>
> --
> 2.26.2
>

--
Sincerely yours,
Mike.

2020-11-27 12:26:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote:
> For missing mode uffds, fault around does not help because if the page cache
> existed, then the page should be there already. If the page cache is not
> there, nothing else we can do, either. If the fault-around code is destined to
> be helpless for userfault-missing vmas, then ideally we can skip it.

But it might have been faulted into the cache by another task, so skipping
it is bad.

> For wr-protected mode uffds, errornously fault in those pages around could lead
> to threads accessing the pages without uffd server's awareness. For example,
> when punching holes on uffd-wp registered shmem regions, we'll first try to
> unmap all the pages before evicting the page cache but without locking the
> page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> before shmem_truncate_range()). When fault-around happens near a hole being
> punched, we might errornously fault in the "holes" right before it will be
> punched. Then there's a small window before the page cache was finally
> dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> page can be writable within the small window). That's severe data loss.

Sounds like you have a missing page_mkwrite implementation.

> This patch comes from debugging a data loss issue when working on the uffd-wp
> support on shmem/hugetlbfs. I posted this out for early review and comments,
> but also because it should already start to benefit missing mode userfaultfd to
> avoid trying to fault around on reads.

A measurable difference?

2020-11-27 13:35:31

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

On Fri, Nov 27, 2020 at 10:16:05AM +0200, Mike Rapoport wrote:
> On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote:
> > Faulting around for reads are in most cases helpful for the performance so that
> > continuous memory accesses may avoid another trip of page fault. However it
> > may not always work as expected.
> >
> > For example, userfaultfd registered regions may not be the best candidate for
> > pre-faults around the reads.
> >
> > For missing mode uffds, fault around does not help because if the page cache
> > existed, then the page should be there already. If the page cache is not
> > there, nothing else we can do, either. If the fault-around code is destined to
> > be helpless for userfault-missing vmas, then ideally we can skip it.
> >
> > For wr-protected mode uffds, errornously fault in those pages around could lead
> > to threads accessing the pages without uffd server's awareness. For example,
> > when punching holes on uffd-wp registered shmem regions, we'll first try to
> > unmap all the pages before evicting the page cache but without locking the
> > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> > before shmem_truncate_range()). When fault-around happens near a hole being
> > punched, we might errornously fault in the "holes" right before it will be
> > punched. Then there's a small window before the page cache was finally
> > dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> > information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> > page can be writable within the small window). That's severe data loss.
> >
> > Let's grant the userspace full control of the uffd-registered ranges, rather
> > than trying to do the tricks.
> >
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Andrea Arcangeli <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Mike Rapoport <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
>
> One nit below, except that
>
> Reviewed-by: Mike Rapoport <[email protected]>

Thanks!

> > +static inline bool vma_registered_userfaultfd(struct vm_area_struct *vma)
> > +{
> > + return userfaultfd_missing(vma) || userfaultfd_wp(vma);
> > +}
>
> We have userfaultfd_armed() that does exectly this, don't we?

Yes, will fix that up.

--
Peter Xu

2020-11-27 13:54:26

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

Matthew,

Thanks for the review comments.

On Fri, Nov 27, 2020 at 12:22:24PM +0000, Matthew Wilcox wrote:
> On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote:
> > For missing mode uffds, fault around does not help because if the page cache
> > existed, then the page should be there already. If the page cache is not
> > there, nothing else we can do, either. If the fault-around code is destined to
> > be helpless for userfault-missing vmas, then ideally we can skip it.
>
> But it might have been faulted into the cache by another task, so skipping
> it is bad.

Is there a real use case of such?

I thought about it, at least for qemu postcopy it's not working like that. I
feel like CRIU neither, but Mike could correct me.

Even if there's a case of that, for example, if task A tries to copy the pages
over to a tmpfs file and task B accesses the pages too with uffd missing
registered, the ideal solution is still that when a page is copied task A can
installs the pte for the current task, rather than relying on the fault around
on reads, IMHO. I don't know whether there's a way to do it, though.

>
> > For wr-protected mode uffds, errornously fault in those pages around could lead
> > to threads accessing the pages without uffd server's awareness. For example,
> > when punching holes on uffd-wp registered shmem regions, we'll first try to
> > unmap all the pages before evicting the page cache but without locking the
> > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> > before shmem_truncate_range()). When fault-around happens near a hole being
> > punched, we might errornously fault in the "holes" right before it will be
> > punched. Then there's a small window before the page cache was finally
> > dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> > information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> > page can be writable within the small window). That's severe data loss.
>
> Sounds like you have a missing page_mkwrite implementation.

I think it's slightly different issue, because shmem may not know whether the
page should be allowed to write or not. AFAIU, uffd-wp is designed and
implemented in a way that the final protect information is kept within ptes so
e.g. vmas does not have a solid understanding on whether a page should be
write-protected or not (so VM_UFFD_WP in vma flags is a hint only, and also
that's why we won't abuse creating tons of vmas). We tried hard to keep the
pte information on track, majorly _PAGE_UFFD_WP, alive even across swap in/outs
and migrations. If pte is lost, we can't get that information from page cache,
at least for now.

>
> > This patch comes from debugging a data loss issue when working on the uffd-wp
> > support on shmem/hugetlbfs. I posted this out for early review and comments,
> > but also because it should already start to benefit missing mode userfaultfd to
> > avoid trying to fault around on reads.
>
> A measurable difference?

Nop. I didn't measure missing case. It should really depend on whether
there's a use case of such, imho. If there's, then we may still want that
(however uffd-wp might be a different story, as discussed above). Otherwise
maybe we should just avoid doing that for all.

The other thing that led me to this patch (rather than only check against
uffd-wp, for which case I'll just keep that small patch in my own tree until
posting the uffd-wp series) is I thought about when the community would like to
introduce things like "minor-faults" for userfaultfd. In that case we'd need
to be careful too here otherwise we may lose some minor faults. And from that
pov I'm thinking whether it's easier to just forbid kernel-side tricks like
this for uffd in general, since as I also stated previously that IMHO if a
memory region is registered with userfaultfd, maybe it's always good to
"always" rely on the userspace on resolving page faults, so that we don't need
to debug things like uffd-wp data crash as userfaultfd evolves, because that'll
be non-trivial task at least to me...

Thanks,

--
Peter Xu

2020-11-27 18:44:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

On 26.11.20 23:23, Peter Xu wrote:
> Faulting around for reads are in most cases helpful for the performance so that
> continuous memory accesses may avoid another trip of page fault. However it
> may not always work as expected.
>
> For example, userfaultfd registered regions may not be the best candidate for
> pre-faults around the reads.

Are we getting uffd faults even though no-one actually accessed it? So
in case I would track what some part of my program actually reads, I
would get wrong notifications?

--
Thanks,

David / dhildenb

2020-11-27 19:21:43

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

On Fri, Nov 27, 2020 at 06:00:44PM +0100, David Hildenbrand wrote:
> On 26.11.20 23:23, Peter Xu wrote:
> > Faulting around for reads are in most cases helpful for the performance so that
> > continuous memory accesses may avoid another trip of page fault. However it
> > may not always work as expected.
> >
> > For example, userfaultfd registered regions may not be the best candidate for
> > pre-faults around the reads.
>
> Are we getting uffd faults even though no-one actually accessed it?

Userfaultfd should only notify if someone accessed it.

> So in case I would track what some part of my program actually reads, I
> would get wrong notifications?

For shmem, we can't track it right now, afaict. The message will only generate
if the page cache is not there.

While tracking page reads without page cache is helpless.. because they're all
zeros.

Thanks,

--
Peter Xu

2020-11-28 01:28:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

Hello,

On Fri, Nov 27, 2020 at 12:22:24PM +0000, Matthew Wilcox wrote:
> On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote:
> > For wr-protected mode uffds, errornously fault in those pages around could lead
> > to threads accessing the pages without uffd server's awareness. For example,
> > when punching holes on uffd-wp registered shmem regions, we'll first try to
> > unmap all the pages before evicting the page cache but without locking the
> > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> > before shmem_truncate_range()). When fault-around happens near a hole being
> > punched, we might errornously fault in the "holes" right before it will be
> > punched. Then there's a small window before the page cache was finally
> > dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> > information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> > page can be writable within the small window). That's severe data loss.
>
> Sounds like you have a missing page_mkwrite implementation.

If the real fault happened through the pagetable (which got dropped by
the hole punching), a "missing type" userfault would be delivered to
userland (because the pte would be none). Userland would invoke
UFFDIO_COPY with the UFFDIO_COPY_MODE_WP flag. Such flag would then
map the filled shmem page (not necessarily all zero and not
necessarily the old content before the hole punch) with _PAGE_RW not
set and _PAGE_UFFD_WP set, so the next write would also trigger a
wrprotect userfault (this is what the uffd-wp app expects).

filemap_map_pages doesn't notify userland when it fills a pte and it
will map again the page read-write.

However filemap_map_pages isn't capable to fill a hole and to undo the
hole punch, all it can do are minor faults to re-fill the ptes from a
not-yet-truncated inode page.

Now it would be ok if filemap_map_pages re-filled the ptes, after they
were zapped the first time by fallocate, and then the fallocate would
zap them again before truncating the page, but I don't see a second
pte zap... there's just the below single call of unmap_mapping_range:

if ((u64)unmap_end > (u64)unmap_start)
unmap_mapping_range(mapping, unmap_start,
1 + unmap_end - unmap_start, 0);
shmem_truncate_range(inode, offset, offset + len - 1);

So looking at filemap_map_pages in shmem, I'm really wondering if the
non-uffd case is correct or not.

Do we end up with ptes pointing to non pagecache, so then the virtual
mapping is out of sync also with read/write syscalls that will then
allocate another shmem page out of sync of the ptes?

If a real page fault happened instead of filemap_map_pages, the
shmem_fault() would block during fallocate punch hole by checking
inode->i_private, as the comment says:

* So refrain from
* faulting pages into the hole while it's being punched.

It's not immediately clear where filemap_map_pages refrains from
faulting pages into the hole while it's being punched... given it's
ignoring inode->i_private.

So I'm not exactly sure how shmem can safely faulted in through
filemap_map_pages, without going through shmem_fault... I suppose
shmem simply is unsafe to use filemap_map_pages and it'd require
a specific shmem_map_pages?

If only filemap_map_pages was refraining re-faulting ptes of a shmem
page that is about to be truncated (whose original ptes had _PAGE_RW
unset and _PAGE_UFFD_WP set) there would be no problem with the uffd
interaction. So a proper shmem_map_pages could co-exist with uffd, the
userfaultfd_armed check would be only an optimization but it wouldn't
be required to avoid userland memory corruption?

From 8c6fb1b7dde309f0c8b5666a8e13557ae46369b4 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <[email protected]>
Date: Fri, 27 Nov 2020 19:12:44 -0500
Subject: [PATCH 1/1] shmem: stop faulting in pages without checking
inode->i_private

Per shmem_fault comment shmem need to "refrain from faulting pages
into the hole while it's being punched" and to do so it must check
inode->i_private, which filemap_map_pages won't so it's unsafe to use
in shmem because it can leave ptes pointing to non-pagecache pages in
shmem backed vmas.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/shmem.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 8e2b35ba93ad..f6f29b3e67c6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3942,7 +3942,6 @@ static const struct super_operations shmem_ops = {

static const struct vm_operations_struct shmem_vm_ops = {
.fault = shmem_fault,
- .map_pages = filemap_map_pages,
#ifdef CONFIG_NUMA
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,


Thanks,
Andrea

2020-11-28 22:03:28

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

On Fri, Nov 27, 2020 at 07:33:39PM -0500, Andrea Arcangeli wrote:
> Hello,

Hi, Andrea!

>
> On Fri, Nov 27, 2020 at 12:22:24PM +0000, Matthew Wilcox wrote:
> > On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote:
> > > For wr-protected mode uffds, errornously fault in those pages around could lead
> > > to threads accessing the pages without uffd server's awareness. For example,
> > > when punching holes on uffd-wp registered shmem regions, we'll first try to
> > > unmap all the pages before evicting the page cache but without locking the
> > > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> > > before shmem_truncate_range()). When fault-around happens near a hole being
> > > punched, we might errornously fault in the "holes" right before it will be
> > > punched. Then there's a small window before the page cache was finally
> > > dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> > > information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> > > page can be writable within the small window). That's severe data loss.
> >
> > Sounds like you have a missing page_mkwrite implementation.
>
> If the real fault happened through the pagetable (which got dropped by
> the hole punching), a "missing type" userfault would be delivered to
> userland (because the pte would be none). Userland would invoke
> UFFDIO_COPY with the UFFDIO_COPY_MODE_WP flag. Such flag would then
> map the filled shmem page (not necessarily all zero and not
> necessarily the old content before the hole punch) with _PAGE_RW not
> set and _PAGE_UFFD_WP set, so the next write would also trigger a
> wrprotect userfault (this is what the uffd-wp app expects).
>
> filemap_map_pages doesn't notify userland when it fills a pte and it
> will map again the page read-write.

Yes. A trivial supplementary detail is that filemap_map_pages() will only set
it read-only since alloc_set_pte() will only set write bit if it's a write. In
our case it's read fault-around so without it. However it'll be quickly marked
as writable as long as the thread quickly writes to it again.

>
> However filemap_map_pages isn't capable to fill a hole and to undo the
> hole punch, all it can do are minor faults to re-fill the ptes from a
> not-yet-truncated inode page.
>
> Now it would be ok if filemap_map_pages re-filled the ptes, after they
> were zapped the first time by fallocate, and then the fallocate would
> zap them again before truncating the page, but I don't see a second
> pte zap... there's just the below single call of unmap_mapping_range:
>
> if ((u64)unmap_end > (u64)unmap_start)
> unmap_mapping_range(mapping, unmap_start,
> 1 + unmap_end - unmap_start, 0);
> shmem_truncate_range(inode, offset, offset + len - 1);

IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call
truncate_inode_page() then onto truncate_cleanup_page().

Since we're at it, some more context: this is actually where I started to
notice the issue, that we'll try to pre-unmap the whole region first before
shmem_truncate_range(). The thing is shmem_truncate_range() will zap the ptes
too, in an even safer way (with page lock held). So before I came up with the
current patch, I also tried below patch, and it also fixes the data corrupt
issue:

-----8<-----
diff --git a/mm/shmem.c b/mm/shmem.c
index efa19e33e470..b275f401fe1f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
inode_lock(inode);

if (mode & FALLOC_FL_PUNCH_HOLE) {
- struct address_space *mapping = file->f_mapping;
loff_t unmap_start = round_up(offset, PAGE_SIZE);
loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
@@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
inode->i_private = &shmem_falloc;
spin_unlock(&inode->i_lock);

- if ((u64)unmap_end > (u64)unmap_start)
- unmap_mapping_range(mapping, unmap_start,
- 1 + unmap_end - unmap_start, 0);
shmem_truncate_range(inode, offset, offset + len - 1);
/* No need to unmap again: hole-punching leaves COWed pages */
-----8<-----

Above code existed starting from the 1st day of shmem fallocate code, so I was
thinking why we did that. One reason could be for performance, since this
pre-unmap of explicit unmap_mapping_range() will try to walk the page tables
once rather than walking for every single page, so when the hole is huge it
could benefit us since truncate_cleanup_page() is able to avoid those per-page
walks then because page_mapped() could be more likely to be zero:

if (page_mapped(page)) {
pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
unmap_mapping_pages(mapping, page->index, nr, false);
}

It would be good if Hugh can help confirm whether my understanding is correct,
though..

As a summary, that's why I didn't try to remove the optimization (which fixes
the issue too) but instead I tried to dissable fault around instead for uffd,
which sounds a better thing to do.

>
> So looking at filemap_map_pages in shmem, I'm really wondering if the
> non-uffd case is correct or not.
>
> Do we end up with ptes pointing to non pagecache, so then the virtual
> mapping is out of sync also with read/write syscalls that will then
> allocate another shmem page out of sync of the ptes?
>
> If a real page fault happened instead of filemap_map_pages, the
> shmem_fault() would block during fallocate punch hole by checking
> inode->i_private, as the comment says:
>
> * So refrain from
> * faulting pages into the hole while it's being punched.
>
> It's not immediately clear where filemap_map_pages refrains from
> faulting pages into the hole while it's being punched... given it's
> ignoring inode->i_private.
>
> So I'm not exactly sure how shmem can safely faulted in through
> filemap_map_pages, without going through shmem_fault... I suppose
> shmem simply is unsafe to use filemap_map_pages and it'd require
> a specific shmem_map_pages?
>
> If only filemap_map_pages was refraining re-faulting ptes of a shmem
> page that is about to be truncated (whose original ptes had _PAGE_RW
> unset and _PAGE_UFFD_WP set) there would be no problem with the uffd
> interaction. So a proper shmem_map_pages could co-exist with uffd, the
> userfaultfd_armed check would be only an optimization but it wouldn't
> be required to avoid userland memory corruption?

So far I still think it's necessary to have this patch or equivilant to avoid
fault-around.

As discussed above, current map_pages of shmem (which is the general
filemap_map_pages) should be safe without uffd as long as we'll remove ptes
correctly, but also with page lock held (that should be exactly what we do
right now in shmem_truncate_range).

But since we have that optimization on pre-unmap, then uffd-wp is not safe any
more, because of the race that this patch tries to avoid.

And if my understanding above is correct, we may still want to keep shmem fault
around logic, however it just may not suitable for uffd-wp, or uffd in general.

Thanks!

>
> From 8c6fb1b7dde309f0c8b5666a8e13557ae46369b4 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <[email protected]>
> Date: Fri, 27 Nov 2020 19:12:44 -0500
> Subject: [PATCH 1/1] shmem: stop faulting in pages without checking
> inode->i_private
>
> Per shmem_fault comment shmem need to "refrain from faulting pages
> into the hole while it's being punched" and to do so it must check
> inode->i_private, which filemap_map_pages won't so it's unsafe to use
> in shmem because it can leave ptes pointing to non-pagecache pages in
> shmem backed vmas.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> mm/shmem.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8e2b35ba93ad..f6f29b3e67c6 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3942,7 +3942,6 @@ static const struct super_operations shmem_ops = {
>
> static const struct vm_operations_struct shmem_vm_ops = {
> .fault = shmem_fault,
> - .map_pages = filemap_map_pages,
> #ifdef CONFIG_NUMA
> .set_policy = shmem_set_policy,
> .get_policy = shmem_get_policy,
>
>
> Thanks,
> Andrea
>

--
Peter Xu

2020-12-01 19:13:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

Hi Peter,

On Sat, Nov 28, 2020 at 10:29:03AM -0500, Peter Xu wrote:
> Yes. A trivial supplementary detail is that filemap_map_pages() will only set
> it read-only since alloc_set_pte() will only set write bit if it's a write. In
> our case it's read fault-around so without it. However it'll be quickly marked
> as writable as long as the thread quickly writes to it again.

The problem here is that all the information injected into the kernel
through the UFFDIO_WRITEPROTECT syscalls is wiped by the quoted
unmap_mapping_range.

We can later discuss we actually check _PAGE_UFFD_WP, but that's not
the first thing to solve here.

Here need to we find a way to retain _PAGE_UFFD_WP until the page is
actually locked and is atomically disappearing from the xarray index
as far as further shmem page faults are concerned.

Adding that information to the xarray isn't ideal because it's mm
specific and not a property of the "pagecache", but keeping that
information in the pagetables as we do for anon memory is also causing
issue as shown below, because the shmem pagetables are supposed to be
zapped at any time and be re-created on the fly based on the
xarray. As opposed this never happens with anon memory.

For example when shmem swap the swap entry is written in the xarray
not in the pagetable, how are we going to let _PAGE_UFFD_WP survive
swapping of a shmem if we store the _PAGE_UFFD_WP in the pte? Did you
test swap?

So here I'm afraid there's something more fundamental in how to have
_PAGE_UFFD_WP survive swapping and a random pte zap, and we need more
infrastructure in the unmap_mapping_range to retain that information
so then swapping also works then.

So I don't think we can evaluate this patch without seeing the full
patchset and how the rest is being handled. Still it's great you found
the source of the corruption for the current shmem uffd-wp codebase so
we can move forward!

> > However filemap_map_pages isn't capable to fill a hole and to undo the
> > hole punch, all it can do are minor faults to re-fill the ptes from a
> > not-yet-truncated inode page.
> >
> > Now it would be ok if filemap_map_pages re-filled the ptes, after they
> > were zapped the first time by fallocate, and then the fallocate would
> > zap them again before truncating the page, but I don't see a second
> > pte zap... there's just the below single call of unmap_mapping_range:
> >
> > if ((u64)unmap_end > (u64)unmap_start)
> > unmap_mapping_range(mapping, unmap_start,
> > 1 + unmap_end - unmap_start, 0);
> > shmem_truncate_range(inode, offset, offset + len - 1);
>
> IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call
> truncate_inode_page() then onto truncate_cleanup_page().

I wrongly assumed there was only one invalidate and I guessed that is what
caused the problem since it run outside the page lock so it wasn't
"atomic" with respect to the page truncation in the xarray.

The comment "refrain from faulting pages into the hole while it's
being punched" and the "inode->i_private waitqueue" logic is just to
slowly throttle the faulting so the truncation is faster. It's still
not entirely clear why it's needed since shmem_getpage_gfp uses
find_lock_entry and the page had to be locked. I didn't look at the
history to see the order of where those changes were added to see if
it's still needed.

Anyway it was already suspect that "unlikely(inode->i_private)" would
be enough as an out of order check to fully serialize things but for a
performance "throttling" logic it's fine and non concerning (worst
case it is not needed and the page lock waitqueue would have
serialized the faults against the slow invalidate loops already).

> Since we're at it, some more context: this is actually where I started to
> notice the issue, that we'll try to pre-unmap the whole region first before
> shmem_truncate_range(). The thing is shmem_truncate_range() will zap the ptes
> too, in an even safer way (with page lock held). So before I came up with the
> current patch, I also tried below patch, and it also fixes the data corrupt
> issue:
>
> -----8<-----
> diff --git a/mm/shmem.c b/mm/shmem.c
> index efa19e33e470..b275f401fe1f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> inode_lock(inode);
>
> if (mode & FALLOC_FL_PUNCH_HOLE) {
> - struct address_space *mapping = file->f_mapping;
> loff_t unmap_start = round_up(offset, PAGE_SIZE);
> loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
> DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
> @@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> inode->i_private = &shmem_falloc;
> spin_unlock(&inode->i_lock);
>
> - if ((u64)unmap_end > (u64)unmap_start)
> - unmap_mapping_range(mapping, unmap_start,
> - 1 + unmap_end - unmap_start, 0);
> shmem_truncate_range(inode, offset, offset + len - 1);
> /* No need to unmap again: hole-punching leaves COWed pages */
> -----8<-----
>
> Above code existed starting from the 1st day of shmem fallocate code, so I was
> thinking why we did that. One reason could be for performance, since this
> pre-unmap of explicit unmap_mapping_range() will try to walk the page tables
> once rather than walking for every single page, so when the hole is huge it
> could benefit us since truncate_cleanup_page() is able to avoid those per-page
> walks then because page_mapped() could be more likely to be zero:
>
> if (page_mapped(page)) {
> pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
> unmap_mapping_pages(mapping, page->index, nr, false);
> }
>
> It would be good if Hugh can help confirm whether my understanding is correct,
> though..
>
> As a summary, that's why I didn't try to remove the optimization (which fixes
> the issue too) but instead I tried to dissable fault around instead for uffd,
> which sounds a better thing to do.

I'm afraid the real problem is the above invalidate though, and not
the invalidate itself but the fact uffd-wp shmem cannot cope with it
and we need to find a way to have the _PAGE_UFFD_WP information
survive even without applying your above patch.

> So far I still think it's necessary to have this patch or equivilant to avoid
> fault-around.
>
> As discussed above, current map_pages of shmem (which is the general
> filemap_map_pages) should be safe without uffd as long as we'll remove ptes
> correctly, but also with page lock held (that should be exactly what we do
> right now in shmem_truncate_range).
>
> But since we have that optimization on pre-unmap, then uffd-wp is not safe any
> more, because of the race that this patch tries to avoid.
>
> And if my understanding above is correct, we may still want to keep shmem fault
> around logic, however it just may not suitable for uffd-wp, or uffd in general.

Since there's the second invalidate in truncate_inode_page unde the
page lock I agree it means filemap_map_pages is perfectly safe for
shmem.

However it also means filemap_map_pages is a 100% bypass as far as
userfaultfd is concerned. That applies to all modes including missing
faults not just wp faults.

In fact, no matter the workload, given the current semantics of uffd
on filebacked mappings: it's always guaranteed there cannot be even a
single extra userfault, regardless if filemap_map_pages is active or
inactive in shmem, hugetlbfs or any other filesystem.

So it's not intuitive why it shouldn't be suitable for uffd, given its
presence is also completely unnoticeable to userland. Despite what the
faultaround name, it's not a real fault.

If later David send the work to handle minor faults the above will
change (so that the huge final dirty bitmap can be delivered to the
destination node while the guest already runs in the destination),
because the uffd semantics will be extended to cover minor faults. But
the current upstream only trigger userfault on a file hole, so major
faults only, and filemap_map_pages by definition cannot fill a file
hole.

So again it's not clear why we should even care if filemap_map_pages
is active or not, when uffd is armed, no matter if in wp or missing
mode.

More interesting is how to retain the _PAGE_UFFD_WP during swapping
that will get rid of all pagetables too, since that looks a much
wider loss of info, than the above race and it looks the same issue.

I didn't see the full patchset yet, so I cannot tell if you solved
swapping some other way already, but until that is clear, the above
looks a minor concern and whatever solution that works to retain the
_PAGE_UFFD_WP information during shmem swapping should also solve the
above without extra changes to filemap_map_pages and the fault around
logic if the uffd is armed.

What I'm really saying is that there's no point to apply this patch,
until we see the full patchset of shmem uffd-wp and then it's possible
to evaluate if there are no other losses for the _PAGE_UFFD_WP bit.

Anon memory is completely different, it's impossible to lose
_PAGE_UFFD_WP there, since the swap entry format contains it, for
shmem the pte is zero instead during swap.

Thanks!
Andrea

2020-12-01 21:36:09

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

On Sat, 28 Nov 2020, Peter Xu wrote:
> On Fri, Nov 27, 2020 at 07:33:39PM -0500, Andrea Arcangeli wrote:
>
> > Now it would be ok if filemap_map_pages re-filled the ptes, after they
> > were zapped the first time by fallocate, and then the fallocate would
> > zap them again before truncating the page, but I don't see a second
> > pte zap... there's just the below single call of unmap_mapping_range:
> >
> > if ((u64)unmap_end > (u64)unmap_start)
> > unmap_mapping_range(mapping, unmap_start,
> > 1 + unmap_end - unmap_start, 0);
> > shmem_truncate_range(inode, offset, offset + len - 1);
>
> IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call
> truncate_inode_page() then onto truncate_cleanup_page().

Correct.

>
> Since we're at it, some more context: this is actually where I started to
> notice the issue, that we'll try to pre-unmap the whole region first before
> shmem_truncate_range(). The thing is shmem_truncate_range() will zap the ptes
> too, in an even safer way (with page lock held). So before I came up with the
> current patch, I also tried below patch, and it also fixes the data corrupt
> issue:
>
> -----8<-----
> diff --git a/mm/shmem.c b/mm/shmem.c
> index efa19e33e470..b275f401fe1f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> inode_lock(inode);
>
> if (mode & FALLOC_FL_PUNCH_HOLE) {
> - struct address_space *mapping = file->f_mapping;
> loff_t unmap_start = round_up(offset, PAGE_SIZE);
> loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
> DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
> @@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> inode->i_private = &shmem_falloc;
> spin_unlock(&inode->i_lock);
>
> - if ((u64)unmap_end > (u64)unmap_start)
> - unmap_mapping_range(mapping, unmap_start,
> - 1 + unmap_end - unmap_start, 0);
> shmem_truncate_range(inode, offset, offset + len - 1);
> /* No need to unmap again: hole-punching leaves COWed pages */
> -----8<-----
>
> Above code existed starting from the 1st day of shmem fallocate code, so I was
> thinking why we did that. One reason could be for performance, since this
> pre-unmap of explicit unmap_mapping_range() will try to walk the page tables
> once rather than walking for every single page, so when the hole is huge it
> could benefit us since truncate_cleanup_page() is able to avoid those per-page
> walks then because page_mapped() could be more likely to be zero:
>
> if (page_mapped(page)) {
> pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
> unmap_mapping_pages(mapping, page->index, nr, false);
> }
>
> It would be good if Hugh can help confirm whether my understanding is correct,
> though..

Correct. Code duplicated from mm/truncate.c, but with more comments
over there, in truncate_pagecache() and in truncate_pagecache_range().

>
> As a summary, that's why I didn't try to remove the optimization (which fixes
> the issue too) but instead I tried to dissable fault around instead for uffd,
> which sounds a better thing to do.
>
> >
> > So looking at filemap_map_pages in shmem, I'm really wondering if the
> > non-uffd case is correct or not.
> >
> > Do we end up with ptes pointing to non pagecache, so then the virtual
> > mapping is out of sync also with read/write syscalls that will then
> > allocate another shmem page out of sync of the ptes?

No, as you point out, the second pte zap, under page lock, keeps it safe.

> >
> > If a real page fault happened instead of filemap_map_pages, the
> > shmem_fault() would block during fallocate punch hole by checking
> > inode->i_private, as the comment says:
> >
> > * So refrain from
> > * faulting pages into the hole while it's being punched.
> >
> > It's not immediately clear where filemap_map_pages refrains from
> > faulting pages into the hole while it's being punched... given it's
> > ignoring inode->i_private.

Please don't ever rely on that i_private business for correctness: as
more of the comment around "So refrain from" explains, it was added
to avoid a livelock with the trinity fuzzer, and I'd gladly delete
it all, if I could ever be confident that enough has changed in the
intervening years that it's no longer necessary.

It tends to prevent shmem faults racing hole punches in the same area
(without quite guaranteeing no race at all). But without the livelock
issue, I'd just have gone on letting them race. "Punch hole" ==
"Lose data" - though I can imagine that UFFD would like more control
over that. Since map_pages is driven by faulting, it should already
be throttled too.

But Andrea in next mail goes on to see other issues with UFFD_WP
in relation to shmem swap, so this thread is probably dead now.
If there's a bit to spare for UFFD_WP in the anonymous swap entry,
then that bit should be usable for shmem too: but a shmem page
(and its swap entry) can be shared between many different users,
so I don't know whether that will make sense.

Hugh

2020-12-01 23:45:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads

Hi Hugh,

On Tue, Dec 01, 2020 at 01:31:21PM -0800, Hugh Dickins wrote:
> Please don't ever rely on that i_private business for correctness: as

The out of order and lockless "if (inode->i_private)" branch didn't
inspire much confidence in terms of being able to rely on it for
locking correctness indeed.

> more of the comment around "So refrain from" explains, it was added
> to avoid a livelock with the trinity fuzzer, and I'd gladly delete
> it all, if I could ever be confident that enough has changed in the
> intervening years that it's no longer necessary.

I was wondering if it's still needed, so now I checked the old code
and it also did an unconditional lock_page in shmem_fault, so I assume
it's still necessary.

> It tends to prevent shmem faults racing hole punches in the same area
> (without quite guaranteeing no race at all). But without the livelock
> issue, I'd just have gone on letting them race. "Punch hole" ==
> "Lose data" - though I can imagine that UFFD would like more control
> over that. Since map_pages is driven by faulting, it should already
> be throttled too.

Yes I see now it just needs to "eventually" stop the shmem_fault
activity, it doesn't need to catch those faults already in flight, so
it cannot relied upon as the form of serialization to zap the
pageteables while truncating the page.

> But Andrea in next mail goes on to see other issues with UFFD_WP
> in relation to shmem swap, so this thread is probably dead now.
> If there's a bit to spare for UFFD_WP in the anonymous swap entry,
> then that bit should be usable for shmem too: but a shmem page
> (and its swap entry) can be shared between many different users,
> so I don't know whether that will make sense.

An UFFD_WP software bit exists both for the present pte and for the
swap entry:

#define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
#define _PAGE_SWP_UFFD_WP _PAGE_USER

It works similarly to soft dirty, if the page is swapped _PAGE_UFFD_WP
it's translated to _PAGE_SWP_UFFD_WP and the other way around during
swapin.

The problem is that the bit represents an information that is specific
to an mm and a single mapping.

If you punch an hole in a shmem, you map the shmem file in two
processes A and B, you register the mapped range in a uffd opened by
process A using VM_UFFD_MISSING, only process A will generate
userfaults if you access the virtual address that corresponds to the
hole in the file, process B will still fill the hole normally and it
won't trigger userfaults in process A.

The uffd context is attached to an mm, and the notification only has
effect on that mm, the mm that created the context. Only the UFFDIO_
operations that resolve the fault (like UFFDIO_COPY) have effect
visible by all file sharers.

So VM_UFFD_WP shall work the same, and the _PAGE_SWP_UFFD_WP
information of a swapped out shmem page shouldn't go in the xarray
because doing so would share it with all "mm" sharing the mapping.

Currently uffd-wp only works on anon memory so this is a new challenge
in extending uffd-wp to shmem it seems.

If any pagetable of an anonymous memory mapping is zapped, then not
just the _PAGE_SWP_UFFD_WP bit is lost, the whole data is lost too so
it'd break not just uffd-wp. As opposed with shmem the ptes can be
zapped at all times by memory pressure alone even before the final
->writepage swap out is invoked.

It's always possible to make uffd-wp work without those bits (anon
memory also would work without those bits), but the reason those bits
exist is to avoid a flood of UFFDIO_WRITEPROTECT ioctl false-negatives
after fork or after swapping (in the anon case). In the shmem case if
we'd it without a bitflag to track which page was wrprotected in which
vma, the uffd handler would be required to mark each pte writable with
a syscall every time a new pte has been established on the range and
there's a write to the page.

One possibility could be to store the bit set by UFFDIO_WRITEPROTECT
in a vmalloc bitmap associated with the shmem vma at uffd registration
time, so it can be checked during the shmem_fault() if VM_UFFD_WP is
set on the vma? The vma_splits and vma_merge would be the hard part.

Another possibility would be to change how the pte invalidate work for
vmas with VM_UFFD_WP set, the pte and the _PAGE_UFFD_WP would need to
survive all invalidates... only the final truncation under page lock
would be really allowed to clear the whole pte and destroy the
_PAGE_UFFD_WP information in the pte and then to free the pgtables on
those ranges.

Once we find a way to make that bit survive the invalidates, we can
check it in shmem_fault to decide if to invoke handle_userfault if the
bit is set and FAULT_FLAG_WRITE is set in vmf->flags, or if to map the
page wrprotected in that vaddr if the bit is set and FAULT_FLAG_WRITE
is not set.

Thanks,
Andrea