2020-12-23 02:59:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
> We are talking about non-COW anon pages here -- they can't be mapped
> more than once. So why not just identify them by checking
> page_mapcount == 1 and then unconditionally reuse them? (This is
> probably where I've missed things.)

The problem in depending on page_mapcount to decide if it's COW or
non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP
may elevate the count of a COW anon page that become a non-COW anon
page.

This is Jann's idea not mine.

The problem is we have an unprivileged long term GUP like vmsplice
that facilitates elevating the page count indefinitely, until the
parent finally writes a secret to it. Theoretically a short term pin
would do it too so it's not just vmpslice, but the short term pin
would be incredibly more challenging to become a concern since it'd
kill a phone battery and flash before it can read any data.

So what happens with your page_mapcount == 1 check is that it doesn't
mean non-COW (we thought it did until it didn't for the long term gup
pin in vmsplice).

Jann's testcases does fork() and set page_mapcount 2 and page_count to
2, vmsplice, take unprivileged infinitely long GUP pin to set
page_count to 3, queue the page in the pipe with page_count elevated,
munmap to drop page_count to 2 and page_mapcount to 1.

page_mapcount is 1, so you'd think the page is non-COW and owned by
the parent, but the child can still read it so it's very much still
wp_page_copy material if the parent tries to modify it. Otherwise the
child can read the content.

This was supposed to be solvable by just doing the COW in gup(write=0)
case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly
sure why that didn't fly and it had to be reverted by Peter in
a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was
happening I was side tracked by urgent issues and I didn't manage to
look back of how we ended up with the big hammer page_count == 1 check
instead to decide if to call wp_page_reuse or wp_page_shared.

So anyway, the only thing that is clear to me is that keeping the
child from reading the page_mapcount == 1 pages of the parent, is the
only reason why wp_page_reuse(vmf) will only be called on
page_count(page) == 1 and not on page_mapcount(page) == 1.

It's also the reason why your page_mapcount assumption will risk to
reintroduce the issue, and I only wish we could put back page_mapcount
== 1 back there.

Still even if we put back page_mapcount there, it is not ok to leave
the page fault with stale TLB entries and to rely on the fact
wp_page_shared won't run. It'd also avoid the problem but I think if
you leave stale TLB entries in change_protection just like NUMA
balancing does, it also requires a catcher just like NUMA balancing
has, or it'd truly work by luck.

So until we can put a page_mapcount == 1 check back there, the
page_count will be by definition unreliable because of the speculative
lookups randomly elevating all non zero page_counts at any time in the
background on all pages, so you will never be able to tell if a page
is true COW or if it's just a spurious COW because of a speculative
lookup. It is impossible to differentiate a speculative lookup from a
vmsplice ref in a child.

Thanks,
Andrea


2020-12-23 03:38:10

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 09:56:11PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
> > We are talking about non-COW anon pages here -- they can't be mapped
> > more than once. So why not just identify them by checking
> > page_mapcount == 1 and then unconditionally reuse them? (This is
> > probably where I've missed things.)
>
> The problem in depending on page_mapcount to decide if it's COW or
> non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP
> may elevate the count of a COW anon page that become a non-COW anon
> page.
>
> This is Jann's idea not mine.
>
> The problem is we have an unprivileged long term GUP like vmsplice
> that facilitates elevating the page count indefinitely, until the
> parent finally writes a secret to it. Theoretically a short term pin
> would do it too so it's not just vmpslice, but the short term pin
> would be incredibly more challenging to become a concern since it'd
> kill a phone battery and flash before it can read any data.
>
> So what happens with your page_mapcount == 1 check is that it doesn't
> mean non-COW (we thought it did until it didn't for the long term gup
> pin in vmsplice).
>
> Jann's testcases does fork() and set page_mapcount 2 and page_count to
> 2, vmsplice, take unprivileged infinitely long GUP pin to set
> page_count to 3, queue the page in the pipe with page_count elevated,
> munmap to drop page_count to 2 and page_mapcount to 1.
>
> page_mapcount is 1, so you'd think the page is non-COW and owned by
> the parent, but the child can still read it so it's very much still
> wp_page_copy material if the parent tries to modify it. Otherwise the
> child can read the content.
>
> This was supposed to be solvable by just doing the COW in gup(write=0)
> case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly
> sure why that didn't fly and it had to be reverted by Peter in
> a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was
> happening I was side tracked by urgent issues and I didn't manage to
> look back of how we ended up with the big hammer page_count == 1 check
> instead to decide if to call wp_page_reuse or wp_page_shared.
>
> So anyway, the only thing that is clear to me is that keeping the
> child from reading the page_mapcount == 1 pages of the parent, is the
> only reason why wp_page_reuse(vmf) will only be called on
> page_count(page) == 1 and not on page_mapcount(page) == 1.
>
> It's also the reason why your page_mapcount assumption will risk to
> reintroduce the issue, and I only wish we could put back page_mapcount
> == 1 back there.
>
> Still even if we put back page_mapcount there, it is not ok to leave
> the page fault with stale TLB entries and to rely on the fact
> wp_page_shared won't run. It'd also avoid the problem but I think if
> you leave stale TLB entries in change_protection just like NUMA
> balancing does, it also requires a catcher just like NUMA balancing
> has, or it'd truly work by luck.
>
> So until we can put a page_mapcount == 1 check back there, the
> page_count will be by definition unreliable because of the speculative
> lookups randomly elevating all non zero page_counts at any time in the
> background on all pages, so you will never be able to tell if a page
> is true COW or if it's just a spurious COW because of a speculative
> lookup. It is impossible to differentiate a speculative lookup from a
> vmsplice ref in a child.

Thanks for the details.

In your patch, do we need to take wrprotect_rwsem in
handle_userfault() as well? Otherwise, it seems userspace would have
to synchronize between its wrprotect ioctl and fault handler? i.e.,
the fault hander needs to be aware that the content of write-
protected pages can actually change before the iotcl returns.

2020-12-23 15:55:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> In your patch, do we need to take wrprotect_rwsem in
> handle_userfault() as well? Otherwise, it seems userspace would have
> to synchronize between its wrprotect ioctl and fault handler? i.e.,
> the fault hander needs to be aware that the content of write-
> protected pages can actually change before the iotcl returns.

The handle_userfault() thread should be sleeping until another uffd_wp_resolve
fixes the page fault for it. However when the uffd_wp_resolve ioctl comes,
then rwsem (either the group rwsem lock as Andrea proposed, or the mmap_sem, or
any new rwsem lock we'd like to introduce, maybe per-uffd rather than per-mm)
should have guaranteed the previous wr-protect ioctls are finished and tlb must
have been flushed until this thread continues.

And I don't know why it matters even if the data changed - IMHO what uffd-wp
wants to do is simply to make sure after wr-protect ioctl returns to userspace,
no change on the page should ever happen anymore. So "whether data changed"
seems matter more on the ioctl thread rather than the handle_userfault()
thread. IOW, I think data changes before tlb flush but after pte wr-protect is
always fine - but that's not fine anymore if the syscall returns.

Thanks,

--
Peter Xu

2020-12-23 21:13:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 10:52:35AM -0500, Peter Xu wrote:
> On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > In your patch, do we need to take wrprotect_rwsem in
> > handle_userfault() as well? Otherwise, it seems userspace would have
> > to synchronize between its wrprotect ioctl and fault handler? i.e.,
> > the fault hander needs to be aware that the content of write-
> > protected pages can actually change before the iotcl returns.
>
> The handle_userfault() thread should be sleeping until another uffd_wp_resolve
> fixes the page fault for it. However when the uffd_wp_resolve ioctl comes,
> then rwsem (either the group rwsem lock as Andrea proposed, or the mmap_sem, or
> any new rwsem lock we'd like to introduce, maybe per-uffd rather than per-mm)
> should have guaranteed the previous wr-protect ioctls are finished and tlb must
> have been flushed until this thread continues.
>
> And I don't know why it matters even if the data changed - IMHO what uffd-wp

The data will change indeed and it's fine.

> wants to do is simply to make sure after wr-protect ioctl returns to userspace,
> no change on the page should ever happen anymore. So "whether data changed"
> seems matter more on the ioctl thread rather than the handle_userfault()
> thread. IOW, I think data changes before tlb flush but after pte wr-protect is
> always fine - but that's not fine anymore if the syscall returns.

Agreed.

From the userland point of view all it matters is that the writes
through the stale TLB entries will stop in both the two cases:

1) before returning from the UFFDIO_WRITEPROTECT(mode_wp = true) ioctl syscall

2) before a parallel UFFDIO_WRITEPROTECT(mode_wp = false) can clear
the _PAGE_UFFD_WP marker in the pte/hugepmd under the PT lock,
assuming the syscall at point 1) is still in flight

Both points are guaranteed at all times by the group lock now, so
userland cannot even measure or perceive the existence of any stale
TLB at any given time in the whole uffd-wp workload.

So it's perfectly safe and identical as NUMA balancing and requires
zero extra locking in handle_userfault().

handle_userfault() is a dead end that simply waits and when it's the
right time it restarts the page fault. It can have occasional false positives
after f9bf352224d7d4612b55b8d0cd0eaa981a3246cf, false positive as in
restarting too soon, but even then it's perfectly safe since it's
equivalent of one more CPU hitting the page fault path. As long as the
marker is there, any spurious userfault will re-enter
handle_userfault().

handle_userfault() doesn't care about the data and in turn it cannot
care less about any stale TLB either. Userland cares but userland
cannot make any assumption about writes being fully stopped, until the
ioctl returned anyway and by that time the pending flush will be done
and in fact by the time userland can make any assumption also the
mmap_write_lock would have been released with the first proposed patch.

Thanks,
Andrea

2020-12-23 21:42:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> Thanks for the details.

I hope we can find a way put the page_mapcount back where there's a
page_count right now.

If you're so worried about having to maintain a all defined well
documented (or to be documented even better if you ACK it)
marker/catcher for userfaultfd_writeprotect, I can't see how you could
consider to maintain the page fault safe against any random code
leaving too permissive TLB entries out of sync of the more restrictive
pte permissions as it was happening with clear_refs_write, which
worked by luck until page_mapcount was changed to page_count.

page_count is far from optimal, but it is a feature it finally allowed
us to notice that various code (clear_refs_write included apparently
even after the fix) leaves stale too permissive TLB entries when it
shouldn't.

The question is only which way you prefer to fix clear_refs_write and
I don't think we can deviate from those 3 methods that already exist
today. So clear_refs_write will have to pick one of those and
currently it's not falling in the same category with mprotect even
after the fix.

I think if clear_refs_write starts to take the mmap_write_lock and
really start to operate like mprotect, only then we can consider to
make userfaultfd_writeprotect also operate like mprotect.

Even then I'd hope we can at least be allowed to make it operate like
KSM write_protect_page for len <= HPAGE_PMD_SIZE, something that
clear_refs_write cannot do since it works in O(N) and tends to scan
everything at once, so there would be no point to optimize not to
defer the flush, for a process with a tiny amount of virtual memory
mapped.

vm86 also should be fixed to fall in the same category with mprotect,
since performance there is irrelevant.

Thanks,
Andrea

2020-12-23 22:31:26

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 04:39:00PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > Thanks for the details.
>
> I hope we can find a way put the page_mapcount back where there's a
> page_count right now.
>
> If you're so worried about having to maintain a all defined well
> documented (or to be documented even better if you ACK it)
> marker/catcher for userfaultfd_writeprotect, I can't see how you could
> consider to maintain the page fault safe against any random code
> leaving too permissive TLB entries out of sync of the more restrictive
> pte permissions as it was happening with clear_refs_write, which
> worked by luck until page_mapcount was changed to page_count.
>
> page_count is far from optimal, but it is a feature it finally allowed
> us to notice that various code (clear_refs_write included apparently
> even after the fix) leaves stale too permissive TLB entries when it
> shouldn't.
>
> The question is only which way you prefer to fix clear_refs_write and
> I don't think we can deviate from those 3 methods that already exist
> today. So clear_refs_write will have to pick one of those and
> currently it's not falling in the same category with mprotect even
> after the fix.
>
> I think if clear_refs_write starts to take the mmap_write_lock and
> really start to operate like mprotect, only then we can consider to
> make userfaultfd_writeprotect also operate like mprotect.
>
> Even then I'd hope we can at least be allowed to make it operate like
> KSM write_protect_page for len <= HPAGE_PMD_SIZE, something that
> clear_refs_write cannot do since it works in O(N) and tends to scan
> everything at once, so there would be no point to optimize not to
> defer the flush, for a process with a tiny amount of virtual memory
> mapped.
>
> vm86 also should be fixed to fall in the same category with mprotect,
> since performance there is irrelevant.

I was hesitant to suggest the following because it isn't that straight
forward. But since you seem to be less concerned with the complexity,
I'll just bring it on the table -- it would take care of both ufd and
clear_refs_write, wouldn't it?

diff --git a/mm/memory.c b/mm/memory.c
index 5e9ca612d7d7..af38c5ee327e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
goto unlock;
}
if (vmf->flags & FAULT_FLAG_WRITE) {
- if (!pte_write(entry))
+ if (!pte_write(entry)) {
+ if (mm_tlb_flush_pending(vmf->vma->vm_mm))
+ flush_tlb_page(vmf->vma, vmf->address);
return do_wp_page(vmf);
+ }
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);

2020-12-23 23:09:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 03:29:51PM -0700, Yu Zhao wrote:
> I was hesitant to suggest the following because it isn't that straight
> forward. But since you seem to be less concerned with the complexity,
> I'll just bring it on the table -- it would take care of both ufd and
> clear_refs_write, wouldn't it?

It certainly would since this is basically declaring "leaving stale
TLB entries past mmap_read_lock" is now permitted as long as the
pending flush counter is elevated under mmap_sem for reading.

Anything that prevents uffd-wp to take mmap_write_lock looks great to
me, anything, the below included, as long as it looks like easy to
enforce and understand. And the below certainly is.

My view is that the below is at the very least an improvement in terms
of total complexity, compared to v5.10. At least it'll be documented.

So what would be not ok to me is to depend on undocumented not
guaranteed behavior in do_wp_page like the page_mapcount, which is
what we had until now in clear_refs_write and in uffd-wp (but only if
wrprotect raced against un-wrprotect, a tiny window if compared to
clear_refs_write).

Documenting that clearing pte_write and deferring the flush is allowed
if mm_tlb_flush_pending was elevated before taking the PT lock is less
complex and very well defined rule, if compared to what we had before
in the page_mapcount dependency of clear_refs_write which was prone to
break, and in fact it just did.

We'll need a commentary in both userfaultfd_writeprotect and
clear_refs_write that links to the below snippet.

If we abstract it in a header we can hide there also a #ifdef
CONFIG_HAVE_ARCH_SOFT_DIRTY=y && CONFIG_HAVE_ARCH_USERFAULTFD_WP=y &&
CONFIG_USERFAULTFD=y to make it even more explicit.

However it may be simpler to keep it unconditional, I don't mind
either ways. If it was up to me I'd write it to those 3 config options
to be sure I remember where it comes from and to force any other user
to register to be explicit they depend on that.

>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> goto unlock;
> }
> if (vmf->flags & FAULT_FLAG_WRITE) {
> - if (!pte_write(entry))
> + if (!pte_write(entry)) {
> + if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> + flush_tlb_page(vmf->vma, vmf->address);
> return do_wp_page(vmf);
> + }
> entry = pte_mkdirty(entry);
> }
> entry = pte_mkyoung(entry);
>

2020-12-23 23:41:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 1:39 PM Andrea Arcangeli <[email protected]> wrote:
>
> On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > Thanks for the details.
>
> I hope we can find a way put the page_mapcount back where there's a
> page_count right now.

I really don't think that's ever going to happen - at least if you're
talking about do_wp_page().

I refuse to make the *simple* core operations VM have to jump through
hoops, and the old COW mapcount logic was that. I *much* prefer the
newer COW code, because the rules are more straightforward.

> page_count is far from optimal, but it is a feature it finally allowed
> us to notice that various code (clear_refs_write included apparently
> even after the fix) leaves stale too permissive TLB entries when it
> shouldn't.

I absolutely agree that page_count isn't exactly optimal, but
"mapcount" is just so much worse.

page_count() is at least _logical_, and has a very clear meaning:
"this page has other users". mapcount() means something else, and is
simply not sufficient or relevant wrt COW.

That doesn't mean that page_mapcount() is wrong - it's just that it's
wrong for COW. page_mapcount() is great for rmap, so that we can see
when we need to shoot down a memory mapping of a page that gets
released (truncate being the classic example).

I think that the mapcount games we used to have were horrible. I
absolutely much prefer where we are now wrt COW.

The modern rules for COW handling are:

- if we got a COW fault there might be another user, we copy (and
this is where the page count makes so much logical sense).

- if somebody needs to pin the page in the VM, we either make sure
that it is pre-COWed and we

(a) either never turn it back into a COW page (ie the fork()-time
stuff we have for pinned pages)

(b) or there is some explicit marker on the page or in the page
table (ie the userfaultfd_pte_wp thing).

those are _so_ much more straightforward than the very complex rules
we used to have that were actively buggy, in addition to requiring the
page lock. So they were buggy and slow.

And yes, I had forgotten about that userfaultfd_pte_wp() because I was
being myopic and only looking at wp_page_copy(). So using that as a
way to make sure that a page doesn't go through COW is a good way to
avoid the COW race, but I think that thing requires a bit in the page
table which might be a problem on some architectures?

Linus

2020-12-24 01:05:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

Hello Linus,

On Wed, Dec 23, 2020 at 03:39:53PM -0800, Linus Torvalds wrote:
> On Wed, Dec 23, 2020 at 1:39 PM Andrea Arcangeli <[email protected]> wrote:
> >
> > On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > > Thanks for the details.
> >
> > I hope we can find a way put the page_mapcount back where there's a
> > page_count right now.
>
> I really don't think that's ever going to happen - at least if you're
> talking about do_wp_page().

Yes, I was referring to do_wp_page().

> I refuse to make the *simple* core operations VM have to jump through
> hoops, and the old COW mapcount logic was that. I *much* prefer the
> newer COW code, because the rules are more straightforward.

Agreed. I don't have any practical alternative proposal in fact, it
was only an hypothetical wish/hope, echoing Hugh's view on this
matter.

> > page_count is far from optimal, but it is a feature it finally allowed
> > us to notice that various code (clear_refs_write included apparently
> > even after the fix) leaves stale too permissive TLB entries when it
> > shouldn't.
>
> I absolutely agree that page_count isn't exactly optimal, but
> "mapcount" is just so much worse.

Agreed. The "isn't exactly optimal" part is the only explanation for
wish/hope.

> page_count() is at least _logical_, and has a very clear meaning:
> "this page has other users". mapcount() means something else, and is
> simply not sufficient or relevant wrt COW.
>
> That doesn't mean that page_mapcount() is wrong - it's just that it's
> wrong for COW. page_mapcount() is great for rmap, so that we can see
> when we need to shoot down a memory mapping of a page that gets
> released (truncate being the classic example).
>
> I think that the mapcount games we used to have were horrible. I
> absolutely much prefer where we are now wrt COW.
>
> The modern rules for COW handling are:
>
> - if we got a COW fault there might be another user, we copy (and
> this is where the page count makes so much logical sense).
>
> - if somebody needs to pin the page in the VM, we either make sure
> that it is pre-COWed and we
>
> (a) either never turn it back into a COW page (ie the fork()-time
> stuff we have for pinned pages)
>
> (b) or there is some explicit marker on the page or in the page
> table (ie the userfaultfd_pte_wp thing).
>
> those are _so_ much more straightforward than the very complex rules
> we used to have that were actively buggy, in addition to requiring the
> page lock. So they were buggy and slow.

Agreed, this is a very solid solution that solves the problem the
mapcount had with GUP pins.

The only alternative but very vapourware thought I had on this matter,
is that all troubles start when we let a GUP-pinned page be unmapped
from the pgtables.

I already told Jens, is io_uring could use mmu notifier already (that
would make any io_uring GUP pin not count anymore with regard to
page_mapcount vs page_count difference) and vmsplice also needs some
handling or maybe become privileged.

The above two improvements are orthogonal with the COW issue since
long term GUP pins do more than just mlock and they break most
advanced VM features. It's not ok just to account GUP pins in rlimit
mlock.

The above two improvements however would go into the direction to make
mapcount more suitable for COW, but it'd still not be enough.

The transient GUP pins also would need to be taken care of, but we
could wait for those to be released, since those are guaranteed to be
released or something else, not just munmap()/MADV_DONTNEED, will
remain in D state.

Anyway.. until io_uring and vmsplice are solved first, it's pointless
to even wonder about transient GUP pins.

> And yes, I had forgotten about that userfaultfd_pte_wp() because I was
> being myopic and only looking at wp_page_copy(). So using that as a
> way to make sure that a page doesn't go through COW is a good way to
> avoid the COW race, but I think that thing requires a bit in the page
> table which might be a problem on some architectures?

Correct, it requires a bit in the pgtable.

The bit is required in order to disambiguate which regions have been
marked for wrprotect faults and which didn't.

A practical example would be: fork() called by an app with a very
large vma with VM_UFFD_WP set.

There would be a storm of WP userfaults if we didn't have the software
bit to detect exactly which pte were wrprotected by uffd-wp and which
ones were wrprotected by fork.

That bit then sends the COW fault to a safe place if wrprotect is
running (the problem is we didn't see the un-wrprotect would clear the
bit while the TLB flush of the wrprotect could be still pending).

The page_mapcount I guess hidden that race to begin with, just like it
did in clear_refs_write.

uffd-wp is similar to soft dirty: it won't work well without a pgtable
software bit. The software bit, to avoid storms of false positives
during memory pressure, also survives swapin/swapout, again like soft
dirty.

The bit requirement is handled through a per-arch option
arch/x86/Kconfig similarly to soft dirty.

select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD

From an high level, the extra bit in the pte/hugepmd could be seen as
holding the information that would be generated by split_vma()
followed by clearing VM_WRITE in the middle vma. Setting that bit
results in a cheaper runtime than allocating 2 more vmas with the
associated locking and rbtree size increase, but same accuracy.

Thanks,
Andrea

2020-12-24 01:25:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect


> On Dec 23, 2020, at 2:29 PM, Yu Zhao <[email protected]> wrote:
>

>
> I was hesitant to suggest the following because it isn't that straight
> forward. But since you seem to be less concerned with the complexity,
> I'll just bring it on the table -- it would take care of both ufd and
> clear_refs_write, wouldn't it?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> goto unlock;
> }
> if (vmf->flags & FAULT_FLAG_WRITE) {
> - if (!pte_write(entry))
> + if (!pte_write(entry)) {
> + if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> + flush_tlb_page(vmf->vma, vmf->address);
> return do_wp_page(vmf);
> + }

I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.

I’m not immediately sure how to do all that much better, though. We could potentially keep a record of pending ranges that need flushing per mm or per PTL, protected by the PTL, and arrange to do the flush if we notice that flushes are pending when we want to do_wp_page(). At least this would limit us to one point extra flush, at least until the concurrent mprotect (or whatever) makes further progress. The bookkeeping might be nasty, though.

But x86 already sort of does some of this bookkeeping, and arguably x86’s code could be improved by tracking TLB ranges to flush per mm instead of per flush request — Nadav already got us half way there by making a little cache of flush_tlb_info structs. IMO it wouldn’t be totally crazy to integrate this better with tlb_gather_mmu to make the pending flush data visible to other CPUs even before actually kicking off the flush. In the limit, this starts to look a bit like a fully async flush mechanism. We would have a function to request a flush, and that function would return a generation count but not actually flush anything. The case of flushing a range adjacent to a still-pending range would be explicitly optimized. Then another function would actually initiate and wait for the flush to complete. And we could, while holding PTL, scan the list of pending flushes, if any, to see if the PTE we’re looking at has a flush pending. This is sort of easy in the one-PTL-per-mm case but potentially rather complicated in the split PTL case. And I’m genuinely unsure where the “batch” TLB flush interface fits in, because it can create a batch that spans more than one mm. x86 can deal with this moderately efficiently since we limit the number of live mms per CPU and our flushes are (for now?) per cpu, not per mm.

u64 gen = 0;
for(...)
gen = queue_flush(mm, start, end, freed_tables);
flush_to_gen(mm, gen);

and the wp fault path does:

wait_for_pending_flushes(mm, address);

Other than the possibility of devolving to one flush per operation if one thread is page faulting at the same speed that another thread is modifying PTEs, this should be decently performant.

2020-12-24 02:03:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.

mprotect can't run concurrently with a page fault in the first place.

One other near zero cost improvement easy to add if this would be "if
(vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
conditional to the two config options too.

Still I don't mind doing it in some other way, uffd-wp has much easier
time doing it in another way in fact.

Whatever performs better is fine, but queuing up pending invalidate
ranges don't look very attractive since it'd be a fixed cost that we'd
always have to pay even when there's no fault (and there can't be any
fault at least for mprotect).

Thanks,
Andrea

2020-12-24 03:10:44

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli <[email protected]> wrote:
>
> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
>> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.
>
> mprotect can't run concurrently with a page fault in the first place.
>
> One other near zero cost improvement easy to add if this would be "if
> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
> conditional to the two config options too.
>
> Still I don't mind doing it in some other way, uffd-wp has much easier
> time doing it in another way in fact.
>
> Whatever performs better is fine, but queuing up pending invalidate
> ranges don't look very attractive since it'd be a fixed cost that we'd
> always have to pay even when there's no fault (and there can't be any
> fault at least for mprotect).

I think there are other cases in which Andy’s concern is relevant
(MADV_PAGEOUT).

Perhaps holding some small bitmap based on part of the deferred flushed
pages (e.g., bits 12-17 of the address or some other kind of a single
hash-function bloom-filter) would be more performant to avoid (most)
unnecessary TLB flushes. It will be cleared before a TLB flush and set while
holding the PTL.

Checking if a flush is needed, under the PTL, would require a single memory
access (although potentially cache miss). It will however require one atomic
operation for each page-table whose PTEs’ flushes are deferred - in contrast
to the current scheme which requires two atomic operations for the *entire*
operation.

2020-12-24 03:36:15

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 09:00:26PM -0500, Andrea Arcangeli wrote:
> One other near zero cost improvement easy to add if this would be "if
> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made

The next worry then is if UFFDIO_WRITEPROTECT is very large then there
would be a flood of wrprotect faults, and they'd end up all issuing a
tlb flush during the UFFDIO_WRITEPROTECT itself which again is a
performance concern for certain uffd-wp use cases.

Those use cases would be for example redis and postcopy live
snapshotting, to use it for async snapshots, unprivileged too in the
case of redis if it temporarily uses bounce buffers for the syscall
I/O for the duration of the snapshot. hypervisors tuned profiles need
to manually lift the unprivileged_userfaultfd to 1 unless their jailer
leaves one capability in the snapshot thread.

Moving the check after userfaultfd_pte_wp would solve
userfaultfd_writeprotect(mode_wp=true), but that still wouldn't avoid
a flood of tlb flushes during userfaultfd_writeprotect(mode_wp=false)
because change_protection doesn't restore the pte_write:

} else if (uffd_wp_resolve) {
/*
* Leave the write bit to be handled
* by PF interrupt handler, then
* things like COW could be properly
* handled.
*/
ptent = pte_clear_uffd_wp(ptent);
}

When the snapshot is complete userfaultfd_writeprotect(mode_wp=false)
would need to run again on the whole range which can be very big
again.

Orthogonally I think we should also look to restore the pte_write
above orthogonally in uffd-wp, so it'll get yet an extra boost if
compared to current redis snapshotting fork(), that cannot restore all
pte_write after the snapshot child quit and forces a flood of spurious
wrprotect faults (uffd-wp can solve that too).

However, even if uffd-wp restored the pte_write, things would remain
suboptimal for a terabyte process under clear_refs, since softdirty
wrprotect faults that start happening while softdirty is still running
on the mm, won't be caught in userfaultfd_pte_wp.

Something like below, if cleaned up, abstracted properly and
documented well in the two places involved, will have a better chance
to perform optimally for softdirty too.

And on a side note the CONFIG_MEM_SOFT_DIRTY compile time check is
compulsory because VM_SOFTDIRTY is defined to zero if softdirty is not
built in. (for VM_UFFD_WP the CONFIG_HAVE_ARCH_USERFAULTFD_WP can be
removed and it won't make any measurable difference even when
USERFAULTFD=n)

RFC untested below, it's supposed to fix the softdirty testcase too,
even without the incremental fix, since it already does tlb_gather_mmu
before walk_page_range and tlb_finish_mmu after it and that appears
enough to define the inc/dec_tlb_flush_pending.

diff --git a/mm/memory.c b/mm/memory.c
index 7d608765932b..66fd6d070c47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2844,11 +2844,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (!new_page)
goto oom;
} else {
+ bool in_uffd_wp, in_softdirty;
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (!new_page)
goto oom;

+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+ in_uffd_wp = !!(vma->vm_flags & VM_UFFD_WP);
+#else
+ in_uffd_wp = false;
+#endif
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ in_softdirty = !(vma->vm_flags & VM_SOFTDIRTY);
+#else
+ in_softdirty = false;
+#endif
+ if ((in_uffd_wp || in_softdirty) &&
+ mm_tlb_flush_pending(mm))
+ flush_tlb_page(vma, vmf->address);
+
if (!cow_user_page(new_page, old_page, vmf)) {
/*
* COW failed, if the fault was solved by other,

2020-12-24 03:36:37

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
> > On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli <[email protected]> wrote:
> >
> > On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
> >> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.
> >
> > mprotect can't run concurrently with a page fault in the first place.
> >
> > One other near zero cost improvement easy to add if this would be "if
> > (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
> > conditional to the two config options too.
> >
> > Still I don't mind doing it in some other way, uffd-wp has much easier
> > time doing it in another way in fact.
> >
> > Whatever performs better is fine, but queuing up pending invalidate
> > ranges don't look very attractive since it'd be a fixed cost that we'd
> > always have to pay even when there's no fault (and there can't be any
> > fault at least for mprotect).
>
> I think there are other cases in which Andy’s concern is relevant
> (MADV_PAGEOUT).

That patch only demonstrate a rough idea and I should have been
elaborate: if we ever decide to go that direction, we only need to
worry about "jumping through hoops", because the final patch (set)
I have in mind would not only have the build time optimization Andrea
suggested but also include runtime optimizations like skipping
do_swap_page() path and (!PageAnon() || page_mapcount > 1). Rest
assured, the performance impact on do_wp_page() from occasionally an
additional TLB flush on top of a page copy is negligible.

> Perhaps holding some small bitmap based on part of the deferred flushed
> pages (e.g., bits 12-17 of the address or some other kind of a single
> hash-function bloom-filter) would be more performant to avoid (most)
> unnecessary TLB flushes. It will be cleared before a TLB flush and set while
> holding the PTL.
>
> Checking if a flush is needed, under the PTL, would require a single memory
> access (although potentially cache miss). It will however require one atomic
> operation for each page-table whose PTEs’ flushes are deferred - in contrast
> to the current scheme which requires two atomic operations for the *entire*
> operation.
>

2020-12-24 04:04:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
> > I think there are other cases in which Andy’s concern is relevant
> > (MADV_PAGEOUT).

I didn't try to figure how it would help MADV_PAGEOUT. MADV_PAGEOUT
sounds cool feature, but maybe it'd need a way to flush the
invalidates out and a take a static key to enable the buildup of those
ranges?

I wouldn't like to slow down the fast paths even for MADV_PAGEOUT, and
the same applies to uffd-wp and softdirty in fact.

On Wed, Dec 23, 2020 at 08:34:00PM -0700, Yu Zhao wrote:
> That patch only demonstrate a rough idea and I should have been
> elaborate: if we ever decide to go that direction, we only need to
> worry about "jumping through hoops", because the final patch (set)
> I have in mind would not only have the build time optimization Andrea
> suggested but also include runtime optimizations like skipping
> do_swap_page() path and (!PageAnon() || page_mapcount > 1). Rest
> assured, the performance impact on do_wp_page() from occasionally an
> additional TLB flush on top of a page copy is negligible.

Agreed, I just posted something in that direction. Feel free to
refactor it, it's just a tentative implementation to show something
that may deliver all the performance we need in all cases involved
without slowing down the fast paths of non-uffd and non-softdirty
(well 1 branch).

> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
> > Perhaps holding some small bitmap based on part of the deferred flushed
> > pages (e.g., bits 12-17 of the address or some other kind of a single
> > hash-function bloom-filter) would be more performant to avoid (most)

The concern here aren't only the page faults having to run the bloom
filter, but how to manage the RAM storage pointed by the bloomfilter
or whatever index into the storage, which would slowdown mprotect.

Granted that mprotect is slow to begin with, but the idea we can't make
it any slower to make MADV_PAGEOUT or uffd-wp or clear_refs run
faster since it's too important and too frequent in comparison.

Just to restrict the potential false positive IPI caused by page_count
inevitable inaccuracies to uffd-wp and softdirty runtimes, a simple
check on vm_flags should be enough.

Thanks,
Andrea

2020-12-24 04:40:27

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

> On Dec 23, 2020, at 7:34 PM, Yu Zhao <[email protected]> wrote:
>
> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
>>> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli <[email protected]> wrote:
>>>
>>> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
>>>> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.
>>>
>>> mprotect can't run concurrently with a page fault in the first place.
>>>
>>> One other near zero cost improvement easy to add if this would be "if
>>> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
>>> conditional to the two config options too.
>>>
>>> Still I don't mind doing it in some other way, uffd-wp has much easier
>>> time doing it in another way in fact.
>>>
>>> Whatever performs better is fine, but queuing up pending invalidate
>>> ranges don't look very attractive since it'd be a fixed cost that we'd
>>> always have to pay even when there's no fault (and there can't be any
>>> fault at least for mprotect).
>>
>> I think there are other cases in which Andy’s concern is relevant
>> (MADV_PAGEOUT).
>
> That patch only demonstrate a rough idea and I should have been
> elaborate: if we ever decide to go that direction, we only need to
> worry about "jumping through hoops", because the final patch (set)
> I have in mind would not only have the build time optimization Andrea
> suggested but also include runtime optimizations like skipping
> do_swap_page() path and (!PageAnon() || page_mapcount > 1). Rest
> assured, the performance impact on do_wp_page() from occasionally an
> additional TLB flush on top of a page copy is negligible.

I agree with you to a certain extent, since there is anyhow another TLB
flush in this path when the PTE is set after copying.

Yet, I think that having a combined and efficient central mechanism for
pending TLB flushes is important even for robustness: to prevent the
development of new independent deferred flushing schemes. I specifically do
not like tlb_flush_batched which every time that I look at gets me confused.
For example the following code completely confuses me:

void flush_tlb_batched_pending(struct mm_struct *mm)
{
if (data_race(mm->tlb_flush_batched)) {
flush_tlb_mm(mm);

/*
* Do not allow the compiler to re-order the clearing of
* tlb_flush_batched before the tlb is flushed.
*/
barrier();
mm->tlb_flush_batched = false;
}
}

… and then I ask myself (no answer):

1. What prevents concurrent flush_tlb_batched_pending() which is called by
madvise_free_pte_range(), for instance from madvise_free_pte_range(), from
clearing new deferred flush indication that was just set by
set_tlb_ubc_flush_pending()? Can it cause a missed TLB flush later?

2. Why the write to tlb_flush_batched is not done with WRITE_ONCE()?

3. Should we have smp_wmb() instead of barrier()? (probably the barrier() is
not needed at all since flush_tlb_mm() serializes if a flush is needed).

4. Why do we need 2 deferred TLB flushing mechanisms?

2020-12-24 05:21:05

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

> On Dec 23, 2020, at 8:01 PM, Andrea Arcangeli <[email protected]> wrote:
>
>> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
>>> Perhaps holding some small bitmap based on part of the deferred flushed
>>> pages (e.g., bits 12-17 of the address or some other kind of a single
>>> hash-function bloom-filter) would be more performant to avoid (most)
>
> The concern here aren't only the page faults having to run the bloom
> filter, but how to manage the RAM storage pointed by the bloomfilter
> or whatever index into the storage, which would slowdown mprotect.
>
> Granted that mprotect is slow to begin with, but the idea we can't make
> it any slower to make MADV_PAGEOUT or uffd-wp or clear_refs run
> faster since it's too important and too frequent in comparison.
>
> Just to restrict the potential false positive IPI caused by page_count
> inevitable inaccuracies to uffd-wp and softdirty runtimes, a simple
> check on vm_flags should be enough.

Andrea,

I am not trying to be argumentative, and I did not think through about an
alternative solution. It sounds to me that your proposed solution is correct
and would probably be eventually (slightly) more efficient than anything
that I can propose.

Yet, I do want to explain my position. Reasoning on TLB flushes is hard, as
this long thread shows. The question is whether it has to be so hard. In
theory, we can only think about architectural considerations - whether a PTE
permissions are promoted/demoted and whether the PTE was changed/cleared.

Obviously, it is more complex than that. Yet, once you add into the equation
various parameters such as the VMA flags or whether a page is locked (which
Mel told me was once a consideration), things become much more complicated.
If all the logic of TLB flushes had been concentrated in a single point and
maintenance of this code did not require thought about users and use-cases,
I think things would have been much simpler, at least for me.

Regards,
Nadav

2020-12-24 18:52:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 09:18:09PM -0800, Nadav Amit wrote:
> I am not trying to be argumentative, and I did not think through about an
> alternative solution. It sounds to me that your proposed solution is correct
> and would probably be eventually (slightly) more efficient than anything
> that I can propose.

On a side note, on the last proposed solution, I've been wondering
about the memory ordering mm_tlb_flush_pending.

There's plenty of locking before the point where the actual data is
read, but it's not a release/acquire full barrier (or more accurately
it is only on x86), so smp_rmb() seems needed before cow_user_page to
be sure no data can be read before we read the value of
mm_tlb_flush_pending.

To avoid an explicit smp_rmb() and to inherit the implicit smp_mb()
from the release/acquire of PT unlock/PT lock, we'd need to put the
flush before the previous PT unlock. (note, not after the next PT lock
or it'd be even worse).

So this made me look also the inc/dec:

+ smp_mb__before_atomic();
atomic_dec(&mm->tlb_flush_pending);

Without the above, can't the CPU decrement the tlb_flush_pending while
the IPI to ack didn't arrive yet in csd_lock_wait?

The smp_rmb() is still needed in the page fault (implicit or explicit
doesn't matter), but we also need a smp_mb__before_atomic() above to
really make this work.

> Yet, I do want to explain my position. Reasoning on TLB flushes is hard, as
> this long thread shows. The question is whether it has to be so hard. In
> theory, we can only think about architectural considerations - whether a PTE
> permissions are promoted/demoted and whether the PTE was changed/cleared.
>
> Obviously, it is more complex than that. Yet, once you add into the equation
> various parameters such as the VMA flags or whether a page is locked (which
> Mel told me was once a consideration), things become much more complicated.
> If all the logic of TLB flushes had been concentrated in a single point and
> maintenance of this code did not require thought about users and use-cases,
> I think things would have been much simpler, at least for me.

In your previous email you also suggested to add range invalidates and
bloom filter to index them by the address in the page fault since it'd
help MADV_PAGEOUT. That would increase complexity and it won't go
exactly in the above direction.

I assume that here nobody wants to add gratuitous complexity, and I
never suggested the code shouldn't become simpler and easier to
understand and maintain. But we can't solve everything in a single
thread in terms of cleaning up and auditing the entirety of the TLB
code.

To refocus: the only short term objective in this thread, is to fix
the data corruption in uffd-wp and clear_refs_write without
introducing new performance regressions compared to the previous
clear_refs_write runtime.

Once that is fixed and we didn't introduce a performance regression
while fixing an userland memory corruption regression (i.e. not really
a regression in theory, but in practice it is because it worked by
luck), then we can look at all the rest without hurry.

So if already want to start the cleanups like I mentioned in a
previous email, and I'll say it more explicitly, the tlb gather is for
freeing memory, it's just pure overhead and gives a false sense of
security, like it can make any difference, when just changing
protection with mmap_read_lock. It wouldn't be needed with the write
lock and it can't help solve the races that trigger with
mmap_read_lock either.

It is needed when you have to store the page pointer outside the pte,
clear a pte, flush the tlb and only then put_page. So it is needed to
keep tracks of which pages got cleared in the ptes so you don't have
to issue a tlb flush for each single pte that gets cleared.

So the only case when to use the tlb_gather is when you must make the
pte stop pointing to the page and you need an external storage that
will keep track of those pages that we cannot yet lose track of since
we didn't do put_page yet. That kind of external storage to keep track
of the pages that have pending tlb flushes, is never needed in
change_protection and clear_refs_write.

change_protection is already correct in that respect, it doesn't use
the tlb_gather. clear_refs_write is not ok and it need to stop using
tlb_gather_mmu/tlb_finish_mmu.

* tlb_gather_mmu - initialize an mmu_gather structure for page-table tear-down

See also the tear-down mention which doesn't apply to clear_refs_write.

clear_refs_write needs to become identical to
change_protection_range. Just increasing inc_tlb_flush_pending and
then doing a flush of the range right before the
dec_tlb_flush_pending.

That change is actually orthogonal to the regression fix to teach the
COW to deal with stale TLB entries and it will clean up the code.

So to pursue your simplification objective you could already go ahead
doing this improvement since it's very confusing to see the
clear_refs_write use something that it shouldn't use like if it
actually could make any difference and then seeing an incremental
patch trying to perfect the tlb_gather logic in clear_refs instead of
removing it. In fact it's so strange that it's hard to suggest
dropping tlb_gather entirely, I feel like I must be missing
something, so if I miss something this would be a good time to explain
me why tlb_gather is used in clear_refs.

Once that is also done, we can look at the flush_tlb_batched_pending
that I see you mentioned to Yu. I didn't go check it yet, but we can
certainly look at it deeper later, maybe starting a new thread for
it is simpler?

In the short term, for this thread, we can't solve everything at once
and reduce the complexity at the same time.

So refocusing on the memory ordering of dec_tlb_flush_pending and the
mm_tlb_flush_pending mentioned above, to find a proper abstraction and
write proper documentation for the flush in wp_copy_page would be
ideal. Then we can do the rest.

On my to-list before worrying about the rest in fact I also need to
re-send (I already proposed it for merging a few times on lkml) of the
ARM64 tlb flushing optimization to skip the hardware SMP un-scalable
ITLB broadcast for single threaded processes or multithreaded
processes that are temporarily running single threaded or bind to a
single CPU, which increases SMP scalability of the arm64 TLB flushing
by an order of magnitude for some workloads and we had to ship in
production already).

Thanks,
Andrea

2020-12-24 19:18:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Thu, Dec 24, 2020 at 01:49:45PM -0500, Andrea Arcangeli wrote:
> Without the above, can't the CPU decrement the tlb_flush_pending while
> the IPI to ack didn't arrive yet in csd_lock_wait?

Ehm: csd_lock_wait has smp_acquire__after_ctrl_dep() so the write side
looks ok after all sorry.