2020-12-22 23:52:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 3:39 PM Yu Zhao <[email protected]> wrote:
>
> 2) is the false positive because of what we do, and it's causing the
> memory corruption because do_wp_page() tries to make copies of pages
> that seem to be RO but may have stale RW tlb entries pending flush.

Yeah, that's definitely a different bug.

The rule is that the TLB flush has to be done before the page table
lock is released.

See zap_pte_range() for an example of doing it right, even in the
presence of complexities (ie that has an example of both flushing the
TLB, and doing the actual "free the pages after flush", and it does
the two cases separately).

Linus


2020-12-23 00:04:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 3:50 PM Linus Torvalds
<[email protected]> wrote:
>
> See zap_pte_range() for an example of doing it right, even in the
> presence of complexities (ie that has an example of both flushing the
> TLB, and doing the actual "free the pages after flush", and it does
> the two cases separately).

The more I look at the mprotect code, the less I like it. We seem to
be much better about the TLB flushes in other places (looking at
mremap, for example). The mprotect code seems to be very laissez-faire
about the TLB flushing.

Does adding a TLB flush to before that

pte_unmap_unlock(pte - 1, ptl);

fix things for you?

That's not the right fix - leaving a stale TLB entry around is fine if
the TLB entry is more strict wrt protections - but it might be worth
testing as a "does it at least close the problem" patch.

Linus

2020-12-23 00:23:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 3:50 PM Linus Torvalds
<[email protected]> wrote:
>
> The rule is that the TLB flush has to be done before the page table
> lock is released.

I take that back. I guess it's ok as long as the mmap_sem is held for
writing. Then the TLB flush can be delayed until just before releasing
the mmap_sem. I think.

The stale TLB entries still mean that somebody else can write through
them in another thread, but as long as anybody who actually unmaps the
page (and frees it - think rmap etc) is being careful, mprotect()
itself can probably afford to be a bit laissez-faire.

So mprotect() itself should be ok, I think, because it takes things for writing.

Even with the mmap_sem held for writing, truncate and friends can see
the read-only page table entries (because they can look things up
using the file i_mmap thing instead), but then they rely on the page
table lock and they'll also be careful if they then change that PTE
and will force their own TLB flushes.

So I think a pending TLB flush outside the page table lock is fine -
but once again only if you hold the mmap_sem for writing. Not for
reading, because then the page tables need to be synchronized with the
TLB so that other readers don't see the not-yet-synchronized state.

It once again looks like it's just userfaultfd that would trigger this
due to the read-lock on the mmap_sem. And mprotect() itself is fine.

Am I missing something?

But apparently Nadav sees problems even with that lock changed to a
write lock. Navad?

Linus

2020-12-23 00:25:45

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 04:01:45PM -0800, Linus Torvalds wrote:
> On Tue, Dec 22, 2020 at 3:50 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > See zap_pte_range() for an example of doing it right, even in the
> > presence of complexities (ie that has an example of both flushing the
> > TLB, and doing the actual "free the pages after flush", and it does
> > the two cases separately).
>
> The more I look at the mprotect code, the less I like it. We seem to
> be much better about the TLB flushes in other places (looking at
> mremap, for example). The mprotect code seems to be very laissez-faire
> about the TLB flushing.
>
> Does adding a TLB flush to before that
>
> pte_unmap_unlock(pte - 1, ptl);
>
> fix things for you?

It definitely does. But if I had to choose, I'd go with holding
mmap_lock for write because 1) it's less likely to storm other CPUs by
IPI and would only have performance impact on processes that use ufd,
which I guess already have high tolerance for not-so-good performance,
and 2) people are spearheading multiple efforts to reduce the mmap_lock
contention, which hopefully would make ufd users suffer less soon.

> That's not the right fix - leaving a stale TLB entry around is fine if
> the TLB entry is more strict wrt protections - but it might be worth
> testing as a "does it at least close the problem" patch.

Well, things get trick if we do this. I'm not sure if I could vouch
such a fix for stable as confident as I do holding mmap_lock for
write.

2020-12-23 03:05:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 05:23:39PM -0700, Yu Zhao wrote:
> and 2) people are spearheading multiple efforts to reduce the mmap_lock
> contention, which hopefully would make ufd users suffer less soon.

In my view UFFD is an already deployed working solution that
eliminates the mmap_lock_write contention to allocate and free memory.

We need to add a UFFDIO_POPULATE to use in combination with
UFFD_FEATURE_SIGBUS (UFFDIO_POPULATE just needs to zero out a page or
THP and map it, it'll be indistinguishable to UFFDIO_ZEROPAGE, but it
will solve the last performance bottleneck by avoiding a suprious
wrprotect fault after the allocation).

After that malloc based on uffd should become competitive single
threaded and it won't ever require the mmap_lock_write so allocations
and freeing of memory can continue indefinitely from all threaded in
parallel. There will never be another mmap or munmap stalling all
threads.

This is not why uffd was created, it's just a secondary performance
benefit of uffd, but it's still a relevant benefit in my view.

Every time I hear people with major mmap_lock_write issues I recommend
uffd, but you know, until we add the UFFDIO_POPULATE, it will still
have higher fixed allocation overhead because of the wprotect fault
after UFFDIO_ZEROCOPY. UFFDIO_COPY also would be not as optimal as a
clear_page and currently it's not even THP capable.

In addition you'll get a SIGBUS after an user after free. It's not
like when you have a malloc lib doing MADV_DONTNEED at PAGE_SIZE
granularity to rate limit the costly munmap, and then the app does an
use after free and it reads zero or writes to a newly faulted in page.

The above will not require any special privilege and all allocated
virtual memory remains fully swappable, because SIGBUS mode will never
have to block any kernel initiated faults.

uffd-wp also is totally usable unprivileged by default to replace
various software dirty bits with the info provided in O(1) instead of
O(N), as long as the writes are done in userland also unprivileged by
default without tweaking any sysctl and with zero risk of increasing
reproduciblity of any exploit against unrelated random kernel bugs.

So if we're forced to take the mmap_lock_write it'd be cool if at
least we can avoid it for 1 single pte or hugepmd wrprotection, as it
happens in write_protect_page() KSM.

Thanks,
Andrea

2020-12-23 09:46:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
<[email protected]> wrote:
>
> The more I look at the mprotect code, the less I like it. We seem to
> be much better about the TLB flushes in other places (looking at
> mremap, for example). The mprotect code seems to be very laissez-faire
> about the TLB flushing.

No, this doesn't help.

> Does adding a TLB flush to before that
>
> pte_unmap_unlock(pte - 1, ptl);
>
> fix things for you?

It really doesn't fix it. Exactly because - as pointed out earlier -
the actual page *copy* happens outside the pte lock.

So what can happen is:

- CPU 1 holds the page table lock, while doing the write protect. It
has cleared the writable bit, but hasn't flushed the TLB's yet

- CPU 2 did *not* have the TLB entry, sees the new read-only state,
takes a COW page fault, and reads the PTE from memory (into
vmf->orig_pte)

- CPU 2 correctly decides it needs to be a COW, and copies the page contents

- CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
happened yet), and writes to that page in users apce

- CPU 1 now does the TLB invalidate, and releases the page table lock

- CPU 2 gets the page table lock, sees that its PTE matches
vmf->orig_pte, and switches it to be that writable copy of the page.

where the copy happened before CPU 3 had stopped writing to the page.

So the pte lock doesn't actually matter, unless we actually do the
page copy inside of it (on CPU2), in addition to doing the TLB flush
inside of it (on CPU1).

mprotect() is actually safe for two independent reasons: (a) it does
the mmap_sem for writing (so mprotect can't race with the COW logic at
all), and (b) it changes the vma permissions so turning something
read-only actually disables COW anyway, since it won't be a COW, it
will be a SIGSEGV.

So mprotect() is irrelevant, other than the fact that it shares some
code with that "turn it read-only in the page tables".

fork() is a much closer operation, in that it actually triggers that
COW behavior, but fork() takes the mmap_sem for writing, so it avoids
this too.

So it's really just userfaultfd and that kind of ilk that is relevant
here, I think. But that "you need to flush the TLB before releasing
the page table lock" was not true (well, it's true in other
circumstances - just not *here*), and is not part of the solution.

Or rather, if it's part of the solution here, it would have to be
matched with that "page copy needs to be done under the page table
lock too".

Linus

2020-12-23 10:08:46

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 01:44:42AM -0800, Linus Torvalds wrote:
> On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > The more I look at the mprotect code, the less I like it. We seem to
> > be much better about the TLB flushes in other places (looking at
> > mremap, for example). The mprotect code seems to be very laissez-faire
> > about the TLB flushing.
>
> No, this doesn't help.
>
> > Does adding a TLB flush to before that
> >
> > pte_unmap_unlock(pte - 1, ptl);
> >
> > fix things for you?
>
> It really doesn't fix it. Exactly because - as pointed out earlier -
> the actual page *copy* happens outside the pte lock.

I appreciate all the pointers. It seems to me it does.

> So what can happen is:
>
> - CPU 1 holds the page table lock, while doing the write protect. It
> has cleared the writable bit, but hasn't flushed the TLB's yet
>
> - CPU 2 did *not* have the TLB entry, sees the new read-only state,
> takes a COW page fault, and reads the PTE from memory (into
> vmf->orig_pte)

In handle_pte_fault(), we lock page table and check pte_write(), so
we either see a RW pte before CPU 1 runs or a RO one with no stale tlb
entries after CPU 1 runs, assume CPU 1 flushes tlb while holding the
same page table lock (not mmap_lock).

> - CPU 2 correctly decides it needs to be a COW, and copies the page contents
>
> - CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
> happened yet), and writes to that page in users apce
>
> - CPU 1 now does the TLB invalidate, and releases the page table lock
>
> - CPU 2 gets the page table lock, sees that its PTE matches
> vmf->orig_pte, and switches it to be that writable copy of the page.
>
> where the copy happened before CPU 3 had stopped writing to the page.
>
> So the pte lock doesn't actually matter, unless we actually do the
> page copy inside of it (on CPU2), in addition to doing the TLB flush
> inside of it (on CPU1).
>
> mprotect() is actually safe for two independent reasons: (a) it does
> the mmap_sem for writing (so mprotect can't race with the COW logic at
> all), and (b) it changes the vma permissions so turning something
> read-only actually disables COW anyway, since it won't be a COW, it
> will be a SIGSEGV.
>
> So mprotect() is irrelevant, other than the fact that it shares some
> code with that "turn it read-only in the page tables".
>
> fork() is a much closer operation, in that it actually triggers that
> COW behavior, but fork() takes the mmap_sem for writing, so it avoids
> this too.
>
> So it's really just userfaultfd and that kind of ilk that is relevant
> here, I think. But that "you need to flush the TLB before releasing
> the page table lock" was not true (well, it's true in other
> circumstances - just not *here*), and is not part of the solution.
>
> Or rather, if it's part of the solution here, it would have to be
> matched with that "page copy needs to be done under the page table
> lock too".
>
> Linus
>

2020-12-23 16:28:29

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 03:06:30AM -0700, Yu Zhao wrote:
> On Wed, Dec 23, 2020 at 01:44:42AM -0800, Linus Torvalds wrote:
> > On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > The more I look at the mprotect code, the less I like it. We seem to
> > > be much better about the TLB flushes in other places (looking at
> > > mremap, for example). The mprotect code seems to be very laissez-faire
> > > about the TLB flushing.
> >
> > No, this doesn't help.
> >
> > > Does adding a TLB flush to before that
> > >
> > > pte_unmap_unlock(pte - 1, ptl);
> > >
> > > fix things for you?
> >
> > It really doesn't fix it. Exactly because - as pointed out earlier -
> > the actual page *copy* happens outside the pte lock.
>
> I appreciate all the pointers. It seems to me it does.
>
> > So what can happen is:
> >
> > - CPU 1 holds the page table lock, while doing the write protect. It
> > has cleared the writable bit, but hasn't flushed the TLB's yet
> >
> > - CPU 2 did *not* have the TLB entry, sees the new read-only state,
> > takes a COW page fault, and reads the PTE from memory (into
> > vmf->orig_pte)
>
> In handle_pte_fault(), we lock page table and check pte_write(), so
> we either see a RW pte before CPU 1 runs or a RO one with no stale tlb
> entries after CPU 1 runs, assume CPU 1 flushes tlb while holding the
> same page table lock (not mmap_lock).

I think this is not against Linus's example - where cpu2 does not have tlb
cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
it. So IMHO there's no problem here.

But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit. Note that if
it's uffd-wp wr-protection it's always applied along with removing of the write
bit in change_pte_range():

if (uffd_wp) {
ptent = pte_wrprotect(ptent);
ptent = pte_mkuffd_wp(ptent);
}

So instead of being handled as COW page do_wp_page() will always trap
userfaultfd-wp first, hence no chance to race with COW.

COW could only trigger after another uffd-wp-resolve ioctl which could remove
the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen
after all wr-protect completes, which guarantees that when reaching the COW
path the tlb must has been flushed anyways. Then no one should be modifying
the page anymore even without pgtable lock in COW path.

So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to
work, but it just may cause more tlb flush than Andrea's proposal especially
when the protection range is large (it's common to have a huge protection range
for e.g. VM live snapshotting, where we'll protect all guest rw ram).

My understanding of current issue is that either we can take Andrea's proposal
(although I think the group rwsem may not be extremely better than a per-mm
rwsem, which I don't know... at least not worst than that?), or we can also go
the other way (also as Andrea mentioned) so that when wr-protect:

- for <=2M range (pmd or less), we take read rwsem, but flush tlb within
pgtable lock

- for >2M range, we take write rwsem directly but flush tlb once

Thanks,

>
> > - CPU 2 correctly decides it needs to be a COW, and copies the page contents
> >
> > - CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
> > happened yet), and writes to that page in users apce
> >
> > - CPU 1 now does the TLB invalidate, and releases the page table lock
> >
> > - CPU 2 gets the page table lock, sees that its PTE matches
> > vmf->orig_pte, and switches it to be that writable copy of the page.
> >
> > where the copy happened before CPU 3 had stopped writing to the page.
> >
> > So the pte lock doesn't actually matter, unless we actually do the
> > page copy inside of it (on CPU2), in addition to doing the TLB flush
> > inside of it (on CPU1).
> >
> > mprotect() is actually safe for two independent reasons: (a) it does
> > the mmap_sem for writing (so mprotect can't race with the COW logic at
> > all), and (b) it changes the vma permissions so turning something
> > read-only actually disables COW anyway, since it won't be a COW, it
> > will be a SIGSEGV.
> >
> > So mprotect() is irrelevant, other than the fact that it shares some
> > code with that "turn it read-only in the page tables".
> >
> > fork() is a much closer operation, in that it actually triggers that
> > COW behavior, but fork() takes the mmap_sem for writing, so it avoids
> > this too.
> >
> > So it's really just userfaultfd and that kind of ilk that is relevant
> > here, I think. But that "you need to flush the TLB before releasing
> > the page table lock" was not true (well, it's true in other
> > circumstances - just not *here*), and is not part of the solution.
> >
> > Or rather, if it's part of the solution here, it would have to be
> > matched with that "page copy needs to be done under the page table
> > lock too".
> >
> > Linus
> >
>

--
Peter Xu

2020-12-23 18:56:53

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote:
> I think this is not against Linus's example - where cpu2 does not have tlb
> cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
> it. So IMHO there's no problem here.
>
> But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit. Note that if
> it's uffd-wp wr-protection it's always applied along with removing of the write
> bit in change_pte_range():
>
> if (uffd_wp) {
> ptent = pte_wrprotect(ptent);
> ptent = pte_mkuffd_wp(ptent);
> }
>
> So instead of being handled as COW page do_wp_page() will always trap
> userfaultfd-wp first, hence no chance to race with COW.
>
> COW could only trigger after another uffd-wp-resolve ioctl which could remove
> the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen
> after all wr-protect completes, which guarantees that when reaching the COW
> path the tlb must has been flushed anyways. Then no one should be modifying
> the page anymore even without pgtable lock in COW path.
>
> So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to
> work, but it just may cause more tlb flush than Andrea's proposal especially
> when the protection range is large (it's common to have a huge protection range
> for e.g. VM live snapshotting, where we'll protect all guest rw ram).
>
> My understanding of current issue is that either we can take Andrea's proposal
> (although I think the group rwsem may not be extremely better than a per-mm
> rwsem, which I don't know... at least not worst than that?), or we can also go
> the other way (also as Andrea mentioned) so that when wr-protect:
>
> - for <=2M range (pmd or less), we take read rwsem, but flush tlb within
> pgtable lock
>
> - for >2M range, we take write rwsem directly but flush tlb once

I fully agree indeed.

I still have to fully understand how the clear_refs_write was fixed.

clear_refs_write has not even a "catcher" in the page fault but it
clears the pte_write under mmap_read_lock and it doesn't flush before
releasing the PT lock. It appears way more broken than
userfaultfd_writeprotect ever was.

static inline void clear_soft_dirty(struct vm_area_struct *vma,
unsigned long addr, pte_t *pte)
{
/*
* The soft-dirty tracker uses #PF-s to catch writes
* to pages, so write-protect the pte as well. See the
* Documentation/admin-guide/mm/soft-dirty.rst for full description
* of how soft-dirty works.
*/

https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/

"Although this is fine when simply aging the ptes, in the case of
clearing the "soft-dirty" state we can end up with entries where
pte_write() is false, yet a writable mapping remains in the TLB.

Fix this by avoiding the mmu_gather API altogether: managing both the
'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB
invalidation for the sort-dirty path, much like mprotect() does
already"

NOTE: about the above comment, that mprotect takes
mmap_read_lock. Your above code change in the commit above, still has
mmap_read_lock, there's still no similarity with mprotect whatsoever.

Did you test what happens when clear_refs_write leaves writable stale
TLB and you run do_wp_page copies the page while the other CPU still
is writing to the page? Has clear_refs_write a selftest as aggressive
and racy as the uffd selftest to reproduce that workload?

The race fixed with the group lock internally to uffd, is possible
thanks to marker+catcher in NUMA balancing style? But that's not there
for clear_refs_write even after the above commit.

It doesn't appear either that this patch is adding a synchronous tlb
flush inside the PT lock either.

Overall, it would be unfair if clear_refs_write is allowed to do the
same thing that userfaultfd_writeprotect has to do, but with the
mmap_read_lock, if userfaultfd_writeprotect will be forced to take
mmap_write_lock.

In my view there are 3 official ways to deal with this:

1) set a marker in the pte/hugepmd inside the PT lock, and add a
catcher for the marker in the page fault to send the page fault to
a dead end while there are stale TLB entries

cases: userfaultfd_writeprotect() and NUMA balancing

2) take mmap_write_lock

case: mprotect

3) flush the TLB before the PT lock release

case: KSM write_protect_page

Where does the below patch land in the 3 cases? It doesn't fit any of
them and what it does looks still not sufficient to fix the userfault
memory corruption.

https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/

It'd be backwards if the complexity of the kernel was increased for
clear_refs_write, but forbidden for the O(1) API that delivers the
exact same information (clear_refs_write API delivers that info in
O(N)).

We went the last mile to guarantee uffd can be always used fully
securely unprivileged and by default in current Linus's tree and in
future Android out of the box. It's simply impossible with the current
defaults, to use uffd to enlarge the any kernel user-after-free race
window either, so even that concern is gone. From on those research
testcases will stick to fuse instead I guess.

Thanks,
Andrea

2020-12-23 18:58:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 01:51:59PM -0500, Andrea Arcangeli wrote:
> NOTE: about the above comment, that mprotect takes
> mmap_read_lock. Your above code change in the commit above, still has
^^^^ write

Correction to avoid any confusion.

2020-12-23 19:14:22

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote:
> On Wed, Dec 23, 2020 at 03:06:30AM -0700, Yu Zhao wrote:
> > On Wed, Dec 23, 2020 at 01:44:42AM -0800, Linus Torvalds wrote:
> > > On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > The more I look at the mprotect code, the less I like it. We seem to
> > > > be much better about the TLB flushes in other places (looking at
> > > > mremap, for example). The mprotect code seems to be very laissez-faire
> > > > about the TLB flushing.
> > >
> > > No, this doesn't help.
> > >
> > > > Does adding a TLB flush to before that
> > > >
> > > > pte_unmap_unlock(pte - 1, ptl);
> > > >
> > > > fix things for you?
> > >
> > > It really doesn't fix it. Exactly because - as pointed out earlier -
> > > the actual page *copy* happens outside the pte lock.
> >
> > I appreciate all the pointers. It seems to me it does.
> >
> > > So what can happen is:
> > >
> > > - CPU 1 holds the page table lock, while doing the write protect. It
> > > has cleared the writable bit, but hasn't flushed the TLB's yet
> > >
> > > - CPU 2 did *not* have the TLB entry, sees the new read-only state,
> > > takes a COW page fault, and reads the PTE from memory (into
> > > vmf->orig_pte)
> >
> > In handle_pte_fault(), we lock page table and check pte_write(), so
> > we either see a RW pte before CPU 1 runs or a RO one with no stale tlb
> > entries after CPU 1 runs, assume CPU 1 flushes tlb while holding the
> > same page table lock (not mmap_lock).
>
> I think this is not against Linus's example - where cpu2 does not have tlb
> cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
> it. So IMHO there's no problem here.

None of the CPUs has stale entries when CPU 2 sees a RO PTE. We are
assuming that TLB flush will be done on CPU 1 while it's still holding
page table lock.

CPU 2 (re)locks page table and (re)checks the PTE under question when
it decides if copy is necessary. If it sees a RO PTE, it means the
flush has been done on all CPUs, therefore it fixes the problem.

> But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit. Note that if
> it's uffd-wp wr-protection it's always applied along with removing of the write
> bit in change_pte_range():
>
> if (uffd_wp) {
> ptent = pte_wrprotect(ptent);
> ptent = pte_mkuffd_wp(ptent);
> }
>
> So instead of being handled as COW page do_wp_page() will always trap
> userfaultfd-wp first, hence no chance to race with COW.
>
> COW could only trigger after another uffd-wp-resolve ioctl which could remove
> the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen
> after all wr-protect completes, which guarantees that when reaching the COW
> path the tlb must has been flushed anyways. Then no one should be modifying
> the page anymore even without pgtable lock in COW path.
>
> So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to
> work, but it just may cause more tlb flush than Andrea's proposal especially
> when the protection range is large (it's common to have a huge protection range
> for e.g. VM live snapshotting, where we'll protect all guest rw ram).
>
> My understanding of current issue is that either we can take Andrea's proposal
> (although I think the group rwsem may not be extremely better than a per-mm
> rwsem, which I don't know... at least not worst than that?), or we can also go
> the other way (also as Andrea mentioned) so that when wr-protect:
>
> - for <=2M range (pmd or less), we take read rwsem, but flush tlb within
> pgtable lock
>
> - for >2M range, we take write rwsem directly but flush tlb once
>
> Thanks,
>
> >
> > > - CPU 2 correctly decides it needs to be a COW, and copies the page contents
> > >
> > > - CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
> > > happened yet), and writes to that page in users apce
> > >
> > > - CPU 1 now does the TLB invalidate, and releases the page table lock
> > >
> > > - CPU 2 gets the page table lock, sees that its PTE matches
> > > vmf->orig_pte, and switches it to be that writable copy of the page.
> > >
> > > where the copy happened before CPU 3 had stopped writing to the page.
> > >
> > > So the pte lock doesn't actually matter, unless we actually do the
> > > page copy inside of it (on CPU2), in addition to doing the TLB flush
> > > inside of it (on CPU1).
> > >
> > > mprotect() is actually safe for two independent reasons: (a) it does
> > > the mmap_sem for writing (so mprotect can't race with the COW logic at
> > > all), and (b) it changes the vma permissions so turning something
> > > read-only actually disables COW anyway, since it won't be a COW, it
> > > will be a SIGSEGV.
> > >
> > > So mprotect() is irrelevant, other than the fact that it shares some
> > > code with that "turn it read-only in the page tables".
> > >
> > > fork() is a much closer operation, in that it actually triggers that
> > > COW behavior, but fork() takes the mmap_sem for writing, so it avoids
> > > this too.
> > >
> > > So it's really just userfaultfd and that kind of ilk that is relevant
> > > here, I think. But that "you need to flush the TLB before releasing
> > > the page table lock" was not true (well, it's true in other
> > > circumstances - just not *here*), and is not part of the solution.
> > >
> > > Or rather, if it's part of the solution here, it would have to be
> > > matched with that "page copy needs to be done under the page table
> > > lock too".
> > >
> > > Linus
> > >
> >
>
> --
> Peter Xu
>

2020-12-23 19:35:30

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Wed, Dec 23, 2020 at 12:12:23PM -0700, Yu Zhao wrote:
> On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote:
> > On Wed, Dec 23, 2020 at 03:06:30AM -0700, Yu Zhao wrote:
> > > On Wed, Dec 23, 2020 at 01:44:42AM -0800, Linus Torvalds wrote:
> > > > On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
> > > > <[email protected]> wrote:
> > > > >
> > > > > The more I look at the mprotect code, the less I like it. We seem to
> > > > > be much better about the TLB flushes in other places (looking at
> > > > > mremap, for example). The mprotect code seems to be very laissez-faire
> > > > > about the TLB flushing.
> > > >
> > > > No, this doesn't help.
> > > >
> > > > > Does adding a TLB flush to before that
> > > > >
> > > > > pte_unmap_unlock(pte - 1, ptl);
> > > > >
> > > > > fix things for you?
> > > >
> > > > It really doesn't fix it. Exactly because - as pointed out earlier -
> > > > the actual page *copy* happens outside the pte lock.
> > >
> > > I appreciate all the pointers. It seems to me it does.
> > >
> > > > So what can happen is:
> > > >
> > > > - CPU 1 holds the page table lock, while doing the write protect. It
> > > > has cleared the writable bit, but hasn't flushed the TLB's yet
> > > >
> > > > - CPU 2 did *not* have the TLB entry, sees the new read-only state,
> > > > takes a COW page fault, and reads the PTE from memory (into
> > > > vmf->orig_pte)
> > >
> > > In handle_pte_fault(), we lock page table and check pte_write(), so
> > > we either see a RW pte before CPU 1 runs or a RO one with no stale tlb
> > > entries after CPU 1 runs, assume CPU 1 flushes tlb while holding the
> > > same page table lock (not mmap_lock).
> >
> > I think this is not against Linus's example - where cpu2 does not have tlb
> > cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
> > it. So IMHO there's no problem here.
>
> None of the CPUs has stale entries when CPU 2 sees a RO PTE.

In this example we have - Please see [1] below.

> We are
> assuming that TLB flush will be done on CPU 1 while it's still holding
> page table lock.
>
> CPU 2 (re)locks page table and (re)checks the PTE under question when
> it decides if copy is necessary. If it sees a RO PTE, it means the
> flush has been done on all CPUs, therefore it fixes the problem.

I guess you had the assumption that pgtable lock released in step 1 already.
But it's not happening in this specific example, not until step5 [2] below.

Indeed if pgtable lock is not released from cpu1, then COW path won't even
triger, afaiu... do_wp_page() needs the pgtable lock. It seems just even safer.

Irrelevant of the small confusions here and there... I believe we're always on
the same page, at least the conclusion.

Thanks,

>
> > But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit. Note that if
> > it's uffd-wp wr-protection it's always applied along with removing of the write
> > bit in change_pte_range():
> >
> > if (uffd_wp) {
> > ptent = pte_wrprotect(ptent);
> > ptent = pte_mkuffd_wp(ptent);
> > }
> >
> > So instead of being handled as COW page do_wp_page() will always trap
> > userfaultfd-wp first, hence no chance to race with COW.
> >
> > COW could only trigger after another uffd-wp-resolve ioctl which could remove
> > the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen
> > after all wr-protect completes, which guarantees that when reaching the COW
> > path the tlb must has been flushed anyways. Then no one should be modifying
> > the page anymore even without pgtable lock in COW path.
> >
> > So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to
> > work, but it just may cause more tlb flush than Andrea's proposal especially
> > when the protection range is large (it's common to have a huge protection range
> > for e.g. VM live snapshotting, where we'll protect all guest rw ram).
> >
> > My understanding of current issue is that either we can take Andrea's proposal
> > (although I think the group rwsem may not be extremely better than a per-mm
> > rwsem, which I don't know... at least not worst than that?), or we can also go
> > the other way (also as Andrea mentioned) so that when wr-protect:
> >
> > - for <=2M range (pmd or less), we take read rwsem, but flush tlb within
> > pgtable lock
> >
> > - for >2M range, we take write rwsem directly but flush tlb once
> >
> > Thanks,
> >
> > >
> > > > - CPU 2 correctly decides it needs to be a COW, and copies the page contents
> > > >
> > > > - CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
> > > > happened yet), and writes to that page in users apce

[1]

> > > >
> > > > - CPU 1 now does the TLB invalidate, and releases the page table lock

[2]

> > > >
> > > > - CPU 2 gets the page table lock, sees that its PTE matches
> > > > vmf->orig_pte, and switches it to be that writable copy of the page.
> > > >
> > > > where the copy happened before CPU 3 had stopped writing to the page.
> > > >
> > > > So the pte lock doesn't actually matter, unless we actually do the
> > > > page copy inside of it (on CPU2), in addition to doing the TLB flush
> > > > inside of it (on CPU1).
> > > >
> > > > mprotect() is actually safe for two independent reasons: (a) it does
> > > > the mmap_sem for writing (so mprotect can't race with the COW logic at
> > > > all), and (b) it changes the vma permissions so turning something
> > > > read-only actually disables COW anyway, since it won't be a COW, it
> > > > will be a SIGSEGV.
> > > >
> > > > So mprotect() is irrelevant, other than the fact that it shares some
> > > > code with that "turn it read-only in the page tables".
> > > >
> > > > fork() is a much closer operation, in that it actually triggers that
> > > > COW behavior, but fork() takes the mmap_sem for writing, so it avoids
> > > > this too.
> > > >
> > > > So it's really just userfaultfd and that kind of ilk that is relevant
> > > > here, I think. But that "you need to flush the TLB before releasing
> > > > the page table lock" was not true (well, it's true in other
> > > > circumstances - just not *here*), and is not part of the solution.
> > > >
> > > > Or rather, if it's part of the solution here, it would have to be
> > > > matched with that "page copy needs to be done under the page table
> > > > lock too".
> > > >
> > > > Linus
> > > >
> > >
> >
> > --
> > Peter Xu
> >
>

--
Peter Xu