2020-12-20 02:04:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Sat, Dec 19, 2020 at 1:34 PM Nadav Amit <[email protected]> wrote:
>
> [ cc’ing some more people who have experience with similar problems ]
>
> > On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <[email protected]> wrote:
> >
> > Hello,
> >
> > On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >> Analyzing this problem indicates that there is a real bug since
> >> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >
> > Never having to take the mmap_sem for writing, and in turn never
> > blocking, in order to modify the pagetables is quite an important
> > feature in uffd that justifies uffd instead of mprotect. It's not the
> > most important reason to use uffd, but it'd be nice if that guarantee
> > would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> > other pgtable manipulations.
> >
> >> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>
> >> cpu0 cpu1
> >> ---- ----
> >> userfaultfd_writeprotect()
> >> [ write-protecting ]
> >> mwriteprotect_range()
> >> mmap_read_lock()
> >> change_protection()
> >> change_protection_range()
> >> ...
> >> change_pte_range()
> >> [ defer TLB flushes]
> >> userfaultfd_writeprotect()
> >> mmap_read_lock()
> >> change_protection()
> >> [ write-unprotect ]
> >> ...
> >> [ unprotect PTE logically ]
> >> ...
> >> [ page-fault]
> >> ...
> >> wp_page_copy()
> >> [ set new writable page in PTE]
> >
> > Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
> > is set and do an explicit (potentially spurious) tlb flush before
> > write-unprotect?
>
> There is a concrete scenario that I actually encountered and then there is a
> general problem.
>
> In general, the kernel code assumes that PTEs that are read from the
> page-tables are coherent across all the TLBs, excluding permission promotion
> (i.e., the PTE may have higher permissions in the page-tables than those
> that are cached in the TLBs).
>
> We therefore need to both: (a) protect change_protection_range() from the
> changes of others who might defer TLB flushes without taking mmap_sem for
> write (e.g., try_to_unmap_one()); and (b) to protect others (e.g.,
> page-fault handlers) from concurrent changes of change_protection().
>
> We have already encountered several similar bugs, and debugging such issues
> s time consuming and these bugs impact is substantial (memory corruption,
> security). So I think we should only stick to general solutions.
>
> So perhaps your the approach of your proposed solution is feasible, but it
> would have to be applied all over the place: we will need to add a check for
> mm_tlb_flush_pending() and conditionally flush the TLB in every case in
> which PTEs are read and there might be an assumption that the
> access-permission reflect what the TLBs hold. This includes page-fault
> handlers, but also NUMA migration code in change_protection(), softdirty
> cleanup in clear_refs_write() and maybe others.

I missed the beginning of this thread, but it looks to me like
userfaultfd changes PTEs with not locking except mmap_read_lock(). It
also calls inc_tlb_flush_pending(), which is very explicitly
documented as requiring the pagetable lock. Those docs must be wrong,
because mprotect() uses the mmap_sem write lock, which is just fine,
but ISTM some kind of mutual exclusion with proper acquire/release
ordering is indeed needed. So the userfaultfd code seems bogus.

I think userfaultfd either needs to take a real lock (probably doesn't
matter which) or the core rules about PTEs need to be rewritten.


2020-12-20 02:53:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Sat, Dec 19, 2020 at 06:01:39PM -0800, Andy Lutomirski wrote:
> I missed the beginning of this thread, but it looks to me like
> userfaultfd changes PTEs with not locking except mmap_read_lock(). It

There's no mmap_read_lock, I assume you mean mmap_lock for reading.

The ptes are changed always with the PT lock, in fact there's no
problem with the PTE updates. The only difference with mprotect
runtime is that the mmap_lock is taken for reading. And the effect
contested for this change doesn't affect the PTE, but supposedly the
tlb flushing deferral.

The change_protection_range is identical to what already happens with
zap_page_range. zap_page_range is called with mmap_lock for reading
in MADV_DONTNEED, and by munmap with mmap_lock for
writing. change_protection_range is called with mmap_lock for writing
by mprotect, and mmap_lock for reading by UFFDIO_WRITEPROTECT.

> also calls inc_tlb_flush_pending(), which is very explicitly
> documented as requiring the pagetable lock. Those docs must be wrong,

The comment in inc_tlb_flush_pending() shows the pagetable lock is
taken after inc_tlb_flush_pending():

* atomic_inc(&mm->tlb_flush_pending);
* spin_lock(&ptl);

> because mprotect() uses the mmap_sem write lock, which is just fine,
> but ISTM some kind of mutual exclusion with proper acquire/release
> ordering is indeed needed. So the userfaultfd code seems bogus.

If there's a bug, it'd be nice to fix without taking the mmap_lock for
writing.

The vma is guaranteed not modified, so I think it'd be pretty bad if
we had to give in the mmap_lock for writing just to wait for a tlb
flush that is issued deferred in the context of
userfaultfd_writeprotect.

> I think userfaultfd either needs to take a real lock (probably doesn't
> matter which) or the core rules about PTEs need to be rewritten.

It's not exactly clear how the do_wp_page could run on cpu1 before the
unprotect did an extra flush, I guess the trace needs one more
cpu/thread?

Anyway to wait the wrprotect to do the deferred flush, before the
unprotect can even start, one more mutex in the mm to take in all
callers of change_protection_range with the mmap_lock for reading may
be enough.

Thanks,
Andrea

2020-12-20 05:11:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Sat, Dec 19, 2020 at 6:49 PM Andrea Arcangeli <[email protected]> wrote:
>
> On Sat, Dec 19, 2020 at 06:01:39PM -0800, Andy Lutomirski wrote:
> > I missed the beginning of this thread, but it looks to me like
> > userfaultfd changes PTEs with not locking except mmap_read_lock(). It
>
> There's no mmap_read_lock, I assume you mean mmap_lock for reading.

Yes.

>
> The ptes are changed always with the PT lock, in fact there's no
> problem with the PTE updates. The only difference with mprotect
> runtime is that the mmap_lock is taken for reading. And the effect
> contested for this change doesn't affect the PTE, but supposedly the
> tlb flushing deferral.

Can you point me at where the lock ends up being taken in this path?
I apparently missed it somewhere.

> Anyway to wait the wrprotect to do the deferred flush, before the
> unprotect can even start, one more mutex in the mm to take in all
> callers of change_protection_range with the mmap_lock for reading may
> be enough.

I'll read the code again tomorrow.

2020-12-21 18:07:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

Hello,

On Sat, Dec 19, 2020 at 09:08:55PM -0800, Andy Lutomirski wrote:
> On Sat, Dec 19, 2020 at 6:49 PM Andrea Arcangeli <[email protected]> wrote:
> > The ptes are changed always with the PT lock, in fact there's no
> > problem with the PTE updates. The only difference with mprotect
> > runtime is that the mmap_lock is taken for reading. And the effect
> > contested for this change doesn't affect the PTE, but supposedly the
> > tlb flushing deferral.
>
> Can you point me at where the lock ends up being taken in this path?

pte_offset_map_lock in change_pte_range, as in mprotect, no difference.

As I suspected on my follow up, the bug described wasn't there, but
I'll look at the new theory posted.

Thanks,
Andrea

2020-12-21 18:25:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

On Mon, Dec 21, 2020 at 10:04 AM Andrea Arcangeli <[email protected]> wrote:
>
> Hello,
>
> On Sat, Dec 19, 2020 at 09:08:55PM -0800, Andy Lutomirski wrote:
> > On Sat, Dec 19, 2020 at 6:49 PM Andrea Arcangeli <[email protected]> wrote:
> > > The ptes are changed always with the PT lock, in fact there's no
> > > problem with the PTE updates. The only difference with mprotect
> > > runtime is that the mmap_lock is taken for reading. And the effect
> > > contested for this change doesn't affect the PTE, but supposedly the
> > > tlb flushing deferral.
> >
> > Can you point me at where the lock ends up being taken in this path?
>
> pte_offset_map_lock in change_pte_range, as in mprotect, no difference.
>
> As I suspected on my follow up, the bug described wasn't there, but
> I'll look at the new theory posted.

Indeed.