Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Fri, 27 Nov 2020 08:51:17 -0500
From:   Peter Xu <peterx@redhat.com>
To:     Matthew Wilcox <willy@infradead.org>
Cc:     linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Mike Rapoport <rppt@linux.vnet.ibm.com>
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on
 reads
Message-ID: <20201127135117.GB6573@xz-x1>
References: <20201126222359.8120-1-peterx@redhat.com>
 <20201127122224.GX4327@casper.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20201127122224.GX4327@casper.infradead.org>
Precedence: bulk

Matthew,

Thanks for the review comments.

On Fri, Nov 27, 2020 at 12:22:24PM +0000, Matthew Wilcox wrote:
> On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote:
> > For missing mode uffds, fault around does not help because if the page cache
> > existed, then the page should be there already.  If the page cache is not
> > there, nothing else we can do, either.  If the fault-around code is destined to
> > be helpless for userfault-missing vmas, then ideally we can skip it.
> 
> But it might have been faulted into the cache by another task, so skipping
> it is bad.

Is there a real use case of such?

I thought about it, at least for qemu postcopy it's not working like that.  I
feel like CRIU neither, but Mike could correct me.

Even if there's a case of that, for example, if task A tries to copy the pages
over to a tmpfs file and task B accesses the pages too with uffd missing
registered, the ideal solution is still that when a page is copied task A can
installs the pte for the current task, rather than relying on the fault around
on reads, IMHO.  I don't know whether there's a way to do it, though.

> 
> > For wr-protected mode uffds, errornously fault in those pages around could lead
> > to threads accessing the pages without uffd server's awareness.  For example,
> > when punching holes on uffd-wp registered shmem regions, we'll first try to
> > unmap all the pages before evicting the page cache but without locking the
> > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> > before shmem_truncate_range()).  When fault-around happens near a hole being
> > punched, we might errornously fault in the "holes" right before it will be
> > punched.  Then there's a small window before the page cache was finally
> > dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> > information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> > page can be writable within the small window).  That's severe data loss.
> 
> Sounds like you have a missing page_mkwrite implementation.

I think it's slightly different issue, because shmem may not know whether the
page should be allowed to write or not.  AFAIU, uffd-wp is designed and
implemented in a way that the final protect information is kept within ptes so
e.g. vmas does not have a solid understanding on whether a page should be
write-protected or not (so VM_UFFD_WP in vma flags is a hint only, and also
that's why we won't abuse creating tons of vmas).  We tried hard to keep the
pte information on track, majorly _PAGE_UFFD_WP, alive even across swap in/outs
and migrations.  If pte is lost, we can't get that information from page cache,
at least for now.

> 
> > This patch comes from debugging a data loss issue when working on the uffd-wp
> > support on shmem/hugetlbfs.  I posted this out for early review and comments,
> > but also because it should already start to benefit missing mode userfaultfd to
> > avoid trying to fault around on reads.
> 
> A measurable difference?

Nop.  I didn't measure missing case.  It should really depend on whether
there's a use case of such, imho.  If there's, then we may still want that
(however uffd-wp might be a different story, as discussed above).  Otherwise
maybe we should just avoid doing that for all.

The other thing that led me to this patch (rather than only check against
uffd-wp, for which case I'll just keep that small patch in my own tree until
posting the uffd-wp series) is I thought about when the community would like to
introduce things like "minor-faults" for userfaultfd.  In that case we'd need
to be careful too here otherwise we may lose some minor faults.  And from that
pov I'm thinking whether it's easier to just forbid kernel-side tricks like
this for uffd in general, since as I also stated previously that IMHO if a
memory region is registered with userfaultfd, maybe it's always good to
"always" rely on the userspace on resolving page faults, so that we don't need
to debug things like uffd-wp data crash as userfaultfd evolves, because that'll
be non-trivial task at least to me...

Thanks,

-- 
Peter Xu