Date:   Thu, 2 Feb 2023 15:22:50 -0500
From:   Peter Xu <peterx@redhat.com>
To:     Yang Shi <shy828301@gmail.com>
Cc:     David Stevens <stevensd@chromium.org>,
        David Hildenbrand <david@redhat.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
        linux-kernel@vger.kernel.org, Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd
Message-ID: <Y9wbmuEq0QjZATE6@x1n>
References: <20230201034137.2463113-1-stevensd@google.com>
 <CAHbLzkpbV2LOoTpWwSOS+UUsYJiZX4vO78jZSr6xmpAGNGoH5w@mail.gmail.com>
 <Y9rRCN9EfqzwYnDG@x1n>
 <CAD=HUj4FjuLpihQLGLzUu82vr5fdFFxfnyKNhApC6L67F5iV4g@mail.gmail.com>
 <CAHbLzko_cXrOQCsQC3g_id06Jkcg3=9dsVe+MwuzAh+iC9dDDA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <CAHbLzko_cXrOQCsQC3g_id06Jkcg3=9dsVe+MwuzAh+iC9dDDA@mail.gmail.com>
Precedence: bulk

On Thu, Feb 02, 2023 at 09:40:12AM -0800, Yang Shi wrote:
> On Thu, Feb 2, 2023 at 1:56 AM David Stevens <stevensd@chromium.org> wrote:
> >
> > On Thu, Feb 2, 2023 at 5:52 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Feb 01, 2023 at 09:36:37AM -0800, Yang Shi wrote:
> > > > On Tue, Jan 31, 2023 at 7:42 PM David Stevens <stevensd@chromium.org> wrote:
> > > > >
> > > > > From: David Stevens <stevensd@chromium.org>
> > > > >
> > > > > Collapsing memory in a vma that has an armed userfaultfd results in
> > > > > zero-filling any missing pages, which breaks user-space paging for those
> > > > > filled pages. Avoid khugepage bypassing userfaultfd by not collapsing
> > > > > pages in shmem reached via scanning a vma with an armed userfaultfd if
> > > > > doing so would zero-fill any pages.
> > > > >
> > > > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > > > ---
> > > > >  mm/khugepaged.c | 35 ++++++++++++++++++++++++-----------
> > > > >  1 file changed, 24 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index 79be13133322..48e944fb8972 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -1736,8 +1736,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > > >   *    + restore gaps in the page cache;
> > > > >   *    + unlock and free huge page;
> > > > >   */
> > > > > -static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > > -                        struct file *file, pgoff_t start,
> > > > > +static int collapse_file(struct mm_struct *mm, struct vm_area_struct *vma,
> > > > > +                        unsigned long addr, struct file *file, pgoff_t start,
> > > > >                          struct collapse_control *cc)
> > > > >  {
> > > > >         struct address_space *mapping = file->f_mapping;
> > > > > @@ -1784,6 +1784,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > >          * be able to map it or use it in another way until we unlock it.
> > > > >          */
> > > > >
> > > > > +       if (is_shmem)
> > > > > +               mmap_read_lock(mm);
> > > >
> > > > If you release mmap_lock before then reacquire it here, the vma is not
> > > > trusted anymore. It is not safe to use the vma anymore.
> > > >
> > > > Since you already read uffd_was_armed before releasing mmap_lock, so
> > > > you could pass it directly to collapse_file w/o dereferencing vma
> > > > again. The problem may be false positive (not userfaultfd armed
> > > > anymore), but it should be fine. Khugepaged could collapse this area
> > > > in the next round.
> >
> > I didn't notice this race condition. It should be possible to adapt
> > hugepage_vma_revalidate for this situation, or at least to create an
> > analogous situation.
> 
> But once you release mmap_lock, the vma still may be changed,
> revalidation just can guarantee the vma is valid when you hold the
> mmap_lock unless mmap_lock is held for the whole collapse or at some
> point that the collapse doesn't have impact on userfaultfd anymore. We
> definitely don't want to hold mmap_lock for the whole collapse, but I
> don't know whether we could release it earlier or not due to my
> limited knowledge of userfaultfd.

I agree with Yang; I don't quickly see how that'll resolve the issue.

> 
> >
> > > Unfortunately that may not be enough.. because it's also possible that it
> > > reads uffd_armed==false, released mmap_sem, passed it over to the scanner,
> > > but then when scanning the file uffd got armed in parallel.
> > >
> > > There's another problem where the current vma may not have uffd armed,
> > > khugepaged may think it has nothing to do with uffd and moved on with
> > > collapsing, but actually it's armed in another vma of either the current mm
> > > or just another mm's.
> > >
> > > It seems non-trivial too to safely check this across all the vmas, let's
> > > say, by a reverse walk - the only safe way is to walk all the vmas and take
> > > the write lock for every mm, but that's not only too heavy but also merely
> > > impossible to always make it right because of deadlock issues and on the
> > > order of mmap write lock to take..
> > >
> > > So far what I can still think of is, if we can extend shmem_inode_info and
> > > have a counter showing how many uffd has been armed.  It can be a generic
> > > counter too (e.g. shmem_inode_info.collapse_guard_counter) just to avoid
> > > the page cache being collapsed under the hood, but I am also not aware of
> > > whether it can be reused by other things besides uffd.
> > >
> > > Then when we do the real collapsing, say, when:
> > >
> > >                 xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > >                 xas_store(&xas, hpage);
> > >                 xas_unlock_irq(&xas);
> > >
> > > We may need to make sure that counter keeps static (probably by holding
> > > some locks during the process) and we only do that last phase collapse if
> > > counter==0.
> > >
> > > Similar checks in this patch can still be done, but that'll only service as
> > > a role of failing faster before the ultimate check on the uffd_armed
> > > counter.  Otherwise I just don't quickly see how to avoid race conditions.
> >
> > I don't know if it's necessary to go that far. Userfaultfd plus shmem
> > is inherently brittle. It's possible for userspace to bypass
> > userfaultfd on a shmem mapping by accessing the shmem through a
> > different mapping or simply by using the write syscall.

Yes this is possible, but this is user-visible operation - no matter it was
a read()/write() from another process, or mmap()ed memory accesses.
Khugepaged merges ptes in a way that is out of control of users.  That's
something the user can hardly control.

AFAICT currently file-based uffd missing mode all works in that way.  IOW
the user should have full control of the file/inode under the hood to make
sure there will be nothing surprising.  Otherwise I don't really see how
the missing mode can work solidly since it's page cache based.

> > It might be sufficient to say that the kernel won't directly bypass a
> > VMA's userfaultfd to collapse the underlying shmem's pages. Although on
> > the other hand, I guess it's not great for the presence of an unused
> > shmem mapping lying around to cause khugepaged to have user-visible
> > side effects.

Maybe it works for your use case already, for example, if in your app the
shmem is only and always be mapped once?  However that doesn't seem like a
complete solution to me.

There's nothing that will prevent another mapping being established, and
right after that happens it'll stop working, because khugepaged can notice
that new mm/vma which doesn't register with uffd at all, and thinks it a
good idea to collapse the shmem page cache again.  Uffd will silently fail
in another case even if not immediately in your current app/reproducer.

Again, I don't think what I propose above is anything close to good.. It'll
literally disable any collapsing possibility for a shmem node as long as
any small portion of the inode mapping address space got registered by any
process with uffd.  I just don't see any easier approach so far.

Thanks,

-- 
Peter Xu