Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Wed, 21 Dec 2022 17:10:08 -0500
From:   Peter Xu <peterx@redhat.com>
To:     Mike Kravetz <mike.kravetz@oracle.com>
Cc:     James Houghton <jthoughton@google.com>,
        Muchun Song <songmuchun@bytedance.com>,
        David Hildenbrand <david@redhat.com>,
        David Rientjes <rientjes@google.com>,
        Axel Rasmussen <axelrasmussen@google.com>,
        Mina Almasry <almasrymina@google.com>,
        Zach O'Keefe <zokeefe@google.com>,
        Manish Mishra <manish.mishra@nutanix.com>,
        Naoya Horiguchi <naoya.horiguchi@nec.com>,
        "Dr . David Alan Gilbert" <dgilbert@redhat.com>,
        "Matthew Wilcox (Oracle)" <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>,
        Baolin Wang <baolin.wang@linux.alibaba.com>,
        Miaohe Lin <linmiaohe@huawei.com>,
        Yang Shi <shy828301@gmail.com>,
        Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v2 33/47] userfaultfd: add
 UFFD_FEATURE_MINOR_HUGETLBFS_HGM
Message-ID: <Y6OEQB3dLSa083F6@x1n>
References: <20221021163703.3218176-1-jthoughton@google.com>
 <20221021163703.3218176-34-jthoughton@google.com>
 <Y3VkIdVKRuq+fO0N@x1n>
 <CADrL8HXixUPyTVmYMiwc11Ot5sDMsA3x7VhgXQjimJ93MSZihA@mail.gmail.com>
 <Y6NdN2ADVCcK70ym@x1n>
 <CADrL8HXqE3s4ckxh0OU5onkhystj=1jMTS+S7GFeiO+kwBo0QQ@mail.gmail.com>
 <Y6N9G0Y2j98V8Pnz@monkey>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <Y6N9G0Y2j98V8Pnz@monkey>
Precedence: bulk

On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> On 12/21/22 15:21, James Houghton wrote:
> > On Wed, Dec 21, 2022 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > James,
> > >
> > > On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> > > > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > > > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > > > > enable HGM. Userspace can check if the feature exists in
> > > > > > uffdio_api.features, and if it does not exist, the kernel does not
> > > > > > support and therefore did not enable HGM.
> > > > > >
> > > > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > > >
> > > > > It's still slightly a pity that this can only be enabled by an uffd context
> > > > > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> > > >
> > > > The idea here is that, for applications that can conceivably benefit
> > > > from HGM, we have a mechanism for enabling it for that application. So
> > > > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> > > > prefer this approach over something more general like MADV_ENABLE_HGM
> > > > or something.
> > >
> > > Sorry to get back to this very late - I know this has been discussed since
> > > the very early stage of the feature, but is there any reasoning behind?
> > >
> > > When I start to think seriously on applying this to process snapshot with
> > > uffd-wp I found that the minor mode trick won't easily play - normally
> > > that's a case where all the pages were there mapped huge, but when the app
> > > wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
> > > pages, probably some size that the user can specify.  It'll be non-trivial
> > > to enable HGM during that phase using MINOR mode because in that case the
> > > pages are all mapped.
> > >
> > > For the long term, I am just still worried the current interface is still
> > > not as flexible.
> > 
> > Thanks for bringing this up, Peter. I think the main reason was:
> > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > and is not supported.
> 
> IIRC, I think we wanted to initially limit the usage to the very
> specific use case (live migration).  The idea is that we could then
> expand usage as more use cases came to light.
> 
> Another good thing is that userfaultfd has versioning built into the
> API.  Thus a user can determine if HGM is enabled in their running
> kernel.

I don't worry much on this one, afaiu if we have any way to enable hgm then
the user can just try enabling it on a test vma, just like when an app
wants to detect whether a new madvise() is present on the current host OS.

Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
would work too.

> 
> > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > allowed as of this patch series, but it could be allowed in the
> > future. To add support in the same way as this series, we would add
> > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > having to add another feature isn't great; is this what you're
> > concerned about?
> > 
> > Considering MADV_ENABLE_HUGETLB...
> > 1. If a user provides this, then the contract becomes: "the kernel may
> > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > high-granularities, provided the support exists", but it becomes
> > unclear to userspace to know what's supported and what isn't.
> > 2. We would then need to keep track if a user explicitly enabled it,
> > or if it got enabled automatically in response to memory poison, for
> > example. Not a big problem, just a complication. (Otherwise, if HGM
> > got enabled for poison, suddenly userspace would be allowed to do
> > things it wasn't allowed to do before.)

We could alternatively have two flags for each vma: (a) hgm_advised and (b)
hgm_enabled.  (a) always sets (b) but not vice versa.  We can limit poison
to set (b) only.  For this patchset, it can be all about (a).

> > 3. This API makes sense for enabling HGM for something outside of
> > userfaultfd, like MADV_DONTNEED.
> 
> I think #3 is key here.  Once we start applying HGM to things outside
> userfaultfd, then more thought will be required on APIs.  The API is
> somewhat limited by design until the basic functionality is in place.

Mike, could you elaborate what's the major concern of having hgm used
outside uffd and live migration use cases?

I feel like I miss something here.  I can understand we want to limit the
usage only when the user specifies using hgm because we want to keep the
old behavior intact.  However if we want another way to enable hgm it'll
still need one knob anyway even outside uffd, and I thought that'll service
the same purpose, or maybe not?

Thanks,

-- 
Peter Xu