MIME-Version: 1.0
References: <Y9Li93O6Ffwcr+vn@x1n> <CADrL8HVJb9mA3Lbz5UKJ8EudTk9sGDY-pdRqvW3TGawD_bJB0A@mail.gmail.com>
 <Y9f+jMLWy6ngpYuR@x1n> <CADrL8HX3sf6OO3PXS1g6b2dKf8b5phQ7oyNR0dVT=sAOdTmmqw@mail.gmail.com>
 <Y9gzOqwKcu7p/PEw@x1n> <CADrL8HXX9YDFUxmPPsm2s3Pno0XXgAyFB40fV1PdtP9eb-5D2A@mail.gmail.com>
 <Y9m/VVRABt0Blfjh@x1n> <CADrL8HXpfTE1+eE3rNGQyOr1QRtDtG5mBp-b3xcNX22QJRvPaQ@mail.gmail.com>
 <Y9qRta3bd4JqjUHx@x1n> <CADrL8HU809O0cPa9hXjf3k+ob139SQqvxOvpqm6UEv=zrPjHSg@mail.gmail.com>
 <Y9re82gctIZf08cX@x1n> <CADrL8HVja_xJ9qczsd-fQfEPvEEXswhXQwoan=a_LSMyORvqww@mail.gmail.com>
 <CADrL8HUSx6=K0QXQtTmv9ZJQmvhe6KEb+FiAviRfO3HjmRUeTw@mail.gmail.com>
In-Reply-To: <CADrL8HUSx6=K0QXQtTmv9ZJQmvhe6KEb+FiAviRfO3HjmRUeTw@mail.gmail.com>
From:   James Houghton <jthoughton@google.com>
Date:   Tue, 7 Feb 2023 14:46:04 -0800
Message-ID: <CADrL8HUm7g4pBLv9vjmB-LhJqxm4jyksGJQAdwRsweKKAnofDg@mail.gmail.com>
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
To:     Peter Xu <peterx@redhat.com>
Cc:     Mike Kravetz <mike.kravetz@oracle.com>,
        David Hildenbrand <david@redhat.com>,
        Muchun Song <songmuchun@bytedance.com>,
        David Rientjes <rientjes@google.com>,
        Axel Rasmussen <axelrasmussen@google.com>,
        Mina Almasry <almasrymina@google.com>,
        "Zach O'Keefe" <zokeefe@google.com>,
        Manish Mishra <manish.mishra@nutanix.com>,
        Naoya Horiguchi <naoya.horiguchi@nec.com>,
        "Dr . David Alan Gilbert" <dgilbert@redhat.com>,
        "Matthew Wilcox (Oracle)" <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>,
        Baolin Wang <baolin.wang@linux.alibaba.com>,
        Miaohe Lin <linmiaohe@huawei.com>,
        Yang Shi <shy828301@gmail.com>,
        Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

> Here is the result: [1] (sorry it took a little while heh). The
> implementation of the "RFC v1" way is pretty horrible[2] (and this
> implementation probably has bugs anyway; it doesn't account for the
> folio_referenced() problem).
>
> Matthew is trying to solve the same problem with THPs right now: [3].
> I haven't figured out how we can apply Matthews's approach to HGM
> right now, but there probably is a way. (If we left the mapcount
> increment bits in the same place, we couldn't just check the
> hstate-level PTE; it would have already been made present.)
>
> We could:
> - use the THP-like way and tolerate ~1 second collapses

Another thought here. We don't necessarily *need* to collapse the page
table mappings in between mmu_notifier_invalidate_range_start() and
mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
we aren't punching any holes, and we aren't changing permission bits.
If we had an MMU notifier that simply informed KVM that we collapsed
the page tables *after* we finished collapsing, then it would be ok
for hugetlb_collapse() to be slow.

If this MMU notifier is something that makes sense, it probably
applies to MADV_COLLAPSE for THPs as well.


> - use the (non-RFC) v1 way and tolerate the migration/smaps differences
> - use the RFC v1 way and tolerate the complicated mapcount accounting
> - flesh out [3] and see if it can be applied to HGM nicely
>
> I'm happy to go with any of these approaches.
>
> [1]: https://pastebin.com/raw/hJzFJHiD
> [2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976
> [3]: https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/

- James