Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20230105101844.1893104-1-jthoughton@google.com>
 <20230105101844.1893104-22-jthoughton@google.com> <Y78+/wleTEC3gyqu@x1n>
In-Reply-To: <Y78+/wleTEC3gyqu@x1n>
From:   James Houghton <jthoughton@google.com>
Date:   Thu, 12 Jan 2023 09:06:48 -0500
Message-ID: <CADrL8HU-prbfx2xxXCi0RPznp5DB68-NjqqmdM+4aUeUUURhiA@mail.gmail.com>
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
To:     Peter Xu <peterx@redhat.com>
Cc:     Mike Kravetz <mike.kravetz@oracle.com>,
        Muchun Song <songmuchun@bytedance.com>,
        David Hildenbrand <david@redhat.com>,
        David Rientjes <rientjes@google.com>,
        Axel Rasmussen <axelrasmussen@google.com>,
        Mina Almasry <almasrymina@google.com>,
        "Zach O'Keefe" <zokeefe@google.com>,
        Manish Mishra <manish.mishra@nutanix.com>,
        Naoya Horiguchi <naoya.horiguchi@nec.com>,
        "Dr . David Alan Gilbert" <dgilbert@redhat.com>,
        "Matthew Wilcox (Oracle)" <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>,
        Baolin Wang <baolin.wang@linux.alibaba.com>,
        Miaohe Lin <linmiaohe@huawei.com>,
        Yang Shi <shy828301@gmail.com>,
        Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote:
>
> James,
>
> On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> >               int mapcount = page_mapcount(page);
> >
> >               if (mapcount >= 2)
> > -                     mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > +                     mss->shared_hugetlb += hugetlb_pte_size(hpte);
> >               else
> > -                     mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > +                     mss->private_hugetlb += hugetlb_pte_size(hpte);
> >       }
> >       return 0;
>
> One thing interesting I found with hgm right now is mostly everything will
> be counted as "shared" here, I think it's because mapcount is accounted
> always to the huge page even if mapped in smaller sizes, so page_mapcount()
> to a small page should be huge too because the head page mapcount should be
> huge.  I'm curious the reasons behind the mapcount decision.
>
> For example, would that risk overflow with head_compound_mapcount?  One 1G
> page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> atomic_t.  Looks like it's possible.

The original mapcount approach was "if the hstate-level PTE is
present, increment the compound_mapcount by 1" (basically "if any of
the hugepage is mapped, increment the compound_mapcount by 1"), but
this was painful to implement, so I changed it to what it is now (each
new PT mapping increments the compound_mapcount by 1). I think you're
right, there is absolutely an overflow risk. :( I'm not sure what the
best solution is. I could just go back to the old approach.

Regarding when things are accounted in private_hugetlb vs.
shared_hugetlb, HGM shouldn't change that, because it only applies to
shared mappings (right now anyway). It seems like "private_hugetlb"
can include cases where the page is shared but has only one mapping,
in which case HGM will change it from "private" to "shared".

>
> Btw, are the small page* pointers still needed in the latest HGM design?
> Is there code taking care of disabling of hugetlb vmemmap optimization for
> HGM?  Or maybe it's not needed anymore for the current design?

The hugetlb vmemmap optimization can still be used with HGM, so there
is no code to disable it. We don't need small page* pointers either,
except for when we're dealing with mapping subpages, like in
hugetlb_no_page. Everything else can handle the hugetlb page as a
folio.

I hope we can keep compatibility with the vmemmap optimization while
solving the mapcount overflow risk.

Thanks Peter.
- James