Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Tue, 17 May 2022 17:16:11 +0800
From:   Muchun Song <songmuchun@bytedance.com>
To:     Oscar Salvador <osalvador@suse.de>
Cc:     corbet@lwn.net, mike.kravetz@oracle.com, akpm@linux-foundation.org,
        mcgrof@kernel.org, keescook@chromium.org, yzaikin@google.com,
        david@redhat.com, masahiroy@kernel.org, linux-doc@vger.kernel.org,
        linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        duanxiongchun@bytedance.com, smuchun@gmail.com
Subject: Re: [PATCH v12 7/7] mm: hugetlb_vmemmap: add
 hugetlb_optimize_vmemmap sysctl
Message-ID: <YoNn2+8VG7XxQ20Y@FVFYT0MHHV2J.usts.net>
References: <20220516102211.41557-1-songmuchun@bytedance.com>
 <20220516102211.41557-8-songmuchun@bytedance.com>
 <YoNXm2c5fJq8luqf@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YoNXm2c5fJq8luqf@localhost.localdomain>
Precedence: bulk

On Tue, May 17, 2022 at 10:06:51AM +0200, Oscar Salvador wrote:
> On Mon, May 16, 2022 at 06:22:11PM +0800, Muchun Song wrote:
> > We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
> > reboot the server to enable or disable the feature of optimizing vmemmap
> > pages associated with HugeTLB pages.  However, rebooting usually takes a
> > long time.  So add a sysctl to enable or disable the feature at runtime
> > without rebooting.  Why we need this?  There are 3 use cases.
> > 
> > 1) The feature of minimizing overhead of struct page associated with each
> > HugeTLB is disabled by default without passing "hugetlb_free_vmemmap=on"
> > to the boot cmdline. When we (ByteDance) deliver the servers to the
> > users who want to enable this feature, they have to configure the grub
> > (change boot cmdline) and reboot the servers, whereas rebooting usually
> > takes a long time (we have thousands of servers).  It's a very bad
> > experience for the users.  So we need a approach to enable this feature
> > after rebooting. This is a use case in our practical environment.
> > 
> > 2) Some use cases are that HugeTLB pages are allocated 'on the fly'
> > instead of being pulled from the HugeTLB pool, those workloads would be
> > affected with this feature enabled.  Those workloads could be identified
> > by the characteristics of they never explicitly allocating huge pages
> > with 'nr_hugepages' but only set 'nr_overcommit_hugepages' and then let
> > the pages be allocated from the buddy allocator at fault time.  We can
> > confirm it is a real use case from the commit 099730d67417.  For those
> > workloads, the page fault time could be ~2x slower than before. We
> > suspect those users want to disable this feature if the system has enabled
> > this before and they don't think the memory savings benefit is enough to
> > make up for the performance drop.
> > 
> > 3) If the workload which wants vmemmap pages to be optimized and the
> > workload which wants to set 'nr_overcommit_hugepages' and does not want
> > the extera overhead at fault time when the overcommitted pages be
> > allocated from the buddy allocator are deployed in the same server.
> > The user could enable this feature and set 'nr_hugepages' and
> > 'nr_overcommit_hugepages', then disable the feature.  In this case,
> > the overcommited HugeTLB pages will not encounter the extra overhead
> > at fault time.
> 
> I am having issues parsing point 3), specially the first part.
> IIUC, you are saying we have two kind of different workloads:
> 
> - one that wants to have hugetlb vmemmap pages optimized
> - one that wants to allocate hugetlb pages at fault time rather than
>   allocating them via /proc/..., but does not want to suffer the
>   overhead of optimizing the vmemmap pages when faulting them

I need to clarify this workload, the one that does not want to
suffer the overhead of optimizing the vmemmap pages when faulting
them instead of wanting to allocate hugetlb pages at fault time.
It is different from the one in the case 2). This one usually
configures 'nr_overcommit_hugepages' as well as 'nr_hugepages',
if it does not want to suffer the overhead of optimizing the
vmemmap pages when faulting pages (must be overcommitted pages),
then they could follow the steps mentioned above.

> 
> Then you say the user could enable the optimization and allocate
> those pages via nr_hugepages, and then disable the feature.
> So, when we fault in those pages, the pages are already in the
> pool, right? And are already optimized.
>

I mean the overcommitted pages (it could be allocated at fault
time) as explained above.

Thanks.