Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
To:     Oscar Salvador <osalvador@suse.de>,
        Mike Kravetz <mike.kravetz@oracle.com>
Cc:     Muchun Song <songmuchun@bytedance.com>,
        Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org
References: <20210208103812.32056-1-osalvador@suse.de>
 <20210208103812.32056-3-osalvador@suse.de>
From:   David Hildenbrand <david@redhat.com>
Organization: Red Hat GmbH
Subject: Re: [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle
 free hugetlb pages
Message-ID: <9ed946df-9c6c-9a4d-4be9-2f32809974f7@redhat.com>
Date:   Wed, 10 Feb 2021 09:23:59 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.5.0
MIME-Version: 1.0
In-Reply-To: <20210208103812.32056-3-osalvador@suse.de>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Precedence: bulk

On 08.02.21 11:38, Oscar Salvador wrote:
> Free hugetlb pages are trickier to handle as to in order to guarantee
> no userspace appplication disruption, we need to replace the
> current free hugepage with a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> in introduced.
> This function will first try to get a new fresh hugetlb page, and if it
> succeeds, it will dissolve the old one.
>

Thanks for looking into this! Can we move this patch to #1 in the 
series? It is the easier case.

I also wonder if we should at least try on the memory unplug path to 
keep nr_pages by at least trying to allocate at new one if required, and 
printing a warning if that fails (after all, we're messing with 
something configured by the admin - "nr_pages"). Note that gigantic 
pages are special (below).

> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   include/linux/hugetlb.h |  6 ++++++
>   mm/compaction.c         | 11 +++++++++++
>   mm/hugetlb.c            | 35 +++++++++++++++++++++++++++++++++++
>   3 files changed, 52 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ebca2ef02212..f81afcb86e89 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -505,6 +505,7 @@ struct huge_bootmem_page {
>   	struct hstate *hstate;
>   };
>   
> +bool alloc_and_dissolve_huge_page(struct page *page);
>   struct page *alloc_huge_page(struct vm_area_struct *vma,
>   				unsigned long addr, int avoid_reserve);
>   struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> @@ -773,6 +774,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
>   #else	/* CONFIG_HUGETLB_PAGE */
>   struct hstate {};
>   
> +static inline bool alloc_and_dissolve_huge_page(struct page *page)
> +{
> +	return false;
> +}
> +
>   static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
>   					   unsigned long addr,
>   					   int avoid_reserve)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 89cd2e60da29..7969ddc10856 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   					low_pfn += compound_nr(page) - 1;
>   					goto isolate_success_no_list;
>   				}
> +			} else {

} else if (alloc_and_dissolve_huge_page(page))) {

...

> +				/*
> +				 * Free hugetlb page. Allocate a new one and
> +				 * dissolve this is if succeed.
> +				 */
> +				if (alloc_and_dissolve_huge_page(page)) {
> +					unsigned long order = buddy_order_unsafe(page);
> +
> +					low_pfn += (1UL << order) - 1;
> +					continue;
> +				}


Note that there is a very ugly corner case we will have to handle 
gracefully (I think also in patch #1):

Assume you allocated a gigantic page (and assume that we are not using 
CMA for gigantic pages for simplicity). Assume you want to allocate 
another one. alloc_pool_huge_page()->...->alloc_contig_pages() will 
stumble over the first allocated page. It will try to 
alloc_and_dissolve_huge_page() the existing gigantic page. To do that, 
it will alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on. 
Bad.

We really don't want to mess with gigantic pages (migrate, dissolve) 
while allocating a gigantic page. I think the easiest (and cleanest) way 
forward is to not mess (isolate, migrate, dissolve) with gigantic pages 
at all.

Gigantic pages are not movable, so they won't be placed on random CMA / 
ZONE_MOVABLE.

Some hstate_is_gigantic(h) calls (maybe inside 
alloc_and_dissolve_huge_page() ? ) along with a nice comment might be 
good enough to avoid having to pass down some kind of alloc_contig 
context. I even think that should be handled inside

(the main issue is that in contrast to CMA, plain alloc_contig_pages() 
has no memory about which parts were allocated and will simply try 
re-allocating what it previously allocated and never freed - which is 
usually fine, unless we're dealing with such special cases)

Apart from that, not messing with gigantic pages feels like the right 
approach (allocating/migrating gigantic pages is just horribly slow and 
most probably not worth it anyway).

-- 
Thanks,

David / dhildenb