2023-07-18 01:07:14

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v2 0/2] Fix hugetlb free path race with memory errors

In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
HWPOISON hugepages" the race window was discovered.
https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/

Freeing a hugetlb page back to low level memory allocators is performed
in two steps.
1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
2) Outside lock, allocate vmemmap if necessary and call low level free
Between these two steps, the hugetlb page will appear as a normal
compound page. However, vmemmap for tail pages could be missing.
If a memory error occurs at this time, we could try to update page
flags non-existant page structs.

A much more detailed description is in the first patch.

The first patch addresses the race window. However, it adds a
hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb
page free operation. This is sub-optimal but is hardly noticeable
on a mostly idle system (the normal case).

The second path optimizes the update_and_free_pages_bulk routine
to only take the lock once in bulk operations.

-> v2
- Used the more definitive method of checking folio_test_hugetlb to
determine if destructor must be cleared.
- Added comment to clearly describe why and when we clear the
destructor in __update_and_free_hugetlb_folio.
- Clear destructor in hugetlb demote path.
- Do not send second patch to stable releases.

Mike Kravetz (2):
hugetlb: Do not clear hugetlb dtor until allocating vmemmap
hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

mm/hugetlb.c | 128 ++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 100 insertions(+), 28 deletions(-)

--
2.41.0



2023-07-18 01:11:49

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v2 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators. This may require
allocating vmemmmap pages associated with each hugetlb page. The
hugetlb page destructor must be changed before pages are freed to lower
level allocators. However, the destructor must be changed under the
hugetlb lock. This means there is potentially one lock cycle per page.

Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++----
1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4a910121a647..e6b780291539 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1856,13 +1856,43 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
{
struct page *page, *t_page;
- struct folio *folio;
+ bool clear_dtor = false;

+ /*
+ * First allocate required vmemmmap for all pages on list. If vmemmap
+ * can not be allocated, we can not free page to lower level allocator,
+ * so add back as hugetlb surplus page.
+ */
list_for_each_entry_safe(page, t_page, list, lru) {
- folio = page_folio(page);
- update_and_free_hugetlb_folio(h, folio, false);
- cond_resched();
+ if (HPageVmemmapOptimized(page)) {
+ if (hugetlb_vmemmap_restore(h, page)) {
+ spin_lock_irq(&hugetlb_lock);
+ add_hugetlb_folio(h, page_folio(page), true);
+ spin_unlock_irq(&hugetlb_lock);
+ } else
+ clear_dtor = true;
+ cond_resched();
+ }
+ }
+
+ /*
+ * If vmemmmap allocation performed above, then take lock to clear
+ * destructor of all pages on list.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry(page, list, lru)
+ __clear_hugetlb_destructor(h, page_folio(page));
+ spin_unlock_irq(&hugetlb_lock);
}
+
+ /*
+ * Free pages back to low level allocators. vmemmap and destructors
+ * were taken care of above, so update_and_free_hugetlb_folio will
+ * not need to take hugetlb lock.
+ */
+ list_for_each_entry_safe(page, t_page, list, lru)
+ update_and_free_hugetlb_folio(h, page_folio(page), false);
}

struct hstate *size_to_hstate(unsigned long size)
--
2.41.0


2023-07-18 01:13:44

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v2 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap

Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
lists, get a ref on the page and remove hugetlb destructor. This
all must be done under the hugetlb lock. After this call, the page
can be treated as a normal compound page or a collection of base
size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
needed and the free routine of the underlying allocator is called
on the resulting page. We can not hold the hugetlb lock here.

One issue with this scheme is that a memory error could occur between
these two steps. In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages. It will then try to SetPageHWPoison(page) on the page with an
error. If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.

Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap. Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present. This saves a
lock/unlock cycle. Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.

Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count. This is not
a normal state. The only code that would notice is the memory error
code, and it is set up to retry in such a case.

A subsequent patch will create a routine to do bulk processing of
vmemmap allocation. This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.

Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Cc: <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 90 ++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 66 insertions(+), 24 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 64a3239b6407..4a910121a647 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
unsigned int order) { }
#endif

+static inline void __clear_hugetlb_destructor(struct hstate *h,
+ struct folio *folio)
+{
+ lockdep_assert_held(&hugetlb_lock);
+
+ /*
+ * Very subtle
+ *
+ * For non-gigantic pages set the destructor to the normal compound
+ * page dtor. This is needed in case someone takes an additional
+ * temporary ref to the page, and freeing is delayed until they drop
+ * their reference.
+ *
+ * For gigantic pages set the destructor to the null dtor. This
+ * destructor will never be called. Before freeing the gigantic
+ * page destroy_compound_gigantic_folio will turn the folio into a
+ * simple group of pages. After this the destructor does not
+ * apply.
+ *
+ */
+ if (hstate_is_gigantic(h))
+ folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
+ else
+ folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
+}
+
/*
- * Remove hugetlb folio from lists, and update dtor so that the folio appears
- * as just a compound page.
+ * Remove hugetlb folio from lists.
+ * If vmemmap exists for the folio, update dtor so that the folio appears
+ * as just a compound page. Otherwise, wait until after allocating vmemmap
+ * to update dtor.
*
* A reference is held on the folio, except in the case of demote.
*
@@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
}

/*
- * Very subtle
- *
- * For non-gigantic pages set the destructor to the normal compound
- * page dtor. This is needed in case someone takes an additional
- * temporary ref to the page, and freeing is delayed until they drop
- * their reference.
- *
- * For gigantic pages set the destructor to the null dtor. This
- * destructor will never be called. Before freeing the gigantic
- * page destroy_compound_gigantic_folio will turn the folio into a
- * simple group of pages. After this the destructor does not
- * apply.
- *
- * This handles the case where more than one ref is held when and
- * after update_and_free_hugetlb_folio is called.
- *
- * In the case of demote we do not ref count the page as it will soon
- * be turned into a page of smaller size.
+ * We can only clear the hugetlb destructor after allocating vmemmap
+ * pages. Otherwise, someone (memory error handling) may try to write
+ * to tail struct pages.
+ */
+ if (!folio_test_hugetlb_vmemmap_optimized(folio))
+ __clear_hugetlb_destructor(h, folio);
+
+ /*
+ * In the case of demote we do not ref count the page as it will soon
+ * be turned into a page of smaller size.
*/
if (!demote)
folio_ref_unfreeze(folio, 1);
- if (hstate_is_gigantic(h))
- folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
- else
- folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);

h->nr_huge_pages--;
h->nr_huge_pages_node[nid]--;
@@ -1728,6 +1744,19 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
return;
}

+ /*
+ * If needed, clear hugetlb destructor under the hugetlb lock.
+ * This must be done AFTER allocating vmemmap pages in case there is an
+ * attempt to write to tail struct pages as in memory poison.
+ * It must be done BEFORE PageHWPoison handling so that any subsequent
+ * memory errors poison individual pages instead of head.
+ */
+ if (folio_test_hugetlb(folio)) {
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
/*
* Move PageHWPoison flag from head page to the raw error pages,
* which makes any healthy subpages reusable.
@@ -3604,6 +3633,19 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
return rc;
}

+ /*
+ * The hugetlb destructor could still be set for this folio if vmemmap
+ * was actually allocated above. The ref count on all pages is 0.
+ * Therefore, nobody should attempt access. However, before destroying
+ * compound page below, clear the destructor. Unfortunately, this
+ * requires a lock/unlock cycle.
+ */
+ if (folio_test_hugetlb(folio)) {
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
/*
* Use destroy_compound_hugetlb_folio_for_demote for all huge page
* sizes as it will not ref count folios.
--
2.41.0


2023-07-18 16:48:38

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <[email protected]> wrote:
>
> update_and_free_pages_bulk is designed to free a list of hugetlb pages
> back to their associated lower level allocators. This may require
> allocating vmemmmap pages associated with each hugetlb page. The
> hugetlb page destructor must be changed before pages are freed to lower
> level allocators. However, the destructor must be changed under the
> hugetlb lock. This means there is potentially one lock cycle per page.
>
> Minimize the number of lock cycles in update_and_free_pages_bulk by:
> 1) allocating necessary vmemmap for all hugetlb pages on the list
> 2) take hugetlb lock and clear destructor for all pages on the list
> 3) free all pages on list back to low level allocators
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++----
> 1 file changed, 34 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4a910121a647..e6b780291539 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1856,13 +1856,43 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
> static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> {
> struct page *page, *t_page;
> - struct folio *folio;
> + bool clear_dtor = false;
>
> + /*
> + * First allocate required vmemmmap for all pages on list. If vmemmap
> + * can not be allocated, we can not free page to lower level allocator,
> + * so add back as hugetlb surplus page.
> + */
> list_for_each_entry_safe(page, t_page, list, lru) {
> - folio = page_folio(page);
> - update_and_free_hugetlb_folio(h, folio, false);
> - cond_resched();
> + if (HPageVmemmapOptimized(page)) {
> + if (hugetlb_vmemmap_restore(h, page)) {
> + spin_lock_irq(&hugetlb_lock);
> + add_hugetlb_folio(h, page_folio(page), true);
> + spin_unlock_irq(&hugetlb_lock);
> + } else
> + clear_dtor = true;
> + cond_resched();
> + }
> + }
> +
> + /*
> + * If vmemmmap allocation performed above, then take lock to clear

s/vmemmmap/vmemmap. Also is a little hard to understand, something
like "If vmemmap allocation was performed above for any folios,
then..." seems clearer to me.

> + * destructor of all pages on list.
> + */
> + if (clear_dtor) {
> + spin_lock_irq(&hugetlb_lock);
> + list_for_each_entry(page, list, lru)
> + __clear_hugetlb_destructor(h, page_folio(page));
> + spin_unlock_irq(&hugetlb_lock);
> }

I'm not too familiar with this code, but the above block seems weird
to me. If we successfully allocated the vmemmap for *any* folio, we
clear the hugetlb destructor for all the folios? I feel like we should
only be clearing the hugetlb destructor for all folios if the vmemmap
allocation succeeded for *all* folios. If the code is functionally
correct as is, I'm a little bit confused why we need `clear_dtor`; it
seems like this function doesn't really need it. (I could have some
huge misunderstanding here.)

> +
> + /*
> + * Free pages back to low level allocators. vmemmap and destructors
> + * were taken care of above, so update_and_free_hugetlb_folio will
> + * not need to take hugetlb lock.
> + */
> + list_for_each_entry_safe(page, t_page, list, lru)
> + update_and_free_hugetlb_folio(h, page_folio(page), false);
> }
>
> struct hstate *size_to_hstate(unsigned long size)
> --
> 2.41.0
>

2023-07-18 16:52:45

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap

On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <[email protected]> wrote:
>
> Freeing a hugetlb page and releasing base pages back to the underlying
> allocator such as buddy or cma is performed in two steps:
> - remove_hugetlb_folio() is called to remove the folio from hugetlb
> lists, get a ref on the page and remove hugetlb destructor. This
> all must be done under the hugetlb lock. After this call, the page
> can be treated as a normal compound page or a collection of base
> size pages.
> - update_and_free_hugetlb_folio() is called to allocate vmemmap if
> needed and the free routine of the underlying allocator is called
> on the resulting page. We can not hold the hugetlb lock here.
>
> One issue with this scheme is that a memory error could occur between
> these two steps. In this case, the memory error handling code treats
> the old hugetlb page as a normal compound page or collection of base
> pages. It will then try to SetPageHWPoison(page) on the page with an
> error. If the page with error is a tail page without vmemmap, a write
> error will occur when trying to set the flag.
>
> Address this issue by modifying remove_hugetlb_folio() and
> update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> cleared until after allocating vmemmap. Since clearing the destructor
> requires holding the hugetlb lock, the clearing is done in
> remove_hugetlb_folio() if the vmemmap is present. This saves a
> lock/unlock cycle. Otherwise, destructor is cleared in
> update_and_free_hugetlb_folio() after allocating vmemmap.
>
> Note that this will leave hugetlb pages in a state where they are marked
> free (by hugetlb specific page flag) and have a ref count. This is not
> a normal state. The only code that would notice is the memory error
> code, and it is set up to retry in such a case.
>
> A subsequent patch will create a routine to do bulk processing of
> vmemmap allocation. This will eliminate a lock/unlock cycle for each
> hugetlb page in the case where we are freeing a large number of pages.
>
> Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
> Cc: <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 90 ++++++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 66 insertions(+), 24 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 64a3239b6407..4a910121a647 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
> unsigned int order) { }
> #endif
>
> +static inline void __clear_hugetlb_destructor(struct hstate *h,
> + struct folio *folio)
> +{
> + lockdep_assert_held(&hugetlb_lock);
> +
> + /*
> + * Very subtle
> + *
> + * For non-gigantic pages set the destructor to the normal compound
> + * page dtor. This is needed in case someone takes an additional
> + * temporary ref to the page, and freeing is delayed until they drop
> + * their reference.
> + *
> + * For gigantic pages set the destructor to the null dtor. This
> + * destructor will never be called. Before freeing the gigantic
> + * page destroy_compound_gigantic_folio will turn the folio into a
> + * simple group of pages. After this the destructor does not
> + * apply.
> + *
> + */

Is it correct and useful to add a
WARN_ON_ONCE(folio_test_hugetlb_vmemmap_optimized(folio)) here?

Feel free to add:

Acked-by: James Houghton <[email protected]>

> + if (hstate_is_gigantic(h))
> + folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> + else
> + folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> +}
> +
> /*
> - * Remove hugetlb folio from lists, and update dtor so that the folio appears
> - * as just a compound page.
> + * Remove hugetlb folio from lists.
> + * If vmemmap exists for the folio, update dtor so that the folio appears
> + * as just a compound page. Otherwise, wait until after allocating vmemmap
> + * to update dtor.
> *
> * A reference is held on the folio, except in the case of demote.
> *
> @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> }
>
> /*
> - * Very subtle
> - *
> - * For non-gigantic pages set the destructor to the normal compound
> - * page dtor. This is needed in case someone takes an additional
> - * temporary ref to the page, and freeing is delayed until they drop
> - * their reference.
> - *
> - * For gigantic pages set the destructor to the null dtor. This
> - * destructor will never be called. Before freeing the gigantic
> - * page destroy_compound_gigantic_folio will turn the folio into a
> - * simple group of pages. After this the destructor does not
> - * apply.
> - *
> - * This handles the case where more than one ref is held when and
> - * after update_and_free_hugetlb_folio is called.
> - *
> - * In the case of demote we do not ref count the page as it will soon
> - * be turned into a page of smaller size.
> + * We can only clear the hugetlb destructor after allocating vmemmap
> + * pages. Otherwise, someone (memory error handling) may try to write
> + * to tail struct pages.
> + */
> + if (!folio_test_hugetlb_vmemmap_optimized(folio))
> + __clear_hugetlb_destructor(h, folio);
> +
> + /*
> + * In the case of demote we do not ref count the page as it will soon
> + * be turned into a page of smaller size.
> */
> if (!demote)
> folio_ref_unfreeze(folio, 1);
> - if (hstate_is_gigantic(h))
> - folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> - else
> - folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
>
> h->nr_huge_pages--;
> h->nr_huge_pages_node[nid]--;
> @@ -1728,6 +1744,19 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> return;
> }
>
> + /*
> + * If needed, clear hugetlb destructor under the hugetlb lock.
> + * This must be done AFTER allocating vmemmap pages in case there is an
> + * attempt to write to tail struct pages as in memory poison.
> + * It must be done BEFORE PageHWPoison handling so that any subsequent
> + * memory errors poison individual pages instead of head.
> + */
> + if (folio_test_hugetlb(folio)) {
> + spin_lock_irq(&hugetlb_lock);
> + __clear_hugetlb_destructor(h, folio);
> + spin_unlock_irq(&hugetlb_lock);
> + }
> +
> /*
> * Move PageHWPoison flag from head page to the raw error pages,
> * which makes any healthy subpages reusable.
> @@ -3604,6 +3633,19 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
> return rc;
> }
>
> + /*
> + * The hugetlb destructor could still be set for this folio if vmemmap
> + * was actually allocated above. The ref count on all pages is 0.
> + * Therefore, nobody should attempt access. However, before destroying
> + * compound page below, clear the destructor. Unfortunately, this
> + * requires a lock/unlock cycle.
> + */
> + if (folio_test_hugetlb(folio)) {
> + spin_lock_irq(&hugetlb_lock);
> + __clear_hugetlb_destructor(h, folio);
> + spin_unlock_irq(&hugetlb_lock);
> + }
> +
> /*
> * Use destroy_compound_hugetlb_folio_for_demote for all huge page
> * sizes as it will not ref count folios.
> --
> 2.41.0
>

2023-07-18 17:17:57

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

On 07/18/23 09:31, James Houghton wrote:
> On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <[email protected]> wrote:
> >
> > update_and_free_pages_bulk is designed to free a list of hugetlb pages
> > back to their associated lower level allocators. This may require
> > allocating vmemmmap pages associated with each hugetlb page. The
> > hugetlb page destructor must be changed before pages are freed to lower
> > level allocators. However, the destructor must be changed under the
> > hugetlb lock. This means there is potentially one lock cycle per page.
> >
> > Minimize the number of lock cycles in update_and_free_pages_bulk by:
> > 1) allocating necessary vmemmap for all hugetlb pages on the list
> > 2) take hugetlb lock and clear destructor for all pages on the list
> > 3) free all pages on list back to low level allocators
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
> > ---
> > mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++----
> > 1 file changed, 34 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 4a910121a647..e6b780291539 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1856,13 +1856,43 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
> > static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> > {
> > struct page *page, *t_page;
> > - struct folio *folio;
> > + bool clear_dtor = false;
> >
> > + /*
> > + * First allocate required vmemmmap for all pages on list. If vmemmap
> > + * can not be allocated, we can not free page to lower level allocator,
> > + * so add back as hugetlb surplus page.
> > + */
> > list_for_each_entry_safe(page, t_page, list, lru) {
> > - folio = page_folio(page);
> > - update_and_free_hugetlb_folio(h, folio, false);
> > - cond_resched();
> > + if (HPageVmemmapOptimized(page)) {
> > + if (hugetlb_vmemmap_restore(h, page)) {
> > + spin_lock_irq(&hugetlb_lock);
> > + add_hugetlb_folio(h, page_folio(page), true);
> > + spin_unlock_irq(&hugetlb_lock);
> > + } else
> > + clear_dtor = true;
> > + cond_resched();
> > + }
> > + }
> > +
> > + /*
> > + * If vmemmmap allocation performed above, then take lock to clear
>
> s/vmemmmap/vmemmap. Also is a little hard to understand, something
> like "If vmemmap allocation was performed above for any folios,
> then..." seems clearer to me.
>

Typo :(
Yes, that would be more clear ... see below.

> > + * destructor of all pages on list.
> > + */
> > + if (clear_dtor) {
> > + spin_lock_irq(&hugetlb_lock);
> > + list_for_each_entry(page, list, lru)
> > + __clear_hugetlb_destructor(h, page_folio(page));
> > + spin_unlock_irq(&hugetlb_lock);
> > }
>
> I'm not too familiar with this code, but the above block seems weird
> to me. If we successfully allocated the vmemmap for *any* folio, we
> clear the hugetlb destructor for all the folios? I feel like we should
> only be clearing the hugetlb destructor for all folios if the vmemmap
> allocation succeeded for *all* folios. If the code is functionally
> correct as is, I'm a little bit confused why we need `clear_dtor`; it
> seems like this function doesn't really need it. (I could have some
> huge misunderstanding here.)
>

Yes, it is a bit strange.

I was thinking this has to also handle the case where hugetlb vmemmap
optimization is off system wide. In that case, clear_dtor would never
be set and there is no sense in ever walking the list and calling
__clear_hugetlb_destructor() would would be a NOOP in this case. Think
of the case where there are TBs of hugetlb pages.

That is one of the reasons I made __clear_hugetlb_destructor() check
for the need to modify the destructor. The other reason is in the
dissolve_free_huge_page() code path where we allocate vmemmap. I
suppose, there could be an explicit call to __clear_hugetlb_destructor()
in dissolve_free_huge_page. But, I thought it might be better if
we just handled both cases here.

My thinking is that the clear_dtor boolean would tell us if vmemmap was
restored for ANY hugetlb page. I am aware that just because vmemmap was
allocated for one page, does not mean that it was allocated for others.
However, in the common case where hugetlb vmemmap optimization is on
system wide, we would have allocated vmemmap for all pages on the list
and would need to clear the destructor for them all.

So, clear_dtor is really just an optimization for the
hugetlb_free_vmemmap=off case. Perhaps that is just over thinking and
not a useful miro-optimization.

Thanks for taking a look!
--
Mike Kravetz

> > +
> > + /*
> > + * Free pages back to low level allocators. vmemmap and destructors
> > + * were taken care of above, so update_and_free_hugetlb_folio will
> > + * not need to take hugetlb lock.
> > + */
> > + list_for_each_entry_safe(page, t_page, list, lru)
> > + update_and_free_hugetlb_folio(h, page_folio(page), false);
> > }
> >
> > struct hstate *size_to_hstate(unsigned long size)
> > --
> > 2.41.0
> >

2023-07-19 03:04:58

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap



> On Jul 18, 2023, at 08:49, Mike Kravetz <[email protected]> wrote:
>
> Freeing a hugetlb page and releasing base pages back to the underlying
> allocator such as buddy or cma is performed in two steps:
> - remove_hugetlb_folio() is called to remove the folio from hugetlb
> lists, get a ref on the page and remove hugetlb destructor. This
> all must be done under the hugetlb lock. After this call, the page
> can be treated as a normal compound page or a collection of base
> size pages.
> - update_and_free_hugetlb_folio() is called to allocate vmemmap if
> needed and the free routine of the underlying allocator is called
> on the resulting page. We can not hold the hugetlb lock here.
>
> One issue with this scheme is that a memory error could occur between
> these two steps. In this case, the memory error handling code treats
> the old hugetlb page as a normal compound page or collection of base
> pages. It will then try to SetPageHWPoison(page) on the page with an
> error. If the page with error is a tail page without vmemmap, a write
> error will occur when trying to set the flag.
>
> Address this issue by modifying remove_hugetlb_folio() and
> update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> cleared until after allocating vmemmap. Since clearing the destructor
> requires holding the hugetlb lock, the clearing is done in
> remove_hugetlb_folio() if the vmemmap is present. This saves a
> lock/unlock cycle. Otherwise, destructor is cleared in
> update_and_free_hugetlb_folio() after allocating vmemmap.
>
> Note that this will leave hugetlb pages in a state where they are marked
> free (by hugetlb specific page flag) and have a ref count. This is not
> a normal state. The only code that would notice is the memory error
> code, and it is set up to retry in such a case.
>
> A subsequent patch will create a routine to do bulk processing of
> vmemmap allocation. This will eliminate a lock/unlock cycle for each
> hugetlb page in the case where we are freeing a large number of pages.
>
> Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
> Cc: <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>

Reviewed-by: Muchun Song <[email protected]>

Thanks.


2023-07-19 04:05:39

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles



> On Jul 18, 2023, at 08:49, Mike Kravetz <[email protected]> wrote:
>
> update_and_free_pages_bulk is designed to free a list of hugetlb pages
> back to their associated lower level allocators. This may require
> allocating vmemmmap pages associated with each hugetlb page. The
> hugetlb page destructor must be changed before pages are freed to lower
> level allocators. However, the destructor must be changed under the
> hugetlb lock. This means there is potentially one lock cycle per page.
>
> Minimize the number of lock cycles in update_and_free_pages_bulk by:
> 1) allocating necessary vmemmap for all hugetlb pages on the list
> 2) take hugetlb lock and clear destructor for all pages on the list
> 3) free all pages on list back to low level allocators
>
> Signed-off-by: Mike Kravetz <[email protected]>

Reviewed-by: Muchun Song <[email protected]>

Thanks.


2023-07-20 00:18:00

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

On Tue, Jul 18, 2023 at 9:47 AM Mike Kravetz <[email protected]> wrote:
>
> On 07/18/23 09:31, James Houghton wrote:
> > On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <[email protected]> wrote:
> > > + * destructor of all pages on list.
> > > + */
> > > + if (clear_dtor) {
> > > + spin_lock_irq(&hugetlb_lock);
> > > + list_for_each_entry(page, list, lru)
> > > + __clear_hugetlb_destructor(h, page_folio(page));
> > > + spin_unlock_irq(&hugetlb_lock);
> > > }
> >
> > I'm not too familiar with this code, but the above block seems weird
> > to me. If we successfully allocated the vmemmap for *any* folio, we
> > clear the hugetlb destructor for all the folios? I feel like we should
> > only be clearing the hugetlb destructor for all folios if the vmemmap
> > allocation succeeded for *all* folios. If the code is functionally
> > correct as is, I'm a little bit confused why we need `clear_dtor`; it
> > seems like this function doesn't really need it. (I could have some
> > huge misunderstanding here.)
> >
>
> Yes, it is a bit strange.
>
> I was thinking this has to also handle the case where hugetlb vmemmap
> optimization is off system wide. In that case, clear_dtor would never
> be set and there is no sense in ever walking the list and calling
> __clear_hugetlb_destructor() would would be a NOOP in this case. Think
> of the case where there are TBs of hugetlb pages.
>
> That is one of the reasons I made __clear_hugetlb_destructor() check
> for the need to modify the destructor. The other reason is in the
> dissolve_free_huge_page() code path where we allocate vmemmap. I
> suppose, there could be an explicit call to __clear_hugetlb_destructor()
> in dissolve_free_huge_page. But, I thought it might be better if
> we just handled both cases here.
>
> My thinking is that the clear_dtor boolean would tell us if vmemmap was
> restored for ANY hugetlb page. I am aware that just because vmemmap was
> allocated for one page, does not mean that it was allocated for others.
> However, in the common case where hugetlb vmemmap optimization is on
> system wide, we would have allocated vmemmap for all pages on the list
> and would need to clear the destructor for them all.
>
> So, clear_dtor is really just an optimization for the
> hugetlb_free_vmemmap=off case. Perhaps that is just over thinking and
> not a useful miro-optimization.

Ok I think I understand; I think the micro-optimization is fine to
add. But I think there's still a bug here:

If we have two vmemmap-optimized hugetlb pages and restoring the page
structs for one of them fails, that page will end up with the
incorrect dtor (add_hugetlb_folio will set it properly, but then we
clear it afterwards because clear_dtor was set).

What do you think?

2023-07-20 00:48:10

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

On 07/19/23 17:02, James Houghton wrote:
> On Tue, Jul 18, 2023 at 9:47 AM Mike Kravetz <[email protected]> wrote:
> >
> > On 07/18/23 09:31, James Houghton wrote:
> > > On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <[email protected]> wrote:
> > > > + * destructor of all pages on list.
> > > > + */
> > > > + if (clear_dtor) {
> > > > + spin_lock_irq(&hugetlb_lock);
> > > > + list_for_each_entry(page, list, lru)
> > > > + __clear_hugetlb_destructor(h, page_folio(page));
> > > > + spin_unlock_irq(&hugetlb_lock);
> > > > }
> > >
> > > I'm not too familiar with this code, but the above block seems weird
> > > to me. If we successfully allocated the vmemmap for *any* folio, we
> > > clear the hugetlb destructor for all the folios? I feel like we should
> > > only be clearing the hugetlb destructor for all folios if the vmemmap
> > > allocation succeeded for *all* folios. If the code is functionally
> > > correct as is, I'm a little bit confused why we need `clear_dtor`; it
> > > seems like this function doesn't really need it. (I could have some
> > > huge misunderstanding here.)
> > >
> >
> > Yes, it is a bit strange.
> >
> > I was thinking this has to also handle the case where hugetlb vmemmap
> > optimization is off system wide. In that case, clear_dtor would never
> > be set and there is no sense in ever walking the list and calling
> > __clear_hugetlb_destructor() would would be a NOOP in this case. Think
> > of the case where there are TBs of hugetlb pages.
> >
> > That is one of the reasons I made __clear_hugetlb_destructor() check
> > for the need to modify the destructor. The other reason is in the
> > dissolve_free_huge_page() code path where we allocate vmemmap. I
> > suppose, there could be an explicit call to __clear_hugetlb_destructor()
> > in dissolve_free_huge_page. But, I thought it might be better if
> > we just handled both cases here.
> >
> > My thinking is that the clear_dtor boolean would tell us if vmemmap was
> > restored for ANY hugetlb page. I am aware that just because vmemmap was
> > allocated for one page, does not mean that it was allocated for others.
> > However, in the common case where hugetlb vmemmap optimization is on
> > system wide, we would have allocated vmemmap for all pages on the list
> > and would need to clear the destructor for them all.
> >
> > So, clear_dtor is really just an optimization for the
> > hugetlb_free_vmemmap=off case. Perhaps that is just over thinking and
> > not a useful miro-optimization.
>
> Ok I think I understand; I think the micro-optimization is fine to
> add. But I think there's still a bug here:
>
> If we have two vmemmap-optimized hugetlb pages and restoring the page
> structs for one of them fails, that page will end up with the
> incorrect dtor (add_hugetlb_folio will set it properly, but then we
> clear it afterwards because clear_dtor was set).
>
> What do you think?

add_hugetlb_folio() will call enqueue_hugetlb_folio() which will move
the folio from the existing list we are processing to the hugetlb free
list. Therefore, the page for which we could not restore vmemmap is not
on the list for that 'if (clear_dtor)' block of code.

--
Mike Kravetz

2023-07-20 01:01:46

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

On Wed, Jul 19, 2023 at 5:19 PM Mike Kravetz <[email protected]> wrote:
>
> On 07/19/23 17:02, James Houghton wrote:
> > On Tue, Jul 18, 2023 at 9:47 AM Mike Kravetz <[email protected]> wrote:
> > >
> > > On 07/18/23 09:31, James Houghton wrote:
> > > > On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <[email protected]> wrote:
> > > > > + * destructor of all pages on list.
> > > > > + */
> > > > > + if (clear_dtor) {
> > > > > + spin_lock_irq(&hugetlb_lock);
> > > > > + list_for_each_entry(page, list, lru)
> > > > > + __clear_hugetlb_destructor(h, page_folio(page));
> > > > > + spin_unlock_irq(&hugetlb_lock);
> > > > > }
> > > >
> > > > I'm not too familiar with this code, but the above block seems weird
> > > > to me. If we successfully allocated the vmemmap for *any* folio, we
> > > > clear the hugetlb destructor for all the folios? I feel like we should
> > > > only be clearing the hugetlb destructor for all folios if the vmemmap
> > > > allocation succeeded for *all* folios. If the code is functionally
> > > > correct as is, I'm a little bit confused why we need `clear_dtor`; it
> > > > seems like this function doesn't really need it. (I could have some
> > > > huge misunderstanding here.)
> > > >
> > >
> > > Yes, it is a bit strange.
> > >
> > > I was thinking this has to also handle the case where hugetlb vmemmap
> > > optimization is off system wide. In that case, clear_dtor would never
> > > be set and there is no sense in ever walking the list and calling
> > > __clear_hugetlb_destructor() would would be a NOOP in this case. Think
> > > of the case where there are TBs of hugetlb pages.
> > >
> > > That is one of the reasons I made __clear_hugetlb_destructor() check
> > > for the need to modify the destructor. The other reason is in the
> > > dissolve_free_huge_page() code path where we allocate vmemmap. I
> > > suppose, there could be an explicit call to __clear_hugetlb_destructor()
> > > in dissolve_free_huge_page. But, I thought it might be better if
> > > we just handled both cases here.
> > >
> > > My thinking is that the clear_dtor boolean would tell us if vmemmap was
> > > restored for ANY hugetlb page. I am aware that just because vmemmap was
> > > allocated for one page, does not mean that it was allocated for others.
> > > However, in the common case where hugetlb vmemmap optimization is on
> > > system wide, we would have allocated vmemmap for all pages on the list
> > > and would need to clear the destructor for them all.
> > >
> > > So, clear_dtor is really just an optimization for the
> > > hugetlb_free_vmemmap=off case. Perhaps that is just over thinking and
> > > not a useful miro-optimization.
> >
> > Ok I think I understand; I think the micro-optimization is fine to
> > add. But I think there's still a bug here:
> >
> > If we have two vmemmap-optimized hugetlb pages and restoring the page
> > structs for one of them fails, that page will end up with the
> > incorrect dtor (add_hugetlb_folio will set it properly, but then we
> > clear it afterwards because clear_dtor was set).
> >
> > What do you think?
>
> add_hugetlb_folio() will call enqueue_hugetlb_folio() which will move
> the folio from the existing list we are processing to the hugetlb free
> list. Therefore, the page for which we could not restore vmemmap is not
> on the list for that 'if (clear_dtor)' block of code.

Oh, I see. Thanks! Unless you think it's pretty obvious, perhaps a
comment would be good to add here, to explain that folios are removed
from 'list' if their vmemmap isn't restored.

Unrelated nit: I think you mean to use
folio_test_hugetlb_vmemmap_optimized instead of HPageVmemmapOptimized
in this patch.

Feel free to add:

Acked-by: James Houghton <[email protected]>


>
> --
> Mike Kravetz

2023-07-20 02:28:20

by Jiaqi Yan

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap

On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <[email protected]> wrote:
>
> Freeing a hugetlb page and releasing base pages back to the underlying
> allocator such as buddy or cma is performed in two steps:
> - remove_hugetlb_folio() is called to remove the folio from hugetlb
> lists, get a ref on the page and remove hugetlb destructor. This
> all must be done under the hugetlb lock. After this call, the page
> can be treated as a normal compound page or a collection of base
> size pages.
> - update_and_free_hugetlb_folio() is called to allocate vmemmap if
> needed and the free routine of the underlying allocator is called
> on the resulting page. We can not hold the hugetlb lock here.
>
> One issue with this scheme is that a memory error could occur between
> these two steps. In this case, the memory error handling code treats
> the old hugetlb page as a normal compound page or collection of base
> pages. It will then try to SetPageHWPoison(page) on the page with an
> error. If the page with error is a tail page without vmemmap, a write
> error will occur when trying to set the flag.
>
> Address this issue by modifying remove_hugetlb_folio() and
> update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> cleared until after allocating vmemmap. Since clearing the destructor
> requires holding the hugetlb lock, the clearing is done in
> remove_hugetlb_folio() if the vmemmap is present. This saves a
> lock/unlock cycle. Otherwise, destructor is cleared in
> update_and_free_hugetlb_folio() after allocating vmemmap.
>
> Note that this will leave hugetlb pages in a state where they are marked
> free (by hugetlb specific page flag) and have a ref count. This is not
> a normal state. The only code that would notice is the memory error
> code, and it is set up to retry in such a case.
>
> A subsequent patch will create a routine to do bulk processing of
> vmemmap allocation. This will eliminate a lock/unlock cycle for each
> hugetlb page in the case where we are freeing a large number of pages.
>
> Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
> Cc: <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 90 ++++++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 66 insertions(+), 24 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 64a3239b6407..4a910121a647 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
> unsigned int order) { }
> #endif
>
> +static inline void __clear_hugetlb_destructor(struct hstate *h,
> + struct folio *folio)
> +{
> + lockdep_assert_held(&hugetlb_lock);
> +
> + /*
> + * Very subtle
> + *
> + * For non-gigantic pages set the destructor to the normal compound
> + * page dtor. This is needed in case someone takes an additional
> + * temporary ref to the page, and freeing is delayed until they drop
> + * their reference.
> + *
> + * For gigantic pages set the destructor to the null dtor. This
> + * destructor will never be called. Before freeing the gigantic
> + * page destroy_compound_gigantic_folio will turn the folio into a
> + * simple group of pages. After this the destructor does not
> + * apply.
> + *
> + */
> + if (hstate_is_gigantic(h))
> + folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> + else
> + folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> +}
> +
> /*
> - * Remove hugetlb folio from lists, and update dtor so that the folio appears
> - * as just a compound page.
> + * Remove hugetlb folio from lists.
> + * If vmemmap exists for the folio, update dtor so that the folio appears
> + * as just a compound page. Otherwise, wait until after allocating vmemmap
> + * to update dtor.
> *
> * A reference is held on the folio, except in the case of demote.
> *
> @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> }
>
> /*
> - * Very subtle
> - *
> - * For non-gigantic pages set the destructor to the normal compound
> - * page dtor. This is needed in case someone takes an additional
> - * temporary ref to the page, and freeing is delayed until they drop
> - * their reference.
> - *
> - * For gigantic pages set the destructor to the null dtor. This
> - * destructor will never be called. Before freeing the gigantic
> - * page destroy_compound_gigantic_folio will turn the folio into a
> - * simple group of pages. After this the destructor does not
> - * apply.
> - *
> - * This handles the case where more than one ref is held when and
> - * after update_and_free_hugetlb_folio is called.
> - *
> - * In the case of demote we do not ref count the page as it will soon
> - * be turned into a page of smaller size.
> + * We can only clear the hugetlb destructor after allocating vmemmap
> + * pages. Otherwise, someone (memory error handling) may try to write
> + * to tail struct pages.
> + */
> + if (!folio_test_hugetlb_vmemmap_optimized(folio))
> + __clear_hugetlb_destructor(h, folio);
> +
> + /*
> + * In the case of demote we do not ref count the page as it will soon
> + * be turned into a page of smaller size.
> */
> if (!demote)
> folio_ref_unfreeze(folio, 1);
> - if (hstate_is_gigantic(h))
> - folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> - else
> - folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
>
> h->nr_huge_pages--;
> h->nr_huge_pages_node[nid]--;
> @@ -1728,6 +1744,19 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> return;
> }
>
> + /*
> + * If needed, clear hugetlb destructor under the hugetlb lock.
> + * This must be done AFTER allocating vmemmap pages in case there is an
> + * attempt to write to tail struct pages as in memory poison.
> + * It must be done BEFORE PageHWPoison handling so that any subsequent
> + * memory errors poison individual pages instead of head.
> + */
> + if (folio_test_hugetlb(folio)) {

Thanks Mike, this definitely fixed the issue in v1, :)

> + spin_lock_irq(&hugetlb_lock);
> + __clear_hugetlb_destructor(h, folio);
> + spin_unlock_irq(&hugetlb_lock);
> + }
> +
> /*
> * Move PageHWPoison flag from head page to the raw error pages,
> * which makes any healthy subpages reusable.
> @@ -3604,6 +3633,19 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
> return rc;
> }
>
> + /*
> + * The hugetlb destructor could still be set for this folio if vmemmap
> + * was actually allocated above. The ref count on all pages is 0.
> + * Therefore, nobody should attempt access. However, before destroying
> + * compound page below, clear the destructor. Unfortunately, this
> + * requires a lock/unlock cycle.
> + */
> + if (folio_test_hugetlb(folio)) {
> + spin_lock_irq(&hugetlb_lock);
> + __clear_hugetlb_destructor(h, folio);
> + spin_unlock_irq(&hugetlb_lock);
> + }
> +
> /*
> * Use destroy_compound_hugetlb_folio_for_demote for all huge page
> * sizes as it will not ref count folios.
> --
> 2.41.0
>

2023-07-26 10:00:33

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] hugetlb: Do not clear hugetlb dtor until allocating vmemmap

On Mon, Jul 17, 2023 at 05:49:41PM -0700, Mike Kravetz wrote:
> Freeing a hugetlb page and releasing base pages back to the underlying
> allocator such as buddy or cma is performed in two steps:
> - remove_hugetlb_folio() is called to remove the folio from hugetlb
> lists, get a ref on the page and remove hugetlb destructor. This
> all must be done under the hugetlb lock. After this call, the page
> can be treated as a normal compound page or a collection of base
> size pages.
> - update_and_free_hugetlb_folio() is called to allocate vmemmap if
> needed and the free routine of the underlying allocator is called
> on the resulting page. We can not hold the hugetlb lock here.
>
> One issue with this scheme is that a memory error could occur between
> these two steps. In this case, the memory error handling code treats
> the old hugetlb page as a normal compound page or collection of base
> pages. It will then try to SetPageHWPoison(page) on the page with an
> error. If the page with error is a tail page without vmemmap, a write
> error will occur when trying to set the flag.
>
> Address this issue by modifying remove_hugetlb_folio() and
> update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> cleared until after allocating vmemmap. Since clearing the destructor
> requires holding the hugetlb lock, the clearing is done in
> remove_hugetlb_folio() if the vmemmap is present. This saves a
> lock/unlock cycle. Otherwise, destructor is cleared in
> update_and_free_hugetlb_folio() after allocating vmemmap.
>
> Note that this will leave hugetlb pages in a state where they are marked
> free (by hugetlb specific page flag) and have a ref count. This is not
> a normal state. The only code that would notice is the memory error
> code, and it is set up to retry in such a case.
>
> A subsequent patch will create a routine to do bulk processing of
> vmemmap allocation. This will eliminate a lock/unlock cycle for each
> hugetlb page in the case where we are freeing a large number of pages.
>
> Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
> Cc: <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>

I wrote a reproducer to cause the race condition between memory failure
and shrinking free hugepage pool (with vmemmap optimization enabled).
Then I observed that v6.5-rc2 kernel panicked with "BUG: unable to handle
page fault for address...", and confirmed that the kernel with your patches
do not reproduce it. Thank you for fixing this.

Tested-by: Naoya Horiguchi <[email protected]>