2023-07-11 22:39:09

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 0/2] Fix hugetlb free path race with memory errors

In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
HWPOISON hugepages" the race window was discovered.
https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/

Freeing a hugetlb page back to low level memory allocators is performed
in two steps.
1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
2) Outside lock, allocate vmemmap if necessary and call low level free
Between these two steps, the hugetlb page will appear as a normal
compound page. However, vmemmap for tail pages could be missing.
If a memory error occurs at this time, we could try to update page
flags non-existant page structs.

A much more detailed description is in the first patch.

The first patch addresses the race window. However, it adds a
hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb
page free operation. This could lead to slowdowns if one is freeing
a large number of hugetlb pages.

The second path optimizes the update_and_free_pages_bulk routine
to only take the lock once in bulk operations.

The second patch is technically not a bug fix, but includes a Fixes
tag and Cc stable to avoid a performance regression. It can be
combined with the first, but was done separately make reviewing easier.

Mike Kravetz (2):
hugetlb: Do not clear hugetlb dtor until allocating vmemmap
hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

mm/hugetlb.c | 110 +++++++++++++++++++++++++++++++++++++++------------
1 file changed, 85 insertions(+), 25 deletions(-)

--
2.41.0



2023-07-11 23:03:03

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators. This may require
allocating vmemmmap pages associated with each hugetlb page. The
hugetlb page destructor must be changed before pages are freed to lower
level allocators. However, the destructor must be changed under the
hugetlb lock. This means there is potentially one lock cycle per page.

Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators

Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Cc: <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1b67bf341c32..e751fced870a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1856,11 +1856,44 @@ static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
{
struct page *page, *t_page;
struct folio *folio;
+ bool clear_dtor = false;

+ /*
+ * First allocate required vmemmmap for all pages on list. If vmemmap
+ * can not be allocated, we can not free page to lower level allocator,
+ * so add back as hugetlb surplus page.
+ */
+ list_for_each_entry_safe(page, t_page, list, lru) {
+ if (HPageVmemmapOptimized(page)) {
+ clear_dtor = true;
+ if (hugetlb_vmemmap_restore(h, page)) {
+ spin_lock_irq(&hugetlb_lock);
+ add_hugetlb_folio(h, folio, true);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+ cond_resched();
+ }
+ }
+
+ /*
+ * If vmemmmap allocation performed above, then take lock * to clear
+ * destructor of all pages on list.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry(page, list, lru)
+ __clear_hugetlb_destructor(h, page_folio(page));
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
+ /*
+ * Free pages back to low level allocators. vmemmap and destructors
+ * were taken care of above, so update_and_free_hugetlb_folio will
+ * not need to take hugetlb lock.
+ */
list_for_each_entry_safe(page, t_page, list, lru) {
folio = page_folio(page);
update_and_free_hugetlb_folio(h, folio, false);
- cond_resched();
}
}

--
2.41.0


2023-07-13 17:39:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/2] Fix hugetlb free path race with memory errors

On Tue, 11 Jul 2023 15:09:40 -0700 Mike Kravetz <[email protected]> wrote:

> In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
> HWPOISON hugepages" the race window was discovered.
> https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/
>
> Freeing a hugetlb page back to low level memory allocators is performed
> in two steps.
> 1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
> 2) Outside lock, allocate vmemmap if necessary and call low level free
> Between these two steps, the hugetlb page will appear as a normal
> compound page. However, vmemmap for tail pages could be missing.
> If a memory error occurs at this time, we could try to update page
> flags non-existant page structs.
>
> A much more detailed description is in the first patch.
>
> The first patch addresses the race window. However, it adds a
> hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb
> page free operation. This could lead to slowdowns if one is freeing
> a large number of hugetlb pages.
>
> The second path optimizes the update_and_free_pages_bulk routine
> to only take the lock once in bulk operations.
>
> The second patch is technically not a bug fix, but includes a Fixes
> tag and Cc stable to avoid a performance regression. It can be
> combined with the first, but was done separately make reviewing easier.
>

I feel that backporting performance improvements into -stable is not a
usual thing to do. Perhaps the fact that it's a regression fix changes
this, but why?

Much hinges on the magnitude of the performance change. Are you able
to quantify this at all?