This series can minimize the overhead of struct page for 2MB HugeTLB pages
significantly. It further reduces the overhead of struct page by 12.5% for
a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
The main implementation and details can refer to the commit log of patch 1.
In this series, I have changed the following four helpers, the following
table shows the impact of the overhead of those helpers.
+------------------+-----------------------+
| APIs | head page | tail page |
+------------------+-----------+-----------+
| PageHead() | Y | N |
+------------------+-----------+-----------+
| PageTail() | Y | N |
+------------------+-----------+-----------+
| PageCompound() | N | N |
+------------------+-----------+-----------+
| compound_head() | Y | N |
+------------------+-----------+-----------+
Y: Overhead is increased.
N: Overhead is _NOT_ increased.
It shows that the overhead of those helpers on a tail page don't change
between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off". But the
overhead on a head page will be increased when "hugetlb_free_vmemmap=on"
(except PageCompound()). So I believe that Matthew Wilcox's folio series
will help with this.
The users of PageHead() and PageTail() are much less than compound_head()
and most users of PageTail() are VM_BUG_ON(), so I have done some tests
about the overhead of compound_head() on head pages.
I have tested the overhead of calling compound_head() on a head page, which
is 2.11ns (Measure the call time of 10 million times compound_head(), and
then average).
For a head page whose address is not aligned with PAGE_SIZE or a non-compound
page, the overhead of compound_head() is 2.54ns which is increased by 20%.
For a head page whose address is aligned with PAGE_SIZE, the overhead of
compound_head() is 2.97ns which is increased by 40%. Most pages are the former.
I do not think the overhead is significant since the overhead of compound_head()
itself is low.
Changlogs in v7:
1. Fix a typo (change page_head_if_fake to page_fixed_fake_head) in the
commit log of patch 2.
2. Move details of implementation from cover letter to the commit log of
patch 1.
3. Add some overhead numbers to the cover letter.
Changlogs in v6:
1. Add test case to tools/testing/selftests/vm/run_vmtests.sh.
Changlogs in v5:
1. Move NR_RESET_STRUCT_PAGE to the front of reset_struct_pages().
2. Collect Reviewed-by tags.
Thanks Barry for his suggestions and reviews.
Changlogs in v4:
1. Move hugetlb_free_vmemmap_enabled from hugetlb.h to page-flags.h.
2. Collect Reviewed-by.
3. Add a new patch to move vmemmap functions related to HugeTLB to
the scope of the CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.
Thanks Barry for his suggestions and reviews.
Changlogs in v3:
1. Rename page_head_if_fake() to page_fixed_fake_head().
2. Introducing a new helper page_is_fake_head() to make code more readable.
3. Update commit log of patch 3 to add more judgements.
4. Add some comments in check_page_flags() in the patch 4.
Thanks Barry for his suggestions and reviews.
Changlogs in v2:
1. Drop two patches of introducing PAGEFLAGS_MASK from this series.
2. Let page_head_if_fake() return page instead of NULL.
3. Add a selftest to check if PageHead or PageTail work well.
Muchun Song (5):
mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB
page
mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key
mm: sparsemem: use page table lock to protect kernel pmd operations
selftests: vm: add a hugetlb test case
mm: sparsemem: move vmemmap related to HugeTLB to
CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
Documentation/admin-guide/kernel-parameters.txt | 2 +-
include/linux/hugetlb.h | 6 -
include/linux/mm.h | 2 +
include/linux/page-flags.h | 90 ++++++++++++++-
mm/hugetlb_vmemmap.c | 68 ++++++-----
mm/memory_hotplug.c | 2 +-
mm/ptdump.c | 16 ++-
mm/sparse-vmemmap.c | 70 +++++++++---
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/hugepage-vmemmap.c | 144 ++++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 11 ++
12 files changed, 350 insertions(+), 63 deletions(-)
create mode 100644 tools/testing/selftests/vm/hugepage-vmemmap.c
--
2.11.0
This patch minimizes the overhead of struct page for 2MB HugeTLB
pages significantly. It further reduces the overhead of struct
page by 12.5% for a 2MB HugeTLB compared to the previous approach,
which means 2GB per 1TB HugeTLB (2MB type).
After the feature of "Free sonme vmemmap pages of HugeTLB page"
is enabled, the mapping of the vmemmap addresses associated with
a 2MB HugeTLB page becomes the figure below.
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | -------------> | 1 |
| | +-----------+ +-----------+
| | | 2 | ----------------^ ^ ^ ^ ^ ^
| | +-----------+ | | | | |
| | | 3 | ------------------+ | | | |
| | +-----------+ | | | |
| | | 4 | --------------------+ | | |
| 2MB | +-----------+ | | |
| | | 5 | ----------------------+ | |
| | +-----------+ | |
| | | 6 | ------------------------+ |
| | +-----------+ |
| | | 7 | --------------------------+
| | +-----------+
| |
| |
| |
+-----------+
As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and
remaped. However, the 2nd vmemmap page frame is also can be freed to
the buddy allocator, then we can change the mapping from the figure
above to the figure below.
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | ---------------^ ^ ^ ^ ^ ^ ^
| | +-----------+ | | | | | |
| | | 2 | -----------------+ | | | | |
| | +-----------+ | | | | |
| | | 3 | -------------------+ | | | |
| | +-----------+ | | | |
| | | 4 | ---------------------+ | | |
| 2MB | +-----------+ | | |
| | | 5 | -----------------------+ | |
| | +-----------+ | |
| | | 6 | -------------------------+ |
| | +-----------+ |
| | | 7 | ---------------------------+
| | +-----------+
| |
| |
| |
+-----------+
After we do this, all tail vmemmap pages (1-7) are mapped to the head
vmemmap page frame (0). In other words, there are more than one page
struct with PG_head associated with each HugeTLB page. We __know__
that there is only one head page struct, the tail page structs with
PG_head are fake head page structs. We need an approach to distinguish
between those two different types of page structs so that compound_head(),
PageHead() and PageTail() can work properly if the parameter is the
tail page struct but with PG_head.
The following code snippet describes how to distinguish between real and
fake head page struct.
if (test_bit(PG_head, &page->flags)) {
unsigned long head = READ_ONCE(page[1].compound_head);
if (head & 1) {
if (head == (unsigned long)page + 1)
==> head page struct
else
==> tail page struct
} else
==> head page struct
}
We can safely access the field of the @page[1] with PG_head because the
@page is a compound page composed with at least two contiguous pages.
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Barry Song <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
include/linux/page-flags.h | 78 +++++++++++++++++++++++--
mm/hugetlb_vmemmap.c | 62 +++++++++++---------
mm/sparse-vmemmap.c | 21 +++++++
4 files changed, 130 insertions(+), 33 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ad94a2aa9819..8ac050b9b3da 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1619,7 +1619,7 @@
[KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
enabled.
Allows heavy hugetlb users to free up some more
- memory (6 * PAGE_SIZE for each 2MB hugetlb page).
+ memory (7 * PAGE_SIZE for each 2MB hugetlb page).
Format: { on | off (default) }
on: enable the feature
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 70bf0ec29ee3..7cd386538d0c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -184,13 +184,69 @@ enum pageflags {
#ifndef __GENERATING_BOUNDS_H
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+extern bool hugetlb_free_vmemmap_enabled;
+
+/*
+ * If the feature of freeing some vmemmap pages associated with each HugeTLB
+ * page is enabled, the head vmemmap page frame is reused and all of the tail
+ * vmemmap addresses map to the head vmemmap page frame (furture details can
+ * refer to the figure at the head of the mm/hugetlb_vmemmap.c). In other
+ * words, there are more than one page struct with PG_head associated with each
+ * HugeTLB page. We __know__ that there is only one head page struct, the tail
+ * page structs with PG_head are fake head page structs. We need an approach
+ * to distinguish between those two different types of page structs so that
+ * compound_head() can return the real head page struct when the parameter is
+ * the tail page struct but with PG_head.
+ *
+ * The page_fixed_fake_head() returns the real head page struct if the @page is
+ * fake page head, otherwise, returns @page which can either be a true page
+ * head or tail.
+ */
+static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
+{
+ if (!hugetlb_free_vmemmap_enabled)
+ return page;
+
+ /*
+ * Only addresses aligned with PAGE_SIZE of struct page may be fake head
+ * struct page. The alignment check aims to avoid access the fields (
+ * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
+ * cold cacheline in some cases.
+ */
+ if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
+ test_bit(PG_head, &page->flags)) {
+ /*
+ * We can safely access the field of the @page[1] with PG_head
+ * because the @page is a compound page composed with at least
+ * two contiguous pages.
+ */
+ unsigned long head = READ_ONCE(page[1].compound_head);
+
+ if (likely(head & 1))
+ return (const struct page *)(head - 1);
+ }
+ return page;
+}
+#else
+static inline const struct page *page_fixed_fake_head(const struct page *page)
+{
+ return page;
+}
+#endif
+
+static __always_inline int page_is_fake_head(struct page *page)
+{
+ return page_fixed_fake_head(page) != page;
+}
+
static inline unsigned long _compound_head(const struct page *page)
{
unsigned long head = READ_ONCE(page->compound_head);
if (unlikely(head & 1))
return head - 1;
- return (unsigned long)page;
+ return (unsigned long)page_fixed_fake_head(page);
}
#define compound_head(page) ((typeof(page))_compound_head(page))
@@ -225,12 +281,13 @@ static inline unsigned long _compound_head(const struct page *page)
static __always_inline int PageTail(struct page *page)
{
- return READ_ONCE(page->compound_head) & 1;
+ return READ_ONCE(page->compound_head) & 1 || page_is_fake_head(page);
}
static __always_inline int PageCompound(struct page *page)
{
- return test_bit(PG_head, &page->flags) || PageTail(page);
+ return test_bit(PG_head, &page->flags) ||
+ READ_ONCE(page->compound_head) & 1;
}
#define PAGE_POISON_PATTERN -1l
@@ -675,7 +732,20 @@ static inline bool test_set_page_writeback(struct page *page)
return set_page_writeback(page);
}
-__PAGEFLAG(Head, head, PF_ANY) CLEARPAGEFLAG(Head, head, PF_ANY)
+static __always_inline bool folio_test_head(struct folio *folio)
+{
+ return test_bit(PG_head, folio_flags(folio, FOLIO_PF_ANY));
+}
+
+static __always_inline int PageHead(struct page *page)
+{
+ PF_POISONED_CHECK(page);
+ return test_bit(PG_head, &page->flags) && !page_is_fake_head(page);
+}
+
+__SETPAGEFLAG(Head, head, PF_ANY)
+__CLEARPAGEFLAG(Head, head, PF_ANY)
+CLEARPAGEFLAG(Head, head, PF_ANY)
/* Whether there are one or multiple pages in a folio */
static inline bool folio_test_single(struct folio *folio)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c540c21e26f5..4977f5a520c2 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -124,9 +124,9 @@
* page of page structs (page 0) associated with the HugeTLB page contains the 4
* page structs necessary to describe the HugeTLB. The only use of the remaining
* pages of page structs (page 1 to page 7) is to point to page->compound_head.
- * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
+ * Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs
* will be used for each HugeTLB page. This will allow us to free the remaining
- * 6 pages to the buddy allocator.
+ * 7 pages to the buddy allocator.
*
* Here is how things look after remapping.
*
@@ -134,30 +134,30 @@
* +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
* | | | 0 | -------------> | 0 |
* | | +-----------+ +-----------+
- * | | | 1 | -------------> | 1 |
- * | | +-----------+ +-----------+
- * | | | 2 | ----------------^ ^ ^ ^ ^ ^
- * | | +-----------+ | | | | |
- * | | | 3 | ------------------+ | | | |
- * | | +-----------+ | | | |
- * | | | 4 | --------------------+ | | |
- * | PMD | +-----------+ | | |
- * | level | | 5 | ----------------------+ | |
- * | mapping | +-----------+ | |
- * | | | 6 | ------------------------+ |
- * | | +-----------+ |
- * | | | 7 | --------------------------+
+ * | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^
+ * | | +-----------+ | | | | | |
+ * | | | 2 | -----------------+ | | | | |
+ * | | +-----------+ | | | | |
+ * | | | 3 | -------------------+ | | | |
+ * | | +-----------+ | | | |
+ * | | | 4 | ---------------------+ | | |
+ * | PMD | +-----------+ | | |
+ * | level | | 5 | -----------------------+ | |
+ * | mapping | +-----------+ | |
+ * | | | 6 | -------------------------+ |
+ * | | +-----------+ |
+ * | | | 7 | ---------------------------+
* | | +-----------+
* | |
* | |
* | |
* +-----------+
*
- * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
+ * When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
* vmemmap pages and restore the previous mapping relationship.
*
* For the HugeTLB page of the pud level mapping. It is similar to the former.
- * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
+ * We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
*
* Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
* (e.g. aarch64) provides a contiguous bit in the translation table entries
@@ -166,7 +166,13 @@
*
* The contiguous bit is used to increase the mapping size at the pmd and pte
* (last) level. So this type of HugeTLB page can be optimized only when its
- * size of the struct page structs is greater than 2 pages.
+ * size of the struct page structs is greater than 1 page.
+ *
+ * Notice: The head vmemmap page is not freed to the buddy allocator and all
+ * tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
+ * more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
+ * associated with each HugeTLB page. The compound_head() can handle this
+ * correctly (more details refer to the comment above compound_head()).
*/
#define pr_fmt(fmt) "HugeTLB: " fmt
@@ -175,19 +181,21 @@
/*
* There are a lot of struct page structures associated with each HugeTLB page.
* For tail pages, the value of compound_head is the same. So we can reuse first
- * page of tail page structures. We map the virtual addresses of the remaining
- * pages of tail page structures to the first tail page struct, and then free
- * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
+ * page of head page structures. We map the virtual addresses of all the pages
+ * of tail page structures to the head page struct, and then free these page
+ * frames. Therefore, we need to reserve one pages as vmemmap areas.
*/
-#define RESERVE_VMEMMAP_NR 2U
+#define RESERVE_VMEMMAP_NR 1U
#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
-bool hugetlb_free_vmemmap_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON);
+bool hugetlb_free_vmemmap_enabled __read_mostly =
+ IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON);
+EXPORT_SYMBOL(hugetlb_free_vmemmap_enabled);
static int __init early_hugetlb_free_vmemmap_param(char *buf)
{
/* We cannot optimize if a "struct page" crosses page boundaries. */
- if ((!is_power_of_2(sizeof(struct page)))) {
+ if (!is_power_of_2(sizeof(struct page))) {
pr_warn("cannot free vmemmap pages because \"struct page\" crosses page boundaries\n");
return 0;
}
@@ -236,7 +244,6 @@ int alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
*/
ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse,
GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE);
-
if (!ret)
ClearHPageVmemmapOptimized(head);
@@ -282,9 +289,8 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT;
/*
- * The head page and the first tail page are not to be freed to buddy
- * allocator, the other pages will map to the first tail page, so they
- * can be freed.
+ * The head page is not to be freed to buddy allocator, the other tail
+ * pages will map to the head page, so they can be freed.
*
* Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true
* on some architectures (e.g. aarch64). See Documentation/arm64/
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index db6df27c852a..e881f5db7091 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -245,6 +245,26 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
set_pte_at(&init_mm, addr, pte, entry);
}
+/*
+ * How many struct page structs need to be reset. When we reuse the head
+ * struct page, the special metadata (e.g. page->flags or page->mapping)
+ * cannot copy to the tail struct page structs. The invalid value will be
+ * checked in the free_tail_pages_check(). In order to avoid the message
+ * of "corrupted mapping in tail page". We need to reset at least 3 (one
+ * head struct page struct and two tail struct page structs) struct page
+ * structs.
+ */
+#define NR_RESET_STRUCT_PAGE 3
+
+static inline void reset_struct_pages(struct page *start)
+{
+ int i;
+ struct page *from = start + NR_RESET_STRUCT_PAGE;
+
+ for (i = 0; i < NR_RESET_STRUCT_PAGE; i++)
+ memcpy(start + i, from, sizeof(*from));
+}
+
static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
struct vmemmap_remap_walk *walk)
{
@@ -258,6 +278,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
list_del(&page->lru);
to = page_to_virt(page);
copy_page(to, (void *)walk->reuse_addr);
+ reset_struct_pages(to);
set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
}
--
2.11.0
The page_fixed_fake_head() is used throughout memory management and
the conditional check requires checking a global variable, although
the overhead of this check may be small, it increases when the memory
cache comes under pressure. Also, the global variable will not be
modified after system boot, so it is very appropriate to use static
key machanism.
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Barry Song <[email protected]>
---
include/linux/hugetlb.h | 6 ------
include/linux/page-flags.h | 16 ++++++++++++++--
mm/hugetlb_vmemmap.c | 12 ++++++------
mm/memory_hotplug.c | 2 +-
4 files changed, 21 insertions(+), 15 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 44c2ab0dfa59..27a2adff0db7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1077,12 +1077,6 @@ static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr
}
#endif /* CONFIG_HUGETLB_PAGE */
-#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
-extern bool hugetlb_free_vmemmap_enabled;
-#else
-#define hugetlb_free_vmemmap_enabled false
-#endif
-
static inline spinlock_t *huge_pte_lock(struct hstate *h,
struct mm_struct *mm, pte_t *pte)
{
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7cd386538d0c..26e540fd3393 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -185,7 +185,14 @@ enum pageflags {
#ifndef __GENERATING_BOUNDS_H
#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
-extern bool hugetlb_free_vmemmap_enabled;
+DECLARE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON,
+ hugetlb_free_vmemmap_enabled_key);
+
+static __always_inline bool hugetlb_free_vmemmap_enabled(void)
+{
+ return static_branch_maybe(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON,
+ &hugetlb_free_vmemmap_enabled_key);
+}
/*
* If the feature of freeing some vmemmap pages associated with each HugeTLB
@@ -205,7 +212,7 @@ extern bool hugetlb_free_vmemmap_enabled;
*/
static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
{
- if (!hugetlb_free_vmemmap_enabled)
+ if (!hugetlb_free_vmemmap_enabled())
return page;
/*
@@ -233,6 +240,11 @@ static inline const struct page *page_fixed_fake_head(const struct page *page)
{
return page;
}
+
+static inline bool hugetlb_free_vmemmap_enabled(void)
+{
+ return false;
+}
#endif
static __always_inline int page_is_fake_head(struct page *page)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4977f5a520c2..791626983c2e 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -188,9 +188,9 @@
#define RESERVE_VMEMMAP_NR 1U
#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
-bool hugetlb_free_vmemmap_enabled __read_mostly =
- IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON);
-EXPORT_SYMBOL(hugetlb_free_vmemmap_enabled);
+DEFINE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON,
+ hugetlb_free_vmemmap_enabled_key);
+EXPORT_SYMBOL(hugetlb_free_vmemmap_enabled_key);
static int __init early_hugetlb_free_vmemmap_param(char *buf)
{
@@ -204,9 +204,9 @@ static int __init early_hugetlb_free_vmemmap_param(char *buf)
return -EINVAL;
if (!strcmp(buf, "on"))
- hugetlb_free_vmemmap_enabled = true;
+ static_branch_enable(&hugetlb_free_vmemmap_enabled_key);
else if (!strcmp(buf, "off"))
- hugetlb_free_vmemmap_enabled = false;
+ static_branch_disable(&hugetlb_free_vmemmap_enabled_key);
else
return -EINVAL;
@@ -284,7 +284,7 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
BUILD_BUG_ON(__NR_USED_SUBPAGE >=
RESERVE_VMEMMAP_SIZE / sizeof(struct page));
- if (!hugetlb_free_vmemmap_enabled)
+ if (!hugetlb_free_vmemmap_enabled())
return;
vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3de7933e5302..587a8fc61fc8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1326,7 +1326,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
* populate a single PMD.
*/
return memmap_on_memory &&
- !hugetlb_free_vmemmap_enabled &&
+ !hugetlb_free_vmemmap_enabled() &&
IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
size == memory_block_size_bytes() &&
IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
--
2.11.0
The init_mm.page_table_lock is used to protect kernel page tables, we
can use it to serialize splitting vmemmap PMD mappings instead of mmap
write lock, which can increase the concurrency of vmemmap_remap_free().
Actually, It increase the concurrency between allocations of HugeTLB
pages. But it is not the only benefit. There are a lot of users of
mmap read lock of init_mm. The mmap write lock is holding through
vmemmap_remap_free(), removing mmap write lock usage to make it does
not affect other users of mmap read lock. It is not making anything
worse and always a win to move.
Now the kernel page table walker does not hold the page_table_lock when
walking pmd entries. There may be consistency issue of a pmd entry,
because pmd entry might change from a huge pmd entry to a PTE page table.
There is only one user of kernel page table walker, namely ptdump. The
ptdump already considers the consistency, which use a local variable to
cache the value of pmd entry. But we also need to update ->action to
ACTION_CONTINUE to make sure the walker does not walk every pte entry
again when concurrent thread has split the huge pmd.
Signed-off-by: Muchun Song <[email protected]>
---
mm/ptdump.c | 16 ++++++++++++----
mm/sparse-vmemmap.c | 47 +++++++++++++++++++++++++++++++----------------
2 files changed, 43 insertions(+), 20 deletions(-)
diff --git a/mm/ptdump.c b/mm/ptdump.c
index da751448d0e4..eea3d28d173c 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -40,8 +40,10 @@ static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
if (st->effective_prot)
st->effective_prot(st, 0, pgd_val(val));
- if (pgd_leaf(val))
+ if (pgd_leaf(val)) {
st->note_page(st, addr, 0, pgd_val(val));
+ walk->action = ACTION_CONTINUE;
+ }
return 0;
}
@@ -61,8 +63,10 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
if (st->effective_prot)
st->effective_prot(st, 1, p4d_val(val));
- if (p4d_leaf(val))
+ if (p4d_leaf(val)) {
st->note_page(st, addr, 1, p4d_val(val));
+ walk->action = ACTION_CONTINUE;
+ }
return 0;
}
@@ -82,8 +86,10 @@ static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
if (st->effective_prot)
st->effective_prot(st, 2, pud_val(val));
- if (pud_leaf(val))
+ if (pud_leaf(val)) {
st->note_page(st, addr, 2, pud_val(val));
+ walk->action = ACTION_CONTINUE;
+ }
return 0;
}
@@ -101,8 +107,10 @@ static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
if (st->effective_prot)
st->effective_prot(st, 3, pmd_val(val));
- if (pmd_leaf(val))
+ if (pmd_leaf(val)) {
st->note_page(st, addr, 3, pmd_val(val));
+ walk->action = ACTION_CONTINUE;
+ }
return 0;
}
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index e881f5db7091..c64d1aa3c4b5 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -53,8 +53,7 @@ struct vmemmap_remap_walk {
struct list_head *vmemmap_pages;
};
-static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start,
- struct vmemmap_remap_walk *walk)
+static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
{
pmd_t __pmd;
int i;
@@ -76,15 +75,34 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start,
set_pte_at(&init_mm, addr, pte, entry);
}
- /* Make pte visible before pmd. See comment in pmd_install(). */
- smp_wmb();
- pmd_populate_kernel(&init_mm, pmd, pgtable);
-
- flush_tlb_kernel_range(start, start + PMD_SIZE);
+ spin_lock(&init_mm.page_table_lock);
+ if (likely(pmd_leaf(*pmd))) {
+ /* Make pte visible before pmd. See comment in pmd_install(). */
+ smp_wmb();
+ pmd_populate_kernel(&init_mm, pmd, pgtable);
+ flush_tlb_kernel_range(start, start + PMD_SIZE);
+ } else {
+ pte_free_kernel(&init_mm, pgtable);
+ }
+ spin_unlock(&init_mm.page_table_lock);
return 0;
}
+static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
+{
+ int leaf;
+
+ spin_lock(&init_mm.page_table_lock);
+ leaf = pmd_leaf(*pmd);
+ spin_unlock(&init_mm.page_table_lock);
+
+ if (!leaf)
+ return 0;
+
+ return __split_vmemmap_huge_pmd(pmd, start);
+}
+
static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end,
struct vmemmap_remap_walk *walk)
@@ -121,13 +139,12 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
pmd = pmd_offset(pud, addr);
do {
- if (pmd_leaf(*pmd)) {
- int ret;
+ int ret;
+
+ ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
+ if (ret)
+ return ret;
- ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, walk);
- if (ret)
- return ret;
- }
next = pmd_addr_end(addr, end);
vmemmap_pte_range(pmd, addr, next, walk);
} while (pmd++, addr = next, addr != end);
@@ -321,10 +338,8 @@ int vmemmap_remap_free(unsigned long start, unsigned long end,
*/
BUG_ON(start - reuse != PAGE_SIZE);
- mmap_write_lock(&init_mm);
+ mmap_read_lock(&init_mm);
ret = vmemmap_remap_range(reuse, end, &walk);
- mmap_write_downgrade(&init_mm);
-
if (ret && walk.nr_walked) {
end = reuse + walk.nr_walked * PAGE_SIZE;
/*
--
2.11.0
Since the head vmemmap page frame associated with each HugeTLB page is
reused, we should hide the PG_head flag of tail struct page from the
user. Add a tese case to check whether it is work properly. The test
steps are as follows.
1) alloc 2MB hugeTLB
2) get each page frame
3) apply those APIs in each page frame
4) Those APIs work completely the same as before.
Reading the flags of a page by /proc/kpageflags is done in
stable_page_flags(), which has invoked PageHead(), PageTail(),
PageCompound() and compound_head(). If those APIs work properly, the
head page must have 15 and 17 bits set. And tail pages must have 16
and 17 bits set but 15 bit unset. Those flags are checked in
check_page_flags().
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Barry Song <[email protected]>
---
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/hugepage-vmemmap.c | 144 ++++++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 11 ++
4 files changed, 157 insertions(+)
create mode 100644 tools/testing/selftests/vm/hugepage-vmemmap.c
diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 2e7e86e85282..3b5faec3c04f 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -2,6 +2,7 @@
hugepage-mmap
hugepage-mremap
hugepage-shm
+hugepage-vmemmap
khugepaged
map_hugetlb
map_populate
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 1607322a112c..7d100a7dc462 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -31,6 +31,7 @@ TEST_GEN_FILES += hmm-tests
TEST_GEN_FILES += hugepage-mmap
TEST_GEN_FILES += hugepage-mremap
TEST_GEN_FILES += hugepage-shm
+TEST_GEN_FILES += hugepage-vmemmap
TEST_GEN_FILES += khugepaged
TEST_GEN_FILES += madv_populate
TEST_GEN_FILES += map_fixed_noreplace
diff --git a/tools/testing/selftests/vm/hugepage-vmemmap.c b/tools/testing/selftests/vm/hugepage-vmemmap.c
new file mode 100644
index 000000000000..557bdbd4f87e
--- /dev/null
+++ b/tools/testing/selftests/vm/hugepage-vmemmap.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A test case of using hugepage memory in a user application using the
+ * mmap system call with MAP_HUGETLB flag. Before running this program
+ * make sure the administrator has allocated enough default sized huge
+ * pages to cover the 2 MB allocation.
+ */
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <fcntl.h>
+
+#define MAP_LENGTH (2UL * 1024 * 1024)
+
+#ifndef MAP_HUGETLB
+#define MAP_HUGETLB 0x40000 /* arch specific */
+#endif
+
+#define PAGE_SIZE 4096
+
+#define PAGE_COMPOUND_HEAD (1UL << 15)
+#define PAGE_COMPOUND_TAIL (1UL << 16)
+#define PAGE_HUGE (1UL << 17)
+
+#define HEAD_PAGE_FLAGS (PAGE_COMPOUND_HEAD | PAGE_HUGE)
+#define TAIL_PAGE_FLAGS (PAGE_COMPOUND_TAIL | PAGE_HUGE)
+
+#define PM_PFRAME_BITS 55
+#define PM_PFRAME_MASK ~((1UL << PM_PFRAME_BITS) - 1)
+
+/*
+ * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
+ * That means the addresses starting with 0x800000... will need to be
+ * specified. Specifying a fixed address is not required on ppc64, i386
+ * or x86_64.
+ */
+#ifdef __ia64__
+#define MAP_ADDR (void *)(0x8000000000000000UL)
+#define MAP_FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED)
+#else
+#define MAP_ADDR NULL
+#define MAP_FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
+#endif
+
+static void write_bytes(char *addr, size_t length)
+{
+ unsigned long i;
+
+ for (i = 0; i < length; i++)
+ *(addr + i) = (char)i;
+}
+
+static unsigned long virt_to_pfn(void *addr)
+{
+ int fd;
+ unsigned long pagemap;
+
+ fd = open("/proc/self/pagemap", O_RDONLY);
+ if (fd < 0)
+ return -1UL;
+
+ lseek(fd, (unsigned long)addr / PAGE_SIZE * sizeof(pagemap), SEEK_SET);
+ read(fd, &pagemap, sizeof(pagemap));
+ close(fd);
+
+ return pagemap & ~PM_PFRAME_MASK;
+}
+
+static int check_page_flags(unsigned long pfn)
+{
+ int fd, i;
+ unsigned long pageflags;
+
+ fd = open("/proc/kpageflags", O_RDONLY);
+ if (fd < 0)
+ return -1;
+
+ lseek(fd, pfn * sizeof(pageflags), SEEK_SET);
+
+ read(fd, &pageflags, sizeof(pageflags));
+ if ((pageflags & HEAD_PAGE_FLAGS) != HEAD_PAGE_FLAGS) {
+ close(fd);
+ printf("Head page flags (%lx) is invalid\n", pageflags);
+ return -1;
+ }
+
+ /*
+ * pages other than the first page must be tail and shouldn't be head;
+ * this also verifies kernel has correctly set the fake page_head to tail
+ * while hugetlb_free_vmemmap is enabled.
+ */
+ for (i = 1; i < MAP_LENGTH / PAGE_SIZE; i++) {
+ read(fd, &pageflags, sizeof(pageflags));
+ if ((pageflags & TAIL_PAGE_FLAGS) != TAIL_PAGE_FLAGS ||
+ (pageflags & HEAD_PAGE_FLAGS) == HEAD_PAGE_FLAGS) {
+ close(fd);
+ printf("Tail page flags (%lx) is invalid\n", pageflags);
+ return -1;
+ }
+ }
+
+ close(fd);
+
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ void *addr;
+ unsigned long pfn;
+
+ addr = mmap(MAP_ADDR, MAP_LENGTH, PROT_READ | PROT_WRITE, MAP_FLAGS, -1, 0);
+ if (addr == MAP_FAILED) {
+ perror("mmap");
+ exit(1);
+ }
+
+ /* Trigger allocation of HugeTLB page. */
+ write_bytes(addr, MAP_LENGTH);
+
+ pfn = virt_to_pfn(addr);
+ if (pfn == -1UL) {
+ munmap(addr, MAP_LENGTH);
+ perror("virt_to_pfn");
+ exit(1);
+ }
+
+ printf("Returned address is %p whose pfn is %lx\n", addr, pfn);
+
+ if (check_page_flags(pfn) < 0) {
+ munmap(addr, MAP_LENGTH);
+ perror("check_page_flags");
+ exit(1);
+ }
+
+ /* munmap() length of MAP_HUGETLB memory must be hugepage aligned */
+ if (munmap(addr, MAP_LENGTH)) {
+ perror("munmap");
+ exit(1);
+ }
+
+ return 0;
+}
diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index 45e803af7c77..745f86e7a086 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -108,6 +108,17 @@ else
echo "[PASS]"
fi
+echo "------------------------"
+echo "running hugepage-vmemmap"
+echo "------------------------"
+./hugepage-vmemmap
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
echo "NOTE: The above hugetlb tests provide minimal coverage. Use"
echo " https://github.com/libhugetlbfs/libhugetlbfs.git for"
echo " hugetlb regression testing."
--
2.11.0
The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those
functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Barry Song <[email protected]>
---
include/linux/mm.h | 2 ++
mm/sparse-vmemmap.c | 2 ++
2 files changed, 4 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..8c85863a067c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3184,10 +3184,12 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
}
#endif
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
int vmemmap_remap_free(unsigned long start, unsigned long end,
unsigned long reuse);
int vmemmap_remap_alloc(unsigned long start, unsigned long end,
unsigned long reuse, gfp_t gfp_mask);
+#endif
void *sparse_buffer_alloc(unsigned long size);
struct page * __populate_section_memmap(unsigned long pfn,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index c64d1aa3c4b5..8aecd6b3896c 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -34,6 +34,7 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
/**
* struct vmemmap_remap_walk - walk vmemmap page table
*
@@ -419,6 +420,7 @@ int vmemmap_remap_alloc(unsigned long start, unsigned long end,
return 0;
}
+#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
/*
* Allocate a block of memory to be used to back the virtual memory map
--
2.11.0
On Mon, Nov 1, 2021 at 11:22 AM Muchun Song <[email protected]> wrote:
>
> This series can minimize the overhead of struct page for 2MB HugeTLB pages
> significantly. It further reduces the overhead of struct page by 12.5% for
> a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
> HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
>
Hi,
Ping guys. Does anyone have any comments or suggestions
on this series?
Thanks.
On 11/8/21 12:16 AM, Muchun Song wrote:
> On Mon, Nov 1, 2021 at 11:22 AM Muchun Song <[email protected]> wrote:
>>
>> This series can minimize the overhead of struct page for 2MB HugeTLB pages
>> significantly. It further reduces the overhead of struct page by 12.5% for
>> a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
>> HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
>>
>
> Hi,
>
> Ping guys. Does anyone have any comments or suggestions
> on this series?
>
> Thanks.
>
I did look over the series earlier. I have no issue with the hugetlb and
vmemmap modifications as they are enhancements to the existing
optimizations. My primary concern is the (small) increased overhead
for the helpers as outlined in your cover letter. Since these helpers
are not limited to hugetlb and used throughout the kernel, I would
really like to get comments from others with a better understanding of
the potential impact.
--
Mike Kravetz
On Tue, Nov 9, 2021 at 3:33 AM Mike Kravetz <[email protected]> wrote:
>
> On 11/8/21 12:16 AM, Muchun Song wrote:
> > On Mon, Nov 1, 2021 at 11:22 AM Muchun Song <[email protected]> wrote:
> >>
> >> This series can minimize the overhead of struct page for 2MB HugeTLB pages
> >> significantly. It further reduces the overhead of struct page by 12.5% for
> >> a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
> >> HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
> >>
> >
> > Hi,
> >
> > Ping guys. Does anyone have any comments or suggestions
> > on this series?
> >
> > Thanks.
> >
>
> I did look over the series earlier. I have no issue with the hugetlb and
> vmemmap modifications as they are enhancements to the existing
> optimizations. My primary concern is the (small) increased overhead
> for the helpers as outlined in your cover letter. Since these helpers
> are not limited to hugetlb and used throughout the kernel, I would
> really like to get comments from others with a better understanding of
> the potential impact.
Thanks Mike. I'd like to hear others' comments about this as well.
From my point of view, maybe the (small) overhead is acceptable
since it only affects the head page, however Matthew Wilcox's folio
series could reduce this situation as well.
Looking forward to others' comments.
Thanks.
>
> --
> Mike Kravetz
On Wed, Nov 10, 2021 at 2:18 PM Muchun Song <[email protected]> wrote:
>
> On Tue, Nov 9, 2021 at 3:33 AM Mike Kravetz <[email protected]> wrote:
> >
> > On 11/8/21 12:16 AM, Muchun Song wrote:
> > > On Mon, Nov 1, 2021 at 11:22 AM Muchun Song <[email protected]> wrote:
> > >>
> > >> This series can minimize the overhead of struct page for 2MB HugeTLB pages
> > >> significantly. It further reduces the overhead of struct page by 12.5% for
> > >> a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
> > >> HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
> > >>
> > >
> > > Hi,
> > >
> > > Ping guys. Does anyone have any comments or suggestions
> > > on this series?
> > >
> > > Thanks.
> > >
> >
> > I did look over the series earlier. I have no issue with the hugetlb and
> > vmemmap modifications as they are enhancements to the existing
> > optimizations. My primary concern is the (small) increased overhead
> > for the helpers as outlined in your cover letter. Since these helpers
> > are not limited to hugetlb and used throughout the kernel, I would
> > really like to get comments from others with a better understanding of
> > the potential impact.
>
> Thanks Mike. I'd like to hear others' comments about this as well.
> From my point of view, maybe the (small) overhead is acceptable
> since it only affects the head page, however Matthew Wilcox's folio
> series could reduce this situation as well.
>
> Looking forward to others' comments.
>
Ping guys.
Hi Andrew,
Do you have any suggestions on this series to move it on?
Thanks.
On Mon, 22 Nov 2021 12:21:32 +0800 Muchun Song <[email protected]> wrote:
> On Wed, Nov 10, 2021 at 2:18 PM Muchun Song <[email protected]> wrote:
> >
> > On Tue, Nov 9, 2021 at 3:33 AM Mike Kravetz <[email protected]> wrote:
> > >
> > > On 11/8/21 12:16 AM, Muchun Song wrote:
> > > > On Mon, Nov 1, 2021 at 11:22 AM Muchun Song <[email protected]> wrote:
> > > >>
> > > >> This series can minimize the overhead of struct page for 2MB HugeTLB pages
> > > >> significantly. It further reduces the overhead of struct page by 12.5% for
> > > >> a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
> > > >> HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
> > > >>
> > > >
> > > > Hi,
> > > >
> > > > Ping guys. Does anyone have any comments or suggestions
> > > > on this series?
> > > >
> > > > Thanks.
> > > >
> > >
> > > I did look over the series earlier. I have no issue with the hugetlb and
> > > vmemmap modifications as they are enhancements to the existing
> > > optimizations. My primary concern is the (small) increased overhead
> > > for the helpers as outlined in your cover letter. Since these helpers
> > > are not limited to hugetlb and used throughout the kernel, I would
> > > really like to get comments from others with a better understanding of
> > > the potential impact.
> >
> > Thanks Mike. I'd like to hear others' comments about this as well.
> > From my point of view, maybe the (small) overhead is acceptable
> > since it only affects the head page, however Matthew Wilcox's folio
> > series could reduce this situation as well.
I think Mike was inviting you to run some tests to quantify the
overhead ;)
> Ping guys.
>
> Hi Andrew,
>
> Do you have any suggestions on this series to move it on?
>
I tossed it in there for some testing but yes please, additional
reviewing?
On Wed, Nov 24, 2021 at 11:09 AM Andrew Morton
<[email protected]> wrote:
>
> On Mon, 22 Nov 2021 12:21:32 +0800 Muchun Song <[email protected]> wrote:
>
> > On Wed, Nov 10, 2021 at 2:18 PM Muchun Song <[email protected]> wrote:
> > >
> > > On Tue, Nov 9, 2021 at 3:33 AM Mike Kravetz <[email protected]> wrote:
> > > >
> > > > On 11/8/21 12:16 AM, Muchun Song wrote:
> > > > > On Mon, Nov 1, 2021 at 11:22 AM Muchun Song <[email protected]> wrote:
> > > > >>
> > > > >> This series can minimize the overhead of struct page for 2MB HugeTLB pages
> > > > >> significantly. It further reduces the overhead of struct page by 12.5% for
> > > > >> a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
> > > > >> HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
> > > > >>
> > > > >
> > > > > Hi,
> > > > >
> > > > > Ping guys. Does anyone have any comments or suggestions
> > > > > on this series?
> > > > >
> > > > > Thanks.
> > > > >
> > > >
> > > > I did look over the series earlier. I have no issue with the hugetlb and
> > > > vmemmap modifications as they are enhancements to the existing
> > > > optimizations. My primary concern is the (small) increased overhead
> > > > for the helpers as outlined in your cover letter. Since these helpers
> > > > are not limited to hugetlb and used throughout the kernel, I would
> > > > really like to get comments from others with a better understanding of
> > > > the potential impact.
> > >
> > > Thanks Mike. I'd like to hear others' comments about this as well.
> > > From my point of view, maybe the (small) overhead is acceptable
> > > since it only affects the head page, however Matthew Wilcox's folio
> > > series could reduce this situation as well.
>
> I think Mike was inviting you to run some tests to quantify the
> overhead ;)
Hi Andrew,
Sorry for the late reply.
Specific overhead figures are already in the cover letter. Also,
I did some other tests, e.g. kernel compilation, sysbench. I didn't
see any regressions.
>
> > Ping guys.
> >
> > Hi Andrew,
> >
> > Do you have any suggestions on this series to move it on?
> >
>
> I tossed it in there for some testing but yes please, additional
> reviewing?
It's already been in the next-tree (also in our ByteDance servers)
for several months, and I didn't receive any negative feedback.
Do you think it is ready for 5.17?
Thanks.
On Wed, Jan 26, 2022 at 4:04 PM Muchun Song <[email protected]> wrote:
>
> On Wed, Nov 24, 2021 at 11:09 AM Andrew Morton
> <[email protected]> wrote:
> >
> > On Mon, 22 Nov 2021 12:21:32 +0800 Muchun Song <[email protected]> wrote:
> >
> > > On Wed, Nov 10, 2021 at 2:18 PM Muchun Song <[email protected]> wrote:
> > > >
> > > > On Tue, Nov 9, 2021 at 3:33 AM Mike Kravetz <[email protected]> wrote:
> > > > >
> > > > > On 11/8/21 12:16 AM, Muchun Song wrote:
> > > > > > On Mon, Nov 1, 2021 at 11:22 AM Muchun Song <[email protected]> wrote:
> > > > > >>
> > > > > >> This series can minimize the overhead of struct page for 2MB HugeTLB pages
> > > > > >> significantly. It further reduces the overhead of struct page by 12.5% for
> > > > > >> a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
> > > > > >> HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
> > > > > >>
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Ping guys. Does anyone have any comments or suggestions
> > > > > > on this series?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > >
> > > > > I did look over the series earlier. I have no issue with the hugetlb and
> > > > > vmemmap modifications as they are enhancements to the existing
> > > > > optimizations. My primary concern is the (small) increased overhead
> > > > > for the helpers as outlined in your cover letter. Since these helpers
> > > > > are not limited to hugetlb and used throughout the kernel, I would
> > > > > really like to get comments from others with a better understanding of
> > > > > the potential impact.
> > > >
> > > > Thanks Mike. I'd like to hear others' comments about this as well.
> > > > From my point of view, maybe the (small) overhead is acceptable
> > > > since it only affects the head page, however Matthew Wilcox's folio
> > > > series could reduce this situation as well.
> >
> > I think Mike was inviting you to run some tests to quantify the
> > overhead ;)
>
> Hi Andrew,
>
> Sorry for the late reply.
>
> Specific overhead figures are already in the cover letter. Also,
> I did some other tests, e.g. kernel compilation, sysbench. I didn't
> see any regressions.
The overhead is introduced by page_fixed_fake_head() which
has an "if" statement and an access to a possible cold cache line.
I think the main overhead is from the latter. However, probabilistically,
only 1/64 of the pages need to do the latter. And
page_fixed_fake_head() is already simple (I mean the overhead
is small enough) and many performance bottlenecks in mm are
not in compound_head(). This also matches the tests I did.
I didn't see any regressions after enabling this feature.
I knew Mike's concern is the increased overhead to use cases
beyond HugeTLB. If we really want to avoid the access to
a possible cold cache line, we can introduce a new page
flag like PG_hugetlb and test if it is set in the page->flags,
if so, then return the read head page struct. Then
page_fixed_fake_head() looks like below.
static __always_inline const struct page *page_fixed_fake_head(const
struct page *page)
{
if (!hugetlb_free_vmemmap_enabled())
return page;
if (test_bit(PG_hugetlb, &page->flags)) {
unsigned long head = READ_ONCE(page[1].compound_head);
if (likely(head & 1))
return (const struct page *)(head - 1);
}
return page;
}
But I don't think it's worth doing this.
Hi Mike and Andrew,
Since these helpers are not limited to hugetlb and used throughout the
kernel, I would really like to get comments from others with a better
understanding of the potential impact. Do you have any appropriate
reviewers to invite?
Thanks.
>
> >
> > > Ping guys.
> > >
> > > Hi Andrew,
> > >
> > > Do you have any suggestions on this series to move it on?
> > >
> >
> > I tossed it in there for some testing but yes please, additional
> > reviewing?
>
> It's already been in the next-tree (also in our ByteDance servers)
> for several months, and I didn't receive any negative feedback.
>
> Do you think it is ready for 5.17?
>
> Thanks.
On 2/8/22 23:44, Muchun Song wrote:
> On Wed, Jan 26, 2022 at 4:04 PM Muchun Song <[email protected]> wrote:
>>
>> On Wed, Nov 24, 2021 at 11:09 AM Andrew Morton
>> <[email protected]> wrote:
>>>
>>> On Mon, 22 Nov 2021 12:21:32 +0800 Muchun Song <[email protected]> wrote:
>>>
>>>> On Wed, Nov 10, 2021 at 2:18 PM Muchun Song <[email protected]> wrote:
>>>>>
>>>>> On Tue, Nov 9, 2021 at 3:33 AM Mike Kravetz <[email protected]> wrote:
>>>>>>
>>>>>> On 11/8/21 12:16 AM, Muchun Song wrote:
>>>>>>> On Mon, Nov 1, 2021 at 11:22 AM Muchun Song <[email protected]> wrote:
>>>>>>>>
>>>>>>>> This series can minimize the overhead of struct page for 2MB HugeTLB pages
>>>>>>>> significantly. It further reduces the overhead of struct page by 12.5% for
>>>>>>>> a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
>>>>>>>> HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
>>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Ping guys. Does anyone have any comments or suggestions
>>>>>>> on this series?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>
>>>>>> I did look over the series earlier. I have no issue with the hugetlb and
>>>>>> vmemmap modifications as they are enhancements to the existing
>>>>>> optimizations. My primary concern is the (small) increased overhead
>>>>>> for the helpers as outlined in your cover letter. Since these helpers
>>>>>> are not limited to hugetlb and used throughout the kernel, I would
>>>>>> really like to get comments from others with a better understanding of
>>>>>> the potential impact.
>>>>>
>>>>> Thanks Mike. I'd like to hear others' comments about this as well.
>>>>> From my point of view, maybe the (small) overhead is acceptable
>>>>> since it only affects the head page, however Matthew Wilcox's folio
>>>>> series could reduce this situation as well.
>>>
>>> I think Mike was inviting you to run some tests to quantify the
>>> overhead ;)
>>
>> Hi Andrew,
>>
>> Sorry for the late reply.
>>
>> Specific overhead figures are already in the cover letter. Also,
>> I did some other tests, e.g. kernel compilation, sysbench. I didn't
>> see any regressions.
>
> The overhead is introduced by page_fixed_fake_head() which
> has an "if" statement and an access to a possible cold cache line.
> I think the main overhead is from the latter. However, probabilistically,
> only 1/64 of the pages need to do the latter. And
> page_fixed_fake_head() is already simple (I mean the overhead
> is small enough) and many performance bottlenecks in mm are
> not in compound_head(). This also matches the tests I did.
> I didn't see any regressions after enabling this feature.
>
> I knew Mike's concern is the increased overhead to use cases
> beyond HugeTLB. If we really want to avoid the access to
> a possible cold cache line, we can introduce a new page
> flag like PG_hugetlb and test if it is set in the page->flags,
> if so, then return the read head page struct. Then
> page_fixed_fake_head() looks like below.
>
> static __always_inline const struct page *page_fixed_fake_head(const
> struct page *page)
> {
> if (!hugetlb_free_vmemmap_enabled())
> return page;
>
> if (test_bit(PG_hugetlb, &page->flags)) {
> unsigned long head = READ_ONCE(page[1].compound_head);
>
> if (likely(head & 1))
> return (const struct page *)(head - 1);
> }
> return page;
> }
>
> But I don't think it's worth doing this.
>
> Hi Mike and Andrew,
>
> Since these helpers are not limited to hugetlb and used throughout the
> kernel, I would really like to get comments from others with a better
> understanding of the potential impact. Do you have any appropriate
> reviewers to invite?
>
I think the appropriate people are already on Cc as they provided input on
the original vmemmap optimization series.
The question that needs to be answered is simple enough: Is the savings of
one vmemmap page per hugetlb page worth the extra minimal overhead in
compound_head()? Like most things, this depends on workload.
One thing to note is that compound_page() overhead is only introduced if
hugetlb vmemmap freeing is enabled. Correct? During the original vmemmap
optimization discussions, people thought it important that this be 'opt in'. I do not know if distos will enable this by default. But, perhaps the
potential overhead can be thought of as just part of 'opting in' for
vmemmap optimizations.
--
Mike Kravetz
On Thu, Feb 10, 2022 at 6:49 AM Mike Kravetz <[email protected]> wrote:
>
> On 2/8/22 23:44, Muchun Song wrote:
> > On Wed, Jan 26, 2022 at 4:04 PM Muchun Song <[email protected]> wrote:
> >>
> >> On Wed, Nov 24, 2021 at 11:09 AM Andrew Morton
> >> <[email protected]> wrote:
> >>>
> >>> On Mon, 22 Nov 2021 12:21:32 +0800 Muchun Song <[email protected]> wrote:
> >>>
> >>>> On Wed, Nov 10, 2021 at 2:18 PM Muchun Song <[email protected]> wrote:
> >>>>>
> >>>>> On Tue, Nov 9, 2021 at 3:33 AM Mike Kravetz <[email protected]> wrote:
> >>>>>>
> >>>>>> On 11/8/21 12:16 AM, Muchun Song wrote:
> >>>>>>> On Mon, Nov 1, 2021 at 11:22 AM Muchun Song <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> This series can minimize the overhead of struct page for 2MB HugeTLB pages
> >>>>>>>> significantly. It further reduces the overhead of struct page by 12.5% for
> >>>>>>>> a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB
> >>>>>>>> HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Ping guys. Does anyone have any comments or suggestions
> >>>>>>> on this series?
> >>>>>>>
> >>>>>>> Thanks.
> >>>>>>>
> >>>>>>
> >>>>>> I did look over the series earlier. I have no issue with the hugetlb and
> >>>>>> vmemmap modifications as they are enhancements to the existing
> >>>>>> optimizations. My primary concern is the (small) increased overhead
> >>>>>> for the helpers as outlined in your cover letter. Since these helpers
> >>>>>> are not limited to hugetlb and used throughout the kernel, I would
> >>>>>> really like to get comments from others with a better understanding of
> >>>>>> the potential impact.
> >>>>>
> >>>>> Thanks Mike. I'd like to hear others' comments about this as well.
> >>>>> From my point of view, maybe the (small) overhead is acceptable
> >>>>> since it only affects the head page, however Matthew Wilcox's folio
> >>>>> series could reduce this situation as well.
> >>>
> >>> I think Mike was inviting you to run some tests to quantify the
> >>> overhead ;)
> >>
> >> Hi Andrew,
> >>
> >> Sorry for the late reply.
> >>
> >> Specific overhead figures are already in the cover letter. Also,
> >> I did some other tests, e.g. kernel compilation, sysbench. I didn't
> >> see any regressions.
> >
> > The overhead is introduced by page_fixed_fake_head() which
> > has an "if" statement and an access to a possible cold cache line.
> > I think the main overhead is from the latter. However, probabilistically,
> > only 1/64 of the pages need to do the latter. And
> > page_fixed_fake_head() is already simple (I mean the overhead
> > is small enough) and many performance bottlenecks in mm are
> > not in compound_head(). This also matches the tests I did.
> > I didn't see any regressions after enabling this feature.
> >
> > I knew Mike's concern is the increased overhead to use cases
> > beyond HugeTLB. If we really want to avoid the access to
> > a possible cold cache line, we can introduce a new page
> > flag like PG_hugetlb and test if it is set in the page->flags,
> > if so, then return the read head page struct. Then
> > page_fixed_fake_head() looks like below.
> >
> > static __always_inline const struct page *page_fixed_fake_head(const
> > struct page *page)
> > {
> > if (!hugetlb_free_vmemmap_enabled())
> > return page;
> >
> > if (test_bit(PG_hugetlb, &page->flags)) {
> > unsigned long head = READ_ONCE(page[1].compound_head);
> >
> > if (likely(head & 1))
> > return (const struct page *)(head - 1);
> > }
> > return page;
> > }
> >
> > But I don't think it's worth doing this.
> >
> > Hi Mike and Andrew,
> >
> > Since these helpers are not limited to hugetlb and used throughout the
> > kernel, I would really like to get comments from others with a better
> > understanding of the potential impact. Do you have any appropriate
> > reviewers to invite?
> >
>
> I think the appropriate people are already on Cc as they provided input on
> the original vmemmap optimization series.
>
> The question that needs to be answered is simple enough: Is the savings of
> one vmemmap page per hugetlb page worth the extra minimal overhead in
> compound_head()? Like most things, this depends on workload.
>
> One thing to note is that compound_page() overhead is only introduced if
> hugetlb vmemmap freeing is enabled. Correct?
Definitely correct.
> During the original vmemmap
> optimization discussions, people thought it important that this be 'opt in'. I do not know if distos will enable this by default. But, perhaps the
> potential overhead can be thought of as just part of 'opting in' for
> vmemmap optimizations.
I agree. Does anyone else have a different opinion?
Thanks.