LinuxLists.cc - [PATCH 0/5] Split huge PMD mapping of vmemmap pages

2021-06-09 17:23:47

Subject: [PATCH 0/5] Split huge PMD mapping of vmemmap pages

In order to reduce the difficulty of code review in series[1]. We disable
huge PMD mapping of vmemmap pages when that feature is enabled. In this
series, we do not disable huge PMD mapping of vmemmap pages anymore. We
will split huge PMD mapping when needed.

[1] https://lore.kernel.org/linux-doc/[email protected]/

Muchun Song (5):
mm: hugetlb: introduce helpers to preallocate/free page tables
mm: hugetlb: introduce helpers to preallocate page tables from bootmem
allocator
mm: sparsemem: split the huge PMD mapping of vmemmap pages
mm: sparsemem: use huge PMD mapping for vmemmap pages
mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON

Documentation/admin-guide/kernel-parameters.txt | 10 +-
arch/x86/mm/init_64.c | 8 +-
fs/Kconfig | 10 ++
include/linux/hugetlb.h | 28 ++----
include/linux/mm.h | 2 +-
mm/hugetlb.c | 42 +++++++-
mm/hugetlb_vmemmap.c | 126 +++++++++++++++++++++++-
mm/hugetlb_vmemmap.h | 25 +++++
mm/memory_hotplug.c | 2 +-
mm/sparse-vmemmap.c | 61 ++++++++++--
10 files changed, 267 insertions(+), 47 deletions(-)

--
2.11.0

2021-06-09 17:25:20

by Muchun Song

[permalink] [raw]

Subject: [PATCH 2/5] mm: hugetlb: introduce helpers to preallocate page tables from bootmem allocator

If we want to split the huge PMD of vmemmap pages associated with each
gigantic page allocated from bootmem allocator, we should pre-allocate
the page tables from bootmem allocator. In this patch, we introduce
some helpers to preallocate page tables for gigantic pages.

Signed-off-by: Muchun Song <[email protected]>
---
include/linux/hugetlb.h | 3 +++
mm/hugetlb_vmemmap.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++
mm/hugetlb_vmemmap.h | 13 ++++++++++
3 files changed, 79 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 03ca83db0a3e..c27a299c4211 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -622,6 +622,9 @@ struct hstate {
struct huge_bootmem_page {
struct list_head list;
struct hstate *hstate;
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+ pte_t *vmemmap_pte;
+#endif
};

int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 628e2752714f..6f3a47b4ebd3 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -171,6 +171,7 @@
#define pr_fmt(fmt) "HugeTLB: " fmt

#include <linux/list.h>
+#include <linux/memblock.h>
#include <asm/pgalloc.h>

#include "hugetlb_vmemmap.h"
@@ -263,6 +264,68 @@ int vmemmap_pgtable_prealloc(struct hstate *h, struct list_head *pgtables)
return -ENOMEM;
}

+unsigned long __init gigantic_vmemmap_pgtable_prealloc(void)
+{
+ struct huge_bootmem_page *m, *tmp;
+ unsigned long nr_free = 0;
+
+ list_for_each_entry_safe(m, tmp, &huge_boot_pages, list) {
+ struct hstate *h = m->hstate;
+ unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+ unsigned long size;
+
+ if (!nr)
+ continue;
+
+ size = nr << PAGE_SHIFT;
+ m->vmemmap_pte = memblock_alloc_try_nid(size, PAGE_SIZE, 0,
+ MEMBLOCK_ALLOC_ACCESSIBLE,
+ NUMA_NO_NODE);
+ if (!m->vmemmap_pte) {
+ nr_free++;
+ list_del(&m->list);
+ memblock_free_early(__pa(m), huge_page_size(h));
+ }
+ }
+
+ return nr_free;
+}
+
+void __init gigantic_vmemmap_pgtable_init(struct huge_bootmem_page *m,
+ struct page *head)
+{
+ struct hstate *h = m->hstate;
+ unsigned long pte = (unsigned long)m->vmemmap_pte;
+ unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+
+ if (!nr)
+ return;
+
+ /*
+ * If we had gigantic hugepages allocated at boot time, we need
+ * to restore the 'stolen' pages to totalram_pages in order to
+ * fix confusing memory reports from free(1) and another
+ * side-effects, like CommitLimit going negative.
+ */
+ adjust_managed_page_count(head, nr);
+
+ /*
+ * Use the huge page lru list to temporarily store the preallocated
+ * pages. The preallocated pages are used and the list is emptied
+ * before the huge page is put into use. When the huge page is put
+ * into use by prep_new_huge_page() the list will be reinitialized.
+ */
+ INIT_LIST_HEAD(&head->lru);
+
+ while (nr--) {
+ struct page *pte_page = virt_to_page(pte);
+
+ __ClearPageReserved(pte_page);
+ list_add(&pte_page->lru, &head->lru);
+ pte += PAGE_SIZE;
+ }
+}
+
/*
* Previously discarded vmemmap pages will be allocated and remapping
* after this function returns zero.
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 306e15519da1..f6170720f183 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -16,6 +16,9 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head);
void hugetlb_vmemmap_init(struct hstate *h);
int vmemmap_pgtable_prealloc(struct hstate *h, struct list_head *pgtables);
void vmemmap_pgtable_free(struct list_head *pgtables);
+unsigned long gigantic_vmemmap_pgtable_prealloc(void);
+void gigantic_vmemmap_pgtable_init(struct huge_bootmem_page *m,
+ struct page *head);

/*
* How many vmemmap pages associated with a HugeTLB page that can be freed
@@ -45,6 +48,16 @@ static inline void vmemmap_pgtable_free(struct list_head *pgtables)
{
}

+static inline unsigned long gigantic_vmemmap_pgtable_prealloc(void)
+{
+ return 0;
+}
+
+static inline void gigantic_vmemmap_pgtable_init(struct huge_bootmem_page *m,
+ struct page *head)
+{
+}
+
static inline void hugetlb_vmemmap_init(struct hstate *h)
{
}
--
2.11.0

2021-06-10 21:34:48

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH 0/5] Split huge PMD mapping of vmemmap pages

On 6/9/21 5:13 AM, Muchun Song wrote:
> In order to reduce the difficulty of code review in series[1]. We disable
> huge PMD mapping of vmemmap pages when that feature is enabled. In this
> series, we do not disable huge PMD mapping of vmemmap pages anymore. We
> will split huge PMD mapping when needed.

Thank you Muchun!

Adding this functionality should reduce the decisions a sys admin needs
to make WRT vmemmap reduction for hugetlb pages. There should be no
downside to enabling vmemmap reduction as moving from PMD to PTE mapping
happens 'on demand' as hugetlb pages are added to the pool.

I just want to clarify something for myself and possibly other
reviewers. At hugetlb page allocation time, we move to PTE mappings.
When hugetlb pages are freed from the pool we do not attempt coalasce
and move back to a PMD mapping. Correct? I am not suggesting we do
this and I suspect it is much more complex. Just want to make sure I
understand the functionality of this series.

BTW - Just before you sent this series I had worked up a version of
hugetlb page demote [2] with vmemmap optimizations. That code will need
to be reworked. However, if we never coalesce and move back to PMD
mappings it might make that effort easier.

[2] https://lore.kernel.org/linux-mm/[email protected]/
--
Mike Kravetz

2021-06-10 22:18:28

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH 2/5] mm: hugetlb: introduce helpers to preallocate page tables from bootmem allocator

On 6/9/21 5:13 AM, Muchun Song wrote:
> If we want to split the huge PMD of vmemmap pages associated with each
> gigantic page allocated from bootmem allocator, we should pre-allocate
> the page tables from bootmem allocator.

Just curious why this is necessary and a good idea? Why not wait until
the gigantic pages allocated from bootmem are added to the pool to
allocate any necessary vmemmmap pages?

> the page tables from bootmem allocator. In this patch, we introduce
> some helpers to preallocate page tables for gigantic pages.
>
> Signed-off-by: Muchun Song <[email protected]>
> ---
> include/linux/hugetlb.h | 3 +++
> mm/hugetlb_vmemmap.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++
> mm/hugetlb_vmemmap.h | 13 ++++++++++
> 3 files changed, 79 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 03ca83db0a3e..c27a299c4211 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -622,6 +622,9 @@ struct hstate {
> struct huge_bootmem_page {
> struct list_head list;
> struct hstate *hstate;
> +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
> + pte_t *vmemmap_pte;
> +#endif
> };
>
> int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 628e2752714f..6f3a47b4ebd3 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -171,6 +171,7 @@
> #define pr_fmt(fmt) "HugeTLB: " fmt
>
> #include <linux/list.h>
> +#include <linux/memblock.h>
> #include <asm/pgalloc.h>
>
> #include "hugetlb_vmemmap.h"
> @@ -263,6 +264,68 @@ int vmemmap_pgtable_prealloc(struct hstate *h, struct list_head *pgtables)
> return -ENOMEM;
> }
>
> +unsigned long __init gigantic_vmemmap_pgtable_prealloc(void)
> +{
> + struct huge_bootmem_page *m, *tmp;
> + unsigned long nr_free = 0;
> +
> + list_for_each_entry_safe(m, tmp, &huge_boot_pages, list) {
> + struct hstate *h = m->hstate;
> + unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> + unsigned long size;
> +
> + if (!nr)
> + continue;
> +
> + size = nr << PAGE_SHIFT;
> + m->vmemmap_pte = memblock_alloc_try_nid(size, PAGE_SIZE, 0,
> + MEMBLOCK_ALLOC_ACCESSIBLE,
> + NUMA_NO_NODE);
> + if (!m->vmemmap_pte) {
> + nr_free++;
> + list_del(&m->list);
> + memblock_free_early(__pa(m), huge_page_size(h));

If we can not allocate the vmmmemap pages to split the PMD, then we will
not add the huge page to the pool. Correct?

Perhaps I am thinking about this incorrectly, but this seems wrong. We
already have everything we need to add the page to the pool. vmemmap
reduction is an optimization. So, the allocation failure is associated
with an optimization. In this case, it seems like we should just skip
the optimization (vmemmap reduction) and proceed to add the page to the
pool? It seems we do the same thing in subsequent patches.

Again, I could be thinking about this incorrectly.
--
Mike Kravetz

2021-06-11 03:26:47

by Muchun Song

[permalink] [raw]

Subject: Re: [External] Re: [PATCH 0/5] Split huge PMD mapping of vmemmap pages

On Fri, Jun 11, 2021 at 5:33 AM Mike Kravetz <[email protected]> wrote:
>
> On 6/9/21 5:13 AM, Muchun Song wrote:
> > In order to reduce the difficulty of code review in series[1]. We disable
> > huge PMD mapping of vmemmap pages when that feature is enabled. In this
> > series, we do not disable huge PMD mapping of vmemmap pages anymore. We
> > will split huge PMD mapping when needed.
>
> Thank you Muchun!
>
> Adding this functionality should reduce the decisions a sys admin needs
> to make WRT vmemmap reduction for hugetlb pages. There should be no
> downside to enabling vmemmap reduction as moving from PMD to PTE mapping
> happens 'on demand' as hugetlb pages are added to the pool.

Agree.

>
> I just want to clarify something for myself and possibly other
> reviewers. At hugetlb page allocation time, we move to PTE mappings.
> When hugetlb pages are freed from the pool we do not attempt coalasce
> and move back to a PMD mapping. Correct? I am not suggesting we do
> this and I suspect it is much more complex. Just want to make sure I
> understand the functionality of this series.

Totally right. Coalescing is very complex. So I do not do this in this
series.

>
> BTW - Just before you sent this series I had worked up a version of
> hugetlb page demote [2] with vmemmap optimizations. That code will need
> to be reworked. However, if we never coalesce and move back to PMD
> mappings it might make that effort easier.
>
> [2] https://lore.kernel.org/linux-mm/[email protected]/

I've not looked at this deeply. I will go take a look.

Thanks Mike.

> --
> Mike Kravetz