Hi all,
This patch series will free some vmemmap pages(struct page structures)
associated with each hugetlbpage when preallocated to save memory.
In order to reduce the difficulty of the first version of code review.
From this version, we disable PMD/huge page mapping of vmemmap if this
feature was enabled. This accutualy eliminate a bunch of the complex code
doing page table manipulation. When this patch series is solid, we cam add
the code of vmemmap page table manipulation in the future.
The struct page structures (page structs) are used to describe a physical
page frame. By default, there is a one-to-one mapping from a page frame to
it's corresponding page struct.
The HugeTLB pages consist of multiple base page size pages and is supported
by many architectures. See hugetlbpage.rst in the Documentation directory
for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB
are currently supported. Since the base page size on x86 is 4KB, a 2MB
HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
4096 base pages. For each base page, there is a corresponding page struct.
Within the HugeTLB subsystem, only the first 4 page structs are used to
contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
provides this upper limit. The only 'useful' information in the remaining
page structs is the compound_head field, and this field is the same for all
tail pages.
By removing redundant page structs for HugeTLB pages, memory can returned to
the buddy allocator for other uses.
When the system boot up, every 2M HugeTLB has 512 struct page structs which
size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | -------------> | 1 |
| | +-----------+ +-----------+
| | | 2 | -------------> | 2 |
| | +-----------+ +-----------+
| | | 3 | -------------> | 3 |
| | +-----------+ +-----------+
| | | 4 | -------------> | 4 |
| 2MB | +-----------+ +-----------+
| | | 5 | -------------> | 5 |
| | +-----------+ +-----------+
| | | 6 | -------------> | 6 |
| | +-----------+ +-----------+
| | | 7 | -------------> | 7 |
| | +-----------+ +-----------+
| |
| |
| |
+-----------+
The value of page->compound_head is the same for all tail pages. The first
page of page structs (page 0) associated with the HugeTLB page contains the 4
page structs necessary to describe the HugeTLB. The only use of the remaining
pages of page structs (page 1 to page 7) is to point to page->compound_head.
Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
will be used for each HugeTLB page. This will allow us to free the remaining
6 pages to the buddy allocator.
Here is how things look after remapping.
HugeTLB struct pages(8 pages) page frame(8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
| | | 1 | -------------> | 1 |
| | +-----------+ +-----------+
| | | 2 | ----------------^ ^ ^ ^ ^ ^
| | +-----------+ | | | | |
| | | 3 | ------------------+ | | | |
| | +-----------+ | | | |
| | | 4 | --------------------+ | | |
| 2MB | +-----------+ | | |
| | | 5 | ----------------------+ | |
| | +-----------+ | |
| | | 6 | ------------------------+ |
| | +-----------+ |
| | | 7 | --------------------------+
| | +-----------+
| |
| |
| |
+-----------+
When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
vmemmap pages and restore the previous mapping relationship.
Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar
to the 2MB HugeTLB page. We also can use this approach to free the vmemmap
pages.
In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a
very substantial gain. On our server, run some SPDK/QEMU applications which
will use 1024GB hugetlbpage. With this feature enabled, we can save ~16GB
(1G hugepage)/~12GB (2MB hugepage) memory.
Because there are vmemmap page tables reconstruction on the freeing/allocating
path, it increases some overhead. Here are some overhead analysis.
1) Allocating 10240 2MB hugetlb pages.
a) With this patch series applied:
# time echo 10240 > /proc/sys/vm/nr_hugepages
real 0m0.166s
user 0m0.000s
sys 0m0.166s
# bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
@latency:
[8K, 16K) 8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K) 1868 |@@@@@@@@@@@ |
[32K, 64K) 10 | |
[64K, 128K) 2 | |
b) Without this patch series:
# time echo 10240 > /proc/sys/vm/nr_hugepages
real 0m0.066s
user 0m0.000s
sys 0m0.066s
# bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
@latency:
[4K, 8K) 10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K) 62 | |
[16K, 32K) 2 | |
Summarize: this feature is about ~2x slower than before.
2) Freeing 10240 2MB hugetlb pages.
a) With this patch series applied:
# time echo 0 > /proc/sys/vm/nr_hugepages
real 0m0.004s
user 0m0.000s
sys 0m0.002s
# bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
@latency:
[16K, 32K) 10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
b) Without this patch series:
# time echo 0 > /proc/sys/vm/nr_hugepages
real 0m0.077s
user 0m0.001s
sys 0m0.075s
# bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
@latency:
[4K, 8K) 9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K) 287 |@ |
[16K, 32K) 3 | |
Summarize: The overhead of __free_hugepage is about ~2-4x slower than before.
But according to the allocation test above, I think that here is
also ~2x slower than before.
But why the 'real' time of patched is smaller than before? Because
In this patch series, the freeing hugetlb is asynchronous(through
kwoker).
Although the overhead has increased, the overhead is not significant. Like Mike
said, "However, remember that the majority of use cases create hugetlb pages at
or shortly after boot time and add them to the pool. So, additional overhead is
at pool creation time. There is no change to 'normal run time' operations of
getting a page from or returning a page to the pool (think page fault/unmap)".
Todo:
- Free all of the tail vmemmap pages
Now for the 2MB HugrTLB page, we only free 6 vmemmap pages. we really can
free 7 vmemmap pages. In this case, we can see 8 of the 512 struct page
structures has beed set PG_head flag. If we can adjust compound_head()
slightly and make compound_head() return the real head struct page when
the parameter is the tail struct page but with PG_head flag set.
In order to make the code evolution route clearer. This feature can can be
a separate patch after this patchset is solid.
- Support for other architectures (e.g. aarch64).
- Enable PMD/huge page mapping of vmemmap even if this feature was enabled.
Changelog in v12 -> v13:
- Remove VM_WARN_ON_PAGE macro.
- Add more comments in vmemmap_pte_range() and vmemmap_remap_free().
Thanks to Oscar and Mike's suggestions and review.
Changelog in v11 -> v12:
- Move VM_WARN_ON_PAGE to a separate patch.
- Call __free_hugepage() with hugetlb_lock (See patch #5.) to serialize
with dissolve_free_huge_page(). It is to prepare for patch #9.
- Introduce PageHugeInflight. See patch #9.
Changelog in v10 -> v11:
- Fix compiler error when !CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.
- Rework some comments and commit changes.
- Rework vmemmap_remap_free() to 3 parameters.
Thanks to Oscar and Mike's suggestions and review.
Changelog in v9 -> v10:
- Fix a bug in patch #11. Thanks to Oscar for pointing that out.
- Rework some commit log or comments. Thanks Mike and Oscar for the suggestions.
- Drop VMEMMAP_TAIL_PAGE_REUSE in the patch #3.
Thank you very much Mike and Oscar for reviewing the code.
Changelog in v8 -> v9:
- Rework some code. Very thanks to Oscar.
- Put all the non-hugetlb vmemmap functions under sparsemem-vmemmap.c.
Changelog in v7 -> v8:
- Adjust the order of patches.
Very thanks to David and Oscar. Your suggestions are very valuable.
Changelog in v6 -> v7:
- Rebase to linux-next 20201130
- Do not use basepage mapping for vmemmap when this feature is disabled.
- Rework some patchs.
[PATCH v6 08/16] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
[PATCH v6 10/16] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
Thanks to Oscar and Barry.
Changelog in v5 -> v6:
- Disable PMD/huge page mapping of vmemmap if this feature was enabled.
- Simplify the first version code.
Changelog in v4 -> v5:
- Rework somme comments and code in the [PATCH v4 04/21] and [PATCH v4 05/21].
Thanks to Mike and Oscar's suggestions.
Changelog in v3 -> v4:
- Move all the vmemmap functions to hugetlb_vmemmap.c.
- Make the CONFIG_HUGETLB_PAGE_FREE_VMEMMAP default to y, if we want to
disable this feature, we should disable it by a boot/kernel command line.
- Remove vmemmap_pgtable_{init, deposit, withdraw}() helper functions.
- Initialize page table lock for vmemmap through core_initcall mechanism.
Thanks for Mike and Oscar's suggestions.
Changelog in v2 -> v3:
- Rename some helps function name. Thanks Mike.
- Rework some code. Thanks Mike and Oscar.
- Remap the tail vmemmap page with PAGE_KERNEL_RO instead of PAGE_KERNEL.
Thanks Matthew.
- Add some overhead analysis in the cover letter.
- Use vmemap pmd table lock instead of a hugetlb specific global lock.
Changelog in v1 -> v2:
- Fix do not call dissolve_compound_page in alloc_huge_page_vmemmap().
- Fix some typo and code style problems.
- Remove unused handle_vmemmap_fault().
- Merge some commits to one commit suggested by Mike.
Muchun Song (12):
mm: memory_hotplug: factor out bootmem core functions to
bootmem_info.c
mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
mm: hugetlb: defer freeing of HugeTLB pages
mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB
page
mm: hugetlb: set the PageHWPoison to the raw error page
mm: hugetlb: flush work when dissolving a HugeTLB page
mm: hugetlb: introduce PageHugeInflight
mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap
mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate
mm: hugetlb: gather discrete indexes of tail page
mm: hugetlb: optimize the code with the help of the compiler
Documentation/admin-guide/kernel-parameters.txt | 14 ++
Documentation/admin-guide/mm/hugetlbpage.rst | 3 +
arch/x86/mm/init_64.c | 13 +-
fs/Kconfig | 18 ++
include/linux/bootmem_info.h | 65 ++++++
include/linux/hugetlb.h | 37 ++++
include/linux/hugetlb_cgroup.h | 15 +-
include/linux/memory_hotplug.h | 27 ---
include/linux/mm.h | 5 +
mm/Makefile | 2 +
mm/bootmem_info.c | 124 +++++++++++
mm/hugetlb.c | 218 +++++++++++++++++--
mm/hugetlb_vmemmap.c | 278 ++++++++++++++++++++++++
mm/hugetlb_vmemmap.h | 45 ++++
mm/memory_hotplug.c | 116 ----------
mm/sparse-vmemmap.c | 273 +++++++++++++++++++++++
mm/sparse.c | 1 +
17 files changed, 1082 insertions(+), 172 deletions(-)
create mode 100644 include/linux/bootmem_info.h
create mode 100644 mm/bootmem_info.c
create mode 100644 mm/hugetlb_vmemmap.c
create mode 100644 mm/hugetlb_vmemmap.h
--
2.11.0
When we free a HugeTLB page to the buddy allocator, we should allocate the
vmemmap pages associated with it. We can do that in the __free_hugepage()
before freeing it to buddy.
Signed-off-by: Muchun Song <[email protected]>
---
include/linux/mm.h | 2 ++
mm/hugetlb.c | 2 ++
mm/hugetlb_vmemmap.c | 15 ++++++++++
mm/hugetlb_vmemmap.h | 5 ++++
mm/sparse-vmemmap.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 100 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f928994ed273..16b55d13b0ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3007,6 +3007,8 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
void vmemmap_remap_free(unsigned long start, unsigned long end,
unsigned long reuse);
+void vmemmap_remap_alloc(unsigned long start, unsigned long end,
+ unsigned long reuse);
void *sparse_buffer_alloc(unsigned long size);
struct page * __populate_section_memmap(unsigned long pfn,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c165186ec2cf..d11c32fcdb38 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1326,6 +1326,8 @@ static void update_hpage_vmemmap_workfn(struct work_struct *work)
page->mapping = NULL;
h = page_hstate(page);
+ alloc_huge_page_vmemmap(h, page);
+
spin_lock(&hugetlb_lock);
__free_hugepage(h, page);
spin_unlock(&hugetlb_lock);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 19f1898aaede..6108ae80314f 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -183,6 +183,21 @@ static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
}
+void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+ unsigned long vmemmap_addr = (unsigned long)head;
+ unsigned long vmemmap_end, vmemmap_reuse;
+
+ if (!free_vmemmap_pages_per_hpage(h))
+ return;
+
+ vmemmap_addr += RESERVE_VMEMMAP_SIZE;
+ vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
+ vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
+
+ vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse);
+}
+
void free_huge_page_vmemmap(struct hstate *h, struct page *head)
{
unsigned long vmemmap_addr = (unsigned long)head;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 01f8637adbe0..b2c8d2f11d48 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -11,6 +11,7 @@
#include <linux/hugetlb.h>
#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+void alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
void free_huge_page_vmemmap(struct hstate *h, struct page *head);
/*
@@ -25,6 +26,10 @@ static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
return 0;
}
#else
+static inline void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+}
+
static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
{
}
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index ce4be1fa93c2..3b146d5949f3 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -29,6 +29,7 @@
#include <linux/sched.h>
#include <linux/pgtable.h>
#include <linux/bootmem_info.h>
+#include <linux/delay.h>
#include <asm/dma.h>
#include <asm/pgalloc.h>
@@ -40,7 +41,8 @@
* @remap_pte: called for each non-empty PTE (lowest-level) entry.
* @reuse_page: the page which is reused for the tail vmemmap pages.
* @reuse_addr: the virtual address of the @reuse_page page.
- * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
+ * @vmemmap_pages: the list head of the vmemmap pages that can be freed
+ * or is mapped from.
*/
struct vmemmap_remap_walk {
void (*remap_pte)(pte_t *pte, unsigned long addr,
@@ -50,6 +52,10 @@ struct vmemmap_remap_walk {
struct list_head *vmemmap_pages;
};
+/* The gfp mask of allocating vmemmap page */
+#define GFP_VMEMMAP_PAGE \
+ (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE)
+
static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end,
struct vmemmap_remap_walk *walk)
@@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end,
free_vmemmap_page_list(&vmemmap_pages);
}
+static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
+ struct vmemmap_remap_walk *walk)
+{
+ pgprot_t pgprot = PAGE_KERNEL;
+ struct page *page;
+ void *to;
+
+ BUG_ON(pte_page(*pte) != walk->reuse_page);
+
+ page = list_first_entry(walk->vmemmap_pages, struct page, lru);
+ list_del(&page->lru);
+ to = page_to_virt(page);
+ copy_page(to, (void *)walk->reuse_addr);
+
+ set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+}
+
+static void alloc_vmemmap_page_list(struct list_head *list,
+ unsigned long start, unsigned long end)
+{
+ unsigned long addr;
+
+ for (addr = start; addr < end; addr += PAGE_SIZE) {
+ struct page *page;
+ int nid = page_to_nid((const void *)addr);
+
+retry:
+ page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0);
+ if (unlikely(!page)) {
+ msleep(100);
+ /*
+ * We should retry infinitely, because we cannot
+ * handle allocation failures. Once we allocate
+ * vmemmap pages successfully, then we can free
+ * a HugeTLB page.
+ */
+ goto retry;
+ }
+ list_add_tail(&page->lru, list);
+ }
+}
+
+/**
+ * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
+ * to the page which is from the @vmemmap_pages
+ * respectively.
+ * @start: start address of the vmemmap virtual address range.
+ * @end: end address of the vmemmap virtual address range.
+ * @reuse: reuse address.
+ */
+void vmemmap_remap_alloc(unsigned long start, unsigned long end,
+ unsigned long reuse)
+{
+ LIST_HEAD(vmemmap_pages);
+ struct vmemmap_remap_walk walk = {
+ .remap_pte = vmemmap_restore_pte,
+ .reuse_addr = reuse,
+ .vmemmap_pages = &vmemmap_pages,
+ };
+
+ might_sleep();
+
+ /* See the comment in the vmemmap_remap_free(). */
+ BUG_ON(start - reuse != PAGE_SIZE);
+
+ alloc_vmemmap_page_list(&vmemmap_pages, start, end);
+ vmemmap_remap_range(reuse, end, &walk);
+}
+
/*
* Allocate a block of memory to be used to back the virtual memory map
* or to back the page tables that are used to create the mapping.
--
2.11.0
Every HugeTLB has more than one struct page structure. We __know__ that
we only use the first 4(HUGETLB_CGROUP_MIN_ORDER) struct page structures
to store metadata associated with each HugeTLB.
There are a lot of struct page structures associated with each HugeTLB
page. For tail pages, the value of compound_head is the same. So we can
reuse first page of tail page structures. We map the virtual addresses
of the remaining pages of tail page structures to the first tail page
struct, and then free these page frames. Therefore, we need to reserve
two pages as vmemmap areas.
When we allocate a HugeTLB page from the buddy, we can free some vmemmap
pages associated with each HugeTLB page. It is more appropriate to do it
in the prep_new_huge_page().
The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
pages associated with a HugeTLB page can be freed, returns zero for
now, which means the feature is disabled. We will enable it once all
the infrastructure is there.
Signed-off-by: Muchun Song <[email protected]>
---
include/linux/bootmem_info.h | 27 +++++-
include/linux/mm.h | 3 +
mm/Makefile | 1 +
mm/hugetlb.c | 3 +
mm/hugetlb_vmemmap.c | 211 +++++++++++++++++++++++++++++++++++++++++++
mm/hugetlb_vmemmap.h | 20 ++++
mm/sparse-vmemmap.c | 198 ++++++++++++++++++++++++++++++++++++++++
7 files changed, 462 insertions(+), 1 deletion(-)
create mode 100644 mm/hugetlb_vmemmap.c
create mode 100644 mm/hugetlb_vmemmap.h
diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
index 4ed6dee1adc9..ec03a624dfa2 100644
--- a/include/linux/bootmem_info.h
+++ b/include/linux/bootmem_info.h
@@ -2,7 +2,7 @@
#ifndef __LINUX_BOOTMEM_INFO_H
#define __LINUX_BOOTMEM_INFO_H
-#include <linux/mmzone.h>
+#include <linux/mm.h>
/*
* Types for free bootmem stored in page->lru.next. These have to be in
@@ -22,6 +22,27 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
void get_page_bootmem(unsigned long info, struct page *page,
unsigned long type);
void put_page_bootmem(struct page *page);
+
+/*
+ * Any memory allocated via the memblock allocator and not via the
+ * buddy will be marked reserved already in the memmap. For those
+ * pages, we can call this function to free it to buddy allocator.
+ */
+static inline void free_bootmem_page(struct page *page)
+{
+ unsigned long magic = (unsigned long)page->freelist;
+
+ /*
+ * The reserve_bootmem_region sets the reserved flag on bootmem
+ * pages.
+ */
+ VM_BUG_ON_PAGE(page_ref_count(page) != 2, page);
+
+ if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
+ put_page_bootmem(page);
+ else
+ VM_BUG_ON_PAGE(1, page);
+}
#else
static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
{
@@ -35,6 +56,10 @@ static inline void get_page_bootmem(unsigned long info, struct page *page,
unsigned long type)
{
}
+
+static inline void free_bootmem_page(struct page *page)
+{
+}
#endif
#endif /* __LINUX_BOOTMEM_INFO_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index eabe7d9f80d8..f928994ed273 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3005,6 +3005,9 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
}
#endif
+void vmemmap_remap_free(unsigned long start, unsigned long end,
+ unsigned long reuse);
+
void *sparse_buffer_alloc(unsigned long size);
struct page * __populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
diff --git a/mm/Makefile b/mm/Makefile
index ed4b88fa0f5e..056801d8daae 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_ZSWAP) += zswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
+obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP) += hugetlb_vmemmap.o
obj-$(CONFIG_NUMA) += mempolicy.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1f3bf1710b66..140135fc8113 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -42,6 +42,7 @@
#include <linux/userfaultfd_k.h>
#include <linux/page_owner.h>
#include "internal.h"
+#include "hugetlb_vmemmap.h"
int hugetlb_max_hstate __read_mostly;
unsigned int default_hstate_idx;
@@ -1497,6 +1498,8 @@ void free_huge_page(struct page *page)
static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
{
+ free_huge_page_vmemmap(h, page);
+
INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
set_hugetlb_cgroup(page, NULL);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
new file mode 100644
index 000000000000..4ffa2a4ae2a8
--- /dev/null
+++ b/mm/hugetlb_vmemmap.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Free some vmemmap pages of HugeTLB
+ *
+ * Copyright (c) 2020, Bytedance. All rights reserved.
+ *
+ * Author: Muchun Song <[email protected]>
+ *
+ * The struct page structures (page structs) are used to describe a physical
+ * page frame. By default, there is a one-to-one mapping from a page frame to
+ * it's corresponding page struct.
+ *
+ * The HugeTLB pages consist of multiple base page size pages and is supported
+ * by many architectures. See hugetlbpage.rst in the Documentation directory
+ * for more details. On the x86-64 architecture, HugeTLB pages of size 2MB and
+ * 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB
+ * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
+ * 4096 base pages. For each base page, there is a corresponding page struct.
+ *
+ * Within the HugeTLB subsystem, only the first 4 page structs are used to
+ * contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
+ * provides this upper limit. The only 'useful' information in the remaining
+ * page structs is the compound_head field, and this field is the same for all
+ * tail pages.
+ *
+ * By removing redundant page structs for HugeTLB pages, memory can be returned
+ * to the buddy allocator for other uses.
+ *
+ * Different architectures support different HugeTLB pages. For example, the
+ * following table is the HugeTLB page size supported by x86 and arm64
+ * architectures. Becasue arm64 supports 4k, 16k, and 64k base pages and
+ * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
+ * page.
+ *
+ * +--------------+-----------+-----------------------------------------------+
+ * | Architecture | Page Size | HugeTLB Page Size |
+ * +--------------+-----------+-----------+-----------+-----------+-----------+
+ * | x86-64 | 4KB | 2MB | 1GB | | |
+ * +--------------+-----------+-----------+-----------+-----------+-----------+
+ * | | 4KB | 64KB | 2MB | 32MB | 1GB |
+ * | +-----------+-----------+-----------+-----------+-----------+
+ * | arm64 | 16KB | 2MB | 32MB | 1GB | |
+ * | +-----------+-----------+-----------+-----------+-----------+
+ * | | 64KB | 2MB | 512MB | 16GB | |
+ * +--------------+-----------+-----------+-----------+-----------+-----------+
+ *
+ * When the system boot up, every HugeTLB page has more than one struct page
+ * structs whose size is (unit: pages):
+ *
+ * struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+ *
+ * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
+ * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
+ * relationship.
+ *
+ * HugeTLB_Size = n * PAGE_SIZE
+ *
+ * Then,
+ *
+ * struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+ * = n * sizeof(struct page) / PAGE_SIZE
+ *
+ * We can use huge mapping at the pud/pmd level for the HugeTLB page.
+ *
+ * For the HugeTLB page of the pmd level mapping, then
+ *
+ * struct_size = n * sizeof(struct page) / PAGE_SIZE
+ * = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
+ * = sizeof(struct page) / sizeof(pte_t)
+ * = 64 / 8
+ * = 8 (pages)
+ *
+ * Where n is how many pte entries which one page can contains. So the value of
+ * n is (PAGE_SIZE / sizeof(pte_t)).
+ *
+ * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
+ * is 8. And this optimization also applicable only when the size of struct page
+ * is a power of two. In most cases, the size of struct page is 64 (e.g. x86-64
+ * and arm64). So if we use pmd level mapping for a HugeTLB page, the size of
+ * struct page structs of it is 8 pages whose size depends on the size of the
+ * base page.
+ *
+ * For the HugeTLB page of the pud level mapping, then
+ *
+ * struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
+ * = PAGE_SIZE / 8 * 8 (pages)
+ * = PAGE_SIZE (pages)
+ *
+ * Where the struct_size(pmd) is the size of the struct page structs of a
+ * HugeTLB page of the pmd level mapping.
+ *
+ * Next, we take the pmd level mapping of the HugeTLB page as an example to
+ * show the internal implementation of this optimization. There are 8 pages
+ * struct page structs associated with a HugeTLB page which is pmd mapped.
+ *
+ * Here is how things look before optimization.
+ *
+ * HugeTLB struct pages(8 pages) page frame(8 pages)
+ * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
+ * | | | 0 | -------------> | 0 |
+ * | | +-----------+ +-----------+
+ * | | | 1 | -------------> | 1 |
+ * | | +-----------+ +-----------+
+ * | | | 2 | -------------> | 2 |
+ * | | +-----------+ +-----------+
+ * | | | 3 | -------------> | 3 |
+ * | | +-----------+ +-----------+
+ * | | | 4 | -------------> | 4 |
+ * | PMD | +-----------+ +-----------+
+ * | level | | 5 | -------------> | 5 |
+ * | mapping | +-----------+ +-----------+
+ * | | | 6 | -------------> | 6 |
+ * | | +-----------+ +-----------+
+ * | | | 7 | -------------> | 7 |
+ * | | +-----------+ +-----------+
+ * | |
+ * | |
+ * | |
+ * +-----------+
+ *
+ * The value of page->compound_head is the same for all tail pages. The first
+ * page of page structs (page 0) associated with the HugeTLB page contains the 4
+ * page structs necessary to describe the HugeTLB. The only use of the remaining
+ * pages of page structs (page 1 to page 7) is to point to page->compound_head.
+ * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
+ * will be used for each HugeTLB page. This will allow us to free the remaining
+ * 6 pages to the buddy allocator.
+ *
+ * Here is how things look after remapping.
+ *
+ * HugeTLB struct pages(8 pages) page frame(8 pages)
+ * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
+ * | | | 0 | -------------> | 0 |
+ * | | +-----------+ +-----------+
+ * | | | 1 | -------------> | 1 |
+ * | | +-----------+ +-----------+
+ * | | | 2 | ----------------^ ^ ^ ^ ^ ^
+ * | | +-----------+ | | | | |
+ * | | | 3 | ------------------+ | | | |
+ * | | +-----------+ | | | |
+ * | | | 4 | --------------------+ | | |
+ * | PMD | +-----------+ | | |
+ * | level | | 5 | ----------------------+ | |
+ * | mapping | +-----------+ | |
+ * | | | 6 | ------------------------+ |
+ * | | +-----------+ |
+ * | | | 7 | --------------------------+
+ * | | +-----------+
+ * | |
+ * | |
+ * | |
+ * +-----------+
+ *
+ * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
+ * vmemmap pages and restore the previous mapping relationship.
+ *
+ * For the HugeTLB page of the pud level mapping. It is similar to the former.
+ * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
+ *
+ * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
+ * (e.g. aarch64) provides a contiguous bit in the translation table entries
+ * that hints to the MMU to indicate that it is one of a contiguous set of
+ * entries that can be cached in a single TLB entry.
+ *
+ * The contiguous bit is used to increase the mapping size at the pmd and pte
+ * (last) level. So this type of HugeTLB page can be optimized only when its
+ * size of the struct page structs is greater than 2 pages.
+ */
+#include "hugetlb_vmemmap.h"
+
+/*
+ * There are a lot of struct page structures associated with each HugeTLB page.
+ * For tail pages, the value of compound_head is the same. So we can reuse first
+ * page of tail page structures. We map the virtual addresses of the remaining
+ * pages of tail page structures to the first tail page struct, and then free
+ * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
+ */
+#define RESERVE_VMEMMAP_NR 2U
+#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
+
+/*
+ * How many vmemmap pages associated with a HugeTLB page that can be freed
+ * to the buddy allocator.
+ *
+ * Todo: Returns zero for now, which means the feature is disabled. We will
+ * enable it once all the infrastructure is there.
+ */
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+ return 0;
+}
+
+static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
+{
+ return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
+}
+
+void free_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+ unsigned long vmemmap_addr = (unsigned long)head;
+ unsigned long vmemmap_end, vmemmap_reuse;
+
+ if (!free_vmemmap_pages_per_hpage(h))
+ return;
+
+ vmemmap_addr += RESERVE_VMEMMAP_SIZE;
+ vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
+ vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
+
+ vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse);
+}
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
new file mode 100644
index 000000000000..6923f03534d5
--- /dev/null
+++ b/mm/hugetlb_vmemmap.h
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Free some vmemmap pages of HugeTLB
+ *
+ * Copyright (c) 2020, Bytedance. All rights reserved.
+ *
+ * Author: Muchun Song <[email protected]>
+ */
+#ifndef _LINUX_HUGETLB_VMEMMAP_H
+#define _LINUX_HUGETLB_VMEMMAP_H
+#include <linux/hugetlb.h>
+
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+#else
+static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+}
+#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
+#endif /* _LINUX_HUGETLB_VMEMMAP_H */
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 16183d85a7d5..ce4be1fa93c2 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -27,8 +27,206 @@
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
#include <linux/sched.h>
+#include <linux/pgtable.h>
+#include <linux/bootmem_info.h>
+
#include <asm/dma.h>
#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+
+/**
+ * vmemmap_remap_walk - walk vmemmap page table
+ *
+ * @remap_pte: called for each non-empty PTE (lowest-level) entry.
+ * @reuse_page: the page which is reused for the tail vmemmap pages.
+ * @reuse_addr: the virtual address of the @reuse_page page.
+ * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
+ */
+struct vmemmap_remap_walk {
+ void (*remap_pte)(pte_t *pte, unsigned long addr,
+ struct vmemmap_remap_walk *walk);
+ struct page *reuse_page;
+ unsigned long reuse_addr;
+ struct list_head *vmemmap_pages;
+};
+
+static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end,
+ struct vmemmap_remap_walk *walk)
+{
+ pte_t *pte;
+
+ pte = pte_offset_kernel(pmd, addr);
+
+ /*
+ * The reuse_page is found 'first' in table walk before we start
+ * remapping (which is calling @walk->remap_pte).
+ */
+ if (walk->reuse_addr == addr) {
+ BUG_ON(pte_none(*pte));
+
+ walk->reuse_page = pte_page(*pte++);
+ /*
+ * Becasue the reuse address is part of the range that we are
+ * walking, skip the reuse address range.
+ */
+ addr += PAGE_SIZE;
+ }
+
+ for (; addr != end; addr += PAGE_SIZE, pte++) {
+ BUG_ON(pte_none(*pte));
+
+ walk->remap_pte(pte, addr, walk);
+ }
+}
+
+static void vmemmap_pmd_range(pud_t *pud, unsigned long addr,
+ unsigned long end,
+ struct vmemmap_remap_walk *walk)
+{
+ pmd_t *pmd;
+ unsigned long next;
+
+ pmd = pmd_offset(pud, addr);
+ do {
+ BUG_ON(pmd_none(*pmd));
+
+ next = pmd_addr_end(addr, end);
+ vmemmap_pte_range(pmd, addr, next, walk);
+ } while (pmd++, addr = next, addr != end);
+}
+
+static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
+ unsigned long end,
+ struct vmemmap_remap_walk *walk)
+{
+ pud_t *pud;
+ unsigned long next;
+
+ pud = pud_offset(p4d, addr);
+ do {
+ BUG_ON(pud_none(*pud));
+
+ next = pud_addr_end(addr, end);
+ vmemmap_pmd_range(pud, addr, next, walk);
+ } while (pud++, addr = next, addr != end);
+}
+
+static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
+ unsigned long end,
+ struct vmemmap_remap_walk *walk)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = p4d_offset(pgd, addr);
+ do {
+ BUG_ON(p4d_none(*p4d));
+
+ next = p4d_addr_end(addr, end);
+ vmemmap_pud_range(p4d, addr, next, walk);
+ } while (p4d++, addr = next, addr != end);
+}
+
+static void vmemmap_remap_range(unsigned long start, unsigned long end,
+ struct vmemmap_remap_walk *walk)
+{
+ unsigned long addr = start;
+ unsigned long next;
+ pgd_t *pgd;
+
+ VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
+ VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
+
+ pgd = pgd_offset_k(addr);
+ do {
+ BUG_ON(pgd_none(*pgd));
+
+ next = pgd_addr_end(addr, end);
+ vmemmap_p4d_range(pgd, addr, next, walk);
+ } while (pgd++, addr = next, addr != end);
+
+ /*
+ * We do not change the mapping of the vmemmap virtual address range
+ * [@start, @start + PAGE_SIZE) which is belong to the reuse range.
+ * So we not need to flush the TLB.
+ */
+ flush_tlb_kernel_range(start - PAGE_SIZE, end);
+}
+
+/*
+ * Free a vmemmap page. A vmemmap page can be allocated from the memblock
+ * allocator or buddy allocator. If the PG_reserved flag is set, it means
+ * that it allocated from the memblock allocator, just free it via the
+ * free_bootmem_page(). Otherwise, use __free_page().
+ */
+static inline void free_vmemmap_page(struct page *page)
+{
+ if (PageReserved(page))
+ free_bootmem_page(page);
+ else
+ __free_page(page);
+}
+
+/* Free a list of the vmemmap pages */
+static void free_vmemmap_page_list(struct list_head *list)
+{
+ struct page *page, *next;
+
+ list_for_each_entry_safe(page, next, list, lru) {
+ list_del(&page->lru);
+ free_vmemmap_page(page);
+ }
+}
+
+static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
+ struct vmemmap_remap_walk *walk)
+{
+ /*
+ * Remap the tail pages as read-only to catch illegal write operation
+ * to the tail pages.
+ */
+ pgprot_t pgprot = PAGE_KERNEL_RO;
+ pte_t entry = mk_pte(walk->reuse_page, pgprot);
+ struct page *page = pte_page(*pte);
+
+ list_add(&page->lru, walk->vmemmap_pages);
+ set_pte_at(&init_mm, addr, pte, entry);
+}
+
+/**
+ * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
+ * to the page which @reuse is mapped, then free vmemmap
+ * pages.
+ * @start: start address of the vmemmap virtual address range.
+ * @end: end address of the vmemmap virtual address range.
+ * @reuse: reuse address.
+ */
+void vmemmap_remap_free(unsigned long start, unsigned long end,
+ unsigned long reuse)
+{
+ LIST_HEAD(vmemmap_pages);
+ struct vmemmap_remap_walk walk = {
+ .remap_pte = vmemmap_remap_pte,
+ .reuse_addr = reuse,
+ .vmemmap_pages = &vmemmap_pages,
+ };
+
+ /*
+ * In order to make remapping routine most efficient for the huge pages,
+ * the routine of vmemmap page table walking has the following rules
+ * (see more details from the vmemmap_pte_range()):
+ *
+ * - The @reuse address is part of the range that we are walking.
+ * - The @reuse address is the first in the complete range.
+ *
+ * So we need to make sure that @start and @reuse meet the above rules.
+ */
+ BUG_ON(start - reuse != PAGE_SIZE);
+
+ vmemmap_remap_range(reuse, end, &walk);
+ free_vmemmap_page_list(&vmemmap_pages);
+}
/*
* Allocate a block of memory to be used to back the virtual memory map
--
2.11.0
Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
freeing unused vmemmap pages associated with each hugetlb page on boot.
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Reviewed-by: Barry Song <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 14 ++++++++++++++
Documentation/admin-guide/mm/hugetlbpage.rst | 3 +++
arch/x86/mm/init_64.c | 8 ++++++--
include/linux/hugetlb.h | 19 +++++++++++++++++++
mm/hugetlb_vmemmap.c | 24 ++++++++++++++++++++++++
5 files changed, 66 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3ae25630a223..44dde9be7e00 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1551,6 +1551,20 @@
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
+ hugetlb_free_vmemmap=
+ [KNL] When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set,
+ this controls freeing unused vmemmap pages associated
+ with each HugeTLB page. When this option is enabled,
+ we disable PMD/huge page mapping of vmemmap pages which
+ increase page table pages. So if a user/sysadmin only
+ uses a small number of HugeTLB pages (as a percentage
+ of system memory), they could end up using more memory
+ with hugetlb_free_vmemmap on as opposed to off.
+ Format: { on | off (default) }
+
+ on: enable the feature
+ off: disable the feature
+
hung_task_panic=
[KNL] Should the hung task detector generate panics.
Format: 0 | 1
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index f7b1c7462991..3a23c2377acc 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -145,6 +145,9 @@ default_hugepagesz
will all result in 256 2M huge pages being allocated. Valid default
huge page size is architecture dependent.
+hugetlb_free_vmemmap
+ When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing
+ unused vmemmap pages associated with each HugeTLB page.
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
indicates the current number of pre-allocated huge pages of the default size.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0435bee2e172..1bce5f20e6ca 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -34,6 +34,7 @@
#include <linux/gfp.h>
#include <linux/kcore.h>
#include <linux/bootmem_info.h>
+#include <linux/hugetlb.h>
#include <asm/processor.h>
#include <asm/bios_ebda.h>
@@ -1557,7 +1558,8 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
{
int err;
- if (end - start < PAGES_PER_SECTION * sizeof(struct page))
+ if (is_hugetlb_free_vmemmap_enabled() ||
+ end - start < PAGES_PER_SECTION * sizeof(struct page))
err = vmemmap_populate_basepages(start, end, node, NULL);
else if (boot_cpu_has(X86_FEATURE_PSE))
err = vmemmap_populate_hugepages(start, end, node, altmap);
@@ -1585,6 +1587,8 @@ void register_page_bootmem_memmap(unsigned long section_nr,
pmd_t *pmd;
unsigned int nr_pmd_pages;
struct page *page;
+ bool base_mapping = !boot_cpu_has(X86_FEATURE_PSE) ||
+ is_hugetlb_free_vmemmap_enabled();
for (; addr < end; addr = next) {
pte_t *pte = NULL;
@@ -1610,7 +1614,7 @@ void register_page_bootmem_memmap(unsigned long section_nr,
}
get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
- if (!boot_cpu_has(X86_FEATURE_PSE)) {
+ if (base_mapping) {
next = (addr + PAGE_SIZE) & PAGE_MASK;
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd))
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..7f47f0eeca3b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -770,6 +770,20 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
}
#endif
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+extern bool hugetlb_free_vmemmap_enabled;
+
+static inline bool is_hugetlb_free_vmemmap_enabled(void)
+{
+ return hugetlb_free_vmemmap_enabled;
+}
+#else
+static inline bool is_hugetlb_free_vmemmap_enabled(void)
+{
+ return false;
+}
+#endif
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
@@ -923,6 +937,11 @@ static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr
pte_t *ptep, pte_t pte, unsigned long sz)
{
}
+
+static inline bool is_hugetlb_free_vmemmap_enabled(void)
+{
+ return false;
+}
#endif /* CONFIG_HUGETLB_PAGE */
static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 6108ae80314f..8206978d1679 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -166,6 +166,8 @@
* (last) level. So this type of HugeTLB page can be optimized only when its
* size of the struct page structs is greater than 2 pages.
*/
+#define pr_fmt(fmt) "HugeTLB: " fmt
+
#include "hugetlb_vmemmap.h"
/*
@@ -178,6 +180,28 @@
#define RESERVE_VMEMMAP_NR 2U
#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
+bool hugetlb_free_vmemmap_enabled;
+
+static int __init early_hugetlb_free_vmemmap_param(char *buf)
+{
+ /* We cannot optimize if a "struct page" crosses page boundaries. */
+ if ((!is_power_of_2(sizeof(struct page)))) {
+ pr_warn("cannot free vmemmap pages because \"struct page\" crosses page boundaries\n");
+ return 0;
+ }
+
+ if (!buf)
+ return -EINVAL;
+
+ if (!strcmp(buf, "on"))
+ hugetlb_free_vmemmap_enabled = true;
+ else if (strcmp(buf, "off"))
+ return -EINVAL;
+
+ return 0;
+}
+early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param);
+
static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
{
return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
--
2.11.0
All the infrastructure is ready, so we introduce nr_free_vmemmap_pages
field in the hstate to indicate how many vmemmap pages associated with
a HugeTLB page that can be freed to buddy allocator. And initialize it
in the hugetlb_vmemmap_init(). This patch is actual enablement of the
feature.
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
---
include/linux/hugetlb.h | 3 +++
mm/hugetlb.c | 1 +
mm/hugetlb_vmemmap.c | 25 +++++++++++++++++++++++++
mm/hugetlb_vmemmap.h | 10 ++++++----
4 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7f47f0eeca3b..66d82ae7b712 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -492,6 +492,9 @@ struct hstate {
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+ unsigned int nr_free_vmemmap_pages;
+#endif
#ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
struct cftype cgroup_files_dfl[7];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 14549204ddcb..0e14fad63823 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3385,6 +3385,7 @@ void __init hugetlb_add_hstate(unsigned int order)
h->next_nid_to_free = first_memory_node;
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
+ hugetlb_vmemmap_init(h);
parsed_hstate = h;
}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 8206978d1679..7dcb4aa1e512 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -236,3 +236,28 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head)
vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse);
}
+
+void __init hugetlb_vmemmap_init(struct hstate *h)
+{
+ unsigned int nr_pages = pages_per_huge_page(h);
+ unsigned int vmemmap_pages;
+
+ if (!hugetlb_free_vmemmap_enabled)
+ return;
+
+ vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT;
+ /*
+ * The head page and the first tail page are not to be freed to buddy
+ * allocator, the other pages will map to the first tail page, so they
+ * can be freed.
+ *
+ * Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true
+ * on some architectures (e.g. aarch64). See Documentation/arm64/
+ * hugetlbpage.rst for more details.
+ */
+ if (likely(vmemmap_pages > RESERVE_VMEMMAP_NR))
+ h->nr_free_vmemmap_pages = vmemmap_pages - RESERVE_VMEMMAP_NR;
+
+ pr_info("can free %d vmemmap pages for %s\n", h->nr_free_vmemmap_pages,
+ h->name);
+}
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index b2c8d2f11d48..8fd9ae113dbd 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -13,17 +13,15 @@
#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
void alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+void hugetlb_vmemmap_init(struct hstate *h);
/*
* How many vmemmap pages associated with a HugeTLB page that can be freed
* to the buddy allocator.
- *
- * Todo: Returns zero for now, which means the feature is disabled. We will
- * enable it once all the infrastructure is there.
*/
static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
{
- return 0;
+ return h->nr_free_vmemmap_pages;
}
#else
static inline void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
@@ -38,5 +36,9 @@ static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
{
return 0;
}
+
+static inline void hugetlb_vmemmap_init(struct hstate *h)
+{
+}
#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
#endif /* _LINUX_HUGETLB_VMEMMAP_H */
--
2.11.0
The HUGETLB_PAGE_FREE_VMEMMAP option is used to enable the freeing
of unnecessary vmemmap associated with HugeTLB pages. The config
option is introduced early so that supporting code can be written
to depend on the option. The initial version of the code only
provides support for x86-64.
Like other code which frees vmemmap, this config option depends on
HAVE_BOOTMEM_INFO_NODE. The routine register_page_bootmem_info() is
used to register bootmem info. Therefore, make sure
register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP
is defined.
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
---
arch/x86/mm/init_64.c | 2 +-
fs/Kconfig | 18 ++++++++++++++++++
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0a45f062826e..0435bee2e172 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
static void __init register_page_bootmem_info(void)
{
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
int i;
for_each_online_node(i)
diff --git a/fs/Kconfig b/fs/Kconfig
index 976e8b9033c4..e7c4c2a79311 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -245,6 +245,24 @@ config HUGETLBFS
config HUGETLB_PAGE
def_bool HUGETLBFS
+config HUGETLB_PAGE_FREE_VMEMMAP
+ def_bool HUGETLB_PAGE
+ depends on X86_64
+ depends on SPARSEMEM_VMEMMAP
+ depends on HAVE_BOOTMEM_INFO_NODE
+ help
+ The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of
+ some vmemmap pages associated with pre-allocated HugeTLB pages.
+ For example, on X86_64 6 vmemmap pages of size 4KB each can be
+ saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB
+ each can be saved for each 1GB HugeTLB page.
+
+ When a HugeTLB page is allocated or freed, the vmemmap array
+ representing the range associated with the page will need to be
+ remapped. When a page is allocated, vmemmap pages are freed
+ after remapping. When a page is freed, previously discarded
+ vmemmap pages must be allocated before remapping.
+
config MEMFD_CREATE
def_bool TMPFS || HUGETLBFS
--
2.11.0
For HugeTLB page, there are more metadata to save in the struct page.
But the head struct page cannot meet our needs, so we have to abuse
other tail struct page to store the metadata. In order to avoid
conflicts caused by subsequent use of more tail struct pages, we can
gather these discrete indexes of tail struct page. In this case, it
will be easier to add a new tail page index later.
There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP,
so add a BUILD_BUG_ON to catch invalid usage of the tail struct page.
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
---
include/linux/hugetlb.h | 14 ++++++++++++++
include/linux/hugetlb_cgroup.h | 15 +++++++++------
mm/hugetlb.c | 25 ++++++++++++-------------
mm/hugetlb_vmemmap.c | 8 ++++++++
4 files changed, 43 insertions(+), 19 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 66d82ae7b712..05fd2db09b78 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -28,6 +28,20 @@ typedef struct { unsigned long pd; } hugepd_t;
#include <linux/shm.h>
#include <asm/tlbflush.h>
+enum {
+ SUBPAGE_INDEX_ACTIVE = 1, /* reuse page flags of PG_private */
+ SUBPAGE_INDEX_TEMPORARY, /* reuse page->mapping */
+#ifdef CONFIG_CGROUP_HUGETLB
+ SUBPAGE_INDEX_CGROUP = SUBPAGE_INDEX_TEMPORARY,/* reuse page->private */
+ SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */
+#endif
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+ SUBPAGE_INDEX_HWPOISON, /* reuse page->private */
+ SUBPAGE_INDEX_INFLIGHT, /* reuse page->private */
+#endif
+ NR_USED_SUBPAGE,
+};
+
struct hugepage_subpool {
spinlock_t lock;
long count;
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 2ad6e92f124a..3d3c1c49efe4 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -24,8 +24,9 @@ struct file_region;
/*
* Minimum page order trackable by hugetlb cgroup.
* At least 4 pages are necessary for all the tracking information.
- * The second tail page (hpage[2]) is the fault usage cgroup.
- * The third tail page (hpage[3]) is the reservation usage cgroup.
+ * The second tail page (hpage[SUBPAGE_INDEX_CGROUP]) is the fault
+ * usage cgroup. The third tail page (hpage[SUBPAGE_INDEX_CGROUP_RSVD])
+ * is the reservation usage cgroup.
*/
#define HUGETLB_CGROUP_MIN_ORDER 2
@@ -66,9 +67,9 @@ __hugetlb_cgroup_from_page(struct page *page, bool rsvd)
if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
return NULL;
if (rsvd)
- return (struct hugetlb_cgroup *)page[3].private;
+ return (void *)page_private(page + SUBPAGE_INDEX_CGROUP_RSVD);
else
- return (struct hugetlb_cgroup *)page[2].private;
+ return (void *)page_private(page + SUBPAGE_INDEX_CGROUP);
}
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
@@ -90,9 +91,11 @@ static inline int __set_hugetlb_cgroup(struct page *page,
if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
return -1;
if (rsvd)
- page[3].private = (unsigned long)h_cg;
+ set_page_private(page + SUBPAGE_INDEX_CGROUP_RSVD,
+ (unsigned long)h_cg);
else
- page[2].private = (unsigned long)h_cg;
+ set_page_private(page + SUBPAGE_INDEX_CGROUP,
+ (unsigned long)h_cg);
return 0;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0e14fad63823..fdabc1d0ef98 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1346,17 +1346,17 @@ static inline void flush_hpage_update_work(struct hstate *h)
#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
static inline bool PageHugeInflight(struct page *head)
{
- return page_private(head + 5) == -1UL;
+ return page_private(head + SUBPAGE_INDEX_INFLIGHT) == -1UL;
}
static inline void SetPageHugeInflight(struct page *head)
{
- set_page_private(head + 5, -1UL);
+ set_page_private(head + SUBPAGE_INDEX_INFLIGHT, -1UL);
}
static inline void ClearPageHugeInflight(struct page *head)
{
- set_page_private(head + 5, 0);
+ set_page_private(head + SUBPAGE_INDEX_INFLIGHT, 0);
}
#else
static inline bool PageHugeInflight(struct page *head)
@@ -1404,7 +1404,7 @@ static inline void hwpoison_subpage_deliver(struct hstate *h, struct page *head)
if (!PageHWPoison(head) || !free_vmemmap_pages_per_hpage(h))
return;
- page = head + page_private(head + 4);
+ page = head + page_private(head + SUBPAGE_INDEX_HWPOISON);
/*
* Move PageHWPoison flag from head page to the raw error page,
@@ -1423,7 +1423,7 @@ static inline void hwpoison_subpage_set(struct hstate *h, struct page *head,
return;
if (free_vmemmap_pages_per_hpage(h)) {
- set_page_private(head + 4, page - head);
+ set_page_private(head + SUBPAGE_INDEX_HWPOISON, page - head);
} else if (page != head) {
/*
* Move PageHWPoison flag from head page to the raw error page,
@@ -1433,7 +1433,6 @@ static inline void hwpoison_subpage_set(struct hstate *h, struct page *head,
ClearPageHWPoison(head);
}
}
-
#else
static inline void hwpoison_subpage_deliver(struct hstate *h, struct page *head)
{
@@ -1514,20 +1513,20 @@ struct hstate *size_to_hstate(unsigned long size)
bool page_huge_active(struct page *page)
{
VM_BUG_ON_PAGE(!PageHuge(page), page);
- return PageHead(page) && PagePrivate(&page[1]);
+ return PageHead(page) && PagePrivate(&page[SUBPAGE_INDEX_ACTIVE]);
}
/* never called for tail page */
static void set_page_huge_active(struct page *page)
{
VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
- SetPagePrivate(&page[1]);
+ SetPagePrivate(&page[SUBPAGE_INDEX_ACTIVE]);
}
static void clear_page_huge_active(struct page *page)
{
VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
- ClearPagePrivate(&page[1]);
+ ClearPagePrivate(&page[SUBPAGE_INDEX_ACTIVE]);
}
/*
@@ -1539,17 +1538,17 @@ static inline bool PageHugeTemporary(struct page *page)
if (!PageHuge(page))
return false;
- return (unsigned long)page[2].mapping == -1U;
+ return (unsigned long)page[SUBPAGE_INDEX_TEMPORARY].mapping == -1U;
}
static inline void SetPageHugeTemporary(struct page *page)
{
- page[2].mapping = (void *)-1U;
+ page[SUBPAGE_INDEX_TEMPORARY].mapping = (void *)-1U;
}
static inline void ClearPageHugeTemporary(struct page *page)
{
- page[2].mapping = NULL;
+ page[SUBPAGE_INDEX_TEMPORARY].mapping = NULL;
}
static void __free_huge_page(struct page *page)
@@ -3374,7 +3373,7 @@ void __init hugetlb_add_hstate(unsigned int order)
return;
}
BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
- BUG_ON(order == 0);
+ BUG_ON((1U << order) < NR_USED_SUBPAGE);
h = &hstates[hugetlb_max_hstate++];
h->order = order;
h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 7dcb4aa1e512..6b8f7bb2273e 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -242,6 +242,14 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
unsigned int nr_pages = pages_per_huge_page(h);
unsigned int vmemmap_pages;
+ /*
+ * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
+ * page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP,
+ * so add a BUILD_BUG_ON to catch invalid usage of the tail struct page.
+ */
+ BUILD_BUG_ON(NR_USED_SUBPAGE >=
+ RESERVE_VMEMMAP_SIZE / sizeof(struct page));
+
if (!hugetlb_free_vmemmap_enabled)
return;
--
2.11.0
Because we reuse the first tail vmemmap page frame and remap it
with read-only, we cannot set the PageHWPosion on a tail page.
So we can use the head[4].private to record the real error page
index and set the raw error page PageHWPoison later.
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
---
mm/hugetlb.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 61 insertions(+), 8 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d11c32fcdb38..6caaa7e5dd2a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1358,6 +1358,63 @@ static inline void __update_and_free_page(struct hstate *h, struct page *page)
schedule_work(&hpage_update_work);
}
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+static inline void hwpoison_subpage_deliver(struct hstate *h, struct page *head)
+{
+ struct page *page;
+
+ if (!PageHWPoison(head) || !free_vmemmap_pages_per_hpage(h))
+ return;
+
+ page = head + page_private(head + 4);
+
+ /*
+ * Move PageHWPoison flag from head page to the raw error page,
+ * which makes any subpages rather than the error page reusable.
+ */
+ if (page != head) {
+ SetPageHWPoison(page);
+ ClearPageHWPoison(head);
+ }
+}
+
+static inline void hwpoison_subpage_set(struct hstate *h, struct page *head,
+ struct page *page)
+{
+ if (!PageHWPoison(head))
+ return;
+
+ if (free_vmemmap_pages_per_hpage(h)) {
+ set_page_private(head + 4, page - head);
+ } else if (page != head) {
+ /*
+ * Move PageHWPoison flag from head page to the raw error page,
+ * which makes any subpages rather than the error page reusable.
+ */
+ SetPageHWPoison(page);
+ ClearPageHWPoison(head);
+ }
+}
+
+#else
+static inline void hwpoison_subpage_deliver(struct hstate *h, struct page *head)
+{
+}
+
+static inline void hwpoison_subpage_set(struct hstate *h, struct page *head,
+ struct page *page)
+{
+ if (PageHWPoison(head) && page != head) {
+ /*
+ * Move PageHWPoison flag from head page to the raw error page,
+ * which makes any subpages rather than the error page reusable.
+ */
+ SetPageHWPoison(page);
+ ClearPageHWPoison(head);
+ }
+}
+#endif
+
static void update_and_free_page(struct hstate *h, struct page *page)
{
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
@@ -1373,6 +1430,8 @@ static void __free_hugepage(struct hstate *h, struct page *page)
{
int i;
+ hwpoison_subpage_deliver(h, page);
+
for (i = 0; i < pages_per_huge_page(h); i++) {
page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
1 << PG_referenced | 1 << PG_dirty |
@@ -1845,14 +1904,8 @@ int dissolve_free_huge_page(struct page *page)
int nid = page_to_nid(head);
if (h->free_huge_pages - h->resv_huge_pages == 0)
goto out;
- /*
- * Move PageHWPoison flag from head page to the raw error page,
- * which makes any subpages rather than the error page reusable.
- */
- if (PageHWPoison(head) && page != head) {
- SetPageHWPoison(page);
- ClearPageHWPoison(head);
- }
+
+ hwpoison_subpage_set(h, head, page);
list_del(&head->lru);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;
--
2.11.0
When we free a HugeTLB page whose vmemmap pages can be optimized,
it is freed to the buddy allocator through a kworker. And the ref
count of page is zero, so if we dissolve it before it is freed to
the buddy allocator. It can be freed again. In order to avoid
this, we introduce PageHugeInflight to indicate that the HugeTLB
page is already freed from hugepage pool but not freed to buddy
allocator yet. When we hit the inflight page, we just need to flush
the work.
Signed-off-by: Muchun Song <[email protected]>
---
mm/hugetlb.c | 38 +++++++++++++++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3222bad8b112..14549204ddcb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1343,6 +1343,36 @@ static inline void flush_hpage_update_work(struct hstate *h)
flush_work(&hpage_update_work);
}
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+static inline bool PageHugeInflight(struct page *head)
+{
+ return page_private(head + 5) == -1UL;
+}
+
+static inline void SetPageHugeInflight(struct page *head)
+{
+ set_page_private(head + 5, -1UL);
+}
+
+static inline void ClearPageHugeInflight(struct page *head)
+{
+ set_page_private(head + 5, 0);
+}
+#else
+static inline bool PageHugeInflight(struct page *head)
+{
+ return false;
+}
+
+static inline void SetPageHugeInflight(struct page *head)
+{
+}
+
+static inline void ClearPageHugeInflight(struct page *head)
+{
+}
+#endif
+
static inline void __update_and_free_page(struct hstate *h, struct page *page)
{
/* No need to allocate vmemmap pages */
@@ -1351,6 +1381,8 @@ static inline void __update_and_free_page(struct hstate *h, struct page *page)
return;
}
+ SetPageHugeInflight(page);
+
/*
* Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap
* pages.
@@ -1637,6 +1669,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
{
free_huge_page_vmemmap(h, page);
+ ClearPageHugeInflight(page);
INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
set_hugetlb_cgroup(page, NULL);
@@ -1913,13 +1946,16 @@ int dissolve_free_huge_page(struct page *page)
if (h->free_huge_pages - h->resv_huge_pages == 0)
goto out;
+ rc = 0;
hwpoison_subpage_set(h, head, page);
+ if (PageHugeInflight(head))
+ goto out;
+
list_del(&head->lru);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;
h->max_huge_pages--;
update_and_free_page(h, head);
- rc = 0;
}
out:
spin_unlock(&hugetlb_lock);
--
2.11.0
We should flush work when dissolving a HugeTLB page to make sure that
the HugeTLB page is freed to the buddy allocator. Because the caller
of dissolve_free_huge_pages() relies on this guarantee.
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
---
mm/hugetlb.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6caaa7e5dd2a..3222bad8b112 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1337,6 +1337,12 @@ static void update_hpage_vmemmap_workfn(struct work_struct *work)
}
static DECLARE_WORK(hpage_update_work, update_hpage_vmemmap_workfn);
+static inline void flush_hpage_update_work(struct hstate *h)
+{
+ if (free_vmemmap_pages_per_hpage(h))
+ flush_work(&hpage_update_work);
+}
+
static inline void __update_and_free_page(struct hstate *h, struct page *page)
{
/* No need to allocate vmemmap pages */
@@ -1887,6 +1893,7 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
int dissolve_free_huge_page(struct page *page)
{
int rc = -EBUSY;
+ struct hstate *h = NULL;
/* Not to disrupt normal path by vainly holding hugetlb_lock */
if (!PageHuge(page))
@@ -1900,8 +1907,9 @@ int dissolve_free_huge_page(struct page *page)
if (!page_count(page)) {
struct page *head = compound_head(page);
- struct hstate *h = page_hstate(head);
int nid = page_to_nid(head);
+
+ h = page_hstate(head);
if (h->free_huge_pages - h->resv_huge_pages == 0)
goto out;
@@ -1915,6 +1923,14 @@ int dissolve_free_huge_page(struct page *page)
}
out:
spin_unlock(&hugetlb_lock);
+
+ /*
+ * We should flush work before return to make sure that
+ * the HugeTLB page is freed to the buddy.
+ */
+ if (!rc && h)
+ flush_hpage_update_work(h);
+
return rc;
}
--
2.11.0
We cannot optimize if a "struct page" crosses page boundaries. If
it is true, we can optimize the code with the help of a compiler.
When free_vmemmap_pages_per_hpage() returns zero, most functions are
optimized by the compiler.
Signed-off-by: Muchun Song <[email protected]>
---
include/linux/hugetlb.h | 3 ++-
mm/hugetlb_vmemmap.c | 7 +++++++
mm/hugetlb_vmemmap.h | 5 +++--
3 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 05fd2db09b78..b685bc4d79d5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -792,7 +792,8 @@ extern bool hugetlb_free_vmemmap_enabled;
static inline bool is_hugetlb_free_vmemmap_enabled(void)
{
- return hugetlb_free_vmemmap_enabled;
+ return hugetlb_free_vmemmap_enabled &&
+ is_power_of_2(sizeof(struct page));
}
#else
static inline bool is_hugetlb_free_vmemmap_enabled(void)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 6b8f7bb2273e..5ea12c7507a6 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -250,6 +250,13 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
BUILD_BUG_ON(NR_USED_SUBPAGE >=
RESERVE_VMEMMAP_SIZE / sizeof(struct page));
+ /*
+ * The compiler can help us to optimize this function to null
+ * when the size of the struct page is not power of 2.
+ */
+ if (!is_power_of_2(sizeof(struct page)))
+ return;
+
if (!hugetlb_free_vmemmap_enabled)
return;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 8fd9ae113dbd..e8de41295d4d 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -17,11 +17,12 @@ void hugetlb_vmemmap_init(struct hstate *h);
/*
* How many vmemmap pages associated with a HugeTLB page that can be freed
- * to the buddy allocator.
+ * to the buddy allocator. The checking of the is_power_of_2() aims to let
+ * the compiler help us optimize the code as much as possible.
*/
static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
{
- return h->nr_free_vmemmap_pages;
+ return is_power_of_2(sizeof(struct page)) ? h->nr_free_vmemmap_pages : 0;
}
#else
static inline void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
--
2.11.0
In the subsequent patch, we should allocate the vmemmap pages when
freeing HugeTLB pages. But update_and_free_page() is always called
with holding hugetlb_lock, so we cannot use GFP_KERNEL to allocate
vmemmap pages. However, we can defer the actual freeing in a kworker
to prevent from using GFP_ATOMIC to allocate the vmemmap pages.
The update_hpage_vmemmap_workfn() is where the call to allocate
vmemmmap pages will be inserted.
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++--
mm/hugetlb_vmemmap.c | 12 ---------
mm/hugetlb_vmemmap.h | 17 ++++++++++++
3 files changed, 89 insertions(+), 14 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 140135fc8113..c165186ec2cf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1292,15 +1292,85 @@ static inline void destroy_compound_gigantic_page(struct page *page,
unsigned int order) { }
#endif
-static void update_and_free_page(struct hstate *h, struct page *page)
+static void __free_hugepage(struct hstate *h, struct page *page);
+
+/*
+ * As update_and_free_page() is always called with holding hugetlb_lock, so we
+ * cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
+ * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate
+ * the vmemmap pages.
+ *
+ * The update_hpage_vmemmap_workfn() is where the call to allocate vmemmmap
+ * pages will be inserted.
+ *
+ * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of pages
+ * to be freed and frees them one-by-one. As the page->mapping pointer is going
+ * to be cleared in update_hpage_vmemmap_workfn() anyway, it is reused as the
+ * llist_node structure of a lockless linked list of huge pages to be freed.
+ */
+static LLIST_HEAD(hpage_update_freelist);
+
+static void update_hpage_vmemmap_workfn(struct work_struct *work)
{
- int i;
+ struct llist_node *node;
+
+ node = llist_del_all(&hpage_update_freelist);
+
+ while (node) {
+ struct page *page;
+ struct hstate *h;
+
+ page = container_of((struct address_space **)node,
+ struct page, mapping);
+ node = node->next;
+ page->mapping = NULL;
+ h = page_hstate(page);
+
+ spin_lock(&hugetlb_lock);
+ __free_hugepage(h, page);
+ spin_unlock(&hugetlb_lock);
+ cond_resched();
+ }
+}
+static DECLARE_WORK(hpage_update_work, update_hpage_vmemmap_workfn);
+
+static inline void __update_and_free_page(struct hstate *h, struct page *page)
+{
+ /* No need to allocate vmemmap pages */
+ if (!free_vmemmap_pages_per_hpage(h)) {
+ __free_hugepage(h, page);
+ return;
+ }
+
+ /*
+ * Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap
+ * pages.
+ *
+ * Only call schedule_work() if hpage_update_freelist is previously
+ * empty. Otherwise, schedule_work() had been called but the workfn
+ * hasn't retrieved the list yet.
+ */
+ if (llist_add((struct llist_node *)&page->mapping,
+ &hpage_update_freelist))
+ schedule_work(&hpage_update_work);
+}
+
+static void update_and_free_page(struct hstate *h, struct page *page)
+{
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
return;
h->nr_huge_pages--;
h->nr_huge_pages_node[page_to_nid(page)]--;
+
+ __update_and_free_page(h, page);
+}
+
+static void __free_hugepage(struct hstate *h, struct page *page)
+{
+ int i;
+
for (i = 0; i < pages_per_huge_page(h); i++) {
page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
1 << PG_referenced | 1 << PG_dirty |
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4ffa2a4ae2a8..19f1898aaede 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -178,18 +178,6 @@
#define RESERVE_VMEMMAP_NR 2U
#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
-/*
- * How many vmemmap pages associated with a HugeTLB page that can be freed
- * to the buddy allocator.
- *
- * Todo: Returns zero for now, which means the feature is disabled. We will
- * enable it once all the infrastructure is there.
- */
-static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
-{
- return 0;
-}
-
static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
{
return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 6923f03534d5..01f8637adbe0 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -12,9 +12,26 @@
#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+
+/*
+ * How many vmemmap pages associated with a HugeTLB page that can be freed
+ * to the buddy allocator.
+ *
+ * Todo: Returns zero for now, which means the feature is disabled. We will
+ * enable it once all the infrastructure is there.
+ */
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+ return 0;
+}
#else
static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
{
}
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+ return 0;
+}
#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
#endif /* _LINUX_HUGETLB_VMEMMAP_H */
--
2.11.0
On Sun, Jan 17, 2021 at 11:13 PM Muchun Song <[email protected]> wrote:
>
> Every HugeTLB has more than one struct page structure. We __know__ that
> we only use the first 4(HUGETLB_CGROUP_MIN_ORDER) struct page structures
> to store metadata associated with each HugeTLB.
>
> There are a lot of struct page structures associated with each HugeTLB
> page. For tail pages, the value of compound_head is the same. So we can
> reuse first page of tail page structures. We map the virtual addresses
> of the remaining pages of tail page structures to the first tail page
> struct, and then free these page frames. Therefore, we need to reserve
> two pages as vmemmap areas.
>
> When we allocate a HugeTLB page from the buddy, we can free some vmemmap
> pages associated with each HugeTLB page. It is more appropriate to do it
> in the prep_new_huge_page().
>
> The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
> pages associated with a HugeTLB page can be freed, returns zero for
> now, which means the feature is disabled. We will enable it once all
> the infrastructure is there.
>
> Signed-off-by: Muchun Song <[email protected]>
> ---
> include/linux/bootmem_info.h | 27 +++++-
> include/linux/mm.h | 3 +
> mm/Makefile | 1 +
> mm/hugetlb.c | 3 +
> mm/hugetlb_vmemmap.c | 211 +++++++++++++++++++++++++++++++++++++++++++
> mm/hugetlb_vmemmap.h | 20 ++++
> mm/sparse-vmemmap.c | 198 ++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 462 insertions(+), 1 deletion(-)
> create mode 100644 mm/hugetlb_vmemmap.c
> create mode 100644 mm/hugetlb_vmemmap.h
>
> diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
> index 4ed6dee1adc9..ec03a624dfa2 100644
> --- a/include/linux/bootmem_info.h
> +++ b/include/linux/bootmem_info.h
> @@ -2,7 +2,7 @@
> #ifndef __LINUX_BOOTMEM_INFO_H
> #define __LINUX_BOOTMEM_INFO_H
>
> -#include <linux/mmzone.h>
> +#include <linux/mm.h>
>
> /*
> * Types for free bootmem stored in page->lru.next. These have to be in
> @@ -22,6 +22,27 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
> void get_page_bootmem(unsigned long info, struct page *page,
> unsigned long type);
> void put_page_bootmem(struct page *page);
> +
> +/*
> + * Any memory allocated via the memblock allocator and not via the
> + * buddy will be marked reserved already in the memmap. For those
> + * pages, we can call this function to free it to buddy allocator.
> + */
> +static inline void free_bootmem_page(struct page *page)
> +{
> + unsigned long magic = (unsigned long)page->freelist;
> +
> + /*
> + * The reserve_bootmem_region sets the reserved flag on bootmem
> + * pages.
> + */
> + VM_BUG_ON_PAGE(page_ref_count(page) != 2, page);
> +
> + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
> + put_page_bootmem(page);
> + else
> + VM_BUG_ON_PAGE(1, page);
> +}
> #else
> static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
> {
> @@ -35,6 +56,10 @@ static inline void get_page_bootmem(unsigned long info, struct page *page,
> unsigned long type)
> {
> }
> +
> +static inline void free_bootmem_page(struct page *page)
> +{
> +}
> #endif
>
> #endif /* __LINUX_BOOTMEM_INFO_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index eabe7d9f80d8..f928994ed273 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3005,6 +3005,9 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
> }
> #endif
>
> +void vmemmap_remap_free(unsigned long start, unsigned long end,
> + unsigned long reuse);
> +
> void *sparse_buffer_alloc(unsigned long size);
> struct page * __populate_section_memmap(unsigned long pfn,
> unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
> diff --git a/mm/Makefile b/mm/Makefile
> index ed4b88fa0f5e..056801d8daae 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -71,6 +71,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
> obj-$(CONFIG_ZSWAP) += zswap.o
> obj-$(CONFIG_HAS_DMA) += dmapool.o
> obj-$(CONFIG_HUGETLBFS) += hugetlb.o
> +obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP) += hugetlb_vmemmap.o
> obj-$(CONFIG_NUMA) += mempolicy.o
> obj-$(CONFIG_SPARSEMEM) += sparse.o
> obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1f3bf1710b66..140135fc8113 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -42,6 +42,7 @@
> #include <linux/userfaultfd_k.h>
> #include <linux/page_owner.h>
> #include "internal.h"
> +#include "hugetlb_vmemmap.h"
>
> int hugetlb_max_hstate __read_mostly;
> unsigned int default_hstate_idx;
> @@ -1497,6 +1498,8 @@ void free_huge_page(struct page *page)
>
> static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> {
> + free_huge_page_vmemmap(h, page);
> +
> INIT_LIST_HEAD(&page->lru);
> set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> set_hugetlb_cgroup(page, NULL);
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> new file mode 100644
> index 000000000000..4ffa2a4ae2a8
> --- /dev/null
> +++ b/mm/hugetlb_vmemmap.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Free some vmemmap pages of HugeTLB
> + *
> + * Copyright (c) 2020, Bytedance. All rights reserved.
> + *
> + * Author: Muchun Song <[email protected]>
> + *
> + * The struct page structures (page structs) are used to describe a physical
> + * page frame. By default, there is a one-to-one mapping from a page frame to
> + * it's corresponding page struct.
> + *
> + * The HugeTLB pages consist of multiple base page size pages and is supported
> + * by many architectures. See hugetlbpage.rst in the Documentation directory
> + * for more details. On the x86-64 architecture, HugeTLB pages of size 2MB and
> + * 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB
> + * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
> + * 4096 base pages. For each base page, there is a corresponding page struct.
> + *
> + * Within the HugeTLB subsystem, only the first 4 page structs are used to
> + * contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
> + * provides this upper limit. The only 'useful' information in the remaining
> + * page structs is the compound_head field, and this field is the same for all
> + * tail pages.
> + *
> + * By removing redundant page structs for HugeTLB pages, memory can be returned
> + * to the buddy allocator for other uses.
> + *
> + * Different architectures support different HugeTLB pages. For example, the
> + * following table is the HugeTLB page size supported by x86 and arm64
> + * architectures. Becasue arm64 supports 4k, 16k, and 64k base pages and
> + * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
> + * page.
> + *
> + * +--------------+-----------+-----------------------------------------------+
> + * | Architecture | Page Size | HugeTLB Page Size |
> + * +--------------+-----------+-----------+-----------+-----------+-----------+
> + * | x86-64 | 4KB | 2MB | 1GB | | |
> + * +--------------+-----------+-----------+-----------+-----------+-----------+
> + * | | 4KB | 64KB | 2MB | 32MB | 1GB |
> + * | +-----------+-----------+-----------+-----------+-----------+
> + * | arm64 | 16KB | 2MB | 32MB | 1GB | |
> + * | +-----------+-----------+-----------+-----------+-----------+
> + * | | 64KB | 2MB | 512MB | 16GB | |
> + * +--------------+-----------+-----------+-----------+-----------+-----------+
> + *
> + * When the system boot up, every HugeTLB page has more than one struct page
> + * structs whose size is (unit: pages):
> + *
> + * struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> + *
> + * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
> + * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
> + * relationship.
> + *
> + * HugeTLB_Size = n * PAGE_SIZE
> + *
> + * Then,
> + *
> + * struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> + * = n * sizeof(struct page) / PAGE_SIZE
> + *
> + * We can use huge mapping at the pud/pmd level for the HugeTLB page.
> + *
> + * For the HugeTLB page of the pmd level mapping, then
> + *
> + * struct_size = n * sizeof(struct page) / PAGE_SIZE
> + * = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
> + * = sizeof(struct page) / sizeof(pte_t)
> + * = 64 / 8
> + * = 8 (pages)
> + *
> + * Where n is how many pte entries which one page can contains. So the value of
> + * n is (PAGE_SIZE / sizeof(pte_t)).
> + *
> + * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
> + * is 8. And this optimization also applicable only when the size of struct page
> + * is a power of two. In most cases, the size of struct page is 64 (e.g. x86-64
> + * and arm64). So if we use pmd level mapping for a HugeTLB page, the size of
> + * struct page structs of it is 8 pages whose size depends on the size of the
> + * base page.
> + *
> + * For the HugeTLB page of the pud level mapping, then
> + *
> + * struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
> + * = PAGE_SIZE / 8 * 8 (pages)
> + * = PAGE_SIZE (pages)
> + *
> + * Where the struct_size(pmd) is the size of the struct page structs of a
> + * HugeTLB page of the pmd level mapping.
> + *
> + * Next, we take the pmd level mapping of the HugeTLB page as an example to
> + * show the internal implementation of this optimization. There are 8 pages
> + * struct page structs associated with a HugeTLB page which is pmd mapped.
> + *
> + * Here is how things look before optimization.
> + *
> + * HugeTLB struct pages(8 pages) page frame(8 pages)
> + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> + * | | | 0 | -------------> | 0 |
> + * | | +-----------+ +-----------+
> + * | | | 1 | -------------> | 1 |
> + * | | +-----------+ +-----------+
> + * | | | 2 | -------------> | 2 |
> + * | | +-----------+ +-----------+
> + * | | | 3 | -------------> | 3 |
> + * | | +-----------+ +-----------+
> + * | | | 4 | -------------> | 4 |
> + * | PMD | +-----------+ +-----------+
> + * | level | | 5 | -------------> | 5 |
> + * | mapping | +-----------+ +-----------+
> + * | | | 6 | -------------> | 6 |
> + * | | +-----------+ +-----------+
> + * | | | 7 | -------------> | 7 |
> + * | | +-----------+ +-----------+
> + * | |
> + * | |
> + * | |
> + * +-----------+
> + *
> + * The value of page->compound_head is the same for all tail pages. The first
> + * page of page structs (page 0) associated with the HugeTLB page contains the 4
> + * page structs necessary to describe the HugeTLB. The only use of the remaining
> + * pages of page structs (page 1 to page 7) is to point to page->compound_head.
> + * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
> + * will be used for each HugeTLB page. This will allow us to free the remaining
> + * 6 pages to the buddy allocator.
> + *
> + * Here is how things look after remapping.
> + *
> + * HugeTLB struct pages(8 pages) page frame(8 pages)
> + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> + * | | | 0 | -------------> | 0 |
> + * | | +-----------+ +-----------+
> + * | | | 1 | -------------> | 1 |
> + * | | +-----------+ +-----------+
> + * | | | 2 | ----------------^ ^ ^ ^ ^ ^
> + * | | +-----------+ | | | | |
> + * | | | 3 | ------------------+ | | | |
> + * | | +-----------+ | | | |
> + * | | | 4 | --------------------+ | | |
> + * | PMD | +-----------+ | | |
> + * | level | | 5 | ----------------------+ | |
> + * | mapping | +-----------+ | |
> + * | | | 6 | ------------------------+ |
> + * | | +-----------+ |
> + * | | | 7 | --------------------------+
> + * | | +-----------+
> + * | |
> + * | |
> + * | |
> + * +-----------+
> + *
> + * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
> + * vmemmap pages and restore the previous mapping relationship.
> + *
> + * For the HugeTLB page of the pud level mapping. It is similar to the former.
> + * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
> + *
> + * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
> + * (e.g. aarch64) provides a contiguous bit in the translation table entries
> + * that hints to the MMU to indicate that it is one of a contiguous set of
> + * entries that can be cached in a single TLB entry.
> + *
> + * The contiguous bit is used to increase the mapping size at the pmd and pte
> + * (last) level. So this type of HugeTLB page can be optimized only when its
> + * size of the struct page structs is greater than 2 pages.
> + */
> +#include "hugetlb_vmemmap.h"
> +
> +/*
> + * There are a lot of struct page structures associated with each HugeTLB page.
> + * For tail pages, the value of compound_head is the same. So we can reuse first
> + * page of tail page structures. We map the virtual addresses of the remaining
> + * pages of tail page structures to the first tail page struct, and then free
> + * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
> + */
> +#define RESERVE_VMEMMAP_NR 2U
> +#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> +
> +/*
> + * How many vmemmap pages associated with a HugeTLB page that can be freed
> + * to the buddy allocator.
> + *
> + * Todo: Returns zero for now, which means the feature is disabled. We will
> + * enable it once all the infrastructure is there.
> + */
> +static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
> +{
> + return 0;
> +}
> +
> +static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
> +{
> + return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
> +}
> +
> +void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> +{
> + unsigned long vmemmap_addr = (unsigned long)head;
> + unsigned long vmemmap_end, vmemmap_reuse;
> +
> + if (!free_vmemmap_pages_per_hpage(h))
> + return;
> +
> + vmemmap_addr += RESERVE_VMEMMAP_SIZE;
> + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
> + vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
> +
> + vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse);
> +}
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> new file mode 100644
> index 000000000000..6923f03534d5
> --- /dev/null
> +++ b/mm/hugetlb_vmemmap.h
> @@ -0,0 +1,20 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Free some vmemmap pages of HugeTLB
> + *
> + * Copyright (c) 2020, Bytedance. All rights reserved.
> + *
> + * Author: Muchun Song <[email protected]>
> + */
> +#ifndef _LINUX_HUGETLB_VMEMMAP_H
> +#define _LINUX_HUGETLB_VMEMMAP_H
> +#include <linux/hugetlb.h>
> +
> +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
> +void free_huge_page_vmemmap(struct hstate *h, struct page *head);
> +#else
> +static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> +{
> +}
> +#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
> +#endif /* _LINUX_HUGETLB_VMEMMAP_H */
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 16183d85a7d5..ce4be1fa93c2 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -27,8 +27,206 @@
> #include <linux/spinlock.h>
> #include <linux/vmalloc.h>
> #include <linux/sched.h>
> +#include <linux/pgtable.h>
> +#include <linux/bootmem_info.h>
> +
> #include <asm/dma.h>
> #include <asm/pgalloc.h>
> +#include <asm/tlbflush.h>
> +
> +/**
> + * vmemmap_remap_walk - walk vmemmap page table
> + *
> + * @remap_pte: called for each non-empty PTE (lowest-level) entry.
> + * @reuse_page: the page which is reused for the tail vmemmap pages.
> + * @reuse_addr: the virtual address of the @reuse_page page.
> + * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
> + */
> +struct vmemmap_remap_walk {
> + void (*remap_pte)(pte_t *pte, unsigned long addr,
> + struct vmemmap_remap_walk *walk);
> + struct page *reuse_page;
> + unsigned long reuse_addr;
> + struct list_head *vmemmap_pages;
> +};
> +
> +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + pte_t *pte;
> +
> + pte = pte_offset_kernel(pmd, addr);
> +
> + /*
> + * The reuse_page is found 'first' in table walk before we start
> + * remapping (which is calling @walk->remap_pte).
> + */
> + if (walk->reuse_addr == addr) {
> + BUG_ON(pte_none(*pte));
> +
> + walk->reuse_page = pte_page(*pte++);
> + /*
> + * Becasue the reuse address is part of the range that we are
> + * walking, skip the reuse address range.
> + */
> + addr += PAGE_SIZE;
> + }
> +
> + for (; addr != end; addr += PAGE_SIZE, pte++) {
> + BUG_ON(pte_none(*pte));
> +
> + walk->remap_pte(pte, addr, walk);
> + }
> +}
> +
> +static void vmemmap_pmd_range(pud_t *pud, unsigned long addr,
> + unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + pmd_t *pmd;
> + unsigned long next;
> +
> + pmd = pmd_offset(pud, addr);
> + do {
> + BUG_ON(pmd_none(*pmd));
> +
> + next = pmd_addr_end(addr, end);
> + vmemmap_pte_range(pmd, addr, next, walk);
> + } while (pmd++, addr = next, addr != end);
> +}
> +
> +static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
> + unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + pud_t *pud;
> + unsigned long next;
> +
> + pud = pud_offset(p4d, addr);
> + do {
> + BUG_ON(pud_none(*pud));
> +
> + next = pud_addr_end(addr, end);
> + vmemmap_pmd_range(pud, addr, next, walk);
> + } while (pud++, addr = next, addr != end);
> +}
> +
> +static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
> + unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + p4d_t *p4d;
> + unsigned long next;
> +
> + p4d = p4d_offset(pgd, addr);
> + do {
> + BUG_ON(p4d_none(*p4d));
> +
> + next = p4d_addr_end(addr, end);
> + vmemmap_pud_range(p4d, addr, next, walk);
> + } while (p4d++, addr = next, addr != end);
> +}
> +
> +static void vmemmap_remap_range(unsigned long start, unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + unsigned long addr = start;
> + unsigned long next;
> + pgd_t *pgd;
> +
> + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
> + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
> +
> + pgd = pgd_offset_k(addr);
> + do {
> + BUG_ON(pgd_none(*pgd));
> +
> + next = pgd_addr_end(addr, end);
> + vmemmap_p4d_range(pgd, addr, next, walk);
> + } while (pgd++, addr = next, addr != end);
> +
> + /*
> + * We do not change the mapping of the vmemmap virtual address range
> + * [@start, @start + PAGE_SIZE) which is belong to the reuse range.
> + * So we not need to flush the TLB.
> + */
> + flush_tlb_kernel_range(start - PAGE_SIZE, end);
Sorry. Here should be "flush_tlb_kernel_range(start + PAGE_SIZE, end)".
Will be fixed in the next version.
> +}
> +
> +/*
> + * Free a vmemmap page. A vmemmap page can be allocated from the memblock
> + * allocator or buddy allocator. If the PG_reserved flag is set, it means
> + * that it allocated from the memblock allocator, just free it via the
> + * free_bootmem_page(). Otherwise, use __free_page().
> + */
> +static inline void free_vmemmap_page(struct page *page)
> +{
> + if (PageReserved(page))
> + free_bootmem_page(page);
> + else
> + __free_page(page);
> +}
> +
> +/* Free a list of the vmemmap pages */
> +static void free_vmemmap_page_list(struct list_head *list)
> +{
> + struct page *page, *next;
> +
> + list_for_each_entry_safe(page, next, list, lru) {
> + list_del(&page->lru);
> + free_vmemmap_page(page);
> + }
> +}
> +
> +static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> + struct vmemmap_remap_walk *walk)
> +{
> + /*
> + * Remap the tail pages as read-only to catch illegal write operation
> + * to the tail pages.
> + */
> + pgprot_t pgprot = PAGE_KERNEL_RO;
> + pte_t entry = mk_pte(walk->reuse_page, pgprot);
> + struct page *page = pte_page(*pte);
> +
> + list_add(&page->lru, walk->vmemmap_pages);
> + set_pte_at(&init_mm, addr, pte, entry);
> +}
> +
> +/**
> + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> + * to the page which @reuse is mapped, then free vmemmap
> + * pages.
> + * @start: start address of the vmemmap virtual address range.
> + * @end: end address of the vmemmap virtual address range.
> + * @reuse: reuse address.
> + */
> +void vmemmap_remap_free(unsigned long start, unsigned long end,
> + unsigned long reuse)
> +{
> + LIST_HEAD(vmemmap_pages);
> + struct vmemmap_remap_walk walk = {
> + .remap_pte = vmemmap_remap_pte,
> + .reuse_addr = reuse,
> + .vmemmap_pages = &vmemmap_pages,
> + };
> +
> + /*
> + * In order to make remapping routine most efficient for the huge pages,
> + * the routine of vmemmap page table walking has the following rules
> + * (see more details from the vmemmap_pte_range()):
> + *
> + * - The @reuse address is part of the range that we are walking.
> + * - The @reuse address is the first in the complete range.
> + *
> + * So we need to make sure that @start and @reuse meet the above rules.
> + */
> + BUG_ON(start - reuse != PAGE_SIZE);
> +
> + vmemmap_remap_range(reuse, end, &walk);
> + free_vmemmap_page_list(&vmemmap_pages);
> +}
>
> /*
> * Allocate a block of memory to be used to back the virtual memory map
> --
> 2.11.0
>
On Wed, Jan 20, 2021 at 08:52:50PM +0800, Muchun Song wrote:
> Hi Oscar and Mike,
>
> Any suggestions about this version? Looking forward to your
> review. Thanks a lot.
Hi Muchun,
I plan to keep reviewing it in the coming days (tomorrow or Friday).
I glanced over patch#3 when you posted the series and nothing sticked out besides
what you have already pointed out, but I will have a further look.
thanks
>
> >
> > Changelog in v11 -> v12:
> > - Move VM_WARN_ON_PAGE to a separate patch.
> > - Call __free_hugepage() with hugetlb_lock (See patch #5.) to serialize
> > with dissolve_free_huge_page(). It is to prepare for patch #9.
> > - Introduce PageHugeInflight. See patch #9.
> >
> > Changelog in v10 -> v11:
> > - Fix compiler error when !CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.
> > - Rework some comments and commit changes.
> > - Rework vmemmap_remap_free() to 3 parameters.
> >
> > Thanks to Oscar and Mike's suggestions and review.
> >
> > Changelog in v9 -> v10:
> > - Fix a bug in patch #11. Thanks to Oscar for pointing that out.
> > - Rework some commit log or comments. Thanks Mike and Oscar for the suggestions.
> > - Drop VMEMMAP_TAIL_PAGE_REUSE in the patch #3.
> >
> > Thank you very much Mike and Oscar for reviewing the code.
> >
> > Changelog in v8 -> v9:
> > - Rework some code. Very thanks to Oscar.
> > - Put all the non-hugetlb vmemmap functions under sparsemem-vmemmap.c.
> >
> > Changelog in v7 -> v8:
> > - Adjust the order of patches.
> >
> > Very thanks to David and Oscar. Your suggestions are very valuable.
> >
> > Changelog in v6 -> v7:
> > - Rebase to linux-next 20201130
> > - Do not use basepage mapping for vmemmap when this feature is disabled.
> > - Rework some patchs.
> > [PATCH v6 08/16] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
> > [PATCH v6 10/16] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
> >
> > Thanks to Oscar and Barry.
> >
> > Changelog in v5 -> v6:
> > - Disable PMD/huge page mapping of vmemmap if this feature was enabled.
> > - Simplify the first version code.
> >
> > Changelog in v4 -> v5:
> > - Rework somme comments and code in the [PATCH v4 04/21] and [PATCH v4 05/21].
> >
> > Thanks to Mike and Oscar's suggestions.
> >
> > Changelog in v3 -> v4:
> > - Move all the vmemmap functions to hugetlb_vmemmap.c.
> > - Make the CONFIG_HUGETLB_PAGE_FREE_VMEMMAP default to y, if we want to
> > disable this feature, we should disable it by a boot/kernel command line.
> > - Remove vmemmap_pgtable_{init, deposit, withdraw}() helper functions.
> > - Initialize page table lock for vmemmap through core_initcall mechanism.
> >
> > Thanks for Mike and Oscar's suggestions.
> >
> > Changelog in v2 -> v3:
> > - Rename some helps function name. Thanks Mike.
> > - Rework some code. Thanks Mike and Oscar.
> > - Remap the tail vmemmap page with PAGE_KERNEL_RO instead of PAGE_KERNEL.
> > Thanks Matthew.
> > - Add some overhead analysis in the cover letter.
> > - Use vmemap pmd table lock instead of a hugetlb specific global lock.
> >
> > Changelog in v1 -> v2:
> > - Fix do not call dissolve_compound_page in alloc_huge_page_vmemmap().
> > - Fix some typo and code style problems.
> > - Remove unused handle_vmemmap_fault().
> > - Merge some commits to one commit suggested by Mike.
> >
> > Muchun Song (12):
> > mm: memory_hotplug: factor out bootmem core functions to
> > bootmem_info.c
> > mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
> > mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
> > mm: hugetlb: defer freeing of HugeTLB pages
> > mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB
> > page
> > mm: hugetlb: set the PageHWPoison to the raw error page
> > mm: hugetlb: flush work when dissolving a HugeTLB page
> > mm: hugetlb: introduce PageHugeInflight
> > mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap
> > mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate
> > mm: hugetlb: gather discrete indexes of tail page
> > mm: hugetlb: optimize the code with the help of the compiler
> >
> > Documentation/admin-guide/kernel-parameters.txt | 14 ++
> > Documentation/admin-guide/mm/hugetlbpage.rst | 3 +
> > arch/x86/mm/init_64.c | 13 +-
> > fs/Kconfig | 18 ++
> > include/linux/bootmem_info.h | 65 ++++++
> > include/linux/hugetlb.h | 37 ++++
> > include/linux/hugetlb_cgroup.h | 15 +-
> > include/linux/memory_hotplug.h | 27 ---
> > include/linux/mm.h | 5 +
> > mm/Makefile | 2 +
> > mm/bootmem_info.c | 124 +++++++++++
> > mm/hugetlb.c | 218 +++++++++++++++++--
> > mm/hugetlb_vmemmap.c | 278 ++++++++++++++++++++++++
> > mm/hugetlb_vmemmap.h | 45 ++++
> > mm/memory_hotplug.c | 116 ----------
> > mm/sparse-vmemmap.c | 273 +++++++++++++++++++++++
> > mm/sparse.c | 1 +
> > 17 files changed, 1082 insertions(+), 172 deletions(-)
> > create mode 100644 include/linux/bootmem_info.h
> > create mode 100644 mm/bootmem_info.c
> > create mode 100644 mm/hugetlb_vmemmap.c
> > create mode 100644 mm/hugetlb_vmemmap.h
> >
> > --
> > 2.11.0
> >
>
--
Oscar Salvador
SUSE L3
On Wed, Jan 20, 2021 at 9:10 PM Oscar Salvador <[email protected]> wrote:
>
> On Wed, Jan 20, 2021 at 08:52:50PM +0800, Muchun Song wrote:
> > Hi Oscar and Mike,
> >
> > Any suggestions about this version? Looking forward to your
> > review. Thanks a lot.
>
> Hi Muchun,
>
> I plan to keep reviewing it in the coming days (tomorrow or Friday).
> I glanced over patch#3 when you posted the series and nothing sticked out besides
> what you have already pointed out, but I will have a further look.
OK. Thanks :)
>
> thanks
>
>
>
> >
> > >
> > > Changelog in v11 -> v12:
> > > - Move VM_WARN_ON_PAGE to a separate patch.
> > > - Call __free_hugepage() with hugetlb_lock (See patch #5.) to serialize
> > > with dissolve_free_huge_page(). It is to prepare for patch #9.
> > > - Introduce PageHugeInflight. See patch #9.
> > >
> > > Changelog in v10 -> v11:
> > > - Fix compiler error when !CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.
> > > - Rework some comments and commit changes.
> > > - Rework vmemmap_remap_free() to 3 parameters.
> > >
> > > Thanks to Oscar and Mike's suggestions and review.
> > >
> > > Changelog in v9 -> v10:
> > > - Fix a bug in patch #11. Thanks to Oscar for pointing that out.
> > > - Rework some commit log or comments. Thanks Mike and Oscar for the suggestions.
> > > - Drop VMEMMAP_TAIL_PAGE_REUSE in the patch #3.
> > >
> > > Thank you very much Mike and Oscar for reviewing the code.
> > >
> > > Changelog in v8 -> v9:
> > > - Rework some code. Very thanks to Oscar.
> > > - Put all the non-hugetlb vmemmap functions under sparsemem-vmemmap.c.
> > >
> > > Changelog in v7 -> v8:
> > > - Adjust the order of patches.
> > >
> > > Very thanks to David and Oscar. Your suggestions are very valuable.
> > >
> > > Changelog in v6 -> v7:
> > > - Rebase to linux-next 20201130
> > > - Do not use basepage mapping for vmemmap when this feature is disabled.
> > > - Rework some patchs.
> > > [PATCH v6 08/16] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
> > > [PATCH v6 10/16] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
> > >
> > > Thanks to Oscar and Barry.
> > >
> > > Changelog in v5 -> v6:
> > > - Disable PMD/huge page mapping of vmemmap if this feature was enabled.
> > > - Simplify the first version code.
> > >
> > > Changelog in v4 -> v5:
> > > - Rework somme comments and code in the [PATCH v4 04/21] and [PATCH v4 05/21].
> > >
> > > Thanks to Mike and Oscar's suggestions.
> > >
> > > Changelog in v3 -> v4:
> > > - Move all the vmemmap functions to hugetlb_vmemmap.c.
> > > - Make the CONFIG_HUGETLB_PAGE_FREE_VMEMMAP default to y, if we want to
> > > disable this feature, we should disable it by a boot/kernel command line.
> > > - Remove vmemmap_pgtable_{init, deposit, withdraw}() helper functions.
> > > - Initialize page table lock for vmemmap through core_initcall mechanism.
> > >
> > > Thanks for Mike and Oscar's suggestions.
> > >
> > > Changelog in v2 -> v3:
> > > - Rename some helps function name. Thanks Mike.
> > > - Rework some code. Thanks Mike and Oscar.
> > > - Remap the tail vmemmap page with PAGE_KERNEL_RO instead of PAGE_KERNEL.
> > > Thanks Matthew.
> > > - Add some overhead analysis in the cover letter.
> > > - Use vmemap pmd table lock instead of a hugetlb specific global lock.
> > >
> > > Changelog in v1 -> v2:
> > > - Fix do not call dissolve_compound_page in alloc_huge_page_vmemmap().
> > > - Fix some typo and code style problems.
> > > - Remove unused handle_vmemmap_fault().
> > > - Merge some commits to one commit suggested by Mike.
> > >
> > > Muchun Song (12):
> > > mm: memory_hotplug: factor out bootmem core functions to
> > > bootmem_info.c
> > > mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
> > > mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
> > > mm: hugetlb: defer freeing of HugeTLB pages
> > > mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB
> > > page
> > > mm: hugetlb: set the PageHWPoison to the raw error page
> > > mm: hugetlb: flush work when dissolving a HugeTLB page
> > > mm: hugetlb: introduce PageHugeInflight
> > > mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap
> > > mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate
> > > mm: hugetlb: gather discrete indexes of tail page
> > > mm: hugetlb: optimize the code with the help of the compiler
> > >
> > > Documentation/admin-guide/kernel-parameters.txt | 14 ++
> > > Documentation/admin-guide/mm/hugetlbpage.rst | 3 +
> > > arch/x86/mm/init_64.c | 13 +-
> > > fs/Kconfig | 18 ++
> > > include/linux/bootmem_info.h | 65 ++++++
> > > include/linux/hugetlb.h | 37 ++++
> > > include/linux/hugetlb_cgroup.h | 15 +-
> > > include/linux/memory_hotplug.h | 27 ---
> > > include/linux/mm.h | 5 +
> > > mm/Makefile | 2 +
> > > mm/bootmem_info.c | 124 +++++++++++
> > > mm/hugetlb.c | 218 +++++++++++++++++--
> > > mm/hugetlb_vmemmap.c | 278 ++++++++++++++++++++++++
> > > mm/hugetlb_vmemmap.h | 45 ++++
> > > mm/memory_hotplug.c | 116 ----------
> > > mm/sparse-vmemmap.c | 273 +++++++++++++++++++++++
> > > mm/sparse.c | 1 +
> > > 17 files changed, 1082 insertions(+), 172 deletions(-)
> > > create mode 100644 include/linux/bootmem_info.h
> > > create mode 100644 mm/bootmem_info.c
> > > create mode 100644 mm/hugetlb_vmemmap.c
> > > create mode 100644 mm/hugetlb_vmemmap.h
> > >
> > > --
> > > 2.11.0
> > >
> >
>
> --
> Oscar Salvador
> SUSE L3
On Sun, Jan 17, 2021 at 11:12 PM Muchun Song <[email protected]> wrote:
>
> Hi all,
>
> This patch series will free some vmemmap pages(struct page structures)
> associated with each hugetlbpage when preallocated to save memory.
>
> In order to reduce the difficulty of the first version of code review.
> From this version, we disable PMD/huge page mapping of vmemmap if this
> feature was enabled. This accutualy eliminate a bunch of the complex code
> doing page table manipulation. When this patch series is solid, we cam add
> the code of vmemmap page table manipulation in the future.
>
> The struct page structures (page structs) are used to describe a physical
> page frame. By default, there is a one-to-one mapping from a page frame to
> it's corresponding page struct.
>
> The HugeTLB pages consist of multiple base page size pages and is supported
> by many architectures. See hugetlbpage.rst in the Documentation directory
> for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB
> are currently supported. Since the base page size on x86 is 4KB, a 2MB
> HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
> 4096 base pages. For each base page, there is a corresponding page struct.
>
> Within the HugeTLB subsystem, only the first 4 page structs are used to
> contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
> provides this upper limit. The only 'useful' information in the remaining
> page structs is the compound_head field, and this field is the same for all
> tail pages.
>
> By removing redundant page structs for HugeTLB pages, memory can returned to
> the buddy allocator for other uses.
>
> When the system boot up, every 2M HugeTLB has 512 struct page structs which
> size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
>
> HugeTLB struct pages(8 pages) page frame(8 pages)
> +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> | | | 0 | -------------> | 0 |
> | | +-----------+ +-----------+
> | | | 1 | -------------> | 1 |
> | | +-----------+ +-----------+
> | | | 2 | -------------> | 2 |
> | | +-----------+ +-----------+
> | | | 3 | -------------> | 3 |
> | | +-----------+ +-----------+
> | | | 4 | -------------> | 4 |
> | 2MB | +-----------+ +-----------+
> | | | 5 | -------------> | 5 |
> | | +-----------+ +-----------+
> | | | 6 | -------------> | 6 |
> | | +-----------+ +-----------+
> | | | 7 | -------------> | 7 |
> | | +-----------+ +-----------+
> | |
> | |
> | |
> +-----------+
>
> The value of page->compound_head is the same for all tail pages. The first
> page of page structs (page 0) associated with the HugeTLB page contains the 4
> page structs necessary to describe the HugeTLB. The only use of the remaining
> pages of page structs (page 1 to page 7) is to point to page->compound_head.
> Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
> will be used for each HugeTLB page. This will allow us to free the remaining
> 6 pages to the buddy allocator.
>
> Here is how things look after remapping.
>
> HugeTLB struct pages(8 pages) page frame(8 pages)
> +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> | | | 0 | -------------> | 0 |
> | | +-----------+ +-----------+
> | | | 1 | -------------> | 1 |
> | | +-----------+ +-----------+
> | | | 2 | ----------------^ ^ ^ ^ ^ ^
> | | +-----------+ | | | | |
> | | | 3 | ------------------+ | | | |
> | | +-----------+ | | | |
> | | | 4 | --------------------+ | | |
> | 2MB | +-----------+ | | |
> | | | 5 | ----------------------+ | |
> | | +-----------+ | |
> | | | 6 | ------------------------+ |
> | | +-----------+ |
> | | | 7 | --------------------------+
> | | +-----------+
> | |
> | |
> | |
> +-----------+
>
> When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
> vmemmap pages and restore the previous mapping relationship.
>
> Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar
> to the 2MB HugeTLB page. We also can use this approach to free the vmemmap
> pages.
>
> In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a
> very substantial gain. On our server, run some SPDK/QEMU applications which
> will use 1024GB hugetlbpage. With this feature enabled, we can save ~16GB
> (1G hugepage)/~12GB (2MB hugepage) memory.
>
> Because there are vmemmap page tables reconstruction on the freeing/allocating
> path, it increases some overhead. Here are some overhead analysis.
>
> 1) Allocating 10240 2MB hugetlb pages.
>
> a) With this patch series applied:
> # time echo 10240 > /proc/sys/vm/nr_hugepages
>
> real 0m0.166s
> user 0m0.000s
> sys 0m0.166s
>
> # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> Attaching 2 probes...
>
> @latency:
> [8K, 16K) 8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K) 1868 |@@@@@@@@@@@ |
> [32K, 64K) 10 | |
> [64K, 128K) 2 | |
>
> b) Without this patch series:
> # time echo 10240 > /proc/sys/vm/nr_hugepages
>
> real 0m0.066s
> user 0m0.000s
> sys 0m0.066s
>
> # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> Attaching 2 probes...
>
> @latency:
> [4K, 8K) 10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K) 62 | |
> [16K, 32K) 2 | |
>
> Summarize: this feature is about ~2x slower than before.
>
> 2) Freeing 10240 2MB hugetlb pages.
>
> a) With this patch series applied:
> # time echo 0 > /proc/sys/vm/nr_hugepages
>
> real 0m0.004s
> user 0m0.000s
> sys 0m0.002s
>
> # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> Attaching 2 probes...
>
> @latency:
> [16K, 32K) 10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> b) Without this patch series:
> # time echo 0 > /proc/sys/vm/nr_hugepages
>
> real 0m0.077s
> user 0m0.001s
> sys 0m0.075s
>
> # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> Attaching 2 probes...
>
> @latency:
> [4K, 8K) 9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K) 287 |@ |
> [16K, 32K) 3 | |
>
> Summarize: The overhead of __free_hugepage is about ~2-4x slower than before.
> But according to the allocation test above, I think that here is
> also ~2x slower than before.
>
> But why the 'real' time of patched is smaller than before? Because
> In this patch series, the freeing hugetlb is asynchronous(through
> kwoker).
>
> Although the overhead has increased, the overhead is not significant. Like Mike
> said, "However, remember that the majority of use cases create hugetlb pages at
> or shortly after boot time and add them to the pool. So, additional overhead is
> at pool creation time. There is no change to 'normal run time' operations of
> getting a page from or returning a page to the pool (think page fault/unmap)".
>
> Todo:
> - Free all of the tail vmemmap pages
> Now for the 2MB HugrTLB page, we only free 6 vmemmap pages. we really can
> free 7 vmemmap pages. In this case, we can see 8 of the 512 struct page
> structures has beed set PG_head flag. If we can adjust compound_head()
> slightly and make compound_head() return the real head struct page when
> the parameter is the tail struct page but with PG_head flag set.
>
> In order to make the code evolution route clearer. This feature can can be
> a separate patch after this patchset is solid.
>
> - Support for other architectures (e.g. aarch64).
> - Enable PMD/huge page mapping of vmemmap even if this feature was enabled.
>
> Changelog in v12 -> v13:
> - Remove VM_WARN_ON_PAGE macro.
> - Add more comments in vmemmap_pte_range() and vmemmap_remap_free().
>
> Thanks to Oscar and Mike's suggestions and review.
Hi Oscar and Mike,
Any suggestions about this version? Looking forward to your
review. Thanks a lot.
>
> Changelog in v11 -> v12:
> - Move VM_WARN_ON_PAGE to a separate patch.
> - Call __free_hugepage() with hugetlb_lock (See patch #5.) to serialize
> with dissolve_free_huge_page(). It is to prepare for patch #9.
> - Introduce PageHugeInflight. See patch #9.
>
> Changelog in v10 -> v11:
> - Fix compiler error when !CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.
> - Rework some comments and commit changes.
> - Rework vmemmap_remap_free() to 3 parameters.
>
> Thanks to Oscar and Mike's suggestions and review.
>
> Changelog in v9 -> v10:
> - Fix a bug in patch #11. Thanks to Oscar for pointing that out.
> - Rework some commit log or comments. Thanks Mike and Oscar for the suggestions.
> - Drop VMEMMAP_TAIL_PAGE_REUSE in the patch #3.
>
> Thank you very much Mike and Oscar for reviewing the code.
>
> Changelog in v8 -> v9:
> - Rework some code. Very thanks to Oscar.
> - Put all the non-hugetlb vmemmap functions under sparsemem-vmemmap.c.
>
> Changelog in v7 -> v8:
> - Adjust the order of patches.
>
> Very thanks to David and Oscar. Your suggestions are very valuable.
>
> Changelog in v6 -> v7:
> - Rebase to linux-next 20201130
> - Do not use basepage mapping for vmemmap when this feature is disabled.
> - Rework some patchs.
> [PATCH v6 08/16] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page
> [PATCH v6 10/16] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page
>
> Thanks to Oscar and Barry.
>
> Changelog in v5 -> v6:
> - Disable PMD/huge page mapping of vmemmap if this feature was enabled.
> - Simplify the first version code.
>
> Changelog in v4 -> v5:
> - Rework somme comments and code in the [PATCH v4 04/21] and [PATCH v4 05/21].
>
> Thanks to Mike and Oscar's suggestions.
>
> Changelog in v3 -> v4:
> - Move all the vmemmap functions to hugetlb_vmemmap.c.
> - Make the CONFIG_HUGETLB_PAGE_FREE_VMEMMAP default to y, if we want to
> disable this feature, we should disable it by a boot/kernel command line.
> - Remove vmemmap_pgtable_{init, deposit, withdraw}() helper functions.
> - Initialize page table lock for vmemmap through core_initcall mechanism.
>
> Thanks for Mike and Oscar's suggestions.
>
> Changelog in v2 -> v3:
> - Rename some helps function name. Thanks Mike.
> - Rework some code. Thanks Mike and Oscar.
> - Remap the tail vmemmap page with PAGE_KERNEL_RO instead of PAGE_KERNEL.
> Thanks Matthew.
> - Add some overhead analysis in the cover letter.
> - Use vmemap pmd table lock instead of a hugetlb specific global lock.
>
> Changelog in v1 -> v2:
> - Fix do not call dissolve_compound_page in alloc_huge_page_vmemmap().
> - Fix some typo and code style problems.
> - Remove unused handle_vmemmap_fault().
> - Merge some commits to one commit suggested by Mike.
>
> Muchun Song (12):
> mm: memory_hotplug: factor out bootmem core functions to
> bootmem_info.c
> mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
> mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
> mm: hugetlb: defer freeing of HugeTLB pages
> mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB
> page
> mm: hugetlb: set the PageHWPoison to the raw error page
> mm: hugetlb: flush work when dissolving a HugeTLB page
> mm: hugetlb: introduce PageHugeInflight
> mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap
> mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate
> mm: hugetlb: gather discrete indexes of tail page
> mm: hugetlb: optimize the code with the help of the compiler
>
> Documentation/admin-guide/kernel-parameters.txt | 14 ++
> Documentation/admin-guide/mm/hugetlbpage.rst | 3 +
> arch/x86/mm/init_64.c | 13 +-
> fs/Kconfig | 18 ++
> include/linux/bootmem_info.h | 65 ++++++
> include/linux/hugetlb.h | 37 ++++
> include/linux/hugetlb_cgroup.h | 15 +-
> include/linux/memory_hotplug.h | 27 ---
> include/linux/mm.h | 5 +
> mm/Makefile | 2 +
> mm/bootmem_info.c | 124 +++++++++++
> mm/hugetlb.c | 218 +++++++++++++++++--
> mm/hugetlb_vmemmap.c | 278 ++++++++++++++++++++++++
> mm/hugetlb_vmemmap.h | 45 ++++
> mm/memory_hotplug.c | 116 ----------
> mm/sparse-vmemmap.c | 273 +++++++++++++++++++++++
> mm/sparse.c | 1 +
> 17 files changed, 1082 insertions(+), 172 deletions(-)
> create mode 100644 include/linux/bootmem_info.h
> create mode 100644 mm/bootmem_info.c
> create mode 100644 mm/hugetlb_vmemmap.c
> create mode 100644 mm/hugetlb_vmemmap.h
>
> --
> 2.11.0
>
On 1/17/21 7:10 AM, Muchun Song wrote:
> Every HugeTLB has more than one struct page structure. We __know__ that
> we only use the first 4(HUGETLB_CGROUP_MIN_ORDER) struct page structures
> to store metadata associated with each HugeTLB.
>
> There are a lot of struct page structures associated with each HugeTLB
> page. For tail pages, the value of compound_head is the same. So we can
> reuse first page of tail page structures. We map the virtual addresses
> of the remaining pages of tail page structures to the first tail page
> struct, and then free these page frames. Therefore, we need to reserve
> two pages as vmemmap areas.
>
> When we allocate a HugeTLB page from the buddy, we can free some vmemmap
> pages associated with each HugeTLB page. It is more appropriate to do it
> in the prep_new_huge_page().
>
> The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
> pages associated with a HugeTLB page can be freed, returns zero for
> now, which means the feature is disabled. We will enable it once all
> the infrastructure is there.
>
> Signed-off-by: Muchun Song <[email protected]>
> ---
> include/linux/bootmem_info.h | 27 +++++-
> include/linux/mm.h | 3 +
> mm/Makefile | 1 +
> mm/hugetlb.c | 3 +
> mm/hugetlb_vmemmap.c | 211 +++++++++++++++++++++++++++++++++++++++++++
> mm/hugetlb_vmemmap.h | 20 ++++
> mm/sparse-vmemmap.c | 198 ++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 462 insertions(+), 1 deletion(-)
> create mode 100644 mm/hugetlb_vmemmap.c
> create mode 100644 mm/hugetlb_vmemmap.h
Thank you for the continued updates! Just some comments below.
I am hoping that others can take a look so we can move this forward.
I do not see any obvious issues.
> diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
> index 4ed6dee1adc9..ec03a624dfa2 100644
> --- a/include/linux/bootmem_info.h
> +++ b/include/linux/bootmem_info.h
> @@ -2,7 +2,7 @@
> #ifndef __LINUX_BOOTMEM_INFO_H
> #define __LINUX_BOOTMEM_INFO_H
>
> -#include <linux/mmzone.h>
> +#include <linux/mm.h>
>
> /*
> * Types for free bootmem stored in page->lru.next. These have to be in
> @@ -22,6 +22,27 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
> void get_page_bootmem(unsigned long info, struct page *page,
> unsigned long type);
> void put_page_bootmem(struct page *page);
> +
> +/*
> + * Any memory allocated via the memblock allocator and not via the
> + * buddy will be marked reserved already in the memmap. For those
> + * pages, we can call this function to free it to buddy allocator.
> + */
> +static inline void free_bootmem_page(struct page *page)
> +{
> + unsigned long magic = (unsigned long)page->freelist;
> +
> + /*
> + * The reserve_bootmem_region sets the reserved flag on bootmem
> + * pages.
> + */
> + VM_BUG_ON_PAGE(page_ref_count(page) != 2, page);
> +
> + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
> + put_page_bootmem(page);
> + else
> + VM_BUG_ON_PAGE(1, page);
> +}
> #else
> static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
> {
> @@ -35,6 +56,10 @@ static inline void get_page_bootmem(unsigned long info, struct page *page,
> unsigned long type)
> {
> }
> +
> +static inline void free_bootmem_page(struct page *page)
> +{
> +}
> #endif
>
> #endif /* __LINUX_BOOTMEM_INFO_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index eabe7d9f80d8..f928994ed273 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3005,6 +3005,9 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
> }
> #endif
>
> +void vmemmap_remap_free(unsigned long start, unsigned long end,
> + unsigned long reuse);
> +
> void *sparse_buffer_alloc(unsigned long size);
> struct page * __populate_section_memmap(unsigned long pfn,
> unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
> diff --git a/mm/Makefile b/mm/Makefile
> index ed4b88fa0f5e..056801d8daae 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -71,6 +71,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
> obj-$(CONFIG_ZSWAP) += zswap.o
> obj-$(CONFIG_HAS_DMA) += dmapool.o
> obj-$(CONFIG_HUGETLBFS) += hugetlb.o
> +obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP) += hugetlb_vmemmap.o
> obj-$(CONFIG_NUMA) += mempolicy.o
> obj-$(CONFIG_SPARSEMEM) += sparse.o
> obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1f3bf1710b66..140135fc8113 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -42,6 +42,7 @@
> #include <linux/userfaultfd_k.h>
> #include <linux/page_owner.h>
> #include "internal.h"
> +#include "hugetlb_vmemmap.h"
>
> int hugetlb_max_hstate __read_mostly;
> unsigned int default_hstate_idx;
> @@ -1497,6 +1498,8 @@ void free_huge_page(struct page *page)
>
> static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> {
> + free_huge_page_vmemmap(h, page);
> +
> INIT_LIST_HEAD(&page->lru);
> set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> set_hugetlb_cgroup(page, NULL);
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> new file mode 100644
> index 000000000000..4ffa2a4ae2a8
> --- /dev/null
> +++ b/mm/hugetlb_vmemmap.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Free some vmemmap pages of HugeTLB
> + *
> + * Copyright (c) 2020, Bytedance. All rights reserved.
> + *
> + * Author: Muchun Song <[email protected]>
> + *
> + * The struct page structures (page structs) are used to describe a physical
> + * page frame. By default, there is a one-to-one mapping from a page frame to
> + * it's corresponding page struct.
> + *
> + * The HugeTLB pages consist of multiple base page size pages and is supported
> + * by many architectures. See hugetlbpage.rst in the Documentation directory
> + * for more details. On the x86-64 architecture, HugeTLB pages of size 2MB and
> + * 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB
> + * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
> + * 4096 base pages. For each base page, there is a corresponding page struct.
> + *
> + * Within the HugeTLB subsystem, only the first 4 page structs are used to
> + * contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
> + * provides this upper limit. The only 'useful' information in the remaining
> + * page structs is the compound_head field, and this field is the same for all
> + * tail pages.
> + *
> + * By removing redundant page structs for HugeTLB pages, memory can be returned
> + * to the buddy allocator for other uses.
> + *
> + * Different architectures support different HugeTLB pages. For example, the
> + * following table is the HugeTLB page size supported by x86 and arm64
> + * architectures. Becasue arm64 supports 4k, 16k, and 64k base pages and
> + * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
> + * page.
> + *
> + * +--------------+-----------+-----------------------------------------------+
> + * | Architecture | Page Size | HugeTLB Page Size |
> + * +--------------+-----------+-----------+-----------+-----------+-----------+
> + * | x86-64 | 4KB | 2MB | 1GB | | |
> + * +--------------+-----------+-----------+-----------+-----------+-----------+
> + * | | 4KB | 64KB | 2MB | 32MB | 1GB |
> + * | +-----------+-----------+-----------+-----------+-----------+
> + * | arm64 | 16KB | 2MB | 32MB | 1GB | |
> + * | +-----------+-----------+-----------+-----------+-----------+
> + * | | 64KB | 2MB | 512MB | 16GB | |
> + * +--------------+-----------+-----------+-----------+-----------+-----------+
> + *
> + * When the system boot up, every HugeTLB page has more than one struct page
> + * structs whose size is (unit: pages):
> + *
> + * struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> + *
> + * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
> + * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
> + * relationship.
> + *
> + * HugeTLB_Size = n * PAGE_SIZE
> + *
> + * Then,
> + *
> + * struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> + * = n * sizeof(struct page) / PAGE_SIZE
> + *
> + * We can use huge mapping at the pud/pmd level for the HugeTLB page.
> + *
> + * For the HugeTLB page of the pmd level mapping, then
> + *
> + * struct_size = n * sizeof(struct page) / PAGE_SIZE
> + * = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
> + * = sizeof(struct page) / sizeof(pte_t)
> + * = 64 / 8
> + * = 8 (pages)
> + *
> + * Where n is how many pte entries which one page can contains. So the value of
> + * n is (PAGE_SIZE / sizeof(pte_t)).
> + *
> + * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
> + * is 8. And this optimization also applicable only when the size of struct page
> + * is a power of two. In most cases, the size of struct page is 64 (e.g. x86-64
> + * and arm64). So if we use pmd level mapping for a HugeTLB page, the size of
> + * struct page structs of it is 8 pages whose size depends on the size of the
> + * base page.
> + *
> + * For the HugeTLB page of the pud level mapping, then
> + *
> + * struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
> + * = PAGE_SIZE / 8 * 8 (pages)
> + * = PAGE_SIZE (pages)
> + *
> + * Where the struct_size(pmd) is the size of the struct page structs of a
> + * HugeTLB page of the pmd level mapping.
> + *
> + * Next, we take the pmd level mapping of the HugeTLB page as an example to
> + * show the internal implementation of this optimization. There are 8 pages
> + * struct page structs associated with a HugeTLB page which is pmd mapped.
> + *
> + * Here is how things look before optimization.
> + *
> + * HugeTLB struct pages(8 pages) page frame(8 pages)
> + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> + * | | | 0 | -------------> | 0 |
> + * | | +-----------+ +-----------+
> + * | | | 1 | -------------> | 1 |
> + * | | +-----------+ +-----------+
> + * | | | 2 | -------------> | 2 |
> + * | | +-----------+ +-----------+
> + * | | | 3 | -------------> | 3 |
> + * | | +-----------+ +-----------+
> + * | | | 4 | -------------> | 4 |
> + * | PMD | +-----------+ +-----------+
> + * | level | | 5 | -------------> | 5 |
> + * | mapping | +-----------+ +-----------+
> + * | | | 6 | -------------> | 6 |
> + * | | +-----------+ +-----------+
> + * | | | 7 | -------------> | 7 |
> + * | | +-----------+ +-----------+
> + * | |
> + * | |
> + * | |
> + * +-----------+
> + *
> + * The value of page->compound_head is the same for all tail pages. The first
> + * page of page structs (page 0) associated with the HugeTLB page contains the 4
> + * page structs necessary to describe the HugeTLB. The only use of the remaining
> + * pages of page structs (page 1 to page 7) is to point to page->compound_head.
> + * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
> + * will be used for each HugeTLB page. This will allow us to free the remaining
> + * 6 pages to the buddy allocator.
> + *
> + * Here is how things look after remapping.
> + *
> + * HugeTLB struct pages(8 pages) page frame(8 pages)
> + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> + * | | | 0 | -------------> | 0 |
> + * | | +-----------+ +-----------+
> + * | | | 1 | -------------> | 1 |
> + * | | +-----------+ +-----------+
> + * | | | 2 | ----------------^ ^ ^ ^ ^ ^
> + * | | +-----------+ | | | | |
> + * | | | 3 | ------------------+ | | | |
> + * | | +-----------+ | | | |
> + * | | | 4 | --------------------+ | | |
> + * | PMD | +-----------+ | | |
> + * | level | | 5 | ----------------------+ | |
> + * | mapping | +-----------+ | |
> + * | | | 6 | ------------------------+ |
> + * | | +-----------+ |
> + * | | | 7 | --------------------------+
> + * | | +-----------+
> + * | |
> + * | |
> + * | |
> + * +-----------+
> + *
> + * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
> + * vmemmap pages and restore the previous mapping relationship.
> + *
> + * For the HugeTLB page of the pud level mapping. It is similar to the former.
> + * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
> + *
> + * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
> + * (e.g. aarch64) provides a contiguous bit in the translation table entries
> + * that hints to the MMU to indicate that it is one of a contiguous set of
> + * entries that can be cached in a single TLB entry.
> + *
> + * The contiguous bit is used to increase the mapping size at the pmd and pte
> + * (last) level. So this type of HugeTLB page can be optimized only when its
> + * size of the struct page structs is greater than 2 pages.
> + */
> +#include "hugetlb_vmemmap.h"
> +
> +/*
> + * There are a lot of struct page structures associated with each HugeTLB page.
> + * For tail pages, the value of compound_head is the same. So we can reuse first
> + * page of tail page structures. We map the virtual addresses of the remaining
> + * pages of tail page structures to the first tail page struct, and then free
> + * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
> + */
> +#define RESERVE_VMEMMAP_NR 2U
> +#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> +
> +/*
> + * How many vmemmap pages associated with a HugeTLB page that can be freed
> + * to the buddy allocator.
> + *
> + * Todo: Returns zero for now, which means the feature is disabled. We will
> + * enable it once all the infrastructure is there.
> + */
> +static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
> +{
> + return 0;
> +}
> +
> +static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
> +{
> + return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
> +}
> +
> +void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> +{
> + unsigned long vmemmap_addr = (unsigned long)head;
> + unsigned long vmemmap_end, vmemmap_reuse;
> +
> + if (!free_vmemmap_pages_per_hpage(h))
> + return;
> +
> + vmemmap_addr += RESERVE_VMEMMAP_SIZE;
> + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
> + vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
> +
> + vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse);
> +}
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> new file mode 100644
> index 000000000000..6923f03534d5
> --- /dev/null
> +++ b/mm/hugetlb_vmemmap.h
> @@ -0,0 +1,20 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Free some vmemmap pages of HugeTLB
> + *
> + * Copyright (c) 2020, Bytedance. All rights reserved.
> + *
> + * Author: Muchun Song <[email protected]>
> + */
> +#ifndef _LINUX_HUGETLB_VMEMMAP_H
> +#define _LINUX_HUGETLB_VMEMMAP_H
> +#include <linux/hugetlb.h>
> +
> +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
> +void free_huge_page_vmemmap(struct hstate *h, struct page *head);
> +#else
> +static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> +{
> +}
> +#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
> +#endif /* _LINUX_HUGETLB_VMEMMAP_H */
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 16183d85a7d5..ce4be1fa93c2 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -27,8 +27,206 @@
> #include <linux/spinlock.h>
> #include <linux/vmalloc.h>
> #include <linux/sched.h>
> +#include <linux/pgtable.h>
> +#include <linux/bootmem_info.h>
> +
> #include <asm/dma.h>
> #include <asm/pgalloc.h>
> +#include <asm/tlbflush.h>
> +
We made the decision to disable PMD mapping of the vmemmap if this feature
is enabled. However, that is not until later in the series. And, the code
which disables PMD mapping is done in arch specific init code. So, a reader
of this new code in sparse-vmemmap.c might not be aware of this. But, the
code below depends on vmemmap being base page mapped.
I know your plan is to perhaps remove this restriction in the future.
Perhaps we should have a big comment in the code (?and commit message?)
noting that this is designed to only work with base page mappings so that
people do not get confused?
> +/**
> + * vmemmap_remap_walk - walk vmemmap page table
> + *
> + * @remap_pte: called for each non-empty PTE (lowest-level) entry.
> + * @reuse_page: the page which is reused for the tail vmemmap pages.
> + * @reuse_addr: the virtual address of the @reuse_page page.
> + * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
> + */
> +struct vmemmap_remap_walk {
> + void (*remap_pte)(pte_t *pte, unsigned long addr,
> + struct vmemmap_remap_walk *walk);
> + struct page *reuse_page;
> + unsigned long reuse_addr;
> + struct list_head *vmemmap_pages;
> +};
> +
> +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + pte_t *pte;
> +
> + pte = pte_offset_kernel(pmd, addr);
> +
> + /*
> + * The reuse_page is found 'first' in table walk before we start
> + * remapping (which is calling @walk->remap_pte).
> + */
> + if (walk->reuse_addr == addr) {
> + BUG_ON(pte_none(*pte));
> +
> + walk->reuse_page = pte_page(*pte++);
> + /*
> + * Becasue the reuse address is part of the range that we are
> + * walking, skip the reuse address range.
> + */
> + addr += PAGE_SIZE;
> + }
> +
> + for (; addr != end; addr += PAGE_SIZE, pte++) {
> + BUG_ON(pte_none(*pte));
> +
> + walk->remap_pte(pte, addr, walk);
> + }
> +}
> +
> +static void vmemmap_pmd_range(pud_t *pud, unsigned long addr,
> + unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + pmd_t *pmd;
> + unsigned long next;
> +
> + pmd = pmd_offset(pud, addr);
> + do {
> + BUG_ON(pmd_none(*pmd));
> +
> + next = pmd_addr_end(addr, end);
> + vmemmap_pte_range(pmd, addr, next, walk);
> + } while (pmd++, addr = next, addr != end);
> +}
> +
> +static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
> + unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + pud_t *pud;
> + unsigned long next;
> +
> + pud = pud_offset(p4d, addr);
> + do {
> + BUG_ON(pud_none(*pud));
> +
> + next = pud_addr_end(addr, end);
> + vmemmap_pmd_range(pud, addr, next, walk);
> + } while (pud++, addr = next, addr != end);
> +}
> +
> +static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
> + unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + p4d_t *p4d;
> + unsigned long next;
> +
> + p4d = p4d_offset(pgd, addr);
> + do {
> + BUG_ON(p4d_none(*p4d));
> +
> + next = p4d_addr_end(addr, end);
> + vmemmap_pud_range(p4d, addr, next, walk);
> + } while (p4d++, addr = next, addr != end);
> +}
> +
> +static void vmemmap_remap_range(unsigned long start, unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + unsigned long addr = start;
> + unsigned long next;
> + pgd_t *pgd;
> +
> + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
> + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
> +
> + pgd = pgd_offset_k(addr);
> + do {
> + BUG_ON(pgd_none(*pgd));
> +
> + next = pgd_addr_end(addr, end);
> + vmemmap_p4d_range(pgd, addr, next, walk);
> + } while (pgd++, addr = next, addr != end);
> +
> + /*
> + * We do not change the mapping of the vmemmap virtual address range
> + * [@start, @start + PAGE_SIZE) which is belong to the reuse range.
> + * So we not need to flush the TLB.
> + */
> + flush_tlb_kernel_range(start - PAGE_SIZE, end);
> +}
> +
> +/*
> + * Free a vmemmap page. A vmemmap page can be allocated from the memblock
> + * allocator or buddy allocator. If the PG_reserved flag is set, it means
> + * that it allocated from the memblock allocator, just free it via the
> + * free_bootmem_page(). Otherwise, use __free_page().
> + */
> +static inline void free_vmemmap_page(struct page *page)
> +{
> + if (PageReserved(page))
> + free_bootmem_page(page);
> + else
> + __free_page(page);
> +}
> +
> +/* Free a list of the vmemmap pages */
> +static void free_vmemmap_page_list(struct list_head *list)
> +{
> + struct page *page, *next;
> +
> + list_for_each_entry_safe(page, next, list, lru) {
> + list_del(&page->lru);
> + free_vmemmap_page(page);
> + }
> +}
> +
> +static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> + struct vmemmap_remap_walk *walk)
> +{
> + /*
> + * Remap the tail pages as read-only to catch illegal write operation
> + * to the tail pages.
> + */
> + pgprot_t pgprot = PAGE_KERNEL_RO;
> + pte_t entry = mk_pte(walk->reuse_page, pgprot);
> + struct page *page = pte_page(*pte);
> +
> + list_add(&page->lru, walk->vmemmap_pages);
> + set_pte_at(&init_mm, addr, pte, entry);
> +}
> +
> +/**
> + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> + * to the page which @reuse is mapped, then free vmemmap
> + * pages.
> + * @start: start address of the vmemmap virtual address range.
> + * @end: end address of the vmemmap virtual address range.
> + * @reuse: reuse address.
> + */
> +void vmemmap_remap_free(unsigned long start, unsigned long end,
> + unsigned long reuse)
> +{
> + LIST_HEAD(vmemmap_pages);
> + struct vmemmap_remap_walk walk = {
> + .remap_pte = vmemmap_remap_pte,
> + .reuse_addr = reuse,
> + .vmemmap_pages = &vmemmap_pages,
> + };
> +
> + /*
> + * In order to make remapping routine most efficient for the huge pages,
> + * the routine of vmemmap page table walking has the following rules
> + * (see more details from the vmemmap_pte_range()):
> + *
> + * - The @reuse address is part of the range that we are walking.
> + * - The @reuse address is the first in the complete range.
> + *
> + * So we need to make sure that @start and @reuse meet the above rules.
> + */
Thanks for adding this comment.
For now this code only works for huge pages. We need to make sure that is
clear to reviewers and people just reading the code.
--
Mike Kravetz
> + BUG_ON(start - reuse != PAGE_SIZE);
> +
> + vmemmap_remap_range(reuse, end, &walk);
> + free_vmemmap_page_list(&vmemmap_pages);
> +}
>
> /*
> * Allocate a block of memory to be used to back the virtual memory map
>
On Sat, Jan 23, 2021 at 9:00 AM Mike Kravetz <[email protected]> wrote:
>
> X-Gm-Spam: 0
> X-Gm-Phishy: 0
>
> On 1/17/21 7:10 AM, Muchun Song wrote:
> > Every HugeTLB has more than one struct page structure. We __know__ that
> > we only use the first 4(HUGETLB_CGROUP_MIN_ORDER) struct page structures
> > to store metadata associated with each HugeTLB.
> >
> > There are a lot of struct page structures associated with each HugeTLB
> > page. For tail pages, the value of compound_head is the same. So we can
> > reuse first page of tail page structures. We map the virtual addresses
> > of the remaining pages of tail page structures to the first tail page
> > struct, and then free these page frames. Therefore, we need to reserve
> > two pages as vmemmap areas.
> >
> > When we allocate a HugeTLB page from the buddy, we can free some vmemmap
> > pages associated with each HugeTLB page. It is more appropriate to do it
> > in the prep_new_huge_page().
> >
> > The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
> > pages associated with a HugeTLB page can be freed, returns zero for
> > now, which means the feature is disabled. We will enable it once all
> > the infrastructure is there.
> >
> > Signed-off-by: Muchun Song <[email protected]>
> > ---
> > include/linux/bootmem_info.h | 27 +++++-
> > include/linux/mm.h | 3 +
> > mm/Makefile | 1 +
> > mm/hugetlb.c | 3 +
> > mm/hugetlb_vmemmap.c | 211 +++++++++++++++++++++++++++++++++++++++++++
> > mm/hugetlb_vmemmap.h | 20 ++++
> > mm/sparse-vmemmap.c | 198 ++++++++++++++++++++++++++++++++++++++++
> > 7 files changed, 462 insertions(+), 1 deletion(-)
> > create mode 100644 mm/hugetlb_vmemmap.c
> > create mode 100644 mm/hugetlb_vmemmap.h
>
> Thank you for the continued updates! Just some comments below.
> I am hoping that others can take a look so we can move this forward.
> I do not see any obvious issues.
Yeah, hope more reviewers will participate in this. :)
>
> > diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
> > index 4ed6dee1adc9..ec03a624dfa2 100644
> > --- a/include/linux/bootmem_info.h
> > +++ b/include/linux/bootmem_info.h
> > @@ -2,7 +2,7 @@
> > #ifndef __LINUX_BOOTMEM_INFO_H
> > #define __LINUX_BOOTMEM_INFO_H
> >
> > -#include <linux/mmzone.h>
> > +#include <linux/mm.h>
> >
> > /*
> > * Types for free bootmem stored in page->lru.next. These have to be in
> > @@ -22,6 +22,27 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
> > void get_page_bootmem(unsigned long info, struct page *page,
> > unsigned long type);
> > void put_page_bootmem(struct page *page);
> > +
> > +/*
> > + * Any memory allocated via the memblock allocator and not via the
> > + * buddy will be marked reserved already in the memmap. For those
> > + * pages, we can call this function to free it to buddy allocator.
> > + */
> > +static inline void free_bootmem_page(struct page *page)
> > +{
> > + unsigned long magic = (unsigned long)page->freelist;
> > +
> > + /*
> > + * The reserve_bootmem_region sets the reserved flag on bootmem
> > + * pages.
> > + */
> > + VM_BUG_ON_PAGE(page_ref_count(page) != 2, page);
> > +
> > + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
> > + put_page_bootmem(page);
> > + else
> > + VM_BUG_ON_PAGE(1, page);
> > +}
> > #else
> > static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
> > {
> > @@ -35,6 +56,10 @@ static inline void get_page_bootmem(unsigned long info, struct page *page,
> > unsigned long type)
> > {
> > }
> > +
> > +static inline void free_bootmem_page(struct page *page)
> > +{
> > +}
> > #endif
> >
> > #endif /* __LINUX_BOOTMEM_INFO_H */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index eabe7d9f80d8..f928994ed273 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3005,6 +3005,9 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
> > }
> > #endif
> >
> > +void vmemmap_remap_free(unsigned long start, unsigned long end,
> > + unsigned long reuse);
> > +
> > void *sparse_buffer_alloc(unsigned long size);
> > struct page * __populate_section_memmap(unsigned long pfn,
> > unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
> > diff --git a/mm/Makefile b/mm/Makefile
> > index ed4b88fa0f5e..056801d8daae 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -71,6 +71,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
> > obj-$(CONFIG_ZSWAP) += zswap.o
> > obj-$(CONFIG_HAS_DMA) += dmapool.o
> > obj-$(CONFIG_HUGETLBFS) += hugetlb.o
> > +obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP) += hugetlb_vmemmap.o
> > obj-$(CONFIG_NUMA) += mempolicy.o
> > obj-$(CONFIG_SPARSEMEM) += sparse.o
> > obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 1f3bf1710b66..140135fc8113 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -42,6 +42,7 @@
> > #include <linux/userfaultfd_k.h>
> > #include <linux/page_owner.h>
> > #include "internal.h"
> > +#include "hugetlb_vmemmap.h"
> >
> > int hugetlb_max_hstate __read_mostly;
> > unsigned int default_hstate_idx;
> > @@ -1497,6 +1498,8 @@ void free_huge_page(struct page *page)
> >
> > static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> > {
> > + free_huge_page_vmemmap(h, page);
> > +
> > INIT_LIST_HEAD(&page->lru);
> > set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> > set_hugetlb_cgroup(page, NULL);
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > new file mode 100644
> > index 000000000000..4ffa2a4ae2a8
> > --- /dev/null
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -0,0 +1,211 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Free some vmemmap pages of HugeTLB
> > + *
> > + * Copyright (c) 2020, Bytedance. All rights reserved.
> > + *
> > + * Author: Muchun Song <[email protected]>
> > + *
> > + * The struct page structures (page structs) are used to describe a physical
> > + * page frame. By default, there is a one-to-one mapping from a page frame to
> > + * it's corresponding page struct.
> > + *
> > + * The HugeTLB pages consist of multiple base page size pages and is supported
> > + * by many architectures. See hugetlbpage.rst in the Documentation directory
> > + * for more details. On the x86-64 architecture, HugeTLB pages of size 2MB and
> > + * 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB
> > + * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
> > + * 4096 base pages. For each base page, there is a corresponding page struct.
> > + *
> > + * Within the HugeTLB subsystem, only the first 4 page structs are used to
> > + * contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
> > + * provides this upper limit. The only 'useful' information in the remaining
> > + * page structs is the compound_head field, and this field is the same for all
> > + * tail pages.
> > + *
> > + * By removing redundant page structs for HugeTLB pages, memory can be returned
> > + * to the buddy allocator for other uses.
> > + *
> > + * Different architectures support different HugeTLB pages. For example, the
> > + * following table is the HugeTLB page size supported by x86 and arm64
> > + * architectures. Becasue arm64 supports 4k, 16k, and 64k base pages and
> > + * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
> > + * page.
> > + *
> > + * +--------------+-----------+-----------------------------------------------+
> > + * | Architecture | Page Size | HugeTLB Page Size |
> > + * +--------------+-----------+-----------+-----------+-----------+-----------+
> > + * | x86-64 | 4KB | 2MB | 1GB | | |
> > + * +--------------+-----------+-----------+-----------+-----------+-----------+
> > + * | | 4KB | 64KB | 2MB | 32MB | 1GB |
> > + * | +-----------+-----------+-----------+-----------+-----------+
> > + * | arm64 | 16KB | 2MB | 32MB | 1GB | |
> > + * | +-----------+-----------+-----------+-----------+-----------+
> > + * | | 64KB | 2MB | 512MB | 16GB | |
> > + * +--------------+-----------+-----------+-----------+-----------+-----------+
> > + *
> > + * When the system boot up, every HugeTLB page has more than one struct page
> > + * structs whose size is (unit: pages):
> > + *
> > + * struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> > + *
> > + * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
> > + * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
> > + * relationship.
> > + *
> > + * HugeTLB_Size = n * PAGE_SIZE
> > + *
> > + * Then,
> > + *
> > + * struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> > + * = n * sizeof(struct page) / PAGE_SIZE
> > + *
> > + * We can use huge mapping at the pud/pmd level for the HugeTLB page.
> > + *
> > + * For the HugeTLB page of the pmd level mapping, then
> > + *
> > + * struct_size = n * sizeof(struct page) / PAGE_SIZE
> > + * = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
> > + * = sizeof(struct page) / sizeof(pte_t)
> > + * = 64 / 8
> > + * = 8 (pages)
> > + *
> > + * Where n is how many pte entries which one page can contains. So the value of
> > + * n is (PAGE_SIZE / sizeof(pte_t)).
> > + *
> > + * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
> > + * is 8. And this optimization also applicable only when the size of struct page
> > + * is a power of two. In most cases, the size of struct page is 64 (e.g. x86-64
> > + * and arm64). So if we use pmd level mapping for a HugeTLB page, the size of
> > + * struct page structs of it is 8 pages whose size depends on the size of the
> > + * base page.
> > + *
> > + * For the HugeTLB page of the pud level mapping, then
> > + *
> > + * struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
> > + * = PAGE_SIZE / 8 * 8 (pages)
> > + * = PAGE_SIZE (pages)
> > + *
> > + * Where the struct_size(pmd) is the size of the struct page structs of a
> > + * HugeTLB page of the pmd level mapping.
> > + *
> > + * Next, we take the pmd level mapping of the HugeTLB page as an example to
> > + * show the internal implementation of this optimization. There are 8 pages
> > + * struct page structs associated with a HugeTLB page which is pmd mapped.
> > + *
> > + * Here is how things look before optimization.
> > + *
> > + * HugeTLB struct pages(8 pages) page frame(8 pages)
> > + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> > + * | | | 0 | -------------> | 0 |
> > + * | | +-----------+ +-----------+
> > + * | | | 1 | -------------> | 1 |
> > + * | | +-----------+ +-----------+
> > + * | | | 2 | -------------> | 2 |
> > + * | | +-----------+ +-----------+
> > + * | | | 3 | -------------> | 3 |
> > + * | | +-----------+ +-----------+
> > + * | | | 4 | -------------> | 4 |
> > + * | PMD | +-----------+ +-----------+
> > + * | level | | 5 | -------------> | 5 |
> > + * | mapping | +-----------+ +-----------+
> > + * | | | 6 | -------------> | 6 |
> > + * | | +-----------+ +-----------+
> > + * | | | 7 | -------------> | 7 |
> > + * | | +-----------+ +-----------+
> > + * | |
> > + * | |
> > + * | |
> > + * +-----------+
> > + *
> > + * The value of page->compound_head is the same for all tail pages. The first
> > + * page of page structs (page 0) associated with the HugeTLB page contains the 4
> > + * page structs necessary to describe the HugeTLB. The only use of the remaining
> > + * pages of page structs (page 1 to page 7) is to point to page->compound_head.
> > + * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
> > + * will be used for each HugeTLB page. This will allow us to free the remaining
> > + * 6 pages to the buddy allocator.
> > + *
> > + * Here is how things look after remapping.
> > + *
> > + * HugeTLB struct pages(8 pages) page frame(8 pages)
> > + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> > + * | | | 0 | -------------> | 0 |
> > + * | | +-----------+ +-----------+
> > + * | | | 1 | -------------> | 1 |
> > + * | | +-----------+ +-----------+
> > + * | | | 2 | ----------------^ ^ ^ ^ ^ ^
> > + * | | +-----------+ | | | | |
> > + * | | | 3 | ------------------+ | | | |
> > + * | | +-----------+ | | | |
> > + * | | | 4 | --------------------+ | | |
> > + * | PMD | +-----------+ | | |
> > + * | level | | 5 | ----------------------+ | |
> > + * | mapping | +-----------+ | |
> > + * | | | 6 | ------------------------+ |
> > + * | | +-----------+ |
> > + * | | | 7 | --------------------------+
> > + * | | +-----------+
> > + * | |
> > + * | |
> > + * | |
> > + * +-----------+
> > + *
> > + * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
> > + * vmemmap pages and restore the previous mapping relationship.
> > + *
> > + * For the HugeTLB page of the pud level mapping. It is similar to the former.
> > + * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
> > + *
> > + * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
> > + * (e.g. aarch64) provides a contiguous bit in the translation table entries
> > + * that hints to the MMU to indicate that it is one of a contiguous set of
> > + * entries that can be cached in a single TLB entry.
> > + *
> > + * The contiguous bit is used to increase the mapping size at the pmd and pte
> > + * (last) level. So this type of HugeTLB page can be optimized only when its
> > + * size of the struct page structs is greater than 2 pages.
> > + */
> > +#include "hugetlb_vmemmap.h"
> > +
> > +/*
> > + * There are a lot of struct page structures associated with each HugeTLB page.
> > + * For tail pages, the value of compound_head is the same. So we can reuse first
> > + * page of tail page structures. We map the virtual addresses of the remaining
> > + * pages of tail page structures to the first tail page struct, and then free
> > + * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
> > + */
> > +#define RESERVE_VMEMMAP_NR 2U
> > +#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > +
> > +/*
> > + * How many vmemmap pages associated with a HugeTLB page that can be freed
> > + * to the buddy allocator.
> > + *
> > + * Todo: Returns zero for now, which means the feature is disabled. We will
> > + * enable it once all the infrastructure is there.
> > + */
> > +static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
> > +{
> > + return 0;
> > +}
> > +
> > +static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
> > +{
> > + return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
> > +}
> > +
> > +void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> > +{
> > + unsigned long vmemmap_addr = (unsigned long)head;
> > + unsigned long vmemmap_end, vmemmap_reuse;
> > +
> > + if (!free_vmemmap_pages_per_hpage(h))
> > + return;
> > +
> > + vmemmap_addr += RESERVE_VMEMMAP_SIZE;
> > + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
> > + vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
> > +
> > + vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse);
> > +}
> > diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> > new file mode 100644
> > index 000000000000..6923f03534d5
> > --- /dev/null
> > +++ b/mm/hugetlb_vmemmap.h
> > @@ -0,0 +1,20 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Free some vmemmap pages of HugeTLB
> > + *
> > + * Copyright (c) 2020, Bytedance. All rights reserved.
> > + *
> > + * Author: Muchun Song <[email protected]>
> > + */
> > +#ifndef _LINUX_HUGETLB_VMEMMAP_H
> > +#define _LINUX_HUGETLB_VMEMMAP_H
> > +#include <linux/hugetlb.h>
> > +
> > +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
> > +void free_huge_page_vmemmap(struct hstate *h, struct page *head);
> > +#else
> > +static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> > +{
> > +}
> > +#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
> > +#endif /* _LINUX_HUGETLB_VMEMMAP_H */
> > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > index 16183d85a7d5..ce4be1fa93c2 100644
> > --- a/mm/sparse-vmemmap.c
> > +++ b/mm/sparse-vmemmap.c
> > @@ -27,8 +27,206 @@
> > #include <linux/spinlock.h>
> > #include <linux/vmalloc.h>
> > #include <linux/sched.h>
> > +#include <linux/pgtable.h>
> > +#include <linux/bootmem_info.h>
> > +
> > #include <asm/dma.h>
> > #include <asm/pgalloc.h>
> > +#include <asm/tlbflush.h>
> > +
>
> We made the decision to disable PMD mapping of the vmemmap if this feature
> is enabled. However, that is not until later in the series. And, the code
> which disables PMD mapping is done in arch specific init code. So, a reader
> of this new code in sparse-vmemmap.c might not be aware of this. But, the
> code below depends on vmemmap being base page mapped.
>
> I know your plan is to perhaps remove this restriction in the future.
> Perhaps we should have a big comment in the code (?and commit message?)
> noting that this is designed to only work with base page mappings so that
> people do not get confused?
Agree. Will add some comments in the next version.
>
> > +/**
> > + * vmemmap_remap_walk - walk vmemmap page table
> > + *
> > + * @remap_pte: called for each non-empty PTE (lowest-level) entry.
> > + * @reuse_page: the page which is reused for the tail vmemmap pages.
> > + * @reuse_addr: the virtual address of the @reuse_page page.
> > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
> > + */
> > +struct vmemmap_remap_walk {
> > + void (*remap_pte)(pte_t *pte, unsigned long addr,
> > + struct vmemmap_remap_walk *walk);
> > + struct page *reuse_page;
> > + unsigned long reuse_addr;
> > + struct list_head *vmemmap_pages;
> > +};
> > +
> > +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> > + unsigned long end,
> > + struct vmemmap_remap_walk *walk)
> > +{
> > + pte_t *pte;
> > +
> > + pte = pte_offset_kernel(pmd, addr);
> > +
> > + /*
> > + * The reuse_page is found 'first' in table walk before we start
> > + * remapping (which is calling @walk->remap_pte).
> > + */
> > + if (walk->reuse_addr == addr) {
> > + BUG_ON(pte_none(*pte));
> > +
> > + walk->reuse_page = pte_page(*pte++);
> > + /*
> > + * Becasue the reuse address is part of the range that we are
> > + * walking, skip the reuse address range.
> > + */
> > + addr += PAGE_SIZE;
> > + }
> > +
> > + for (; addr != end; addr += PAGE_SIZE, pte++) {
> > + BUG_ON(pte_none(*pte));
> > +
> > + walk->remap_pte(pte, addr, walk);
> > + }
> > +}
> > +
> > +static void vmemmap_pmd_range(pud_t *pud, unsigned long addr,
> > + unsigned long end,
> > + struct vmemmap_remap_walk *walk)
> > +{
> > + pmd_t *pmd;
> > + unsigned long next;
> > +
> > + pmd = pmd_offset(pud, addr);
> > + do {
> > + BUG_ON(pmd_none(*pmd));
> > +
> > + next = pmd_addr_end(addr, end);
> > + vmemmap_pte_range(pmd, addr, next, walk);
> > + } while (pmd++, addr = next, addr != end);
> > +}
> > +
> > +static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
> > + unsigned long end,
> > + struct vmemmap_remap_walk *walk)
> > +{
> > + pud_t *pud;
> > + unsigned long next;
> > +
> > + pud = pud_offset(p4d, addr);
> > + do {
> > + BUG_ON(pud_none(*pud));
> > +
> > + next = pud_addr_end(addr, end);
> > + vmemmap_pmd_range(pud, addr, next, walk);
> > + } while (pud++, addr = next, addr != end);
> > +}
> > +
> > +static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
> > + unsigned long end,
> > + struct vmemmap_remap_walk *walk)
> > +{
> > + p4d_t *p4d;
> > + unsigned long next;
> > +
> > + p4d = p4d_offset(pgd, addr);
> > + do {
> > + BUG_ON(p4d_none(*p4d));
> > +
> > + next = p4d_addr_end(addr, end);
> > + vmemmap_pud_range(p4d, addr, next, walk);
> > + } while (p4d++, addr = next, addr != end);
> > +}
> > +
> > +static void vmemmap_remap_range(unsigned long start, unsigned long end,
> > + struct vmemmap_remap_walk *walk)
> > +{
> > + unsigned long addr = start;
> > + unsigned long next;
> > + pgd_t *pgd;
> > +
> > + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
> > + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
> > +
> > + pgd = pgd_offset_k(addr);
> > + do {
> > + BUG_ON(pgd_none(*pgd));
> > +
> > + next = pgd_addr_end(addr, end);
> > + vmemmap_p4d_range(pgd, addr, next, walk);
> > + } while (pgd++, addr = next, addr != end);
> > +
> > + /*
> > + * We do not change the mapping of the vmemmap virtual address range
> > + * [@start, @start + PAGE_SIZE) which is belong to the reuse range.
> > + * So we not need to flush the TLB.
> > + */
> > + flush_tlb_kernel_range(start - PAGE_SIZE, end);
> > +}
> > +
> > +/*
> > + * Free a vmemmap page. A vmemmap page can be allocated from the memblock
> > + * allocator or buddy allocator. If the PG_reserved flag is set, it means
> > + * that it allocated from the memblock allocator, just free it via the
> > + * free_bootmem_page(). Otherwise, use __free_page().
> > + */
> > +static inline void free_vmemmap_page(struct page *page)
> > +{
> > + if (PageReserved(page))
> > + free_bootmem_page(page);
> > + else
> > + __free_page(page);
> > +}
> > +
> > +/* Free a list of the vmemmap pages */
> > +static void free_vmemmap_page_list(struct list_head *list)
> > +{
> > + struct page *page, *next;
> > +
> > + list_for_each_entry_safe(page, next, list, lru) {
> > + list_del(&page->lru);
> > + free_vmemmap_page(page);
> > + }
> > +}
> > +
> > +static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> > + struct vmemmap_remap_walk *walk)
> > +{
> > + /*
> > + * Remap the tail pages as read-only to catch illegal write operation
> > + * to the tail pages.
> > + */
> > + pgprot_t pgprot = PAGE_KERNEL_RO;
> > + pte_t entry = mk_pte(walk->reuse_page, pgprot);
> > + struct page *page = pte_page(*pte);
> > +
> > + list_add(&page->lru, walk->vmemmap_pages);
> > + set_pte_at(&init_mm, addr, pte, entry);
> > +}
> > +
> > +/**
> > + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> > + * to the page which @reuse is mapped, then free vmemmap
> > + * pages.
> > + * @start: start address of the vmemmap virtual address range.
> > + * @end: end address of the vmemmap virtual address range.
> > + * @reuse: reuse address.
> > + */
> > +void vmemmap_remap_free(unsigned long start, unsigned long end,
> > + unsigned long reuse)
> > +{
> > + LIST_HEAD(vmemmap_pages);
> > + struct vmemmap_remap_walk walk = {
> > + .remap_pte = vmemmap_remap_pte,
> > + .reuse_addr = reuse,
> > + .vmemmap_pages = &vmemmap_pages,
> > + };
> > +
> > + /*
> > + * In order to make remapping routine most efficient for the huge pages,
> > + * the routine of vmemmap page table walking has the following rules
> > + * (see more details from the vmemmap_pte_range()):
> > + *
> > + * - The @reuse address is part of the range that we are walking.
> > + * - The @reuse address is the first in the complete range.
> > + *
> > + * So we need to make sure that @start and @reuse meet the above rules.
> > + */
>
> Thanks for adding this comment.
>
> For now this code only works for huge pages. We need to make sure that is
> clear to reviewers and people just reading the code.
>
> --
> Mike Kravetz
>
> > + BUG_ON(start - reuse != PAGE_SIZE);
> > +
> > + vmemmap_remap_range(reuse, end, &walk);
> > + free_vmemmap_page_list(&vmemmap_pages);
> > +}
> >
> > /*
> > * Allocate a block of memory to be used to back the virtual memory map
> >
On Sun, Jan 17, 2021 at 11:10:44PM +0800, Muchun Song wrote:
> Every HugeTLB has more than one struct page structure. We __know__ that
> we only use the first 4(HUGETLB_CGROUP_MIN_ORDER) struct page structures
> to store metadata associated with each HugeTLB.
>
> There are a lot of struct page structures associated with each HugeTLB
> page. For tail pages, the value of compound_head is the same. So we can
> reuse first page of tail page structures. We map the virtual addresses
> of the remaining pages of tail page structures to the first tail page
> struct, and then free these page frames. Therefore, we need to reserve
> two pages as vmemmap areas.
>
> When we allocate a HugeTLB page from the buddy, we can free some vmemmap
> pages associated with each HugeTLB page. It is more appropriate to do it
> in the prep_new_huge_page().
>
> The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
> pages associated with a HugeTLB page can be freed, returns zero for
> now, which means the feature is disabled. We will enable it once all
> the infrastructure is there.
>
> Signed-off-by: Muchun Song <[email protected]>
Overall looks good to me.
A few nits below, plus what Mike has already said.
I was playing the other day (just for un) to see how hard would be to adapt
this to ppc64 but did not have the time :-)
> --- /dev/null
> +++ b/mm/hugetlb_vmemmap.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Free some vmemmap pages of HugeTLB
> + *
> + * Copyright (c) 2020, Bytedance. All rights reserved.
> + *
> + * Author: Muchun Song <[email protected]>
> + *
> + * The struct page structures (page structs) are used to describe a physical
> + * page frame. By default, there is a one-to-one mapping from a page frame to
> + * it's corresponding page struct.
> + *
> + * The HugeTLB pages consist of multiple base page size pages and is supported
"HugeTLB pages ..."
> + * When the system boot up, every HugeTLB page has more than one struct page
> + * structs whose size is (unit: pages):
^^^^ which?
> + *
> + * struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> + *
> + * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
> + * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
> + * relationship.
> + *
> + * HugeTLB_Size = n * PAGE_SIZE
> + *
> + * Then,
> + *
> + * struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> + * = n * sizeof(struct page) / PAGE_SIZE
> + *
> + * We can use huge mapping at the pud/pmd level for the HugeTLB page.
> + *
> + * For the HugeTLB page of the pmd level mapping, then
> + *
> + * struct_size = n * sizeof(struct page) / PAGE_SIZE
> + * = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
> + * = sizeof(struct page) / sizeof(pte_t)
> + * = 64 / 8
> + * = 8 (pages)
> + *
> + * Where n is how many pte entries which one page can contains. So the value of
> + * n is (PAGE_SIZE / sizeof(pte_t)).
> + *
> + * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
> + * is 8. And this optimization also applicable only when the size of struct page
> + * is a power of two. In most cases, the size of struct page is 64 (e.g. x86-64
> + * and arm64). So if we use pmd level mapping for a HugeTLB page, the size of
> + * struct page structs of it is 8 pages whose size depends on the size of the
> + * base page.
> + *
> + * For the HugeTLB page of the pud level mapping, then
> + *
> + * struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
> + * = PAGE_SIZE / 8 * 8 (pages)
> + * = PAGE_SIZE (pages)
I would try to condense above information and focus on what are the
key points you want people to get.
E.g: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
HugeTLB page consists in 4096.
If you do not want to be that specific you can always write down the
formula, and maybe put the X86_64 example at the end.
But as I said, I would try to make it more brief.
Maybe others disagree though.
> + *
> + * Where the struct_size(pmd) is the size of the struct page structs of a
> + * HugeTLB page of the pmd level mapping.
[...]
> +void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> +{
> + unsigned long vmemmap_addr = (unsigned long)head;
> + unsigned long vmemmap_end, vmemmap_reuse;
> +
> + if (!free_vmemmap_pages_per_hpage(h))
> + return;
> +
> + vmemmap_addr += RESERVE_VMEMMAP_SIZE;
> + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
> + vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
> +
I would like to see a comment there explaining why those variables get
they value they do.
> +/**
> + * vmemmap_remap_walk - walk vmemmap page table
> + *
> + * @remap_pte: called for each non-empty PTE (lowest-level) entry.
> + * @reuse_page: the page which is reused for the tail vmemmap pages.
> + * @reuse_addr: the virtual address of the @reuse_page page.
> + * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
Let us align the tabs there.
> +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + pte_t *pte;
> +
> + pte = pte_offset_kernel(pmd, addr);
> +
> + /*
> + * The reuse_page is found 'first' in table walk before we start
> + * remapping (which is calling @walk->remap_pte).
> + */
> + if (walk->reuse_addr == addr) {
> + BUG_ON(pte_none(*pte));
If it is found first, would not be
if (!walk->reuse_page) {
BUG_ON(walk->reuse_addr != addr)
...
}
more intuitive?
> +static void vmemmap_remap_range(unsigned long start, unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + unsigned long addr = start;
> + unsigned long next;
> + pgd_t *pgd;
> +
> + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
> + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
> +
> + pgd = pgd_offset_k(addr);
> + do {
> + BUG_ON(pgd_none(*pgd));
> +
> + next = pgd_addr_end(addr, end);
> + vmemmap_p4d_range(pgd, addr, next, walk);
> + } while (pgd++, addr = next, addr != end);
> +
> + /*
> + * We do not change the mapping of the vmemmap virtual address range
> + * [@start, @start + PAGE_SIZE) which is belong to the reuse range.
"which belongs to"
> + * So we not need to flush the TLB.
> + */
> + flush_tlb_kernel_range(start - PAGE_SIZE, end);
you already commented on on this one.
> +/**
> + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> + * to the page which @reuse is mapped, then free vmemmap
> + * pages.
> + * @start: start address of the vmemmap virtual address range.
Well, it is the start address of the range we want to remap.
Reading it made me think that it is really the __start__ address
of the vmemmap range.
> +void vmemmap_remap_free(unsigned long start, unsigned long end,
> + unsigned long reuse)
> +{
> + LIST_HEAD(vmemmap_pages);
> + struct vmemmap_remap_walk walk = {
> + .remap_pte = vmemmap_remap_pte,
> + .reuse_addr = reuse,
> + .vmemmap_pages = &vmemmap_pages,
> + };
> +
> + /*
> + * In order to make remapping routine most efficient for the huge pages,
> + * the routine of vmemmap page table walking has the following rules
> + * (see more details from the vmemmap_pte_range()):
> + *
> + * - The @reuse address is part of the range that we are walking.
> + * - The @reuse address is the first in the complete range.
> + *
> + * So we need to make sure that @start and @reuse meet the above rules.
You say that "reuse" and "start" need to meet some rules, but in the
paragraph above you only seem to point "reuse" rules?
--
Oscar Salvador
SUSE L3
On Sun, Jan 24, 2021 at 1:53 AM Oscar Salvador <[email protected]> wrote:
>
> X-Gm-Spam: 0
> X-Gm-Phishy: 0
>
> On Sun, Jan 17, 2021 at 11:10:44PM +0800, Muchun Song wrote:
> > Every HugeTLB has more than one struct page structure. We __know__ that
> > we only use the first 4(HUGETLB_CGROUP_MIN_ORDER) struct page structures
> > to store metadata associated with each HugeTLB.
> >
> > There are a lot of struct page structures associated with each HugeTLB
> > page. For tail pages, the value of compound_head is the same. So we can
> > reuse first page of tail page structures. We map the virtual addresses
> > of the remaining pages of tail page structures to the first tail page
> > struct, and then free these page frames. Therefore, we need to reserve
> > two pages as vmemmap areas.
> >
> > When we allocate a HugeTLB page from the buddy, we can free some vmemmap
> > pages associated with each HugeTLB page. It is more appropriate to do it
> > in the prep_new_huge_page().
> >
> > The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
> > pages associated with a HugeTLB page can be freed, returns zero for
> > now, which means the feature is disabled. We will enable it once all
> > the infrastructure is there.
> >
> > Signed-off-by: Muchun Song <[email protected]>
>
> Overall looks good to me.
> A few nits below, plus what Mike has already said.
>
> I was playing the other day (just for un) to see how hard would be to adapt
> this to ppc64 but did not have the time :-)
I have no idea about ppc64. But for aarch64, it is easy to adapt
this to aarch64 (I have finished this part of the work). Is the size
of the struct page 64 bytes for ppc64? If so, I think that it also
easy.
>
> > --- /dev/null
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -0,0 +1,211 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Free some vmemmap pages of HugeTLB
> > + *
> > + * Copyright (c) 2020, Bytedance. All rights reserved.
> > + *
> > + * Author: Muchun Song <[email protected]>
> > + *
> > + * The struct page structures (page structs) are used to describe a physical
> > + * page frame. By default, there is a one-to-one mapping from a page frame to
> > + * it's corresponding page struct.
> > + *
> > + * The HugeTLB pages consist of multiple base page size pages and is supported
> "HugeTLB pages ..."
Thanks.
>
> > + * When the system boot up, every HugeTLB page has more than one struct page
> > + * structs whose size is (unit: pages):
> ^^^^ which?
I am not a native English. Thanks for pointing this out.
> > + *
> > + * struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> > + *
> > + * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
> > + * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
> > + * relationship.
> > + *
> > + * HugeTLB_Size = n * PAGE_SIZE
> > + *
> > + * Then,
> > + *
> > + * struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> > + * = n * sizeof(struct page) / PAGE_SIZE
> > + *
> > + * We can use huge mapping at the pud/pmd level for the HugeTLB page.
> > + *
> > + * For the HugeTLB page of the pmd level mapping, then
> > + *
> > + * struct_size = n * sizeof(struct page) / PAGE_SIZE
> > + * = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
> > + * = sizeof(struct page) / sizeof(pte_t)
> > + * = 64 / 8
> > + * = 8 (pages)
> > + *
> > + * Where n is how many pte entries which one page can contains. So the value of
> > + * n is (PAGE_SIZE / sizeof(pte_t)).
> > + *
> > + * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
> > + * is 8. And this optimization also applicable only when the size of struct page
> > + * is a power of two. In most cases, the size of struct page is 64 (e.g. x86-64
> > + * and arm64). So if we use pmd level mapping for a HugeTLB page, the size of
> > + * struct page structs of it is 8 pages whose size depends on the size of the
> > + * base page.
> > + *
> > + * For the HugeTLB page of the pud level mapping, then
> > + *
> > + * struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
> > + * = PAGE_SIZE / 8 * 8 (pages)
> > + * = PAGE_SIZE (pages)
>
> I would try to condense above information and focus on what are the
> key points you want people to get.
> E.g: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
> HugeTLB page consists in 4096.
> If you do not want to be that specific you can always write down the
> formula, and maybe put the X86_64 example at the end.
> But as I said, I would try to make it more brief.
>
> Maybe others disagree though.
I want to make the formula more general. Because the PAGE_SIZE
can be different on arm64. But you are right, I should make it brief
and easy to understand. I will add some examples at the end of the
formula. Thanks.
>
>
> > + *
> > + * Where the struct_size(pmd) is the size of the struct page structs of a
> > + * HugeTLB page of the pmd level mapping.
>
> [...]
>
> > +void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> > +{
> > + unsigned long vmemmap_addr = (unsigned long)head;
> > + unsigned long vmemmap_end, vmemmap_reuse;
> > +
> > + if (!free_vmemmap_pages_per_hpage(h))
> > + return;
> > +
> > + vmemmap_addr += RESERVE_VMEMMAP_SIZE;
> > + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
> > + vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
> > +
>
> I would like to see a comment there explaining why those variables get
> they value they do.
OK. Will add a comment here.
>
> > +/**
> > + * vmemmap_remap_walk - walk vmemmap page table
> > + *
> > + * @remap_pte: called for each non-empty PTE (lowest-level) entry.
> > + * @reuse_page: the page which is reused for the tail vmemmap pages.
> > + * @reuse_addr: the virtual address of the @reuse_page page.
> > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
>
> Let us align the tabs there.
It is already aligned. :)
>
> > +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> > + unsigned long end,
> > + struct vmemmap_remap_walk *walk)
> > +{
> > + pte_t *pte;
> > +
> > + pte = pte_offset_kernel(pmd, addr);
> > +
> > + /*
> > + * The reuse_page is found 'first' in table walk before we start
> > + * remapping (which is calling @walk->remap_pte).
> > + */
> > + if (walk->reuse_addr == addr) {
> > + BUG_ON(pte_none(*pte));
>
> If it is found first, would not be
>
> if (!walk->reuse_page) {
> BUG_ON(walk->reuse_addr != addr)
> ...
> }
>
> more intuitive?
Good. More intuitive. Thanks.
>
>
> > +static void vmemmap_remap_range(unsigned long start, unsigned long end,
> > + struct vmemmap_remap_walk *walk)
> > +{
> > + unsigned long addr = start;
> > + unsigned long next;
> > + pgd_t *pgd;
> > +
> > + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
> > + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
> > +
> > + pgd = pgd_offset_k(addr);
> > + do {
> > + BUG_ON(pgd_none(*pgd));
> > +
> > + next = pgd_addr_end(addr, end);
> > + vmemmap_p4d_range(pgd, addr, next, walk);
> > + } while (pgd++, addr = next, addr != end);
> > +
> > + /*
> > + * We do not change the mapping of the vmemmap virtual address range
> > + * [@start, @start + PAGE_SIZE) which is belong to the reuse range.
> "which belongs to"
Thanks. Will fix it.
>
> > + * So we not need to flush the TLB.
> > + */
> > + flush_tlb_kernel_range(start - PAGE_SIZE, end);
>
> you already commented on on this one.
Yeah, will fix it.
>
> > +/**
> > + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> > + * to the page which @reuse is mapped, then free vmemmap
> > + * pages.
> > + * @start: start address of the vmemmap virtual address range.
>
> Well, it is the start address of the range we want to remap.
> Reading it made me think that it is really the __start__ address
> of the vmemmap range.
Sorry for confusing. I will fix it. Thanks.
>
> > +void vmemmap_remap_free(unsigned long start, unsigned long end,
> > + unsigned long reuse)
> > +{
> > + LIST_HEAD(vmemmap_pages);
> > + struct vmemmap_remap_walk walk = {
> > + .remap_pte = vmemmap_remap_pte,
> > + .reuse_addr = reuse,
> > + .vmemmap_pages = &vmemmap_pages,
> > + };
> > +
> > + /*
> > + * In order to make remapping routine most efficient for the huge pages,
> > + * the routine of vmemmap page table walking has the following rules
> > + * (see more details from the vmemmap_pte_range()):
> > + *
> > + * - The @reuse address is part of the range that we are walking.
> > + * - The @reuse address is the first in the complete range.
> > + *
> > + * So we need to make sure that @start and @reuse meet the above rules.
>
> You say that "reuse" and "start" need to meet some rules, but in the
> paragraph above you only seem to point "reuse" rules?
OK. I should make the comment more clear. I will update it
and point out the relationship between @start and @reuse.
Thanks a lot.
>
>
> --
> Oscar Salvador
> SUSE L3
On Sun, 17 Jan 2021, Muchun Song wrote:
> In the subsequent patch, we should allocate the vmemmap pages when
> freeing HugeTLB pages. But update_and_free_page() is always called
> with holding hugetlb_lock, so we cannot use GFP_KERNEL to allocate
> vmemmap pages. However, we can defer the actual freeing in a kworker
> to prevent from using GFP_ATOMIC to allocate the vmemmap pages.
>
> The update_hpage_vmemmap_workfn() is where the call to allocate
> vmemmmap pages will be inserted.
>
I think it's reasonable to assume that userspace can release free hugetlb
pages from the pool on oom conditions when reclaim has become too
expensive. This approach now requires that we can allocate vmemmap pages
in a potential oom condition as a prerequisite for freeing memory, which
seems less than ideal.
And, by doing this through a kworker, we can presumably get queued behind
another work item that requires memory to make forward progress in this
oom condition.
Two thoughts:
- We're going to be freeing the hugetlb page after we can allocate the
vmemmap pages, so why do we need to allocate with GFP_KERNEL? Can't we
simply dip into memory reserves using GFP_ATOMIC (and thus can be
holding hugetlb_lock) because we know we'll be freeing more memory than
we'll be allocating? I think requiring a GFP_KERNEL allocation to block
to free memory for vmemmap when we'll be freeing memory ourselves is
dubious. This simplifies all of this.
- If the answer is that we actually have to use GFP_KERNEL for other
reasons, what are your thoughts on pre-allocating the vmemmap as opposed
to deferring to a kworker? In other words, preallocate the necessary
memory with GFP_KERNEL and put it on a linked list in struct hstate
before acquiring hugetlb_lock.
> Signed-off-by: Muchun Song <[email protected]>
> Reviewed-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> mm/hugetlb_vmemmap.c | 12 ---------
> mm/hugetlb_vmemmap.h | 17 ++++++++++++
> 3 files changed, 89 insertions(+), 14 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 140135fc8113..c165186ec2cf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1292,15 +1292,85 @@ static inline void destroy_compound_gigantic_page(struct page *page,
> unsigned int order) { }
> #endif
>
> -static void update_and_free_page(struct hstate *h, struct page *page)
> +static void __free_hugepage(struct hstate *h, struct page *page);
> +
> +/*
> + * As update_and_free_page() is always called with holding hugetlb_lock, so we
> + * cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
> + * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate
> + * the vmemmap pages.
> + *
> + * The update_hpage_vmemmap_workfn() is where the call to allocate vmemmmap
> + * pages will be inserted.
> + *
> + * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of pages
> + * to be freed and frees them one-by-one. As the page->mapping pointer is going
> + * to be cleared in update_hpage_vmemmap_workfn() anyway, it is reused as the
> + * llist_node structure of a lockless linked list of huge pages to be freed.
> + */
> +static LLIST_HEAD(hpage_update_freelist);
> +
> +static void update_hpage_vmemmap_workfn(struct work_struct *work)
> {
> - int i;
> + struct llist_node *node;
> +
> + node = llist_del_all(&hpage_update_freelist);
> +
> + while (node) {
> + struct page *page;
> + struct hstate *h;
> +
> + page = container_of((struct address_space **)node,
> + struct page, mapping);
> + node = node->next;
> + page->mapping = NULL;
> + h = page_hstate(page);
> +
> + spin_lock(&hugetlb_lock);
> + __free_hugepage(h, page);
> + spin_unlock(&hugetlb_lock);
>
> + cond_resched();
Wouldn't it be better to hold hugetlb_lock for the iteration rather than
constantly dropping it and reacquiring it? Use
cond_resched_lock(&hugetlb_lock) instead?
On Sun, 17 Jan 2021, Muchun Song wrote:
> The HUGETLB_PAGE_FREE_VMEMMAP option is used to enable the freeing
> of unnecessary vmemmap associated with HugeTLB pages. The config
> option is introduced early so that supporting code can be written
> to depend on the option. The initial version of the code only
> provides support for x86-64.
>
> Like other code which frees vmemmap, this config option depends on
> HAVE_BOOTMEM_INFO_NODE. The routine register_page_bootmem_info() is
> used to register bootmem info. Therefore, make sure
> register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP
> is defined.
>
> Signed-off-by: Muchun Song <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Acked-by: Mike Kravetz <[email protected]>
> ---
> arch/x86/mm/init_64.c | 2 +-
> fs/Kconfig | 18 ++++++++++++++++++
> 2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0a45f062826e..0435bee2e172 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>
> static void __init register_page_bootmem_info(void)
> {
> -#ifdef CONFIG_NUMA
> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
> int i;
>
> for_each_online_node(i)
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 976e8b9033c4..e7c4c2a79311 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -245,6 +245,24 @@ config HUGETLBFS
> config HUGETLB_PAGE
> def_bool HUGETLBFS
>
> +config HUGETLB_PAGE_FREE_VMEMMAP
> + def_bool HUGETLB_PAGE
I'm not sure I understand the rationale for providing this help text if
this is def_bool depending on CONFIG_HUGETLB_PAGE. Are you intending that
this is actually configurable and we want to provide guidance to the admin
on when to disable it (which it currently doesn't)? If not, why have the
help text?
> + depends on X86_64
> + depends on SPARSEMEM_VMEMMAP
> + depends on HAVE_BOOTMEM_INFO_NODE
> + help
> + The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of
> + some vmemmap pages associated with pre-allocated HugeTLB pages.
> + For example, on X86_64 6 vmemmap pages of size 4KB each can be
> + saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB
> + each can be saved for each 1GB HugeTLB page.
> +
> + When a HugeTLB page is allocated or freed, the vmemmap array
> + representing the range associated with the page will need to be
> + remapped. When a page is allocated, vmemmap pages are freed
> + after remapping. When a page is freed, previously discarded
> + vmemmap pages must be allocated before remapping.
> +
> config MEMFD_CREATE
> def_bool TMPFS || HUGETLBFS
>
On Sun, 17 Jan 2021, Muchun Song wrote:
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index ce4be1fa93c2..3b146d5949f3 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -29,6 +29,7 @@
> #include <linux/sched.h>
> #include <linux/pgtable.h>
> #include <linux/bootmem_info.h>
> +#include <linux/delay.h>
>
> #include <asm/dma.h>
> #include <asm/pgalloc.h>
> @@ -40,7 +41,8 @@
> * @remap_pte: called for each non-empty PTE (lowest-level) entry.
> * @reuse_page: the page which is reused for the tail vmemmap pages.
> * @reuse_addr: the virtual address of the @reuse_page page.
> - * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
> + * @vmemmap_pages: the list head of the vmemmap pages that can be freed
> + * or is mapped from.
> */
> struct vmemmap_remap_walk {
> void (*remap_pte)(pte_t *pte, unsigned long addr,
> @@ -50,6 +52,10 @@ struct vmemmap_remap_walk {
> struct list_head *vmemmap_pages;
> };
>
> +/* The gfp mask of allocating vmemmap page */
> +#define GFP_VMEMMAP_PAGE \
> + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE)
> +
This is unnecessary, just use the gfp mask directly in allocator.
> static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> unsigned long end,
> struct vmemmap_remap_walk *walk)
> @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end,
> free_vmemmap_page_list(&vmemmap_pages);
> }
>
> +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> + struct vmemmap_remap_walk *walk)
> +{
> + pgprot_t pgprot = PAGE_KERNEL;
> + struct page *page;
> + void *to;
> +
> + BUG_ON(pte_page(*pte) != walk->reuse_page);
> +
> + page = list_first_entry(walk->vmemmap_pages, struct page, lru);
> + list_del(&page->lru);
> + to = page_to_virt(page);
> + copy_page(to, (void *)walk->reuse_addr);
> +
> + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> +}
> +
> +static void alloc_vmemmap_page_list(struct list_head *list,
> + unsigned long start, unsigned long end)
> +{
> + unsigned long addr;
> +
> + for (addr = start; addr < end; addr += PAGE_SIZE) {
> + struct page *page;
> + int nid = page_to_nid((const void *)addr);
> +
> +retry:
> + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0);
> + if (unlikely(!page)) {
> + msleep(100);
> + /*
> + * We should retry infinitely, because we cannot
> + * handle allocation failures. Once we allocate
> + * vmemmap pages successfully, then we can free
> + * a HugeTLB page.
> + */
> + goto retry;
Ugh, I don't think this will work, there's no guarantee that we'll ever
succeed and now we can't free a 2MB hugepage because we cannot allocate a
4KB page. We absolutely have to ensure we make forward progress here.
We're going to be freeing the hugetlb page after this succeeeds, can we
not use part of the hugetlb page that we're freeing for this memory
instead?
> + }
> + list_add_tail(&page->lru, list);
> + }
> +}
> +
> +/**
> + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
> + * to the page which is from the @vmemmap_pages
> + * respectively.
> + * @start: start address of the vmemmap virtual address range.
> + * @end: end address of the vmemmap virtual address range.
> + * @reuse: reuse address.
> + */
> +void vmemmap_remap_alloc(unsigned long start, unsigned long end,
> + unsigned long reuse)
> +{
> + LIST_HEAD(vmemmap_pages);
> + struct vmemmap_remap_walk walk = {
> + .remap_pte = vmemmap_restore_pte,
> + .reuse_addr = reuse,
> + .vmemmap_pages = &vmemmap_pages,
> + };
> +
> + might_sleep();
> +
> + /* See the comment in the vmemmap_remap_free(). */
> + BUG_ON(start - reuse != PAGE_SIZE);
> +
> + alloc_vmemmap_page_list(&vmemmap_pages, start, end);
> + vmemmap_remap_range(reuse, end, &walk);
> +}
> +
> /*
> * Allocate a block of memory to be used to back the virtual memory map
> * or to back the page tables that are used to create the mapping.
> --
> 2.11.0
>
>
On Sun, 17 Jan 2021, Muchun Song wrote:
> Because we reuse the first tail vmemmap page frame and remap it
> with read-only, we cannot set the PageHWPosion on a tail page.
> So we can use the head[4].private to record the real error page
> index and set the raw error page PageHWPoison later.
>
> Signed-off-by: Muchun Song <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: David Rientjes <[email protected]>
On 1/24/21 3:58 PM, David Rientjes wrote:
> On Sun, 17 Jan 2021, Muchun Song wrote:
>
>> The HUGETLB_PAGE_FREE_VMEMMAP option is used to enable the freeing
>> of unnecessary vmemmap associated with HugeTLB pages. The config
>> option is introduced early so that supporting code can be written
>> to depend on the option. The initial version of the code only
>> provides support for x86-64.
>>
>> Like other code which frees vmemmap, this config option depends on
>> HAVE_BOOTMEM_INFO_NODE. The routine register_page_bootmem_info() is
>> used to register bootmem info. Therefore, make sure
>> register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP
>> is defined.
>>
>> Signed-off-by: Muchun Song <[email protected]>
>> Reviewed-by: Oscar Salvador <[email protected]>
>> Acked-by: Mike Kravetz <[email protected]>
>> ---
>> arch/x86/mm/init_64.c | 2 +-
>> fs/Kconfig | 18 ++++++++++++++++++
>> 2 files changed, 19 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 0a45f062826e..0435bee2e172 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>>
>> static void __init register_page_bootmem_info(void)
>> {
>> -#ifdef CONFIG_NUMA
>> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
>> int i;
>>
>> for_each_online_node(i)
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 976e8b9033c4..e7c4c2a79311 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -245,6 +245,24 @@ config HUGETLBFS
>> config HUGETLB_PAGE
>> def_bool HUGETLBFS
>>
>> +config HUGETLB_PAGE_FREE_VMEMMAP
>> + def_bool HUGETLB_PAGE
>
> I'm not sure I understand the rationale for providing this help text if
> this is def_bool depending on CONFIG_HUGETLB_PAGE. Are you intending that
> this is actually configurable and we want to provide guidance to the admin
> on when to disable it (which it currently doesn't)? If not, why have the
> help text?
It's good for the (non-user) Kconfig symbol's meaning to be documented somewhere,
preferably such that one does not have to go digging thru git commit logs
to find it.
>> + depends on X86_64
>> + depends on SPARSEMEM_VMEMMAP
>> + depends on HAVE_BOOTMEM_INFO_NODE
>> + help
>> + The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of
>> + some vmemmap pages associated with pre-allocated HugeTLB pages.
>> + For example, on X86_64 6 vmemmap pages of size 4KB each can be
>> + saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB
>> + each can be saved for each 1GB HugeTLB page.
>> +
>> + When a HugeTLB page is allocated or freed, the vmemmap array
>> + representing the range associated with the page will need to be
>> + remapped. When a page is allocated, vmemmap pages are freed
>> + after remapping. When a page is freed, previously discarded
>> + vmemmap pages must be allocated before remapping.
>> +
>> config MEMFD_CREATE
>> def_bool TMPFS || HUGETLBFS
>>
>
--
~Randy
On Mon, Jan 25, 2021 at 7:55 AM David Rientjes <[email protected]> wrote:
>
>
> On Sun, 17 Jan 2021, Muchun Song wrote:
>
> > In the subsequent patch, we should allocate the vmemmap pages when
> > freeing HugeTLB pages. But update_and_free_page() is always called
> > with holding hugetlb_lock, so we cannot use GFP_KERNEL to allocate
> > vmemmap pages. However, we can defer the actual freeing in a kworker
> > to prevent from using GFP_ATOMIC to allocate the vmemmap pages.
> >
> > The update_hpage_vmemmap_workfn() is where the call to allocate
> > vmemmmap pages will be inserted.
> >
>
> I think it's reasonable to assume that userspace can release free hugetlb
> pages from the pool on oom conditions when reclaim has become too
> expensive. This approach now requires that we can allocate vmemmap pages
> in a potential oom condition as a prerequisite for freeing memory, which
> seems less than ideal.
>
> And, by doing this through a kworker, we can presumably get queued behind
> another work item that requires memory to make forward progress in this
> oom condition.
>
> Two thoughts:
>
> - We're going to be freeing the hugetlb page after we can allocate the
> vmemmap pages, so why do we need to allocate with GFP_KERNEL? Can't we
> simply dip into memory reserves using GFP_ATOMIC (and thus can be
> holding hugetlb_lock) because we know we'll be freeing more memory than
> we'll be allocating?
Right.
> I think requiring a GFP_KERNEL allocation to block
> to free memory for vmemmap when we'll be freeing memory ourselves is
> dubious. This simplifies all of this.
Thanks for your thoughts. I just thought that we can go to reclaim
when there is no memory in the system. But we cannot block when
using GFP_KERNEL. Actually, we cannot deal with fail of memory
allocating. In the next patch, I try to sleep 100ms and then try again
to allocate memory when allocating memory fails.
>
> - If the answer is that we actually have to use GFP_KERNEL for other
> reasons, what are your thoughts on pre-allocating the vmemmap as opposed
> to deferring to a kworker? In other words, preallocate the necessary
> memory with GFP_KERNEL and put it on a linked list in struct hstate
> before acquiring hugetlb_lock.
put_page() can be used in an atomic context. Actually, we cannot sleep
in the __free_huge_page(). It seems a little tricky. Right?
>
> > Signed-off-by: Muchun Song <[email protected]>
> > Reviewed-by: Mike Kravetz <[email protected]>
> > ---
> > mm/hugetlb.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > mm/hugetlb_vmemmap.c | 12 ---------
> > mm/hugetlb_vmemmap.h | 17 ++++++++++++
> > 3 files changed, 89 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 140135fc8113..c165186ec2cf 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1292,15 +1292,85 @@ static inline void destroy_compound_gigantic_page(struct page *page,
> > unsigned int order) { }
> > #endif
> >
> > -static void update_and_free_page(struct hstate *h, struct page *page)
> > +static void __free_hugepage(struct hstate *h, struct page *page);
> > +
> > +/*
> > + * As update_and_free_page() is always called with holding hugetlb_lock, so we
> > + * cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
> > + * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate
> > + * the vmemmap pages.
> > + *
> > + * The update_hpage_vmemmap_workfn() is where the call to allocate vmemmmap
> > + * pages will be inserted.
> > + *
> > + * update_hpage_vmemmap_workfn() locklessly retrieves the linked list of pages
> > + * to be freed and frees them one-by-one. As the page->mapping pointer is going
> > + * to be cleared in update_hpage_vmemmap_workfn() anyway, it is reused as the
> > + * llist_node structure of a lockless linked list of huge pages to be freed.
> > + */
> > +static LLIST_HEAD(hpage_update_freelist);
> > +
> > +static void update_hpage_vmemmap_workfn(struct work_struct *work)
> > {
> > - int i;
> > + struct llist_node *node;
> > +
> > + node = llist_del_all(&hpage_update_freelist);
> > +
> > + while (node) {
> > + struct page *page;
> > + struct hstate *h;
> > +
> > + page = container_of((struct address_space **)node,
> > + struct page, mapping);
> > + node = node->next;
> > + page->mapping = NULL;
> > + h = page_hstate(page);
> > +
> > + spin_lock(&hugetlb_lock);
> > + __free_hugepage(h, page);
> > + spin_unlock(&hugetlb_lock);
> >
> > + cond_resched();
>
> Wouldn't it be better to hold hugetlb_lock for the iteration rather than
> constantly dropping it and reacquiring it? Use
> cond_resched_lock(&hugetlb_lock) instead?
Great. We can use it. Thanks.
On Mon, Jan 25, 2021 at 7:58 AM David Rientjes <[email protected]> wrote:
>
>
> On Sun, 17 Jan 2021, Muchun Song wrote:
>
> > The HUGETLB_PAGE_FREE_VMEMMAP option is used to enable the freeing
> > of unnecessary vmemmap associated with HugeTLB pages. The config
> > option is introduced early so that supporting code can be written
> > to depend on the option. The initial version of the code only
> > provides support for x86-64.
> >
> > Like other code which frees vmemmap, this config option depends on
> > HAVE_BOOTMEM_INFO_NODE. The routine register_page_bootmem_info() is
> > used to register bootmem info. Therefore, make sure
> > register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP
> > is defined.
> >
> > Signed-off-by: Muchun Song <[email protected]>
> > Reviewed-by: Oscar Salvador <[email protected]>
> > Acked-by: Mike Kravetz <[email protected]>
> > ---
> > arch/x86/mm/init_64.c | 2 +-
> > fs/Kconfig | 18 ++++++++++++++++++
> > 2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 0a45f062826e..0435bee2e172 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
> >
> > static void __init register_page_bootmem_info(void)
> > {
> > -#ifdef CONFIG_NUMA
> > +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
> > int i;
> >
> > for_each_online_node(i)
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 976e8b9033c4..e7c4c2a79311 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -245,6 +245,24 @@ config HUGETLBFS
> > config HUGETLB_PAGE
> > def_bool HUGETLBFS
> >
> > +config HUGETLB_PAGE_FREE_VMEMMAP
> > + def_bool HUGETLB_PAGE
>
> I'm not sure I understand the rationale for providing this help text if
> this is def_bool depending on CONFIG_HUGETLB_PAGE. Are you intending that
> this is actually configurable and we want to provide guidance to the admin
> on when to disable it (which it currently doesn't)? If not, why have the
> help text?
This is __not__ configurable. Seems like a comment to help others
understand this option. Like Randy said.
Thanks.
>
> > + depends on X86_64
> > + depends on SPARSEMEM_VMEMMAP
> > + depends on HAVE_BOOTMEM_INFO_NODE
> > + help
> > + The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of
> > + some vmemmap pages associated with pre-allocated HugeTLB pages.
> > + For example, on X86_64 6 vmemmap pages of size 4KB each can be
> > + saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB
> > + each can be saved for each 1GB HugeTLB page.
> > +
> > + When a HugeTLB page is allocated or freed, the vmemmap array
> > + representing the range associated with the page will need to be
> > + remapped. When a page is allocated, vmemmap pages are freed
> > + after remapping. When a page is freed, previously discarded
> > + vmemmap pages must be allocated before remapping.
> > +
> > config MEMFD_CREATE
> > def_bool TMPFS || HUGETLBFS
> >
On 1/24/21 8:06 PM, Muchun Song wrote:
> On Mon, Jan 25, 2021 at 7:58 AM David Rientjes <[email protected]> wrote:
>>
>>
>> On Sun, 17 Jan 2021, Muchun Song wrote:
>>
>>> The HUGETLB_PAGE_FREE_VMEMMAP option is used to enable the freeing
>>> of unnecessary vmemmap associated with HugeTLB pages. The config
>>> option is introduced early so that supporting code can be written
>>> to depend on the option. The initial version of the code only
>>> provides support for x86-64.
>>>
>>> Like other code which frees vmemmap, this config option depends on
>>> HAVE_BOOTMEM_INFO_NODE. The routine register_page_bootmem_info() is
>>> used to register bootmem info. Therefore, make sure
>>> register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP
>>> is defined.
>>>
>>> Signed-off-by: Muchun Song <[email protected]>
>>> Reviewed-by: Oscar Salvador <[email protected]>
>>> Acked-by: Mike Kravetz <[email protected]>
>>> ---
>>> arch/x86/mm/init_64.c | 2 +-
>>> fs/Kconfig | 18 ++++++++++++++++++
>>> 2 files changed, 19 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>> index 0a45f062826e..0435bee2e172 100644
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>>>
>>> static void __init register_page_bootmem_info(void)
>>> {
>>> -#ifdef CONFIG_NUMA
>>> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
>>> int i;
>>>
>>> for_each_online_node(i)
>>> diff --git a/fs/Kconfig b/fs/Kconfig
>>> index 976e8b9033c4..e7c4c2a79311 100644
>>> --- a/fs/Kconfig
>>> +++ b/fs/Kconfig
>>> @@ -245,6 +245,24 @@ config HUGETLBFS
>>> config HUGETLB_PAGE
>>> def_bool HUGETLBFS
>>>
>>> +config HUGETLB_PAGE_FREE_VMEMMAP
>>> + def_bool HUGETLB_PAGE
>>
>> I'm not sure I understand the rationale for providing this help text if
>> this is def_bool depending on CONFIG_HUGETLB_PAGE. Are you intending that
>> this is actually configurable and we want to provide guidance to the admin
>> on when to disable it (which it currently doesn't)? If not, why have the
>> help text?
>
> This is __not__ configurable. Seems like a comment to help others
> understand this option. Like Randy said.
Yes, it could be written with '#' (or "comment") comment syntax instead of as help text.
thanks.
>>
>>> + depends on X86_64
>>> + depends on SPARSEMEM_VMEMMAP
>>> + depends on HAVE_BOOTMEM_INFO_NODE
>>> + help
>>> + The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of
>>> + some vmemmap pages associated with pre-allocated HugeTLB pages.
>>> + For example, on X86_64 6 vmemmap pages of size 4KB each can be
>>> + saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB
>>> + each can be saved for each 1GB HugeTLB page.
>>> +
>>> + When a HugeTLB page is allocated or freed, the vmemmap array
>>> + representing the range associated with the page will need to be
>>> + remapped. When a page is allocated, vmemmap pages are freed
>>> + after remapping. When a page is freed, previously discarded
>>> + vmemmap pages must be allocated before remapping.
>>> +
>>> config MEMFD_CREATE
>>> def_bool TMPFS || HUGETLBFS
>>>
>
--
~Randy
On Mon, Jan 25, 2021 at 8:06 AM David Rientjes <[email protected]> wrote:
>
>
> On Sun, 17 Jan 2021, Muchun Song wrote:
>
> > Because we reuse the first tail vmemmap page frame and remap it
> > with read-only, we cannot set the PageHWPosion on a tail page.
> > So we can use the head[4].private to record the real error page
> > index and set the raw error page PageHWPoison later.
> >
> > Signed-off-by: Muchun Song <[email protected]>
> > Reviewed-by: Oscar Salvador <[email protected]>
>
> Acked-by: David Rientjes <[email protected]>
Thanks.
On Mon, Jan 25, 2021 at 12:09 PM Randy Dunlap <[email protected]> wrote:
>
> On 1/24/21 8:06 PM, Muchun Song wrote:
> > On Mon, Jan 25, 2021 at 7:58 AM David Rientjes <[email protected]> wrote:
> >>
> >>
> >> On Sun, 17 Jan 2021, Muchun Song wrote:
> >>
> >>> The HUGETLB_PAGE_FREE_VMEMMAP option is used to enable the freeing
> >>> of unnecessary vmemmap associated with HugeTLB pages. The config
> >>> option is introduced early so that supporting code can be written
> >>> to depend on the option. The initial version of the code only
> >>> provides support for x86-64.
> >>>
> >>> Like other code which frees vmemmap, this config option depends on
> >>> HAVE_BOOTMEM_INFO_NODE. The routine register_page_bootmem_info() is
> >>> used to register bootmem info. Therefore, make sure
> >>> register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP
> >>> is defined.
> >>>
> >>> Signed-off-by: Muchun Song <[email protected]>
> >>> Reviewed-by: Oscar Salvador <[email protected]>
> >>> Acked-by: Mike Kravetz <[email protected]>
> >>> ---
> >>> arch/x86/mm/init_64.c | 2 +-
> >>> fs/Kconfig | 18 ++++++++++++++++++
> >>> 2 files changed, 19 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >>> index 0a45f062826e..0435bee2e172 100644
> >>> --- a/arch/x86/mm/init_64.c
> >>> +++ b/arch/x86/mm/init_64.c
> >>> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
> >>>
> >>> static void __init register_page_bootmem_info(void)
> >>> {
> >>> -#ifdef CONFIG_NUMA
> >>> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
> >>> int i;
> >>>
> >>> for_each_online_node(i)
> >>> diff --git a/fs/Kconfig b/fs/Kconfig
> >>> index 976e8b9033c4..e7c4c2a79311 100644
> >>> --- a/fs/Kconfig
> >>> +++ b/fs/Kconfig
> >>> @@ -245,6 +245,24 @@ config HUGETLBFS
> >>> config HUGETLB_PAGE
> >>> def_bool HUGETLBFS
> >>>
> >>> +config HUGETLB_PAGE_FREE_VMEMMAP
> >>> + def_bool HUGETLB_PAGE
> >>
> >> I'm not sure I understand the rationale for providing this help text if
> >> this is def_bool depending on CONFIG_HUGETLB_PAGE. Are you intending that
> >> this is actually configurable and we want to provide guidance to the admin
> >> on when to disable it (which it currently doesn't)? If not, why have the
> >> help text?
> >
> > This is __not__ configurable. Seems like a comment to help others
> > understand this option. Like Randy said.
>
> Yes, it could be written with '#' (or "comment") comment syntax instead of as help text.
Got it. I will update in the next version. Thanks.
>
> thanks.
>
> >>
> >>> + depends on X86_64
> >>> + depends on SPARSEMEM_VMEMMAP
> >>> + depends on HAVE_BOOTMEM_INFO_NODE
> >>> + help
> >>> + The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of
> >>> + some vmemmap pages associated with pre-allocated HugeTLB pages.
> >>> + For example, on X86_64 6 vmemmap pages of size 4KB each can be
> >>> + saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB
> >>> + each can be saved for each 1GB HugeTLB page.
> >>> +
> >>> + When a HugeTLB page is allocated or freed, the vmemmap array
> >>> + representing the range associated with the page will need to be
> >>> + remapped. When a page is allocated, vmemmap pages are freed
> >>> + after remapping. When a page is freed, previously discarded
> >>> + vmemmap pages must be allocated before remapping.
> >>> +
> >>> config MEMFD_CREATE
> >>> def_bool TMPFS || HUGETLBFS
> >>>
> >
>
>
> --
> ~Randy
>
On Mon, Jan 25, 2021 at 8:05 AM David Rientjes <[email protected]> wrote:
>
>
> On Sun, 17 Jan 2021, Muchun Song wrote:
>
> > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > index ce4be1fa93c2..3b146d5949f3 100644
> > --- a/mm/sparse-vmemmap.c
> > +++ b/mm/sparse-vmemmap.c
> > @@ -29,6 +29,7 @@
> > #include <linux/sched.h>
> > #include <linux/pgtable.h>
> > #include <linux/bootmem_info.h>
> > +#include <linux/delay.h>
> >
> > #include <asm/dma.h>
> > #include <asm/pgalloc.h>
> > @@ -40,7 +41,8 @@
> > * @remap_pte: called for each non-empty PTE (lowest-level) entry.
> > * @reuse_page: the page which is reused for the tail vmemmap pages.
> > * @reuse_addr: the virtual address of the @reuse_page page.
> > - * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
> > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed
> > + * or is mapped from.
> > */
> > struct vmemmap_remap_walk {
> > void (*remap_pte)(pte_t *pte, unsigned long addr,
> > @@ -50,6 +52,10 @@ struct vmemmap_remap_walk {
> > struct list_head *vmemmap_pages;
> > };
> >
> > +/* The gfp mask of allocating vmemmap page */
> > +#define GFP_VMEMMAP_PAGE \
> > + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE)
> > +
>
> This is unnecessary, just use the gfp mask directly in allocator.
Will do. Thanks.
>
> > static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> > unsigned long end,
> > struct vmemmap_remap_walk *walk)
> > @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end,
> > free_vmemmap_page_list(&vmemmap_pages);
> > }
> >
> > +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> > + struct vmemmap_remap_walk *walk)
> > +{
> > + pgprot_t pgprot = PAGE_KERNEL;
> > + struct page *page;
> > + void *to;
> > +
> > + BUG_ON(pte_page(*pte) != walk->reuse_page);
> > +
> > + page = list_first_entry(walk->vmemmap_pages, struct page, lru);
> > + list_del(&page->lru);
> > + to = page_to_virt(page);
> > + copy_page(to, (void *)walk->reuse_addr);
> > +
> > + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> > +}
> > +
> > +static void alloc_vmemmap_page_list(struct list_head *list,
> > + unsigned long start, unsigned long end)
> > +{
> > + unsigned long addr;
> > +
> > + for (addr = start; addr < end; addr += PAGE_SIZE) {
> > + struct page *page;
> > + int nid = page_to_nid((const void *)addr);
> > +
> > +retry:
> > + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0);
> > + if (unlikely(!page)) {
> > + msleep(100);
> > + /*
> > + * We should retry infinitely, because we cannot
> > + * handle allocation failures. Once we allocate
> > + * vmemmap pages successfully, then we can free
> > + * a HugeTLB page.
> > + */
> > + goto retry;
>
> Ugh, I don't think this will work, there's no guarantee that we'll ever
> succeed and now we can't free a 2MB hugepage because we cannot allocate a
> 4KB page. We absolutely have to ensure we make forward progress here.
This can trigger a OOM when there is no memory and kill someone to release
some memory. Right?
>
> We're going to be freeing the hugetlb page after this succeeeds, can we
> not use part of the hugetlb page that we're freeing for this memory
> instead?
It seems a good idea. We can try to allocate memory firstly, if successful,
just use the new page to remap (it can reduce memory fragmentation).
If not, we can use part of the hugetlb page to remap. What's your opinion
about this?
>
> > + }
> > + list_add_tail(&page->lru, list);
> > + }
> > +}
> > +
> > +/**
> > + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
> > + * to the page which is from the @vmemmap_pages
> > + * respectively.
> > + * @start: start address of the vmemmap virtual address range.
> > + * @end: end address of the vmemmap virtual address range.
> > + * @reuse: reuse address.
> > + */
> > +void vmemmap_remap_alloc(unsigned long start, unsigned long end,
> > + unsigned long reuse)
> > +{
> > + LIST_HEAD(vmemmap_pages);
> > + struct vmemmap_remap_walk walk = {
> > + .remap_pte = vmemmap_restore_pte,
> > + .reuse_addr = reuse,
> > + .vmemmap_pages = &vmemmap_pages,
> > + };
> > +
> > + might_sleep();
> > +
> > + /* See the comment in the vmemmap_remap_free(). */
> > + BUG_ON(start - reuse != PAGE_SIZE);
> > +
> > + alloc_vmemmap_page_list(&vmemmap_pages, start, end);
> > + vmemmap_remap_range(reuse, end, &walk);
> > +}
> > +
> > /*
> > * Allocate a block of memory to be used to back the virtual memory map
> > * or to back the page tables that are used to create the mapping.
> > --
> > 2.11.0
> >
> >
On Mon, Jan 25, 2021 at 2:40 PM Muchun Song <[email protected]> wrote:
>
> On Mon, Jan 25, 2021 at 8:05 AM David Rientjes <[email protected]> wrote:
> >
> >
> > On Sun, 17 Jan 2021, Muchun Song wrote:
> >
> > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > > index ce4be1fa93c2..3b146d5949f3 100644
> > > --- a/mm/sparse-vmemmap.c
> > > +++ b/mm/sparse-vmemmap.c
> > > @@ -29,6 +29,7 @@
> > > #include <linux/sched.h>
> > > #include <linux/pgtable.h>
> > > #include <linux/bootmem_info.h>
> > > +#include <linux/delay.h>
> > >
> > > #include <asm/dma.h>
> > > #include <asm/pgalloc.h>
> > > @@ -40,7 +41,8 @@
> > > * @remap_pte: called for each non-empty PTE (lowest-level) entry.
> > > * @reuse_page: the page which is reused for the tail vmemmap pages.
> > > * @reuse_addr: the virtual address of the @reuse_page page.
> > > - * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
> > > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed
> > > + * or is mapped from.
> > > */
> > > struct vmemmap_remap_walk {
> > > void (*remap_pte)(pte_t *pte, unsigned long addr,
> > > @@ -50,6 +52,10 @@ struct vmemmap_remap_walk {
> > > struct list_head *vmemmap_pages;
> > > };
> > >
> > > +/* The gfp mask of allocating vmemmap page */
> > > +#define GFP_VMEMMAP_PAGE \
> > > + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE)
> > > +
> >
> > This is unnecessary, just use the gfp mask directly in allocator.
>
> Will do. Thanks.
>
> >
> > > static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> > > unsigned long end,
> > > struct vmemmap_remap_walk *walk)
> > > @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end,
> > > free_vmemmap_page_list(&vmemmap_pages);
> > > }
> > >
> > > +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> > > + struct vmemmap_remap_walk *walk)
> > > +{
> > > + pgprot_t pgprot = PAGE_KERNEL;
> > > + struct page *page;
> > > + void *to;
> > > +
> > > + BUG_ON(pte_page(*pte) != walk->reuse_page);
> > > +
> > > + page = list_first_entry(walk->vmemmap_pages, struct page, lru);
> > > + list_del(&page->lru);
> > > + to = page_to_virt(page);
> > > + copy_page(to, (void *)walk->reuse_addr);
> > > +
> > > + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> > > +}
> > > +
> > > +static void alloc_vmemmap_page_list(struct list_head *list,
> > > + unsigned long start, unsigned long end)
> > > +{
> > > + unsigned long addr;
> > > +
> > > + for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > + struct page *page;
> > > + int nid = page_to_nid((const void *)addr);
> > > +
> > > +retry:
> > > + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0);
> > > + if (unlikely(!page)) {
> > > + msleep(100);
> > > + /*
> > > + * We should retry infinitely, because we cannot
> > > + * handle allocation failures. Once we allocate
> > > + * vmemmap pages successfully, then we can free
> > > + * a HugeTLB page.
> > > + */
> > > + goto retry;
> >
> > Ugh, I don't think this will work, there's no guarantee that we'll ever
> > succeed and now we can't free a 2MB hugepage because we cannot allocate a
> > 4KB page. We absolutely have to ensure we make forward progress here.
>
> This can trigger a OOM when there is no memory and kill someone to release
> some memory. Right?
>
> >
> > We're going to be freeing the hugetlb page after this succeeeds, can we
> > not use part of the hugetlb page that we're freeing for this memory
> > instead?
>
> It seems a good idea. We can try to allocate memory firstly, if successful,
> just use the new page to remap (it can reduce memory fragmentation).
> If not, we can use part of the hugetlb page to remap. What's your opinion
> about this?
If the HugeTLB page is a gigantic page which is allocated from
CMA. In this case, we cannot use part of the hugetlb page to remap.
Right?
>
> >
> > > + }
> > > + list_add_tail(&page->lru, list);
> > > + }
> > > +}
> > > +
> > > +/**
> > > + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
> > > + * to the page which is from the @vmemmap_pages
> > > + * respectively.
> > > + * @start: start address of the vmemmap virtual address range.
> > > + * @end: end address of the vmemmap virtual address range.
> > > + * @reuse: reuse address.
> > > + */
> > > +void vmemmap_remap_alloc(unsigned long start, unsigned long end,
> > > + unsigned long reuse)
> > > +{
> > > + LIST_HEAD(vmemmap_pages);
> > > + struct vmemmap_remap_walk walk = {
> > > + .remap_pte = vmemmap_restore_pte,
> > > + .reuse_addr = reuse,
> > > + .vmemmap_pages = &vmemmap_pages,
> > > + };
> > > +
> > > + might_sleep();
> > > +
> > > + /* See the comment in the vmemmap_remap_free(). */
> > > + BUG_ON(start - reuse != PAGE_SIZE);
> > > +
> > > + alloc_vmemmap_page_list(&vmemmap_pages, start, end);
> > > + vmemmap_remap_range(reuse, end, &walk);
> > > +}
> > > +
> > > /*
> > > * Allocate a block of memory to be used to back the virtual memory map
> > > * or to back the page tables that are used to create the mapping.
> > > --
> > > 2.11.0
> > >
> > >
On 17.01.21 16:10, Muchun Song wrote:
> Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
> freeing unused vmemmap pages associated with each hugetlb page on boot.
The description completely lacks a description of the changes performed
in arch/x86/mm/init_64.c.
[...]
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -34,6 +34,7 @@
> #include <linux/gfp.h>
> #include <linux/kcore.h>
> #include <linux/bootmem_info.h>
> +#include <linux/hugetlb.h>
>
> #include <asm/processor.h>
> #include <asm/bios_ebda.h>
> @@ -1557,7 +1558,8 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> {
> int err;
>
> - if (end - start < PAGES_PER_SECTION * sizeof(struct page))
> + if (is_hugetlb_free_vmemmap_enabled() ||
> + end - start < PAGES_PER_SECTION * sizeof(struct page))
This looks irresponsible. You ignore any altmap, even though current
altmap users (ZONE_DEVICE) will not actually result in applicable
vmemmaps that huge pages could ever use.
Why do you ignore the altmap completely? This has to be properly
documented, but IMHO it's not even the right approach to mess with
altmap here.
--
Thanks,
David / dhildenb
On Mon, 25 Jan 2021, Muchun Song wrote:
> > >> I'm not sure I understand the rationale for providing this help text if
> > >> this is def_bool depending on CONFIG_HUGETLB_PAGE. Are you intending that
> > >> this is actually configurable and we want to provide guidance to the admin
> > >> on when to disable it (which it currently doesn't)? If not, why have the
> > >> help text?
> > >
> > > This is __not__ configurable. Seems like a comment to help others
> > > understand this option. Like Randy said.
> >
> > Yes, it could be written with '#' (or "comment") comment syntax instead of as help text.
>
> Got it. I will update in the next version. Thanks.
>
I'm not sure that Kconfig is the right place to document functional
behavior of the kernel, especially for non-configurable options. Seems
like this is already served by existing comments added by this patch
series in the files where the description is helpful.
Hi:
On 2021/1/17 23:10, Muchun Song wrote:
> The HUGETLB_PAGE_FREE_VMEMMAP option is used to enable the freeing
> of unnecessary vmemmap associated with HugeTLB pages. The config
> option is introduced early so that supporting code can be written
> to depend on the option. The initial version of the code only
> provides support for x86-64.
>
> Like other code which frees vmemmap, this config option depends on
> HAVE_BOOTMEM_INFO_NODE. The routine register_page_bootmem_info() is
> used to register bootmem info. Therefore, make sure
> register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP
> is defined.
>
> Signed-off-by: Muchun Song <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Acked-by: Mike Kravetz <[email protected]>
> ---
> arch/x86/mm/init_64.c | 2 +-
> fs/Kconfig | 18 ++++++++++++++++++
> 2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0a45f062826e..0435bee2e172 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>
> static void __init register_page_bootmem_info(void)
> {
> -#ifdef CONFIG_NUMA
> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
> int i;
>
> for_each_online_node(i)
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 976e8b9033c4..e7c4c2a79311 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -245,6 +245,24 @@ config HUGETLBFS
> config HUGETLB_PAGE
> def_bool HUGETLBFS
>
> +config HUGETLB_PAGE_FREE_VMEMMAP
> + def_bool HUGETLB_PAGE
> + depends on X86_64
> + depends on SPARSEMEM_VMEMMAP
> + depends on HAVE_BOOTMEM_INFO_NODE
> + help
> + The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of
> + some vmemmap pages associated with pre-allocated HugeTLB pages.
> + For example, on X86_64 6 vmemmap pages of size 4KB each can be
> + saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB
> + each can be saved for each 1GB HugeTLB page.
> +
> + When a HugeTLB page is allocated or freed, the vmemmap array
> + representing the range associated with the page will need to be
> + remapped. When a page is allocated, vmemmap pages are freed
> + after remapping. When a page is freed, previously discarded
> + vmemmap pages must be allocated before remapping.
> +
> config MEMFD_CREATE
> def_bool TMPFS || HUGETLBFS
>
>
LGTM. Thanks.
Reviewed-by: Miaohe Lin <[email protected]>
On Tue, Jan 26, 2021 at 2:47 AM David Rientjes <[email protected]> wrote:
>
> On Mon, 25 Jan 2021, Muchun Song wrote:
>
> > > >> I'm not sure I understand the rationale for providing this help text if
> > > >> this is def_bool depending on CONFIG_HUGETLB_PAGE. Are you intending that
> > > >> this is actually configurable and we want to provide guidance to the admin
> > > >> on when to disable it (which it currently doesn't)? If not, why have the
> > > >> help text?
> > > >
> > > > This is __not__ configurable. Seems like a comment to help others
> > > > understand this option. Like Randy said.
> > >
> > > Yes, it could be written with '#' (or "comment") comment syntax instead of as help text.
> >
> > Got it. I will update in the next version. Thanks.
> >
>
> I'm not sure that Kconfig is the right place to document functional
> behavior of the kernel, especially for non-configurable options. Seems
> like this is already served by existing comments added by this patch
> series in the files where the description is helpful.
OK. So do you mean just remove the help text here?
Thanks.
On Sun, Jan 17, 2021 at 11:10:46PM +0800, Muchun Song wrote:
> When we free a HugeTLB page to the buddy allocator, we should allocate the
> vmemmap pages associated with it. We can do that in the __free_hugepage()
> before freeing it to buddy.
>
> Signed-off-by: Muchun Song <[email protected]>
This series has grown a certain grade of madurity and improvment, but it seems
to me that we have been stuck in this patch (and patch#4) for quite some time.
Would it be acceptable for a first implementation to not let hugetlb pages to
be freed when this feature is in use?
This would simplify things for now, as we could get rid of patch#4 and patch#5.
We can always extend functionality once this has been merged, right?
Of course, this means that e.g: memory-hotplug (hot-remove) will not fully work
when this in place, but well.
I would like to hear what others think, but in my opinion it would be a big step
to move on.
> ---
> include/linux/mm.h | 2 ++
> mm/hugetlb.c | 2 ++
> mm/hugetlb_vmemmap.c | 15 ++++++++++
> mm/hugetlb_vmemmap.h | 5 ++++
> mm/sparse-vmemmap.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> 5 files changed, 100 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f928994ed273..16b55d13b0ab 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3007,6 +3007,8 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
>
> void vmemmap_remap_free(unsigned long start, unsigned long end,
> unsigned long reuse);
> +void vmemmap_remap_alloc(unsigned long start, unsigned long end,
> + unsigned long reuse);
>
> void *sparse_buffer_alloc(unsigned long size);
> struct page * __populate_section_memmap(unsigned long pfn,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c165186ec2cf..d11c32fcdb38 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1326,6 +1326,8 @@ static void update_hpage_vmemmap_workfn(struct work_struct *work)
> page->mapping = NULL;
> h = page_hstate(page);
>
> + alloc_huge_page_vmemmap(h, page);
> +
> spin_lock(&hugetlb_lock);
> __free_hugepage(h, page);
> spin_unlock(&hugetlb_lock);
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 19f1898aaede..6108ae80314f 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -183,6 +183,21 @@ static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
> return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
> }
>
> +void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
> +{
> + unsigned long vmemmap_addr = (unsigned long)head;
> + unsigned long vmemmap_end, vmemmap_reuse;
> +
> + if (!free_vmemmap_pages_per_hpage(h))
> + return;
> +
> + vmemmap_addr += RESERVE_VMEMMAP_SIZE;
> + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
> + vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
> +
> + vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse);
> +}
> +
> void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> {
> unsigned long vmemmap_addr = (unsigned long)head;
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index 01f8637adbe0..b2c8d2f11d48 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -11,6 +11,7 @@
> #include <linux/hugetlb.h>
>
> #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
> +void alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
> void free_huge_page_vmemmap(struct hstate *h, struct page *head);
>
> /*
> @@ -25,6 +26,10 @@ static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
> return 0;
> }
> #else
> +static inline void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
> +{
> +}
> +
> static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
> {
> }
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index ce4be1fa93c2..3b146d5949f3 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -29,6 +29,7 @@
> #include <linux/sched.h>
> #include <linux/pgtable.h>
> #include <linux/bootmem_info.h>
> +#include <linux/delay.h>
>
> #include <asm/dma.h>
> #include <asm/pgalloc.h>
> @@ -40,7 +41,8 @@
> * @remap_pte: called for each non-empty PTE (lowest-level) entry.
> * @reuse_page: the page which is reused for the tail vmemmap pages.
> * @reuse_addr: the virtual address of the @reuse_page page.
> - * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
> + * @vmemmap_pages: the list head of the vmemmap pages that can be freed
> + * or is mapped from.
> */
> struct vmemmap_remap_walk {
> void (*remap_pte)(pte_t *pte, unsigned long addr,
> @@ -50,6 +52,10 @@ struct vmemmap_remap_walk {
> struct list_head *vmemmap_pages;
> };
>
> +/* The gfp mask of allocating vmemmap page */
> +#define GFP_VMEMMAP_PAGE \
> + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE)
> +
> static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> unsigned long end,
> struct vmemmap_remap_walk *walk)
> @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end,
> free_vmemmap_page_list(&vmemmap_pages);
> }
>
> +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> + struct vmemmap_remap_walk *walk)
> +{
> + pgprot_t pgprot = PAGE_KERNEL;
> + struct page *page;
> + void *to;
> +
> + BUG_ON(pte_page(*pte) != walk->reuse_page);
> +
> + page = list_first_entry(walk->vmemmap_pages, struct page, lru);
> + list_del(&page->lru);
> + to = page_to_virt(page);
> + copy_page(to, (void *)walk->reuse_addr);
> +
> + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> +}
> +
> +static void alloc_vmemmap_page_list(struct list_head *list,
> + unsigned long start, unsigned long end)
> +{
> + unsigned long addr;
> +
> + for (addr = start; addr < end; addr += PAGE_SIZE) {
> + struct page *page;
> + int nid = page_to_nid((const void *)addr);
> +
> +retry:
> + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0);
> + if (unlikely(!page)) {
> + msleep(100);
> + /*
> + * We should retry infinitely, because we cannot
> + * handle allocation failures. Once we allocate
> + * vmemmap pages successfully, then we can free
> + * a HugeTLB page.
> + */
> + goto retry;
> + }
> + list_add_tail(&page->lru, list);
> + }
> +}
> +
> +/**
> + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
> + * to the page which is from the @vmemmap_pages
> + * respectively.
> + * @start: start address of the vmemmap virtual address range.
> + * @end: end address of the vmemmap virtual address range.
> + * @reuse: reuse address.
> + */
> +void vmemmap_remap_alloc(unsigned long start, unsigned long end,
> + unsigned long reuse)
> +{
> + LIST_HEAD(vmemmap_pages);
> + struct vmemmap_remap_walk walk = {
> + .remap_pte = vmemmap_restore_pte,
> + .reuse_addr = reuse,
> + .vmemmap_pages = &vmemmap_pages,
> + };
> +
> + might_sleep();
> +
> + /* See the comment in the vmemmap_remap_free(). */
> + BUG_ON(start - reuse != PAGE_SIZE);
> +
> + alloc_vmemmap_page_list(&vmemmap_pages, start, end);
> + vmemmap_remap_range(reuse, end, &walk);
> +}
> +
> /*
> * Allocate a block of memory to be used to back the virtual memory map
> * or to back the page tables that are used to create the mapping.
> --
> 2.11.0
>
>
--
Oscar Salvador
SUSE L3
On 26.01.21 10:29, Oscar Salvador wrote:
> On Sun, Jan 17, 2021 at 11:10:46PM +0800, Muchun Song wrote:
>> When we free a HugeTLB page to the buddy allocator, we should allocate the
>> vmemmap pages associated with it. We can do that in the __free_hugepage()
>> before freeing it to buddy.
>>
>> Signed-off-by: Muchun Song <[email protected]>
>
> This series has grown a certain grade of madurity and improvment, but it seems
> to me that we have been stuck in this patch (and patch#4) for quite some time.
>
> Would it be acceptable for a first implementation to not let hugetlb pages to
> be freed when this feature is in use?
> This would simplify things for now, as we could get rid of patch#4 and patch#5.
> We can always extend functionality once this has been merged, right?
I think either keep it completely simple (only free vmemmap of hugetlb
pages allocated early during boot - which is what's not sufficient for
some use cases) or implement the full thing properly (meaning, solve
most challenging issues to get the basics running).
I don't want to have some easy parts of complex features merged (e.g.,
breaking other stuff as you indicate below), and later finding out "it's
not that easy" again and being stuck with it forever.
>
> Of course, this means that e.g: memory-hotplug (hot-remove) will not fully work
> when this in place, but well.
Can you elaborate? Are we're talking about having hugepages in
ZONE_MOVABLE that are not migratable (and/or dissolvable) anymore? Than
a clear NACK from my side.
--
Thanks,
David / dhildenb
On 26.01.21 15:58, Oscar Salvador wrote:
> On Tue, Jan 26, 2021 at 10:36:21AM +0100, David Hildenbrand wrote:
>> I think either keep it completely simple (only free vmemmap of hugetlb
>> pages allocated early during boot - which is what's not sufficient for
>> some use cases) or implement the full thing properly (meaning, solve
>> most challenging issues to get the basics running).
>>
>> I don't want to have some easy parts of complex features merged (e.g.,
>> breaking other stuff as you indicate below), and later finding out "it's
>> not that easy" again and being stuck with it forever.
>
> Well, we could try to do an optimistic allocation, without tricky loopings.
> If that fails, refuse to shrink the pool at that moment.
>
> The user could always try to shrink it later via /proc/sys/vm/nr_hugepages
> interface.
>
> But I am just thinking out loud..
The real issue seems to be discarding the vmemmap on any memory that has
movability constraints - CMA and ZONE_MOVABLE; otherwise, as discussed,
we can reuse parts of the thingy we're freeing for the vmemmap. Not that
it would be ideal: that once-a-huge-page thing will never ever be a huge
page again - but if it helps with OOM in corner cases, sure.
Possible simplification: don't perform the optimization for now with
free huge pages residing on ZONE_MOVABLE or CMA. Certainly not perfect:
what happens when migrating a huge page from ZONE_NORMAL to
(ZONE_MOVABLE|CMA)?
>
>>> Of course, this means that e.g: memory-hotplug (hot-remove) will not fully work
>>> when this in place, but well.
>>
>> Can you elaborate? Are we're talking about having hugepages in
>> ZONE_MOVABLE that are not migratable (and/or dissolvable) anymore? Than
>> a clear NACK from my side.
>
> Pretty much, yeah.
Note that we most likely soon have to tackle migrating/dissolving (free)
hugetlbfs pages from alloc_contig_range() context - e.g., for CMA
allocations. That's certainly something to keep in mind regarding any
approaches that already break offline_pages().
--
Thanks,
David / dhildenb
On Tue, Jan 26, 2021 at 04:10:53PM +0100, David Hildenbrand wrote:
> The real issue seems to be discarding the vmemmap on any memory that has
> movability constraints - CMA and ZONE_MOVABLE; otherwise, as discussed, we
> can reuse parts of the thingy we're freeing for the vmemmap. Not that it
> would be ideal: that once-a-huge-page thing will never ever be a huge page
> again - but if it helps with OOM in corner cases, sure.
Yes, that is one way, but I am not sure how hard would it be to implement.
Plus the fact that as you pointed out, once that memory is used for vmemmap
array, we cannot use it again.
Actually, we would fragment the memory eventually?
> Possible simplification: don't perform the optimization for now with free
> huge pages residing on ZONE_MOVABLE or CMA. Certainly not perfect: what
> happens when migrating a huge page from ZONE_NORMAL to (ZONE_MOVABLE|CMA)?
But if we do not allow theose pages to be in ZONE_MOVABLE or CMA, there is no
point in migrate them, right?
> > > > Of course, this means that e.g: memory-hotplug (hot-remove) will not fully work
> > > > when this in place, but well.
> > >
> > > Can you elaborate? Are we're talking about having hugepages in
> > > ZONE_MOVABLE that are not migratable (and/or dissolvable) anymore? Than
> > > a clear NACK from my side.
> >
> > Pretty much, yeah.
>
> Note that we most likely soon have to tackle migrating/dissolving (free)
> hugetlbfs pages from alloc_contig_range() context - e.g., for CMA
> allocations. That's certainly something to keep in mind regarding any
> approaches that already break offline_pages().
Definitely. I already talked to Mike about that and I am going to have
a look into it pretty soon.
--
Oscar Salvador
SUSE L3
On 26.01.21 16:34, Oscar Salvador wrote:
> On Tue, Jan 26, 2021 at 04:10:53PM +0100, David Hildenbrand wrote:
>> The real issue seems to be discarding the vmemmap on any memory that has
>> movability constraints - CMA and ZONE_MOVABLE; otherwise, as discussed, we
>> can reuse parts of the thingy we're freeing for the vmemmap. Not that it
>> would be ideal: that once-a-huge-page thing will never ever be a huge page
>> again - but if it helps with OOM in corner cases, sure.
>
> Yes, that is one way, but I am not sure how hard would it be to implement.
> Plus the fact that as you pointed out, once that memory is used for vmemmap
> array, we cannot use it again.
> Actually, we would fragment the memory eventually?
>
>> Possible simplification: don't perform the optimization for now with free
>> huge pages residing on ZONE_MOVABLE or CMA. Certainly not perfect: what
>> happens when migrating a huge page from ZONE_NORMAL to (ZONE_MOVABLE|CMA)?
>
> But if we do not allow theose pages to be in ZONE_MOVABLE or CMA, there is no
> point in migrate them, right?
Well, memory unplug "could" still work and migrate them and
alloc_contig_range() "could in the future" still want to migrate them
(virtio-mem, gigantic pages, powernv memtrace). Especially, the latter
two don't work with ZONE_MOVABLE/CMA. But, I mean, it would be fair
enough to say "there are no guarantees for
alloc_contig_range()/offline_pages() with ZONE_NORMAL, so we can break
these use cases when a magic switch is flipped and make these pages
non-migratable anymore".
I assume compaction doesn't care about huge pages either way, not sure
about numa balancing etc.
However, note that there is a fundamental issue with any approach that
allocates a significant amount of unmovable memory for user-space
purposes (excluding CMA allocations for unmovable stuff, CMA is
special): pairing it with ZONE_MOVABLE becomes very tricky as your user
space might just end up eating all kernel memory, although the system
still looks like there is plenty of free memory residing in
ZONE_MOVABLE. I mentioned that in the context of secretmem in a reduced
form as well.
We theoretically have that issue with dynamic allocation of gigantic
pages, but it's something a user explicitly/rarely triggers and it can
be documented to cause problems well enough. We'll have the same issue
with GUP+ZONE_MOVABLE that Pavel is fixing right now - but GUP is
already known to be broken in various ways and that it has to be treated
in a special way. I'd like to limit the nasty corner cases.
Of course, we could have smart rules like "don't online memory to
ZONE_MOVABLE automatically when the magic switch is active". That's just
ugly, but could work.
--
Thanks,
David / dhildenb
On Mon, Jan 25, 2021 at 12:43:23PM +0100, David Hildenbrand wrote:
> > - if (end - start < PAGES_PER_SECTION * sizeof(struct page))
> > + if (is_hugetlb_free_vmemmap_enabled() ||
> > + end - start < PAGES_PER_SECTION * sizeof(struct page))
>
> This looks irresponsible. You ignore any altmap, even though current
> altmap users (ZONE_DEVICE) will not actually result in applicable
> vmemmaps that huge pages could ever use.
>
> Why do you ignore the altmap completely? This has to be properly
> documented, but IMHO it's not even the right approach to mess with
> altmap here.
The goal was not to ignore altmap but to disable PMD mapping sections
when the feature was enabled.
Shame on me I did not notice that with this, altmap will be ignored.
Something like below maybe:
int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
struct vmem_altmap *altmap)
{
int err;
bool populate_base_pages = false;
if ((end - start < PAGES_PER_SECTION * sizeof(struct page)) ||
(is_hugetlb_free_vmemmap_enabled() && !altmap))
populate_base_pages = true;
if (populate_base_pages) {
err = vmemmap_populate_basepages(start, end, node, NULL);
} else if (boot_cpu_has(X86_FEATURE_PSE)) {
....
>
> --
> Thanks,
>
> David / dhildenb
>
>
--
Oscar Salvador
SUSE L3
On Tue, Jan 26, 2021 at 10:36:21AM +0100, David Hildenbrand wrote:
> I think either keep it completely simple (only free vmemmap of hugetlb
> pages allocated early during boot - which is what's not sufficient for
> some use cases) or implement the full thing properly (meaning, solve
> most challenging issues to get the basics running).
>
> I don't want to have some easy parts of complex features merged (e.g.,
> breaking other stuff as you indicate below), and later finding out "it's
> not that easy" again and being stuck with it forever.
Well, we could try to do an optimistic allocation, without tricky loopings.
If that fails, refuse to shrink the pool at that moment.
The user could always try to shrink it later via /proc/sys/vm/nr_hugepages
interface.
But I am just thinking out loud..
> > Of course, this means that e.g: memory-hotplug (hot-remove) will not fully work
> > when this in place, but well.
>
> Can you elaborate? Are we're talking about having hugepages in
> ZONE_MOVABLE that are not migratable (and/or dissolvable) anymore? Than
> a clear NACK from my side.
Pretty much, yeah.
--
Oscar Salvador
SUSE L3
On Mon, Jan 25, 2021 at 7:43 PM David Hildenbrand <[email protected]> wrote:
>
> On 17.01.21 16:10, Muchun Song wrote:
> > Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
> > freeing unused vmemmap pages associated with each hugetlb page on boot.
>
> The description completely lacks a description of the changes performed
> in arch/x86/mm/init_64.c.
Will update. Thanks.
>
> [...]
>
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -34,6 +34,7 @@
> > #include <linux/gfp.h>
> > #include <linux/kcore.h>
> > #include <linux/bootmem_info.h>
> > +#include <linux/hugetlb.h>
> >
> > #include <asm/processor.h>
> > #include <asm/bios_ebda.h>
> > @@ -1557,7 +1558,8 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> > {
> > int err;
> >
> > - if (end - start < PAGES_PER_SECTION * sizeof(struct page))
> > + if (is_hugetlb_free_vmemmap_enabled() ||
> > + end - start < PAGES_PER_SECTION * sizeof(struct page))
>
> This looks irresponsible. You ignore any altmap, even though current
> altmap users (ZONE_DEVICE) will not actually result in applicable
> vmemmaps that huge pages could ever use.
>
> Why do you ignore the altmap completely? This has to be properly
> documented, but IMHO it's not even the right approach to mess with
> altmap here.
Thanks for reminding me of this. Sorry I also did not notice that.
>
> --
> Thanks,
>
> David / dhildenb
>
On Mon, Jan 25, 2021 at 8:08 PM Oscar Salvador <[email protected]> wrote:
>
> On Mon, Jan 25, 2021 at 12:43:23PM +0100, David Hildenbrand wrote:
> > > - if (end - start < PAGES_PER_SECTION * sizeof(struct page))
> > > + if (is_hugetlb_free_vmemmap_enabled() ||
> > > + end - start < PAGES_PER_SECTION * sizeof(struct page))
> >
> > This looks irresponsible. You ignore any altmap, even though current
> > altmap users (ZONE_DEVICE) will not actually result in applicable
> > vmemmaps that huge pages could ever use.
> >
> > Why do you ignore the altmap completely? This has to be properly
> > documented, but IMHO it's not even the right approach to mess with
> > altmap here.
>
> The goal was not to ignore altmap but to disable PMD mapping sections
> when the feature was enabled.
> Shame on me I did not notice that with this, altmap will be ignored.
>
> Something like below maybe:
Yeah, Thanks a lot.
>
> int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> struct vmem_altmap *altmap)
> {
> int err;
> bool populate_base_pages = false;
>
> if ((end - start < PAGES_PER_SECTION * sizeof(struct page)) ||
> (is_hugetlb_free_vmemmap_enabled() && !altmap))
> populate_base_pages = true;
>
> if (populate_base_pages) {
> err = vmemmap_populate_basepages(start, end, node, NULL);
> } else if (boot_cpu_has(X86_FEATURE_PSE)) {
> ....
>
>
> >
> > --
> > Thanks,
> >
> > David / dhildenb
> >
> >
>
> --
> Oscar Salvador
> SUSE L3
On 25.01.21 08:41, Muchun Song wrote:
> On Mon, Jan 25, 2021 at 2:40 PM Muchun Song <[email protected]> wrote:
>>
>> On Mon, Jan 25, 2021 at 8:05 AM David Rientjes <[email protected]> wrote:
>>>
>>>
>>> On Sun, 17 Jan 2021, Muchun Song wrote:
>>>
>>>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>>>> index ce4be1fa93c2..3b146d5949f3 100644
>>>> --- a/mm/sparse-vmemmap.c
>>>> +++ b/mm/sparse-vmemmap.c
>>>> @@ -29,6 +29,7 @@
>>>> #include <linux/sched.h>
>>>> #include <linux/pgtable.h>
>>>> #include <linux/bootmem_info.h>
>>>> +#include <linux/delay.h>
>>>>
>>>> #include <asm/dma.h>
>>>> #include <asm/pgalloc.h>
>>>> @@ -40,7 +41,8 @@
>>>> * @remap_pte: called for each non-empty PTE (lowest-level) entry.
>>>> * @reuse_page: the page which is reused for the tail vmemmap pages.
>>>> * @reuse_addr: the virtual address of the @reuse_page page.
>>>> - * @vmemmap_pages: the list head of the vmemmap pages that can be freed.
>>>> + * @vmemmap_pages: the list head of the vmemmap pages that can be freed
>>>> + * or is mapped from.
>>>> */
>>>> struct vmemmap_remap_walk {
>>>> void (*remap_pte)(pte_t *pte, unsigned long addr,
>>>> @@ -50,6 +52,10 @@ struct vmemmap_remap_walk {
>>>> struct list_head *vmemmap_pages;
>>>> };
>>>>
>>>> +/* The gfp mask of allocating vmemmap page */
>>>> +#define GFP_VMEMMAP_PAGE \
>>>> + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE)
>>>> +
>>>
>>> This is unnecessary, just use the gfp mask directly in allocator.
>>
>> Will do. Thanks.
>>
>>>
>>>> static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
>>>> unsigned long end,
>>>> struct vmemmap_remap_walk *walk)
>>>> @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end,
>>>> free_vmemmap_page_list(&vmemmap_pages);
>>>> }
>>>>
>>>> +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>>>> + struct vmemmap_remap_walk *walk)
>>>> +{
>>>> + pgprot_t pgprot = PAGE_KERNEL;
>>>> + struct page *page;
>>>> + void *to;
>>>> +
>>>> + BUG_ON(pte_page(*pte) != walk->reuse_page);
>>>> +
>>>> + page = list_first_entry(walk->vmemmap_pages, struct page, lru);
>>>> + list_del(&page->lru);
>>>> + to = page_to_virt(page);
>>>> + copy_page(to, (void *)walk->reuse_addr);
>>>> +
>>>> + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>>>> +}
>>>> +
>>>> +static void alloc_vmemmap_page_list(struct list_head *list,
>>>> + unsigned long start, unsigned long end)
>>>> +{
>>>> + unsigned long addr;
>>>> +
>>>> + for (addr = start; addr < end; addr += PAGE_SIZE) {
>>>> + struct page *page;
>>>> + int nid = page_to_nid((const void *)addr);
>>>> +
>>>> +retry:
>>>> + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0);
>>>> + if (unlikely(!page)) {
>>>> + msleep(100);
>>>> + /*
>>>> + * We should retry infinitely, because we cannot
>>>> + * handle allocation failures. Once we allocate
>>>> + * vmemmap pages successfully, then we can free
>>>> + * a HugeTLB page.
>>>> + */
>>>> + goto retry;
>>>
>>> Ugh, I don't think this will work, there's no guarantee that we'll ever
>>> succeed and now we can't free a 2MB hugepage because we cannot allocate a
>>> 4KB page. We absolutely have to ensure we make forward progress here.
>>
>> This can trigger a OOM when there is no memory and kill someone to release
>> some memory. Right?
>>
>>>
>>> We're going to be freeing the hugetlb page after this succeeeds, can we
>>> not use part of the hugetlb page that we're freeing for this memory
>>> instead?
>>
>> It seems a good idea. We can try to allocate memory firstly, if successful,
>> just use the new page to remap (it can reduce memory fragmentation).
>> If not, we can use part of the hugetlb page to remap. What's your opinion
>> about this?
>
> If the HugeTLB page is a gigantic page which is allocated from
> CMA. In this case, we cannot use part of the hugetlb page to remap.
> Right?
Right; and I don't think the "reuse part of a huge page as vmemmap while
freeing, while that part itself might not have a proper vmemmap yet (or
might cover itself now)" is particularly straight forward. Maybe I'm
wrong :)
Also, watch out for huge pages on ZONE_MOVABLE, in that case you also
shouldn't allocate the vmemmap from there ...
--
Thanks,
David / dhildenb
On Tue, 26 Jan 2021, Muchun Song wrote:
> > I'm not sure that Kconfig is the right place to document functional
> > behavior of the kernel, especially for non-configurable options. Seems
> > like this is already served by existing comments added by this patch
> > series in the files where the description is helpful.
>
> OK. So do you mean just remove the help text here?
>
Yeah, I'd suggest removing the help text from a non-configurable Kconfig
option and double checking that its substance is available elsewhere (like
the giant comment in mm/hugetlb_vmemmap.c).
On 26.01.21 16:56, David Hildenbrand wrote:
> On 26.01.21 16:34, Oscar Salvador wrote:
>> On Tue, Jan 26, 2021 at 04:10:53PM +0100, David Hildenbrand wrote:
>>> The real issue seems to be discarding the vmemmap on any memory that has
>>> movability constraints - CMA and ZONE_MOVABLE; otherwise, as discussed, we
>>> can reuse parts of the thingy we're freeing for the vmemmap. Not that it
>>> would be ideal: that once-a-huge-page thing will never ever be a huge page
>>> again - but if it helps with OOM in corner cases, sure.
>>
>> Yes, that is one way, but I am not sure how hard would it be to implement.
>> Plus the fact that as you pointed out, once that memory is used for vmemmap
>> array, we cannot use it again.
>> Actually, we would fragment the memory eventually?
>>
>>> Possible simplification: don't perform the optimization for now with free
>>> huge pages residing on ZONE_MOVABLE or CMA. Certainly not perfect: what
>>> happens when migrating a huge page from ZONE_NORMAL to (ZONE_MOVABLE|CMA)?
>>
>> But if we do not allow theose pages to be in ZONE_MOVABLE or CMA, there is no
>> point in migrate them, right?
>
> Well, memory unplug "could" still work and migrate them and
> alloc_contig_range() "could in the future" still want to migrate them
> (virtio-mem, gigantic pages, powernv memtrace). Especially, the latter
> two don't work with ZONE_MOVABLE/CMA. But, I mean, it would be fair
> enough to say "there are no guarantees for
> alloc_contig_range()/offline_pages() with ZONE_NORMAL, so we can break
> these use cases when a magic switch is flipped and make these pages
> non-migratable anymore".
>
> I assume compaction doesn't care about huge pages either way, not sure
> about numa balancing etc.
>
>
> However, note that there is a fundamental issue with any approach that
> allocates a significant amount of unmovable memory for user-space
> purposes (excluding CMA allocations for unmovable stuff, CMA is
> special): pairing it with ZONE_MOVABLE becomes very tricky as your user
> space might just end up eating all kernel memory, although the system
> still looks like there is plenty of free memory residing in
> ZONE_MOVABLE. I mentioned that in the context of secretmem in a reduced
> form as well.
>
> We theoretically have that issue with dynamic allocation of gigantic
> pages, but it's something a user explicitly/rarely triggers and it can
> be documented to cause problems well enough. We'll have the same issue
> with GUP+ZONE_MOVABLE that Pavel is fixing right now - but GUP is
> already known to be broken in various ways and that it has to be treated
> in a special way. I'd like to limit the nasty corner cases.
>
> Of course, we could have smart rules like "don't online memory to
> ZONE_MOVABLE automatically when the magic switch is active". That's just
> ugly, but could work.
>
Extending on that, I just discovered that only x86-64, ppc64, and arm64
really support hugepage migration.
Maybe one approach with the "magic switch" really would be to disable
hugepage migration completely in hugepage_migration_supported(), and
consequently making hugepage_movable_supported() always return false.
Huge pages would never get placed onto ZONE_MOVABLE/CMA and cannot be
migrated. The problem I describe would apply (careful with using
ZONE_MOVABLE), but well, it can at least be documented.
--
Thanks,
David / dhildenb
On Wed, Jan 27, 2021 at 6:36 PM David Hildenbrand <[email protected]> wrote:
>
> On 26.01.21 16:56, David Hildenbrand wrote:
> > On 26.01.21 16:34, Oscar Salvador wrote:
> >> On Tue, Jan 26, 2021 at 04:10:53PM +0100, David Hildenbrand wrote:
> >>> The real issue seems to be discarding the vmemmap on any memory that has
> >>> movability constraints - CMA and ZONE_MOVABLE; otherwise, as discussed, we
> >>> can reuse parts of the thingy we're freeing for the vmemmap. Not that it
> >>> would be ideal: that once-a-huge-page thing will never ever be a huge page
> >>> again - but if it helps with OOM in corner cases, sure.
> >>
> >> Yes, that is one way, but I am not sure how hard would it be to implement.
> >> Plus the fact that as you pointed out, once that memory is used for vmemmap
> >> array, we cannot use it again.
> >> Actually, we would fragment the memory eventually?
> >>
> >>> Possible simplification: don't perform the optimization for now with free
> >>> huge pages residing on ZONE_MOVABLE or CMA. Certainly not perfect: what
> >>> happens when migrating a huge page from ZONE_NORMAL to (ZONE_MOVABLE|CMA)?
> >>
> >> But if we do not allow theose pages to be in ZONE_MOVABLE or CMA, there is no
> >> point in migrate them, right?
> >
> > Well, memory unplug "could" still work and migrate them and
> > alloc_contig_range() "could in the future" still want to migrate them
> > (virtio-mem, gigantic pages, powernv memtrace). Especially, the latter
> > two don't work with ZONE_MOVABLE/CMA. But, I mean, it would be fair
> > enough to say "there are no guarantees for
> > alloc_contig_range()/offline_pages() with ZONE_NORMAL, so we can break
> > these use cases when a magic switch is flipped and make these pages
> > non-migratable anymore".
> >
> > I assume compaction doesn't care about huge pages either way, not sure
> > about numa balancing etc.
> >
> >
> > However, note that there is a fundamental issue with any approach that
> > allocates a significant amount of unmovable memory for user-space
> > purposes (excluding CMA allocations for unmovable stuff, CMA is
> > special): pairing it with ZONE_MOVABLE becomes very tricky as your user
> > space might just end up eating all kernel memory, although the system
> > still looks like there is plenty of free memory residing in
> > ZONE_MOVABLE. I mentioned that in the context of secretmem in a reduced
> > form as well.
> >
> > We theoretically have that issue with dynamic allocation of gigantic
> > pages, but it's something a user explicitly/rarely triggers and it can
> > be documented to cause problems well enough. We'll have the same issue
> > with GUP+ZONE_MOVABLE that Pavel is fixing right now - but GUP is
> > already known to be broken in various ways and that it has to be treated
> > in a special way. I'd like to limit the nasty corner cases.
> >
> > Of course, we could have smart rules like "don't online memory to
> > ZONE_MOVABLE automatically when the magic switch is active". That's just
> > ugly, but could work.
> >
>
> Extending on that, I just discovered that only x86-64, ppc64, and arm64
> really support hugepage migration.
>
> Maybe one approach with the "magic switch" really would be to disable
> hugepage migration completely in hugepage_migration_supported(), and
> consequently making hugepage_movable_supported() always return false.
>
> Huge pages would never get placed onto ZONE_MOVABLE/CMA and cannot be
> migrated. The problem I describe would apply (careful with using
> ZONE_MOVABLE), but well, it can at least be documented.
Thanks for your explanation.
All thinking seems to be introduced by encountering OOM. :-(
In order to move forward and free the hugepage. We should add some
restrictions below.
1. Only free the hugepage which is allocated from the ZONE_NORMAL.
2. Disable hugepage migration when this feature is enabled.
3. Using GFP_ATOMIC to allocate vmemmap pages firstly (it can reduce
memory fragmentation), if it fails, we use part of the hugepage to
remap.
Hi Oscar, Mike and David H
What's your opinion about this? Should we take this approach?
Thanks.
>
> --
> Thanks,
>
> David / dhildenb
>
On Thu, Jan 28, 2021 at 8:37 PM Muchun Song <[email protected]> wrote:
>
> On Wed, Jan 27, 2021 at 6:36 PM David Hildenbrand <[email protected]> wrote:
> >
> > On 26.01.21 16:56, David Hildenbrand wrote:
> > > On 26.01.21 16:34, Oscar Salvador wrote:
> > >> On Tue, Jan 26, 2021 at 04:10:53PM +0100, David Hildenbrand wrote:
> > >>> The real issue seems to be discarding the vmemmap on any memory that has
> > >>> movability constraints - CMA and ZONE_MOVABLE; otherwise, as discussed, we
> > >>> can reuse parts of the thingy we're freeing for the vmemmap. Not that it
> > >>> would be ideal: that once-a-huge-page thing will never ever be a huge page
> > >>> again - but if it helps with OOM in corner cases, sure.
> > >>
> > >> Yes, that is one way, but I am not sure how hard would it be to implement.
> > >> Plus the fact that as you pointed out, once that memory is used for vmemmap
> > >> array, we cannot use it again.
> > >> Actually, we would fragment the memory eventually?
> > >>
> > >>> Possible simplification: don't perform the optimization for now with free
> > >>> huge pages residing on ZONE_MOVABLE or CMA. Certainly not perfect: what
> > >>> happens when migrating a huge page from ZONE_NORMAL to (ZONE_MOVABLE|CMA)?
> > >>
> > >> But if we do not allow theose pages to be in ZONE_MOVABLE or CMA, there is no
> > >> point in migrate them, right?
> > >
> > > Well, memory unplug "could" still work and migrate them and
> > > alloc_contig_range() "could in the future" still want to migrate them
> > > (virtio-mem, gigantic pages, powernv memtrace). Especially, the latter
> > > two don't work with ZONE_MOVABLE/CMA. But, I mean, it would be fair
> > > enough to say "there are no guarantees for
> > > alloc_contig_range()/offline_pages() with ZONE_NORMAL, so we can break
> > > these use cases when a magic switch is flipped and make these pages
> > > non-migratable anymore".
> > >
> > > I assume compaction doesn't care about huge pages either way, not sure
> > > about numa balancing etc.
> > >
> > >
> > > However, note that there is a fundamental issue with any approach that
> > > allocates a significant amount of unmovable memory for user-space
> > > purposes (excluding CMA allocations for unmovable stuff, CMA is
> > > special): pairing it with ZONE_MOVABLE becomes very tricky as your user
> > > space might just end up eating all kernel memory, although the system
> > > still looks like there is plenty of free memory residing in
> > > ZONE_MOVABLE. I mentioned that in the context of secretmem in a reduced
> > > form as well.
> > >
> > > We theoretically have that issue with dynamic allocation of gigantic
> > > pages, but it's something a user explicitly/rarely triggers and it can
> > > be documented to cause problems well enough. We'll have the same issue
> > > with GUP+ZONE_MOVABLE that Pavel is fixing right now - but GUP is
> > > already known to be broken in various ways and that it has to be treated
> > > in a special way. I'd like to limit the nasty corner cases.
> > >
> > > Of course, we could have smart rules like "don't online memory to
> > > ZONE_MOVABLE automatically when the magic switch is active". That's just
> > > ugly, but could work.
> > >
> >
> > Extending on that, I just discovered that only x86-64, ppc64, and arm64
> > really support hugepage migration.
> >
> > Maybe one approach with the "magic switch" really would be to disable
> > hugepage migration completely in hugepage_migration_supported(), and
> > consequently making hugepage_movable_supported() always return false.
> >
> > Huge pages would never get placed onto ZONE_MOVABLE/CMA and cannot be
> > migrated. The problem I describe would apply (careful with using
> > ZONE_MOVABLE), but well, it can at least be documented.
>
> Thanks for your explanation.
>
> All thinking seems to be introduced by encountering OOM. :-(
>
> In order to move forward and free the hugepage. We should add some
> restrictions below.
>
> 1. Only free the hugepage which is allocated from the ZONE_NORMAL.
^^
Sorry. Here "free" should be "optimize".
> 2. Disable hugepage migration when this feature is enabled.
> 3. Using GFP_ATOMIC to allocate vmemmap pages firstly (it can reduce
> memory fragmentation), if it fails, we use part of the hugepage to
> remap.
>
> Hi Oscar, Mike and David H
>
> What's your opinion about this? Should we take this approach?
>
> Thanks.
>
> >
> > --
> > Thanks,
> >
> > David / dhildenb
> >
On Wed, Jan 27, 2021 at 11:36:15AM +0100, David Hildenbrand wrote:
> Extending on that, I just discovered that only x86-64, ppc64, and arm64
> really support hugepage migration.
>
> Maybe one approach with the "magic switch" really would be to disable
> hugepage migration completely in hugepage_migration_supported(), and
> consequently making hugepage_movable_supported() always return false.
Ok, so migration would not fork for these pages, and since them would
lay in !ZONE_MOVABLE there is no guarantee we can unplug the memory.
Well, we really cannot unplug it unless the hugepage is not used
(it can be dissolved at least).
Now to the allocation-when-freeing.
Current implementation uses GFP_ATOMIC(or wants to use) + forever loop.
One of the problems I see with GFP_ATOMIC is that gives you access
to memory reserves, but there are more users using those reserves.
Then, worst-scenario case we need to allocate 16MB order-0 pages
to free up 1GB hugepage, so the question would be whether reserves
really scale to 16MB + more users accessing reserves.
As I said, if anything I would go for an optimistic allocation-try
, if we fail just refuse to shrink the pool.
User can always try to shrink it later again via /sys interface.
Since hugepages would not be longer in ZONE_MOVABLE/CMA and are not
expected to be migratable, is that ok?
Using the hugepage for the vmemmap array was brought up several times,
but that would imply fragmenting memory over time.
All in all seems to be overly complicated (I might be wrong).
> Huge pages would never get placed onto ZONE_MOVABLE/CMA and cannot be
> migrated. The problem I describe would apply (careful with using
> ZONE_MOVABLE), but well, it can at least be documented.
I am not a page allocator expert but cannot the allocation fallback
to ZONE_MOVABLE under memory shortage on other zones?
--
Oscar Salvador
SUSE L3
On 1/28/21 4:37 AM, Muchun Song wrote:
> On Wed, Jan 27, 2021 at 6:36 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 26.01.21 16:56, David Hildenbrand wrote:
>>> On 26.01.21 16:34, Oscar Salvador wrote:
>>>> On Tue, Jan 26, 2021 at 04:10:53PM +0100, David Hildenbrand wrote:
>>>>> The real issue seems to be discarding the vmemmap on any memory that has
>>>>> movability constraints - CMA and ZONE_MOVABLE; otherwise, as discussed, we
>>>>> can reuse parts of the thingy we're freeing for the vmemmap. Not that it
>>>>> would be ideal: that once-a-huge-page thing will never ever be a huge page
>>>>> again - but if it helps with OOM in corner cases, sure.
>>>>
>>>> Yes, that is one way, but I am not sure how hard would it be to implement.
>>>> Plus the fact that as you pointed out, once that memory is used for vmemmap
>>>> array, we cannot use it again.
>>>> Actually, we would fragment the memory eventually?
>>>>
>>>>> Possible simplification: don't perform the optimization for now with free
>>>>> huge pages residing on ZONE_MOVABLE or CMA. Certainly not perfect: what
>>>>> happens when migrating a huge page from ZONE_NORMAL to (ZONE_MOVABLE|CMA)?
>>>>
>>>> But if we do not allow theose pages to be in ZONE_MOVABLE or CMA, there is no
>>>> point in migrate them, right?
>>>
>>> Well, memory unplug "could" still work and migrate them and
>>> alloc_contig_range() "could in the future" still want to migrate them
>>> (virtio-mem, gigantic pages, powernv memtrace). Especially, the latter
>>> two don't work with ZONE_MOVABLE/CMA. But, I mean, it would be fair
>>> enough to say "there are no guarantees for
>>> alloc_contig_range()/offline_pages() with ZONE_NORMAL, so we can break
>>> these use cases when a magic switch is flipped and make these pages
>>> non-migratable anymore".
>>>
>>> I assume compaction doesn't care about huge pages either way, not sure
>>> about numa balancing etc.
>>>
>>>
>>> However, note that there is a fundamental issue with any approach that
>>> allocates a significant amount of unmovable memory for user-space
>>> purposes (excluding CMA allocations for unmovable stuff, CMA is
>>> special): pairing it with ZONE_MOVABLE becomes very tricky as your user
>>> space might just end up eating all kernel memory, although the system
>>> still looks like there is plenty of free memory residing in
>>> ZONE_MOVABLE. I mentioned that in the context of secretmem in a reduced
>>> form as well.
>>>
>>> We theoretically have that issue with dynamic allocation of gigantic
>>> pages, but it's something a user explicitly/rarely triggers and it can
>>> be documented to cause problems well enough. We'll have the same issue
>>> with GUP+ZONE_MOVABLE that Pavel is fixing right now - but GUP is
>>> already known to be broken in various ways and that it has to be treated
>>> in a special way. I'd like to limit the nasty corner cases.
>>>
>>> Of course, we could have smart rules like "don't online memory to
>>> ZONE_MOVABLE automatically when the magic switch is active". That's just
>>> ugly, but could work.
>>>
>>
>> Extending on that, I just discovered that only x86-64, ppc64, and arm64
>> really support hugepage migration.
>>
>> Maybe one approach with the "magic switch" really would be to disable
>> hugepage migration completely in hugepage_migration_supported(), and
>> consequently making hugepage_movable_supported() always return false.
>>
>> Huge pages would never get placed onto ZONE_MOVABLE/CMA and cannot be
>> migrated. The problem I describe would apply (careful with using
>> ZONE_MOVABLE), but well, it can at least be documented.
>
> Thanks for your explanation.
>
> All thinking seems to be introduced by encountering OOM. :-(
Yes. Or, I think about it as the problem of not being able to dissolve (free
to buddy) a hugetlb page. We can not dissolve because we can not allocate
vmemmap for all sumpages.
> In order to move forward and free the hugepage. We should add some
> restrictions below.
>
> 1. Only free the hugepage which is allocated from the ZONE_NORMAL.
Corrected: Only vmemmap optimize hugepages in ZONE_NORMAL
> 2. Disable hugepage migration when this feature is enabled.
I am not sure if we want to fully disable migration. I may be misunderstanding
but the thought was to prevent migration between some movability types. It
seems we should be able to migrate form ZONE_NORMAL to ZONE_NORMAL.
Also, if we do allow huge pages without vmemmap optimization in MOVABLE or CMA
then we should allow those to be migrated to NORMAL? Or is there a reason why
we should prevent that.
> 3. Using GFP_ATOMIC to allocate vmemmap pages firstly (it can reduce
> memory fragmentation), if it fails, we use part of the hugepage to
> remap.
I honestly am not sure about this. This would only happen for pages in
NORMAL. The only time using part of the huge page for vmemmap would help is
if we are trying to dissolve huge pages to free up memory for other uses.
> What's your opinion about this? Should we take this approach?
I think trying to solve all the issues that could happen as the result of
not being able to dissolve a hugetlb page has made this extremely complex.
I know this is something we need to address/solve. We do not want to add
more unexpected behavior in corner cases. However, I can not help but think
about similar issues today. For example, if a huge page is in use in
ZONE_MOVABLE or CMA there is no guarantee that it can be migrated today.
Correct? We may need to allocate another huge page for the target of the
migration, and there is no guarantee we can do that.
--
Mike Kravetz
On Fri, Jan 29, 2021 at 6:29 AM Oscar Salvador <[email protected]> wrote:
>
> On Wed, Jan 27, 2021 at 11:36:15AM +0100, David Hildenbrand wrote:
> > Extending on that, I just discovered that only x86-64, ppc64, and arm64
> > really support hugepage migration.
> >
> > Maybe one approach with the "magic switch" really would be to disable
> > hugepage migration completely in hugepage_migration_supported(), and
> > consequently making hugepage_movable_supported() always return false.
>
> Ok, so migration would not fork for these pages, and since them would
> lay in !ZONE_MOVABLE there is no guarantee we can unplug the memory.
> Well, we really cannot unplug it unless the hugepage is not used
> (it can be dissolved at least).
>
> Now to the allocation-when-freeing.
> Current implementation uses GFP_ATOMIC(or wants to use) + forever loop.
> One of the problems I see with GFP_ATOMIC is that gives you access
> to memory reserves, but there are more users using those reserves.
> Then, worst-scenario case we need to allocate 16MB order-0 pages
> to free up 1GB hugepage, so the question would be whether reserves
> really scale to 16MB + more users accessing reserves.
>
> As I said, if anything I would go for an optimistic allocation-try
> , if we fail just refuse to shrink the pool.
> User can always try to shrink it later again via /sys interface.
Yeah. It seems that this is the easy way to move on.
Thanks.
>
> Since hugepages would not be longer in ZONE_MOVABLE/CMA and are not
> expected to be migratable, is that ok?
>
> Using the hugepage for the vmemmap array was brought up several times,
> but that would imply fragmenting memory over time.
>
> All in all seems to be overly complicated (I might be wrong).
>
>
> > Huge pages would never get placed onto ZONE_MOVABLE/CMA and cannot be
> > migrated. The problem I describe would apply (careful with using
> > ZONE_MOVABLE), but well, it can at least be documented.
>
> I am not a page allocator expert but cannot the allocation fallback
> to ZONE_MOVABLE under memory shortage on other zones?
>
>
> --
> Oscar Salvador
> SUSE L3
On Fri, Jan 29, 2021 at 9:04 AM Mike Kravetz <[email protected]> wrote:
>
> On 1/28/21 4:37 AM, Muchun Song wrote:
> > On Wed, Jan 27, 2021 at 6:36 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 26.01.21 16:56, David Hildenbrand wrote:
> >>> On 26.01.21 16:34, Oscar Salvador wrote:
> >>>> On Tue, Jan 26, 2021 at 04:10:53PM +0100, David Hildenbrand wrote:
> >>>>> The real issue seems to be discarding the vmemmap on any memory that has
> >>>>> movability constraints - CMA and ZONE_MOVABLE; otherwise, as discussed, we
> >>>>> can reuse parts of the thingy we're freeing for the vmemmap. Not that it
> >>>>> would be ideal: that once-a-huge-page thing will never ever be a huge page
> >>>>> again - but if it helps with OOM in corner cases, sure.
> >>>>
> >>>> Yes, that is one way, but I am not sure how hard would it be to implement.
> >>>> Plus the fact that as you pointed out, once that memory is used for vmemmap
> >>>> array, we cannot use it again.
> >>>> Actually, we would fragment the memory eventually?
> >>>>
> >>>>> Possible simplification: don't perform the optimization for now with free
> >>>>> huge pages residing on ZONE_MOVABLE or CMA. Certainly not perfect: what
> >>>>> happens when migrating a huge page from ZONE_NORMAL to (ZONE_MOVABLE|CMA)?
> >>>>
> >>>> But if we do not allow theose pages to be in ZONE_MOVABLE or CMA, there is no
> >>>> point in migrate them, right?
> >>>
> >>> Well, memory unplug "could" still work and migrate them and
> >>> alloc_contig_range() "could in the future" still want to migrate them
> >>> (virtio-mem, gigantic pages, powernv memtrace). Especially, the latter
> >>> two don't work with ZONE_MOVABLE/CMA. But, I mean, it would be fair
> >>> enough to say "there are no guarantees for
> >>> alloc_contig_range()/offline_pages() with ZONE_NORMAL, so we can break
> >>> these use cases when a magic switch is flipped and make these pages
> >>> non-migratable anymore".
> >>>
> >>> I assume compaction doesn't care about huge pages either way, not sure
> >>> about numa balancing etc.
> >>>
> >>>
> >>> However, note that there is a fundamental issue with any approach that
> >>> allocates a significant amount of unmovable memory for user-space
> >>> purposes (excluding CMA allocations for unmovable stuff, CMA is
> >>> special): pairing it with ZONE_MOVABLE becomes very tricky as your user
> >>> space might just end up eating all kernel memory, although the system
> >>> still looks like there is plenty of free memory residing in
> >>> ZONE_MOVABLE. I mentioned that in the context of secretmem in a reduced
> >>> form as well.
> >>>
> >>> We theoretically have that issue with dynamic allocation of gigantic
> >>> pages, but it's something a user explicitly/rarely triggers and it can
> >>> be documented to cause problems well enough. We'll have the same issue
> >>> with GUP+ZONE_MOVABLE that Pavel is fixing right now - but GUP is
> >>> already known to be broken in various ways and that it has to be treated
> >>> in a special way. I'd like to limit the nasty corner cases.
> >>>
> >>> Of course, we could have smart rules like "don't online memory to
> >>> ZONE_MOVABLE automatically when the magic switch is active". That's just
> >>> ugly, but could work.
> >>>
> >>
> >> Extending on that, I just discovered that only x86-64, ppc64, and arm64
> >> really support hugepage migration.
> >>
> >> Maybe one approach with the "magic switch" really would be to disable
> >> hugepage migration completely in hugepage_migration_supported(), and
> >> consequently making hugepage_movable_supported() always return false.
> >>
> >> Huge pages would never get placed onto ZONE_MOVABLE/CMA and cannot be
> >> migrated. The problem I describe would apply (careful with using
> >> ZONE_MOVABLE), but well, it can at least be documented.
> >
> > Thanks for your explanation.
> >
> > All thinking seems to be introduced by encountering OOM. :-(
>
> Yes. Or, I think about it as the problem of not being able to dissolve (free
> to buddy) a hugetlb page. We can not dissolve because we can not allocate
> vmemmap for all sumpages.
>
> > In order to move forward and free the hugepage. We should add some
> > restrictions below.
> >
> > 1. Only free the hugepage which is allocated from the ZONE_NORMAL.
> Corrected: Only vmemmap optimize hugepages in ZONE_NORMAL
>
> > 2. Disable hugepage migration when this feature is enabled.
>
> I am not sure if we want to fully disable migration. I may be misunderstanding
> but the thought was to prevent migration between some movability types. It
> seems we should be able to migrate form ZONE_NORMAL to ZONE_NORMAL.
>
> Also, if we do allow huge pages without vmemmap optimization in MOVABLE or CMA
> then we should allow those to be migrated to NORMAL? Or is there a reason why
> we should prevent that.
>
> > 3. Using GFP_ATOMIC to allocate vmemmap pages firstly (it can reduce
> > memory fragmentation), if it fails, we use part of the hugepage to
> > remap.
>
> I honestly am not sure about this. This would only happen for pages in
> NORMAL. The only time using part of the huge page for vmemmap would help is
> if we are trying to dissolve huge pages to free up memory for other uses.
>
> > What's your opinion about this? Should we take this approach?
>
> I think trying to solve all the issues that could happen as the result of
> not being able to dissolve a hugetlb page has made this extremely complex.
> I know this is something we need to address/solve. We do not want to add
> more unexpected behavior in corner cases. However, I can not help but think
> about similar issues today. For example, if a huge page is in use in
> ZONE_MOVABLE or CMA there is no guarantee that it can be migrated today.
> Correct? We may need to allocate another huge page for the target of the
> migration, and there is no guarantee we can do that.
Yeah. Adding more restrictions makes things more complex. As you
and Oscar said, refusing to free hugepage when allocating
vmemmap pages fail may be an easy way now.
> --
> Mike Kravetz
On 28.01.21 23:29, Oscar Salvador wrote:
> On Wed, Jan 27, 2021 at 11:36:15AM +0100, David Hildenbrand wrote:
>> Extending on that, I just discovered that only x86-64, ppc64, and arm64
>> really support hugepage migration.
>>
>> Maybe one approach with the "magic switch" really would be to disable
>> hugepage migration completely in hugepage_migration_supported(), and
>> consequently making hugepage_movable_supported() always return false.
>
> Ok, so migration would not fork for these pages, and since them would
> lay in !ZONE_MOVABLE there is no guarantee we can unplug the memory.
> Well, we really cannot unplug it unless the hugepage is not used
> (it can be dissolved at least).
>
> Now to the allocation-when-freeing.
> Current implementation uses GFP_ATOMIC(or wants to use) + forever loop.
> One of the problems I see with GFP_ATOMIC is that gives you access
> to memory reserves, but there are more users using those reserves.
> Then, worst-scenario case we need to allocate 16MB order-0 pages
> to free up 1GB hugepage, so the question would be whether reserves
> really scale to 16MB + more users accessing reserves.
>
> As I said, if anything I would go for an optimistic allocation-try
> , if we fail just refuse to shrink the pool.
> User can always try to shrink it later again via /sys interface.
>
> Since hugepages would not be longer in ZONE_MOVABLE/CMA and are not
> expected to be migratable, is that ok?
>
> Using the hugepage for the vmemmap array was brought up several times,
> but that would imply fragmenting memory over time.
>
> All in all seems to be overly complicated (I might be wrong).
>
>
>> Huge pages would never get placed onto ZONE_MOVABLE/CMA and cannot be
>> migrated. The problem I describe would apply (careful with using
>> ZONE_MOVABLE), but well, it can at least be documented.
>
> I am not a page allocator expert but cannot the allocation fallback
> to ZONE_MOVABLE under memory shortage on other zones?
No, for now it's not done. Only movable allocations target ZONE_MOVABLE.
Doing so would be controversial: when would be the right point in time
to start spilling unmovable allocations into CMA/ZONE_MOVABLE? You
certainly want to try other things first (swapping, reclaim,
compaction), before breaking any guarantees regarding
hotunplug+migration/compaction you have with CMA/ZONE_MOVABLE. And even
if you would allow it, your workload would already suffer extremely.
So it smells more like a setup issue. But then, who knows when
allocating huge pages (esp. at runtime) that there are such side effects
before actually running into them?
We can make sure that all relevant archs support migration of ordinary
(!gigantic) huge pages (for now, only x86-64, ppc64/spapr, arm64), so we
can place them onto ZONE_MOVABLE. It gets harder with more special cases.
Gigantic pages (without CMA) are more of a general issue, but at least
it's simple to document ("Careful when pairing ZONE_MOVABLE with
gigantic pages on !CMA").
An unexpected high amount of unmovable memory is just extremely
difficult to handle with ZONE_MOVABLE; it's hard for the user/admin to
figure out that such restrictions actually apply.
--
Thanks,
David / dhildenb
>> What's your opinion about this? Should we take this approach?
>
> I think trying to solve all the issues that could happen as the result of
> not being able to dissolve a hugetlb page has made this extremely complex.
> I know this is something we need to address/solve. We do not want to add
> more unexpected behavior in corner cases. However, I can not help but think
> about similar issues today. For example, if a huge page is in use in
> ZONE_MOVABLE or CMA there is no guarantee that it can be migrated today.
Yes, hugetlbfs is broken with alloc_contig_range() as e.g., used by CMA
and needs fixing. Then, similar problems as with hugetlbfs pages on
ZONE_MOVABLE apply.
hugetlbfs pages on ZONE_MOVABLE for memory unplug are problematic in
corner cases only I think:
1. Not sufficient memory to allocate a destination page. Well, nothing
we can really do about that - just like trying to migrate any other
memory but running into -ENOMEM.
2. Trying to dissolve a free huge page but running into reservation
limits. I think we should at least try allocating a new free huge page
before failing. To be tackled in the future.
> Correct? We may need to allocate another huge page for the target of the
> migration, and there is no guarantee we can do that.
>
I agree that 1. is similar to "cannot migrate because OOM".
So thinking about it again, we don't actually seem to lose that much when
a) Rejecting migration of a huge page when not being able to allocate
the vmemmap for our source page. Our system seems to be under quite some
memory pressure already. Migration could just fail because we fail to
allocate a migration target already.
b) Rejecting to dissolve a huge page when not able to allocate the
vmemmap. Dissolving can fail already. And, again, our system seems to be
under quite some memory pressure already.
c) Rejecting freeing huge pages when not able to allocate the vmemmap. I
guess the "only" surprise is that the user might now no longer get what
he asked for. This seems to be the "real change".
So maybe little actually speaks against allowing for migration of such
huge pages and optimizing any huge page, besides rejecting freeing of
huge pages and surprising the user/admin.
I guess while our system is under memory pressure CMA and ZONE_MOVABLE
are already no longer able to always keep their guarantees - until there
is no more memory pressure.
--
Thanks,
David / dhildenb
On 2/1/21 8:10 AM, David Hildenbrand wrote:
>>> What's your opinion about this? Should we take this approach?
>>
>> I think trying to solve all the issues that could happen as the result of
>> not being able to dissolve a hugetlb page has made this extremely complex.
>> I know this is something we need to address/solve. We do not want to add
>> more unexpected behavior in corner cases. However, I can not help but think
>> about similar issues today. For example, if a huge page is in use in
>> ZONE_MOVABLE or CMA there is no guarantee that it can be migrated today.
>
> Yes, hugetlbfs is broken with alloc_contig_range() as e.g., used by CMA and needs fixing. Then, similar problems as with hugetlbfs pages on ZONE_MOVABLE apply.
>
>
> hugetlbfs pages on ZONE_MOVABLE for memory unplug are problematic in corner cases only I think:
>
> 1. Not sufficient memory to allocate a destination page. Well, nothing we can really do about that - just like trying to migrate any other memory but running into -ENOMEM.
>
> 2. Trying to dissolve a free huge page but running into reservation limits. I think we should at least try allocating a new free huge page before failing. To be tackled in the future.
>
>> Correct? We may need to allocate another huge page for the target of the
>> migration, and there is no guarantee we can do that.
>>
>
> I agree that 1. is similar to "cannot migrate because OOM".
>
>
> So thinking about it again, we don't actually seem to lose that much when
>
> a) Rejecting migration of a huge page when not being able to allocate the vmemmap for our source page. Our system seems to be under quite some memory pressure already. Migration could just fail because we fail to allocate a migration target already.
>
> b) Rejecting to dissolve a huge page when not able to allocate the vmemmap. Dissolving can fail already. And, again, our system seems to be under quite some memory pressure already.
>
> c) Rejecting freeing huge pages when not able to allocate the vmemmap. I guess the "only" surprise is that the user might now no longer get what he asked for. This seems to be the "real change".
>
> So maybe little actually speaks against allowing for migration of such huge pages and optimizing any huge page, besides rejecting freeing of huge pages and surprising the user/admin.
>
> I guess while our system is under memory pressure CMA and ZONE_MOVABLE are already no longer able to always keep their guarantees - until there is no more memory pressure.
>
My thinking was similar. Failing to dissolve a hugetlb page because we could
not allocate vmmemmap pages is not much/any worse than what we do when near
OOM conditions today. As for surprising the user/admin, we should certainly
log a warning if we can not dissolve a hugetlb page.
One point David R brought up still is a bit concerning. When getting close
to OOM, there may be users/code that will try to dissolve free hugetlb pages
to give back as much memory as possible to buddy. I've seen users holding
'big chunks' of memory for a specific purpose and dumping them when needed.
They were not doing this with hugetlb pages, but nothing would surprise me.
In this series, vmmap freeing is 'opt in' at boot time. I would expect
the use cases that want to opt in rarely if ever free/dissolve hugetlb
pages. But, I could be wrong.
--
Mike Kravetz