2020-09-02 18:10:25

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 00/16] 1GB THP support on x86_64

From: Zi Yan <[email protected]>

Hi all,

This patchset adds support for 1GB THP on x86_64. It is on top of
v5.9-rc2-mmots-2020-08-25-21-13.

1GB THP is more flexible for reducing translation overhead and increasing the
performance of applications with large memory footprint without application
changes compared to hugetlb.

Design
=======

1GB THP implementation looks similar to exiting THP code except some new designs
for the additional page table level.

1. Page table deposit and withdraw using a new pagechain data structure:
instead of one PTE page table page, 1GB THP requires 513 page table pages
(one PMD page table page and 512 PTE page table pages) to be deposited
at the page allocaiton time, so that we can split the page later. Currently,
the page table deposit is using ->lru, thus only one page can be deposited.
A new pagechain data structure is added to enable multi-page deposit.

2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
page[N*512 + 3].compound_mapcount.

3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
to use something less intrusive. So all 1GB THPs are allocated from reserved
CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
THP is cleared as the resulting pages can be freed via normal page free path.
We can fall back to alloc_contig_pages for 1GB THP if necessary.


Patch Organization
=======

Patch 01 adds the new pagechain data structure.

Patch 02 to 13 adds 1GB THP support in variable places.

Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.

Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.

Patch 16 use hugepage_cma reservation for 1GB THP allocation.


Any suggestions and comments are welcome.


Zi Yan (16):
mm: add pagechain container for storing multiple pages.
mm: thp: 1GB anonymous page implementation.
mm: proc: add 1GB THP kpageflag.
mm: thp: 1GB THP copy on write implementation.
mm: thp: handling 1GB THP reference bit.
mm: thp: add 1GB THP split_huge_pud_page() function.
mm: stats: make smap stats understand PUD THPs.
mm: page_vma_walk: teach it about PMD-mapped PUD THP.
mm: thp: 1GB THP support in try_to_unmap().
mm: thp: split 1GB THPs at page reclaim.
mm: thp: 1GB THP follow_p*d_page() support.
mm: support 1GB THP pagemap support.
mm: thp: add a knob to enable/disable 1GB THPs.
mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
hugetlb: cma: move cma reserve function to cma.c.
mm: thp: use cma reservation for pud thp allocation.

.../admin-guide/kernel-parameters.txt | 2 +-
arch/arm64/mm/hugetlbpage.c | 2 +-
arch/powerpc/mm/hugetlbpage.c | 2 +-
arch/x86/include/asm/pgalloc.h | 68 ++
arch/x86/include/asm/pgtable.h | 26 +
arch/x86/kernel/setup.c | 8 +-
arch/x86/mm/pgtable.c | 38 +
drivers/base/node.c | 3 +
fs/proc/meminfo.c | 2 +
fs/proc/page.c | 2 +
fs/proc/task_mmu.c | 122 ++-
include/linux/cma.h | 18 +
include/linux/huge_mm.h | 84 +-
include/linux/hugetlb.h | 12 -
include/linux/memcontrol.h | 5 +
include/linux/mm.h | 29 +-
include/linux/mm_types.h | 1 +
include/linux/mmu_notifier.h | 13 +
include/linux/mmzone.h | 1 +
include/linux/page-flags.h | 47 +
include/linux/pagechain.h | 73 ++
include/linux/pgtable.h | 34 +
include/linux/rmap.h | 10 +-
include/linux/swap.h | 2 +
include/linux/vm_event_item.h | 7 +
include/uapi/linux/kernel-page-flags.h | 2 +
kernel/events/uprobes.c | 4 +-
kernel/fork.c | 5 +
mm/cma.c | 119 +++
mm/gup.c | 60 +-
mm/huge_memory.c | 939 +++++++++++++++++-
mm/hugetlb.c | 114 +--
mm/internal.h | 2 +
mm/khugepaged.c | 6 +-
mm/ksm.c | 4 +-
mm/memcontrol.c | 13 +
mm/memory.c | 51 +-
mm/mempolicy.c | 21 +-
mm/migrate.c | 12 +-
mm/page_alloc.c | 57 +-
mm/page_vma_mapped.c | 129 ++-
mm/pgtable-generic.c | 56 ++
mm/rmap.c | 289 ++++--
mm/swap.c | 31 +
mm/swap_slots.c | 2 +
mm/swapfile.c | 8 +-
mm/userfaultfd.c | 2 +-
mm/util.c | 16 +-
mm/vmscan.c | 58 +-
mm/vmstat.c | 8 +
50 files changed, 2270 insertions(+), 349 deletions(-)
create mode 100644 include/linux/pagechain.h

--
2.28.0


2020-09-02 18:11:30

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 16/16] mm: thp: use cma reservation for pud thp allocation.

From: Zi Yan <[email protected]>

Sharing hugepage_cma reservation with hugetlb for pud thp allocaiton.
The reserved cma regions still can be used for moveable page allocations.

During 1GB page split, all subpages are cleared from the CMA bitmap,
since they are no more 1GB pages and will be freed via the normal path
instead of cma_release().

Signed-off-by: Zi Yan <[email protected]>
---
include/linux/cma.h | 3 +++
include/linux/huge_mm.h | 10 ++++++++++
mm/cma.c | 31 +++++++++++++++++++++++++++++++
mm/huge_memory.c | 30 ++++++++++++++++++++++++++++++
mm/mempolicy.c | 12 +++++++++---
mm/page_alloc.c | 3 ++-
6 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index abcf7ab712f9..b765d19e4052 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -46,6 +46,9 @@ extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
bool no_warn);
extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);

+extern bool cma_clear_bitmap_if_in_range(struct cma *cma, const struct page *page,
+ unsigned int count);
+
extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);

extern void cma_reserve(int min_order, unsigned long requested_size,
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3bf8d8a09f08..5a45877055bb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -24,6 +24,8 @@ extern struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
unsigned long addr,
pud_t *pud,
unsigned int flags);
+extern struct page *alloc_thp_pud_page(int nid);
+extern bool free_thp_pud_page(struct page *page, int order);
#else
static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
{
@@ -43,6 +45,14 @@ struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
{
return NULL;
}
+struct page *alloc_thp_pud_page(int nid)
+{
+ return NULL;
+}
+extern bool free_thp_pud_page(struct page *page, int order);
+{
+ return false;
+}
#endif

extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/cma.c b/mm/cma.c
index aa3a17d8a191..3f721b8f7ccd 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -532,6 +532,37 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
return true;
}

+/**
+ * cma_clear_bitmap_if_in_range() - clear bitmap for a given page
+ * @cma: Contiguous memory region for which the allocation is performed.
+ * @pages: Allocated pages.
+ * @count: Number of allocated pages.
+ *
+ * This function clears bitmap of memory allocated by cma_alloc().
+ * It returns false when provided pages do not belong to contiguous area and
+ * true otherwise.
+ */
+bool cma_clear_bitmap_if_in_range(struct cma *cma, const struct page *pages,
+ unsigned int count)
+{
+ unsigned long pfn;
+
+ if (!cma || !pages)
+ return false;
+
+ pfn = page_to_pfn(pages);
+
+ if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
+ return false;
+
+ if (pfn + count > cma->base_pfn + cma->count)
+ return false;
+
+ cma_clear_bitmap(cma, pfn, count);
+
+ return true;
+}
+
int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)
{
int i;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e1440a13da63..2020b843fd97 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,7 @@
#include <linux/oom.h>
#include <linux/numa.h>
#include <linux/page_owner.h>
+#include <linux/cma.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -64,6 +65,10 @@ static struct shrinker deferred_split_shrinker;
static atomic_t huge_zero_refcount;
struct page *huge_zero_page __read_mostly;

+#ifdef CONFIG_CMA
+extern struct cma *hugepage_cma[MAX_NUMNODES];
+#endif
+
bool transparent_hugepage_enabled(struct vm_area_struct *vma)
{
/* The addr is used to check if the vma size fits */
@@ -2526,6 +2531,13 @@ static void __split_huge_pud_page(struct page *page, struct list_head *list,
/* no file-back page support yet */
VM_BUG_ON(!PageAnon(page));

+ /* */
+ if (IS_ENABLED(CONFIG_CMA)) {
+ struct cma *cma = hugepage_cma[page_to_nid(head)];
+ VM_BUG_ON(!cma_clear_bitmap_if_in_range(cma, head,
+ thp_nr_pages(head)));
+ }
+
for (i = HPAGE_PUD_NR - HPAGE_PMD_NR; i >= 1; i -= HPAGE_PMD_NR) {
__split_huge_pud_page_tail(head, i, lruvec, list);
}
@@ -3753,3 +3765,21 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
update_mmu_cache_pmd(vma, address, pvmw->pmd);
}
#endif
+
+struct page *alloc_thp_pud_page(int nid)
+{
+ struct page *page = NULL;
+#ifdef CONFIG_CMA
+ page = cma_alloc(hugepage_cma[nid], HPAGE_PUD_NR, HPAGE_PUD_ORDER, true);
+#endif
+ return page;
+}
+
+bool free_thp_pud_page(struct page *page, int order)
+{
+ bool ret = false;
+#ifdef CONFIG_CMA
+ ret = cma_release(hugepage_cma[page_to_nid(page)], page, 1<<order);
+#endif
+ return ret;
+}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4bae089e7a89..82b496922196 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2139,7 +2139,10 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
struct page *page;

if (order > MAX_ORDER) {
- page = alloc_contig_pages(1UL<<order, gfp, nid, NULL);
+ if (order == HPAGE_PUD_ORDER)
+ page = alloc_thp_pud_page(nid);
+ if (!page)
+ page = alloc_contig_pages(1UL<<order, gfp, nid, NULL);
if (page && (gfp & __GFP_COMP))
prep_compound_page(page, order);
} else
@@ -2219,8 +2222,11 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
mpol_cond_put(pol);

if (order > MAX_ORDER) {
- page = alloc_contig_pages(1UL<<order, gfp,
- hpage_node, NULL);
+ if (order == HPAGE_PUD_ORDER)
+ page = alloc_thp_pud_page(hpage_node);
+ if (!page)
+ page = alloc_contig_pages(1UL<<order,
+ gfp, hpage_node, NULL);
if (page && (gfp & __GFP_COMP))
prep_compound_page(page, order);
goto out;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a8b241508f7..eff307b4dc57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1509,7 +1509,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)

if (order >= MAX_ORDER) {
destroy_compound_gigantic_page(page, order);
- free_contig_range(page_to_pfn(page), 1 << order);
+ if (!free_thp_pud_page(page, order))
+ free_contig_range(page_to_pfn(page), 1 << order);
} else {
migratetype = get_pfnblock_migratetype(page, pfn);
local_irq_save(flags);
--
2.28.0

2020-09-02 18:11:36

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 15/16] hugetlb: cma: move cma reserve function to cma.c.

From: Zi Yan <[email protected]>

It will be used by other allocations, like 1GB THP allocation in the
upcoming commit.

Signed-off-by: Zi Yan <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 2 +-
arch/arm64/mm/hugetlbpage.c | 2 +-
arch/powerpc/mm/hugetlbpage.c | 2 +-
arch/x86/kernel/setup.c | 8 +-
include/linux/cma.h | 15 ++++
include/linux/hugetlb.h | 12 ---
mm/cma.c | 88 +++++++++++++++++++
mm/hugetlb.c | 88 ++-----------------
8 files changed, 118 insertions(+), 99 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 68fee5e034ca..600668ee0ac7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1507,7 +1507,7 @@
hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET
registers. Default set by CONFIG_HPET_MMAP_DEFAULT.

- hugetlb_cma= [HW] The size of a cma area used for allocation
+ hugepage_cma= [HW] The size of a cma area used for allocation
of gigantic hugepages.
Format: nn[KMGTPE]

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 55ecf6de9ff7..8a3ad7eaae49 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -52,7 +52,7 @@ void __init arm64_hugetlb_cma_reserve(void)
* breaking this assumption.
*/
WARN_ON(order <= MAX_ORDER);
- hugetlb_cma_reserve(order);
+ hugepage_cma_reserve(order);
}
#endif /* CONFIG_CMA */

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..d608e58cb69b 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -699,6 +699,6 @@ void __init gigantic_hugetlb_cma_reserve(void)

if (order) {
VM_WARN_ON(order < MAX_ORDER);
- hugetlb_cma_reserve(order);
+ hugepage_cma_reserve(order);
}
}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 52e83ba607b3..93c8fbdff972 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -16,7 +16,7 @@
#include <linux/pci.h>
#include <linux/root_dev.h>
#include <linux/sfi.h>
-#include <linux/hugetlb.h>
+#include <linux/cma.h>
#include <linux/tboot.h>
#include <linux/usb/xhci-dbgp.h>

@@ -640,7 +640,7 @@ static void __init trim_snb_memory(void)
* already been reserved.
*/
memblock_reserve(0, 1<<20);
-
+
for (i = 0; i < ARRAY_SIZE(bad_pages); i++) {
if (memblock_reserve(bad_pages[i], PAGE_SIZE))
printk(KERN_WARNING "failed to reserve 0x%08lx\n",
@@ -732,7 +732,7 @@ static void __init trim_low_memory_range(void)
{
memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
}
-
+
/*
* Dump out kernel offset information on panic.
*/
@@ -1142,7 +1142,7 @@ void __init setup_arch(char **cmdline_p)
dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);

if (boot_cpu_has(X86_FEATURE_GBPAGES))
- hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+ hugepage_cma_reserve(PUD_SHIFT - PAGE_SHIFT);

/*
* Reserve memory for crash kernel after SRAT is parsed so that it
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 6ff79fefd01f..abcf7ab712f9 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -47,4 +47,19 @@ extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);

extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
+
+extern void cma_reserve(int min_order, unsigned long requested_size,
+ const char *name, struct cma *cma_struct[N_MEMORY]);
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void __init hugepage_cma_reserve(int order);
+extern void __init hugepage_cma_check(void);
+#else
+static inline void __init hugepage_cma_check(void)
+{
+}
+static inline void __init hugepage_cma_reserve(int order)
+{
+}
+#endif
+
#endif
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d5cc5f802dd4..087d13a1dc24 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -935,16 +935,4 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
return ptl;
}

-#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
-extern void __init hugetlb_cma_reserve(int order);
-extern void __init hugetlb_cma_check(void);
-#else
-static inline __init void hugetlb_cma_reserve(int order)
-{
-}
-static inline __init void hugetlb_cma_check(void)
-{
-}
-#endif
-
#endif /* _LINUX_HUGETLB_H */
diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7cda9f..aa3a17d8a191 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -37,6 +37,10 @@
#include "cma.h"

struct cma cma_areas[MAX_CMA_AREAS];
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+struct cma *hugepage_cma[MAX_NUMNODES];
+#endif
+unsigned long hugepage_cma_size __initdata;
unsigned cma_area_count;
static DEFINE_MUTEX(cma_mutex);

@@ -541,3 +545,87 @@ int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)

return 0;
}
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+/*
+ * cma_reserve() - reserve CMA for gigantic pages on nodes with memory
+ *
+ * must be called after free_area_init() that updates N_MEMORY via node_set_state().
+ * cma_reserve() scans over N_MEMORY nodemask and hence expects the platforms
+ * to have initialized N_MEMORY state.
+ */
+void __init cma_reserve(int min_order, unsigned long requested_size, const char *name,
+ struct cma *cma_struct[MAX_NUMNODES])
+{
+ unsigned long size, reserved, per_node;
+ int nid;
+
+ if (!requested_size)
+ return;
+
+ if (requested_size < (PAGE_SIZE << min_order)) {
+ pr_warn("%s_cma: cma area should be at least %lu MiB\n",
+ name, (PAGE_SIZE << min_order) / SZ_1M);
+ return;
+ }
+
+ /*
+ * If 3 GB area is requested on a machine with 4 numa nodes,
+ * let's allocate 1 GB on first three nodes and ignore the last one.
+ */
+ per_node = DIV_ROUND_UP(requested_size, nr_online_nodes);
+ pr_info("%s_cma: reserve %lu MiB, up to %lu MiB per node\n",
+ name, requested_size / SZ_1M, per_node / SZ_1M);
+
+ reserved = 0;
+ for_each_node_state(nid, N_ONLINE) {
+ int res;
+ char node_name[20];
+
+ size = min(per_node, requested_size - reserved);
+ size = round_up(size, PAGE_SIZE << min_order);
+
+ snprintf(node_name, 20, "%s%d", name, nid);
+ res = cma_declare_contiguous_nid(0, size, 0,
+ PAGE_SIZE << min_order,
+ 0, false, node_name,
+ &cma_struct[nid], nid);
+ if (res) {
+ pr_warn("%s_cma: reservation failed: err %d, node %d",
+ name, res, nid);
+ continue;
+ }
+
+ reserved += size;
+ pr_info("%s_cma: reserved %lu MiB on node %d\n",
+ name, size / SZ_1M, nid);
+
+ if (reserved >= requested_size)
+ break;
+ }
+}
+
+static bool hugepage_cma_reserve_called __initdata;
+
+static int __init cmdline_parse_hugepage_cma(char *p)
+{
+ hugepage_cma_size = memparse(p, &p);
+ return 0;
+}
+
+early_param("hugepage_cma", cmdline_parse_hugepage_cma);
+
+void __init hugepage_cma_reserve(int order)
+{
+ hugepage_cma_reserve_called = true;
+ cma_reserve(order, hugepage_cma_size, "hugepage", hugepage_cma);
+}
+
+void __init hugepage_cma_check(void)
+{
+ if (!hugepage_cma_size || hugepage_cma_reserve_called)
+ return;
+
+ pr_warn("hugepage_cma: the option isn't supported by current arch\n");
+}
+#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d5357778b026..6685cad879d0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -48,9 +48,9 @@ unsigned int default_hstate_idx;
struct hstate hstates[HUGE_MAX_HSTATE];

#ifdef CONFIG_CMA
-static struct cma *hugetlb_cma[MAX_NUMNODES];
+extern struct cma *hugepage_cma[MAX_NUMNODES];
#endif
-static unsigned long hugetlb_cma_size __initdata;
+extern unsigned long hugepage_cma_size __initdata;

/*
* Minimum page order among possible hugepage sizes, set to a proper value
@@ -1218,7 +1218,7 @@ static void free_gigantic_page(struct page *page, unsigned int order)
* cma_release() returns false.
*/
#ifdef CONFIG_CMA
- if (cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order))
+ if (cma_release(hugepage_cma[page_to_nid(page)], page, 1 << order))
return;
#endif

@@ -1237,10 +1237,10 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
int node;

for_each_node_mask(node, *nodemask) {
- if (!hugetlb_cma[node])
+ if (!hugepage_cma[node])
continue;

- page = cma_alloc(hugetlb_cma[node], nr_pages,
+ page = cma_alloc(hugepage_cma[node], nr_pages,
huge_page_order(h), true);
if (page)
return page;
@@ -2532,8 +2532,8 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)

for (i = 0; i < h->max_huge_pages; ++i) {
if (hstate_is_gigantic(h)) {
- if (hugetlb_cma_size) {
- pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
+ if (hugepage_cma_size) {
+ pr_warn_once("HugeTLB: hugepage_cma is enabled, skip boot time allocation\n");
break;
}
if (!alloc_bootmem_huge_page(h))
@@ -3209,7 +3209,7 @@ static int __init hugetlb_init(void)
}
}

- hugetlb_cma_check();
+ hugepage_cma_check();
hugetlb_init_hstates();
gather_bootmem_prealloc();
report_hugepages();
@@ -5622,75 +5622,3 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
spin_unlock(&hugetlb_lock);
}
}
-
-#ifdef CONFIG_CMA
-static bool cma_reserve_called __initdata;
-
-static int __init cmdline_parse_hugetlb_cma(char *p)
-{
- hugetlb_cma_size = memparse(p, &p);
- return 0;
-}
-
-early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
-
-void __init hugetlb_cma_reserve(int order)
-{
- unsigned long size, reserved, per_node;
- int nid;
-
- cma_reserve_called = true;
-
- if (!hugetlb_cma_size)
- return;
-
- if (hugetlb_cma_size < (PAGE_SIZE << order)) {
- pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n",
- (PAGE_SIZE << order) / SZ_1M);
- return;
- }
-
- /*
- * If 3 GB area is requested on a machine with 4 numa nodes,
- * let's allocate 1 GB on first three nodes and ignore the last one.
- */
- per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
- pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
- hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
-
- reserved = 0;
- for_each_node_state(nid, N_ONLINE) {
- int res;
- char name[20];
-
- size = min(per_node, hugetlb_cma_size - reserved);
- size = round_up(size, PAGE_SIZE << order);
-
- snprintf(name, 20, "hugetlb%d", nid);
- res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
- 0, false, name,
- &hugetlb_cma[nid], nid);
- if (res) {
- pr_warn("hugetlb_cma: reservation failed: err %d, node %d",
- res, nid);
- continue;
- }
-
- reserved += size;
- pr_info("hugetlb_cma: reserved %lu MiB on node %d\n",
- size / SZ_1M, nid);
-
- if (reserved >= hugetlb_cma_size)
- break;
- }
-}
-
-void __init hugetlb_cma_check(void)
-{
- if (!hugetlb_cma_size || cma_reserve_called)
- return;
-
- pr_warn("hugetlb_cma: the option isn't supported by current arch\n");
-}
-
-#endif /* CONFIG_CMA */
--
2.28.0

2020-09-02 18:12:21

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 13/16] mm: thp: add a knob to enable/disable 1GB THPs.

From: Zi Yan <[email protected]>

It does not affect existing 1GB THPs. It is similar to the knob for
2MB THPs.

Signed-off-by: Zi Yan <[email protected]>
---
include/linux/huge_mm.h | 14 ++++++++++++++
mm/huge_memory.c | 40 ++++++++++++++++++++++++++++++++++++++++
mm/memory.c | 2 +-
3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c7bc40c4a5e2..3bf8d8a09f08 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,6 +119,8 @@ enum transparent_hugepage_flag {
#ifdef CONFIG_DEBUG_VM
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
#endif
+ TRANSPARENT_PUD_HUGEPAGE_FLAG,
+ TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG,
};

struct kobject;
@@ -184,6 +186,18 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
}

bool transparent_hugepage_enabled(struct vm_area_struct *vma);
+static inline bool transparent_pud_hugepage_enabled(struct vm_area_struct *vma)
+{
+ if (transparent_hugepage_enabled(vma)) {
+ if (transparent_hugepage_flags & (1 << TRANSPARENT_PUD_HUGEPAGE_FLAG))
+ return true;
+ if (transparent_hugepage_flags &
+ (1 << TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG))
+ return !!(vma->vm_flags & VM_HUGEPAGE);
+ }
+
+ return false;
+}

#define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e209c2dfc5b7..e1440a13da63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -49,9 +49,11 @@
unsigned long transparent_hugepage_flags __read_mostly =
#ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+ (1<<TRANSPARENT_PUD_HUGEPAGE_FLAG)|
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
+ (1<<TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG)|
#endif
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
@@ -199,6 +201,43 @@ static ssize_t enabled_store(struct kobject *kobj,
static struct kobj_attribute enabled_attr =
__ATTR(enabled, 0644, enabled_show, enabled_store);

+static ssize_t enabled_1gb_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ if (test_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags))
+ return sprintf(buf, "[always] madvise never\n");
+ else if (test_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags))
+ return sprintf(buf, "always [madvise] never\n");
+ else
+ return sprintf(buf, "always madvise [never]\n");
+}
+
+static ssize_t enabled_1gb_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ ssize_t ret = count;
+
+ if (!memcmp("always", buf,
+ min(sizeof("always")-1, count))) {
+ clear_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+ set_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+ } else if (!memcmp("madvise", buf,
+ min(sizeof("madvise")-1, count))) {
+ clear_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+ set_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+ } else if (!memcmp("never", buf,
+ min(sizeof("never")-1, count))) {
+ clear_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+ } else
+ ret = -EINVAL;
+
+ return ret;
+}
+static struct kobj_attribute enabled_1gb_attr =
+ __ATTR(enabled_1gb, 0644, enabled_1gb_show, enabled_1gb_store);
+
ssize_t single_hugepage_flag_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf,
enum transparent_hugepage_flag flag)
@@ -305,6 +344,7 @@ static struct kobj_attribute hpage_pmd_size_attr =

static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
+ &enabled_1gb_attr.attr,
&defrag_attr.attr,
&use_zero_page_attr.attr,
&hpage_pmd_size_attr.attr,
diff --git a/mm/memory.c b/mm/memory.c
index 184d8eb2d060..518f29a5903e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4305,7 +4305,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
if (!vmf.pud)
return VM_FAULT_OOM;
retry_pud:
- if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
+ if (pud_none(*vmf.pud) && transparent_pud_hugepage_enabled(vma)) {
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
--
2.28.0

2020-09-02 18:12:39

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 02/16] mm: thp: 1GB anonymous page implementation.

From: Zi Yan <[email protected]>

This adds 1GB THP support for anonymous pages. Applications can get 1GB
pages during page faults when their VMAs are larger than 1GB. For
read-only 1GB zero THP, a shared 1GB zero THP is created for all
readers.

Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/include/asm/pgalloc.h | 59 +++++++++++
arch/x86/include/asm/pgtable.h | 2 +
arch/x86/mm/pgtable.c | 25 +++++
drivers/base/node.c | 3 +
fs/proc/meminfo.c | 2 +
include/linux/huge_mm.h | 13 ++-
include/linux/mm.h | 4 +
include/linux/mm_types.h | 1 +
include/linux/mmzone.h | 1 +
include/linux/pgtable.h | 3 +
include/linux/vm_event_item.h | 3 +
kernel/fork.c | 5 +
mm/huge_memory.c | 188 +++++++++++++++++++++++++++++++--
mm/memory.c | 29 ++++-
mm/page_alloc.c | 3 +-
mm/pgtable-generic.c | 45 ++++++++
mm/rmap.c | 30 ++++--
mm/vmstat.c | 4 +
18 files changed, 396 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 62ad61d6fefc..fae13467d3e1 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -52,6 +52,18 @@ extern pgd_t *pgd_alloc(struct mm_struct *);
extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);

extern pgtable_t pte_alloc_one(struct mm_struct *);
+extern pgtable_t pte_alloc_order(struct mm_struct *, unsigned long, int);
+
+static inline void pte_free_order(struct mm_struct *mm, struct page *pte,
+ int order)
+{
+ int i;
+
+ for (i = 0; i < (1<<order); i++) {
+ pgtable_pte_page_dtor(&pte[i]);
+ __free_page(&pte[i]);
+ }
+}

extern void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte);

@@ -87,6 +99,53 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
#define pmd_pgtable(pmd) pmd_page(pmd)

#if CONFIG_PGTABLE_LEVELS > 2
+static inline pmd_t *pmd_alloc_one_page_with_ptes(struct mm_struct *mm, unsigned long addr)
+{
+ pgtable_t pte_pgtables;
+ pmd_t *pmd;
+ spinlock_t *pmd_ptl;
+ int i;
+
+ pte_pgtables = pte_alloc_order(mm, addr,
+ HPAGE_PUD_ORDER - HPAGE_PMD_ORDER);
+ if (!pte_pgtables)
+ return NULL;
+
+ pmd = pmd_alloc_one(mm, addr);
+ if (unlikely(!pmd)) {
+ pte_free_order(mm, pte_pgtables,
+ HPAGE_PUD_ORDER - HPAGE_PMD_ORDER);
+ return NULL;
+ }
+ pmd_ptl = pmd_lock(mm, pmd);
+
+ for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+ pgtable_trans_huge_deposit(mm, pmd, pte_pgtables + i);
+
+ spin_unlock(pmd_ptl);
+
+ return pmd;
+}
+
+static inline void pmd_free_page_with_ptes(struct mm_struct *mm, pmd_t *pmd)
+{
+ spinlock_t *pmd_ptl;
+ int i;
+
+ BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
+ pmd_ptl = pmd_lock(mm, pmd);
+
+ for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++) {
+ pgtable_t pte_pgtable;
+
+ pte_pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pte_free(mm, pte_pgtable);
+ }
+
+ spin_unlock(pmd_ptl);
+ pmd_free(mm, pmd);
+}
+
extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd);

static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd,
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5e0dcc20614d..26255cac78c0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1141,6 +1141,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long
return native_pmdp_get_and_clear(pmdp);
}

+#define mk_pud(page, pgprot) pfn_pud(page_to_pfn(page), (pgprot))
+
#define __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR
static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
unsigned long addr, pud_t *pudp)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..7be73aee6183 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -33,6 +33,31 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
return __pte_alloc_one(mm, __userpte_alloc_gfp);
}

+pgtable_t pte_alloc_order(struct mm_struct *mm, unsigned long address, int order)
+{
+ struct page *pte;
+ int i;
+
+ pte = alloc_pages(__userpte_alloc_gfp, order);
+ if (!pte)
+ return NULL;
+ split_page(pte, order);
+ for (i = 1; i < (1 << order); i++)
+ set_page_private(pte + i, 0);
+
+ for (i = 0; i < (1<<order); i++) {
+ if (!pgtable_pte_page_ctor(&pte[i])) {
+ __free_page(&pte[i]);
+ while (--i >= 0) {
+ pgtable_pte_page_dtor(&pte[i]);
+ __free_page(&pte[i]);
+ }
+ return NULL;
+ }
+ }
+ return pte;
+}
+
static int __init setup_userpte(char *arg)
{
if (!arg)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 508b80f6329b..f11b4d88911c 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d AnonHugePUDPages: %8lu kB\n"
"Node %d ShmemHugePages: %8lu kB\n"
"Node %d ShmemPmdMapped: %8lu kB\n"
"Node %d FileHugePages: %8lu kB\n"
@@ -457,6 +458,8 @@ static ssize_t node_read_meminfo(struct device *dev,
,
nid, K(node_page_state(pgdat, NR_ANON_THPS) *
HPAGE_PMD_NR),
+ nid, K(node_page_state(pgdat, NR_ANON_THPS_PUD) *
+ HPAGE_PUD_NR),
nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
HPAGE_PMD_NR),
nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 887a5532e449..b60e0c241015 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -130,6 +130,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
show_val_kb(m, "AnonHugePages: ",
global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR);
+ show_val_kb(m, "AnonHugePUDPages: ",
+ global_node_page_state(NR_ANON_THPS_PUD) * HPAGE_PUD_NR);
show_val_kb(m, "ShmemHugePages: ",
global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
show_val_kb(m, "ShmemPmdMapped: ",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 8a8bc46a2432..7528652400e4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -18,10 +18,15 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,

#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
+extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
#else
static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
{
}
+extern int do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+ return VM_FAULT_FALLBACK;
+}
#endif

extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
@@ -115,6 +120,9 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define HPAGE_PMD_SHIFT PMD_SHIFT
#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
@@ -276,7 +284,7 @@ static inline unsigned int thp_order(struct page *page)
{
VM_BUG_ON_PGFLAGS(PageTail(page), page);
if (PageHead(page))
- return HPAGE_PMD_ORDER;
+ return page[1].compound_order;
return 0;
}

@@ -288,7 +296,7 @@ static inline int thp_nr_pages(struct page *page)
{
VM_BUG_ON_PGFLAGS(PageTail(page), page);
if (PageHead(page))
- return HPAGE_PMD_NR;
+ return (1<<page[1].compound_order);
return 1;
}

@@ -320,6 +328,7 @@ struct page *mm_get_huge_zero_page(struct mm_struct *mm);
void mm_put_huge_zero_page(struct mm_struct *mm);

#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+#define mk_huge_pud(page, prot) pud_mkhuge(mk_pud(page, prot))

static inline bool thp_migration_supported(void)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3a4f099fb1b..cb1ccf804404 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,7 @@
#include <linux/sizes.h>
#include <linux/sched.h>
#include <linux/pgtable.h>
+#include <linux/pagechain.h>

struct mempolicy;
struct anon_vma;
@@ -2184,6 +2185,7 @@ static inline void pgtable_init(void)
{
ptlock_cache_init();
pgtable_cache_init();
+ pagechain_cache_init();
}

static inline bool pgtable_pte_page_ctor(struct page *page)
@@ -2316,6 +2318,8 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
return ptl;
}

+#define pud_huge_pte(mm, pud) ((mm)->pud_huge_pte)
+
extern void __init pagecache_init(void);
extern void __init free_area_init_memoryless_node(int nid);
extern void free_initmem(void);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..4c1839366af4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -513,6 +513,7 @@ struct mm_struct {
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
+ struct list_head pud_huge_pte; /* protected by page_table_lock */
#ifdef CONFIG_NUMA_BALANCING
/*
* numa_next_scan is the next time that the PTEs will be marked
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0a404552ecc1..3a8f54a2c5a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -196,6 +196,7 @@ enum node_stat_item {
NR_FILE_THPS,
NR_FILE_PMDMAPPED,
NR_ANON_THPS,
+ NR_ANON_THPS_PUD,
NR_VMSCAN_WRITE,
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
NR_DIRTIED, /* page dirtyings since bootup */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..255275d5b73e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -462,10 +462,13 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable);
+extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+ pgtable_t pgtable);
#endif

#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
+extern pgtable_t pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
#endif

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2e6ca53b9bbd..a3f1093a55bb 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -92,6 +92,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_DEFERRED_SPLIT_PAGE,
THP_SPLIT_PMD,
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+ THP_FAULT_ALLOC_PUD,
+ THP_FAULT_FALLBACK_PUD,
+ THP_FAULT_FALLBACK_PUD_CHARGE,
THP_SPLIT_PUD,
#endif
THP_ZERO_PAGE_ALLOC,
diff --git a/kernel/fork.c b/kernel/fork.c
index 3f281814a3d3..842fdc4ae5fc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -663,6 +663,10 @@ static void check_mm(struct mm_struct *mm)
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
#endif
+ VM_BUG_ON_MM(!list_empty(&mm->pud_huge_pte) &&
+ !pagechain_empty(list_first_entry(&mm->pud_huge_pte,
+ struct pagechain, list)),
+ mm);
}

#define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
@@ -1023,6 +1027,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
#endif
+ INIT_LIST_HEAD(&mm->pud_huge_pte);
mm_init_uprobes_state(mm);

if (current->mm) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 90733cefa528..ec3847392208 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -933,6 +933,112 @@ vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn,
return VM_FAULT_NOPAGE;
}
EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud_prot);
+
+static int __do_huge_pud_anonymous_page(struct vm_fault *vmf, struct page *page,
+ gfp_t gfp)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ pmd_t *pmd_pgtable;
+ unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+ int ret = 0;
+
+ VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+ if (mem_cgroup_charge(page, vma->vm_mm, gfp)) {
+ put_page(page);
+ count_vm_event(THP_FAULT_FALLBACK_PUD);
+ count_vm_event(THP_FAULT_FALLBACK_CHARGE);
+ return VM_FAULT_FALLBACK;
+ }
+ cgroup_throttle_swaprate(page, gfp);
+
+ pmd_pgtable = pmd_alloc_one_page_with_ptes(vma->vm_mm, haddr);
+ if (unlikely(!pmd_pgtable)) {
+ ret = VM_FAULT_OOM;
+ goto release;
+ }
+
+ clear_huge_page(page, vmf->address, HPAGE_PUD_NR);
+ /*
+ * The memory barrier inside __SetPageUptodate makes sure that
+ * clear_huge_page writes become visible before the set_pmd_at()
+ * write.
+ */
+ __SetPageUptodate(page);
+
+ vmf->ptl = pud_lock(vma->vm_mm, vmf->pud);
+ if (unlikely(!pud_none(*vmf->pud))) {
+ goto unlock_release;
+ } else {
+ pud_t entry;
+ int i;
+
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ goto unlock_release;
+
+ /* Deliver the page fault to userland */
+ if (userfaultfd_missing(vma)) {
+ vm_fault_t ret2;
+
+ spin_unlock(vmf->ptl);
+ put_page(page);
+ pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+ ret2 = handle_userfault(vmf, VM_UFFD_MISSING);
+ VM_BUG_ON(ret2 & VM_FAULT_FALLBACK);
+ return ret2;
+ }
+
+ entry = mk_huge_pud(page, vma->vm_page_prot);
+ entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+ page_add_new_anon_rmap(page, vma, haddr, true);
+ lru_cache_add_inactive_or_unevictable(page, vma);
+ pgtable_trans_huge_pud_deposit(vma->vm_mm, vmf->pud,
+ virt_to_page(pmd_pgtable));
+ set_pud_at(vma->vm_mm, haddr, vmf->pud, entry);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+ mm_inc_nr_pmds(vma->vm_mm);
+ for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+ mm_inc_nr_ptes(vma->vm_mm);
+ spin_unlock(vmf->ptl);
+ count_vm_event(THP_FAULT_ALLOC_PUD);
+ }
+
+ return 0;
+unlock_release:
+ spin_unlock(vmf->ptl);
+release:
+ if (pmd_pgtable)
+ pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+ put_page(page);
+ return ret;
+
+}
+
+int do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ gfp_t gfp;
+ struct page *page;
+ unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+
+ if (haddr < vma->vm_start || haddr + HPAGE_PUD_SIZE > vma->vm_end)
+ return VM_FAULT_FALLBACK;
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+ if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
+ return VM_FAULT_OOM;
+
+ gfp = alloc_hugepage_direct_gfpmask(vma);
+ page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PUD_ORDER);
+ if (unlikely(!page)) {
+ count_vm_event(THP_FAULT_FALLBACK_PUD);
+ return VM_FAULT_FALLBACK;
+ }
+ prep_transhuge_page(page);
+ return __do_huge_pud_anonymous_page(vmf, page, gfp);
+}
+
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */

static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1159,7 +1265,12 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
{
spinlock_t *dst_ptl, *src_ptl;
pud_t pud;
- int ret;
+ pmd_t *pmd_pgtable = NULL;
+ int ret = -ENOMEM;
+
+ pmd_pgtable = pmd_alloc_one_page_with_ptes(vma->vm_mm, addr);
+ if (unlikely(!pmd_pgtable))
+ goto out;

dst_ptl = pud_lock(dst_mm, dst_pud);
src_ptl = pud_lockptr(src_mm, src_pud);
@@ -1167,16 +1278,28 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,

ret = -EAGAIN;
pud = *src_pud;
+
+ /* only transparent huge pud page needs extra page table pages for
+ * possible huge page split */
+ if (!pud_trans_huge(pud))
+ pmd_free_page_with_ptes(dst_mm, pmd_pgtable);
+
if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
goto out_unlock;

- /*
- * When page table lock is held, the huge zero pud should not be
- * under splitting since we don't split the page itself, only pud to
- * a page table.
- */
- if (is_huge_zero_pud(pud)) {
- /* No huge zero pud yet */
+ if (pud_trans_huge(pud)) {
+ struct page *src_page;
+ int i;
+
+ src_page = pud_page(pud);
+ VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+ get_page(src_page);
+ page_dup_rmap(src_page, true);
+ add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+ mm_inc_nr_pmds(dst_mm);
+ for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+ mm_inc_nr_ptes(dst_mm);
+ pgtable_trans_huge_pud_deposit(dst_mm, dst_pud, virt_to_page(pmd_pgtable));
}

pudp_set_wrprotect(src_mm, addr, src_pud);
@@ -1187,6 +1310,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
out_unlock:
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
+out:
return ret;
}

@@ -1887,11 +2011,27 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
}

#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline void zap_pud_deposited_table(struct mm_struct *mm, pud_t *pud)
+{
+ pgtable_t pgtable;
+ int i;
+
+ pgtable = pgtable_trans_huge_pud_withdraw(mm, pud);
+ pmd_free_page_with_ptes(mm, (pmd_t *)page_address(pgtable));
+
+ mm_dec_nr_pmds(mm);
+ for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+ mm_dec_nr_ptes(mm);
+}
+
int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
pud_t *pud, unsigned long addr)
{
+ pud_t orig_pud;
spinlock_t *ptl;

+ tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
+
ptl = __pud_trans_huge_lock(pud, vma);
if (!ptl)
return 0;
@@ -1901,14 +2041,40 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
* pgtable_trans_huge_withdraw after finishing pudp related
* operations.
*/
- pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+ orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud,
+ tlb->fullmm);
tlb_remove_pud_tlb_entry(tlb, pud, addr);
if (vma_is_special_huge(vma)) {
spin_unlock(ptl);
/* No zero page support yet */
+ } else if (is_huge_zero_pud(orig_pud)) {
+ zap_pud_deposited_table(tlb->mm, pud);
+ spin_unlock(ptl);
+ tlb_remove_page_size(tlb, pud_page(orig_pud), HPAGE_PUD_SIZE);
} else {
- /* No support for anonymous PUD pages yet */
- BUG();
+ struct page *page = NULL;
+ int flush_needed = 1;
+
+ if (pud_present(orig_pud)) {
+ page = pud_page(orig_pud);
+ page_remove_rmap(page, true);
+ VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ } else
+ WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+ if (PageAnon(page)) {
+ zap_pud_deposited_table(tlb->mm, pud);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PUD_NR);
+ } else {
+ if (arch_needs_pgtable_deposit())
+ zap_pud_deposited_table(tlb->mm, pud);
+ add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PUD_NR);
+ }
+
+ spin_unlock(ptl);
+ if (flush_needed)
+ tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
}
return 1;
}
diff --git a/mm/memory.c b/mm/memory.c
index fb5463153351..6f86294438fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4147,14 +4147,13 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vmf->vma))
- goto split;
+ return do_huge_pud_anonymous_page(vmf);
if (vmf->vma->vm_ops->huge_fault) {
vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);

if (!(ret & VM_FAULT_FALLBACK))
return ret;
}
-split:
/* COW or write-notify not handled on PUD level: split pud.*/
__split_huge_pud(vmf->vma, vmf->pud, vmf->address);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -5098,3 +5097,29 @@ void ptlock_free(struct page *page)
kmem_cache_free(page_ptl_cachep, page->ptl);
}
#endif
+
+static struct kmem_cache *pagechain_cachep;
+
+void __init pagechain_cache_init(void)
+{
+ pagechain_cachep = kmem_cache_create("pagechain",
+ sizeof(struct pagechain), 0, SLAB_PANIC, NULL);
+}
+
+struct pagechain *pagechain_alloc(void)
+{
+ struct pagechain *chain;
+
+ chain = kmem_cache_alloc(pagechain_cachep, GFP_ATOMIC);
+
+ if (!chain)
+ return NULL;
+
+ pagechain_init(chain);
+ return chain;
+}
+
+void pagechain_free(struct pagechain *pchain)
+{
+ kmem_cache_free(pagechain_cachep, pchain);
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d9f9bd0e06c..763acbed66f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5443,7 +5443,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
* HPAGE_PMD_NR),
- K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
+ K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR +
+ node_page_state(pgdat, NR_ANON_THPS_PUD) * HPAGE_PUD_NR),
#endif
K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
node_page_state(pgdat, NR_KERNEL_STACK_KB),
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 9578db83e312..ef218b0f5d74 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -10,6 +10,7 @@
#include <linux/pagemap.h>
#include <linux/hugetlb.h>
#include <linux/pgtable.h>
+#include <linux/pagechain.h>
#include <asm/tlb.h>

/*
@@ -170,6 +171,23 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
pmd_huge_pte(mm, pmdp) = pgtable;
}
+
+void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+ pgtable_t pgtable)
+{
+ struct pagechain *chain = NULL;
+
+ assert_spin_locked(pud_lockptr(mm, pudp));
+ /* FIFO */
+ chain = list_first_entry_or_null(&pud_huge_pte(mm, pudp),
+ struct pagechain, list);
+
+ if (!chain || !pagechain_space(chain)) {
+ chain = pagechain_alloc();
+ list_add(&chain->list, &pud_huge_pte(mm, pudp));
+ }
+ pagechain_deposit(chain, pgtable);
+}
#endif

#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
@@ -188,6 +206,33 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
list_del(&pgtable->lru);
return pgtable;
}
+
+pgtable_t pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
+{
+ pgtable_t pgtable;
+ struct pagechain *chain = NULL;
+
+ assert_spin_locked(pud_lockptr(mm, pudp));
+
+ /* FIFO */
+retry:
+ chain = list_first_entry_or_null(&pud_huge_pte(mm, pudp),
+ struct pagechain, list);
+
+ if (!chain)
+ return NULL;
+
+ if (pagechain_empty(chain)) {
+ if (list_is_singular(&chain->list))
+ return NULL;
+ list_del(&chain->list);
+ pagechain_free(chain);
+ goto retry;
+ }
+
+ pgtable = pagechain_withdraw(chain);
+ return pgtable;
+}
#endif

#ifndef __HAVE_ARCH_PMDP_INVALIDATE
diff --git a/mm/rmap.c b/mm/rmap.c
index 9425260774a1..10195a2421cf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -726,6 +726,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
+ pud_t pude;
pmd_t *pmd = NULL;
pmd_t pmde;

@@ -738,7 +739,10 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
goto out;

pud = pud_offset(p4d, address);
- if (!pud_present(*pud))
+
+ pude = *pud;
+ barrier();
+ if (!pud_present(pude) || pud_trans_huge(pude))
goto out;

pmd = pmd_offset(pud, address);
@@ -1033,7 +1037,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
* __page_set_anon_rmap - set up new anonymous rmap
* @page: Page or Hugepage to add to rmap
* @vma: VM area to add page to.
- * @address: User virtual address of the mapping
+ * @address: User virtual address of the mapping
* @exclusive: the page is exclusively owned by the current process
*/
static void __page_set_anon_rmap(struct page *page,
@@ -1137,8 +1141,12 @@ void do_page_add_anon_rmap(struct page *page,
* pte lock(a spinlock) is held, which implies preemption
* disabled.
*/
- if (compound)
- __inc_lruvec_page_state(page, NR_ANON_THPS);
+ if (compound) {
+ if (nr == HPAGE_PMD_NR)
+ __inc_lruvec_page_state(page, NR_ANON_THPS);
+ else
+ __inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
+ }
__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
}

@@ -1180,7 +1188,10 @@ void page_add_new_anon_rmap(struct page *page,
if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);

- __inc_lruvec_page_state(page, NR_ANON_THPS);
+ if (nr == HPAGE_PMD_NR)
+ __inc_lruvec_page_state(page, NR_ANON_THPS);
+ else
+ __inc_lruvec_page_state(page, NR_ANON_THPS_PUD);
} else {
/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1286,14 +1297,17 @@ static void page_remove_anon_compound_rmap(struct page *page)
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return;

- __dec_lruvec_page_state(page, NR_ANON_THPS);
+ if (thp_nr_pages(page) == HPAGE_PMD_NR)
+ __dec_lruvec_page_state(page, NR_ANON_THPS);
+ else
+ __dec_lruvec_page_state(page, NR_ANON_THPS_PUD);

if (TestClearPageDoubleMap(page)) {
/*
* Subpages can be mapped with PTEs too. Check how many of
* them are still mapped.
*/
- for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+ for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
if (atomic_add_negative(-1, &page[i]._mapcount))
nr++;
}
@@ -1306,7 +1320,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
if (nr && nr < HPAGE_PMD_NR)
deferred_split_huge_page(page);
} else {
- nr = HPAGE_PMD_NR;
+ nr = thp_nr_pages(page);
}

if (unlikely(PageMlocked(page)))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 06fd13ebc2b8..3a01212b652c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1209,6 +1209,7 @@ const char * const vmstat_text[] = {
"nr_file_hugepages",
"nr_file_pmdmapped",
"nr_anon_transparent_hugepages",
+ "nr_anon_transparent_pud_hugepages",
"nr_vmscan_write",
"nr_vmscan_immediate_reclaim",
"nr_dirtied",
@@ -1325,6 +1326,9 @@ const char * const vmstat_text[] = {
"thp_deferred_split_page",
"thp_split_pmd",
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+ "thp_fault_alloc_pud",
+ "thp_fault_fallback_pud",
+ "thp_fault_fallback_pud_charge",
"thp_split_pud",
#endif
"thp_zero_page_alloc",
--
2.28.0

2020-09-02 18:12:47

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 12/16] mm: support 1GB THP pagemap support.

From: Zi Yan <[email protected]>

Print page flags properly.

Signed-off-by: Zi Yan <[email protected]>
---
fs/proc/task_mmu.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2ff80a9c8b57..7254c7ecf659 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1557,6 +1557,64 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
return err;
}

+static int pagemap_pud_range(pud_t *pudp, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct vm_area_struct *vma = walk->vma;
+ struct pagemapread *pm = walk->private;
+ spinlock_t *ptl;
+ int err = 0;
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+ ptl = pud_trans_huge_lock(pudp, vma);
+ if (ptl) {
+ u64 flags = 0, frame = 0;
+ pud_t pud = *pudp;
+ struct page *page = NULL;
+
+ if (vma->vm_flags & VM_SOFTDIRTY)
+ flags |= PM_SOFT_DIRTY;
+
+ if (pud_present(pud)) {
+ page = pud_page(pud);
+
+ flags |= PM_PRESENT;
+ if (pud_soft_dirty(pud))
+ flags |= PM_SOFT_DIRTY;
+ if (pm->show_pfn)
+ frame = pud_pfn(pud) +
+ ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+ }
+
+ if (page && page_mapcount(page) == 1)
+ flags |= PM_MMAP_EXCLUSIVE;
+
+ for (; addr != end; addr += PAGE_SIZE) {
+ pagemap_entry_t pme = make_pme(frame, flags);
+
+ err = add_to_pagemap(addr, &pme, pm);
+ if (err)
+ break;
+ if (pm->show_pfn) {
+ if (flags & PM_PRESENT)
+ frame++;
+ else if (flags & PM_SWAP)
+ frame += (1 << MAX_SWAPFILES_SHIFT);
+ }
+ }
+ spin_unlock(ptl);
+ walk->action = ACTION_CONTINUE;
+ return err;
+ }
+
+ if (pud_trans_unstable(pudp)) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+ return err;
+}
+
#ifdef CONFIG_HUGETLB_PAGE
/* This function walks within one hugetlb entry in the single call */
static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
@@ -1607,6 +1665,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
#endif /* HUGETLB_PAGE */

static const struct mm_walk_ops pagemap_ops = {
+ .pud_entry = pagemap_pud_range,
.pmd_entry = pagemap_pmd_range,
.pte_hole = pagemap_pte_hole,
.hugetlb_entry = pagemap_hugetlb_range,
--
2.28.0

2020-09-02 18:44:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Hi all,
>
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
>
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.
>
> Design
> =======
>
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
>
> 1. Page table deposit and withdraw using a new pagechain data structure:
> instead of one PTE page table page, 1GB THP requires 513 page table pages
> (one PMD page table page and 512 PTE page table pages) to be deposited
> at the page allocaiton time, so that we can split the page later. Currently,
> the page table deposit is using ->lru, thus only one page can be deposited.
> A new pagechain data structure is added to enable multi-page deposit.
>
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> page[N*512 + 3].compound_mapcount.
>
> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> to use something less intrusive. So all 1GB THPs are allocated from reserved
> CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> THP is cleared as the resulting pages can be freed via normal page free path.
> We can fall back to alloc_contig_pages for 1GB THP if necessary.
>
>
> Patch Organization
> =======
>
> Patch 01 adds the new pagechain data structure.
>
> Patch 02 to 13 adds 1GB THP support in variable places.
>
> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
>
> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
>
> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
>
>
> Any suggestions and comments are welcome.
>
>
> Zi Yan (16):
> mm: add pagechain container for storing multiple pages.
> mm: thp: 1GB anonymous page implementation.
> mm: proc: add 1GB THP kpageflag.
> mm: thp: 1GB THP copy on write implementation.
> mm: thp: handling 1GB THP reference bit.
> mm: thp: add 1GB THP split_huge_pud_page() function.
> mm: stats: make smap stats understand PUD THPs.
> mm: page_vma_walk: teach it about PMD-mapped PUD THP.
> mm: thp: 1GB THP support in try_to_unmap().
> mm: thp: split 1GB THPs at page reclaim.
> mm: thp: 1GB THP follow_p*d_page() support.
> mm: support 1GB THP pagemap support.
> mm: thp: add a knob to enable/disable 1GB THPs.
> mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
> hugetlb: cma: move cma reserve function to cma.c.
> mm: thp: use cma reservation for pud thp allocation.

Surprised this doesn't touch mm/pagewalk.c ?

Jason

2020-09-02 18:48:30

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 2 Sep 2020, at 14:40, Jason Gunthorpe wrote:

> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> Hi all,
>>
>> This patchset adds support for 1GB THP on x86_64. It is on top of
>> v5.9-rc2-mmots-2020-08-25-21-13.
>>
>> 1GB THP is more flexible for reducing translation overhead and increasing the
>> performance of applications with large memory footprint without application
>> changes compared to hugetlb.
>>
>> Design
>> =======
>>
>> 1GB THP implementation looks similar to exiting THP code except some new designs
>> for the additional page table level.
>>
>> 1. Page table deposit and withdraw using a new pagechain data structure:
>> instead of one PTE page table page, 1GB THP requires 513 page table pages
>> (one PMD page table page and 512 PTE page table pages) to be deposited
>> at the page allocaiton time, so that we can split the page later. Currently,
>> the page table deposit is using ->lru, thus only one page can be deposited.
>> A new pagechain data structure is added to enable multi-page deposit.
>>
>> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
>> and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
>> PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
>> sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
>> page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
>> page[N*512 + 3].compound_mapcount.
>>
>> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
>> to use something less intrusive. So all 1GB THPs are allocated from reserved
>> CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
>> THP is cleared as the resulting pages can be freed via normal page free path.
>> We can fall back to alloc_contig_pages for 1GB THP if necessary.
>>
>>
>> Patch Organization
>> =======
>>
>> Patch 01 adds the new pagechain data structure.
>>
>> Patch 02 to 13 adds 1GB THP support in variable places.
>>
>> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
>>
>> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
>>
>> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
>>
>>
>> Any suggestions and comments are welcome.
>>
>>
>> Zi Yan (16):
>> mm: add pagechain container for storing multiple pages.
>> mm: thp: 1GB anonymous page implementation.
>> mm: proc: add 1GB THP kpageflag.
>> mm: thp: 1GB THP copy on write implementation.
>> mm: thp: handling 1GB THP reference bit.
>> mm: thp: add 1GB THP split_huge_pud_page() function.
>> mm: stats: make smap stats understand PUD THPs.
>> mm: page_vma_walk: teach it about PMD-mapped PUD THP.
>> mm: thp: 1GB THP support in try_to_unmap().
>> mm: thp: split 1GB THPs at page reclaim.
>> mm: thp: 1GB THP follow_p*d_page() support.
>> mm: support 1GB THP pagemap support.
>> mm: thp: add a knob to enable/disable 1GB THPs.
>> mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
>> hugetlb: cma: move cma reserve function to cma.c.
>> mm: thp: use cma reservation for pud thp allocation.
>
> Surprised this doesn't touch mm/pagewalk.c ?

1GB PUD page support is present for DAX purpose, so the code is there
in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
the functions in mm/pagewalk.c. :)


Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-09-02 18:52:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:

> > Surprised this doesn't touch mm/pagewalk.c ?
>
> 1GB PUD page support is present for DAX purpose, so the code is there
> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> the functions in mm/pagewalk.c. :)

Yes, but doesn't this change what is possible under the mmap_sem
without the page table locks?

ie I would expect some thing like pmd_trans_unstable() to be required
as well for lockless walkers. (and I don't think the pmd code is 100%
right either)

Jason

2020-09-02 19:07:55

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:

> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
>
>>> Surprised this doesn't touch mm/pagewalk.c ?
>>
>> 1GB PUD page support is present for DAX purpose, so the code is there
>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
>> the functions in mm/pagewalk.c. :)
>
> Yes, but doesn't this change what is possible under the mmap_sem
> without the page table locks?
>
> ie I would expect some thing like pmd_trans_unstable() to be required
> as well for lockless walkers. (and I don't think the pmd code is 100%
> right either)
>

Right. I missed that. Thanks for pointing it out.
The code like this, right?

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..4fe6ce4a92eb 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -152,10 +152,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
!(ops->pmd_entry || ops->pte_entry))
continue;

- if (walk->vma)
+ if (walk->vma) {
split_huge_pud(walk->vma, pud, addr);
- if (pud_none(*pud))
- goto again;
+ if (pud_trans_unstable(pud))
+ goto again;
+ }

err = walk_pmd_range(pud, addr, next, walk);
if (err)



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-09-02 19:38:28

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 03/16] mm: proc: add 1GB THP kpageflag.

From: Zi Yan <[email protected]>

Bit 27 is used to identify 1GB THP.

Signed-off-by: Zi Yan <[email protected]>
---
fs/proc/page.c | 2 ++
include/uapi/linux/kernel-page-flags.h | 2 ++
2 files changed, 4 insertions(+)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index f3b39a7d2bf3..e4e2ad3612c9 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -161,6 +161,8 @@ u64 stable_page_flags(struct page *page)
u |= BIT_ULL(KPF_ZERO_PAGE);
u |= BIT_ULL(KPF_THP);
}
+ if (compound_order(head) == HPAGE_PUD_ORDER)
+ u |= 1 << KPF_PUD_THP;
} else if (is_zero_pfn(page_to_pfn(page)))
u |= BIT_ULL(KPF_ZERO_PAGE);

diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index 6f2f2720f3ac..cdeb33ab655c 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -36,5 +36,7 @@
#define KPF_ZERO_PAGE 24
#define KPF_IDLE 25
#define KPF_PGTABLE 26
+#define KPF_PUD_THP 27
+

#endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
--
2.28.0

2020-09-02 19:38:33

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 10/16] mm: thp: split 1GB THPs at page reclaim.

From: Zi Yan <[email protected]>

We cannot swap 1GB THPs, so split them before swap them out.

Signed-off-by: Zi Yan <[email protected]>
---
mm/swap_slots.c | 2 ++
mm/vmscan.c | 58 +++++++++++++++++++++++++++++++++++++------------
2 files changed, 46 insertions(+), 14 deletions(-)

diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 3e6453573a89..65b8742a0446 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -312,6 +312,8 @@ swp_entry_t get_swap_page(struct page *page)
entry.val = 0;

if (PageTransHuge(page)) {
+ if (compound_order(page) == HPAGE_PUD_ORDER)
+ return entry;
if (IS_ENABLED(CONFIG_THP_SWAP))
get_swap_pages(1, &entry, HPAGE_PMD_NR);
goto out;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99e1796eb833..617d15a041f8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1240,23 +1240,49 @@ static unsigned int shrink_page_list(struct list_head *page_list,
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
if (PageTransHuge(page)) {
- /* cannot split THP, skip it */
- if (!can_split_huge_page(page, NULL))
- goto activate_locked;
- /*
- * Split pages without a PMD map right
- * away. Chances are some or all of the
- * tail pages can be freed without IO.
- */
- if (!compound_mapcount(page) &&
- split_huge_page_to_list(page,
- page_list))
+ if (compound_order(page) == HPAGE_PUD_ORDER) {
+ /* cannot split THP, skip it */
+ if (!can_split_huge_pud_page(page, NULL))
+ goto activate_locked;
+ /*
+ * Split pages without a PUD map right
+ * away. Chances are some or all of the
+ * tail pages can be freed without IO.
+ */
+ if (!compound_mapcount(page) &&
+ split_huge_pud_page_to_list(page,
+ page_list))
+ goto activate_locked;
+ }
+ if (compound_order(page) == HPAGE_PMD_ORDER) {
+ /* cannot split THP, skip it */
+ if (!can_split_huge_page(page, NULL))
+ goto activate_locked;
+ /*
+ * Split pages without a PMD map right
+ * away. Chances are some or all of the
+ * tail pages can be freed without IO.
+ */
+ if (!compound_mapcount(page) &&
+ split_huge_page_to_list(page,
+ page_list))
+ goto activate_locked;
+ }
+ }
+ /* Split PUD THPs before swapping */
+ if (compound_order(page) == HPAGE_PUD_ORDER) {
+ if (split_huge_pud_page_to_list(page, page_list))
goto activate_locked;
+ else {
+ sc->nr_scanned -= (nr_pages - HPAGE_PMD_NR);
+ nr_pages = HPAGE_PMD_NR;
+ }
}
if (!add_to_swap(page)) {
if (!PageTransHuge(page))
goto activate_locked_split;
/* Fallback to swap normal pages */
+ VM_BUG_ON_PAGE(compound_order(page) != HPAGE_PMD_ORDER, page);
if (split_huge_page_to_list(page,
page_list))
goto activate_locked;
@@ -1273,6 +1299,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
mapping = page_mapping(page);
}
} else if (unlikely(PageTransHuge(page))) {
+ VM_BUG_ON_PAGE(compound_order(page) != HPAGE_PMD_ORDER, page);
/* Split file THP */
if (split_huge_page_to_list(page, page_list))
goto keep_locked;
@@ -1298,9 +1325,12 @@ static unsigned int shrink_page_list(struct list_head *page_list,
enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
bool was_swapbacked = PageSwapBacked(page);

- if (unlikely(PageTransHuge(page)))
- flags |= TTU_SPLIT_HUGE_PMD;
-
+ if (unlikely(PageTransHuge(page))) {
+ if (compound_order(page) == HPAGE_PMD_ORDER)
+ flags |= TTU_SPLIT_HUGE_PMD;
+ else if (compound_order(page) == HPAGE_PUD_ORDER)
+ flags |= TTU_SPLIT_HUGE_PUD;
+ }
if (!try_to_unmap(page, flags)) {
stat->nr_unmap_fail += nr_pages;
if (!was_swapbacked && PageSwapBacked(page))
--
2.28.0

2020-09-02 19:39:27

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 04/16] mm: thp: 1GB THP copy on write implementation.

From: Zi Yan <[email protected]>

COW on 1GB THPs will fall back to 2MB THPs if 1GB THP is not available.

Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/include/asm/pgalloc.h | 9 ++++++
include/linux/huge_mm.h | 5 ++++
mm/huge_memory.c | 54 ++++++++++++++++++++++++++++++++++
mm/memory.c | 2 +-
mm/swapfile.c | 4 ++-
5 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index fae13467d3e1..31221269c387 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -98,6 +98,15 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,

#define pmd_pgtable(pmd) pmd_page(pmd)

+static inline void pud_populate_with_pgtable(struct mm_struct *mm, pud_t *pud,
+ struct page *pte)
+{
+ unsigned long pfn = page_to_pfn(pte);
+
+ paravirt_alloc_pmd(mm, pfn);
+ set_pud(pud, __pud(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
+}
+
#if CONFIG_PGTABLE_LEVELS > 2
static inline pmd_t *pmd_alloc_one_page_with_ptes(struct mm_struct *mm, unsigned long addr)
{
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7528652400e4..0c20a8ea6911 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,7 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
+extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud);
#else
static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
{
@@ -27,6 +28,10 @@ extern int do_huge_pud_anonymous_page(struct vm_fault *vmf)
{
return VM_FAULT_FALLBACK;
}
+extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
+{
+ return VM_FAULT_FALLBACK;
+}
#endif

extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec3847392208..6da9b02501b7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1334,6 +1334,60 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
unlock:
spin_unlock(vmf->ptl);
}
+
+vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct page *page = NULL;
+ unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+
+ vmf->ptl = pud_lockptr(vma->vm_mm, vmf->pud);
+ VM_BUG_ON_VMA(!vma->anon_vma, vma);
+
+ if (is_huge_zero_pud(orig_pud))
+ goto fallback;
+
+ spin_lock(vmf->ptl);
+
+ if (unlikely(!pud_same(*vmf->pud, orig_pud))) {
+ spin_unlock(vmf->ptl);
+ return 0;
+ }
+
+ page = pud_page(orig_pud);
+ VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
+
+ /* Lock page for reuse_swap_page() */
+ if (!trylock_page(page)) {
+ get_page(page);
+ spin_unlock(vmf->ptl);
+ lock_page(page);
+ spin_lock(vmf->ptl);
+ if (unlikely(!pud_same(*vmf->pud, orig_pud))) {
+ unlock_page(page);
+ put_page(page);
+ return 0;
+ }
+ put_page(page);
+ }
+ if (reuse_swap_page(page, NULL)) {
+ pud_t entry;
+
+ entry = pud_mkyoung(orig_pud);
+ entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+ if (pudp_set_access_flags(vma, haddr, vmf->pud, entry, 1))
+ update_mmu_cache_pud(vma, vmf->address, vmf->pud);
+ unlock_page(page);
+ spin_unlock(vmf->ptl);
+ return VM_FAULT_WRITE;
+ }
+ unlock_page(page);
+ spin_unlock(vmf->ptl);
+fallback:
+ __split_huge_pud(vma, vmf->pud, vmf->address);
+ return VM_FAULT_FALLBACK;
+}
+
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */

void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
diff --git a/mm/memory.c b/mm/memory.c
index 6f86294438fd..b88587256bc1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4165,7 +4165,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vmf->vma))
- return VM_FAULT_FALLBACK;
+ return do_huge_pud_wp_page(vmf, orig_pud);
if (vmf->vma->vm_ops->huge_fault)
return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 20012c0c0252..e3f771c2ad83 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1635,7 +1635,9 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
/* hugetlbfs shouldn't call it */
VM_BUG_ON_PAGE(PageHuge(page), page);

- if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) {
+ if (!IS_ENABLED(CONFIG_THP_SWAP) ||
+ unlikely(compound_order(compound_head(page)) == HPAGE_PUD_ORDER) ||
+ likely(!PageTransCompound(page))) {
mapcount = page_trans_huge_mapcount(page, total_mapcount);
if (PageSwapCache(page))
swapcount = page_swapcount(page);
--
2.28.0

2020-09-02 19:40:14

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 11/16] mm: thp: 1GB THP follow_p*d_page() support.

From: Zi Yan <[email protected]>

Add follow_page support for 1GB THPs.

Signed-off-by: Zi Yan <[email protected]>
---
include/linux/huge_mm.h | 11 +++++++
mm/gup.c | 60 ++++++++++++++++++++++++++++++++-
mm/huge_memory.c | 73 ++++++++++++++++++++++++++++++++++++++++-
3 files changed, 142 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 589e5af5a1c2..c7bc40c4a5e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -20,6 +20,10 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud);
+extern struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+ unsigned long addr,
+ pud_t *pud,
+ unsigned int flags);
#else
static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
{
@@ -32,6 +36,13 @@ extern vm_fault_t do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
{
return VM_FAULT_FALLBACK;
}
+struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+ unsigned long addr,
+ pud_t *pud,
+ unsigned int flags)
+{
+ return NULL;
+}
#endif

extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/gup.c b/mm/gup.c
index bd883a112724..4b32ae3c5fa2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -698,10 +698,68 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
if (page)
return page;
}
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+ if (likely(!pud_trans_huge(*pud))) {
+ if (unlikely(pud_bad(*pud)))
+ return no_page_table(vma, flags);
+ return follow_pmd_mask(vma, address, pud, flags, ctx);
+ }
+
+ ptl = pud_lock(mm, pud);
+
+ if (unlikely(!pud_trans_huge(*pud))) {
+ spin_unlock(ptl);
+ if (unlikely(pud_bad(*pud)))
+ return no_page_table(vma, flags);
+ return follow_pmd_mask(vma, address, pud, flags, ctx);
+ }
+
+ if (flags & FOLL_SPLIT) {
+ int ret;
+ pmd_t *pmd = NULL;
+
+ page = pud_page(*pud);
+ if (is_huge_zero_page(page)) {
+
+ spin_unlock(ptl);
+ ret = 0;
+ split_huge_pud(vma, pud, address);
+ pmd = pmd_offset(pud, address);
+ split_huge_pmd(vma, pmd, address);
+ if (pmd_trans_unstable(pmd))
+ ret = -EBUSY;
+ } else {
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ ret = split_huge_pud_page(page);
+ if (!ret)
+ ret = split_huge_page(page);
+ else {
+ unlock_page(page);
+ put_page(page);
+ goto out;
+ }
+ unlock_page(page);
+ put_page(page);
+ if (pud_none(*pud))
+ return no_page_table(vma, flags);
+ pmd = pmd_offset(pud, address);
+ }
+out:
+ return ret ? ERR_PTR(ret) :
+ follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+ }
+ page = follow_trans_huge_pud(vma, address, pud, flags);
+ spin_unlock(ptl);
+ ctx->page_mask = HPAGE_PUD_NR - 1;
+ return page;
+#else
if (unlikely(pud_bad(*pud)))
return no_page_table(vma, flags);
-
return follow_pmd_mask(vma, address, pud, flags, ctx);
+#endif
}

static struct page *follow_p4d_mask(struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 398f1b52f789..e209c2dfc5b7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1259,6 +1259,77 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
return page;
}

+/*
+ * FOLL_FORCE can write to even unwritable pmd's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pud(pud_t pud, unsigned int flags)
+{
+ return pud_write(pud) ||
+ ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pud_dirty(pud));
+}
+
+struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+ unsigned long addr,
+ pud_t *pud,
+ unsigned int flags)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page = NULL;
+
+ assert_spin_locked(pud_lockptr(mm, pud));
+
+ if (flags & FOLL_WRITE && !can_follow_write_pud(*pud, flags))
+ goto out;
+
+ /* Avoid dumping huge zero page */
+ if ((flags & FOLL_DUMP) && is_huge_zero_pud(*pud))
+ return ERR_PTR(-EFAULT);
+
+ /* Full NUMA hinting faults to serialise migration in fault paths */
+ /*&& pud_protnone(*pmd)*/
+ if ((flags & FOLL_NUMA))
+ goto out;
+
+ page = pud_page(*pud);
+ VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+ if (flags & FOLL_TOUCH)
+ touch_pud(vma, addr, pud, flags);
+ if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
+ /*
+ * We don't mlock() pte-mapped THPs. This way we can avoid
+ * leaking mlocked pages into non-VM_LOCKED VMAs.
+ *
+ * For anon THP:
+ *
+ * We do the same thing as PMD-level THP.
+ *
+ * For file THP:
+ *
+ * No support yet.
+ *
+ */
+
+ if (PageAnon(page) && compound_mapcount(page) != 1)
+ goto skip_mlock;
+ if (PagePUDDoubleMap(page) || !page->mapping)
+ goto skip_mlock;
+ if (!trylock_page(page))
+ goto skip_mlock;
+ lru_add_drain();
+ if (page->mapping && !PagePUDDoubleMap(page))
+ mlock_vma_page(page);
+ unlock_page(page);
+ }
+skip_mlock:
+ page += (addr & ~HPAGE_PUD_MASK) >> PAGE_SHIFT;
+ VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+ if (flags & FOLL_GET)
+ get_page(page);
+
+out:
+ return page;
+}
int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
struct vm_area_struct *vma)
@@ -1501,7 +1572,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
goto out;

page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+ VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page) && !PMDPageInPUD(page), page);

if (!try_grab_page(page, flags))
return ERR_PTR(-ENOMEM);
--
2.28.0

2020-09-02 19:59:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
>
> > On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
> >
> >>> Surprised this doesn't touch mm/pagewalk.c ?
> >>
> >> 1GB PUD page support is present for DAX purpose, so the code is there
> >> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> >> the functions in mm/pagewalk.c. :)
> >
> > Yes, but doesn't this change what is possible under the mmap_sem
> > without the page table locks?
> >
> > ie I would expect some thing like pmd_trans_unstable() to be required
> > as well for lockless walkers. (and I don't think the pmd code is 100%
> > right either)
> >
>
> Right. I missed that. Thanks for pointing it out.
> The code like this, right?

Technically all those *pud's are racy too, the design here with the
_unstable function call always seemed weird. I strongly suspect it
should mirror how get_user_pages_fast works for lockless walking

Jason

2020-09-02 20:34:05

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote:

> On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
>> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
>>
>>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
>>>
>>>>> Surprised this doesn't touch mm/pagewalk.c ?
>>>>
>>>> 1GB PUD page support is present for DAX purpose, so the code is there
>>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
>>>> the functions in mm/pagewalk.c. :)
>>>
>>> Yes, but doesn't this change what is possible under the mmap_sem
>>> without the page table locks?
>>>
>>> ie I would expect some thing like pmd_trans_unstable() to be required
>>> as well for lockless walkers. (and I don't think the pmd code is 100%
>>> right either)
>>>
>>
>> Right. I missed that. Thanks for pointing it out.
>> The code like this, right?
>
> Technically all those *pud's are racy too, the design here with the
> _unstable function call always seemed weird. I strongly suspect it
> should mirror how get_user_pages_fast works for lockless walking

You mean READ_ONCE on page table entry pointer first, then use the value
for the rest of the loop? I am not quite familiar with this racy check
part of the code and happy to hear more about it.



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-09-03 07:37:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed 02-09-20 14:06:12, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Hi all,
>
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
>
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.

Please be more specific about usecases. This better have some strong
ones because THP code is complex enough already to add on top solely
based on a generic TLB pressure easing.

> Design
> =======
>
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
>
> 1. Page table deposit and withdraw using a new pagechain data structure:
> instead of one PTE page table page, 1GB THP requires 513 page table pages
> (one PMD page table page and 512 PTE page table pages) to be deposited
> at the page allocaiton time, so that we can split the page later. Currently,
> the page table deposit is using ->lru, thus only one page can be deposited.
> A new pagechain data structure is added to enable multi-page deposit.
>
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> page[N*512 + 3].compound_mapcount.
>
> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> to use something less intrusive. So all 1GB THPs are allocated from reserved
> CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> THP is cleared as the resulting pages can be freed via normal page free path.
> We can fall back to alloc_contig_pages for 1GB THP if necessary.

Do those pages get instantiated during the page fault or only via
khugepaged? This is an important design detail because then we have to
think carefully about how much automatic we want this to be. Memory
overhead can be quite large with 2MB THPs already. Also what about the
allocation overhead? Do you have any numbers?

Maybe all these details are described in the patcheset but the cover
letter should contain all that information. It doesn't make much sense
to dig into details in a patchset this large without having an idea how
feasible this is.

Thanks.

> Patch Organization
> =======
>
> Patch 01 adds the new pagechain data structure.
>
> Patch 02 to 13 adds 1GB THP support in variable places.
>
> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
>
> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
>
> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
>
>
> Any suggestions and comments are welcome.
>
>
> Zi Yan (16):
> mm: add pagechain container for storing multiple pages.
> mm: thp: 1GB anonymous page implementation.
> mm: proc: add 1GB THP kpageflag.
> mm: thp: 1GB THP copy on write implementation.
> mm: thp: handling 1GB THP reference bit.
> mm: thp: add 1GB THP split_huge_pud_page() function.
> mm: stats: make smap stats understand PUD THPs.
> mm: page_vma_walk: teach it about PMD-mapped PUD THP.
> mm: thp: 1GB THP support in try_to_unmap().
> mm: thp: split 1GB THPs at page reclaim.
> mm: thp: 1GB THP follow_p*d_page() support.
> mm: support 1GB THP pagemap support.
> mm: thp: add a knob to enable/disable 1GB THPs.
> mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
> hugetlb: cma: move cma reserve function to cma.c.
> mm: thp: use cma reservation for pud thp allocation.
>
> .../admin-guide/kernel-parameters.txt | 2 +-
> arch/arm64/mm/hugetlbpage.c | 2 +-
> arch/powerpc/mm/hugetlbpage.c | 2 +-
> arch/x86/include/asm/pgalloc.h | 68 ++
> arch/x86/include/asm/pgtable.h | 26 +
> arch/x86/kernel/setup.c | 8 +-
> arch/x86/mm/pgtable.c | 38 +
> drivers/base/node.c | 3 +
> fs/proc/meminfo.c | 2 +
> fs/proc/page.c | 2 +
> fs/proc/task_mmu.c | 122 ++-
> include/linux/cma.h | 18 +
> include/linux/huge_mm.h | 84 +-
> include/linux/hugetlb.h | 12 -
> include/linux/memcontrol.h | 5 +
> include/linux/mm.h | 29 +-
> include/linux/mm_types.h | 1 +
> include/linux/mmu_notifier.h | 13 +
> include/linux/mmzone.h | 1 +
> include/linux/page-flags.h | 47 +
> include/linux/pagechain.h | 73 ++
> include/linux/pgtable.h | 34 +
> include/linux/rmap.h | 10 +-
> include/linux/swap.h | 2 +
> include/linux/vm_event_item.h | 7 +
> include/uapi/linux/kernel-page-flags.h | 2 +
> kernel/events/uprobes.c | 4 +-
> kernel/fork.c | 5 +
> mm/cma.c | 119 +++
> mm/gup.c | 60 +-
> mm/huge_memory.c | 939 +++++++++++++++++-
> mm/hugetlb.c | 114 +--
> mm/internal.h | 2 +
> mm/khugepaged.c | 6 +-
> mm/ksm.c | 4 +-
> mm/memcontrol.c | 13 +
> mm/memory.c | 51 +-
> mm/mempolicy.c | 21 +-
> mm/migrate.c | 12 +-
> mm/page_alloc.c | 57 +-
> mm/page_vma_mapped.c | 129 ++-
> mm/pgtable-generic.c | 56 ++
> mm/rmap.c | 289 ++++--
> mm/swap.c | 31 +
> mm/swap_slots.c | 2 +
> mm/swapfile.c | 8 +-
> mm/userfaultfd.c | 2 +-
> mm/util.c | 16 +-
> mm/vmscan.c | 58 +-
> mm/vmstat.c | 8 +
> 50 files changed, 2270 insertions(+), 349 deletions(-)
> create mode 100644 include/linux/pagechain.h
>
> --
> 2.28.0
>

--
Michal Hocko
SUSE Labs

2020-09-03 14:48:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Hi all,
>
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
>
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.

This statement needs a lot of justification. I don't see 1GB THP as viable
for any workload. Opportunistic 1GB allocation is very questionable
strategy.

> Design
> =======
>
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
>
> 1. Page table deposit and withdraw using a new pagechain data structure:
> instead of one PTE page table page, 1GB THP requires 513 page table pages
> (one PMD page table page and 512 PTE page table pages) to be deposited
> at the page allocaiton time, so that we can split the page later. Currently,
> the page table deposit is using ->lru, thus only one page can be deposited.

False. Current code can deposit arbitrary number of page tables.

What can be problem to you is that these page tables tied to struct page
of PMD page table.

> A new pagechain data structure is added to enable multi-page deposit.
>
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> page[N*512 + 3].compound_mapcount.

I had hard time reasoning about DoubleMap vs. rmap. Good for you if you
get it right.

> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> to use something less intrusive. So all 1GB THPs are allocated from reserved
> CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> THP is cleared as the resulting pages can be freed via normal page free path.
> We can fall back to alloc_contig_pages for 1GB THP if necessary.
>

--
Kirill A. Shutemov

2020-09-03 16:32:12

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> > From: Zi Yan <[email protected]>
> >
> > Hi all,
> >
> > This patchset adds support for 1GB THP on x86_64. It is on top of
> > v5.9-rc2-mmots-2020-08-25-21-13.
> >
> > 1GB THP is more flexible for reducing translation overhead and increasing the
> > performance of applications with large memory footprint without application
> > changes compared to hugetlb.
>
> This statement needs a lot of justification. I don't see 1GB THP as viable
> for any workload. Opportunistic 1GB allocation is very questionable
> strategy.

Hello, Kirill!

I share your skepticism about opportunistic 1 GB allocations, however it might be useful
if backed by an madvise() annotations from userspace application. In this case,
1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
interface.

Thanks!

>
> > Design
> > =======
> >
> > 1GB THP implementation looks similar to exiting THP code except some new designs
> > for the additional page table level.
> >
> > 1. Page table deposit and withdraw using a new pagechain data structure:
> > instead of one PTE page table page, 1GB THP requires 513 page table pages
> > (one PMD page table page and 512 PTE page table pages) to be deposited
> > at the page allocaiton time, so that we can split the page later. Currently,
> > the page table deposit is using ->lru, thus only one page can be deposited.
>
> False. Current code can deposit arbitrary number of page tables.
>
> What can be problem to you is that these page tables tied to struct page
> of PMD page table.
>
> > A new pagechain data structure is added to enable multi-page deposit.
> >
> > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> > and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> > PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> > sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> > page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> > page[N*512 + 3].compound_mapcount.
>
> I had hard time reasoning about DoubleMap vs. rmap. Good for you if you
> get it right.
>
> > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> > to use something less intrusive. So all 1GB THPs are allocated from reserved
> > CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> > THP is cleared as the resulting pages can be freed via normal page free path.
> > We can fall back to alloc_contig_pages for 1GB THP if necessary.
> >
>
> --
> Kirill A. Shutemov

2020-09-03 16:33:52

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > From: Zi Yan <[email protected]>
> >
> > Hi all,
> >
> > This patchset adds support for 1GB THP on x86_64. It is on top of
> > v5.9-rc2-mmots-2020-08-25-21-13.
> >
> > 1GB THP is more flexible for reducing translation overhead and increasing the
> > performance of applications with large memory footprint without application
> > changes compared to hugetlb.
>
> Please be more specific about usecases. This better have some strong
> ones because THP code is complex enough already to add on top solely
> based on a generic TLB pressure easing.

Hello, Michal!

We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
performance wins on some workloads.

Historically we allocated gigantic pages at the boot time, but recently moved
to cma-based dynamic approach. Still, hugetlbfs interface requires more management
than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
see it as a very useful feature.

Given the cost of an allocation, I'm slightly skeptical about an automatic
heuristics-based approach, but if an application can explicitly mark target areas
with madvise(), I don't see why it wouldn't work.

In our case we'd like to have a reliable way to get 1 GB THPs at some point
(usually at the start of an application), and transparently destroy them on
the application exit.

Once we'll have the patchset in a relatively good shape, I'll be happy to give
it a test in our environment and share results.

Thanks!

>
> > Design
> > =======
> >
> > 1GB THP implementation looks similar to exiting THP code except some new designs
> > for the additional page table level.
> >
> > 1. Page table deposit and withdraw using a new pagechain data structure:
> > instead of one PTE page table page, 1GB THP requires 513 page table pages
> > (one PMD page table page and 512 PTE page table pages) to be deposited
> > at the page allocaiton time, so that we can split the page later. Currently,
> > the page table deposit is using ->lru, thus only one page can be deposited.
> > A new pagechain data structure is added to enable multi-page deposit.
> >
> > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> > and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> > PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> > sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> > page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> > page[N*512 + 3].compound_mapcount.
> >
> > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> > to use something less intrusive. So all 1GB THPs are allocated from reserved
> > CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> > THP is cleared as the resulting pages can be freed via normal page free path.
> > We can fall back to alloc_contig_pages for 1GB THP if necessary.
>
> Do those pages get instantiated during the page fault or only via
> khugepaged? This is an important design detail because then we have to
> think carefully about how much automatic we want this to be. Memory
> overhead can be quite large with 2MB THPs already. Also what about the
> allocation overhead? Do you have any numbers?
>
> Maybe all these details are described in the patcheset but the cover
> letter should contain all that information. It doesn't make much sense
> to dig into details in a patchset this large without having an idea how
> feasible this is.
>
> Thanks.
>
> > Patch Organization
> > =======
> >
> > Patch 01 adds the new pagechain data structure.
> >
> > Patch 02 to 13 adds 1GB THP support in variable places.
> >
> > Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
> >
> > Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
> >
> > Patch 16 use hugepage_cma reservation for 1GB THP allocation.
> >
> >
> > Any suggestions and comments are welcome.
> >
> >
> > Zi Yan (16):
> > mm: add pagechain container for storing multiple pages.
> > mm: thp: 1GB anonymous page implementation.
> > mm: proc: add 1GB THP kpageflag.
> > mm: thp: 1GB THP copy on write implementation.
> > mm: thp: handling 1GB THP reference bit.
> > mm: thp: add 1GB THP split_huge_pud_page() function.
> > mm: stats: make smap stats understand PUD THPs.
> > mm: page_vma_walk: teach it about PMD-mapped PUD THP.
> > mm: thp: 1GB THP support in try_to_unmap().
> > mm: thp: split 1GB THPs at page reclaim.
> > mm: thp: 1GB THP follow_p*d_page() support.
> > mm: support 1GB THP pagemap support.
> > mm: thp: add a knob to enable/disable 1GB THPs.
> > mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
> > hugetlb: cma: move cma reserve function to cma.c.
> > mm: thp: use cma reservation for pud thp allocation.
> >
> > .../admin-guide/kernel-parameters.txt | 2 +-
> > arch/arm64/mm/hugetlbpage.c | 2 +-
> > arch/powerpc/mm/hugetlbpage.c | 2 +-
> > arch/x86/include/asm/pgalloc.h | 68 ++
> > arch/x86/include/asm/pgtable.h | 26 +
> > arch/x86/kernel/setup.c | 8 +-
> > arch/x86/mm/pgtable.c | 38 +
> > drivers/base/node.c | 3 +
> > fs/proc/meminfo.c | 2 +
> > fs/proc/page.c | 2 +
> > fs/proc/task_mmu.c | 122 ++-
> > include/linux/cma.h | 18 +
> > include/linux/huge_mm.h | 84 +-
> > include/linux/hugetlb.h | 12 -
> > include/linux/memcontrol.h | 5 +
> > include/linux/mm.h | 29 +-
> > include/linux/mm_types.h | 1 +
> > include/linux/mmu_notifier.h | 13 +
> > include/linux/mmzone.h | 1 +
> > include/linux/page-flags.h | 47 +
> > include/linux/pagechain.h | 73 ++
> > include/linux/pgtable.h | 34 +
> > include/linux/rmap.h | 10 +-
> > include/linux/swap.h | 2 +
> > include/linux/vm_event_item.h | 7 +
> > include/uapi/linux/kernel-page-flags.h | 2 +
> > kernel/events/uprobes.c | 4 +-
> > kernel/fork.c | 5 +
> > mm/cma.c | 119 +++
> > mm/gup.c | 60 +-
> > mm/huge_memory.c | 939 +++++++++++++++++-
> > mm/hugetlb.c | 114 +--
> > mm/internal.h | 2 +
> > mm/khugepaged.c | 6 +-
> > mm/ksm.c | 4 +-
> > mm/memcontrol.c | 13 +
> > mm/memory.c | 51 +-
> > mm/mempolicy.c | 21 +-
> > mm/migrate.c | 12 +-
> > mm/page_alloc.c | 57 +-
> > mm/page_vma_mapped.c | 129 ++-
> > mm/pgtable-generic.c | 56 ++
> > mm/rmap.c | 289 ++++--
> > mm/swap.c | 31 +
> > mm/swap_slots.c | 2 +
> > mm/swapfile.c | 8 +-
> > mm/userfaultfd.c | 2 +-
> > mm/util.c | 16 +-
> > mm/vmscan.c | 58 +-
> > mm/vmstat.c | 8 +
> > 50 files changed, 2270 insertions(+), 349 deletions(-)
> > create mode 100644 include/linux/pagechain.h
> >
> > --
> > 2.28.0
> >
>
> --
> Michal Hocko
> SUSE Labs

2020-09-03 16:41:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, Sep 02, 2020 at 04:29:46PM -0400, Zi Yan wrote:
> On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote:
>
> > On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
> >> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
> >>
> >>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
> >>>
> >>>>> Surprised this doesn't touch mm/pagewalk.c ?
> >>>>
> >>>> 1GB PUD page support is present for DAX purpose, so the code is there
> >>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> >>>> the functions in mm/pagewalk.c. :)
> >>>
> >>> Yes, but doesn't this change what is possible under the mmap_sem
> >>> without the page table locks?
> >>>
> >>> ie I would expect some thing like pmd_trans_unstable() to be required
> >>> as well for lockless walkers. (and I don't think the pmd code is 100%
> >>> right either)
> >>>
> >>
> >> Right. I missed that. Thanks for pointing it out.
> >> The code like this, right?
> >
> > Technically all those *pud's are racy too, the design here with the
> > _unstable function call always seemed weird. I strongly suspect it
> > should mirror how get_user_pages_fast works for lockless walking
>
> You mean READ_ONCE on page table entry pointer first, then use the value
> for the rest of the loop? I am not quite familiar with this racy check
> part of the code and happy to hear more about it.

There are two main issues with the THPs and lockless walks

- The *pXX value can change at any time, as THPs can be split at any
moment. However, once observed to be a sub page table pointer the
value is fixed under the read side of the mmap (I think, I never
did find the code path supporting this, but everything is busted if
it isn't true...)

- Reading the *pXX without load tearing is difficult on 32 bit arches

So if you do READ_ONCE() it defeats the first problem.

However if the sizeof(*pXX) is 8 on a 32 bit platform then load
tearing is a problem. At lest the various pXX_*() test functions
operate on a single 32 bit word so don't tear, but to to convert the
*pXX to a lower level page table pointer a coherent, untorn, read is
required.

So, looking again, I remember now, I could never quite figure out why
gup_pmd_range() was safe to do:

pmd_t pmd = READ_ONCE(*pmdp);
[..]
} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
[..]
ptem = ptep = pte_offset_map(&pmd, addr);

As I don't see what prevents load tearing a 64 bit pmd.. Eg no
pmd_trans_unstable() or equivalent here.

But we see gup_get_pte() using an anti-load tearing technique..

Jason

2020-09-03 16:53:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu, Sep 03, 2020 at 09:25:27AM -0700, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > From: Zi Yan <[email protected]>
> > >
> > > Hi all,
> > >
> > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > v5.9-rc2-mmots-2020-08-25-21-13.
> > >
> > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > performance of applications with large memory footprint without application
> > > changes compared to hugetlb.
> >
> > Please be more specific about usecases. This better have some strong
> > ones because THP code is complex enough already to add on top solely
> > based on a generic TLB pressure easing.
>
> Hello, Michal!
>
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.

At least from a RDMA NIC perspective I've heard from a lot of users
that higher order pages at the DMA level is giving big speed ups too.

It is basically the same dynamic as CPU TLB, except missing a 'TLB'
cache in a PCI-E device is dramatically more expensive to refill. With
200G and soon 400G networking these misses are a growing problem.

With HPC nodes now pushing 1TB of actual physical RAM and single
applications basically using all of it, there is definately some
meaningful return - if pages can be reliably available.

At least for HPC where the node returns to an idle state after each
job and most of the 1TB memory becomes freed up again, it seems more
believable to me that a large cache of 1G pages could be available?

Even triggering some kind of cleaner between jobs to defragment could
be a reasonable approach..

Jason

2020-09-03 16:59:31

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu, Sep 03, 2020 at 01:40:32PM -0300, Jason Gunthorpe wrote:
> However if the sizeof(*pXX) is 8 on a 32 bit platform then load
> tearing is a problem. At lest the various pXX_*() test functions
> operate on a single 32 bit word so don't tear, but to to convert the
> *pXX to a lower level page table pointer a coherent, untorn, read is
> required.
>
> So, looking again, I remember now, I could never quite figure out why
> gup_pmd_range() was safe to do:
>
> pmd_t pmd = READ_ONCE(*pmdp);
> [..]
> } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> [..]
> ptem = ptep = pte_offset_map(&pmd, addr);
>
> As I don't see what prevents load tearing a 64 bit pmd.. Eg no
> pmd_trans_unstable() or equivalent here.

I don't think there are any 32-bit page tables which support a PUD-sized
page. Pretty sure x86 doesn't until you get to 4- or 5- level page tables
(which need you to be running in 64-bit mode). There's not much utility
in having 1GB of your 3GB process address space taken up by a single page.

I'm OK if there are some oddball architectures which support it, but
Linux doesn't.

2020-09-03 17:04:42

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu, Sep 03, 2020 at 01:50:51PM -0300, Jason Gunthorpe wrote:
> At least from a RDMA NIC perspective I've heard from a lot of users
> that higher order pages at the DMA level is giving big speed ups too.
>
> It is basically the same dynamic as CPU TLB, except missing a 'TLB'
> cache in a PCI-E device is dramatically more expensive to refill. With
> 200G and soon 400G networking these misses are a growing problem.
>
> With HPC nodes now pushing 1TB of actual physical RAM and single
> applications basically using all of it, there is definately some
> meaningful return - if pages can be reliably available.
>
> At least for HPC where the node returns to an idle state after each
> job and most of the 1TB memory becomes freed up again, it seems more
> believable to me that a large cache of 1G pages could be available?

You may be interested in trying out my current THP patchset:

http://git.infradead.org/users/willy/pagecache.git

It doesn't allocate pages larger than PMD size, but it does allocate pages
*up to* PMD size for the page cache which means that larger pages are
easier to create as larger pages aren't fragmented all over the system.

If someone wants to opportunistically allocate pages larger than PMD
size, I've put some preliminary support in for that, but I've never
tested any of it. That's not my goal at the moment.

I'm not clear whether these HPC users primarily use page cache or
anonymous memory (with O_DIRECT). Probably a mixture.

2020-09-03 17:10:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu, Sep 03, 2020 at 05:55:59PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 03, 2020 at 01:40:32PM -0300, Jason Gunthorpe wrote:
> > However if the sizeof(*pXX) is 8 on a 32 bit platform then load
> > tearing is a problem. At lest the various pXX_*() test functions
> > operate on a single 32 bit word so don't tear, but to to convert the
> > *pXX to a lower level page table pointer a coherent, untorn, read is
> > required.
> >
> > So, looking again, I remember now, I could never quite figure out why
> > gup_pmd_range() was safe to do:
> >
> > pmd_t pmd = READ_ONCE(*pmdp);
> > [..]
> > } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> > [..]
> > ptem = ptep = pte_offset_map(&pmd, addr);
> >
> > As I don't see what prevents load tearing a 64 bit pmd.. Eg no
> > pmd_trans_unstable() or equivalent here.
>
> I don't think there are any 32-bit page tables which support a PUD-sized
> page. Pretty sure x86 doesn't until you get to 4- or 5- level page tables
> (which need you to be running in 64-bit mode). There's not much utility
> in having 1GB of your 3GB process address space taken up by a single page.

Make sense for PUD, but why is the above GUP code OK for PMD?
pmd_trans_unstable() exists specifically to close read tearing races,
so it looks like a real problem?

> I'm OK if there are some oddball architectures which support it, but
> Linux doesn't.

So, based on that observation, I think something approximately like
this is needed for the page walker for PUD: (this has been on my
backlog to return to these patches..)

From 00a361ecb2d9e1226600d9e78e6e1803a886f2d6 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <[email protected]>
Date: Fri, 13 Mar 2020 13:15:36 -0300
Subject: [RFC] mm/pagewalk: use READ_ONCE when reading the PUD entry
unlocked

The pagewalker runs while only holding the mmap_sem for read. The pud can
be set asynchronously, while also holding the mmap_sem for read

eg from:

handle_mm_fault()
__handle_mm_fault()
create_huge_pmd()
dev_dax_huge_fault()
__dev_dax_pud_fault()
vmf_insert_pfn_pud()
insert_pfn_pud()
pud_lock()
set_pud_at()

At least x86 sets the PUD using WRITE_ONCE(), so an unlocked read of
unstable data should be paired to use READ_ONCE().

For the pagewalker to work locklessly the PUD must work similarly to the
PMD: once the PUD entry becomes a pointer to a PMD, it must be stable, and
safe to pass to pmd_offset()

Passing the value from READ_ONCE into the callbacks prevents the callers
from seeing inconsistencies after they re-read, such as seeing pud_none().

If a callback does obtain the pud_lock then it should trigger ACTION_AGAIN
if a data race caused the original value to change.

Use the same pattern as gup_pmd_range() and pass in the address of the
local READ_ONCE stack variable to pmd_offset() to avoid reading it again.

Signed-off-by: Jason Gunthorpe <[email protected]>
---
include/linux/pagewalk.h | 2 +-
mm/hmm.c | 16 +++++++---------
mm/mapping_dirty_helpers.c | 6 ++----
mm/pagewalk.c | 28 ++++++++++++++++------------
mm/ptdump.c | 3 +--
5 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b1cb6b753abb53..6caf28aadafbff 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -39,7 +39,7 @@ struct mm_walk_ops {
unsigned long next, struct mm_walk *walk);
int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
unsigned long next, struct mm_walk *walk);
- int (*pud_entry)(pud_t *pud, unsigned long addr,
+ int (*pud_entry)(pud_t pud, pud_t *pudp, unsigned long addr,
unsigned long next, struct mm_walk *walk);
int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
unsigned long next, struct mm_walk *walk);
diff --git a/mm/hmm.c b/mm/hmm.c
index 6d9da4b0f0a9f8..98ced96421b913 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -459,28 +459,26 @@ static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud)
range->flags[HMM_PFN_VALID];
}

-static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
- struct mm_walk *walk)
+static int hmm_vma_walk_pud(pud_t pud, pud_t *pudp, unsigned long start,
+ unsigned long end, struct mm_walk *walk)
{
struct hmm_vma_walk *hmm_vma_walk = walk->private;
struct hmm_range *range = hmm_vma_walk->range;
unsigned long addr = start;
- pud_t pud;
int ret = 0;
spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma);

if (!ptl)
return 0;
+ if (memcmp(pudp, &pud, sizeof(pud)) != 0) {
+ walk->action = ACTION_AGAIN;
+ spin_unlock(ptl);
+ return 0;
+ }

/* Normally we don't want to split the huge page */
walk->action = ACTION_CONTINUE;

- pud = READ_ONCE(*pudp);
- if (pud_none(pud)) {
- spin_unlock(ptl);
- return hmm_vma_walk_hole(start, end, -1, walk);
- }
-
if (pud_huge(pud) && pud_devmap(pud)) {
unsigned long i, npages, pfn;
uint64_t *pfns, cpu_flags;
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 71070dda9643d4..8943c2509ec0f7 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -125,12 +125,10 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
}

/* wp_clean_pud_entry - The pagewalk pud callback. */
-static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static int wp_clean_pud_entry(pud_t pudval, pud_t *pudp, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
{
/* Dirty-tracking should be handled on the pte level */
- pud_t pudval = READ_ONCE(*pud);
-
if (pud_trans_huge(pudval) || pud_devmap(pudval))
WARN_ON(pud_write(pudval) || pud_dirty(pudval));

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 928df1638c30d1..cf99536cec23be 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
return err;
}

-static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+static int walk_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
pmd_t *pmd;
@@ -67,7 +67,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
int err = 0;
int depth = real_depth(3);

- pmd = pmd_offset(pud, addr);
+ pmd = pmd_offset(&pud, addr);
do {
again:
next = pmd_addr_end(addr, end);
@@ -119,17 +119,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
- pud_t *pud;
+ pud_t *pudp;
+ pud_t pud;
unsigned long next;
const struct mm_walk_ops *ops = walk->ops;
int err = 0;
int depth = real_depth(2);

- pud = pud_offset(p4d, addr);
+ pudp = pud_offset(p4d, addr);
do {
again:
+ pud = READ_ONCE(*pudp);
next = pud_addr_end(addr, end);
- if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
+ if (pud_none(pud) || (!walk->vma && !walk->no_vma)) {
if (ops->pte_hole)
err = ops->pte_hole(addr, next, depth, walk);
if (err)
@@ -140,27 +142,29 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
walk->action = ACTION_SUBTREE;

if (ops->pud_entry)
- err = ops->pud_entry(pud, addr, next, walk);
+ err = ops->pud_entry(pud, pudp, addr, next, walk);
if (err)
break;

if (walk->action == ACTION_AGAIN)
goto again;

- if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
+ if ((!walk->vma && (pud_leaf(pud) || !pud_present(pud))) ||
walk->action == ACTION_CONTINUE ||
!(ops->pmd_entry || ops->pte_entry))
continue;

- if (walk->vma)
- split_huge_pud(walk->vma, pud, addr);
- if (pud_none(*pud))
- goto again;
+ if (walk->vma) {
+ split_huge_pud(walk->vma, pudp, addr);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
+ goto again;
+ }

err = walk_pmd_range(pud, addr, next, walk);
if (err)
break;
- } while (pud++, addr = next, addr != end);
+ } while (pudp++, addr = next, addr != end);

return err;
}
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 26208d0d03b7a9..c5e1717671e36a 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -59,11 +59,10 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
return 0;
}

-static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
+static int ptdump_pud_entry(pud_t val, pud_t *pudp, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{
struct ptdump_state *st = walk->private;
- pud_t val = READ_ONCE(*pud);

#if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_KASAN)
if (pud_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pmd)))
--
2.28.0

2020-09-03 17:19:59

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu, Sep 03, 2020 at 06:01:57PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 03, 2020 at 01:50:51PM -0300, Jason Gunthorpe wrote:
> > At least from a RDMA NIC perspective I've heard from a lot of users
> > that higher order pages at the DMA level is giving big speed ups too.
> >
> > It is basically the same dynamic as CPU TLB, except missing a 'TLB'
> > cache in a PCI-E device is dramatically more expensive to refill. With
> > 200G and soon 400G networking these misses are a growing problem.
> >
> > With HPC nodes now pushing 1TB of actual physical RAM and single
> > applications basically using all of it, there is definately some
> > meaningful return - if pages can be reliably available.
> >
> > At least for HPC where the node returns to an idle state after each
> > job and most of the 1TB memory becomes freed up again, it seems more
> > believable to me that a large cache of 1G pages could be available?
>
> You may be interested in trying out my current THP patchset:
>
> http://git.infradead.org/users/willy/pagecache.git
>
> It doesn't allocate pages larger than PMD size, but it does allocate pages
> *up to* PMD size for the page cache which means that larger pages are
> easier to create as larger pages aren't fragmented all over the system.

Yeah, I saw that, it looks like a great direction.

> If someone wants to opportunistically allocate pages larger than PMD
> size, I've put some preliminary support in for that, but I've never
> tested any of it. That's not my goal at the moment.
>
> I'm not clear whether these HPC users primarily use page cache or
> anonymous memory (with O_DIRECT). Probably a mixture.

There are defiantly HPC systems now that are filesystem-less - they
import data for computation from the network using things like blob
storage or some other kind of non-POSIX userspace based data storage
scheme.

Jason

2020-09-03 20:59:56

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 9/3/20 9:25 AM, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
>> On Wed 02-09-20 14:06:12, Zi Yan wrote:
>>> From: Zi Yan <[email protected]>
>>>
>>> Hi all,
>>>
>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>
>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>> performance of applications with large memory footprint without application
>>> changes compared to hugetlb.
>>
>> Please be more specific about usecases. This better have some strong
>> ones because THP code is complex enough already to add on top solely
>> based on a generic TLB pressure easing.
>
> Hello, Michal!
>
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.
>
> Historically we allocated gigantic pages at the boot time, but recently moved
> to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> see it as a very useful feature.
>
> Given the cost of an allocation, I'm slightly skeptical about an automatic
> heuristics-based approach, but if an application can explicitly mark target areas
> with madvise(), I don't see why it wouldn't work.
>
> In our case we'd like to have a reliable way to get 1 GB THPs at some point
> (usually at the start of an application), and transparently destroy them on
> the application exit.

Hi Roman,

In your current use case at Facebook, are you adding 1G hugetlb pages to
the hugetlb pool and then using them within applications? Or, are you
dynamically allocating them at fault time (hugetlb overcommit/surplus)?

Latency time for use of such pages includes:
- Putting together 1G contiguous
- Clearing 1G memory

In the 'allocation at fault time' mode you incur both costs at fault time.
If using pages from the pool, your only cost at fault time is clearing the
page.
--
Mike Kravetz

2020-09-04 07:45:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu 03-09-20 09:25:27, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > From: Zi Yan <[email protected]>
> > >
> > > Hi all,
> > >
> > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > v5.9-rc2-mmots-2020-08-25-21-13.
> > >
> > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > performance of applications with large memory footprint without application
> > > changes compared to hugetlb.
> >
> > Please be more specific about usecases. This better have some strong
> > ones because THP code is complex enough already to add on top solely
> > based on a generic TLB pressure easing.
>
> Hello, Michal!
>
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.

Let me clarify. I am not questioning 1GB (or large) pages in general. I
believe it is quite clear that there are usecases which hugely benefit
from them. I am mostly asking for the transparent part of it which
traditionally means that userspace mostly doesn't have to care and get
them. 2MB THPs have established certain expectations mostly a really
aggressive pro-active instanciation. This has bitten us many times and
create a "you need to disable THP to fix your problem whatever that is"
cargo cult. I hope we do not want to repeat that mistake here again.

> Historically we allocated gigantic pages at the boot time, but recently moved
> to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> see it as a very useful feature.
>
> Given the cost of an allocation, I'm slightly skeptical about an automatic
> heuristics-based approach, but if an application can explicitly mark target areas
> with madvise(), I don't see why it wouldn't work.

An explicit opt-in sounds much more appropriate to me as well. If we go
with a specific API then I would not make it 1GB pages specific. Why
cannot we have an explicit interface to "defragment" address space
range into large pages and the kernel would use large pages where
appropriate? Or is the additional copying prohibitively expensive?
--
Michal Hocko
SUSE Labs

2020-09-04 21:12:42

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> On Thu 03-09-20 09:25:27, Roman Gushchin wrote:
> > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > > From: Zi Yan <[email protected]>
> > > >
> > > > Hi all,
> > > >
> > > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > > v5.9-rc2-mmots-2020-08-25-21-13.
> > > >
> > > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > > performance of applications with large memory footprint without application
> > > > changes compared to hugetlb.
> > >
> > > Please be more specific about usecases. This better have some strong
> > > ones because THP code is complex enough already to add on top solely
> > > based on a generic TLB pressure easing.
> >
> > Hello, Michal!
> >
> > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> > performance wins on some workloads.
>
> Let me clarify. I am not questioning 1GB (or large) pages in general. I
> believe it is quite clear that there are usecases which hugely benefit
> from them. I am mostly asking for the transparent part of it which
> traditionally means that userspace mostly doesn't have to care and get
> them. 2MB THPs have established certain expectations mostly a really
> aggressive pro-active instanciation. This has bitten us many times and
> create a "you need to disable THP to fix your problem whatever that is"
> cargo cult. I hope we do not want to repeat that mistake here again.

Absolutely, I agree with all above. 1 GB THPs have even fewer chances
to be allocated automatically without hurting overall performance.

I believe that historically the THP allocation success rate and cost were not good
enough to have a strict interface, that's why the "best effort" approach was used.
Maybe I'm wrong here. Also in some cases (e.g. desktop) an opportunistic approach
looks like "it's some perf boost for free". However in case of large distributed
systems it's important to get a predictable and uniform performance across nodes,
so "maybe some hosts will perform better" is not giving much.

>
> > Historically we allocated gigantic pages at the boot time, but recently moved
> > to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> > see it as a very useful feature.
> >
> > Given the cost of an allocation, I'm slightly skeptical about an automatic
> > heuristics-based approach, but if an application can explicitly mark target areas
> > with madvise(), I don't see why it wouldn't work.
>
> An explicit opt-in sounds much more appropriate to me as well. If we go
> with a specific API then I would not make it 1GB pages specific. Why
> cannot we have an explicit interface to "defragment" address space
> range into large pages and the kernel would use large pages where
> appropriate? Or is the additional copying prohibitively expensive?

Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
provides something similar to what you're describing, but there are lot
of details here, so I'm probably missing something.

Thank you!

2020-09-07 07:21:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
[...]
> > An explicit opt-in sounds much more appropriate to me as well. If we go
> > with a specific API then I would not make it 1GB pages specific. Why
> > cannot we have an explicit interface to "defragment" address space
> > range into large pages and the kernel would use large pages where
> > appropriate? Or is the additional copying prohibitively expensive?
>
> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
> provides something similar to what you're describing, but there are lot
> of details here, so I'm probably missing something.

MADV_HUGEPAGE is controlling a preference for THP to be used for a
particular address range. So it looks similar but the historical
behavior is to control page faults as well and the behavior depends on
the global setup.

I've had in mind something much simpler. Effectively an API to invoke
khugepaged (like) functionality synchronously from the calling context
on the specific address range. It could be more aggressive than the
regular khugepaged and create even 1G pages (or as large THPs as page
tables can handle on the particular arch for that matter).

As this would be an explicit call we do not have to be worried about
the resulting latency because it would be an explicit call by the
userspace. The default khugepaged has a harder position there because
has no understanding of the target address space and cannot make any
cost/benefit evaluation so it has to be more conservative.
--
Michal Hocko
SUSE Labs

2020-09-08 15:31:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 08.09.20 16:41, Rik van Riel wrote:
> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>
>> A global knob is insufficient. 1G pages will become a very precious
>> resource as it requires a pre-allocation (reservation). So it really
>> has
>> to be an opt-in and the question is whether there is also some sort
>> of
>> access control needed.
>
> The 1GB pages do not require that much in the way of
> pre-allocation. The memory can be obtained through CMA,
> which means it can be used for movable 4kB and 2MB
> allocations when not
> being used for 1GB pages.
>
> That makes it relatively easy to set aside
> some fraction
> of system memory in every system for 1GB and movable
> allocations, and use it for whatever way it is needed
> depending on what workload(s) end up running on a system.
>

Linking secretmem discussion

https://lkml.kernel.org/r/[email protected]

--
Thanks,

David / dhildenb

2020-09-08 16:49:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 03.09.20 18:30, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>> From: Zi Yan <[email protected]>
>>>
>>> Hi all,
>>>
>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>
>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>> performance of applications with large memory footprint without application
>>> changes compared to hugetlb.
>>
>> This statement needs a lot of justification. I don't see 1GB THP as viable
>> for any workload. Opportunistic 1GB allocation is very questionable
>> strategy.
>
> Hello, Kirill!
>
> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
> if backed by an madvise() annotations from userspace application. In this case,
> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
> interface.

I have concerns if we would silently use 1~GB THPs in most scenarios
where be would have used 2~MB THP. I'd appreciate a trigger to
explicitly enable that - MADV_HUGEPAGE is not sufficient because some
applications relying on that assume that the THP size will be 2~MB
(especially, if you want sparse, large VMAs).

E.g., read via man page

"This feature is primarily aimed at applications that use large
mappings of data and access large regions of that memory at a time
(e.g., virtualization systems such as QEMU). It can very easily
waste memory (e.g., a 2 MB mapping that only ever accesses 1 byte will
result in 2 MB of wired memory instead of one 4 KB page)."



Having that said, I consider having 1~GB THP - similar to 512~MP THP on
arm64 - useless in most setup and I am not sure if it is worth the
trouble. Just use hugetlbfs for the handful of applications where it
makes sense.

--
Thanks,

David / dhildenb

2020-09-08 19:21:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Tue 08-09-20 10:05:11, Zi Yan wrote:
> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
>
> > On 03.09.20 18:30, Roman Gushchin wrote:
> >> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
> >>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> >>>> From: Zi Yan <[email protected]>
> >>>>
> >>>> Hi all,
> >>>>
> >>>> This patchset adds support for 1GB THP on x86_64. It is on top of
> >>>> v5.9-rc2-mmots-2020-08-25-21-13.
> >>>>
> >>>> 1GB THP is more flexible for reducing translation overhead and increasing the
> >>>> performance of applications with large memory footprint without application
> >>>> changes compared to hugetlb.
> >>>
> >>> This statement needs a lot of justification. I don't see 1GB THP as viable
> >>> for any workload. Opportunistic 1GB allocation is very questionable
> >>> strategy.
> >>
> >> Hello, Kirill!
> >>
> >> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
> >> if backed by an madvise() annotations from userspace application. In this case,
> >> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
> >> interface.
> >
> > I have concerns if we would silently use 1~GB THPs in most scenarios
> > where be would have used 2~MB THP. I'd appreciate a trigger to
> > explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> > applications relying on that assume that the THP size will be 2~MB
> > (especially, if you want sparse, large VMAs).
>
> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> region (although I had alloc_contig_pages as a fallback, which can be removed
> in next version), so users need to add hugepage_cma=nG kernel parameter to
> enable 1GB THP allocation. If a finer control is necessary, we can add
> a new MADV_HUGEPAGE_1GB for 1GB THP.

A global knob is insufficient. 1G pages will become a very precious
resource as it requires a pre-allocation (reservation). So it really has
to be an opt-in and the question is whether there is also some sort of
access control needed.

--
Michal Hocko
SUSE Labs

2020-09-08 19:47:49

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 8 Sep 2020, at 7:57, David Hildenbrand wrote:

> On 03.09.20 18:30, Roman Gushchin wrote:
>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>>> From: Zi Yan <[email protected]>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>>
>>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>>> performance of applications with large memory footprint without application
>>>> changes compared to hugetlb.
>>>
>>> This statement needs a lot of justification. I don't see 1GB THP as viable
>>> for any workload. Opportunistic 1GB allocation is very questionable
>>> strategy.
>>
>> Hello, Kirill!
>>
>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
>> if backed by an madvise() annotations from userspace application. In this case,
>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
>> interface.
>
> I have concerns if we would silently use 1~GB THPs in most scenarios
> where be would have used 2~MB THP. I'd appreciate a trigger to
> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> applications relying on that assume that the THP size will be 2~MB
> (especially, if you want sparse, large VMAs).

This patchset is not intended to silently use 1GB THP in place of 2MB THP.
First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
region (although I had alloc_contig_pages as a fallback, which can be removed
in next version), so users need to add hugepage_cma=nG kernel parameter to
enable 1GB THP allocation. If a finer control is necessary, we can add
a new MADV_HUGEPAGE_1GB for 1GB THP.



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-09-08 19:51:15

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 8 Sep 2020, at 10:27, Matthew Wilcox wrote:

> On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote:
>> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
>>> I have concerns if we would silently use 1~GB THPs in most scenarios
>>> where be would have used 2~MB THP. I'd appreciate a trigger to
>>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
>>> applications relying on that assume that the THP size will be 2~MB
>>> (especially, if you want sparse, large VMAs).
>>
>> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
>> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
>> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
>> region (although I had alloc_contig_pages as a fallback, which can be removed
>> in next version), so users need to add hugepage_cma=nG kernel parameter to
>> enable 1GB THP allocation. If a finer control is necessary, we can add
>> a new MADV_HUGEPAGE_1GB for 1GB THP.
>
> I think we do need that flag. Machines don't run a single workload
> (arguably with VMs, we're getting closer to going back to the single
> workload per machine, but that's a different matter). So if there's
> one app that wants 2MB pages and one that wants 1GB pages, we need to
> be able to distinguish them.
>
> I could also see there being an app which benefits from 1GB for
> one mapping and prefers 2GB for a different mapping, so I think the
> per-mapping madvise flag is best.
>
> I'm a little wary of encoding the size of an x86 PUD in the Linux API
> though. Probably best to follow the example set in
> include/uapi/asm-generic/hugetlb_encode.h, but I don't love it. I
> don't have a better suggestion though.

Using hugeltb_encode.h makes sense to me. I will add it in the next version.

Thanks for the suggestion.



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-09-08 19:56:34

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 8 Sep 2020, at 10:22, David Hildenbrand wrote:

> On 08.09.20 16:05, Zi Yan wrote:
>> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
>>
>>> On 03.09.20 18:30, Roman Gushchin wrote:
>>>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>>>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>>>>> From: Zi Yan <[email protected]>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>>>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>>>>
>>>>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>>>>> performance of applications with large memory footprint without application
>>>>>> changes compared to hugetlb.
>>>>>
>>>>> This statement needs a lot of justification. I don't see 1GB THP as viable
>>>>> for any workload. Opportunistic 1GB allocation is very questionable
>>>>> strategy.
>>>>
>>>> Hello, Kirill!
>>>>
>>>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
>>>> if backed by an madvise() annotations from userspace application. In this case,
>>>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
>>>> interface.
>>>
>>> I have concerns if we would silently use 1~GB THPs in most scenarios
>>> where be would have used 2~MB THP. I'd appreciate a trigger to
>>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
>>> applications relying on that assume that the THP size will be 2~MB
>>> (especially, if you want sparse, large VMAs).
>>
>> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
>> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
>> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
>> region (although I had alloc_contig_pages as a fallback, which can be removed
>> in next version), so users need to add hugepage_cma=nG kernel parameter to
>> enable 1GB THP allocation. If a finer control is necessary, we can add
>> a new MADV_HUGEPAGE_1GB for 1GB THP.
>
> Thanks for the information - I would have loved to see important
> information like that (esp. how to use) in the cover letter.
>
> So what you propose is (excluding alloc_contig_pages()) really just
> automatically using (previously reserved) 1GB huge pages as 1GB THP
> instead of explicitly using them in an application using hugetlbfs.
> Still, not convinced how helpful that actually is - most certainly you
> really want a mechanism to control this per application (+ maybe make
> the application indicate actual ranges where it makes sense - but then
> you can directly modify the application to use hugetlbfs).
>
> I guess the interesting thing of this approach is that we can
> mix-and-match THP of differing granularity within a single mapping -
> whereby a hugetlbfs allocation would fail in case there isn't sufficient
> 1GB pages available. However, there are no guarantees for applications
> anymore (thinking about RT KVM and similar, we really want gigantic
> pages and cannot tolerate falling back to smaller granularity).

I agree that currently THP allocation does not provide a strong guarantee
like hugetlbfs, which can pre-allocate pages at boot time. For users like
RT KVM and such, pre-allocated hugetlb might be the only choice, since
allocating huge pages from CMA (either hugetlb or 1GB THP) would fail
if some pages are pinned and scattered in the CMA that could prevent
huge page allocation.

In other cases, if the user can tolerate fall backs but do not like the
unpredictable huge page formation outcome, we could add an madvise()
option like Michal suggested [1], so the user will know whether he gets
huge pages or not and can act accordingly.


> What are intended use cases/applications that could benefit? I doubt
> databases and virtualization are really a good fit - they know how to
> handle hugetlbfs just fine.

Romand and Jason have provided some use cases [2,3]

[1]https://lore.kernel.org/linux-mm/[email protected]/
[2]https://lore.kernel.org/linux-mm/[email protected]/
[3]https://lore.kernel.org/linux-mm/[email protected]/


Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-09-08 20:00:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 08.09.20 16:05, Zi Yan wrote:
> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
>
>> On 03.09.20 18:30, Roman Gushchin wrote:
>>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>>>> From: Zi Yan <[email protected]>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>>>
>>>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>>>> performance of applications with large memory footprint without application
>>>>> changes compared to hugetlb.
>>>>
>>>> This statement needs a lot of justification. I don't see 1GB THP as viable
>>>> for any workload. Opportunistic 1GB allocation is very questionable
>>>> strategy.
>>>
>>> Hello, Kirill!
>>>
>>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
>>> if backed by an madvise() annotations from userspace application. In this case,
>>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
>>> interface.
>>
>> I have concerns if we would silently use 1~GB THPs in most scenarios
>> where be would have used 2~MB THP. I'd appreciate a trigger to
>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
>> applications relying on that assume that the THP size will be 2~MB
>> (especially, if you want sparse, large VMAs).
>
> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> region (although I had alloc_contig_pages as a fallback, which can be removed
> in next version), so users need to add hugepage_cma=nG kernel parameter to
> enable 1GB THP allocation. If a finer control is necessary, we can add
> a new MADV_HUGEPAGE_1GB for 1GB THP.

Thanks for the information - I would have loved to see important
information like that (esp. how to use) in the cover letter.

So what you propose is (excluding alloc_contig_pages()) really just
automatically using (previously reserved) 1GB huge pages as 1GB THP
instead of explicitly using them in an application using hugetlbfs.
Still, not convinced how helpful that actually is - most certainly you
really want a mechanism to control this per application (+ maybe make
the application indicate actual ranges where it makes sense - but then
you can directly modify the application to use hugetlbfs).

I guess the interesting thing of this approach is that we can
mix-and-match THP of differing granularity within a single mapping -
whereby a hugetlbfs allocation would fail in case there isn't sufficient
1GB pages available. However, there are no guarantees for applications
anymore (thinking about RT KVM and similar, we really want gigantic
pages and cannot tolerate falling back to smaller granularity).

What are intended use cases/applications that could benefit? I doubt
databases and virtualization are really a good fit - they know how to
handle hugetlbfs just fine.

--
Thanks,

David / dhildenb

2020-09-08 20:00:43

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote:
> On 7 Sep 2020, at 3:20, Michal Hocko wrote:
>
> > On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
> >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> > [...]
> >>> An explicit opt-in sounds much more appropriate to me as well. If we go
> >>> with a specific API then I would not make it 1GB pages specific. Why
> >>> cannot we have an explicit interface to "defragment" address space
> >>> range into large pages and the kernel would use large pages where
> >>> appropriate? Or is the additional copying prohibitively expensive?
> >>
> >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
> >> provides something similar to what you're describing, but there are lot
> >> of details here, so I'm probably missing something.
> >
> > MADV_HUGEPAGE is controlling a preference for THP to be used for a
> > particular address range. So it looks similar but the historical
> > behavior is to control page faults as well and the behavior depends on
> > the global setup.
> >
> > I've had in mind something much simpler. Effectively an API to invoke
> > khugepaged (like) functionality synchronously from the calling context
> > on the specific address range. It could be more aggressive than the
> > regular khugepaged and create even 1G pages (or as large THPs as page
> > tables can handle on the particular arch for that matter).
> >
> > As this would be an explicit call we do not have to be worried about
> > the resulting latency because it would be an explicit call by the
> > userspace. The default khugepaged has a harder position there because
> > has no understanding of the target address space and cannot make any
> > cost/benefit evaluation so it has to be more conservative.
>
> Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
> better and clearer control of getting huge pages from the kernel and
> know when they will pay the cost of getting the huge pages.
>
> I would think the suggestion is more about the huge page control options
> currently provided by the kernel do not have predictable performance
> outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
> users whether the marked virtual address range is backed by huge pages
> or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
> deterministic result to users on whether the huge page(s) are formed
> or not.

Yeah, I agree with Michal here, we need a more straightforward interface.

The hard question here is how hard the kernel should try to allocate
a gigantic page and how fast it should give up and return an error?
I'd say to try really hard if there are some chances to succeed,
so that if an error is returned, there are no more reasons to retry.
Any objections/better ideas here?

Given that we need to pass a page size, we probably need either to introduce
a new syscall (madvise2?) with an additional argument, or add a bunch
of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc.

Idk what is better long-term, but new madvise flags are probably slightly
easier to deal with in the development process.

Thanks!

2020-09-08 20:02:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote:
> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> > I have concerns if we would silently use 1~GB THPs in most scenarios
> > where be would have used 2~MB THP. I'd appreciate a trigger to
> > explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> > applications relying on that assume that the THP size will be 2~MB
> > (especially, if you want sparse, large VMAs).
>
> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> region (although I had alloc_contig_pages as a fallback, which can be removed
> in next version), so users need to add hugepage_cma=nG kernel parameter to
> enable 1GB THP allocation. If a finer control is necessary, we can add
> a new MADV_HUGEPAGE_1GB for 1GB THP.

I think we do need that flag. Machines don't run a single workload
(arguably with VMs, we're getting closer to going back to the single
workload per machine, but that's a different matter). So if there's
one app that wants 2MB pages and one that wants 1GB pages, we need to
be able to distinguish them.

I could also see there being an app which benefits from 1GB for
one mapping and prefers 2GB for a different mapping, so I think the
per-mapping madvise flag is best.

I'm a little wary of encoding the size of an x86 PUD in the Linux API
though. Probably best to follow the example set in
include/uapi/asm-generic/hugetlb_encode.h, but I don't love it. I
don't have a better suggestion though.

2020-09-08 20:07:54

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:

> A global knob is insufficient. 1G pages will become a very precious
> resource as it requires a pre-allocation (reservation). So it really
> has
> to be an opt-in and the question is whether there is also some sort
> of
> access control needed.

The 1GB pages do not require that much in the way of
pre-allocation. The memory can be obtained through CMA,
which means it can be used for movable 4kB and 2MB
allocations when not
being used for 1GB pages.

That makes it relatively easy to set aside
some fraction
of system memory in every system for 1GB and movable
allocations, and use it for whatever way it is needed
depending on what workload(s) end up running on a system.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-09-08 20:19:43

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 7 Sep 2020, at 3:20, Michal Hocko wrote:

> On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
>> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> [...]
>>> An explicit opt-in sounds much more appropriate to me as well. If we go
>>> with a specific API then I would not make it 1GB pages specific. Why
>>> cannot we have an explicit interface to "defragment" address space
>>> range into large pages and the kernel would use large pages where
>>> appropriate? Or is the additional copying prohibitively expensive?
>>
>> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
>> provides something similar to what you're describing, but there are lot
>> of details here, so I'm probably missing something.
>
> MADV_HUGEPAGE is controlling a preference for THP to be used for a
> particular address range. So it looks similar but the historical
> behavior is to control page faults as well and the behavior depends on
> the global setup.
>
> I've had in mind something much simpler. Effectively an API to invoke
> khugepaged (like) functionality synchronously from the calling context
> on the specific address range. It could be more aggressive than the
> regular khugepaged and create even 1G pages (or as large THPs as page
> tables can handle on the particular arch for that matter).
>
> As this would be an explicit call we do not have to be worried about
> the resulting latency because it would be an explicit call by the
> userspace. The default khugepaged has a harder position there because
> has no understanding of the target address space and cannot make any
> cost/benefit evaluation so it has to be more conservative.

Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
better and clearer control of getting huge pages from the kernel and
know when they will pay the cost of getting the huge pages.

I would think the suggestion is more about the huge page control options
currently provided by the kernel do not have predictable performance
outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
users whether the marked virtual address range is backed by huge pages
or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
deterministic result to users on whether the huge page(s) are formed
or not.


Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-09-09 04:05:16

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 9/8/20 12:58 PM, Roman Gushchin wrote:
> On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote:
>> On 7 Sep 2020, at 3:20, Michal Hocko wrote:
>>> On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
>>>> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
>>> [...]
>> Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
>> better and clearer control of getting huge pages from the kernel and
>> know when they will pay the cost of getting the huge pages.
>>
>> I would think the suggestion is more about the huge page control options
>> currently provided by the kernel do not have predictable performance
>> outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
>> users whether the marked virtual address range is backed by huge pages
>> or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
>> deterministic result to users on whether the huge page(s) are formed
>> or not.
>
> Yeah, I agree with Michal here, we need a more straightforward interface.
>
> The hard question here is how hard the kernel should try to allocate
> a gigantic page and how fast it should give up and return an error?
> I'd say to try really hard if there are some chances to succeed,
> so that if an error is returned, there are no more reasons to retry.
> Any objections/better ideas here?

I agree, especially because this is starting to look a lot more like an
allocation call. And I think it would be appropriate for the kernel to
try approximately as hard to provide these 1GB pages, as it would to
allocate normal memory to a process.

In fact, for a moment I thought, why not go all the way and make this
actually be a true allocation? However, given that we still have
operations that require page splitting, with no good way to call back
user space to notify it that its "allocated" huge pages are being split,
that fails. But it's still pretty close.


>
> Given that we need to pass a page size, we probably need either to introduce
> a new syscall (madvise2?) with an additional argument, or add a bunch
> of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc.
>
> Idk what is better long-term, but new madvise flags are probably slightly
> easier to deal with in the development process.
>

Probably either an MADV_* flag or a new syscall would work fine. But
given that this seems like a pretty distinct new capability, one with
options and man page documentation and possibly future flags itself, I'd
lean toward making it its own new syscall, maybe:

compact_huge_pages(nbytes or npages, flags /* page size, etc */);

...thus leaving madvise() and it's remaining flags still available, to
further refine things.


thanks,
--
John Hubbard
NVIDIA

2020-09-09 07:06:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>
> > A global knob is insufficient. 1G pages will become a very precious
> > resource as it requires a pre-allocation (reservation). So it really
> > has
> > to be an opt-in and the question is whether there is also some sort
> > of
> > access control needed.
>
> The 1GB pages do not require that much in the way of
> pre-allocation. The memory can be obtained through CMA,
> which means it can be used for movable 4kB and 2MB
> allocations when not
> being used for 1GB pages.

That CMA has to be pre-reserved, right? That requires a configuration.

> That makes it relatively easy to set aside
> some fraction
> of system memory in every system for 1GB and movable
> allocations, and use it for whatever way it is needed
> depending on what workload(s) end up running on a system.

I was not talking about how easy or hard it is. My main concern is that
this is effectively a pre-reserved pool and a global knob is a very
suboptimal way to control access to it. I (rather) strongly believe this
should be an explicit opt-in and ideally not 1GB specific but rather
something to allow large pages to be created as there is a fit. See
other subthread for more details.

--
Michal Hocko
SUSE Labs

2020-09-09 07:18:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Tue 08-09-20 12:58:59, Roman Gushchin wrote:
> On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote:
> > On 7 Sep 2020, at 3:20, Michal Hocko wrote:
> >
> > > On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
> > >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> > > [...]
> > >>> An explicit opt-in sounds much more appropriate to me as well. If we go
> > >>> with a specific API then I would not make it 1GB pages specific. Why
> > >>> cannot we have an explicit interface to "defragment" address space
> > >>> range into large pages and the kernel would use large pages where
> > >>> appropriate? Or is the additional copying prohibitively expensive?
> > >>
> > >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
> > >> provides something similar to what you're describing, but there are lot
> > >> of details here, so I'm probably missing something.
> > >
> > > MADV_HUGEPAGE is controlling a preference for THP to be used for a
> > > particular address range. So it looks similar but the historical
> > > behavior is to control page faults as well and the behavior depends on
> > > the global setup.
> > >
> > > I've had in mind something much simpler. Effectively an API to invoke
> > > khugepaged (like) functionality synchronously from the calling context
> > > on the specific address range. It could be more aggressive than the
> > > regular khugepaged and create even 1G pages (or as large THPs as page
> > > tables can handle on the particular arch for that matter).
> > >
> > > As this would be an explicit call we do not have to be worried about
> > > the resulting latency because it would be an explicit call by the
> > > userspace. The default khugepaged has a harder position there because
> > > has no understanding of the target address space and cannot make any
> > > cost/benefit evaluation so it has to be more conservative.
> >
> > Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
> > better and clearer control of getting huge pages from the kernel and
> > know when they will pay the cost of getting the huge pages.

The name is not really that important. The crucial design decisions are
- THP allocation time - #PF and/or madvise context
- lazy/sync instantiation
- huge page sizes controllable by the userspace?
- aggressiveness - how hard to try
- internal fragmentation - allow to create THPs on sparsely or unpopulated
ranges
- do we need some sort of access control or privilege check as some THPs
would be a really scarce (like those that require pre-reservation).

> > I would think the suggestion is more about the huge page control options
> > currently provided by the kernel do not have predictable performance
> > outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
> > users whether the marked virtual address range is backed by huge pages
> > or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
> > deterministic result to users on whether the huge page(s) are formed
> > or not.
>
> Yeah, I agree with Michal here, we need a more straightforward interface.
>
> The hard question here is how hard the kernel should try to allocate
> a gigantic page and how fast it should give up and return an error?
> I'd say to try really hard if there are some chances to succeed,
> so that if an error is returned, there are no more reasons to retry.
> Any objections/better ideas here?

If this is going to be an explicit interface like madvise then I would
follow the same semantic as hugetlb pages allocation - aka try as hard
as feasible (whatever that means).

> Given that we need to pass a page size, we probably need either to introduce
> a new syscall (madvise2?) with an additional argument, or add a bunch
> of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc.

Do we really need to bother userspace with making decision about the
page size? I would expect that the userspace only cares to get huge
pages backed memory range. The larger the pages the better. It is up to
the kernel to make the resource control here. Afterall THPs can be
split/reclaimed under a memory pressure so we do not want to make any
promises about pages backing any mapping.
--
Michal Hocko
SUSE Labs

2020-09-09 12:56:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, Sep 09, 2020 at 09:11:17AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 08, 2020 at 03:27:58PM +0100, Matthew Wilcox wrote:
> > I could also see there being an app which benefits from 1GB for
> > one mapping and prefers 2GB for a different mapping, so I think the
> > per-mapping madvise flag is best.
>
> I wonder if apps really care about the specific page size?
> Particularly from a portability view?

No, they don't. They just want to run as fast as possible ;-)

> The general app desire seems to be the need for 'efficient' memory (eg
> because it is highly accessed) and I suspect comes with a desire to
> populate the pages too.

The problem with a MAP_GOES_FASTER flag is that everybody sets it.
Any flag name needs to convey its drawbacks as well as its advantages.
Maybe MAP_EXTREMELY_COARSE_WORKINGSET would do that -- the VM will work
in terms of 1GB pages for this mapping, so any swap-out is going to take
out an entire 1GB at once.

But here's the thing ... we already allow
mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)

So if we're not doing THP, what's the point of this thread?
My understanding of THP is "Application doesn't need to change, kernel
makes a decision about what page size is best based on entire system
state and process's behaviour".

An madvise flag is a different beast; that's just letting the kernel
know what the app thinks its behaviour will be. The kernel can pay
as much (or as little) attention to that hint as it sees fit. And of
course, it can change over time (either by kernel release as we change
the algorithms, or simple from one minute to the next as more or less
memory comes available).

2020-09-09 13:19:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote:

> But here's the thing ... we already allow
> mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)
>
> So if we're not doing THP, what's the point of this thread?

I wondered that too..

> An madvise flag is a different beast; that's just letting the kernel
> know what the app thinks its behaviour will be. The kernel can pay

But madvise is too late, the VMA already has an address, if it is not
1G aligned it cannot be 1G THP already.

Jason

2020-09-09 13:39:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> >
> > > A global knob is insufficient. 1G pages will become a very
> > > precious
> > > resource as it requires a pre-allocation (reservation). So it
> > > really
> > > has
> > > to be an opt-in and the question is whether there is also some
> > > sort
> > > of
> > > access control needed.
> >
> > The 1GB pages do not require that much in the way of
> > pre-allocation. The memory can be obtained through CMA,
> > which means it can be used for movable 4kB and 2MB
> > allocations when not
> > being used for 1GB pages.
>
> That CMA has to be pre-reserved, right? That requires a
> configuration.

To some extent, yes.

However, because that pool can be used for movable
4kB and 2MB
pages as well as for 1GB pages, it would be easy to just set
the size of that pool to eg. 1/3 or even 1/2 of memory for every
system.

It isn't like the pool needs to be the exact right size. We
just need to avoid the "highmem problem" of having too little
memory for kernel allocations.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-09-09 15:03:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Tue, Sep 08, 2020 at 03:27:58PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote:
> > On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> > > I have concerns if we would silently use 1~GB THPs in most scenarios
> > > where be would have used 2~MB THP. I'd appreciate a trigger to
> > > explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> > > applications relying on that assume that the THP size will be 2~MB
> > > (especially, if you want sparse, large VMAs).
> >
> > This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> > First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> > to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> > region (although I had alloc_contig_pages as a fallback, which can be removed
> > in next version), so users need to add hugepage_cma=nG kernel parameter to
> > enable 1GB THP allocation. If a finer control is necessary, we can add
> > a new MADV_HUGEPAGE_1GB for 1GB THP.
>
> I think we do need that flag. Machines don't run a single workload
> (arguably with VMs, we're getting closer to going back to the single
> workload per machine, but that's a different matter). So if there's
> one app that wants 2MB pages and one that wants 1GB pages, we need to
> be able to distinguish them.
>
> I could also see there being an app which benefits from 1GB for
> one mapping and prefers 2GB for a different mapping, so I think the
> per-mapping madvise flag is best.

I wonder if apps really care about the specific page size?
Particularly from a portability view?

The general app desire seems to be the need for 'efficient' memory (eg
because it is highly accessed) and I suspect comes with a desire to
populate the pages too.

Maybe doing something with MAP_POPULATE is an idea?

eg if I ask for 1GB of MAP_POPULATE it seems fairly natural the thing
that comes back should be a 1GB THP? If I ask for only .5GB then it
could be 2M pages, or whatever depending on arch support.

Jason

2020-09-09 15:53:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 09.09.20 15:14, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote:
>
>> But here's the thing ... we already allow
>> mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)
>>
>> So if we're not doing THP, what's the point of this thread?
>
> I wondered that too..
>
>> An madvise flag is a different beast; that's just letting the kernel
>> know what the app thinks its behaviour will be. The kernel can pay
>
> But madvise is too late, the VMA already has an address, if it is not
> 1G aligned it cannot be 1G THP already.

That's why user space (like QEMU) is THP-aware and selects an address
that is aligned to the expected THP granularity (e.g., 2MB on x86_64).

--
Thanks,

David / dhildenb

2020-09-09 16:07:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed 09-09-20 09:19:16, Rik van Riel wrote:
> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> > On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> > > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> > >
> > > > A global knob is insufficient. 1G pages will become a very
> > > > precious
> > > > resource as it requires a pre-allocation (reservation). So it
> > > > really
> > > > has
> > > > to be an opt-in and the question is whether there is also some
> > > > sort
> > > > of
> > > > access control needed.
> > >
> > > The 1GB pages do not require that much in the way of
> > > pre-allocation. The memory can be obtained through CMA,
> > > which means it can be used for movable 4kB and 2MB
> > > allocations when not
> > > being used for 1GB pages.
> >
> > That CMA has to be pre-reserved, right? That requires a
> > configuration.
>
> To some extent, yes.
>
> However, because that pool can be used for movable
> 4kB and 2MB
> pages as well as for 1GB pages, it would be easy to just set
> the size of that pool to eg. 1/3 or even 1/2 of memory for every
> system.
>
> It isn't like the pool needs to be the exact right size. We
> just need to avoid the "highmem problem" of having too little
> memory for kernel allocations.

Which is the problem why this is not really suitable for an uneducated
guesses. It is really hard to guess the right amount of lowmem. Think of
heavy fs metadata workloads and their memory demand. Memory reclaim
usually struggles when zones are imbalanced from my experience.

--
Michal Hocko
SUSE Labs

2020-09-09 16:18:46

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 09.09.20 15:19, Rik van Riel wrote:
> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>
>>>> A global knob is insufficient. 1G pages will become a very
>>>> precious
>>>> resource as it requires a pre-allocation (reservation). So it
>>>> really
>>>> has
>>>> to be an opt-in and the question is whether there is also some
>>>> sort
>>>> of
>>>> access control needed.
>>>
>>> The 1GB pages do not require that much in the way of
>>> pre-allocation. The memory can be obtained through CMA,
>>> which means it can be used for movable 4kB and 2MB
>>> allocations when not
>>> being used for 1GB pages.
>>
>> That CMA has to be pre-reserved, right? That requires a
>> configuration.
>
> To some extent, yes.
>
> However, because that pool can be used for movable
> 4kB and 2MB
> pages as well as for 1GB pages, it would be easy to just set
> the size of that pool to eg. 1/3 or even 1/2 of memory for every
> system.
>
> It isn't like the pool needs to be the exact right size. We
> just need to avoid the "highmem problem" of having too little
> memory for kernel allocations.
>

I am not sure I like the trend towards CMA that we are seeing, reserving
huge buffers for specific users (and eventually even doing it
automatically).

What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
anybody who requires large, unmovable allocations can use it.

I once played with the idea of having ZONE_PREFER_MOVABLE, which
a) Is the primary choice for movable allocations
b) Is allowed to contain unmovable allocations (esp., gigantic pages)
c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
running out of memory

If someone messes up the zone ratio, issues known from zone imbalances
are avoided - large allocations simply become less likely to succeed. In
contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.

--
Thanks,

David / dhildenb

2020-09-09 16:19:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 03/16] mm: proc: add 1GB THP kpageflag.

On Wed, Sep 02, 2020 at 02:06:15PM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Bit 27 is used to identify 1GB THP.
>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> fs/proc/page.c | 2 ++
> include/uapi/linux/kernel-page-flags.h | 2 ++
> 2 files changed, 4 insertions(+)
>
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index f3b39a7d2bf3..e4e2ad3612c9 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -161,6 +161,8 @@ u64 stable_page_flags(struct page *page)
> u |= BIT_ULL(KPF_ZERO_PAGE);
> u |= BIT_ULL(KPF_THP);
> }
> + if (compound_order(head) == HPAGE_PUD_ORDER)
> + u |= 1 << KPF_PUD_THP;
> } else if (is_zero_pfn(page_to_pfn(page)))
> u |= BIT_ULL(KPF_ZERO_PAGE);
>
> diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> index 6f2f2720f3ac..cdeb33ab655c 100644
> --- a/include/uapi/linux/kernel-page-flags.h
> +++ b/include/uapi/linux/kernel-page-flags.h
> @@ -36,5 +36,7 @@
> #define KPF_ZERO_PAGE 24
> #define KPF_IDLE 25
> #define KPF_PGTABLE 26
> +#define KPF_PUD_THP 27
> +

Redundant newline.

> #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> --
> 2.28.0
>
>

--
Kirill A. Shutemov

2020-09-09 16:24:20

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 09.09.20 15:49, Rik van Riel wrote:
> On Wed, 2020-09-09 at 15:43 +0200, David Hildenbrand wrote:
>> On 09.09.20 15:19, Rik van Riel wrote:
>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>
>>>> That CMA has to be pre-reserved, right? That requires a
>>>> configuration.
>>>
>>> To some extent, yes.
>>>
>>> However, because that pool can be used for movable
>>> 4kB and 2MB
>>> pages as well as for 1GB pages, it would be easy to just set
>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>> system.
>>>
>>> It isn't like the pool needs to be the exact right size. We
>>> just need to avoid the "highmem problem" of having too little
>>> memory for kernel allocations.
>>>
>>
>> I am not sure I like the trend towards CMA that we are seeing,
>> reserving
>> huge buffers for specific users (and eventually even doing it
>> automatically).
>>
>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
>> that
>> anybody who requires large, unmovable allocations can use it.
>>
>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>> a) Is the primary choice for movable allocations
>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead
>> of
>> running out of memory
>>
>> If someone messes up the zone ratio, issues known from zone
>> imbalances
>> are avoided - large allocations simply become less likely to succeed.
>> In
>> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.
>
> I really like that idea. This will be easier to deal with than
> a "just the right size" CMA area, and seems like it would be
> pretty forgiving in both directions.
>

Yes, and can be extended using memory hotplug.

> Keeping unmovable allocations
> contained to one part of memory
> should also make compaction within the ZONE_PREFER_MOVABLE area
> a lot easier than compaction for higher order allocations is
> today.
>
> I suspect your proposal solves a lot of issues at once.
>
> For (c) from your proposal, we could even claim a whole
> 2MB or even 1GB area at once for unmovable allocations,
> keeping those contained in a limited amount of physical
> memory again, to make life easier on compaction.
>

Exactly, locally limiting unmovable allocations to a sane minimum.

(with some smart extra work, we could even convert ZONE_PREFER_MOVABLE
to ZONE_NORMAL, one memory section/block at a time where needed, that
direction always works. But that's very tricky.)

--
Thanks,

David / dhildenb

2020-09-09 17:21:35

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, 2020-09-09 at 15:43 +0200, David Hildenbrand wrote:
> On 09.09.20 15:19, Rik van Riel wrote:
> > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> >
> > > That CMA has to be pre-reserved, right? That requires a
> > > configuration.
> >
> > To some extent, yes.
> >
> > However, because that pool can be used for movable
> > 4kB and 2MB
> > pages as well as for 1GB pages, it would be easy to just set
> > the size of that pool to eg. 1/3 or even 1/2 of memory for every
> > system.
> >
> > It isn't like the pool needs to be the exact right size. We
> > just need to avoid the "highmem problem" of having too little
> > memory for kernel allocations.
> >
>
> I am not sure I like the trend towards CMA that we are seeing,
> reserving
> huge buffers for specific users (and eventually even doing it
> automatically).
>
> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
> that
> anybody who requires large, unmovable allocations can use it.
>
> I once played with the idea of having ZONE_PREFER_MOVABLE, which
> a) Is the primary choice for movable allocations
> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead
> of
> running out of memory
>
> If someone messes up the zone ratio, issues known from zone
> imbalances
> are avoided - large allocations simply become less likely to succeed.
> In
> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.

I really like that idea. This will be easier to deal with than
a "just the right size" CMA area, and seems like it would be
pretty forgiving in both directions.

Keeping unmovable allocations
contained to one part of memory
should also make compaction within the ZONE_PREFER_MOVABLE area
a lot easier than compaction for higher order allocations is
today.

I suspect your proposal solves a lot of issues at once.

For (c) from your proposal, we could even claim a whole
2MB or even 1GB area at once for unmovable allocations,
keeping those contained in a limited amount of physical
memory again, to make life easier on compaction.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-09-10 07:33:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

[Cc Vlastimil and Mel - the whole email thread starts
http://lkml.kernel.org/r/[email protected]
but this particular subthread has diverged a bit and you might find it
interesting]

On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
> On 09.09.20 15:19, Rik van Riel wrote:
> > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> >> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> >>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> >>>
> >>>> A global knob is insufficient. 1G pages will become a very
> >>>> precious
> >>>> resource as it requires a pre-allocation (reservation). So it
> >>>> really
> >>>> has
> >>>> to be an opt-in and the question is whether there is also some
> >>>> sort
> >>>> of
> >>>> access control needed.
> >>>
> >>> The 1GB pages do not require that much in the way of
> >>> pre-allocation. The memory can be obtained through CMA,
> >>> which means it can be used for movable 4kB and 2MB
> >>> allocations when not
> >>> being used for 1GB pages.
> >>
> >> That CMA has to be pre-reserved, right? That requires a
> >> configuration.
> >
> > To some extent, yes.
> >
> > However, because that pool can be used for movable
> > 4kB and 2MB
> > pages as well as for 1GB pages, it would be easy to just set
> > the size of that pool to eg. 1/3 or even 1/2 of memory for every
> > system.
> >
> > It isn't like the pool needs to be the exact right size. We
> > just need to avoid the "highmem problem" of having too little
> > memory for kernel allocations.
> >
>
> I am not sure I like the trend towards CMA that we are seeing, reserving
> huge buffers for specific users (and eventually even doing it
> automatically).
>
> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
> anybody who requires large, unmovable allocations can use it.
>
> I once played with the idea of having ZONE_PREFER_MOVABLE, which
> a) Is the primary choice for movable allocations
> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
> running out of memory

I might be missing something but how can this work longterm? Or put in
another words why would this work any better than existing fragmentation
avoidance techniques that page allocator implements already - movability
grouping etc. Please note that I am not deeply familiar with those but
my high level understanding is that we already try hard to not mix
movable and unmovable objects in same page blocks as much as we can.

My suspicion is that a separate zone would work in a similar fashion. As
long as there is a lot of free memory then zone will be effectively
MOVABLE. Similar applies to normal zone when unmovable allocations are
in minority. As long as the Normal zone gets full of unmovable objects
they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
block stealing when unmovable objects start spreading over movable page
blocks.

Again, my level of expertise to page allocator is quite low so all the
above might be simply wrong...

> If someone messes up the zone ratio, issues known from zone imbalances
> are avoided - large allocations simply become less likely to succeed. In
> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.
--
Michal Hocko
SUSE Labs

2020-09-10 08:30:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 10.09.20 09:32, Michal Hocko wrote:
> [Cc Vlastimil and Mel - the whole email thread starts
> http://lkml.kernel.org/r/[email protected]
> but this particular subthread has diverged a bit and you might find it
> interesting]
>
> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>> On 09.09.20 15:19, Rik van Riel wrote:
>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>>>
>>>>>> A global knob is insufficient. 1G pages will become a very
>>>>>> precious
>>>>>> resource as it requires a pre-allocation (reservation). So it
>>>>>> really
>>>>>> has
>>>>>> to be an opt-in and the question is whether there is also some
>>>>>> sort
>>>>>> of
>>>>>> access control needed.
>>>>>
>>>>> The 1GB pages do not require that much in the way of
>>>>> pre-allocation. The memory can be obtained through CMA,
>>>>> which means it can be used for movable 4kB and 2MB
>>>>> allocations when not
>>>>> being used for 1GB pages.
>>>>
>>>> That CMA has to be pre-reserved, right? That requires a
>>>> configuration.
>>>
>>> To some extent, yes.
>>>
>>> However, because that pool can be used for movable
>>> 4kB and 2MB
>>> pages as well as for 1GB pages, it would be easy to just set
>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>> system.
>>>
>>> It isn't like the pool needs to be the exact right size. We
>>> just need to avoid the "highmem problem" of having too little
>>> memory for kernel allocations.
>>>
>>
>> I am not sure I like the trend towards CMA that we are seeing, reserving
>> huge buffers for specific users (and eventually even doing it
>> automatically).
>>
>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
>> anybody who requires large, unmovable allocations can use it.
>>
>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>> a) Is the primary choice for movable allocations
>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
>> running out of memory
>
> I might be missing something but how can this work longterm? Or put in
> another words why would this work any better than existing fragmentation
> avoidance techniques that page allocator implements already - movability
> grouping etc. Please note that I am not deeply familiar with those but
> my high level understanding is that we already try hard to not mix
> movable and unmovable objects in same page blocks as much as we can.

Note that we group in pageblock granularity, which avoids fragmentation
on a pageblock level, not on anything bigger than that. Especially
MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages.

So once you run for some time on a system (especially thinking about
page shuffling *within* a zone), trying to allocate a gigantic page will
simply always fail - even if you always had plenty of free memory in
your single zone.

>
> My suspicion is that a separate zone would work in a similar fashion. As
> long as there is a lot of free memory then zone will be effectively
> MOVABLE. Similar applies to normal zone when unmovable allocations are

Note the difference to MOVABLE: if you really want, you *can* put
movable allocations into that zone. So you can happily allocate gigantic
pages from it. Or anything else you like. As the name suggests "prefer
movable allocations".

> in minority. As long as the Normal zone gets full of unmovable objects
> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
> block stealing when unmovable objects start spreading over movable page
> blocks.

Right, the long-term goal would be
1. To limit the chance of that happening. (e.g., size it in a way that's
safe for 99.9% of all setups, resize dynamically on demand)
2. To limit the physical area where that is happening (e.g., find lowest
possible pageblock etc.). That's more tricky but I consider this a pure
optimization on top.

As long as we stay in safe zone boundaries you get a benefit in most
scenarios. As soon as we would have a (temporary) workload that would
require more unmovable allocations we would fallback to polluting some
pageblocks only.

>
> Again, my level of expertise to page allocator is quite low so all the
> above might be simply wrong...

Same over here. I had this idea in my mind for quite a while but
obviously didn't get to figure out the details/implement yet - that's
why I decided to share the basic idea just now.

--
Thanks,

David / dhildenb

2020-09-10 10:07:40

by William Kucharski

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64



> On Sep 9, 2020, at 7:27 AM, David Hildenbrand <[email protected]> wrote:
>
> On 09.09.20 15:14, Jason Gunthorpe wrote:
>> On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote:
>>
>>> But here's the thing ... we already allow
>>> mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)
>>>
>>> So if we're not doing THP, what's the point of this thread?
>>
>> I wondered that too..
>>
>>> An madvise flag is a different beast; that's just letting the kernel
>>> know what the app thinks its behaviour will be. The kernel can pay
>>
>> But madvise is too late, the VMA already has an address, if it is not
>> 1G aligned it cannot be 1G THP already.
>
> That's why user space (like QEMU) is THP-aware and selects an address
> that is aligned to the expected THP granularity (e.g., 2MB on x86_64).


To me it's always seemed like there are two major divisions among THP use
cases:

1) Applications that KNOW they would benefit from use of THPs, so they
call madvise() with an appropriate parameter and explicitly inform the
kernel of such

2) Applications that know nothing about THP but there may be an
advantage that comes from "automatic" THP mapping when possible.

This is an approach that I am more familiar with that comes down to:

1) Is a VMA properly aligned for a (whatever size) THP?

2) Is the mapping request for a length >= (whatever size) THP?

3) Let's try allocating memory to map the space using (whatever size)
THP, and:

-- If we succeed, great, awesome, let's do it.
-- If not, no big deal, map using as large a page as we CAN get.

There of course are myriad performance implications to this. Processes
that start early after boot have a better chance of getting a THP,
but that also means frequently mapped large memory spaces have a better
chance of being mapped in a shared manner via a THP, e.g. libc, X servers
or Firefox/Chrome. It also means that processes that would be mapped
using THPs early in boot may not be if they should crash and need to be
restarted.

There are all sorts of tunables that would likely need to be in place to make
the second approach more viable, but I think it's certainly worth investigating.

The address selection you suggest is the basis of one of the patches I wrote
for a previous iteration of THP support (and that is in Matthew's THP tree)
that will try to round VM addresses to the proper alignment if possible so a
THP can then be used to map the area.



2020-09-10 13:45:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote:
> [Cc Vlastimil and Mel - the whole email thread starts
> http://lkml.kernel.org/r/[email protected]
> but this particular subthread has diverged a bit and you might find
> it
> interesting]
>
> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
> >
> > I am not sure I like the trend towards CMA that we are seeing,
> > reserving
> > huge buffers for specific users (and eventually even doing it
> > automatically).
> >
> > What we actually want is ZONE_MOVABLE with relaxed guarantees, such
> > that
> > anybody who requires large, unmovable allocations can use it.
> >
> > I once played with the idea of having ZONE_PREFER_MOVABLE, which
> > a) Is the primary choice for movable allocations
> > b) Is allowed to contain unmovable allocations (esp., gigantic
> > pages)
> > c) Is the fallback for ZONE_NORMAL for unmovable allocations,
> > instead of
> > running out of memory
>
> I might be missing something but how can this work longterm? Or put
> in
> another words why would this work any better than existing
> fragmentation
> avoidance techniques that page allocator implements already -

One big difference is reclaim. If ZONE_NORMAL runs low on
free memory, page reclaim would kick in and evict some
movable/reclaimable things, to free up more space for
unmovable allocations.

The current fragmentation avoidance techniques don't do
things like reclaim, or proactively migrating movable
pages out of unmovable page blocks to prevent unmovable
allocations in currently movable page blocks.

> My suspicion is that a separate zone would work in a similar fashion.
> As
> long as there is a lot of free memory then zone will be effectively
> MOVABLE. Similar applies to normal zone when unmovable allocations
> are
> in minority. As long as the Normal zone gets full of unmovable
> objects
> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble
> page
> block stealing when unmovable objects start spreading over movable
> page
> blocks.

You are right, with the difference being reclaim and/or
migration, which could make a real difference in limiting
the number of pageblocks that have unmovable allocations.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-09-10 14:46:21

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 10 Sep 2020, at 10:34, David Hildenbrand wrote:

>>> As long as we stay in safe zone boundaries you get a benefit in most
>>> scenarios. As soon as we would have a (temporary) workload that would
>>> require more unmovable allocations we would fallback to polluting some
>>> pageblocks only.
>>
>> The idea would work well until unmoveable pages begin to overflow into
>> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
>> avoid unmoveable page overflow. The issue comes from the lifetime of
>> the unmoveable pages. Since some long-live ones can be around the boundary,
>> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
>> even if other unmoveable pages are deallocated. Ultimately,
>> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
>> back to what we have now.
>
> As discussed this would not happen in the usual case in case we size it
> reasonable. Of course, if you push it to the extreme (which was never
> suggested!), you would create mess. There is always a way to create a
> mess if you abuse such mechanism. Also see Rik's reply regarding reclaim.
>
>>
>> OK. I have a stupid question here. Why not just grow pageblock to a larger
>> size, like 1GB? So the fragmentation of unmoveable pages will be at larger
>> granularity. But it is less likely unmoveable pages will be allocated at
>> a movable pageblock, since the kernel has 1GB pageblock for them after
>> a pageblock stealing. If other kinds of pageblocks run out, moveable and
>> reclaimable pages can fall back to unmoveable pageblocks.
>> What am I missing here?
>
> Oh no. For example pageblocks have to completely fit into a single
> section (that's where metadata is maintained). Please refrain from
> suggesting to increase the section size ;)

Thank you for the explanation. I have no idea about the restrictions on
pageblock and section. Out of curiosity, what prevents the growth of
the section size?


Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-09-10 19:49:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 10.09.20 16:41, Zi Yan wrote:
> On 10 Sep 2020, at 10:34, David Hildenbrand wrote:
>
>>>> As long as we stay in safe zone boundaries you get a benefit in most
>>>> scenarios. As soon as we would have a (temporary) workload that would
>>>> require more unmovable allocations we would fallback to polluting some
>>>> pageblocks only.
>>>
>>> The idea would work well until unmoveable pages begin to overflow into
>>> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
>>> avoid unmoveable page overflow. The issue comes from the lifetime of
>>> the unmoveable pages. Since some long-live ones can be around the boundary,
>>> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
>>> even if other unmoveable pages are deallocated. Ultimately,
>>> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
>>> back to what we have now.
>>
>> As discussed this would not happen in the usual case in case we size it
>> reasonable. Of course, if you push it to the extreme (which was never
>> suggested!), you would create mess. There is always a way to create a
>> mess if you abuse such mechanism. Also see Rik's reply regarding reclaim.
>>
>>>
>>> OK. I have a stupid question here. Why not just grow pageblock to a larger
>>> size, like 1GB? So the fragmentation of unmoveable pages will be at larger
>>> granularity. But it is less likely unmoveable pages will be allocated at
>>> a movable pageblock, since the kernel has 1GB pageblock for them after
>>> a pageblock stealing. If other kinds of pageblocks run out, moveable and
>>> reclaimable pages can fall back to unmoveable pageblocks.
>>> What am I missing here?
>>
>> Oh no. For example pageblocks have to completely fit into a single
>> section (that's where metadata is maintained). Please refrain from
>> suggesting to increase the section size ;)
>
> Thank you for the explanation. I have no idea about the restrictions on
> pageblock and section. Out of curiosity, what prevents the growth of
> the section size?

The section size (and based on that the Linux memory block size) defines
- the minimum size in which we can add_memory()
- the alignment requirement in which we can add_memory()

This is applicable
- in physical environments, where the bios will decide where to place
DIMMs/NVDIMMs. The coarser the granularity, the less memory we might
be able to make use of in corner cases.
- in virtualized environments, where we want to add memory in fairly
small granularity. The coarser the granularity, the less flexibility
we have.

arm64 has a section size of 1GB (and a THP/MAX_ORDER - 1 size of 512MB
with 64k base pages :/ ). That already turned out to be a problem - see
[1] regarding thoughts on how to shrink the section size. I once read
about thoughts of switching to 2MB THP on arm64 with any base page size,
not sure if that will become real at one point (and we might be able to
reduce the pageblock size there as well ... )

[1]
https://lkml.kernel.org/r/AM6PR08MB40690714A2E77A7128B2B2ADF7700@AM6PR08MB4069.eurprd08.prod.outlook.com
See [1] as

>
> —
> Best Regards,
> Yan Zi
>


--
Thanks,

David / dhildenb

2020-09-10 21:12:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

>> As long as we stay in safe zone boundaries you get a benefit in most
>> scenarios. As soon as we would have a (temporary) workload that would
>> require more unmovable allocations we would fallback to polluting some
>> pageblocks only.
>
> The idea would work well until unmoveable pages begin to overflow into
> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
> avoid unmoveable page overflow. The issue comes from the lifetime of
> the unmoveable pages. Since some long-live ones can be around the boundary,
> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
> even if other unmoveable pages are deallocated. Ultimately,
> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
> back to what we have now.

As discussed this would not happen in the usual case in case we size it
reasonable. Of course, if you push it to the extreme (which was never
suggested!), you would create mess. There is always a way to create a
mess if you abuse such mechanism. Also see Rik's reply regarding reclaim.

>
> OK. I have a stupid question here. Why not just grow pageblock to a larger
> size, like 1GB? So the fragmentation of unmoveable pages will be at larger
> granularity. But it is less likely unmoveable pages will be allocated at
> a movable pageblock, since the kernel has 1GB pageblock for them after
> a pageblock stealing. If other kinds of pageblocks run out, moveable and
> reclaimable pages can fall back to unmoveable pageblocks.
> What am I missing here?

Oh no. For example pageblocks have to completely fit into a single
section (that's where metadata is maintained). Please refrain from
suggesting to increase the section size ;)

There is plenty of code relying on pageblocks/MAX_ORDER - 1 to be
reasonable in size. Examples in VMs are free page reporting or virtio-mem.

--
Thanks,

David / dhildenb

2020-09-10 21:13:46

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 10 Sep 2020, at 9:32, Rik van Riel wrote:

> On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote:
>> [Cc Vlastimil and Mel - the whole email thread starts
>> http://lkml.kernel.org/r/[email protected]
>> but this particular subthread has diverged a bit and you might find
>> it
>> interesting]
>>
>> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>>>
>>> I am not sure I like the trend towards CMA that we are seeing,
>>> reserving
>>> huge buffers for specific users (and eventually even doing it
>>> automatically).
>>>
>>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
>>> that
>>> anybody who requires large, unmovable allocations can use it.
>>>
>>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>>> a) Is the primary choice for movable allocations
>>> b) Is allowed to contain unmovable allocations (esp., gigantic
>>> pages)
>>> c) Is the fallback for ZONE_NORMAL for unmovable allocations,
>>> instead of
>>> running out of memory
>>
>> I might be missing something but how can this work longterm? Or put
>> in
>> another words why would this work any better than existing
>> fragmentation
>> avoidance techniques that page allocator implements already -
>
> One big difference is reclaim. If ZONE_NORMAL runs low on
> free memory, page reclaim would kick in and evict some
> movable/reclaimable things, to free up more space for
> unmovable allocations.
>
> The current fragmentation avoidance techniques don't do
> things like reclaim, or proactively migrating movable
> pages out of unmovable page blocks to prevent unmovable
> allocations in currently movable page blocks.

Isn’t Mel Gorman’s watermark boost patch[1] (merged about a year ago)
doing what you are describing?


[1]https://lore.kernel.org/linux-mm/[email protected]/



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-09-10 21:27:49

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 10 Sep 2020, at 4:27, David Hildenbrand wrote:

> On 10.09.20 09:32, Michal Hocko wrote:
>> [Cc Vlastimil and Mel - the whole email thread starts
>> http://lkml.kernel.org/r/[email protected]
>> but this particular subthread has diverged a bit and you might find it
>> interesting]
>>
>> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>>> On 09.09.20 15:19, Rik van Riel wrote:
>>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>>>>
>>>>>>> A global knob is insufficient. 1G pages will become a very
>>>>>>> precious
>>>>>>> resource as it requires a pre-allocation (reservation). So it
>>>>>>> really
>>>>>>> has
>>>>>>> to be an opt-in and the question is whether there is also some
>>>>>>> sort
>>>>>>> of
>>>>>>> access control needed.
>>>>>>
>>>>>> The 1GB pages do not require that much in the way of
>>>>>> pre-allocation. The memory can be obtained through CMA,
>>>>>> which means it can be used for movable 4kB and 2MB
>>>>>> allocations when not
>>>>>> being used for 1GB pages.
>>>>>
>>>>> That CMA has to be pre-reserved, right? That requires a
>>>>> configuration.
>>>>
>>>> To some extent, yes.
>>>>
>>>> However, because that pool can be used for movable
>>>> 4kB and 2MB
>>>> pages as well as for 1GB pages, it would be easy to just set
>>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>>> system.
>>>>
>>>> It isn't like the pool needs to be the exact right size. We
>>>> just need to avoid the "highmem problem" of having too little
>>>> memory for kernel allocations.
>>>>
>>>
>>> I am not sure I like the trend towards CMA that we are seeing, reserving
>>> huge buffers for specific users (and eventually even doing it
>>> automatically).
>>>
>>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
>>> anybody who requires large, unmovable allocations can use it.
>>>
>>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>>> a) Is the primary choice for movable allocations
>>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
>>> running out of memory
>>
>> I might be missing something but how can this work longterm? Or put in
>> another words why would this work any better than existing fragmentation
>> avoidance techniques that page allocator implements already - movability
>> grouping etc. Please note that I am not deeply familiar with those but
>> my high level understanding is that we already try hard to not mix
>> movable and unmovable objects in same page blocks as much as we can.
>
> Note that we group in pageblock granularity, which avoids fragmentation
> on a pageblock level, not on anything bigger than that. Especially
> MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages.
>
> So once you run for some time on a system (especially thinking about
> page shuffling *within* a zone), trying to allocate a gigantic page will
> simply always fail - even if you always had plenty of free memory in
> your single zone.
>
>>
>> My suspicion is that a separate zone would work in a similar fashion. As
>> long as there is a lot of free memory then zone will be effectively
>> MOVABLE. Similar applies to normal zone when unmovable allocations are
>
> Note the difference to MOVABLE: if you really want, you *can* put
> movable allocations into that zone. So you can happily allocate gigantic
> pages from it. Or anything else you like. As the name suggests "prefer
> movable allocations".
>
>> in minority. As long as the Normal zone gets full of unmovable objects
>> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
>> block stealing when unmovable objects start spreading over movable page
>> blocks.
>
> Right, the long-term goal would be
> 1. To limit the chance of that happening. (e.g., size it in a way that's
> safe for 99.9% of all setups, resize dynamically on demand)
> 2. To limit the physical area where that is happening (e.g., find lowest
> possible pageblock etc.). That's more tricky but I consider this a pure
> optimization on top.
>
> As long as we stay in safe zone boundaries you get a benefit in most
> scenarios. As soon as we would have a (temporary) workload that would
> require more unmovable allocations we would fallback to polluting some
> pageblocks only.

The idea would work well until unmoveable pages begin to overflow into
ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
avoid unmoveable page overflow. The issue comes from the lifetime of
the unmoveable pages. Since some long-live ones can be around the boundary,
there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
even if other unmoveable pages are deallocated. Ultimately,
ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
back to what we have now.

OK. I have a stupid question here. Why not just grow pageblock to a larger
size, like 1GB? So the fragmentation of unmoveable pages will be at larger
granularity. But it is less likely unmoveable pages will be allocated at
a movable pageblock, since the kernel has 1GB pageblock for them after
a pageblock stealing. If other kinds of pageblocks run out, moveable and
reclaimable pages can fall back to unmoveable pageblocks.
What am I missing here?

Thanks.



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature