2018-12-03 23:52:19

by David Rientjes

[permalink] [raw]
Subject: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

This fixes a 13.9% of remote memory access regression and 40% remote
memory allocation regression on Haswell when the local node is fragmented
for hugepage sized pages and memory is being faulted with either the thp
defrag setting of "always" or has been madvised with MADV_HUGEPAGE.

The usecase that initially identified this issue were binaries that mremap
their .text segment to be backed by transparent hugepages on startup.
They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().

This requires a full revert and partial revert of commits merged during
the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax
__GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
amounts of swap activity on the local zone when faulting hugepages by
falling back to remote memory. This remote allocation causes the access
regression and, if fragmented, the allocation regression.

This patchset also fixes that issue by not attempting direct reclaim at
all when compaction fails to free a hugepage. Note that if remote memory
was also low or fragmented that ac5b2c18911f ("mm: thp: relax
__GFP_THISNODE for MADV_HUGEPAGE mappings") would only have compounded the
problem it attempts to address by now thrashing all nodes instead of only
the local node.

The reverts for the stable trees will be different: just a straight revert
of commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE
mappings") is likely needed.

Cross compiled for architectures with thp support and thp enabled:
arc (with ISA_ARCV2), arm (with ARM_LPAE), arm64, i386, mips64, powerpc,
s390, sparc, x86_64.

Andrea, is this acceptable?
---
drivers/gpu/drm/ttm/ttm_page_alloc.c | 8 +++---
drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 3 --
include/linux/gfp.h | 3 +-
include/linux/mempolicy.h | 2 -
mm/huge_memory.c | 41 +++++++++++--------------------
mm/mempolicy.c | 7 +++--
mm/page_alloc.c | 16 ++++++++++++
7 files changed, 42 insertions(+), 38 deletions(-)


2018-12-03 23:51:23

by David Rientjes

[permalink] [raw]
Subject: [patch 2/2 for-4.20] mm, thp: always fault memory with __GFP_NORETRY

If memory compaction initially fails to free a hugepage, reclaiming and
retrying compaction is more likely to be harmful rather than beneficial.

For reclaim, it is unlikely that the pages reclaimed will form contiguous
memory the size of a hugepage without unnecessarily reclaiming a lot of
memory unnecessarily. It is also not guaranteed to be beneficial to
compaction if the reclaimed memory is not accessible to the per-zone
freeing scanner. For both of these reasons independently, all reclaim
activity may be entirely fruitless.

With these two issues, retrying compaction again is not likely to have a
different result. It is better to fallback to pages of the native page
size and allow khugepaged to collapse the memory into a hugepage later
when the fragmentation or availability of local memory is better.

If __GFP_NORETRY is set, which the page allocator implementation is
expecting in its comments, this can prevent large amounts of unnecesary
reclaim and swapping activity that can cause performance of other
applications to quickly degrade.

Furthermore, since reclaim is likely to be more harmful than beneficial
for such large order allocations, it is better to fail earlier rather
than trying reclaim of SWAP_CLUSTER_MAX pages which is unlikely to make
a difference for memory compaction to become successful.

Signed-off-by: David Rientjes <[email protected]>
---
drivers/gpu/drm/ttm/ttm_page_alloc.c | 8 ++++----
drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 3 +--
include/linux/gfp.h | 3 ++-
mm/huge_memory.c | 3 +--
mm/page_alloc.c | 16 ++++++++++++++++
5 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -860,8 +860,8 @@ static int ttm_get_pages(struct page **pages, unsigned npages, int flags,
while (npages >= HPAGE_PMD_NR) {
gfp_t huge_flags = gfp_flags;

- huge_flags |= GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
- __GFP_KSWAPD_RECLAIM;
+ huge_flags |= GFP_TRANSHUGE_LIGHT |
+ __GFP_KSWAPD_RECLAIM;
huge_flags &= ~__GFP_MOVABLE;
huge_flags &= ~__GFP_COMP;
p = alloc_pages(huge_flags, HPAGE_PMD_ORDER);
@@ -978,13 +978,13 @@ int ttm_page_alloc_init(struct ttm_mem_global *glob, unsigned max_pages)
GFP_USER | GFP_DMA32, "uc dma", 0);

ttm_page_pool_init_locked(&_manager->wc_pool_huge,
- (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+ (GFP_TRANSHUGE_LIGHT |
__GFP_KSWAPD_RECLAIM) &
~(__GFP_MOVABLE | __GFP_COMP),
"wc huge", order);

ttm_page_pool_init_locked(&_manager->uc_pool_huge,
- (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+ (GFP_TRANSHUGE_LIGHT |
__GFP_KSWAPD_RECLAIM) &
~(__GFP_MOVABLE | __GFP_COMP)
, "uc huge", order);
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -863,8 +863,7 @@ static gfp_t ttm_dma_pool_gfp_flags(struct ttm_dma_tt *ttm_dma, bool huge)
gfp_flags |= __GFP_ZERO;

if (huge) {
- gfp_flags |= GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
- __GFP_KSWAPD_RECLAIM;
+ gfp_flags |= GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
gfp_flags &= ~__GFP_MOVABLE;
gfp_flags &= ~__GFP_COMP;
}
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,7 +298,8 @@ struct vm_area_struct;
#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
#define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
#define GFP_TRANSHUGE_LIGHT ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
- __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
+ __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
+ ~__GFP_RECLAIM)
#define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)

/* Convert GFP flags to their corresponding migrate type */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -636,8 +636,7 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, un

/* Always do synchronous compaction */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
- return GFP_TRANSHUGE | __GFP_THISNODE |
- (vma_madvised ? 0 : __GFP_NORETRY);
+ return GFP_TRANSHUGE | __GFP_THISNODE;

/* Kick kcompactd and fail quickly */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4139,6 +4139,22 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (compact_result == COMPACT_DEFERRED)
goto nopage;

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * When faulting a hugepage, it is very unlikely that
+ * thrashing the zonelist is going to help compaction in
+ * freeing such a high-order page. Reclaim would need
+ * to free contiguous memory itself or guarantee the
+ * reclaimed memory is accessible by the compaction
+ * freeing scanner. Since there is no such guarantee,
+ * thrashing is more harmful than beneficial. It is
+ * better to simply fail and fallback to native pages.
+ */
+ if (order == HPAGE_PMD_ORDER &&
+ !(current->flags & PF_KTHREAD))
+ goto nopage;
+#endif
+
/*
* Looks like reclaim/compaction is worth trying, but
* sync compaction could be very expensive, so keep

2018-12-03 23:52:38

by David Rientjes

[permalink] [raw]
Subject: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").

By not setting __GFP_THISNODE, applications can allocate remote hugepages
when the local node is fragmented or low on memory when either the thp
defrag setting is "always" or the vma has been madvised with
MADV_HUGEPAGE.

Remote access to hugepages often has much higher latency than local pages
of the native page size. On Haswell, ac5b2c18911f was shown to have a
13.9% access regression after this commit for binaries that remap their
text segment to be backed by transparent hugepages.

The intent of ac5b2c18911f is to address an issue where a local node is
low on memory or fragmented such that a hugepage cannot be allocated. In
every scenario where this was described as a fix, there is abundant and
unfragmented remote memory available to allocate from, even with a greater
access latency.

If remote memory is also low or fragmented, not setting __GFP_THISNODE was
also measured on Haswell to have a 40% regression in allocation latency.

Restore __GFP_THISNODE for thp allocations.

Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")

Signed-off-by: David Rientjes <[email protected]>
---
include/linux/mempolicy.h | 2 --
mm/huge_memory.c | 42 +++++++++++++++------------------------
mm/mempolicy.c | 7 ++++---
3 files changed, 20 insertions(+), 31 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
struct mempolicy *get_task_policy(struct task_struct *p);
struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
unsigned long addr);
-struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
- unsigned long addr);
bool vma_policy_mof(struct vm_area_struct *vma);

extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -632,37 +632,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
{
const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
- gfp_t this_node = 0;
-
-#ifdef CONFIG_NUMA
- struct mempolicy *pol;
- /*
- * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
- * specified, to express a general desire to stay on the current
- * node for optimistic allocation attempts. If the defrag mode
- * and/or madvise hint requires the direct reclaim then we prefer
- * to fallback to other node rather than node reclaim because that
- * can lead to excessive reclaim even though there is free memory
- * on other nodes. We expect that NUMA preferences are specified
- * by memory policies.
- */
- pol = get_vma_policy(vma, addr);
- if (pol->mode != MPOL_BIND)
- this_node = __GFP_THISNODE;
- mpol_cond_put(pol);
-#endif
+ const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;

+ /* Always do synchronous compaction */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
- return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+ return GFP_TRANSHUGE | __GFP_THISNODE |
+ (vma_madvised ? 0 : __GFP_NORETRY);
+
+ /* Kick kcompactd and fail quickly */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
- return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
+ return gfp_mask | __GFP_KSWAPD_RECLAIM;
+
+ /* Synchronous compaction if madvised, otherwise kick kcompactd */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
- return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
- __GFP_KSWAPD_RECLAIM | this_node);
+ return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+ __GFP_KSWAPD_RECLAIM);
+
+ /* Only do synchronous compaction if madvised */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
- return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
- this_node);
- return GFP_TRANSHUGE_LIGHT | this_node;
+ return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
+
+ return gfp_mask;
}

/* Caller must hold page table lock. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1116,8 +1116,9 @@ static struct page *new_page(struct page *page, unsigned long start)
} else if (PageTransHuge(page)) {
struct page *thp;

- thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
- address, numa_node_id());
+ thp = alloc_pages_vma(GFP_TRANSHUGE | __GFP_THISNODE,
+ HPAGE_PMD_ORDER, vma, address,
+ numa_node_id());
if (!thp)
return NULL;
prep_transhuge_page(thp);
@@ -1662,7 +1663,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
* freeing by another task. It is the caller's responsibility to free the
* extra reference for shared policies.
*/
-struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
unsigned long addr)
{
struct mempolicy *pol = __get_vma_policy(vma, addr);

2018-12-04 07:38:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

On Mon 03-12-18 15:50:24, David Rientjes wrote:
> This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
> consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
>
> By not setting __GFP_THISNODE, applications can allocate remote hugepages
> when the local node is fragmented or low on memory when either the thp
> defrag setting is "always" or the vma has been madvised with
> MADV_HUGEPAGE.
>
> Remote access to hugepages often has much higher latency than local pages
> of the native page size. On Haswell, ac5b2c18911f was shown to have a
> 13.9% access regression after this commit for binaries that remap their
> text segment to be backed by transparent hugepages.
>
> The intent of ac5b2c18911f is to address an issue where a local node is
> low on memory or fragmented such that a hugepage cannot be allocated. In
> every scenario where this was described as a fix, there is abundant and
> unfragmented remote memory available to allocate from, even with a greater
> access latency.
>
> If remote memory is also low or fragmented, not setting __GFP_THISNODE was
> also measured on Haswell to have a 40% regression in allocation latency.
>
> Restore __GFP_THISNODE for thp allocations.
>
> Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
> Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")

At minimum do not remove the cleanup part which consolidates the gfp
hadnling to a single place. There is no real reason to have the
__GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.

I still hate the __GFP_THISNODE part as mentioned before. It is an ugly
hack but I can learn to live with it if this is indeed the only option
for the short term workaround until we find a proper solution.

> Signed-off-by: David Rientjes <[email protected]>
> ---
> include/linux/mempolicy.h | 2 --
> mm/huge_memory.c | 42 +++++++++++++++------------------------
> mm/mempolicy.c | 7 ++++---
> 3 files changed, 20 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
> struct mempolicy *get_task_policy(struct task_struct *p);
> struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> unsigned long addr);
> -struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> - unsigned long addr);
> bool vma_policy_mof(struct vm_area_struct *vma);
>
> extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -632,37 +632,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> {
> const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> - gfp_t this_node = 0;
> -
> -#ifdef CONFIG_NUMA
> - struct mempolicy *pol;
> - /*
> - * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> - * specified, to express a general desire to stay on the current
> - * node for optimistic allocation attempts. If the defrag mode
> - * and/or madvise hint requires the direct reclaim then we prefer
> - * to fallback to other node rather than node reclaim because that
> - * can lead to excessive reclaim even though there is free memory
> - * on other nodes. We expect that NUMA preferences are specified
> - * by memory policies.
> - */
> - pol = get_vma_policy(vma, addr);
> - if (pol->mode != MPOL_BIND)
> - this_node = __GFP_THISNODE;
> - mpol_cond_put(pol);
> -#endif
> + const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>
> + /* Always do synchronous compaction */
> if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> - return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> + return GFP_TRANSHUGE | __GFP_THISNODE |
> + (vma_madvised ? 0 : __GFP_NORETRY);
> +
> + /* Kick kcompactd and fail quickly */
> if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
> + return gfp_mask | __GFP_KSWAPD_RECLAIM;
> +
> + /* Synchronous compaction if madvised, otherwise kick kcompactd */
> if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
> - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> - __GFP_KSWAPD_RECLAIM | this_node);
> + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> + __GFP_KSWAPD_RECLAIM);
> +
> + /* Only do synchronous compaction if madvised */
> if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> - this_node);
> - return GFP_TRANSHUGE_LIGHT | this_node;
> + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
> +
> + return gfp_mask;
> }
>
> /* Caller must hold page table lock. */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1116,8 +1116,9 @@ static struct page *new_page(struct page *page, unsigned long start)
> } else if (PageTransHuge(page)) {
> struct page *thp;
>
> - thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
> - address, numa_node_id());
> + thp = alloc_pages_vma(GFP_TRANSHUGE | __GFP_THISNODE,
> + HPAGE_PMD_ORDER, vma, address,
> + numa_node_id());
> if (!thp)
> return NULL;
> prep_transhuge_page(thp);
> @@ -1662,7 +1663,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> * freeing by another task. It is the caller's responsibility to free the
> * extra reference for shared policies.
> */
> -struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> unsigned long addr)
> {
> struct mempolicy *pol = __get_vma_policy(vma, addr);

--
Michal Hocko
SUSE Labs

2018-12-04 07:39:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Mon 03-12-18 15:50:18, David Rientjes wrote:
> This fixes a 13.9% of remote memory access regression and 40% remote
> memory allocation regression on Haswell when the local node is fragmented
> for hugepage sized pages and memory is being faulted with either the thp
> defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
>
> The usecase that initially identified this issue were binaries that mremap
> their .text segment to be backed by transparent hugepages on startup.
> They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().

Do you have something you can share with so that other people can play
and try to reproduce?

> This requires a full revert and partial revert of commits merged during
> the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax
> __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> amounts of swap activity on the local zone when faulting hugepages by
> falling back to remote memory. This remote allocation causes the access
> regression and, if fragmented, the allocation regression.

Have you tried to measure any of the workloads Mel and Andrea have
pointed out during the previous review discussion? In other words what
is the impact on the THP success rate and allocation latencies for other
usecases?
--
Michal Hocko
SUSE Labs

2018-12-04 10:13:35

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On 12/4/18 12:50 AM, David Rientjes wrote:
> This fixes a 13.9% of remote memory access regression and 40% remote
> memory allocation regression on Haswell when the local node is fragmented
> for hugepage sized pages and memory is being faulted with either the thp
> defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
>
> The usecase that initially identified this issue were binaries that mremap
> their .text segment to be backed by transparent hugepages on startup.
> They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().
>
> This requires a full revert and partial revert of commits merged during
> the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax
> __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> amounts of swap activity on the local zone when faulting hugepages by
> falling back to remote memory. This remote allocation causes the access
> regression and, if fragmented, the allocation regression.
>
> This patchset also fixes that issue by not attempting direct reclaim at
> all when compaction fails to free a hugepage. Note that if remote memory
> was also low or fragmented that ac5b2c18911f ("mm: thp: relax
> __GFP_THISNODE for MADV_HUGEPAGE mappings") would only have compounded the
> problem it attempts to address by now thrashing all nodes instead of only
> the local node.
>
> The reverts for the stable trees will be different: just a straight revert
> of commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE
> mappings") is likely needed.
>
> Cross compiled for architectures with thp support and thp enabled:
> arc (with ISA_ARCV2), arm (with ARM_LPAE), arm64, i386, mips64, powerpc,
> s390, sparc, x86_64.
>
> Andrea, is this acceptable?

So, AFAIK, the situation is:

- commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The
intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking
both as it seemed to make sense).
- The resulting node-reclaim-like behavior regressed Andrea's KVM
workloads, but reverting it (only for madvised or non-default
defrag=always THP by commit ac5b2c18911f) would regress David's
workloads starting with 4.20 to pre-4.1 levels.

If the decision is that it's too late to revert a 4.1 regression for one
kind of workload in 4.20 because it causes regression for another
workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20
and don't rush a different fix (patch 2) to 4.20. It's not a big
difference if a 4.1 regression is fixed in 4.20 or 4.21?

Because there might be other unexpected consequences of patch 2 that
testing won't be able to catch in the remaining 4.20 rc's. And I'm not
even sure if it will fix Andrea's workloads. While it should prevent
node-reclaim-like thrashing, it will still mean that KVM (or anyone)
won't be able to allocate THP's remotely, even if the local node is
exhausted of both huge and base pages.

> ---
> drivers/gpu/drm/ttm/ttm_page_alloc.c | 8 +++---
> drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 3 --
> include/linux/gfp.h | 3 +-
> include/linux/mempolicy.h | 2 -
> mm/huge_memory.c | 41 +++++++++++--------------------
> mm/mempolicy.c | 7 +++--
> mm/page_alloc.c | 16 ++++++++++++
> 7 files changed, 42 insertions(+), 38 deletions(-)
>


2018-12-04 21:57:26

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

On Tue, 4 Dec 2018, Michal Hocko wrote:

> > This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
> > consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
> >
> > By not setting __GFP_THISNODE, applications can allocate remote hugepages
> > when the local node is fragmented or low on memory when either the thp
> > defrag setting is "always" or the vma has been madvised with
> > MADV_HUGEPAGE.
> >
> > Remote access to hugepages often has much higher latency than local pages
> > of the native page size. On Haswell, ac5b2c18911f was shown to have a
> > 13.9% access regression after this commit for binaries that remap their
> > text segment to be backed by transparent hugepages.
> >
> > The intent of ac5b2c18911f is to address an issue where a local node is
> > low on memory or fragmented such that a hugepage cannot be allocated. In
> > every scenario where this was described as a fix, there is abundant and
> > unfragmented remote memory available to allocate from, even with a greater
> > access latency.
> >
> > If remote memory is also low or fragmented, not setting __GFP_THISNODE was
> > also measured on Haswell to have a 40% regression in allocation latency.
> >
> > Restore __GFP_THISNODE for thp allocations.
> >
> > Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
> > Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
>
> At minimum do not remove the cleanup part which consolidates the gfp
> hadnling to a single place. There is no real reason to have the
> __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
>

The __GFP_THISNODE usage is still confined to
alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set
it in alloc_pages_vma() as done before the cleanup.

2018-12-04 22:06:12

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Tue, 4 Dec 2018, Vlastimil Babka wrote:

> So, AFAIK, the situation is:
>
> - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The
> intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking
> both as it seemed to make sense).

Yes, both are based on the preference to fault local thp and fallback to
local pages before allocating remotely because it does not cause the
performance regression introduced by not setting __GFP_THISNODE.

> - The resulting node-reclaim-like behavior regressed Andrea's KVM
> workloads, but reverting it (only for madvised or non-default
> defrag=always THP by commit ac5b2c18911f) would regress David's
> workloads starting with 4.20 to pre-4.1 levels.
>

Almost, but the defrag=always case had the subtle difference of also
setting __GFP_NORETRY whereas MADV_HUGEPAGE did not. This was different
than the comment in __alloc_pages_slowpath() that expected thp fault
allocations to be caught by checking __GFP_NORETRY.

> If the decision is that it's too late to revert a 4.1 regression for one
> kind of workload in 4.20 because it causes regression for another
> workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20
> and don't rush a different fix (patch 2) to 4.20. It's not a big
> difference if a 4.1 regression is fixed in 4.20 or 4.21?
>

The revert is certainly needed to prevent the regression, yes, but I
anticipate that Andrea will report back that patch 2 at least improves the
situation for the problem that he was addressing, specifically that it is
pointless to thrash any node or reclaim unnecessarily when compaction has
already failed. This is what setting __GFP_NORETRY for all thp fault
allocations fixes.

> Because there might be other unexpected consequences of patch 2 that
> testing won't be able to catch in the remaining 4.20 rc's. And I'm not
> even sure if it will fix Andrea's workloads. While it should prevent
> node-reclaim-like thrashing, it will still mean that KVM (or anyone)
> won't be able to allocate THP's remotely, even if the local node is
> exhausted of both huge and base pages.
>

Patch 2 does nothing with respect to the remote allocation policy; it
simply prevents reclaim (and potentially thrashing). Patch 1 sets
__GFP_THISNODE to prevent the remote allocation.

Note that the commit to be reverted in patch 1, if not reverted, would
cause an even more severe regression from Andrea's case if remote memory
is fragmented as well: it opens the door to thrashing both local and
remote memory instead of only local memory. I measured this as a 40%
allocation latency regression when purposefully fragmenting all nodes and
faulting without __GFP_THISNODE, and that was on Haswell; I can imagine it
would be even greater on Rome.

2018-12-04 22:26:47

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Tue, 4 Dec 2018, Michal Hocko wrote:

> > This fixes a 13.9% of remote memory access regression and 40% remote
> > memory allocation regression on Haswell when the local node is fragmented
> > for hugepage sized pages and memory is being faulted with either the thp
> > defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> >
> > The usecase that initially identified this issue were binaries that mremap
> > their .text segment to be backed by transparent hugepages on startup.
> > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().
>
> Do you have something you can share with so that other people can play
> and try to reproduce?
>

This is a single MADV_HUGEPAGE usecase, there is nothing special about it.
It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and
faulted the memory with a fragmented local node and then measured the
remote access latency to the remote hugepage that occurs without setting
__GFP_THISNODE. You can also measure the remote allocation latency by
fragmenting the entire system and then faulting.

(Remapping the text segment only involves parsing /proc/self/exe, mmap,
madvise, memcpy, and mremap.)

> > This requires a full revert and partial revert of commits merged during
> > the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax
> > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> > amounts of swap activity on the local zone when faulting hugepages by
> > falling back to remote memory. This remote allocation causes the access
> > regression and, if fragmented, the allocation regression.
>
> Have you tried to measure any of the workloads Mel and Andrea have
> pointed out during the previous review discussion? In other words what
> is the impact on the THP success rate and allocation latencies for other
> usecases?

It isn't a property of the workload, it's a property of the how fragmented
both local and remote memory is. In Andrea's case, I believe he has
stated that memory compaction has failed locally and the resulting reclaim
activity ends up looping and causing it the thrash the local node whereas
75% of remote memory is free and not fragmented. So we have local
fragmentation and reclaim is very expensive to enable compaction to
succeed, if it ever does succeed[*], and mostly free remote memory.

If remote memory is also fragmented, Andrea's case will run into a much
more severe swap storm as a result of not setting __GFP_THISNODE. The
premise of the entire change is that his remote memory is mostly free so
fallback results in a quick allocation. For balanced nodes, that's not
going to be the case. The fix to prevent the heavy reclaim activity is to
set __GFP_NORETRY as the page allocator suspects, which patch 2 here does.

That's an interesting memory state to

[*] Reclaim here would only be beneficial if we fail the order-0
watermark check in __compaction_suitable() *and* the reclaimed
memory can be accessed during isolate_freepages().

2018-12-05 07:35:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

On Tue 04-12-18 13:56:30, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
>
> > > This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > > MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
> > > consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
> > >
> > > By not setting __GFP_THISNODE, applications can allocate remote hugepages
> > > when the local node is fragmented or low on memory when either the thp
> > > defrag setting is "always" or the vma has been madvised with
> > > MADV_HUGEPAGE.
> > >
> > > Remote access to hugepages often has much higher latency than local pages
> > > of the native page size. On Haswell, ac5b2c18911f was shown to have a
> > > 13.9% access regression after this commit for binaries that remap their
> > > text segment to be backed by transparent hugepages.
> > >
> > > The intent of ac5b2c18911f is to address an issue where a local node is
> > > low on memory or fragmented such that a hugepage cannot be allocated. In
> > > every scenario where this was described as a fix, there is abundant and
> > > unfragmented remote memory available to allocate from, even with a greater
> > > access latency.
> > >
> > > If remote memory is also low or fragmented, not setting __GFP_THISNODE was
> > > also measured on Haswell to have a 40% regression in allocation latency.
> > >
> > > Restore __GFP_THISNODE for thp allocations.
> > >
> > > Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
> > > Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
> >
> > At minimum do not remove the cleanup part which consolidates the gfp
> > hadnling to a single place. There is no real reason to have the
> > __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
> >
>
> The __GFP_THISNODE usage is still confined to
> alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set
> it in alloc_pages_vma() as done before the cleanup.

Why should be new_page any different?

--
Michal Hocko
SUSE Labs

2018-12-05 07:43:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Tue 04-12-18 14:25:54, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
>
> > > This fixes a 13.9% of remote memory access regression and 40% remote
> > > memory allocation regression on Haswell when the local node is fragmented
> > > for hugepage sized pages and memory is being faulted with either the thp
> > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> > >
> > > The usecase that initially identified this issue were binaries that mremap
> > > their .text segment to be backed by transparent hugepages on startup.
> > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().
> >
> > Do you have something you can share with so that other people can play
> > and try to reproduce?
> >
>
> This is a single MADV_HUGEPAGE usecase, there is nothing special about it.
> It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and
> faulted the memory with a fragmented local node and then measured the
> remote access latency to the remote hugepage that occurs without setting
> __GFP_THISNODE. You can also measure the remote allocation latency by
> fragmenting the entire system and then faulting.
>
> (Remapping the text segment only involves parsing /proc/self/exe, mmap,
> madvise, memcpy, and mremap.)

How does this reflect your real workload and regressions in it. It
certainly shows the worst case behavior where the access penalty is
prevalent while there are no other metrics which might be interesting or
even important. E.g. page table savings or the TLB pressure in general
when THP fail too eagerly.
As Anrea mentioned there are really valid cases where the remote latency
pays off.

Have you actually seen the advertised regression in the real workload
and do you have any means to simulate that workload.

> > > This requires a full revert and partial revert of commits merged during
> > > the 4.20 rc cycle. The full revert, of ac5b2c18911f ("mm: thp: relax
> > > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> > > amounts of swap activity on the local zone when faulting hugepages by
> > > falling back to remote memory. This remote allocation causes the access
> > > regression and, if fragmented, the allocation regression.
> >
> > Have you tried to measure any of the workloads Mel and Andrea have
> > pointed out during the previous review discussion? In other words what
> > is the impact on the THP success rate and allocation latencies for other
> > usecases?
>
> It isn't a property of the workload, it's a property of the how fragmented
> both local and remote memory is. In Andrea's case, I believe he has
> stated that memory compaction has failed locally and the resulting reclaim
> activity ends up looping and causing it the thrash the local node whereas
> 75% of remote memory is free and not fragmented. So we have local
> fragmentation and reclaim is very expensive to enable compaction to
> succeed, if it ever does succeed[*], and mostly free remote memory.
>
> If remote memory is also fragmented, Andrea's case will run into a much
> more severe swap storm as a result of not setting __GFP_THISNODE. The
> premise of the entire change is that his remote memory is mostly free so
> fallback results in a quick allocation. For balanced nodes, that's not
> going to be the case. The fix to prevent the heavy reclaim activity is to
> set __GFP_NORETRY as the page allocator suspects, which patch 2 here does.

You are not answering my question and I take it as you haven't actually
tested this in a variety of workload and base your assumptions on an
artificial worst case scenario. Please correct me if I am wrong.

--
Michal Hocko
SUSE Labs

2018-12-05 09:07:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Tue 04-12-18 14:04:10, David Rientjes wrote:
> On Tue, 4 Dec 2018, Vlastimil Babka wrote:
>
> > So, AFAIK, the situation is:
> >
> > - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The
> > intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking
> > both as it seemed to make sense).
>
> Yes, both are based on the preference to fault local thp and fallback to
> local pages before allocating remotely because it does not cause the
> performance regression introduced by not setting __GFP_THISNODE.
>
> > - The resulting node-reclaim-like behavior regressed Andrea's KVM
> > workloads, but reverting it (only for madvised or non-default
> > defrag=always THP by commit ac5b2c18911f) would regress David's
> > workloads starting with 4.20 to pre-4.1 levels.
> >
>
> Almost, but the defrag=always case had the subtle difference of also
> setting __GFP_NORETRY whereas MADV_HUGEPAGE did not. This was different
> than the comment in __alloc_pages_slowpath() that expected thp fault
> allocations to be caught by checking __GFP_NORETRY.
>
> > If the decision is that it's too late to revert a 4.1 regression for one
> > kind of workload in 4.20 because it causes regression for another
> > workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20
> > and don't rush a different fix (patch 2) to 4.20. It's not a big
> > difference if a 4.1 regression is fixed in 4.20 or 4.21?
> >
>
> The revert is certainly needed to prevent the regression, yes, but I
> anticipate that Andrea will report back that patch 2 at least improves the
> situation for the problem that he was addressing, specifically that it is
> pointless to thrash any node or reclaim unnecessarily when compaction has
> already failed. This is what setting __GFP_NORETRY for all thp fault
> allocations fixes.

Yes but earlier numbers from Mel and repeated again [1] simply show
that the swap storms are only handled in favor of an absolute drop of
THP success rate.

> > Because there might be other unexpected consequences of patch 2 that
> > testing won't be able to catch in the remaining 4.20 rc's. And I'm not
> > even sure if it will fix Andrea's workloads. While it should prevent
> > node-reclaim-like thrashing, it will still mean that KVM (or anyone)
> > won't be able to allocate THP's remotely, even if the local node is
> > exhausted of both huge and base pages.
> >
>
> Patch 2 does nothing with respect to the remote allocation policy; it
> simply prevents reclaim (and potentially thrashing). Patch 1 sets
> __GFP_THISNODE to prevent the remote allocation.

Yes, this is understood. So we are getting worst of both. We have a
numa locality side effect of MADV_HUGEPAGE and we have a poor THP
utilization. So how come this is an improvement. Especially when the
reported regression hasn't been demonstrated on a real or repeatable
workload but rather a very vague presumably worst case behavior where
the access penalty is absolutely prevailing.

[1] http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2018-12-05 10:16:05

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Tue, Dec 04, 2018 at 02:25:54PM -0800, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
>
> > > This fixes a 13.9% of remote memory access regression and 40% remote
> > > memory allocation regression on Haswell when the local node is fragmented
> > > for hugepage sized pages and memory is being faulted with either the thp
> > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> > >
> > > The usecase that initially identified this issue were binaries that mremap
> > > their .text segment to be backed by transparent hugepages on startup.
> > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().
> >
> > Do you have something you can share with so that other people can play
> > and try to reproduce?
> >
>
> This is a single MADV_HUGEPAGE usecase, there is nothing special about it.
> It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and
> faulted the memory with a fragmented local node and then measured the
> remote access latency to the remote hugepage that occurs without setting
> __GFP_THISNODE. You can also measure the remote allocation latency by
> fragmenting the entire system and then faulting.
>

I'll make the same point as before, the form the fragmentation takes
matters as well as the types of pages that are resident and whether
they are active or not. It affects the level of work the system does
as well as the overall success rate of operations (be it reclaim, THP
allocation, compaction, whatever). This is why a reproduction case that is
representative of the problem you're facing on the real workload matters
would have been helpful because then any alternative proposal could have
taken your workload into account during testing.

--
Mel Gorman
SUSE Labs

2018-12-05 19:25:46

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

On Wed, 5 Dec 2018, Michal Hocko wrote:

> > > At minimum do not remove the cleanup part which consolidates the gfp
> > > hadnling to a single place. There is no real reason to have the
> > > __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
> > >
> >
> > The __GFP_THISNODE usage is still confined to
> > alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set
> > it in alloc_pages_vma() as done before the cleanup.
>
> Why should be new_page any different?
>

To match alloc_new_node_page() which does it correctly and does not change
the behavior of mbind() that the cleanup did, which used
alloc_hugepage_vma() to get the __GFP_THISNODE behavior. If there is a
reason mbind() is somehow different wrt allocating hugepages locally, I
think that should be a separate patch, but the goal of this patch is to
revert all the behavioral change that caused hugepages to be allocated
remotely.

2018-12-05 19:42:55

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Wed, 5 Dec 2018, Mel Gorman wrote:

> > This is a single MADV_HUGEPAGE usecase, there is nothing special about it.
> > It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and
> > faulted the memory with a fragmented local node and then measured the
> > remote access latency to the remote hugepage that occurs without setting
> > __GFP_THISNODE. You can also measure the remote allocation latency by
> > fragmenting the entire system and then faulting.
> >
>
> I'll make the same point as before, the form the fragmentation takes
> matters as well as the types of pages that are resident and whether
> they are active or not. It affects the level of work the system does
> as well as the overall success rate of operations (be it reclaim, THP
> allocation, compaction, whatever). This is why a reproduction case that is
> representative of the problem you're facing on the real workload matters
> would have been helpful because then any alternative proposal could have
> taken your workload into account during testing.
>

We know from Andrea's report that compaction is failing, and repeatedly
failing because otherwise we would not need excessive swapping to make it
work. That can mean one of two things: (1) a general low-on-memory
situation that causes us repeatedly to be under watermarks to deem
compaction suitable (isolate_freepages() will be too painful) or (2)
compaction has the memory that it needs but is failing to make a hugepage
available because all pages from a pageblock cannot be migrated.

If (1), perhaps in the presence of an antagonist that is quickly
allocating the memory before compaction can pass watermark checks, further
reclaim is not beneficial: the allocation is becoming too expensive and
there is no guarantee that compaction can find this reclaimed memory in
isolate_freepages().

I chose to duplicate (2) by synthetically introducing fragmentation
(high-order slab, free every other one) locally to test the patch that
does not set __GFP_THISNODE. The result is a remote transparent hugepage,
but we do not even need to get to the point of local compaction for that
fallback to happen. And this is where I measure the 13.9% access latency
regression for the lifetime of the binary as a result of this patch.

If local compaction works the first time, great! But that is not what is
happening in Andrea's report and as a result of not setting __GFP_THISNODE
we are *guaranteed* worse access latency and may encounter even worse
allocation latency if the remote memory is fragmented as well.

So while I'm only testing the functional behavior of the patch itself, I
cannot speak to the nature of the local fragmentation on Andrea's systems.

2018-12-05 19:51:26

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Wed, 5 Dec 2018, Michal Hocko wrote:

> > The revert is certainly needed to prevent the regression, yes, but I
> > anticipate that Andrea will report back that patch 2 at least improves the
> > situation for the problem that he was addressing, specifically that it is
> > pointless to thrash any node or reclaim unnecessarily when compaction has
> > already failed. This is what setting __GFP_NORETRY for all thp fault
> > allocations fixes.
>
> Yes but earlier numbers from Mel and repeated again [1] simply show
> that the swap storms are only handled in favor of an absolute drop of
> THP success rate.
>

As we've been over countless times, this is the desired effect for
workloads that fit on a single node. We want local pages of the native
page size because they (1) are accessed faster than remote hugepages and
(2) are candidates for collapse by khugepaged.

For applications that do not fit in a single node, we have discussed
possible ways to extend the API to allow remote faulting of hugepages,
absent remote fragmentation as well, then the long-standing behavior is
preserved and large applications can use the API to increase their thp
success rate.

> Yes, this is understood. So we are getting worst of both. We have a
> numa locality side effect of MADV_HUGEPAGE and we have a poor THP
> utilization. So how come this is an improvement. Especially when the
> reported regression hasn't been demonstrated on a real or repeatable
> workload but rather a very vague presumably worst case behavior where
> the access penalty is absolutely prevailing.
>

High thp utilization is not always better, especially when those hugepages
are accessed remotely and introduce the regressions that I've reported.
Seeking high thp utilization at all costs is not the goal if it causes
workloads to regress.

2018-12-05 20:17:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

On Wed 05-12-18 11:24:53, David Rientjes wrote:
> On Wed, 5 Dec 2018, Michal Hocko wrote:
>
> > > > At minimum do not remove the cleanup part which consolidates the gfp
> > > > hadnling to a single place. There is no real reason to have the
> > > > __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
> > > >
> > >
> > > The __GFP_THISNODE usage is still confined to
> > > alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set
> > > it in alloc_pages_vma() as done before the cleanup.
> >
> > Why should be new_page any different?
> >
>
> To match alloc_new_node_page() which does it correctly and does not change
> the behavior of mbind() that the cleanup did, which used
> alloc_hugepage_vma() to get the __GFP_THISNODE behavior. If there is a
> reason mbind() is somehow different wrt allocating hugepages locally, I
> think that should be a separate patch, but the goal of this patch is to
> revert all the behavioral change that caused hugepages to be allocated
> remotely.

If the __GFP_THISNODE should be really used then it should be applied to
all other types of pages. Not only THP. And as such done in a separate
patch. Not a part of the revert. The cleanup was meant to unify THP
allocations and that is why I object to reverting it as a part of this
work.

--
Michal Hocko
SUSE Labs

2018-12-05 20:35:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Wed 05-12-18 11:49:26, David Rientjes wrote:
> On Wed, 5 Dec 2018, Michal Hocko wrote:
>
> > > The revert is certainly needed to prevent the regression, yes, but I
> > > anticipate that Andrea will report back that patch 2 at least improves the
> > > situation for the problem that he was addressing, specifically that it is
> > > pointless to thrash any node or reclaim unnecessarily when compaction has
> > > already failed. This is what setting __GFP_NORETRY for all thp fault
> > > allocations fixes.
> >
> > Yes but earlier numbers from Mel and repeated again [1] simply show
> > that the swap storms are only handled in favor of an absolute drop of
> > THP success rate.
> >
>
> As we've been over countless times, this is the desired effect for
> workloads that fit on a single node. We want local pages of the native
> page size because they (1) are accessed faster than remote hugepages and
> (2) are candidates for collapse by khugepaged.
>
> For applications that do not fit in a single node, we have discussed
> possible ways to extend the API to allow remote faulting of hugepages,
> absent remote fragmentation as well, then the long-standing behavior is
> preserved and large applications can use the API to increase their thp
> success rate.

OK, I just give up. This doesn't lead anywhere. You keep repeating the
same stuff over and over, neglect other usecases and actually force them
to do something special just to keep your very specific usecase which
you clearly refuse to abstract into a form other people can experiment
with or at least provide more detailed broken down numbers for a more
serious analyses. Fault latency is only a part of the picture which is
much more complex. Look at Mel's report to get an impression of what
might be really useful for a _productive_ discussion.
--
Michal Hocko
SUSE Labs

2018-12-05 21:17:23

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Wed, 5 Dec 2018, Michal Hocko wrote:

> > As we've been over countless times, this is the desired effect for
> > workloads that fit on a single node. We want local pages of the native
> > page size because they (1) are accessed faster than remote hugepages and
> > (2) are candidates for collapse by khugepaged.
> >
> > For applications that do not fit in a single node, we have discussed
> > possible ways to extend the API to allow remote faulting of hugepages,
> > absent remote fragmentation as well, then the long-standing behavior is
> > preserved and large applications can use the API to increase their thp
> > success rate.
>
> OK, I just give up. This doesn't lead anywhere. You keep repeating the
> same stuff over and over, neglect other usecases and actually force them
> to do something special just to keep your very specific usecase which
> you clearly refuse to abstract into a form other people can experiment
> with or at least provide more detailed broken down numbers for a more
> serious analyses. Fault latency is only a part of the picture which is
> much more complex. Look at Mel's report to get an impression of what
> might be really useful for a _productive_ discussion.

The other usecases is part of patch 2/2 in this series that is
functionally similar to the __GFP_COMPACT_ONLY patch that Andrea proposed.
We can also work to extend the API to allow remote thp allocations.

Patch 1/2 reverts the behavior of commit ac5b2c18911f ("mm: thp: relax
__GFP_THISNODE for MADV_HUGEPAGE mappings") which added NUMA locality on
top of an already conflated madvise mode. Prior to this commit that was
merged for 4.20, *all* thp faults were constrained to the local node; this
has been the case for three years and even prior to that in other kernels.
It turns out that allowing remote allocations introduces access latency in
the presence of local fragmentation.

The solution is not to conflate MADV_HUGEPAGE with any sematic that
suggests it allows remote thp allocations, especially when that changes
long-standing behavior, regresses my usecase, and regresses the kernel
test robot.

I'll change patch 1/2 to not touch new_page() so that we are only
addressing thp faults and post a v2.

2018-12-05 21:47:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Wed, Dec 05, 2018 at 11:49:26AM -0800, David Rientjes wrote:
> High thp utilization is not always better, especially when those hugepages
> are accessed remotely and introduce the regressions that I've reported.
> Seeking high thp utilization at all costs is not the goal if it causes
> workloads to regress.

Is it possible what you need is a defrag=compactonly_thisnode to set
instead of the default defrag=madvise? The fact you seem concerned
about page fault latencies doesn't make your workload an obvious
candidate for MADV_HUGEPAGE to begin with. At least unless you decide
to smooth the MADV_HUGEPAGE behavior with an mbind that will simply
add __GFP_THISNODE to the allocations, perhaps you'll be even faster
if you invoke reclaim in the local node for 4k allocations too.

It looks like for your workload THP is a nice to have add-on, which is
practically true of all workloads (with a few corner cases that must
use MADV_NOHUGEPAGE), and it's what the defrag= default is about.

Is it possible that you just don't want to shut off completely
compaction in the page fault and if you're ok to do it for your
library, you may be ok with that for all other apps too?

That's a different stance from other MADV_HUGEPAGE users because you
don't seem to mind a severely crippled THP utilization in your
app.

With your patch the utilization will go down a lot compared to the
previous __GFP_THISNODE swap storm capable and you're still very fine
with that. The fact you're fine with that points in the direction of
changing the default tuning for defrag= to something stronger than
madvise (that is precisely the default setting that is forcing you to
use MADV_HUGEPAGE to get a chance to get some THP once a in a while
during the page fault, after some uptime).

Considering mbind surprisingly isn't privileged (so I suppose it may
cause swap storms equivalent to __GFP_THISNODE if maliciously used
after all) you could even consider a defrag=thisnode to force
compaction+defrag local to the node to retain your THP+NUMA dynamic
partitioning behavior that ends up swappin heavy in the local node.

2018-12-05 22:12:39

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > High thp utilization is not always better, especially when those hugepages
> > are accessed remotely and introduce the regressions that I've reported.
> > Seeking high thp utilization at all costs is not the goal if it causes
> > workloads to regress.
>
> Is it possible what you need is a defrag=compactonly_thisnode to set
> instead of the default defrag=madvise? The fact you seem concerned
> about page fault latencies doesn't make your workload an obvious
> candidate for MADV_HUGEPAGE to begin with. At least unless you decide
> to smooth the MADV_HUGEPAGE behavior with an mbind that will simply
> add __GFP_THISNODE to the allocations, perhaps you'll be even faster
> if you invoke reclaim in the local node for 4k allocations too.
>

I've must have said this at least six or seven times: fault latency is
secondary to the *access* latency. We want to try hard for MADV_HUGEPAGE
users to do synchronous compaction and try to make a hugepage available.
We really want to be backed by hugepages, but certainly not when the
access latency becomes 13.9% worse as a result compared to local pages of
the native page size.

This is not a system-wide configuration detail, it is specific to the
workload: does it span more than one node or not? No workload that can
fit into a single node, which you also say is going to be the majority of
workloads on today's platforms, is going to want to revert __GFP_THISNODE
behavior of the past almost four years. It perfectly makes sense,
however, to be a new mempolicy mode, a new madvise mode, or a prctl.

> It looks like for your workload THP is a nice to have add-on, which is
> practically true of all workloads (with a few corner cases that must
> use MADV_NOHUGEPAGE), and it's what the defrag= default is about.
>
> Is it possible that you just don't want to shut off completely
> compaction in the page fault and if you're ok to do it for your
> library, you may be ok with that for all other apps too?
>

We enable synchronous compaction for MADV_HUGEPAGE users, yes, because we
are not concerned with the fault latency but rather the access latency.

> That's a different stance from other MADV_HUGEPAGE users because you
> don't seem to mind a severely crippled THP utilization in your
> app.
>

If access latency is really better for local pages of the native page
size, we of course want to fault those instead. For almost the past four
years, the behavior of MADV_HUGEPAGE has been to compact and possibly
reclaim locally and then fallback to local pages. It is exactly what our
users of MADV_HUGEPAGE want; I did not introduce this NUMA locality
restriction but our users have used it.

Please: if we wish to change behavior from February 2015, let's extend the
API to allow for remote allocations in several of the ways we have already
brainstormed rather than cause regressions.

2018-12-05 22:23:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

On Wed, Dec 05, 2018 at 09:15:28PM +0100, Michal Hocko wrote:
> If the __GFP_THISNODE should be really used then it should be applied to
> all other types of pages. Not only THP. And as such done in a separate
> patch. Not a part of the revert. The cleanup was meant to unify THP
> allocations and that is why I object to reverting it as a part of this
> work.

I also wonder if using __GFP_THISNODE for all pages and not only THP
may really help David's workload performance further if it's so NUMA
sensitive and short lived too (plus we know it most of the time fits
in a single node). It'll get rid of the cache in the local node before
allocating remote memory. Of course than the swap storms and
pathological performance for processes that don't fit in a node, will
also materialize without THP enabled. If this is true, that'd point
further to the need of a MPOL that can set __GFP_THISNODE not just for
THP but for all allocations, with the only difference that if it's the
regular page size failing, instead of hitting OOM it should do one
last fallback to a regular page size allocation without __GFP_THISNODE
set.

Regardless of the above I think it's still interesting to look into
adding more NUMA affinity to THP somehow, but I doubt we want to do
that at the price of crippling compaction in a way it can't generate
THP anymore once a node is full of cache, and certainly we don't want
to cripple compaction in non-NUMA hosts too like it'd be happening
with the current proposed patch. Furthermore whatever we do should
work for order 8 7 6 etc.. too. If we did a order 9 specialized trick
that cripples compaction effectiveness, it'd be a short term approach
and it'd be tailored to David's use case that seems to be sensitive to
allocation latency. Being sensitive to allocation latency doesn't make
that process a good fit for MADV_HUGEPAGE to begin with.

2018-12-06 00:32:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Wed, Dec 05, 2018 at 02:10:47PM -0800, David Rientjes wrote:
> I've must have said this at least six or seven times: fault latency is

In your original regression report in this thread to Linus:

https://lkml.kernel.org/r/[email protected]

you said "On a fragmented host, the change itself showed a 13.9%
access latency regression on Haswell and up to 40% allocation latency
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
regression. This is more substantial on Naples and Rome. I also
^^^^^^^^^^
measured similar numbers to this for Haswell."

> secondary to the *access* latency. We want to try hard for MADV_HUGEPAGE
> users to do synchronous compaction and try to make a hugepage available.

I'm glad you said it six or seven times now, because you forgot to
mention in the above email that the "40% allocation/fault latency
regression" you reported above, is actually a secondary concern because
those must be long lived allocations and we can't yet generate
compound pages for free after all..

> We really want to be backed by hugepages, but certainly not when the
> access latency becomes 13.9% worse as a result compared to local pages of
> the native page size.

Yes the only regression that you measure that isn't only a secondary
concern, is that 13.9% access latency because of not immediate NUMA
locality.

BTW, I never bothered to ask yet, but, did you enable NUMA balancing
in your benchmarks? NUMA balancing would fix the access latency very
easily too, so that 13.9% access latency must quickly disappear if you
correctly have NUMA balancing enabled in a NUMA system.

Furthermore NUMA balancing is fully converging guaranteed if the
workload can fit in a single node (your case or __GFP_THISNODE would
hardly fly in the first place). It'll work even better for you because
you copy off all MAP_PRIVATE binaries into MAP_ANON to make them THP
backed so they can also be replicated per NODE and they won't increase
the NUMA balancing false sharing. And khugepaged always remains NUMA
agnostic so it won't risk to stop on NUMA balancing toes no matter how
we tweak the MADV_HUGEPAGE behavior.

> This is not a system-wide configuration detail, it is specific to the
> workload: does it span more than one node or not? No workload that can
> fit into a single node, which you also say is going to be the majority of
> workloads on today's platforms, is going to want to revert __GFP_THISNODE
> behavior of the past almost four years. It perfectly makes sense,
> however, to be a new mempolicy mode, a new madvise mode, or a prctl.

qemu has been using MADV_HUGEPAGE since the below commit in Oct 2012.

From ad0b5321f1f797274603ebbe20108b0750baee94 Mon Sep 17 00:00:00 2001
From: Luiz Capitulino <[email protected]>
Date: Fri, 5 Oct 2012 16:47:57 -0300
Subject: [PATCH 1/1] Call MADV_HUGEPAGE for guest RAM allocations

This makes it possible for QEMU to use transparent huge pages (THP)
when transparent_hugepage/enabled=madvise. Otherwise THP is only
used when it's enabled system wide.

Signed-off-by: Luiz Capitulino <[email protected]>
Signed-off-by: Anthony Liguori <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
exec.c | 1 +
osdep.h | 5 +++++
2 files changed, 6 insertions(+)

diff --git a/exec.c b/exec.c
index c4ed6fdef1..750008c4e1 100644
--- a/exec.c
+++ b/exec.c
@@ -2571,6 +2571,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
cpu_physical_memory_set_dirty_range(new_block->offset, size, 0xff);

qemu_ram_setup_dump(new_block->host, size);
+ qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);

if (kvm_enabled())
kvm_setup_guest_memory(new_block->host, size);
diff --git a/osdep.h b/osdep.h
index cb213e0295..c5fd3d91ff 100644
--- a/osdep.h
+++ b/osdep.h
@@ -108,6 +108,11 @@ void qemu_vfree(void *ptr);
#else
#define QEMU_MADV_DONTDUMP QEMU_MADV_INVALID
#endif
+#ifdef MADV_HUGEPAGE
+#define QEMU_MADV_HUGEPAGE MADV_HUGEPAGE
+#else
+#define QEMU_MADV_HUGEPAGE QEMU_MADV_INVALID
+#endif

#elif defined(CONFIG_POSIX_MADVISE)

> Please: if we wish to change behavior from February 2015, let's extend the
> API to allow for remote allocations in several of the ways we have already
> brainstormed rather than cause regressions.

So if I understand correctly, you broke QEMU 3 years ago (again only
noticed Jan 2018 because it requires >2 nodes and it requires a VM
larger than 1 NODE which isn't common).

QEMU before your regression in April 2015:

Finished: 30 GB mapped, 10.163159s elapsed, 2.95GB/s
Finished: 34 GB mapped, 11.806526s elapsed, 2.88GB/s
Finished: 38 GB mapped, 10.369081s elapsed, 3.66GB/s
Finished: 42 GB mapped, 12.357719s elapsed, 3.40GB/s

QEMU after your regression in April 2015:

Finished: 30 GB mapped, 8.821632s elapsed, 3.40GB/s
Finished: 34 GB mapped, 341.979543s elapsed, 0.10GB/s
Finished: 38 GB mapped, 761.933231s elapsed, 0.05GB/s
Finished: 42 GB mapped, 1188.409235s elapsed, 0.04GB/s

And now you ask open source upstream QEMU to start using a special
mempolicy if it doesn't want to keep breaking with your change, so you
don't have to apply a one liner to your unreleased so far proprietary
software that uses MADV_HUGEPAGE and depends on a special
__GFP_THISNODE behavior that broke qemu in the first place?

Thanks,
Andrea

2018-12-09 23:46:08

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > I've must have said this at least six or seven times: fault latency is
>
> In your original regression report in this thread to Linus:
>
> https://lkml.kernel.org/r/[email protected]
>
> you said "On a fragmented host, the change itself showed a 13.9%
> access latency regression on Haswell and up to 40% allocation latency
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> regression. This is more substantial on Naples and Rome. I also
> ^^^^^^^^^^
> measured similar numbers to this for Haswell."
>
> > secondary to the *access* latency. We want to try hard for MADV_HUGEPAGE
> > users to do synchronous compaction and try to make a hugepage available.
>
> I'm glad you said it six or seven times now, because you forgot to
> mention in the above email that the "40% allocation/fault latency
> regression" you reported above, is actually a secondary concern because
> those must be long lived allocations and we can't yet generate
> compound pages for free after all..
>

I've been referring to the long history of this discussion, namely my
explicit Nacked-by in https://marc.info/?l=linux-kernel&m=153868420126775
two months ago stating the 13.9% access latency regression. The patch was
nonetheless still merged and I proposed the revert for the same chief
complaint, and it was reverted.

I brought up the access latency issue three months ago in
https://marc.info/?l=linux-kernel&m=153661012118046 and said allocation
latency was a secondary concern, specifically that our users of
MADV_HUGEPAGE are willing to accept the increased allocation latency for
local hugepages.

> BTW, I never bothered to ask yet, but, did you enable NUMA balancing
> in your benchmarks? NUMA balancing would fix the access latency very
> easily too, so that 13.9% access latency must quickly disappear if you
> correctly have NUMA balancing enabled in a NUMA system.
>

No, we do not have CONFIG_NUMA_BALANCING enabled. The __GFP_THISNODE
behavior for hugepages was added in 4.0 for the PPC usecase, not by me.
That had nothing to do with the madvise mode: the initial documentation
referred to the mode as a way to prevent an increase in rss for configs
where "enabled" was set to madvise. The allocation policy was never about
MADV_HUGEPAGE in any 4.x kernel, it was only an indication for certain
defrag settings to determine how much work should be done to allocate
*local* hugepages at fault.

If you are saying that the change in allocator policy in a patch from
Aneesh almost four years ago and has gone unreported by anybody up until a
few months ago, I can understand the frustration. I do, however, support
the __GFP_THISNODE change he made because his data shows the same results
as mine.

I've suggested a very simple extension, specifically a prctl() mode that
is inherited across fork, that would allow a workload to specify that it
prefers remote allocations over local compaction/reclaim because it is too
large to fit on a single node. I'd value your feedback for that
suggestion to fix your usecase.