LinuxLists.cc - [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
<[email protected]> wrote:
>
> FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> MADV_HUGEPAGE mappings")

Well, that's certainly noticeable and not good.

Andrea, I suspect it might be causing fights with auto numa migration..

Lots more system time, but also look at this:

> 1122389 ± 9% +17.2% 1315380 ± 4% proc-vmstat.numa_hit
> 214722 ± 5% +21.6% 261076 ± 3% proc-vmstat.numa_huge_pte_updates
> 1108142 ± 9% +17.4% 1300857 ± 4% proc-vmstat.numa_local
> 145368 ± 48% +63.1% 237050 ± 17% proc-vmstat.numa_miss
> 159615 ± 44% +57.6% 251573 ± 16% proc-vmstat.numa_other
> 185.50 ± 81% +8278.6% 15542 ± 40% proc-vmstat.numa_pages_migrated

Should the commit be reverted? Or perhaps at least modified?

Linus

2018-11-27 19:28:58

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue 27-11-18 09:08:50, Linus Torvalds wrote:
> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
> <[email protected]> wrote:
> >
> > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > MADV_HUGEPAGE mappings")
>
> Well, that's certainly noticeable and not good.
>
> Andrea, I suspect it might be causing fights with auto numa migration..
>
> Lots more system time, but also look at this:
>
> > 1122389 ? 9% +17.2% 1315380 ? 4% proc-vmstat.numa_hit
> > 214722 ? 5% +21.6% 261076 ? 3% proc-vmstat.numa_huge_pte_updates
> > 1108142 ? 9% +17.4% 1300857 ? 4% proc-vmstat.numa_local
> > 145368 ? 48% +63.1% 237050 ? 17% proc-vmstat.numa_miss
> > 159615 ? 44% +57.6% 251573 ? 16% proc-vmstat.numa_other
> > 185.50 ? 81% +8278.6% 15542 ? 40% proc-vmstat.numa_pages_migrated
>
> Should the commit be reverted? Or perhaps at least modified?

Well, the commit is trying to revert to the behavior before
5265047ac301 because there are real usecases that suffered from that
change and bug reports as a result of that.

will-it-scale is certainly worth considering but it is an artificial
testcase. A higher NUMA miss rate is an expected side effect of the
patch because the fallback to a different NUMA node is more likely. The
__GFP_THISNODE side effect is basically introducing node-reclaim
behavior for THPages. Another thing is that there is no good behavior
for everybody. Reclaim locally vs. THP on a remote node is hard to
tell by default. We have discussed that at length and there were some
conclusions. One of them is that we need a numa policy to tell whether
a expensive localility is preferred over remote allocation. Also we
definitely need a better pro-active defragmentation to allow larger
pages on a local node. This is a work in progress and this patch is a
stop gap fix.

--
Michal Hocko
SUSE Labs

2018-11-27 19:29:44

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue 27-11-18 19:17:27, Michal Hocko wrote:
> On Tue 27-11-18 09:08:50, Linus Torvalds wrote:
> > On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
> > <[email protected]> wrote:
> > >
> > > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> > > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > > MADV_HUGEPAGE mappings")
> >
> > Well, that's certainly noticeable and not good.
> >
> > Andrea, I suspect it might be causing fights with auto numa migration..
> >
> > Lots more system time, but also look at this:
> >
> > > 1122389 ? 9% +17.2% 1315380 ? 4% proc-vmstat.numa_hit
> > > 214722 ? 5% +21.6% 261076 ? 3% proc-vmstat.numa_huge_pte_updates
> > > 1108142 ? 9% +17.4% 1300857 ? 4% proc-vmstat.numa_local
> > > 145368 ? 48% +63.1% 237050 ? 17% proc-vmstat.numa_miss
> > > 159615 ? 44% +57.6% 251573 ? 16% proc-vmstat.numa_other
> > > 185.50 ? 81% +8278.6% 15542 ? 40% proc-vmstat.numa_pages_migrated
> >
> > Should the commit be reverted? Or perhaps at least modified?
>
> Well, the commit is trying to revert to the behavior before
> 5265047ac301 because there are real usecases that suffered from that
> change and bug reports as a result of that.
>
> will-it-scale is certainly worth considering but it is an artificial
> testcase. A higher NUMA miss rate is an expected side effect of the
> patch because the fallback to a different NUMA node is more likely. The
> __GFP_THISNODE side effect is basically introducing node-reclaim
> behavior for THPages. Another thing is that there is no good behavior
> for everybody. Reclaim locally vs. THP on a remote node is hard to
> tell by default. We have discussed that at length and there were some
> conclusions. One of them is that we need a numa policy to tell whether
> a expensive localility is preferred over remote allocation. Also we
> definitely need a better pro-active defragmentation to allow larger
> pages on a local node. This is a work in progress and this patch is a
> stop gap fix.

Btw. the associated discussion is http://lkml.kernel.org/r/[email protected]

--
Michal Hocko
SUSE Labs

2018-11-27 19:30:09

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On 11/27/18 6:08 PM, Linus Torvalds wrote:
> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
> <[email protected]> wrote:
>>
>> FYI, we noticed a -61.3% regression of vm-scalability.throughput due
>> to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
>> MADV_HUGEPAGE mappings")
>
> Well, that's certainly noticeable and not good.
>
> Andrea, I suspect it might be causing fights with auto numa migration..
>
> Lots more system time, but also look at this:
>
>> 1122389 ± 9% +17.2% 1315380 ± 4% proc-vmstat.numa_hit
>> 214722 ± 5% +21.6% 261076 ± 3% proc-vmstat.numa_huge_pte_updates
>> 1108142 ± 9% +17.4% 1300857 ± 4% proc-vmstat.numa_local
>> 145368 ± 48% +63.1% 237050 ± 17% proc-vmstat.numa_miss
>> 159615 ± 44% +57.6% 251573 ± 16% proc-vmstat.numa_other
>> 185.50 ± 81% +8278.6% 15542 ± 40% proc-vmstat.numa_pages_migrated
>
> Should the commit be reverted? Or perhaps at least modified?

This part of the test's config is important:

thp_defrag: always

While the commit targets MADV_HUGEPAGE mappings (such as Andrea's
kvm-qemu usecase), with defrag=always, all mappings behave almost as a
MADV_HUGEPAGE mapping. That's no longer a default for some years now and
I think nobody recommends it. In the default configuration nothing
changes for non-madvise mappings.

Vlastimil

> Linus
>

2018-11-27 19:34:39

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On 11/27/18 8:05 PM, Vlastimil Babka wrote:
> On 11/27/18 6:08 PM, Linus Torvalds wrote:
>> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
>> <[email protected]> wrote:
>>>
>>> FYI, we noticed a -61.3% regression of vm-scalability.throughput due
>>> to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
>>> MADV_HUGEPAGE mappings")
>>
>> Well, that's certainly noticeable and not good.
>>
>> Andrea, I suspect it might be causing fights with auto numa migration..
>>
>> Lots more system time, but also look at this:
>>
>>> 1122389 ± 9% +17.2% 1315380 ± 4% proc-vmstat.numa_hit
>>> 214722 ± 5% +21.6% 261076 ± 3% proc-vmstat.numa_huge_pte_updates
>>> 1108142 ± 9% +17.4% 1300857 ± 4% proc-vmstat.numa_local
>>> 145368 ± 48% +63.1% 237050 ± 17% proc-vmstat.numa_miss
>>> 159615 ± 44% +57.6% 251573 ± 16% proc-vmstat.numa_other
>>> 185.50 ± 81% +8278.6% 15542 ± 40% proc-vmstat.numa_pages_migrated
>>
>> Should the commit be reverted? Or perhaps at least modified?
>
> This part of the test's config is important:
>
> thp_defrag: always
>
> While the commit targets MADV_HUGEPAGE mappings (such as Andrea's
> kvm-qemu usecase), with defrag=always, all mappings behave almost as a
> MADV_HUGEPAGE mapping. That's no longer a default for some years now and

Specifically, that's 444eb2a449ef ("mm: thp: set THP defrag by default
to madvise and add a stall-free defrag option") merged in v4.5. So we
might actually hit this regression with 4.4 stable backport...

> I think nobody recommends it. In the default configuration nothing
> changes for non-madvise mappings.
>
> Vlastimil
>
>> Linus
>>
>

2018-11-27 20:59:17

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

Hi Linus,

On Tue, Nov 27, 2018 at 09:08:50AM -0800, Linus Torvalds wrote:
> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
> <[email protected]> wrote:
> >
> > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > MADV_HUGEPAGE mappings")
>
> Well, that's certainly noticeable and not good.

Noticed this email too.

This difference can only happen with defrag=always, and that's not the
current upstream default.

thp_enabled: always
thp_defrag: always
^^^^^^^^^^^^^^^^^^ emulates MADV_HUGEPAGE always set

> Andrea, I suspect it might be causing fights with auto numa migration..

That MADV_HUGEPAGE causes flights with NUMA balancing is not great
indeed, qemu needs NUMA locality too, but then the badness caused by
__GFP_THISNODE was a larger regression in the worst case for qemu.

When the kernel wants to allocate a THP from node A, if there are no
THP generated on node A but there are in node B, they'll be picked from
node B now.

__GFP_THISNODE previously prevented any THP to be allocated from any
node except A. This introduces a higher chance of initial misplacement
which NUMA balancing will correct over time, but it should only apply
to long lived allocations under MADV_HUGEPAGE. Perhaps the workload
here uses short lived allocations and sets defrag=always which is not
optimal to begin with?

The motivation of the fix, is that the previous code invoked reclaim
with __GFP_THISNODE set. That looked insecure and such behavior should
only have been possible under a mlock/mempolicy
capability. __GFP_THISNODE is like a transient but still privileged
mbind for reclaim.

Before the fix, __GFP_THISNODE would end up swapping out everything
from node A to free 4k pages from node A, despite perhaps there were
gigabytes of memory free in node B. That caused severe regression to
threaded workloads whose memory spanned more than one NUMA node. So
again going back doesn't sounds great for NUMA in general.

The vmscalability test is most certainly not including any
multithreaded process whose memory doesn't fit in a single NUMA node
or we'd see also the other side of the tradeoff. It'd be nice to add
such a test to be sure that the old __GFP_THISNODE behavior won't
happen again for apps that don't fit in a single node.

> Lots more system time, but also look at this:
>
> > 1122389 ? 9% +17.2% 1315380 ? 4% proc-vmstat.numa_hit
> > 214722 ? 5% +21.6% 261076 ? 3% proc-vmstat.numa_huge_pte_updates
> > 1108142 ? 9% +17.4% 1300857 ? 4% proc-vmstat.numa_local
> > 145368 ? 48% +63.1% 237050 ? 17% proc-vmstat.numa_miss
> > 159615 ? 44% +57.6% 251573 ? 16% proc-vmstat.numa_other
> > 185.50 ? 81% +8278.6% 15542 ? 40% proc-vmstat.numa_pages_migrated
>
> Should the commit be reverted? Or perhaps at least modified?

I proposed two solutions, the other one required a new minor feature:
__GFP_ONLY_COMPACT. The other solution wouldn't regress like
above. The THP utilization ratio would decrease though (it had margin
for improvement though).

Kirill preferred the __GFP_ONLY_COMPACT, I was mostly neutral because
it's a tradeoff.

So the short term alternative again would be the alternate patch that
does __GFP_THISNODE|GFP_ONLY_COMPACT appended below.

There's no particular problem in restricting only compaction to the
local node and to skip reclaim entirely during a THP allocation as
long as reclaim is skipped entirely.

David implemented a hardcoded version of GFP_COMPACTONLY too which was
runtime equivalent, but it was hardcoded for THP only the allocator,
but it looks less flexible to hardcode it for THP.

The current fix you merged is simpler overall and puts us back to a
"stable" state without introducing new (minor) features.

The below is for further review of the potential alternative (which
has still margin for improvement).

=======
From: Andrea Arcangeli <[email protected]>
Subject: [PATCH 1/2] mm: thp: consolidate policy_nodemask call

Just a minor cleanup.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/mempolicy.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 01f1a14facc4..d6512ef28cde 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2026,6 +2026,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
goto out;
}

+ nmask = policy_nodemask(gfp, pol);
+
if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
int hpage_node = node;

@@ -2043,7 +2045,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
!(pol->flags & MPOL_F_LOCAL))
hpage_node = pol->v.preferred_node;

- nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(hpage_node, *nmask)) {
mpol_cond_put(pol);
page = __alloc_pages_node(hpage_node,
@@ -2052,7 +2053,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
}
}

- nmask = policy_nodemask(gfp, pol);
preferred_nid = policy_node(gfp, pol, node);
page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
mpol_cond_put(pol);

From: Andrea Arcangeli <[email protected]>
Subject: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always

qemu uses MADV_HUGEPAGE which allows direct compaction (i.e.
__GFP_DIRECT_RECLAIM is set).

The problem is that direct compaction combined with the NUMA
__GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
hard the local node, instead of failing the allocation if there's no
THP available in the local node.

Such logic was ok until __GFP_THISNODE was added to the THP allocation
path even with MPOL_DEFAULT.

The idea behind the __GFP_THISNODE addition, is that it is better to
provide local memory in PAGE_SIZE units than to use remote NUMA THP
backed memory. That largely depends on the remote latency though, on
threadrippers for example the overhead is relatively low in my
experience.

The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
extremely slow qemu startup with vfio, if the VM is larger than the
size of one host NUMA node. This is because it will try very hard to
unsuccessfully swapout get_user_pages pinned pages as result of the
__GFP_THISNODE being set, instead of falling back to PAGE_SIZE
allocations and instead of trying to allocate THP on other nodes (it
would be even worse without vfio type1 GUP pins of course, except it'd
be swapping heavily instead).

It's very easy to reproduce this by setting
transparent_hugepage/defrag to "always", even with a simple memhog.

1) This can be fixed by retaining the __GFP_THISNODE logic also for
__GFP_DIRECT_RELCAIM by allowing only one compaction run. Not even
COMPACT_SKIPPED (i.e. compaction failing because not enough free
memory in the zone) should be allowed to invoke reclaim.

2) An alternative is not use __GFP_THISNODE if __GFP_DIRECT_RELCAIM
has been set by the caller (i.e. MADV_HUGEPAGE or
defrag="always"). That would keep the NUMA locality restriction
only when __GFP_DIRECT_RECLAIM is not set by the caller. So THP
will be provided from remote nodes if available before falling back
to PAGE_SIZE units in the local node, but an app using defrag =
always (or madvise with MADV_HUGEPAGE) supposedly prefers that.

These are the results of 1) (higher GB/s is better).

Finished: 30 GB mapped, 10.188535s elapsed, 2.94GB/s
Finished: 34 GB mapped, 12.274777s elapsed, 2.77GB/s
Finished: 38 GB mapped, 13.847840s elapsed, 2.74GB/s
Finished: 42 GB mapped, 14.288587s elapsed, 2.94GB/s

Finished: 30 GB mapped, 8.907367s elapsed, 3.37GB/s
Finished: 34 GB mapped, 10.724797s elapsed, 3.17GB/s
Finished: 38 GB mapped, 14.272882s elapsed, 2.66GB/s
Finished: 42 GB mapped, 13.929525s elapsed, 3.02GB/s

These are the results of 2) (higher GB/s is better).

Finished: 30 GB mapped, 10.163159s elapsed, 2.95GB/s
Finished: 34 GB mapped, 11.806526s elapsed, 2.88GB/s
Finished: 38 GB mapped, 10.369081s elapsed, 3.66GB/s
Finished: 42 GB mapped, 12.357719s elapsed, 3.40GB/s

Finished: 30 GB mapped, 8.251396s elapsed, 3.64GB/s
Finished: 34 GB mapped, 12.093030s elapsed, 2.81GB/s
Finished: 38 GB mapped, 11.824903s elapsed, 3.21GB/s
Finished: 42 GB mapped, 15.950661s elapsed, 2.63GB/s

This is current upstream (higher GB/s is better).

Finished: 30 GB mapped, 8.821632s elapsed, 3.40GB/s
Finished: 34 GB mapped, 341.979543s elapsed, 0.10GB/s
Finished: 38 GB mapped, 761.933231s elapsed, 0.05GB/s
Finished: 42 GB mapped, 1188.409235s elapsed, 0.04GB/s

vfio is a good test because by pinning all memory it avoids the
swapping and reclaim only wastes CPU, a memhog based test would
created swapout storms and supposedly show a bigger stddev.

What is better between 1) and 2) depends on the hardware and on the
software. Virtualization EPT/NTP gets a bigger boost from THP as well
than host applications.

This commit implements 1).

Reported-by: Alex Williamson <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/gfp.h | 18 ++++++++++++++++++
mm/mempolicy.c | 12 +++++++++++-
mm/page_alloc.c | 4 ++++
3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index a6afcec53795..3c04d5d90e6d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -44,6 +44,7 @@ struct vm_area_struct;
#else
#define ___GFP_NOLOCKDEP 0
#endif
+#define ___GFP_ONLY_COMPACT 0x1000000u
/* If the above are modified, __GFP_BITS_SHIFT may need updating */

/*
@@ -178,6 +179,21 @@ struct vm_area_struct;
* definitely preferable to use the flag rather than opencode endless
* loop around allocator.
* Using this flag for costly allocations is _highly_ discouraged.
+ *
+ * __GFP_ONLY_COMPACT: Only invoke compaction. Do not try to succeed
+ * the allocation by freeing memory. Never risk to free any
+ * "PAGE_SIZE" memory unit even if compaction failed specifically
+ * because of not enough free pages in the zone. This only makes sense
+ * only in combination with __GFP_THISNODE (enforced with a
+ * VM_WARN_ON), to restrict the THP allocation in the local node that
+ * triggered the page fault and fallback into PAGE_SIZE allocations in
+ * the same node. We don't want to invoke reclaim because there may be
+ * plenty of free memory already in the local node. More importantly
+ * there may be even plenty of free THP available in remote nodes so
+ * we should allocate those if something instead of reclaiming any
+ * memory in the local node. Implementation detail: set ___GFP_NORETRY
+ * too so that ___GFP_ONLY_COMPACT only needs to be checked in a slow
+ * path.
*/
#define __GFP_IO ((__force gfp_t)___GFP_IO)
#define __GFP_FS ((__force gfp_t)___GFP_FS)
@@ -187,6 +203,8 @@ struct vm_area_struct;
#define __GFP_RETRY_MAYFAIL ((__force gfp_t)___GFP_RETRY_MAYFAIL)
#define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL)
#define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY)
+#define __GFP_ONLY_COMPACT ((__force gfp_t)(___GFP_NORETRY | \
+ ___GFP_ONLY_COMPACT))

/*
* Action modifiers
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d6512ef28cde..6bf839f20dcc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2047,8 +2047,18 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,

if (!nmask || node_isset(hpage_node, *nmask)) {
mpol_cond_put(pol);
+ /*
+ * We restricted the allocation to the
+ * hpage_node so we must use
+ * __GFP_ONLY_COMPACT to allow at most a
+ * compaction attempt and not ever get into
+ * reclaim or it'll swap heavily with
+ * transparent_hugepage/defrag = always (or
+ * madvise under MADV_HUGEPAGE).
+ */
page = __alloc_pages_node(hpage_node,
- gfp | __GFP_THISNODE, order);
+ gfp | __GFP_THISNODE |
+ __GFP_ONLY_COMPACT, order);
goto out;
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a790ef4be74e..01a5c2bd0860 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4144,6 +4144,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
*/
if (compact_result == COMPACT_DEFERRED)
goto nopage;
+ if (gfp_mask & __GFP_ONLY_COMPACT) {
+ VM_WARN_ON(!(gfp_mask & __GFP_THISNODE));
+ goto nopage;
+ }

/*
* Looks like reclaim/compaction is worth trying, but

2018-11-27 22:52:36

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue, Nov 27, 2018 at 12:57 PM Andrea Arcangeli <[email protected]> wrote:
>
> This difference can only happen with defrag=always, and that's not the
> current upstream default.

Ok, thanks. That makes it a bit less critical.

> That MADV_HUGEPAGE causes flights with NUMA balancing is not great
> indeed, qemu needs NUMA locality too, but then the badness caused by
> __GFP_THISNODE was a larger regression in the worst case for qemu.
[...]
> So the short term alternative again would be the alternate patch that
> does __GFP_THISNODE|GFP_ONLY_COMPACT appended below.

Sounds like we should probably do this. Particularly since Vlastimil
pointed out that we'd otherwise have issues with the back-port for 4.4
where that "defrag=always" was the default.

The patch doesn't look horrible, and it directly addresses this
particular issue.

Is there some reason we wouldn't want to do it?

Linus

2018-11-28 03:21:42

by Huang, Ying

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

Andrea Arcangeli <[email protected]> writes:

> Hi Linus,
>
> On Tue, Nov 27, 2018 at 09:08:50AM -0800, Linus Torvalds wrote:
>> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
>> <[email protected]> wrote:
>> >
>> > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
>> > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
>> > MADV_HUGEPAGE mappings")
>>
>> Well, that's certainly noticeable and not good.
>
> Noticed this email too.
>
> This difference can only happen with defrag=always, and that's not the
> current upstream default.
>
> thp_enabled: always
> thp_defrag: always
> ^^^^^^^^^^^^^^^^^^ emulates MADV_HUGEPAGE always set
>
>> Andrea, I suspect it might be causing fights with auto numa migration..
>
> That MADV_HUGEPAGE causes flights with NUMA balancing is not great
> indeed, qemu needs NUMA locality too, but then the badness caused by
> __GFP_THISNODE was a larger regression in the worst case for qemu.
>
> When the kernel wants to allocate a THP from node A, if there are no
> THP generated on node A but there are in node B, they'll be picked from
> node B now.
>
> __GFP_THISNODE previously prevented any THP to be allocated from any
> node except A. This introduces a higher chance of initial misplacement
> which NUMA balancing will correct over time, but it should only apply
> to long lived allocations under MADV_HUGEPAGE. Perhaps the workload
> here uses short lived allocations and sets defrag=always which is not
> optimal to begin with?
>
> The motivation of the fix, is that the previous code invoked reclaim
> with __GFP_THISNODE set. That looked insecure and such behavior should
> only have been possible under a mlock/mempolicy
> capability. __GFP_THISNODE is like a transient but still privileged
> mbind for reclaim.
>
> Before the fix, __GFP_THISNODE would end up swapping out everything
> from node A to free 4k pages from node A, despite perhaps there were
> gigabytes of memory free in node B. That caused severe regression to
> threaded workloads whose memory spanned more than one NUMA node. So
> again going back doesn't sounds great for NUMA in general.
>
> The vmscalability test is most certainly not including any
> multithreaded process whose memory doesn't fit in a single NUMA node
> or we'd see also the other side of the tradeoff. It'd be nice to add
> such a test to be sure that the old __GFP_THISNODE behavior won't
> happen again for apps that don't fit in a single node.

The test case is to test swap subsystem. Where tens (32 in test job)
processes are run to eat memory to trigger to swap to NVMe disk. The
memory size to eat is almost same in this commit and its parent. But I
found the swap triggered is much more for this commit.

70934968 ± 10% +51.7% 1.076e+08 ± 3% proc-vmstat.pswpout

One possibility is that for parent commit, some processes exit much
earlier than other processes, so the total memory requirement of the
whole system is much lower. So I dig more on test log and found,

For the parent commit,

$ grep 'usecs =' vm-scalability
24573771360 bytes / 13189705 usecs = 1819435 KB/s
24573771360 bytes / 13853913 usecs = 1732205 KB/s
24573771360 bytes / 42953388 usecs = 558694 KB/s
24573771360 bytes / 52782761 usecs = 454652 KB/s
24573771360 bytes / 84026989 usecs = 285596 KB/s
24573771360 bytes / 111677310 usecs = 214885 KB/s
24573771360 bytes / 146084698 usecs = 164273 KB/s
24573771360 bytes / 146978329 usecs = 163274 KB/s
24573771360 bytes / 149371224 usecs = 160658 KB/s
24573771360 bytes / 162892070 usecs = 147323 KB/s
24573771360 bytes / 177949001 usecs = 134857 KB/s
24573771360 bytes / 181729992 usecs = 132052 KB/s
24573771360 bytes / 189812698 usecs = 126428 KB/s
24573771360 bytes / 190992698 usecs = 125647 KB/s
24573771360 bytes / 200039238 usecs = 119965 KB/s
24573771360 bytes / 201254461 usecs = 119241 KB/s
24573771360 bytes / 202825077 usecs = 118317 KB/s
24573771360 bytes / 203441285 usecs = 117959 KB/s
24573771360 bytes / 205378150 usecs = 116847 KB/s
24573771360 bytes / 204840555 usecs = 117153 KB/s
24573771360 bytes / 206235458 usecs = 116361 KB/s
24573771360 bytes / 206419877 usecs = 116257 KB/s
24573771360 bytes / 206619347 usecs = 116145 KB/s
24573771360 bytes / 206942267 usecs = 115963 KB/s
24573771360 bytes / 210289229 usecs = 114118 KB/s
24573771360 bytes / 210504531 usecs = 114001 KB/s
24573771360 bytes / 210521351 usecs = 113992 KB/s
24573771360 bytes / 211012852 usecs = 113726 KB/s
24573771360 bytes / 211547509 usecs = 113439 KB/s
24573771360 bytes / 212179521 usecs = 113101 KB/s
24573771360 bytes / 212907825 usecs = 112714 KB/s
24573771360 bytes / 215558786 usecs = 111328 KB/s

For this commit,

$ grep 'usecs =' vm-scalability
24573681072 bytes / 203705073 usecs = 117806 KB/s
24573681072 bytes / 216146130 usecs = 111025 KB/s
24573681072 bytes / 257234408 usecs = 93291 KB/s
24573681072 bytes / 259530715 usecs = 92465 KB/s
24573681072 bytes / 261335046 usecs = 91827 KB/s
24573681072 bytes / 260134706 usecs = 92251 KB/s
24573681072 bytes / 258848653 usecs = 92709 KB/s
24573681072 bytes / 259889050 usecs = 92338 KB/s
24573681072 bytes / 265457907 usecs = 90401 KB/s
24573681072 bytes / 261698183 usecs = 91700 KB/s
24573681072 bytes / 266806783 usecs = 89944 KB/s
24573681072 bytes / 273096611 usecs = 87872 KB/s
24573681072 bytes / 273601276 usecs = 87710 KB/s
24573681072 bytes / 276132454 usecs = 86906 KB/s
24573681072 bytes / 274162852 usecs = 87530 KB/s
24573681072 bytes / 277901662 usecs = 86353 KB/s
24573681072 bytes / 282373557 usecs = 84985 KB/s
24573681072 bytes / 278202538 usecs = 86259 KB/s
24573681072 bytes / 283311157 usecs = 84704 KB/s
24573681072 bytes / 284181483 usecs = 84445 KB/s
24573681072 bytes / 283331985 usecs = 84698 KB/s
24573681072 bytes / 284573067 usecs = 84328 KB/s
24573681072 bytes / 277832459 usecs = 86374 KB/s
24573681072 bytes / 284753391 usecs = 84275 KB/s
24573681072 bytes / 287701035 usecs = 83412 KB/s
24573681072 bytes / 287816910 usecs = 83378 KB/s
24573681072 bytes / 287871244 usecs = 83362 KB/s
24573681072 bytes / 288322443 usecs = 83232 KB/s
24573681072 bytes / 288750156 usecs = 83108 KB/s
24573681072 bytes / 289595079 usecs = 82866 KB/s
24573681072 bytes / 289741926 usecs = 82824 KB/s
24573681072 bytes / 290746427 usecs = 82538 KB/s

From the above data, for the parent commit 3 processes exited within
14s, another 3 exited within 100s. For this commit, the first process
exited at 203s. That is, this commit makes memory allocation more fair
among processes, so that processes proceeded at more similar speed. But
this raises system memory footprint too, so triggered much more swap,
thus lower benchmark score.

In general, memory allocation fairness among processes should be a good
thing. So I think the report should have been a "performance
improvement" instead of "performance regression".

Best Regards,
Huang, Ying

2018-11-28 06:31:36

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue 27-11-18 14:50:05, Linus Torvalds wrote:
> On Tue, Nov 27, 2018 at 12:57 PM Andrea Arcangeli <[email protected]> wrote:
> >
> > This difference can only happen with defrag=always, and that's not the
> > current upstream default.
>
> Ok, thanks. That makes it a bit less critical.
>
> > That MADV_HUGEPAGE causes flights with NUMA balancing is not great
> > indeed, qemu needs NUMA locality too, but then the badness caused by
> > __GFP_THISNODE was a larger regression in the worst case for qemu.
> [...]
> > So the short term alternative again would be the alternate patch that
> > does __GFP_THISNODE|GFP_ONLY_COMPACT appended below.
>
> Sounds like we should probably do this. Particularly since Vlastimil
> pointed out that we'd otherwise have issues with the back-port for 4.4
> where that "defrag=always" was the default.
>
> The patch doesn't look horrible, and it directly addresses this
> particular issue.
>
> Is there some reason we wouldn't want to do it?

We have discussed it previously and the biggest concern was that it
introduces a new GFP flag with a very weird and one-off semantic.
Anytime we have done that in the past it basically kicked back because
people have started to use such a flag and any further changes were
really hard to do. So I would really prefer some more systematic
solution. And I believe we can do that here. MADV_HUGEPAGE (resp. THP
always enabled) has gained a local memory policy with the patch which
got effectively reverted. I do believe that conflating "I want THP" with
"I want them local" is just wrong from the API point of view. There are
different classes of usecases which obviously disagree on the later.

So I believe that a long term solution should introduce a
MPOL_NODE_RECLAIM kind of policy. It would effectively reclaim local
nodes (within NODE_RECLAIM distance) before falling to other nodes.

Apart from that we need a less disruptive reclaim driven by compaction
and Mel is already working on that AFAIK.
--
Michal Hocko
SUSE Labs

2018-11-28 16:51:13

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying <[email protected]> wrote:
>
> From the above data, for the parent commit 3 processes exited within
> 14s, another 3 exited within 100s. For this commit, the first process
> exited at 203s. That is, this commit makes memory allocation more fair
> among processes, so that processes proceeded at more similar speed. But
> this raises system memory footprint too, so triggered much more swap,
> thus lower benchmark score.
>
> In general, memory allocation fairness among processes should be a good
> thing. So I think the report should have been a "performance
> improvement" instead of "performance regression".

Hey, when you put it that way...

Let's ignore this issue for now, and see if it shows up in some real
workload and people complain.

Linus

2018-11-28 18:42:21

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, Nov 28, 2018 at 08:48:46AM -0800, Linus Torvalds wrote:
> On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying <[email protected]> wrote:
> >
> > From the above data, for the parent commit 3 processes exited within
> > 14s, another 3 exited within 100s. For this commit, the first process
> > exited at 203s. That is, this commit makes memory allocation more fair
> > among processes, so that processes proceeded at more similar speed. But
> > this raises system memory footprint too, so triggered much more swap,
> > thus lower benchmark score.

Ok so it was the previous more unfair behavior that increased overall
performance. It was also unclear to me that this was a full swap storm
test.

> > In general, memory allocation fairness among processes should be a good
> > thing. So I think the report should have been a "performance
> > improvement" instead of "performance regression".
>
> Hey, when you put it that way...
>
> Let's ignore this issue for now, and see if it shows up in some real
> workload and people complain.

Agreed.

With regard to the other question about 4.4 backports, 4.0 didn't have
__GFP_THISNODE, so this will still revert to the previous behavior and
it won't risk to land into uncharted territory. So there should be no
major concern for the backports.

We should still work on improving this area, for now the idea was to
apply a strict hotfix that would just revert to the previous behavior
without introducing new features and new APIs, that would also have
the side effect of diminishing THP utilization under MADV_HUGEPAGE.

Thanks!
Andrea

2018-11-28 23:11:42

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, 28 Nov 2018, Linus Torvalds wrote:

> On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying <[email protected]> wrote:
> >
> > From the above data, for the parent commit 3 processes exited within
> > 14s, another 3 exited within 100s. For this commit, the first process
> > exited at 203s. That is, this commit makes memory allocation more fair
> > among processes, so that processes proceeded at more similar speed. But
> > this raises system memory footprint too, so triggered much more swap,
> > thus lower benchmark score.
> >
> > In general, memory allocation fairness among processes should be a good
> > thing. So I think the report should have been a "performance
> > improvement" instead of "performance regression".
>
> Hey, when you put it that way...
>
> Let's ignore this issue for now, and see if it shows up in some real
> workload and people complain.
>

Well, I originally complained[*] when the change was first proposed and
when the stable backports were proposed[**]. On a fragmented host, the
change itself showed a 13.9% access latency regression on Haswell and up
to 40% allocation latency regression. This is more substantial on Naples
and Rome. I also measured similar numbers to this for Haswell.

We are particularly hit hard by this because we have libraries that remap
the text segment of binaries to hugepages; hugetlbfs is not widely used so
this normally falls back to transparent hugepages. We mmap(),
madvise(MADV_HUGEPAGE), memcpy(), mremap(). We fully accept the latency
to do this when the binary starts because the access latency at runtime is
so much better.

With this change, however, we have no userspace workaround other than
mbind() to prefer the local node. On all of our platforms, native sized
pages are always a win over remote hugepages and it leaves open the
opportunity that we collapse memory into hugepages later by khugepaged if
fragmentation is the issue. mbind() is not viable if the local node is
saturated, we are ok with falling back to remote pages of the native page
size when the local node is oom; this would result in an oom kill if we
used it to retain the old behavior.

Given this severe access and allocation latency regression, we must revert
this patch in our own kernel, there is simply no path forward without
doing so.

[*] https://marc.info/?l=linux-kernel&m=153868420126775
[**] https://marc.info/?l=linux-kernel&m=154269994800842

2018-12-03 18:04:36

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, Nov 28, 2018 at 8:48 AM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying <[email protected]> wrote:
> >
> > In general, memory allocation fairness among processes should be a good
> > thing. So I think the report should have been a "performance
> > improvement" instead of "performance regression".
>
> Hey, when you put it that way...
>
> Let's ignore this issue for now, and see if it shows up in some real
> workload and people complain.

Well, David Rientjes points out that it *does* cause real problems in
real workloads, so it's not just this benchmark run that shows the
issue.

So I guess we should revert, or at least fix. David, please post your
numbers again in public along with your suggested solution...

Linus

2018-12-03 18:16:06

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon 03-12-18 10:01:18, Linus Torvalds wrote:
> On Wed, Nov 28, 2018 at 8:48 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying <[email protected]> wrote:
> > >
> > > In general, memory allocation fairness among processes should be a good
> > > thing. So I think the report should have been a "performance
> > > improvement" instead of "performance regression".
> >
> > Hey, when you put it that way...
> >
> > Let's ignore this issue for now, and see if it shows up in some real
> > workload and people complain.
>
> Well, David Rientjes points out that it *does* cause real problems in
> real workloads, so it's not just this benchmark run that shows the
> issue.

The thing is that there is no universal win here. There are two
different types of workloads and we cannot satisfy both. This has been
discussed at lenght during the review process. David's workload makes
some assumptions about the MADV_HUGEPAGE numa placement. There are other
workalods like KVM setups which do not really require that and those are
ones which regressed.

The prevalent consensus was that a NUMA placement is not really implied
by MADV_HUGEPAGE because a) this has never been documented or intended
behavior and b) it is not a universal win (basically the same as
node/zone_reclaim which used to be on by default on some NUMA setups).

Reverting the patch would regress another class of workloads. As we
cannot satisfy both I believe we should make the API clear and in favor
of a more relaxed workloads. Those with special requirements should have
a proper API to reflect that (this is our general NUMA policy pattern
already).
--
Michal Hocko
SUSE Labs

2018-12-03 18:22:32

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, Dec 3, 2018 at 10:15 AM Michal Hocko <[email protected]> wrote:
>
> The thing is that there is no universal win here. There are two
> different types of workloads and we cannot satisfy both.

Ok, if that's the case, then I'll just revert the commit.

Michal, our rules are very simple: we don't generate regressions. It's
better to have old reliable behavior than to start creating *new*
problems.

Linus

2018-12-03 18:31:53

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon 03-12-18 10:19:55, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 10:15 AM Michal Hocko <[email protected]> wrote:
> >
> > The thing is that there is no universal win here. There are two
> > different types of workloads and we cannot satisfy both.
>
> Ok, if that's the case, then I'll just revert the commit.
>
> Michal, our rules are very simple: we don't generate regressions. It's
> better to have old reliable behavior than to start creating *new*
> problems.

I do not get it. 5265047ac301 which this patch effectively reverts has
regressed kvm workloads. People started to notice only later because
they were not running on kernels with that commit until later. We have
4.4 based kernels reports. What do you propose to do for those people?
Let me remind that it was David who introduced 5265047ac301, presumably
because his workload benefits from it.
--
Michal Hocko
SUSE Labs

2018-12-03 18:47:17

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, Dec 3, 2018 at 10:30 AM Michal Hocko <[email protected]> wrote:
>
> I do not get it. 5265047ac301 which this patch effectively reverts has
> regressed kvm workloads. People started to notice only later because
> they were not running on kernels with that commit until later. We have
> 4.4 based kernels reports. What do you propose to do for those people?

We have at least two patches that others claim to fix things.

You dismissed them and said "can't be done".

As a result, I'm not really interested in this discussion.

Linus

2018-12-03 19:00:58

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon 03-12-18 10:45:35, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 10:30 AM Michal Hocko <[email protected]> wrote:
> >
> > I do not get it. 5265047ac301 which this patch effectively reverts has
> > regressed kvm workloads. People started to notice only later because
> > they were not running on kernels with that commit until later. We have
> > 4.4 based kernels reports. What do you propose to do for those people?
>
> We have at least two patches that others claim to fix things.
>
> You dismissed them and said "can't be done".

You are misinterpreting my words. I haven't dismissed anything. I do
recognize both usecases under discussion.

I have merely said that a better THP locality needs more work and during
the review discussion I have even volunteered to work on that. There
are other reclaim related fixes under work right now. All I am saying
is that MADV_TRANSHUGE having numa locality implications cannot satisfy
all the usecases and it is particurarly KVM that suffers from it.
--
Michal Hocko
SUSE Labs

2018-12-03 19:25:56

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, Dec 03, 2018 at 07:59:54PM +0100, Michal Hocko wrote:
> I have merely said that a better THP locality needs more work and during
> the review discussion I have even volunteered to work on that. There
> are other reclaim related fixes under work right now. All I am saying
> is that MADV_TRANSHUGE having numa locality implications cannot satisfy
> all the usecases and it is particurarly KVM that suffers from it.

I'd like to clarify it's not just KVM, we found with KVM because for
KVM it's fairly common to create VM that won't possibly fit in a
single node, while most other apps don't tend to allocate that much
memory.

It's trivial to reproduce the badness by running a memhog process that
allocates more than the RAM of 1 NUMA node, under defrag=always
setting (or by changing memhog to use MADV_HUGEPAGE) and it'll create
swap storms despite 75% of the RAM is completely free in a 4 node NUMA
(or 50% of RAM free in a 2 node NUMA) etc..

How can it be ok to push the system into gigabytes of swap by default
without any special capability despite 50% - 75% or more of the RAM is
free? That's the downside of the __GFP_THISNODE optimizaton.

__GFP_THISNODE helps increasing NUMA locality if your app can fit in a
single node which is the common David's workload. But if his workload
would more often than not fit in a single node, he would also run into
an unacceptable slowdown because of the __GFP_THISNODE.

I think there's lots of room for improvement for the future, but in my
view that __GFP_THISNODE as it was implemented was an incomplete hack,
that opened the door for bad VM corner cases that should not happen.

It also would be nice to have a reproducer for David's workload, the
software to run the binary on THP is not released either. We have lots
of reproducer for the corner case introduced by the __GFP_THISNODE
trick.

So this is basically a revert of the commit that made MADV_HUGEPAGE
with __GFP_THISNODE behave like a privileged (although not as static)
mbind.

I provided an alternative but we weren't sure if that was the best
long term solution that could satisfy everyone because it does have
some drawback too.

Thanks,
Andrea

2018-12-03 19:30:30

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, Dec 3, 2018 at 10:59 AM Michal Hocko <[email protected]> wrote:
>
> You are misinterpreting my words. I haven't dismissed anything. I do
> recognize both usecases under discussion.
>
> I have merely said that a better THP locality needs more work and during
> the review discussion I have even volunteered to work on that.

We have two known patches that seem to have no real downsides.

One is the patch posted by Andrea earlier in this thread, which seems
to target just this known regression.

The other seems to be to revert commit ac5b2c1891 and instead apply

https://lore.kernel.org/lkml/[email protected]/

which also seems to be sensible.

I'm not seeing why the KVM load would react badly to either of those
models, and they are known to fix the google local-node issue.

Linus

2018-12-03 20:13:10

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, Dec 03, 2018 at 11:28:07AM -0800, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 10:59 AM Michal Hocko <[email protected]> wrote:
> >
> > You are misinterpreting my words. I haven't dismissed anything. I do
> > recognize both usecases under discussion.
> >
> > I have merely said that a better THP locality needs more work and during
> > the review discussion I have even volunteered to work on that.
>
> We have two known patches that seem to have no real downsides.
>
> One is the patch posted by Andrea earlier in this thread, which seems
> to target just this known regression.

For the short term the important thing is to fix the VM regression one
way or another, I don't personally mind which way.

> The other seems to be to revert commit ac5b2c1891 and instead apply
>
> https://lore.kernel.org/lkml/[email protected]/
>
> which also seems to be sensible.

In my earlier review of David's patch, it looked runtime equivalent to
the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a
new gfpflag until we're sure we need it but it's the worst solution
available for the long term in my view. It'd be ok to apply it as
stop-gap measure though.

The "order == pageblock_order" hardcoding inside the allocator to
workaround the __GFP_THISNODE flag passed from outside the allocator
in the THP MADV_HUGEPAGE case, didn't look very attractive because
it's not just THP allocating order >0 pages.

It'd be nicer if whatever compaction latency optimization that applies
to THP could also apply to all other allocation orders too and the
hardcoding of the THP order prevents that.

On the same lines if __GFP_THISNODE is so badly needed by
MADV_HUGEPAGE, all other larger order allocations should also be able
to take advantage of __GFP_THISNODE without ending in the same VM
corner cases that required the "order == pageblock_order" hardcoding
inside the allocator.

If you prefer David's patch I would suggest pageblock_order to be
replaced with HPAGE_PMD_ORDER so it's more likely to match the THP
order in all archs.

Thanks,
Andrea

2018-12-03 20:28:18

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, 3 Dec 2018, Andrea Arcangeli wrote:

> It's trivial to reproduce the badness by running a memhog process that
> allocates more than the RAM of 1 NUMA node, under defrag=always
> setting (or by changing memhog to use MADV_HUGEPAGE) and it'll create
> swap storms despite 75% of the RAM is completely free in a 4 node NUMA
> (or 50% of RAM free in a 2 node NUMA) etc..
>
> How can it be ok to push the system into gigabytes of swap by default
> without any special capability despite 50% - 75% or more of the RAM is
> free? That's the downside of the __GFP_THISNODE optimizaton.
>

The swap storm is the issue that is being addressed. If your remote
memory is as low as local memory, the patch to clear __GFP_THISNODE has
done nothing to fix it: you still get swap storms and memory compaction
can still fail if the per-zone freeing scanner cannot utilize the
reclaimed memory. Recall that this patch to clear __GFP_THISNODE was
measured by me to have a 40% increase in allocation latency for fragmented
remote memory on Haswell. It makes the problem much, much worse.

> __GFP_THISNODE helps increasing NUMA locality if your app can fit in a
> single node which is the common David's workload. But if his workload
> would more often than not fit in a single node, he would also run into
> an unacceptable slowdown because of the __GFP_THISNODE.
>

Which is why I have suggested that we do not do direct reclaim, as the
page allocator implementation expects all thp page fault allocations to
have __GFP_NORETRY set, because no amount of reclaim can be shown to be
useful to the memory compaction freeing scanner if it is iterated over by
the migration scanner.

> I think there's lots of room for improvement for the future, but in my
> view that __GFP_THISNODE as it was implemented was an incomplete hack,
> that opened the door for bad VM corner cases that should not happen.
>

__GFP_THISNODE is intended specifically because of the remote access
latency increase that is encountered if you fault remote hugepages over
local pages of the native page size.

2018-12-03 20:37:01

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, 3 Dec 2018, Andrea Arcangeli wrote:

> In my earlier review of David's patch, it looked runtime equivalent to
> the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a
> new gfpflag until we're sure we need it but it's the worst solution
> available for the long term in my view. It'd be ok to apply it as
> stop-gap measure though.
>
> The "order == pageblock_order" hardcoding inside the allocator to
> workaround the __GFP_THISNODE flag passed from outside the allocator
> in the THP MADV_HUGEPAGE case, didn't look very attractive because
> it's not just THP allocating order >0 pages.
>

We have two different things to consider: NUMA locality and the order of
the allocation. THP is preferred locally and we know the order. For the
other high-order pages you're referring to, I don't know if they are using
__GFP_THISNODE or not (likely not). I see them as two separate issues.

For thp on all platforms I have measured it on specifically for this patch
(Broadwell, Haswell, Rome) there is a clear advantage to faulting local
pages of the native page size over remote hugepages. It also has the
added effect of allowing khugepaged to collapse it into a hugepage later
if fragmentation allows (the reason why khugepaged cares about NUMA
locality, the same reason I do). This is the rationale for __GFP_THISNODE
for thp allocations.

For order == pageblock_order (or more correctly order >= pageblock_order),
this is not based on NUMA whatsoever but is rather based on the
implementation of memory compaction. If it has already failed (or was
deferred for order-HPAGE_PMD_ORDER), reclaim cannot be shown to help if
memory compaction cannot utilize the freed memory in isolate_freepages(),
so that reclaim has been pointless. If compaction fails for other reasons
(any unmovable page preventing a pageblock from becoming free), *all*
reclaim activity has been pointless.

> It'd be nicer if whatever compaction latency optimization that applies
> to THP could also apply to all other allocation orders too and the
> hardcoding of the THP order prevents that.
>
> On the same lines if __GFP_THISNODE is so badly needed by
> MADV_HUGEPAGE, all other larger order allocations should also be able
> to take advantage of __GFP_THISNODE without ending in the same VM
> corner cases that required the "order == pageblock_order" hardcoding
> inside the allocator.
>
> If you prefer David's patch I would suggest pageblock_order to be
> replaced with HPAGE_PMD_ORDER so it's more likely to match the THP
> order in all archs.
>

That sounds fine and I will do that in my v2.

2018-12-03 20:43:10

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, 3 Dec 2018, Michal Hocko wrote:

> I have merely said that a better THP locality needs more work and during
> the review discussion I have even volunteered to work on that. There
> are other reclaim related fixes under work right now. All I am saying
> is that MADV_TRANSHUGE having numa locality implications cannot satisfy
> all the usecases and it is particurarly KVM that suffers from it.

I think extending functionality so thp can be allocated remotely if truly
desired is worthwhile just so long as it does not cause regressions for
other users. I think that is separate from the swap storm regression that
Andrea is reporting, however, since that would also exist even if we
allowed remote thp allocations when the host is fully fragmented rather
than only locally fragmented.

2018-12-03 21:26:56

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon 03-12-18 12:39:34, David Rientjes wrote:
> On Mon, 3 Dec 2018, Michal Hocko wrote:
>
> > I have merely said that a better THP locality needs more work and during
> > the review discussion I have even volunteered to work on that. There
> > are other reclaim related fixes under work right now. All I am saying
> > is that MADV_TRANSHUGE having numa locality implications cannot satisfy
> > all the usecases and it is particurarly KVM that suffers from it.
>
> I think extending functionality so thp can be allocated remotely if truly
> desired is worthwhile

This is a complete NUMA policy antipatern that we have for all other
user memory allocations. So far you have to be explicit for your numa
requirements. You are trying to conflate NUMA api with MADV and that is
just conflating two orthogonal things and that is just wrong.

Let's put the __GFP_THISNODE issue aside. I do not remember you
confirming that __GFP_COMPACT_ONLY patch is OK for you (sorry it might
got lost in the emails storm from back then) but if that is the only
agreeable solution for now then I can live with that. __GFP_NORETRY hack
was shown to not work properly by Mel AFAIR. Again if I misremember then
I am sorry and I can live with that. But conflating MADV_TRANSHUGE with
an implicit numa placement policy and/or adding an opt-in for remote
NUMA placing is completely backwards and a broken API which will likely
bites us later. I sincerely hope we are not going to repeat mistakes
from the past.
--
Michal Hocko
SUSE Labs

2018-12-03 21:56:16

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, 3 Dec 2018, Michal Hocko wrote:

> > I think extending functionality so thp can be allocated remotely if truly
> > desired is worthwhile
>
> This is a complete NUMA policy antipatern that we have for all other
> user memory allocations. So far you have to be explicit for your numa
> requirements. You are trying to conflate NUMA api with MADV and that is
> just conflating two orthogonal things and that is just wrong.
>

No, the page allocator change for both my patch and __GFP_COMPACT_ONLY has
nothing to do with any madvise() mode. It has to do with where thp
allocations are preferred. Yes, this is different than other memory
allocations where it doesn't cause a 13.9% access latency regression for
the lifetime of a binary for users who back their text with hugepages.
MADV_HUGEPAGE still has its purpose to try synchronous memory compaction
at fault time under all thp defrag modes other than "never". The specific
problem being reported here, and that both my patch and __GFP_COMPACT_ONLY
address, is the pointless reclaim activity that does not assist in making
compaction more successful.

> Let's put the __GFP_THISNODE issue aside. I do not remember you
> confirming that __GFP_COMPACT_ONLY patch is OK for you (sorry it might
> got lost in the emails storm from back then) but if that is the only
> agreeable solution for now then I can live with that.

The discussion between my patch and Andrea's patch seemed to only be about
whether this should be a gfp bit or not

> __GFP_NORETRY hack
> was shown to not work properly by Mel AFAIR. Again if I misremember then
> I am sorry and I can live with that.

Andrea's patch as posted in this thread sets __GFP_NORETRY for
__GFP_ONLY_COMPACT, so both my patch and his patch require it. His patch
gets this behavior for page faults by way of alloc_pages_vma(), mine gets
it from modifying GFP_TRANSHUGE.

> But conflating MADV_TRANSHUGE with
> an implicit numa placement policy and/or adding an opt-in for remote
> NUMA placing is completely backwards and a broken API which will likely
> bites us later. I sincerely hope we are not going to repeat mistakes
> from the past.

Assuming s/MADV_TRANSHUGE/MADV_HUGEPAGE/. Again, this is *not* about the
madvise(); it's specifically about the role of direct reclaim in the
allocation of a transparent hugepage at fault time regardless of any
madvise() because you can get the same behavior with defrag=always (and
the inconsistent use of __GFP_NORETRY there that is fixed by both of our
patches).

2018-12-03 22:06:16

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, Dec 3, 2018 at 12:12 PM Andrea Arcangeli <[email protected]> wrote:
>
> On Mon, Dec 03, 2018 at 11:28:07AM -0800, Linus Torvalds wrote:
> >
> > One is the patch posted by Andrea earlier in this thread, which seems
> > to target just this known regression.
>
> For the short term the important thing is to fix the VM regression one
> way or another, I don't personally mind which way.
>
> > The other seems to be to revert commit ac5b2c1891 and instead apply
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > which also seems to be sensible.
>
> In my earlier review of David's patch, it looked runtime equivalent to
> the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a

I think there's a missing "not" in the above.

> new gfpflag until we're sure we need it but it's the worst solution
> available for the long term in my view. It'd be ok to apply it as
> stop-gap measure though.

So I have no really strong opinions either way.

I looking at the two options, I think I'd personally have a slight
preference for that patch by David, not so much because it doesn't add
a new GFP flag, but because it seems to make it a lot more explicit
that GFP_TRANSHUGE_LIGHT automatically implies __GFP_NORETRY.

I think that makes a whole lot of conceptual sense with the whole
meaning of GFP_TRANSHUGE_LIGHT. It's all about "no
reclaim/compaction", but honestly, doesn't __GFP_NORETRY make sense?

So I look at David's patch, and I go "that makes sense", and then I
compare it with ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
MADV_HUGEPAGE mappings") and that makes me go "ok, that's a hack".

So *if* reverting ac5b2c18911f and applying David's patch instead
fixes the KVM latency issues (which I assume it really should do,
simply thanks to __GFP_NORETRY), then I think that makes more sense.

That said, I do agree that the

if (order == pageblock_order ...)

test in __alloc_pages_slowpath() in David's patch then argues for
"that looks hacky". But that code *is* inside the test for

if (costly_order && (gfp_mask & __GFP_NORETRY)) {

so within the context of that (not visible in the patch itself), it
looks like a sensible model. The whole point of that block is, as the
comment above it says

/*
* Checks for costly allocations with __GFP_NORETRY, which
* includes THP page fault allocations
*/

so I think all of David's patch is somewhat sensible, even if that
specific "order == pageblock_order" test really looks like it might
want to be clarified.

BUT.

With all that said, I really don't mind that __GFP_COMPACT_ONLY
approach either. I think David's patch makes sense in a bigger
context, while the __GFP_COMPACT_ONLY patch makes sense in the context
of "let's just fix this _particular_ special case.

As long as both work (and apparently they do), either is perfectly find by me.

Some kind of "Thunderdome for patches" is needed, with an epic soundtrack.

"Two patches enter, one patch leaves!"

I don't so much care which one.

Linus

2018-12-03 22:29:45

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, Dec 3, 2018 at 2:04 PM Linus Torvalds
<[email protected]> wrote:
>
> so I think all of David's patch is somewhat sensible, even if that
> specific "order == pageblock_order" test really looks like it might
> want to be clarified.

Side note: I think maybe people should just look at that whole
compaction logic for that block, because it doesn't make much sense to
me:

/*
* Checks for costly allocations with __GFP_NORETRY, which
* includes THP page fault allocations
*/
if (costly_order && (gfp_mask & __GFP_NORETRY)) {
/*
* If compaction is deferred for high-order allocations,
* it is because sync compaction recently failed. If
* this is the case and the caller requested a THP
* allocation, we do not want to heavily disrupt the
* system, so we fail the allocation instead of entering
* direct reclaim.
*/
if (compact_result == COMPACT_DEFERRED)
goto nopage;

/*
* Looks like reclaim/compaction is worth trying, but
* sync compaction could be very expensive, so keep
* using async compaction.
*/
compact_priority = INIT_COMPACT_PRIORITY;
}

this is where David wants to add *his* odd test, and I think everybody
looks at that added case

+ if (order == pageblock_order &&
+ !(current->flags & PF_KTHREAD))
+ goto nopage;

and just goes "Eww".

But I think the real problem is that it's the "goto nopage" thing that
makes _sense_, and the current cases for "let's try compaction" that
are the odd ones, and then David adds one new special case for the
sensible behavior.

For example, why would COMPACT_DEFERRED mean "don't bother", but not
all the other reasons it didn't really make sense?

So does it really make sense to fall through AT ALL to that "retry"
case, when we explicitly already had (gfp_mask & __GFP_NORETRY)?

Maybe the real fix is to instead of adding yet another special case
for "goto nopage", it should just be unconditional: simply don't try
to compact large-pages if __GFP_NORETRY was set.

Hmm? I dunno. Right now - for 4.20, I'd obviously want to keep changes
smallish, so a hacky added special case might be the right thing to
do. But the code does look odd, doesn't it?

I think part of it comes from the fact that we *used* to do the
compaction first, and then we did the reclaim, and then it was
re-orghanized to do reclaim first, but it tried to keep semantic
changes minimal and some of the above comes from that re-org.

I think.

Linus

2018-12-03 22:58:54

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon, 3 Dec 2018, Linus Torvalds wrote:

> Side note: I think maybe people should just look at that whole
> compaction logic for that block, because it doesn't make much sense to
> me:
>
> /*
> * Checks for costly allocations with __GFP_NORETRY, which
> * includes THP page fault allocations
> */
> if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> /*
> * If compaction is deferred for high-order allocations,
> * it is because sync compaction recently failed. If
> * this is the case and the caller requested a THP
> * allocation, we do not want to heavily disrupt the
> * system, so we fail the allocation instead of entering
> * direct reclaim.
> */
> if (compact_result == COMPACT_DEFERRED)
> goto nopage;
>
> /*
> * Looks like reclaim/compaction is worth trying, but
> * sync compaction could be very expensive, so keep
> * using async compaction.
> */
> compact_priority = INIT_COMPACT_PRIORITY;
> }
>
> this is where David wants to add *his* odd test, and I think everybody
> looks at that added case
>
> + if (order == pageblock_order &&
> + !(current->flags & PF_KTHREAD))
> + goto nopage;
>
> and just goes "Eww".
>
> But I think the real problem is that it's the "goto nopage" thing that
> makes _sense_, and the current cases for "let's try compaction" that
> are the odd ones, and then David adds one new special case for the
> sensible behavior.
>
> For example, why would COMPACT_DEFERRED mean "don't bother", but not
> all the other reasons it didn't really make sense?
>
> So does it really make sense to fall through AT ALL to that "retry"
> case, when we explicitly already had (gfp_mask & __GFP_NORETRY)?
>
> Maybe the real fix is to instead of adding yet another special case
> for "goto nopage", it should just be unconditional: simply don't try
> to compact large-pages if __GFP_NORETRY was set.
>

I think what is intended, which may not be represented by the code, is
that if compaction is not suitable (__compaction_suitable() returns
COMPACT_SKIPPED because of failing watermarks) that for non-hugepage
allocations reclaim may be useful. We just want to reclaim memory so that
memory compaction has pages available for migration targets.

Note the same caveat I keep bringing up still applies, though: if reclaim
frees memory that is iterated over by the compaction migration scanner, it
was pointless. That is a memory compaction implementation detail and can
lead to a lot of unnecessary reclaim (or even thrashing) if unmovable page
fragmentation cause compaction to fail even after it has migrated
everything it could. I think the likelihood of that happening increases
by the allocation order.

2018-12-04 08:49:52

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Mon 03-12-18 13:53:21, David Rientjes wrote:
> On Mon, 3 Dec 2018, Michal Hocko wrote:
>
> > > I think extending functionality so thp can be allocated remotely if truly
> > > desired is worthwhile
> >
> > This is a complete NUMA policy antipatern that we have for all other
> > user memory allocations. So far you have to be explicit for your numa
> > requirements. You are trying to conflate NUMA api with MADV and that is
> > just conflating two orthogonal things and that is just wrong.
> >
>
> No, the page allocator change for both my patch and __GFP_COMPACT_ONLY has
> nothing to do with any madvise() mode. It has to do with where thp
> allocations are preferred. Yes, this is different than other memory
> allocations where it doesn't cause a 13.9% access latency regression for
> the lifetime of a binary for users who back their text with hugepages.
> MADV_HUGEPAGE still has its purpose to try synchronous memory compaction
> at fault time under all thp defrag modes other than "never". The specific
> problem being reported here, and that both my patch and __GFP_COMPACT_ONLY
> address, is the pointless reclaim activity that does not assist in making
> compaction more successful.

You do not address my concern though. Sure there are reclaim related
issues. Nobody is questioning that. But that is only half of the
problem.

The thing I am really up to here is that reintroduction of
__GFP_THISNODE, which you are pushing for, will conflate madvise mode
resp. defrag=always with a numa placement policy because the allocation
doesn't fallback to a remote node.

And that is a fundamental problem and the antipattern I am talking
about. Look at it this way. All normal allocations are utilizing all the
available memory even though they might hit a remote latency penalty. If
you do care about NUMA placement you have an API to enforce a specific
placement. What is so different about THP to behave differently. Do
we really want to later invent an API to actually allow to utilize all
the memory? There are certainly usecases (that triggered the discussion
previously) that do not mind the remote latency because all other
benefits simply outweight it?

That being said what should users who want to use all the memory do to
use as many THPs as possible?
--
Michal Hocko
SUSE Labs

2018-12-04 09:24:37

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On 12/3/18 11:27 PM, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 2:04 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> so I think all of David's patch is somewhat sensible, even if that
>> specific "order == pageblock_order" test really looks like it might
>> want to be clarified.
>
> Side note: I think maybe people should just look at that whole
> compaction logic for that block, because it doesn't make much sense to
> me:
>
> /*
> * Checks for costly allocations with __GFP_NORETRY, which
> * includes THP page fault allocations
> */
> if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> /*
> * If compaction is deferred for high-order allocations,
> * it is because sync compaction recently failed. If
> * this is the case and the caller requested a THP
> * allocation, we do not want to heavily disrupt the
> * system, so we fail the allocation instead of entering
> * direct reclaim.
> */
> if (compact_result == COMPACT_DEFERRED)
> goto nopage;
>
> /*
> * Looks like reclaim/compaction is worth trying, but
> * sync compaction could be very expensive, so keep
> * using async compaction.
> */
> compact_priority = INIT_COMPACT_PRIORITY;
> }
>
> this is where David wants to add *his* odd test, and I think everybody
> looks at that added case
>
> + if (order == pageblock_order &&
> + !(current->flags & PF_KTHREAD))
> + goto nopage;
>
> and just goes "Eww".
>
> But I think the real problem is that it's the "goto nopage" thing that
> makes _sense_, and the current cases for "let's try compaction" that

More precisely it's "let's try reclaim + compaction".

> are the odd ones, and then David adds one new special case for the
> sensible behavior.
>
> For example, why would COMPACT_DEFERRED mean "don't bother", but not
> all the other reasons it didn't really make sense?

COMPACT_DEFERRED means that compaction was failing recently, even with
sufficient free pages (e.g. freed by direct reclaim), so it doesn't make
sense to continue.
What are "all the other reasons"? __alloc_pages_direct_compact() could
have also returned COMPACT_SKIPPED, which means compaction actually
didn't happen at all, because there's not enough free pages.

> So does it really make sense to fall through AT ALL to that "retry"
> case, when we explicitly already had (gfp_mask & __GFP_NORETRY)?

Well if there was no free memory to begin with, and thus compaction
returned COMPACT_SKIPPED, then we didn't really "try" anything yet, so
there's nothing to "not retry".

> Maybe the real fix is to instead of adding yet another special case
> for "goto nopage", it should just be unconditional: simply don't try
> to compact large-pages if __GFP_NORETRY was set.

I think that would destroy THP success rates too much, in situations
where reclaim and compaction would succeed, because there's enough
easily reclaimable and migratable memory.

> Hmm? I dunno. Right now - for 4.20, I'd obviously want to keep changes
> smallish, so a hacky added special case might be the right thing to
> do. But the code does look odd, doesn't it?
>
> I think part of it comes from the fact that we *used* to do the
> compaction first, and then we did the reclaim, and then it was
> re-orghanized to do reclaim first, but it tried to keep semantic
> changes minimal and some of the above comes from that re-org.

IIRC the point of reorg was that in typical case we actually do want to
try the reclaim first (or only), and the exception are those THP-ish
allocations where typically the problem is fragmentation, and not number
of free pages, so we check first if we can defragment the memory or
whether it makes sense to free pages in case the defragmentation is
expected to help afterwards. It seemed better to put this special case
out of the main reclaim/compaction retry-with-increasing-priority loop
for non-costly-order allocations that in general can't fail.

Vlastimil

> I think.
>
> Linus
>

2018-12-04 10:48:32

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

Much of this thread is a rehash of previous discussions so as a result,
I glossed over parts of it so there will be a degree of error. Very
preliminary results from David's approach are below and the bottom line
is that it might fix some latency issues and locality issues at the cost
of a high degree of THP allocation failure.

On Tue, Dec 04, 2018 at 10:22:26AM +0100, Vlastimil Babka wrote:
> > + if (order == pageblock_order &&
> > + !(current->flags & PF_KTHREAD))
> > + goto nopage;
> >
> > and just goes "Eww".
> >
> > But I think the real problem is that it's the "goto nopage" thing that
> > makes _sense_, and the current cases for "let's try compaction" that
>
> More precisely it's "let's try reclaim + compaction".
>

The original intent has been muddied and special cased but the general idea
was that compaction needs space to work with to both succeed and avoid
excessive scanning -- particularly in direct context that is visible to
the application. Before compaction, linear-reclaim (aka lumpy reclaim)
was used but this caused both page age inversion issues and excessive
thrasing. In Andrew's tree, there are patches that also do small amounts
of reclaim in response to fragmentation which in some cases alleviates
the need for the reclaim + compaction step as the reclaim has sometimes
already happened. This has reduced latencies and increased THP allocation
success rates but not by enough which needs further work.

Parts of compaction are in need of a revisit. I'm in the process of doing
but it's time consuming to do this because of the level of testing required
at every step. The prototype currently is 12 patches and growing and I'm
not sure what the final series will look like or how far it'll go.

At this point, I believe that even when it's finished that the concept of
"do some reclaim and try compaction" will remain. I'm focused primarily
on the compaction core at the moment rather than the outer part in the
page allocator.

> > are the odd ones, and then David adds one new special case for the
> > sensible behavior.
> >
> > For example, why would COMPACT_DEFERRED mean "don't bother", but not
> > all the other reasons it didn't really make sense?
>
> COMPACT_DEFERRED means that compaction was failing recently, even with
> sufficient free pages (e.g. freed by direct reclaim), so it doesn't make
> sense to continue.

Yes, the intent is that recent failures should not incur more useless
scanning and stalls. As it is, the latencies are too high and too often
it's useless work. Historically, this was put into place as the time
spent in compaction was too high and the THP allocation latencies were so
bad that it was preferred to disable THP entirely. This has improved in
recent years with general improvements and changes to defaults but there
is room to improve. Again, it's something I'm looking into but it's slow.

> > <SNIP>
> > So does it really make sense to fall through AT ALL to that "retry"
> > case, when we explicitly already had (gfp_mask & __GFP_NORETRY)?
>
> Well if there was no free memory to begin with, and thus compaction
> returned COMPACT_SKIPPED, then we didn't really "try" anything yet, so
> there's nothing to "not retry".
>

What should also be kept in mind is that we should avoid conflating
locality preferences with THP preferences which is separate from THP
allocation latencies. The whole __GFP_THISNODE approach is pushing too
hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
are used which is very unfortunate given that MADV_HUGEPAGE in itself says
nothing about locality -- that is the business of other madvise flags or
a specific policy. Using remote nodes is bad but reclaiming excessively
and pushing data out of memory is worse as the latency to fault data back
from disk is higher than a remote access.

Andrea already pointed it out -- workloads that fit within a node are happy
to reclaim local memory, particularly in the case where the existing data
is old which is the ideal for David. Workloads that do not fit within a
node will often prefer using remote memory -- either THP or base pages
in the general case and THP for definite in the KVM case. While KVM
might not like remote memory, using THP at least reduces the page table
access overhead even if the access is remote and eventually automatic
NUMA balancing might intervene.

> > Maybe the real fix is to instead of adding yet another special case
> > for "goto nopage", it should just be unconditional: simply don't try
> > to compact large-pages if __GFP_NORETRY was set.
>
> I think that would destroy THP success rates too much, in situations
> where reclaim and compaction would succeed, because there's enough
> easily reclaimable and migratable memory.
>

Tests are in progress but yes, this is the primary risk of abandoning
the allocation request too early. I've already found during developing
the prototype series that it's very easy to "fix" latencies by simply
failing THP allocation very quickly. This is not the desired outcome and
has occupied the bulk of my attention.

I have *one* result of the series on a 1-socket machine running
"thpscale". It creates a file, punches holes in it to create a
very light form of fragmentation and then tries THP allocations
using madvise measuring latency and success rates. It's the
global-dhp__workload_thpscale-madvhugepage in mmtests using XFS as the
filesystem.

thpscale Fault Latencies
4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Amean fault-base-3 5358.54 ( 0.00%) 2408.93 * 55.04%*
Amean fault-base-5 9742.30 ( 0.00%) 3035.25 * 68.84%*
Amean fault-base-7 13069.18 ( 0.00%) 4362.22 * 66.62%*
Amean fault-base-12 14882.53 ( 0.00%) 9424.38 * 36.67%*
Amean fault-base-18 15692.75 ( 0.00%) 16280.03 ( -3.74%)
Amean fault-base-24 28775.11 ( 0.00%) 18374.84 * 36.14%*
Amean fault-base-30 42056.32 ( 0.00%) 21984.55 * 47.73%*
Amean fault-base-32 38634.26 ( 0.00%) 22199.49 * 42.54%*
Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Amean fault-huge-3 3628.86 ( 0.00%) 963.45 * 73.45%*
Amean fault-huge-5 4926.42 ( 0.00%) 2959.85 * 39.92%*
Amean fault-huge-7 6717.15 ( 0.00%) 3828.68 * 43.00%*
Amean fault-huge-12 11393.47 ( 0.00%) 5772.92 * 49.33%*
Amean fault-huge-18 16979.38 ( 0.00%) 4435.95 * 73.87%*
Amean fault-huge-24 16558.00 ( 0.00%) 4416.46 * 73.33%*
Amean fault-huge-30 20351.46 ( 0.00%) 5099.73 * 74.94%*
Amean fault-huge-32 23332.54 ( 0.00%) 6524.73 * 72.04%*

So, looks like massive latency improvements but then the THP allocation
success rates

thpscale Percentage Faults Huge
4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Percentage huge-3 95.14 ( 0.00%) 7.94 ( -91.65%)
Percentage huge-5 91.28 ( 0.00%) 5.00 ( -94.52%)
Percentage huge-7 86.87 ( 0.00%) 9.36 ( -89.22%)
Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%)
Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%)
Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%)
Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%)
Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%)

They're down the toilet. 3 threads are able to get 95% of the requested
THP pages with Andrews tree as of Nov 30th. David's patch drops that to
8% success rate.

"Compaction efficiency" which takes success vs failure rate into account
goes from 45% to 1%. Compaction scan efficiency, which is how many pages
for migration are scanned vs how many are scanned as free targets goes
from 21% to 1%.

I do not consider this to be a good outcome and hence will not be acking
the patches.

I would also re-emphasise that a major problem with addressing this
problem is that we do not have a general reproducible test case for
David's scenario where as we do have reproduction cases for the others.
They're not related to KVM but that doesn't matter because it's enough
to have a memory hog try allocating more memory than fits on a single node.

Remember too that while the headline regression reported by LKP looks
bad, they have mentioned themselves that the different threads in the
workload are being treated fairly. We've seen "regressions" like this in
the past. For example, way back we had a problem with a dbench regression
that was due to IO fairness giving each thread time to issue IO which
slowed overall throughput that benefitted from few threads making progress
while others starved.

> > Hmm? I dunno. Right now - for 4.20, I'd obviously want to keep changes
> > smallish, so a hacky added special case might be the right thing to
> > do. But the code does look odd, doesn't it?
> >
> > I think part of it comes from the fact that we *used* to do the
> > compaction first, and then we did the reclaim, and then it was
> > re-orghanized to do reclaim first, but it tried to keep semantic
> > changes minimal and some of the above comes from that re-org.
>
> IIRC the point of reorg was that in typical case we actually do want to
> try the reclaim first (or only), and the exception are those THP-ish
> allocations where typically the problem is fragmentation, and not number
> of free pages, so we check first if we can defragment the memory or
> whether it makes sense to free pages in case the defragmentation is
> expected to help afterwards. It seemed better to put this special case
> out of the main reclaim/compaction retry-with-increasing-priority loop
> for non-costly-order allocations that in general can't fail.
>

Again, this is accurate. Scanning/compaction costs a lot. This has improved
over time, but minimally it's unmapping pages, copying data and a bunch
of TLB flushes. During migration, any access to the data being migrated
stalls. The harm of reclaiming a little first so that the compaction is
more likely to succeed incurred fewer stalls of small magnitude in
general -- or at least it was the case when that behaviour was
developed.

--
Mel Gorman
SUSE Labs

2018-12-05 00:08:32

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue, 4 Dec 2018, Michal Hocko wrote:

> The thing I am really up to here is that reintroduction of
> __GFP_THISNODE, which you are pushing for, will conflate madvise mode
> resp. defrag=always with a numa placement policy because the allocation
> doesn't fallback to a remote node.
>

It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent
hugepage allocations, including defrag=always. We agree that
MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate
a hugepage locally, try compaction synchronous to the fault, allow remote
fallback? It's undefined.

The original intent was to be used when thp is disabled system wide
(enabled set to "madvise") because its possible the rss of the process
increases if backed by thp. That occurs either if faulting on a hugepage
aligned area or based on max_ptes_none. So we have at least three
possible policies that have evolved over time: preventing increased rss,
direct compaction, remote fallback. Certainly not something that fits
under a single madvise mode.

> And that is a fundamental problem and the antipattern I am talking
> about. Look at it this way. All normal allocations are utilizing all the
> available memory even though they might hit a remote latency penalty. If
> you do care about NUMA placement you have an API to enforce a specific
> placement. What is so different about THP to behave differently. Do
> we really want to later invent an API to actually allow to utilize all
> the memory? There are certainly usecases (that triggered the discussion
> previously) that do not mind the remote latency because all other
> benefits simply outweight it?
>

What is different about THP is that on every platform I have measured,
NUMA matters more than hugepages. Obviously if on Broadwell, Haswell, and
Rome, remote hugepages were a performance win over local pages, this
discussion would not be happening. Faulting local pages rather than
local hugepages, if possible, is easy and doesn't require reclaim.
Faulting remote pages rather than reclaiming local pages is easy in your
scenario, it's non-disruptive.

So to answer "what is so different about THP?", it's the performance data.
The NUMA locality matters more than whether the pages are huge or not. We
also have the added benefit of khugepaged being able to collapse pages
locally if fragmentation improves rather than being stuck accessing a
remote hugepage forever.

> That being said what should users who want to use all the memory do to
> use as many THPs as possible?

If those users want to accept the performance degradation of allocating
remote hugepages instead of local pages, that should likely be an
extension, either madvise or prctl. That's not necessarily the usecase
Andrea would have, I don't believe: he'd still prefer to compact memory
locally and avoid the swap storm than allocate remotely. If impossible to
reclaim locally for regular pages, remote hugepages may be more beneficial
than remote pages.

2018-12-05 00:48:25

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue, 4 Dec 2018, Mel Gorman wrote:

> What should also be kept in mind is that we should avoid conflating
> locality preferences with THP preferences which is separate from THP
> allocation latencies. The whole __GFP_THISNODE approach is pushing too
> hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> are used which is very unfortunate given that MADV_HUGEPAGE in itself says
> nothing about locality -- that is the business of other madvise flags or
> a specific policy.

We currently lack those other madvise modes or mempolicies: mbind() is not
a viable alternative because we do not want to oom kill when local memory
is depleted, we want to fallback to remote memory. In my response to
Michal, I noted three possible usecases that MADV_HUGEPAGE either
currently has or has taken before: direct compaction/reclaim, avoid
increased rss, and allow fallback to remote memory. It's certainly not
the business of one madvise mode to define this. Thus, I'm trying to
return to the behavior that was in 4.1 and what was restored three years
ago because suddenly changing the behavior to allow remote allocation
causes real-world regressions.

> Using remote nodes is bad but reclaiming excessively
> and pushing data out of memory is worse as the latency to fault data back
> from disk is higher than a remote access.
>

That's discussing two things at the same time: local fragmentation and
local low-on-memory conditions. If compaction quickly fails and local
pages are available as fallback, that requires no reclaim. If we're truly
low-on-memory locally then it is obviously better to allocate remotely
than aggressively reclaim.

> Andrea already pointed it out -- workloads that fit within a node are happy
> to reclaim local memory, particularly in the case where the existing data
> is old which is the ideal for David. Workloads that do not fit within a
> node will often prefer using remote memory -- either THP or base pages
> in the general case and THP for definite in the KVM case. While KVM
> might not like remote memory, using THP at least reduces the page table
> access overhead even if the access is remote and eventually automatic
> NUMA balancing might intervene.
>

Sure, but not at the cost of regressing real-world workloads; what is
being asked for here is legitimate and worthy of an extension, but since
the long-standing behavior has been to use __GFP_THISNODE and people
depend on that for NUMA locality, can we not fix the swap storm and look
to extending the API to include workloads that span multiple nodes?

> I have *one* result of the series on a 1-socket machine running
> "thpscale". It creates a file, punches holes in it to create a
> very light form of fragmentation and then tries THP allocations
> using madvise measuring latency and success rates. It's the
> global-dhp__workload_thpscale-madvhugepage in mmtests using XFS as the
> filesystem.
>
> thpscale Fault Latencies
> 4.20.0-rc4 4.20.0-rc4
> mmots-20181130 gfpthisnode-v1r1
> Amean fault-base-3 5358.54 ( 0.00%) 2408.93 * 55.04%*
> Amean fault-base-5 9742.30 ( 0.00%) 3035.25 * 68.84%*
> Amean fault-base-7 13069.18 ( 0.00%) 4362.22 * 66.62%*
> Amean fault-base-12 14882.53 ( 0.00%) 9424.38 * 36.67%*
> Amean fault-base-18 15692.75 ( 0.00%) 16280.03 ( -3.74%)
> Amean fault-base-24 28775.11 ( 0.00%) 18374.84 * 36.14%*
> Amean fault-base-30 42056.32 ( 0.00%) 21984.55 * 47.73%*
> Amean fault-base-32 38634.26 ( 0.00%) 22199.49 * 42.54%*
> Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
> Amean fault-huge-3 3628.86 ( 0.00%) 963.45 * 73.45%*
> Amean fault-huge-5 4926.42 ( 0.00%) 2959.85 * 39.92%*
> Amean fault-huge-7 6717.15 ( 0.00%) 3828.68 * 43.00%*
> Amean fault-huge-12 11393.47 ( 0.00%) 5772.92 * 49.33%*
> Amean fault-huge-18 16979.38 ( 0.00%) 4435.95 * 73.87%*
> Amean fault-huge-24 16558.00 ( 0.00%) 4416.46 * 73.33%*
> Amean fault-huge-30 20351.46 ( 0.00%) 5099.73 * 74.94%*
> Amean fault-huge-32 23332.54 ( 0.00%) 6524.73 * 72.04%*
>
> So, looks like massive latency improvements but then the THP allocation
> success rates
>
> thpscale Percentage Faults Huge
> 4.20.0-rc4 4.20.0-rc4
> mmots-20181130 gfpthisnode-v1r1
> Percentage huge-3 95.14 ( 0.00%) 7.94 ( -91.65%)
> Percentage huge-5 91.28 ( 0.00%) 5.00 ( -94.52%)
> Percentage huge-7 86.87 ( 0.00%) 9.36 ( -89.22%)
> Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%)
> Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%)
> Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%)
> Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%)
> Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%)
>
> They're down the toilet. 3 threads are able to get 95% of the requested
> THP pages with Andrews tree as of Nov 30th. David's patch drops that to
> 8% success rate.
>

I'm not as concerned about fault latency for these binaries that remap
their text segments to be backed by transparent hugepages, that's
secondary to the primary concern: access latency. I agree that faulting
thp is more likely successful if allowed to access remote memory; I'm
reporting the regression in the access latency to that memory for the
lifetime of the binary.

> "Compaction efficiency" which takes success vs failure rate into account
> goes from 45% to 1%. Compaction scan efficiency, which is how many pages
> for migration are scanned vs how many are scanned as free targets goes
> from 21% to 1%.
>
> I do not consider this to be a good outcome and hence will not be acking
> the patches.
>
> I would also re-emphasise that a major problem with addressing this
> problem is that we do not have a general reproducible test case for
> David's scenario where as we do have reproduction cases for the others.
> They're not related to KVM but that doesn't matter because it's enough
> to have a memory hog try allocating more memory than fits on a single node.
>

It's trivial to reproduce this issue: fragment all local memory that
compaction cannot resolve, do posix_memalign() for hugepage aligned
memory, and measure the access latency. To fragment all local memory, you
can simply insert a kernel module and allocate high-order memory (just do
kmem_cache_alloc_node() or get_page() to pin it so compaction fails or
punch holes in the file as you did above). You can do this for all memory
rather than the local node to measure the even more severe allocation
latency regression that not setting __GFP_THISNODE introduces.

2018-12-05 09:10:56

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue 04-12-18 16:47:23, David Rientjes wrote:
> On Tue, 4 Dec 2018, Mel Gorman wrote:
>
> > What should also be kept in mind is that we should avoid conflating
> > locality preferences with THP preferences which is separate from THP
> > allocation latencies. The whole __GFP_THISNODE approach is pushing too
> > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> > are used which is very unfortunate given that MADV_HUGEPAGE in itself says
> > nothing about locality -- that is the business of other madvise flags or
> > a specific policy.
>
> We currently lack those other madvise modes or mempolicies: mbind() is not
> a viable alternative because we do not want to oom kill when local memory
> is depleted, we want to fallback to remote memory.

Yes, there was a clear agreement that there is no suitable mempolicy
right now and there were proposals to introduce MPOL_NODE_RECLAIM to
introduce that behavior. This would be an improvement regardless of THP
because global node-reclaim policy was simply a disaster we had to turn
off by default and the global semantic was a reason people just gave up
using it completely.

[...]

> Sure, but not at the cost of regressing real-world workloads; what is
> being asked for here is legitimate and worthy of an extension, but since
> the long-standing behavior has been to use __GFP_THISNODE and people
> depend on that for NUMA locality,

Well, your patch has altered the semantic and has introduced a subtle
and _undocumented_ NUMA policy into MADV_HUGEPAGE. All that without any
workload numbers. It would be preferable to have a simulator of those
real world workloads of course but even getting some more detailed
metric - e.g. without the patch we have X THP utilization and the
runtime characteristics Y but without X1 and Y1).

> can we not fix the swap storm and look
> to extending the API to include workloads that span multiple nodes?

Yes, we definitely want to address swap storms. No question about that.
But our established approach for NUMA policy has been to fallback to
other nodes and everybody focused on NUMA latency should use NUMA API to
achive that. Not vice versa.

As I've said in other thread, I am OK with restoring __GFP_THISNODE for
now but we should really have a very good plan for further steps. And
that involves an agreed NUMA behavior. I haven't seen any widespread
agreement on that yet though.

[...]
> > I would also re-emphasise that a major problem with addressing this
> > problem is that we do not have a general reproducible test case for
> > David's scenario where as we do have reproduction cases for the others.
> > They're not related to KVM but that doesn't matter because it's enough
> > to have a memory hog try allocating more memory than fits on a single node.
> >
>
> It's trivial to reproduce this issue: fragment all local memory that
> compaction cannot resolve, do posix_memalign() for hugepage aligned
> memory, and measure the access latency. To fragment all local memory, you
> can simply insert a kernel module and allocate high-order memory (just do
> kmem_cache_alloc_node() or get_page() to pin it so compaction fails or
> punch holes in the file as you did above). You can do this for all memory
> rather than the local node to measure the even more severe allocation
> latency regression that not setting __GFP_THISNODE introduces.

Sure, but can we get some numbers from a real workload rather than an
artificial worst case? The utilization issue Mel pointed out before and
here again is a real concern IMHO. We we definitely need a better
picture to make an educated decision.
--
Michal Hocko
SUSE Labs

2018-12-05 10:07:24

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue, Dec 04, 2018 at 10:45:58AM +0000, Mel Gorman wrote:
> I have *one* result of the series on a 1-socket machine running
> "thpscale". It creates a file, punches holes in it to create a
> very light form of fragmentation and then tries THP allocations
> using madvise measuring latency and success rates. It's the
> global-dhp__workload_thpscale-madvhugepage in mmtests using XFS as the
> filesystem.
>
> thpscale Fault Latencies
> 4.20.0-rc4 4.20.0-rc4
> mmots-20181130 gfpthisnode-v1r1
> Amean fault-base-3 5358.54 ( 0.00%) 2408.93 * 55.04%*
> Amean fault-base-5 9742.30 ( 0.00%) 3035.25 * 68.84%*
> Amean fault-base-7 13069.18 ( 0.00%) 4362.22 * 66.62%*
> Amean fault-base-12 14882.53 ( 0.00%) 9424.38 * 36.67%*
> Amean fault-base-18 15692.75 ( 0.00%) 16280.03 ( -3.74%)
> Amean fault-base-24 28775.11 ( 0.00%) 18374.84 * 36.14%*
> Amean fault-base-30 42056.32 ( 0.00%) 21984.55 * 47.73%*
> Amean fault-base-32 38634.26 ( 0.00%) 22199.49 * 42.54%*
> Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
> Amean fault-huge-3 3628.86 ( 0.00%) 963.45 * 73.45%*
> Amean fault-huge-5 4926.42 ( 0.00%) 2959.85 * 39.92%*
> Amean fault-huge-7 6717.15 ( 0.00%) 3828.68 * 43.00%*
> Amean fault-huge-12 11393.47 ( 0.00%) 5772.92 * 49.33%*
> Amean fault-huge-18 16979.38 ( 0.00%) 4435.95 * 73.87%*
> Amean fault-huge-24 16558.00 ( 0.00%) 4416.46 * 73.33%*
> Amean fault-huge-30 20351.46 ( 0.00%) 5099.73 * 74.94%*
> Amean fault-huge-32 23332.54 ( 0.00%) 6524.73 * 72.04%*
>
> So, looks like massive latency improvements but then the THP allocation
> success rates
>
> thpscale Percentage Faults Huge
> 4.20.0-rc4 4.20.0-rc4
> mmots-20181130 gfpthisnode-v1r1
> Percentage huge-3 95.14 ( 0.00%) 7.94 ( -91.65%)
> Percentage huge-5 91.28 ( 0.00%) 5.00 ( -94.52%)
> Percentage huge-7 86.87 ( 0.00%) 9.36 ( -89.22%)
> Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%)
> Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%)
> Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%)
> Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%)
> Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%)
>

Other results arrived once the grid caught up and it's a mixed bag of
gains and losses roughtly along the lines predicted by the discussion
already -- namely locality is better as long as the workload fits,
compaction is reduced, reclaim is reduced, THP allocation success rates
are reduced but latencies are often better.

Whether this is "good" or "bad" depends on whether you have a workload
that benefits because it's neither universally good or bad. It would
still be nice to hear how Andreas fared but I think we'll reach the same
conclusion -- the patches shuffles the problem around with limited effort
to address the root causes so all we end up changing is the identity of
the person who complains about their workload. One might be tempted to
think that the reduced latencies in some cases are great but not if the
workload is one that benefits from longer startup costs in exchange for
lower runtime costs in the active phase.

For the much longer answer, I'll focus on the two-socket results because
they are more relevant to the current discussion. The workloads are
not realistic in the slightest, they just happen to trigger some of the
interesting corner cases.

global-dhp__workload_usemem-stress-numa-compact
o Plain anonymous faulting workload
o defrag=always (not representative, simply triggers a bad case)

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Amean Elapsd-1 26.79 ( 0.00%) 34.92 * -30.37%*
Amean Elapsd-3 7.32 ( 0.00%) 8.10 * -10.61%*
Amean Elapsd-4 5.53 ( 0.00%) 5.64 ( -1.94%)

Units are seconds, time to complete 30.37% worse for the single-threaded
case. No direct reclaim activity but other activity is interesting and
I'll pick it out snippets;

4.20.0-rc4 4.20.0-rc4
mmots-20181130gfpthisnode-v1r1
Swap Ins 8 0
Swap Outs 1546 0
Allocation stalls 0 0
Fragmentation stalls 0 2022
Direct pages scanned 0 0
Kswapd pages scanned 42719 1078
Kswapd pages reclaimed 41082 1049
Page writes by reclaim 1546 0
Page writes file 0 0
Page writes anon 1546 0
Page reclaim immediate 2 0

Baseline kernel swaps out (bad), David's patch reclaims less (good).
That's reasonably positive. Less positive is that fragmentation stalls are
triggered with David's patch. This is due to a patch of mine in Andrew's
tree which I've asked that he drop as while it helps control long-term
fragmentation, there was always a risk that the short stalls would be
problematic and it's a distraction.

THP fault alloc 540043 456714
THP fault fallback 0 83329
THP collapse alloc 0 4
THP collapse fail 0 0
THP split 1 0
THP split failed 0 0

David's patch falls back to base page allocation to a much higher degree
(bad).

Compaction pages isolated 85381 11432
Compaction migrate scanned 204787 42635
Compaction free scanned 72376 13061
Compact scan efficiency 282% 326%

David's patch also compacts less.

NUMA alloc hit 1188182 1093244
NUMA alloc miss 68199 42764192
NUMA interleave hit 0 0
NUMA alloc local 1179614 1084665
NUMA base PTE updates 28902547 23270389
NUMA huge PMD updates 56437 45438
NUMA page range updates 57798291 46534645
NUMA hint faults 61395 47838
NUMA hint local faults 46440 47833
NUMA hint local percent 75% 99%
NUMA pages migrated 2000156 5

Interestingly, the NUMA misses are higher with David's patch indicating
that it's allocating *more* from remote nodes. However, there are also
hints that the accessing process then removes to the remote node instead
of the current mmotm kernel which tries to migrate the memory locally.

So, in line with expectations. The baseline kernel works harder to
allocate the THPs where as David's gives up quickly and moves over. At
one level this is good but the bottom line is total time to complete the
workload goes from the baseline of 280 seconds up to 344 seconds. This
is overall mixed because depending on what you look at, it's both good
and bad.

global-dhp__workload_thpscale-xfs
o Workload creates a large file, punches holes in it
o Mapping is created and faulted to measure allocation success rates and
latencies
o No special madvise
o Considered a relatively "simple" case

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Amean fault-base-3 2021.05 ( 0.00%) 2633.11 * -30.28%*
Amean fault-base-5 2475.25 ( 0.00%) 2997.15 * -21.08%*
Amean fault-base-7 5595.79 ( 0.00%) 7523.10 ( -34.44%)
Amean fault-base-12 15604.91 ( 0.00%) 16355.02 ( -4.81%)
Amean fault-base-18 20277.13 ( 0.00%) 22062.73 ( -8.81%)
Amean fault-base-24 24218.46 ( 0.00%) 25772.49 ( -6.42%)
Amean fault-base-30 28516.75 ( 0.00%) 28208.14 ( 1.08%)
Amean fault-base-32 36722.30 ( 0.00%) 20712.46 * 43.60%*
Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Amean fault-huge-3 685.38 ( 0.00%) 512.02 * 25.29%*
Amean fault-huge-5 3639.75 ( 0.00%) 807.33 ( 77.82%)
Amean fault-huge-7 1139.54 ( 0.00%) 555.45 * 51.26%*
Amean fault-huge-12 1012.64 ( 0.00%) 850.68 ( 15.99%)
Amean fault-huge-18 6694.45 ( 0.00%) 1310.39 * 80.43%*
Amean fault-huge-24 10165.27 ( 0.00%) 3822.23 * 62.40%*
Amean fault-huge-30 13496.19 ( 0.00%) 19248.06 * -42.62%*
Amean fault-huge-32 4477.05 ( 0.00%) 63463.78 *-1317.54%*

These latency outliers can be huge so take them with a grain of salt.
Sometimes I'll look at the percentiles but it takes an age to discuss.

In general, David's patch faults huge pages faster, particularly with
higher threads. The allocation success rates are also great

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Percentage huge-3 2.86 ( 0.00%) 26.48 ( 825.27%)
Percentage huge-5 1.07 ( 0.00%) 1.41 ( 31.16%)
Percentage huge-7 20.38 ( 0.00%) 54.82 ( 168.94%)
Percentage huge-12 19.07 ( 0.00%) 38.10 ( 99.76%)
Percentage huge-18 10.72 ( 0.00%) 30.18 ( 181.49%)
Percentage huge-24 8.44 ( 0.00%) 15.48 ( 83.39%)
Percentage huge-30 7.41 ( 0.00%) 10.78 ( 45.38%)
Percentage huge-32 29.08 ( 0.00%) 3.23 ( -88.91%)

Overall system activity looks similar which is counter-intuitive. The
only hints of what is going on is that David's patch reclaims less from
kswapd context. Direct reclaim scanning is high in both cases but does
not reclaim much. David's patch scans for free page as compaction
targets much more aggressively but no indication as to why. Locality
information looks similar.

So, not as sure what to think about this. Headline results look good but
no obvious explanation as to why exactly. It could be that stalls
(higher with David's patch) mean there is less inteference between
threads but that's thin.

global-dhp__workload_thpscale-madvhugepage-xfs
o Same as above except that MADV_HUGEPAGE is used

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Amean fault-base-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Amean fault-base-3 18880.35 ( 0.00%) 6341.60 * 66.41%*
Amean fault-base-5 27608.74 ( 0.00%) 6515.10 * 76.40%*
Amean fault-base-7 28345.03 ( 0.00%) 7529.98 * 73.43%*
Amean fault-base-12 35690.33 ( 0.00%) 13518.77 * 62.12%*
Amean fault-base-18 56538.31 ( 0.00%) 23933.91 * 57.67%*
Amean fault-base-24 71485.33 ( 0.00%) 26927.03 * 62.33%*
Amean fault-base-30 54286.39 ( 0.00%) 23453.61 * 56.80%*
Amean fault-base-32 92143.50 ( 0.00%) 19474.99 * 78.86%*
Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Amean fault-huge-3 5666.72 ( 0.00%) 1351.55 * 76.15%*
Amean fault-huge-5 8307.35 ( 0.00%) 2776.28 * 66.58%*
Amean fault-huge-7 10651.96 ( 0.00%) 2397.70 * 77.49%*
Amean fault-huge-12 15489.56 ( 0.00%) 7034.98 * 54.58%*
Amean fault-huge-18 20278.54 ( 0.00%) 6417.46 * 68.35%*
Amean fault-huge-24 29378.24 ( 0.00%) 16173.41 * 44.95%*
Amean fault-huge-30 29237.66 ( 0.00%) 81198.70 *-177.72%*
Amean fault-huge-32 27177.37 ( 0.00%) 18966.08 * 30.21%*

Superb improvement in latencies coupled with the following

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Percentage huge-3 99.74 ( 0.00%) 49.62 ( -50.25%)
Percentage huge-5 99.24 ( 0.00%) 12.19 ( -87.72%)
Percentage huge-7 97.98 ( 0.00%) 19.20 ( -80.40%)
Percentage huge-12 95.76 ( 0.00%) 21.33 ( -77.73%)
Percentage huge-18 94.91 ( 0.00%) 31.63 ( -66.67%)
Percentage huge-24 94.36 ( 0.00%) 9.27 ( -90.18%)
Percentage huge-30 92.15 ( 0.00%) 9.60 ( -89.58%)
Percentage huge-32 94.18 ( 0.00%) 8.67 ( -90.79%)

THP allocation success rates are through the floor which is why
latencies overall are better.

This goes back to the fundamental question -- does your workload benefit
from THP or not and is it the primary metric? If yes (potentially in the
case with KVM) then this is a disaster. It's actually a mixed bad for
David because THP was desired but so was locality. In this case, the
application specifically requested THP so presumably a real application
specifying the flag means it

The high-level system stats reflect the level of effort, David's patch
does less work in the system which is both good and bad depending on
your requirements

4.20.0-rc4 4.20.0-rc4
mmots-20181130gfpthisnode-v1r1
Swap Ins 1564 0
Swap Outs 12283 163
Allocation stalls 30236 24
Fragmentation stalls 1069 24683

Baseline kernel swaps and has high allocation stalls to reclaim memory.
David's patch stalls on trying to control fragmentation instead.

4.20.0-rc4 4.20.0-rc4
mmots-20181130gfpthisnode-v1r1
Direct pages scanned 12780511 9955217
Kswapd pages scanned 1944181 16554296
Kswapd pages reclaimed 870023 4029534
Direct pages reclaimed 6738924 5884
Kswapd efficiency 44% 24%
Kswapd velocity 1308.975 11200.850
Direct efficiency 52% 0%
Direct velocity 8604.840 6735.828

The baseline kernel does much of the reclaim work in direct context
while David's does it in kswapd context.

THP fault alloc 316843 238810
THP fault fallback 17224 95256
THP collapse alloc 2 0
THP collapse fail 0 5
THP split 177536 180673
THP split failed 10024 2

Baseline kernel allocates THP while David's falls back

THP collapse alloc 2 0
THP collapse fail 0 5
Compaction stalls 100198 75267
Compaction success 65803 3964
Compaction failures 34395 71303
Compaction efficiency 65% 5%
Page migrate success 40807601 17963914
Page migrate failure 16206 16782
Compaction pages isolated 90818819 41285100
Compaction migrate scanned 98628306 36990342
Compaction free scanned 6547623619 6870889207

Unsurprisingly, David's patch tries to compact less. The collapse activity
shows that enough time didn't pass for khugepaged to intervene. While
outside the context of the current discussion, that compaction scanning
activity is mental but also unsurprising. A lot of it is from kcompactd
activity. It's a seperate series to deal with that.

Given the mix of gains and losses, the patch simply shuffles the problem
around in a circle. Some workloads benefit, some don't and whether it's
merged or not merged, someone ends up annoyed as their workload suffers.

I know I didn't review the patch in much detail because in this context,
it was more interesting to know "what it does" than the specifics of the
approach. I'm going to go back to hopping my face off the compaction
series because I think it has the potential to reduce the problem
overall instead of shuffling the deckchairs around the titanic[1]

[1] Famous last words, the series could end up being the iceberg

--
Mel Gorman
SUSE Labs

2018-12-05 10:19:39

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue 04-12-18 16:07:27, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
>
> > The thing I am really up to here is that reintroduction of
> > __GFP_THISNODE, which you are pushing for, will conflate madvise mode
> > resp. defrag=always with a numa placement policy because the allocation
> > doesn't fallback to a remote node.
> >
>
> It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent
> hugepage allocations, including defrag=always. We agree that
> MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate
> a hugepage locally, try compaction synchronous to the fault, allow remote
> fallback? It's undefined.

Yeah, it is certainly underdefined. One thing is clear though. Using
MADV_HUGEPAGE implies that the specific mapping benefits from THPs and
is willing to pay associated init cost. This doesn't imply anything
regarding NUMA locality and as we have NUMA API it shouldn't even
attempt to do so because it would be conflating two things.
[...]

> > And that is a fundamental problem and the antipattern I am talking
> > about. Look at it this way. All normal allocations are utilizing all the
> > available memory even though they might hit a remote latency penalty. If
> > you do care about NUMA placement you have an API to enforce a specific
> > placement. What is so different about THP to behave differently. Do
> > we really want to later invent an API to actually allow to utilize all
> > the memory? There are certainly usecases (that triggered the discussion
> > previously) that do not mind the remote latency because all other
> > benefits simply outweight it?
> >
>
> What is different about THP is that on every platform I have measured,
> NUMA matters more than hugepages. Obviously if on Broadwell, Haswell, and
> Rome, remote hugepages were a performance win over local pages, this
> discussion would not be happening. Faulting local pages rather than
> local hugepages, if possible, is easy and doesn't require reclaim.
> Faulting remote pages rather than reclaiming local pages is easy in your
> scenario, it's non-disruptive.

You keep ignoring all other usecases mentioned before and that is not
really helpful. Access cost can be amortized by other savings. Not to
mention NUMA balancing moving around hot THPs with remote accesses.

> So to answer "what is so different about THP?", it's the performance data.
> The NUMA locality matters more than whether the pages are huge or not. We
> also have the added benefit of khugepaged being able to collapse pages
> locally if fragmentation improves rather than being stuck accessing a
> remote hugepage forever.

Please back your claims by a variety of workloads. Including mentioned
KVMs one. You keep hand waving about access latency completely ignoring
all other aspects and that makes my suspicious that you do not really
appreciate all the complexity here even stronger.

If there was a general consensus we want to make THP very special wrt.
numa locality, I could live with that. It would be inconsistency in the
API and as such something that will kick us sooner or later. But it
seems that _you_ are the only one to push that direction and you keep
ignoring all other usecases consistently throughout all the discussions
we have had so far. Several people keeps pointing out that this is a
wrong direction but that seems to be completely ignored.

I believe that the only way forward is back your claims by numbers
covering a larger set of THP users and prove that remote THP is
a wrong default behavior. But you cannot really push that through based
on a single usecase of yours which you refuse to describe beyond a
simple access latency metric.
--
Michal Hocko
SUSE Labs

2018-12-05 10:45:45

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, Dec 05, 2018 at 10:08:56AM +0100, Michal Hocko wrote:
> On Tue 04-12-18 16:47:23, David Rientjes wrote:
> > On Tue, 4 Dec 2018, Mel Gorman wrote:
> >
> > > What should also be kept in mind is that we should avoid conflating
> > > locality preferences with THP preferences which is separate from THP
> > > allocation latencies. The whole __GFP_THISNODE approach is pushing too
> > > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> > > are used which is very unfortunate given that MADV_HUGEPAGE in itself says
> > > nothing about locality -- that is the business of other madvise flags or
> > > a specific policy.
> >
> > We currently lack those other madvise modes or mempolicies: mbind() is not
> > a viable alternative because we do not want to oom kill when local memory
> > is depleted, we want to fallback to remote memory.
>
> Yes, there was a clear agreement that there is no suitable mempolicy
> right now and there were proposals to introduce MPOL_NODE_RECLAIM to
> introduce that behavior. This would be an improvement regardless of THP
> because global node-reclaim policy was simply a disaster we had to turn
> off by default and the global semantic was a reason people just gave up
> using it completely.
>

The alternative is to define a clear semantic for THP allocation
requests that are considered "light" regardless of whether that needs a
GFP flag or not. A sensible default might be

o Allocate THP local if the amount of work is light or non-existant.
o Allocate THP remote if one is freely available with no additional work
(maybe kick remote kcompactd)
o Allocate base page local if the amount of work is light or non-existant
o Allocate base page remote if the amount of work is light or non-existant
o Do heavy work in zonelist order until a base page is allocated somewhere

It's not something could be clearly expressed with either NORETRY or
THISNODE but longer-term might be saner than chopping and changing on
which flags are more important and which workload is most relevant. That
runs the risk of a revert-loop where each person targetting one workload
reverts one patch to insert another until someone throws up their hands
in frustration and just carries patches out-of-tree long-term.

I'm not going to prototype something along these lines for now as
fundamentally a better compaction could cut out part of the root cause
of pain.

--
Mel Gorman
SUSE Labs

2018-12-05 11:46:10

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed 05-12-18 10:43:43, Mel Gorman wrote:
> On Wed, Dec 05, 2018 at 10:08:56AM +0100, Michal Hocko wrote:
> > On Tue 04-12-18 16:47:23, David Rientjes wrote:
> > > On Tue, 4 Dec 2018, Mel Gorman wrote:
> > >
> > > > What should also be kept in mind is that we should avoid conflating
> > > > locality preferences with THP preferences which is separate from THP
> > > > allocation latencies. The whole __GFP_THISNODE approach is pushing too
> > > > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> > > > are used which is very unfortunate given that MADV_HUGEPAGE in itself says
> > > > nothing about locality -- that is the business of other madvise flags or
> > > > a specific policy.
> > >
> > > We currently lack those other madvise modes or mempolicies: mbind() is not
> > > a viable alternative because we do not want to oom kill when local memory
> > > is depleted, we want to fallback to remote memory.
> >
> > Yes, there was a clear agreement that there is no suitable mempolicy
> > right now and there were proposals to introduce MPOL_NODE_RECLAIM to
> > introduce that behavior. This would be an improvement regardless of THP
> > because global node-reclaim policy was simply a disaster we had to turn
> > off by default and the global semantic was a reason people just gave up
> > using it completely.
> >
>
> The alternative is to define a clear semantic for THP allocation
> requests that are considered "light" regardless of whether that needs a
> GFP flag or not. A sensible default might be
>
> o Allocate THP local if the amount of work is light or non-existant.
> o Allocate THP remote if one is freely available with no additional work
> (maybe kick remote kcompactd)
> o Allocate base page local if the amount of work is light or non-existant
> o Allocate base page remote if the amount of work is light or non-existant
> o Do heavy work in zonelist order until a base page is allocated somewhere

I am not sure about the ordering without a deeper consideration but I
thin THP should reflect the approach we have for base bages.

> It's not something could be clearly expressed with either NORETRY or
> THISNODE but longer-term might be saner than chopping and changing on
> which flags are more important and which workload is most relevant. That
> runs the risk of a revert-loop where each person targetting one workload
> reverts one patch to insert another until someone throws up their hands
> in frustration and just carries patches out-of-tree long-term.

Fully agreed!

> I'm not going to prototype something along these lines for now as
> fundamentally a better compaction could cut out part of the root cause
> of pain.

Yes there is some ground work to be done first.

--
Michal Hocko
SUSE Labs

2018-12-05 19:17:04

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, 5 Dec 2018, Michal Hocko wrote:

> > It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent
> > hugepage allocations, including defrag=always. We agree that
> > MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate
> > a hugepage locally, try compaction synchronous to the fault, allow remote
> > fallback? It's undefined.
>
> Yeah, it is certainly underdefined. One thing is clear though. Using
> MADV_HUGEPAGE implies that the specific mapping benefits from THPs and
> is willing to pay associated init cost. This doesn't imply anything
> regarding NUMA locality and as we have NUMA API it shouldn't even
> attempt to do so because it would be conflating two things.

This is exactly why we use MADV_HUGEPAGE when remapping our text segment
to be backed by transparent hugepages, we want to pay the cost at startup
to fault thp and that involves synchronous memory compaction rather than
quickly falling back to remote memory. This is making the case for me.

> > So to answer "what is so different about THP?", it's the performance data.
> > The NUMA locality matters more than whether the pages are huge or not. We
> > also have the added benefit of khugepaged being able to collapse pages
> > locally if fragmentation improves rather than being stuck accessing a
> > remote hugepage forever.
>
> Please back your claims by a variety of workloads. Including mentioned
> KVMs one. You keep hand waving about access latency completely ignoring
> all other aspects and that makes my suspicious that you do not really
> appreciate all the complexity here even stronger.
>

I discussed the tradeoff of local hugepages vs local pages vs remote
hugepages in https://marc.info/?l=linux-kernel&m=154077010828940 on
Broadwell, Haswell, and Rome. When a single application does not fit on a
single node, we obviously need to extend the API to allow it to fault
remotely. We can do that without changing long-standing behavior that
prefers to only fault locally and causing real-world users to regress.
Your suggestions about how we can extend the API are all very logical.

[ Note that is not the regression being addressed here, however, which is
massive swap storms due to a fragmented local node, which is why the
__GFP_COMPACT_ONLY patch was also proposed by Andrea. The ability to
prefer faulting remotely is a worthwhile extension but it does no
good whatsoever if we can encounter massive swap storms because we
didn't set __GFP_NORETRY appropriately (which both of our patches do)
both locally and now remotely. ]

2018-12-05 20:41:32

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

Hello,

Sorry, it has been challenging to keep up with all fast replies, so
I'll start by answering to the critical result below:

On Tue, Dec 04, 2018 at 10:45:58AM +0000, Mel Gorman wrote:
> thpscale Percentage Faults Huge
> 4.20.0-rc4 4.20.0-rc4
> mmots-20181130 gfpthisnode-v1r1
> Percentage huge-3 95.14 ( 0.00%) 7.94 ( -91.65%)
> Percentage huge-5 91.28 ( 0.00%) 5.00 ( -94.52%)
> Percentage huge-7 86.87 ( 0.00%) 9.36 ( -89.22%)
> Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%)
> Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%)
> Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%)
> Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%)
> Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%)
>
> They're down the toilet. 3 threads are able to get 95% of the requested
> THP pages with Andrews tree as of Nov 30th. David's patch drops that to
> 8% success rate.

This is the downside of David's patch very well exposed above. And
this will make non-NUMA system regress like above too despite they
have no issue to begin with (which is probably why nobody noticed the
trouble with __GFP_THISNODE reclaim until recently, combined with the
fact most workloads can fit in a single NUMA node).

So we're effectively crippling down MADV_HUGEPAGE effectiveness on
non-NUMA (where it cannot help to do so) and on NUMA (as a workaround
for the false positive swapout storms) because in some workload and
system THP improvements are less significant than NUMA improvements.

The higher fault latency is generally the higher cost you pay to get
the good initial THP utilization for apps that do long lived
allocations and in turn can use MADV_HUGEPAGE without downsides. The
cost of compaction pays off over time.

Short lived allocations sensitive to the allocation latency should not
use MADV_HUGEPAGE in the first place. If you don't want high latency
you shouldn't use MADV_HUGEPAGE and khugepaged already uses
__GFP_THISNODE but it replaces memory so it has a neutral memory
footprint at it, so it's ok with regard to reclaim.

In my view David's workload is the outlier that uses MADV_HUGEPAGE but
pretends a low latency and NUMA local behavior as first priority. If
your workload fits in the per-socket CPU cache it doesn't matter which
node it is but it totally matters if you've 2M or 4k tlb. I'm not even
talking about KVM where THP has a multipler effect with EPT.

Even if you make the __GFP_NORETRY change for the HPAGE_PMD_ORDER to
skip reclaim in David's patch conditional NUMA being enabled in the
host (so that it won't cripple THP utilization also on non-NUMA
systems), imagine that you go in the bios, turn off interleaving to
enable host NUMA and THP utilization unexpectedly drops significantly
for your VM.

Rome ryzen architecture has been mentioned several times by David but
in my threadripper (not-Rome, as it's supposed to be available in 2019
only AFIK) enabling THP made a measurable difference for me for some
workloads. As opposed if I turn off NUMA by setting up the
interleaving in the dimm I get a barely measurable slowdown. So I'm
surprised in Rome there's such a radical difference in behavior.

Like Mel said we need to work towards a more complete solution than
putting __GFP_THISNODE from the outside and then turning off reclaim
from the inside. Mel made examples of things that should
happen, that won't increase allocation latency and that can't happen
with __GFP_THISNODE.

I'll try to describe again what's going on:

1: The allocator is being asked through __GFP_THISNODE "ignore all
remote nodes for all reclaim and compaction" from the
outside. Compaction then returns COMPACT_SKIPPED and tells the
allocator "I can generate many more huge pages if you reclaim/swapout
2M of anon memory in this node, the only reason I failed to compact
memory is because there aren't enough 4k fragmented pages free in this
zone". The allocator then goes ahead and swaps 2M and invokes
compaction again that succeeds the order 9 allocation fine. Goto 1;

The above keeps running in a loop at every additional page fault of
the app using MADV_HUGEPAGE until all RAM of the node is swapped out
and replaced by THP and all others nodes had 100% free memory,
potentially 100% order 9, but the allocator completely ignored all
other nodes. That is the thing that we're fixing here, because such
swap storms caused massive slowdowns. If the workload can't fit in a
single node, it's like running with only a fraction of the RAM.

So David's patch (and __GFP_COMPACT_ONLY) to fix the above swap storm,
inside the allocator skips reclaim entirely when compaction tells "I
can generate one more HPAGE_PMD_ORDER compound page if you
reclaim/swap 2M", if __GFP_NORETRY is set (and makes sure
__GFP_NORETRY is always set for THP). And that however prevents to
generate any more THP globally the moment any node is full of
filesystem cache.

NOTE: the filesystem cache will still be shrunk but it'll be shrunk by
4k allocations only. So we just artificially cripple compaction with
David's patch as shown in the quoted results above. This applied to
__GFP_COMPACT_ONLY too and that's I always said there's lots of margin
for improvement there and even __GFP_COMPACT_ONLY was also a stop-gap
measure.

So ultimately we decided that the saner behavior that gives the least
risk of regression for the short term, until we can do something
better, was the one that is already applied upstream.

Of course David's workload regressed, but that's because it gets a
minuscle improvement from THP, maybe it's seeking across all RAM and
it's very RAM heavy bandwidth-heavy workload so 4k or 2m tlb don't
matter at all for his workload and probably he's running it on bare
metal only.

I think the challenge here is to get David's workload optimally
without creating the above regression but we don't have a way to do it
right now in an automatic way. It's trivial however to add a mbind new
MPOL_THISNODE or MPOL_THISNODE_THP policy to force THP to set
__GFP_THISNODE and return to the swap storm behavior that needle to
say may have worked best by practically partioning the system, in fact
you may want to use __GFP_THISNODE for 4k allocations too so you
invoke reclaim from the local node before allocating RAM from the
remote nodes.

To me it doesn't seem the requirements of David's workload are the
same as for other MADV_HUGEPAGE users, I can't image other
MADV_HUGEPAGE users not to care at all if the THP utilization drops,
for David it seems THP is a nice thing to have only, and it seems to
care about allocation latency too (which normal apps using
MADV_HUGEPAGE must not).

In any case David's patch is better than reverting the revert, as the
swap storms are a showstopper compared to crippling down compaction
ability to compact memory when all nodes are full of cache.

Thanks,
Andrea

2018-12-05 22:00:51

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > thpscale Percentage Faults Huge
> > 4.20.0-rc4 4.20.0-rc4
> > mmots-20181130 gfpthisnode-v1r1
> > Percentage huge-3 95.14 ( 0.00%) 7.94 ( -91.65%)
> > Percentage huge-5 91.28 ( 0.00%) 5.00 ( -94.52%)
> > Percentage huge-7 86.87 ( 0.00%) 9.36 ( -89.22%)
> > Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%)
> > Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%)
> > Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%)
> > Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%)
> > Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%)
> >
> > They're down the toilet. 3 threads are able to get 95% of the requested
> > THP pages with Andrews tree as of Nov 30th. David's patch drops that to
> > 8% success rate.
>
> This is the downside of David's patch very well exposed above. And
> this will make non-NUMA system regress like above too despite they
> have no issue to begin with (which is probably why nobody noticed the
> trouble with __GFP_THISNODE reclaim until recently, combined with the
> fact most workloads can fit in a single NUMA node).
>
> So we're effectively crippling down MADV_HUGEPAGE effectiveness on
> non-NUMA (where it cannot help to do so) and on NUMA (as a workaround
> for the false positive swapout storms) because in some workload and
> system THP improvements are less significant than NUMA improvements.
>

For context, you're referring to the patch I posted that is similar to
__GFP_COMPACT_ONLY and patch 2/2 in my series. It's not referring to the
revert of the 4.20-rc commit that relaxes the __GFP_THISNODE restriction
on thp faults and conflates MADV_HUGEPAGE with NUMA locality. For 4.20, I
believe at minimum that patch 1/2 should be merged to restore what we have
had for three years, stop piling more semantics on top of the intent (or
perceived intent) of MADV_HUGEPAGE, and address the swap storm issue
separately.

> The higher fault latency is generally the higher cost you pay to get
> the good initial THP utilization for apps that do long lived
> allocations and in turn can use MADV_HUGEPAGE without downsides. The
> cost of compaction pays off over time.
>
> Short lived allocations sensitive to the allocation latency should not
> use MADV_HUGEPAGE in the first place. If you don't want high latency
> you shouldn't use MADV_HUGEPAGE and khugepaged already uses
> __GFP_THISNODE but it replaces memory so it has a neutral memory
> footprint at it, so it's ok with regard to reclaim.
>

Completely agreed, and is why we want to try synchronous memory compaction
to try to allocate hugepages locally in our usecases as well. We aren't
particularly concerned about the allocation latency, that is secondary to
the long-lived access latency regression that occurs when you do not set
__GFP_THISNODE.

> In my view David's workload is the outlier that uses MADV_HUGEPAGE but
> pretends a low latency and NUMA local behavior as first priority. If
> your workload fits in the per-socket CPU cache it doesn't matter which
> node it is but it totally matters if you've 2M or 4k tlb. I'm not even
> talking about KVM where THP has a multipler effect with EPT.
>

Hm, no, we do not mind the high allocation latency for MADV_HUGEPAGE
users. We *do* care about access latency and that is due to NUMA
locality. Before commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
MADV_HUGEPAGE mappings"), *all* thp faults were done with __GFP_THISNODE
and had been for at least three years. That commit conflates
MADV_HUGEPAGE with a new semantic that it allows remote allocation instead
of what it has done for three years: try harder synchronously to allocate
hugepages locally. We obviously need to address the problem in another
way and not change long-standing behavior that causes regressions. Either
my patch 2/2, __GFP_COMPACT_ONLY, a new mempolicy mode, new madvise mode,
prctl, etc.

> Even if you make the __GFP_NORETRY change for the HPAGE_PMD_ORDER to
> skip reclaim in David's patch conditional NUMA being enabled in the
> host (so that it won't cripple THP utilization also on non-NUMA
> systems), imagine that you go in the bios, turn off interleaving to
> enable host NUMA and THP utilization unexpectedly drops significantly
> for your VM.
>

What's needed is appropriate feedback from memory compaction to determine
if reclaim is worthwhile: checking only COMPACT_DEFERRED is insufficient.
We need to determine if compaction has failed due to order-0 low watermark
checks or whether it simply failed to defragment memory so a hugepage
could be allocated. Determining if compaction has failed due to order-0
low watermark checks is harder than it seems because the reclaimed memory
may not be accessible by isolate_freepages(); we don't have the ability to
only reclaim memory from the end of the zone. Otherwise, reclaim is very
unlikely to help.

> Rome ryzen architecture has been mentioned several times by David but
> in my threadripper (not-Rome, as it's supposed to be available in 2019
> only AFIK) enabling THP made a measurable difference for me for some
> workloads. As opposed if I turn off NUMA by setting up the
> interleaving in the dimm I get a barely measurable slowdown. So I'm
> surprised in Rome there's such a radical difference in behavior.
>

We can compare Naples if that's better; accessing remote hugepages over
local pages of the native page size incurs a 13.8% latency increase
intrasocket and 38.9% latency increase intersocket. Accessing remote
hugepages over remote pages of the native page size is 2.2% better
intrasocket and unchanged intersocket.

> Like Mel said we need to work towards a more complete solution than
> putting __GFP_THISNODE from the outside and then turning off reclaim
> from the inside. Mel made examples of things that should
> happen, that won't increase allocation latency and that can't happen
> with __GFP_THISNODE.
>
> I'll try to describe again what's going on:
>
> 1: The allocator is being asked through __GFP_THISNODE "ignore all
> remote nodes for all reclaim and compaction" from the
> outside. Compaction then returns COMPACT_SKIPPED and tells the
> allocator "I can generate many more huge pages if you reclaim/swapout
> 2M of anon memory in this node, the only reason I failed to compact
> memory is because there aren't enough 4k fragmented pages free in this
> zone". The allocator then goes ahead and swaps 2M and invokes
> compaction again that succeeds the order 9 allocation fine. Goto 1;
>
> The above keeps running in a loop at every additional page fault of
> the app using MADV_HUGEPAGE until all RAM of the node is swapped out
> and replaced by THP and all others nodes had 100% free memory,
> potentially 100% order 9, but the allocator completely ignored all
> other nodes. That is the thing that we're fixing here, because such
> swap storms caused massive slowdowns. If the workload can't fit in a
> single node, it's like running with only a fraction of the RAM.
>
> So David's patch (and __GFP_COMPACT_ONLY) to fix the above swap storm,
> inside the allocator skips reclaim entirely when compaction tells "I
> can generate one more HPAGE_PMD_ORDER compound page if you
> reclaim/swap 2M", if __GFP_NORETRY is set (and makes sure
> __GFP_NORETRY is always set for THP). And that however prevents to
> generate any more THP globally the moment any node is full of
> filesystem cache.
>
> NOTE: the filesystem cache will still be shrunk but it'll be shrunk by
> 4k allocations only. So we just artificially cripple compaction with
> David's patch as shown in the quoted results above. This applied to
> __GFP_COMPACT_ONLY too and that's I always said there's lots of margin
> for improvement there and even __GFP_COMPACT_ONLY was also a stop-gap
> measure.
>

Right, this is all the same as __GFP_COMPACT_ONLY which implicitly set
__GFP_NORETRY as part of the gfp flag itself.

> So ultimately we decided that the saner behavior that gives the least
> risk of regression for the short term, until we can do something
> better, was the one that is already applied upstream.
>
> Of course David's workload regressed, but that's because it gets a
> minuscle improvement from THP, maybe it's seeking across all RAM and
> it's very RAM heavy bandwidth-heavy workload so 4k or 2m tlb don't
> matter at all for his workload and probably he's running it on bare
> metal only.
>
> I think the challenge here is to get David's workload optimally
> without creating the above regression but we don't have a way to do it
> right now in an automatic way. It's trivial however to add a mbind new
> MPOL_THISNODE or MPOL_THISNODE_THP policy to force THP to set
> __GFP_THISNODE and return to the swap storm behavior that needle to
> say may have worked best by practically partioning the system, in fact
> you may want to use __GFP_THISNODE for 4k allocations too so you
> invoke reclaim from the local node before allocating RAM from the
> remote nodes.
>

I'm not sure why we'd be changing the default behavior of three years that
causes reported regressions instead of introducing a mempolicy that allows
for the remote allocation.

This commit from 4.0:

commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1
Author: Aneesh Kumar K.V <[email protected]>
Date: Wed Feb 11 15:27:12 2015 -0800

mm/thp: allocate transparent hugepages on local node

This make sure that we try to allocate hugepages from local node if
allowed by mempolicy. If we can't, we fallback to small page allocation
based on mempolicy. This is based on the observation that allocating
pages on local node is more beneficial than allocating hugepages on remote
node.

With this patch applied we may find transparent huge page allocation
failures if the current node doesn't have enough freee hugepages. Before
this patch such failures result in us retrying the allocation on other
nodes in the numa node mask.

[[email protected]: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Is still needed, and the reason why I'm requesting we maintain that
behavior of nearly four years and not conflate MADV_HUGEPAGE any more than
it already is.

> To me it doesn't seem the requirements of David's workload are the
> same as for other MADV_HUGEPAGE users, I can't image other
> MADV_HUGEPAGE users not to care at all if the THP utilization drops,
> for David it seems THP is a nice thing to have only, and it seems to
> care about allocation latency too (which normal apps using
> MADV_HUGEPAGE must not).
>

I have repeatedly said that allocation latency is secondary to the access
latency regression that not setting __GFP_THISNODE causes with a
fragmented local node on all of our platforms.

> In any case David's patch is better than reverting the revert, as the
> swap storms are a showstopper compared to crippling down compaction
> ability to compact memory when all nodes are full of cache.
>

I'm going to propose a v2 of my two patch series, including the feedback
from Michal in the first. I suggest the first step is to restore the
behavior we have had for three years to prevent the regression I'm
reporting and the kernel test robot has reported, and then look to
improvements that can be made to prevent local thrashing for users who
would rather allocate remote hugepages (and perhaps incur the access
latency issues if their workload does not span multiple nodes) instead of
trying harder to allocate locally.

2018-12-05 22:04:51

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, Dec 5, 2018 at 12:40 PM Andrea Arcangeli <[email protected]> wrote:
>
> So ultimately we decided that the saner behavior that gives the least
> risk of regression for the short term, until we can do something
> better, was the one that is already applied upstream.

You're ignoring the fact that people *did* report things regressed.

That's the part I find unacceptable. You're saying "we picked
something that minimized regressions".

No it didn't. The regression is present and real, and is on a real
load, not a benchmark.

So that argument is clearly bogus.

I'm going to revert the commit since people apparently seem to be
ignoring this fundamental issue.

Real workloads regressed. The regressions got reported. Ignoring that
isn't acceptable.

Linus

2018-12-05 22:13:37

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, 5 Dec 2018, Linus Torvalds wrote:

> > So ultimately we decided that the saner behavior that gives the least
> > risk of regression for the short term, until we can do something
> > better, was the one that is already applied upstream.
>
> You're ignoring the fact that people *did* report things regressed.
>
> That's the part I find unacceptable. You're saying "we picked
> something that minimized regressions".
>
> No it didn't. The regression is present and real, and is on a real
> load, not a benchmark.
>
> So that argument is clearly bogus.
>
> I'm going to revert the commit since people apparently seem to be
> ignoring this fundamental issue.
>
> Real workloads regressed. The regressions got reported. Ignoring that
> isn't acceptable.
>

Please allow me to prepare my v2 because it's not a clean revert due to
the follow-up 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into
alloc_hugepage_direct_gfpmask") and will incorporate the feedback from
Michal to not change anything outside of the thp fault path.

2018-12-05 23:37:38

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, Dec 05, 2018 at 02:03:10PM -0800, Linus Torvalds wrote:
> On Wed, Dec 5, 2018 at 12:40 PM Andrea Arcangeli <[email protected]> wrote:
> >
> > So ultimately we decided that the saner behavior that gives the least
> > risk of regression for the short term, until we can do something
> > better, was the one that is already applied upstream.
>
> You're ignoring the fact that people *did* report things regressed.

I don't ignore regressions.. after all the only reason I touched this
code is that I have been asked to fix a regression that made the
upstream kernel unusable in some enterprise workloads with very large
processes. Enterprise releases don't happen every year so it's normal
we noticed only last January a 3 years old regression. The fact it's
an old regression doesn't make it any less relevant. It took until
August until I had the time to track down this specific
regression which artificially delayed this by another 8 months.

With regard to David's specific regression I didn't ignore it either,
I just prioritize on which regression has to be fixed with the most
urgency and David's regression is less severe than the one we're
fixing here. I posted below the numbers for the regression that is
more urgent to fix.

Now suppose (like I think is likely) David may be better off setting
__GFP_THISNODE across the board including for 4k pages not just for
THP. I don't think anybody would be ok if we set __GFP_THISNODE by on
4k pages too unless it's done under a very specific new MPOL. It'll
probably work even better for him probably (the cache will be pushed
into remote nodes by 4k allocations too, and even more of the app data
and executable will be in the local NUMA node). But that's unusable
for anything except his specialized workload that tends to fit in a
single node and can accept to pay an incredible slowdown if it ever
spills over (as long as the process is not getting OOM killed he's ok
because it's such an uncommon occurrence for him that he can pay an
extreme cost just to avoid OOM killing). It's totally fine to optimize
such things with an opt-in like a new MPOL that makes those
assumptions about process size, but it's that's an unacceptable
assumption to impose on all workloads, because it breaks the VM bad
for all workload that can't fit in a single NUMA node.

> That's the part I find unacceptable. You're saying "we picked
> something that minimized regressions".
>
> No it didn't. The regression is present and real, and is on a real
> load, not a benchmark.
>
> So that argument is clearly bogus.

Note that "this give the least risk of regression" I never meant the
risk is zero. Obviously we know it's higher than zero. Otherwise David
would have no regression in the first place.

So I stand by my argument that this is what "gives the least risk of
regression" if you're given any workload you know nothing about that
uses MADV_HUGEPAGE and it's benefiting from it and you don't know
beforehand if it can fit or not fit in a single NUMA node.

If you knew for sure it can fit in a single NUMA node, __GFP_THISNODE
would be better, obviously, but the same applies to 4k pages
too... and we're not setting __GFP_THISNODE on 4k allocations under
MPOL_DEFAULT.

So I'm all for fixing David's workload but here we're trying to
generalize an ad-hoc NUMA optimization that isn't necessarily only
applicable to THP order allocations either, like it's a generic good
thing when it isn't.

__GFP_COMPACT_ONLY gave an hope it could give some middle ground but
it shows awful compaction results, it basically destroys compaction
effectiveness and we know why (COMPACT_SKIPPED must call reclaim or
compaction can't succeed because there's not enough free memory in the
node). If somebody used MADV_HUGEPAGE compaction should still work and
not fail like that. Compaction would fail to be effective even in the
local node where __GFP_THISNODE didn't fail. Worst of all it'd fail
even on non-NUMA systems (that would be easy to fix though by making
the HPAGE_PMD_ORDER check conditional to NUMA being enabled at
runtime).

Like said earlier still better to apply __GFP_COMPACT_ONLY or David's
patch than to return to v4.18 though.

===
From: Andrea Arcangeli <[email protected]>
To: Andrew Morton <[email protected]>
Cc: [email protected],
Alex Williamson <[email protected]>,
David Rientjes <[email protected]>,
Vlastimil Babka <[email protected]>
Subject: [PATCH 1/1] mm: thp: fix transparent_hugepage/defrag = madvise || always
Date: Sun, 19 Aug 2018 23:26:40 -0400

qemu uses MADV_HUGEPAGE which allows direct compaction (i.e.
__GFP_DIRECT_RECLAIM is set).

The problem is that direct compaction combined with the NUMA
__GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
hard the local node, instead of failing the allocation if there's no
THP available in the local node.

Such logic was ok until __GFP_THISNODE was added to the THP allocation
path even with MPOL_DEFAULT.

The idea behind the __GFP_THISNODE addition, is that it is better to
provide local memory in PAGE_SIZE units than to use remote NUMA THP
backed memory. That largely depends on the remote latency though, on
threadrippers for example the overhead is relatively low in my
experience.

The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
extremely slow qemu startup with vfio, if the VM is larger than the
size of one host NUMA node. This is because it will try very hard to
unsuccessfully swapout get_user_pages pinned pages as result of the
__GFP_THISNODE being set, instead of falling back to PAGE_SIZE
allocations and instead of trying to allocate THP on other nodes (it
would be even worse without vfio type1 GUP pins of course, except it'd
be swapping heavily instead).

It's very easy to reproduce this by setting
transparent_hugepage/defrag to "always", even with a simple memhog.

1) This can be fixed by retaining the __GFP_THISNODE logic also for
__GFP_DIRECT_RELCAIM by allowing only one compaction run. Not even
COMPACT_SKIPPED (i.e. compaction failing because not enough free
memory in the zone) should be allowed to invoke reclaim.

2) An alternative is not use __GFP_THISNODE if __GFP_DIRECT_RELCAIM
has been set by the caller (i.e. MADV_HUGEPAGE or
defrag="always"). That would keep the NUMA locality restriction
only when __GFP_DIRECT_RECLAIM is not set by the caller. So THP
will be provided from remote nodes if available before falling back
to PAGE_SIZE units in the local node, but an app using defrag =
always (or madvise with MADV_HUGEPAGE) supposedly prefers that.

These are the results of 1) (higher GB/s is better).

Finished: 30 GB mapped, 10.188535s elapsed, 2.94GB/s
Finished: 34 GB mapped, 12.274777s elapsed, 2.77GB/s
Finished: 38 GB mapped, 13.847840s elapsed, 2.74GB/s
Finished: 42 GB mapped, 14.288587s elapsed, 2.94GB/s

Finished: 30 GB mapped, 8.907367s elapsed, 3.37GB/s
Finished: 34 GB mapped, 10.724797s elapsed, 3.17GB/s
Finished: 38 GB mapped, 14.272882s elapsed, 2.66GB/s
Finished: 42 GB mapped, 13.929525s elapsed, 3.02GB/s

These are the results of 2) (higher GB/s is better).

Finished: 30 GB mapped, 10.163159s elapsed, 2.95GB/s
Finished: 34 GB mapped, 11.806526s elapsed, 2.88GB/s
Finished: 38 GB mapped, 10.369081s elapsed, 3.66GB/s
Finished: 42 GB mapped, 12.357719s elapsed, 3.40GB/s

Finished: 30 GB mapped, 8.251396s elapsed, 3.64GB/s
Finished: 34 GB mapped, 12.093030s elapsed, 2.81GB/s
Finished: 38 GB mapped, 11.824903s elapsed, 3.21GB/s
Finished: 42 GB mapped, 15.950661s elapsed, 2.63GB/s

This is current upstream (higher GB/s is better).

Finished: 30 GB mapped, 8.821632s elapsed, 3.40GB/s
Finished: 34 GB mapped, 341.979543s elapsed, 0.10GB/s
Finished: 38 GB mapped, 761.933231s elapsed, 0.05GB/s
Finished: 42 GB mapped, 1188.409235s elapsed, 0.04GB/s

vfio is a good test because by pinning all memory it avoids the
swapping and reclaim only wastes CPU, a memhog based test would
created swapout storms and supposedly show a bigger stddev.

What is better between 1) and 2) depends on the hardware and on the
software. Virtualization EPT/NTP gets a bigger boost from THP as well
than host applications.

This commit implements 2).

Reported-by: Alex Williamson <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d6512ef28cde..fb7f9581a835 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2047,8 +2047,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,

if (!nmask || node_isset(hpage_node, *nmask)) {
mpol_cond_put(pol);
- page = __alloc_pages_node(hpage_node,
- gfp | __GFP_THISNODE, order);
+ /*
+ * We cannot invoke reclaim if __GFP_THISNODE
+ * is set. Invoking reclaim with
+ * __GFP_THISNODE set, would cause THP
+ * allocations to trigger heavy swapping
+ * despite there may be tons of free memory
+ * (including potentially plenty of THP
+ * already available in the buddy) on all the
+ * other NUMA nodes.
+ *
+ * At most we could invoke compaction when
+ * __GFP_THISNODE is set (but we would need to
+ * refrain from invoking reclaim even if
+ * compaction returned COMPACT_SKIPPED because
+ * there wasn't not enough memory to succeed
+ * compaction). For now just avoid
+ * __GFP_THISNODE instead of limiting the
+ * allocation path to a strict and single
+ * compaction invocation.
+ *
+ * Supposedly if direct reclaim was enabled by
+ * the caller, the app prefers THP regardless
+ * of the node it comes from so this would be
+ * more desiderable behavior than only
+ * providing THP originated from the local
+ * node in such case.
+ */
+ if (!(gfp & __GFP_DIRECT_RECLAIM))
+ gfp |= __GFP_THISNODE;
+ page = __alloc_pages_node(hpage_node, gfp, order);
goto out;
}
}

===

2018-12-05 23:53:19

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, Dec 5, 2018 at 3:36 PM Andrea Arcangeli <[email protected]> wrote:
>
> Like said earlier still better to apply __GFP_COMPACT_ONLY or David's
> patch than to return to v4.18 though.

Ok, I've applied David's latest patch.

I'm not at all objecting to tweaking this further, I just didn't want
to have this regression stand.

Linus

2018-12-06 00:03:17

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

Hello,

On Wed, Dec 05, 2018 at 01:59:32PM -0800, David Rientjes wrote:
> [..] and the kernel test robot has reported, [..]

Just for completeness you may have missed one email:
https://lkml.kernel.org/r/[email protected]

'So I think the report should have been a "performance
improvement" instead of "performance regression".'

2018-12-06 00:21:56

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> __GFP_COMPACT_ONLY gave an hope it could give some middle ground but
> it shows awful compaction results, it basically destroys compaction
> effectiveness and we know why (COMPACT_SKIPPED must call reclaim or
> compaction can't succeed because there's not enough free memory in the
> node). If somebody used MADV_HUGEPAGE compaction should still work and
> not fail like that. Compaction would fail to be effective even in the
> local node where __GFP_THISNODE didn't fail. Worst of all it'd fail
> even on non-NUMA systems (that would be easy to fix though by making
> the HPAGE_PMD_ORDER check conditional to NUMA being enabled at
> runtime).
>

Note that in addition to COMPACT_SKIPPED that you mention, compaction can
fail with COMPACT_COMPLETE, meaning the full scan has finished without
freeing a hugepage, or COMPACT_DEFERRED, meaning that doing another scan
is unlikely to produce a different result. COMPACT_SKIPPED makes sense to
do reclaim if it can become accessible to isolate_freepages() and
hopefully another allocator does not allocate from these newly freed pages
before compaction can scan the zone again. For COMPACT_COMPLETE and
COMPACT_DEFERRED, reclaim is unlikely to ever help.

2018-12-06 00:55:19

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, Dec 05, 2018 at 04:18:14PM -0800, David Rientjes wrote:
> On Wed, 5 Dec 2018, Andrea Arcangeli wrote:
>
> > __GFP_COMPACT_ONLY gave an hope it could give some middle ground but
> > it shows awful compaction results, it basically destroys compaction
> > effectiveness and we know why (COMPACT_SKIPPED must call reclaim or
> > compaction can't succeed because there's not enough free memory in the
> > node). If somebody used MADV_HUGEPAGE compaction should still work and
> > not fail like that. Compaction would fail to be effective even in the
> > local node where __GFP_THISNODE didn't fail. Worst of all it'd fail
> > even on non-NUMA systems (that would be easy to fix though by making
> > the HPAGE_PMD_ORDER check conditional to NUMA being enabled at
> > runtime).
> >
>
> Note that in addition to COMPACT_SKIPPED that you mention, compaction can
> fail with COMPACT_COMPLETE, meaning the full scan has finished without
> freeing a hugepage, or COMPACT_DEFERRED, meaning that doing another scan
> is unlikely to produce a different result. COMPACT_SKIPPED makes sense to
> do reclaim if it can become accessible to isolate_freepages() and
> hopefully another allocator does not allocate from these newly freed pages
> before compaction can scan the zone again. For COMPACT_COMPLETE and
> COMPACT_DEFERRED, reclaim is unlikely to ever help.

The COMPACT_COMPLETE and (COMPACT_PARTIAL_SKIPPED for that matter)
seems just a mistake in the max() evaluation try_to_compact_pages()
that let it return COMPACT_COMPLETE and COMPACT_PARTIAL_SKIPPED. I
think it should just return COMPACT_DEFERRED in those two cases and it
should be enforced forced for all prio.

There are really only 3 cases that matter for the caller:

1) succeed -> we got the page
2) defer -> we failed (caller won't care about why)
3) skipped -> failed because not enough 4k freed -> reclaim must be invoked then
compaction can be retried

PARTIAL_SKIPPED/COMPLETE both fall into 2) above so for the caller
they should be treated the same way. It doesn't seem very concerning
that it may try like if it succeeded and do a spurious single reclaim
invocation, but it's good to fix this and take the COMPACT_DEFERRED
nopage path in the __GFP_NORETRY case.

2018-12-06 00:59:16

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, Dec 5, 2018 at 3:51 PM Linus Torvalds
<[email protected]> wrote:
>
> Ok, I've applied David's latest patch.
>
> I'm not at all objecting to tweaking this further, I just didn't want
> to have this regression stand.

Hmm. Can somebody (David?) also perhaps try to state what the
different latency impacts end up being? I suspect it's been mentioned
several times during the argument, but it would be nice to have a
"going forward, this is what I care about" kind of setup for good
default behavior.

How much of the problem ends up being about the cost of compaction vs
the cost of getting a remote node bigpage?

That would seem to be a fairly major issue, but __GFP_THISNODE affects
both. It limits compaction to just this now, in addition to obviously
limiting the allocation result.

I realize that we probably do want to just have explicit policies that
do not exist right now, but what are (a) sane defaults, and (b) sane
policies?

For example, if we cannot get a hugepage on this node, but we *do* get
a node-local small page, is the local memory advantage simply better
than the possible TLB advantage?

Because if that's the case (at least commonly), then that in itself is
a fairly good argument for "hugepage allocations should always be
THISNODE".

But David also did mention the actual allocation overhead itself in
the commit, and maybe the math is more "try to get a local hugepage,
but if no such thing exists, see if you can get a remote hugepage
_cheaply_".

So another model can be "do local-only compaction, but allow non-local
allocation if the local node doesn't have anything". IOW, if other
nodes have hugepages available, pick them up, but don't try to compact
other nodes to do so?

And yet another model might be "do a least-effort thing, give me a
local hugepage if it exists, otherwise fall back to small pages".

So there are different combinations of "try compaction" vs "local-remote".

Linus

2018-12-06 09:15:10

[permalink] [raw]

Subject: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

On Wed 05-12-18 16:58:02, Linus Torvalds wrote:
[...]
> I realize that we probably do want to just have explicit policies that
> do not exist right now, but what are (a) sane defaults, and (b) sane
> policies?

I would focus on the current default first (which is defrag=madvise).
This means that we only try the cheapest possible THP without
MADV_HUGEPAGE. If there is none we simply fallback. We do restrict to
the local node. I guess there is a general agreement that this is a sane
default.

MADV_HUGEPAGE changes the picture because the caller expressed a need
for THP and is willing to go extra mile to get it. That involves
allocation latency and as of now also a potential remote access. We do
not have complete agreement on the later but the prevailing argument is
that any strong NUMA locality is just reinventing node-reclaim story
again or makes THP success rate down the toilet (to quote Mel). I agree
that we do not want to fallback to a remote node overeagerly. I believe
that something like the below would be sensible
1) THP on a local node with compaction not giving up too early
2) THP on a remote node in NOWAIT mode - so no direct
compaction/reclaim (trigger kswapd/kcompactd only for
defrag=defer+madvise)
3) fallback to the base page allocation

This would allow both full memory utilization and try to be as local as
possible. Whoever strongly prefers NUMA locality should be using
MPOL_NODE_RECLAIM (or similar) and that would skip 2 and make 1) and 2)
use more aggressive compaction and reclaim.

This will also fit into our existing NUMA api. MPOL_NODE_RECLAIM
wouldn't be restricted to THP obviously. It would act on base pages as
well and it would basically use the same implementation as we have for
the global node_reclaim and make it usable again.

Does this sound at least remotely sane?
--
Michal Hocko
SUSE Labs

2018-12-06 09:28:39

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On 12/6/18 1:54 AM, Andrea Arcangeli wrote:
> On Wed, Dec 05, 2018 at 04:18:14PM -0800, David Rientjes wrote:
>> On Wed, 5 Dec 2018, Andrea Arcangeli wrote:
>>
>> Note that in addition to COMPACT_SKIPPED that you mention, compaction can
>> fail with COMPACT_COMPLETE, meaning the full scan has finished without
>> freeing a hugepage, or COMPACT_DEFERRED, meaning that doing another scan
>> is unlikely to produce a different result. COMPACT_SKIPPED makes sense to
>> do reclaim if it can become accessible to isolate_freepages() and
>> hopefully another allocator does not allocate from these newly freed pages
>> before compaction can scan the zone again. For COMPACT_COMPLETE and
>> COMPACT_DEFERRED, reclaim is unlikely to ever help.
>
> The COMPACT_COMPLETE and (COMPACT_PARTIAL_SKIPPED for that matter)
> seems just a mistake in the max() evaluation try_to_compact_pages()
> that let it return COMPACT_COMPLETE and COMPACT_PARTIAL_SKIPPED. I
> think it should just return COMPACT_DEFERRED in those two cases and it
> should be enforced forced for all prio.
>
> There are really only 3 cases that matter for the caller:
>
> 1) succeed -> we got the page
> 2) defer -> we failed (caller won't care about why)
> 3) skipped -> failed because not enough 4k freed -> reclaim must be invoked then
> compaction can be retried
>
> PARTIAL_SKIPPED/COMPLETE both fall into 2) above so for the caller
> they should be treated the same way. It doesn't seem very concerning
> that it may try like if it succeeded and do a spurious single reclaim
> invocation, but it's good to fix this and take the COMPACT_DEFERRED
> nopage path in the __GFP_NORETRY case.

Yeah good point. I wouldn't change the general logic of
try_to_compact_pages() though, but the condition for __GFP_NORETRY can
simply change to:

if (compact_result != COMPACT_SKIPPED)
goto nopage;

I can make a patch ASAP together with a few others I think are needed,
that should hopefully avoid the need for __GFP_COMPACT_ONLY or checks
based on order. What's probably unavoidable though is adding back
__GFP_NORETRY for madvised allocations (i.e. partially reverting
2516035499b95), but David was fine with that and your __GFP_ONLY_COMPACT
approach effectively did it too.

2018-12-06 23:45:01

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, 5 Dec 2018, Linus Torvalds wrote:

> > Ok, I've applied David's latest patch.
> >
> > I'm not at all objecting to tweaking this further, I just didn't want
> > to have this regression stand.
>
> Hmm. Can somebody (David?) also perhaps try to state what the
> different latency impacts end up being? I suspect it's been mentioned
> several times during the argument, but it would be nice to have a
> "going forward, this is what I care about" kind of setup for good
> default behavior.
>

I'm in the process of writing a more complete test case for this but I
benchmarked a few platforms based solely on remote hugepages vs local
small pages vs remote hugepages. My previous numbers were based on data
from actual workloads.

For all platforms, local hugepages are the premium, of course.

On Broadwell, the access latency to local small pages was +5.6%, remote
hugepages +16.4%, and remote small pages +19.9%.

On Naples, the access latency to local small pages was +4.9%, intrasocket
hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages
+26.6%, and intersocket hugepages +29.2%

The results on Murano were similar, which is why I suspect Aneesh
introduced the __GFP_THISNODE requirement for thp in 4.0, which preferred,
in order, local small pages, remote 1-hop hugepages, remote 2-hop
hugepages, remote 1-hop small pages, remote 2-hop small pages.

So it *appears* from the x86 platforms that NUMA matters much more
significantly than hugeness, but remote hugepages are a slight win over
remote small pages. PPC appeared the same wrt the local node but then
prefers hugeness over affinity when it comes to remote pages.

Of course this could be much different on platforms I have not tested. I
can look at POWER9 but I suspect it will be similar to Murano.

> How much of the problem ends up being about the cost of compaction vs
> the cost of getting a remote node bigpage?
>
> That would seem to be a fairly major issue, but __GFP_THISNODE affects
> both. It limits compaction to just this now, in addition to obviously
> limiting the allocation result.
>
> I realize that we probably do want to just have explicit policies that
> do not exist right now, but what are (a) sane defaults, and (b) sane
> policies?
>

The common case is that local node allocation, whether huge or small, is
*always* better. After that, I assume than some actual measurement of
access latency at boot would be better than hardcoding a single policy in
the page allocator for everybody. On my x86 platforms, it's always a
simple preference of "try huge, try small, go to the next nearest node,
repeat". On my PPC platforms, it's "try local huge, try local small, try
huge from remaining nodes, try small from remaining nodes."

> For example, if we cannot get a hugepage on this node, but we *do* get
> a node-local small page, is the local memory advantage simply better
> than the possible TLB advantage?
>
> Because if that's the case (at least commonly), then that in itself is
> a fairly good argument for "hugepage allocations should always be
> THISNODE".
>
> But David also did mention the actual allocation overhead itself in
> the commit, and maybe the math is more "try to get a local hugepage,
> but if no such thing exists, see if you can get a remote hugepage
> _cheaply_".
>
> So another model can be "do local-only compaction, but allow non-local
> allocation if the local node doesn't have anything". IOW, if other
> nodes have hugepages available, pick them up, but don't try to compact
> other nodes to do so?
>

It would be nice if there was a specific policy that was optimal on all
platforms; since that's not the case, introducing a sane default policy is
going to require some complexity.

It would likely always make sense to allocate huge over small pages
remotely when local allocation is not possible both for MADV_HUGEPAGE
users and non-MADV_HUGEPAGE users. That would require a restructuring of
how thp fallback is done which, today, is try to allocate huge locally and
fail so handle_pte_fault() can take it from there and would obviously
touch more than just the page allocator. I *suspect* that's not all that
common because it's easier to reclaim some pages and fault local small
pages instead, which always has better access latency.

What's different in this discussion thus far is workloads that do not fit
into a single node so allocating remote hugepages is actually better than
constantly reclaiming and compacting locally. Mempolicies are
interesting, but I worry about the interaction it would have with small
page policies because you can only define one mode: we may have a
combination of default, interleave, bind, and preferred policies for huge
and small memory and that may become overly complex.

Since these workloads are in the minority and it seems, to me at least,
that it's a property of the size of the workload rather than a general
desire for remote hugepages over small pages for specific ranges of
memory.

We already have prctl(PR_SET_THP_DISABLE) which was introduced by SGI and
is inherited by child processes so that it's possible to disable hugepages
for a process where you cannot modify the binary or rebuild it. For this
particular usecase, I'd suggest adding a new prctl() mode rather than any
new madvise mode or mempolicy to prefer allocating remote hugepages as
well because the workload cannot fit into a single node.

The implementation would be quite simple, add a new per-process
PF_REMOTE_HUGEPAGE flag that is inherited across fork, and does not set
__GFP_THISNODE in alloc_pages_vma() when faulting hugepages. This would
require no change to qemu or any other binary if the execing process sets
it because it already *knows* the special requirements of that specific
workload. Andrea, would this work for you?

It also seems more extensible because prctl() modes can take arguments so
you could specify the exact allocation policy for the workload to define
whether it is willing to reclaim or compact from remote memory, for
example, during fault to get a hugepage or whether it should truly be best
effort.

2018-12-06 23:50:13

[permalink] [raw]

Subject: Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

On Thu, 6 Dec 2018, Michal Hocko wrote:

> MADV_HUGEPAGE changes the picture because the caller expressed a need
> for THP and is willing to go extra mile to get it. That involves
> allocation latency and as of now also a potential remote access. We do
> not have complete agreement on the later but the prevailing argument is
> that any strong NUMA locality is just reinventing node-reclaim story
> again or makes THP success rate down the toilet (to quote Mel). I agree
> that we do not want to fallback to a remote node overeagerly. I believe
> that something like the below would be sensible
> 1) THP on a local node with compaction not giving up too early
> 2) THP on a remote node in NOWAIT mode - so no direct
> compaction/reclaim (trigger kswapd/kcompactd only for
> defrag=defer+madvise)
> 3) fallback to the base page allocation
>

I disagree that MADV_HUGEPAGE should take on any new semantic that
overrides the preference of node local memory for a hugepage, which is the
nearly four year behavior. The order of MADV_HUGEPAGE preferences listed
above would cause current users to regress who rely on local small page
fallback rather than remote hugepages because the access latency is much
better. I think the preference of remote hugepages over local small pages
needs to be expressed differently to prevent regression.

2018-12-07 04:03:45

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Thu, Dec 6, 2018 at 3:43 PM David Rientjes <[email protected]> wrote:
>
> On Broadwell, the access latency to local small pages was +5.6%, remote
> hugepages +16.4%, and remote small pages +19.9%.
>
> On Naples, the access latency to local small pages was +4.9%, intrasocket
> hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages
> +26.6%, and intersocket hugepages +29.2%

Are those two last numbers transposed?

Or why would small page accesses be *faster* than hugepages for the
intersocket case?

Of course, depending on testing, maybe the page itself was remote, but
the page tables were random, and you happened to get a remote page
table for the hugepage case?

> The results on Murano were similar, which is why I suspect Aneesh
> introduced the __GFP_THISNODE requirement for thp in 4.0, which preferred,
> in order, local small pages, remote 1-hop hugepages, remote 2-hop
> hugepages, remote 1-hop small pages, remote 2-hop small pages.

it sounds like on the whole the TLB advantage of hugepages is smaller
than the locality advantage.

Which doesn't surprise me on x86, because TLB costs really are fairly
low. Very good TLB fills, relatively to what I've seen elsewhere.

> So it *appears* from the x86 platforms that NUMA matters much more
> significantly than hugeness, but remote hugepages are a slight win over
> remote small pages. PPC appeared the same wrt the local node but then
> prefers hugeness over affinity when it comes to remote pages.

I do think POWER at least historically has much weaker TLB fills, but
also very costly page table creation/teardown. Constant-time O(1)
arguments about hash lookups are only worth so much when the constant
time is pretty big. They've been working on it.

So at least on POWER, afaik one issue is literally that hugepages made
the hash setup and teardown situation much better.

One thing that might be worth looking at is whether the process itself
is all that node-local. Maybe we could aim for a policy that says
"prefer local memory, but if we notice that the accesses to this vma
aren't all that local, then who cares?".

IOW, the default could be something more dynamic than just "always use
__GFP_THISNODE". It could be more along the lines of "start off using
__GFP_THISNODE, but for longer-lived processes that bounce around
across nodes, maybe relax it?"

I don't think we have that kind of information right now, though, do we?

Honestly, I think things like vm_policy etc should not be the solution
- yes, some people may know *exactly* what access patterns they want,
but for most situations, I think the policy should be that defaults
"just work".

In fact, I wish even MADV_HUGEPAGE itself were to approach being a
no-op with THP.

We already have TRANSPARENT_HUGEPAGE_ALWAYS being the default kconfig
option (but I think it's a bit detbatable, because I'm not sure
everybody always agrees about memory use), so on the whole
MADV_HUGEPAGE shouldn't really *do* anything.

Linus

Attachments:

smime.p7s (5.00 kB)
S/MIME Cryptographic Signature

2018-12-07 04:34:33

[permalink] [raw]

Subject: Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

[ Oops. different thread for me due to edited subject, so I saw this
after replying to the earlier email by David ]

On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko <[email protected]> wrote:
>
> MADV_HUGEPAGE changes the picture because the caller expressed a need
> for THP and is willing to go extra mile to get it.

Actually, I think MADV_HUGEPAGE should just be
"TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma".

So MADV_HUGEPAGE shouldn't change any behavior at all, if the kernel
was built with TRANSPARENT_HUGEPAGE_ALWAYS.

Put another way: even if you decide to run a kernel that does *not*
have that "always THP" (beause you presumably think that it's too
blunt an instrument), then MADV_HUGEPAGE says "for _this_ vma, do the
'always THP' bebavior"

I think those semantics would be a whole lot easier to explain to
people, and perhaps more imporantly, starting off from that kind of
mindset also gives good guidance to what MADV_HUGEPAGE behavior should
be: it should be sane enough that it makes sense as the _default_
behavior for the TRANSPARENT_HUGEPAGE_ALWAYS configuration.

But that also means that no, MADV_HUGEPAGE doesn't really change the
picture. All it does is says "I know that for this vma, THP really
does make sense as a default".

It doesn't say "I _have_ to have THP", exactly like
TRANSPARENT_HUGEPAGE_ALWAYS does not mean that every allocation should
strive to be THP.

>I believe that something like the below would be sensible
> 1) THP on a local node with compaction not giving up too early
> 2) THP on a remote node in NOWAIT mode - so no direct
> compaction/reclaim (trigger kswapd/kcompactd only for
> defrag=defer+madvise)
> 3) fallback to the base page allocation

That doesn't sound insane to me. That said, the numbers David quoted
do fairly strongly imply that local small-pages are actually preferred
to any remote THP pages.

But *that* in turn makes for other possible questions:

- if the reason we couldn't get a local hugepage is that we're simply
out of local memory (huge *or* small), then maybe a remote hugepage is
better.

Note that this now implies that the choice can be an issue of "did
the hugepage allocation fail due to fragmentation, or due to the node
being low of memory"

and there is the other question that I asked in the other thread
(before subject edit):

- how local is the load to begin with?

Relatively shortlived processes - or processes that are explicitly
bound to a node - might have different preferences than some
long-lived process where the CPU bounces around, and might have
different trade-offs for the local vs remote question too.

So just based on David's numbers, and some wild handwaving on my part,
a slightly more complex, but still very sensible default might be
something like

1) try to do a cheap local node hugepage allocation

Rationale: everybody agrees this is the best case.

But if that fails:

2) look at compacting and the local node, but not very hard.

If there's lots of memory on the local node, but synchronous
compaction doesn't do anything easily, just fall back to small pages.

Rationale: local memory is generally more important than THP.

If that fails (ie local node is simply low on memory):

3) Try to do remote THP allocation

Rationale: Ok, we simply didn't have a lot of local memory, so
it's not just a question of fragmentation. If it *had* been
fragmentation, lots of small local pages would have been better than a
remote THP page.

Oops, remote THP allocation failed (possibly after synchronous
remote compaction, but maybe this is where we do kcompactd).

4) Just do any small page, and do reclaim etc. THP isn't happening,
and it's not a priority when you're starting to feel memory pressure.

In general, I really would want to avoid magic kernel command lines
(or sysfs settings, or whatever) making a huge difference in behavior.
So I really wish people would see the whole
'transparent_hugepage_flags' thing as a way for kernel developers to
try different settings, not as a way for users to tune their loads.

Our default should work as sane defaults, we shouldn't have a "ok,
let's have this sysfs tunable and let people make their own
decisions". That's a cop-out.

Btw, don't get me wrong: I'm not suggesting removing the sysfs knob.
As a debug tool, it's great, where you can ask "ok, do things work
better if you set THP-defrag to defer+madvise".

I'm just saying that we should *not* use that sysfs flag as an excuse
for "ok, if we get the default wrong, people can make their own
defaults". We should strive to do well enough that it really shouldn't
be an issue in normal situations.

Linus

Attachments:

smime.p7s (5.00 kB)
S/MIME Cryptographic Signature

2018-12-07 07:48:00

[permalink] [raw]

Subject: Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

On Thu 06-12-18 15:49:04, David Rientjes wrote:
> On Thu, 6 Dec 2018, Michal Hocko wrote:
>
> > MADV_HUGEPAGE changes the picture because the caller expressed a need
> > for THP and is willing to go extra mile to get it. That involves
> > allocation latency and as of now also a potential remote access. We do
> > not have complete agreement on the later but the prevailing argument is
> > that any strong NUMA locality is just reinventing node-reclaim story
> > again or makes THP success rate down the toilet (to quote Mel). I agree
> > that we do not want to fallback to a remote node overeagerly. I believe
> > that something like the below would be sensible
> > 1) THP on a local node with compaction not giving up too early
> > 2) THP on a remote node in NOWAIT mode - so no direct
> > compaction/reclaim (trigger kswapd/kcompactd only for
> > defrag=defer+madvise)
> > 3) fallback to the base page allocation
> >
>
> I disagree that MADV_HUGEPAGE should take on any new semantic that
> overrides the preference of node local memory for a hugepage, which is the
> nearly four year behavior. The order of MADV_HUGEPAGE preferences listed
> above would cause current users to regress who rely on local small page
> fallback rather than remote hugepages because the access latency is much
> better. I think the preference of remote hugepages over local small pages
> needs to be expressed differently to prevent regression.

Such a model would be broken. It doesn't provide consistent semantic and
leads to surprising results. MADV_HUGEPAGE with local node binding will
not prevent remote base pages to be used and you are back to square one.

It has been a huge mistake to merge your __GFP_THISNODE patch back then
in 4.1. Especially with an absolute lack of numbers for a variety of
workloads. I still believe we can do better, offer a sane mem policy to
help workloads with higher locality demands but it is outright wrong
to confalte demand for THP with the locality semantic.

If this is absolutely no go then we need a MADV_HUGEPAGE_SANE...

--
Michal Hocko
SUSE Labs

2018-12-07 07:51:00

[permalink] [raw]

Subject: Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

On Thu 06-12-18 20:31:46, Linus Torvalds wrote:
> [ Oops. different thread for me due to edited subject, so I saw this
> after replying to the earlier email by David ]

Sorry about that but I really wanted to make the actual discussion about
semantic clearly distinguished because the thread just grown too large
with back and forth that didn't lead to anywhere.

> On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko <[email protected]> wrote:
> >
> > MADV_HUGEPAGE changes the picture because the caller expressed a need
> > for THP and is willing to go extra mile to get it.
>
> Actually, I think MADV_HUGEPAGE should just be
> "TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma".

Yes, that is the case and I didn't want to make the description more
complicated than necessary so I've focused only on the current default.
But historically we have treated defrag=always and MADV_HUGEPAGE the
same.

[...]
> >I believe that something like the below would be sensible
> > 1) THP on a local node with compaction not giving up too early
> > 2) THP on a remote node in NOWAIT mode - so no direct
> > compaction/reclaim (trigger kswapd/kcompactd only for
> > defrag=defer+madvise)
> > 3) fallback to the base page allocation
>
> That doesn't sound insane to me. That said, the numbers David quoted
> do fairly strongly imply that local small-pages are actually preferred
> to any remote THP pages.

As I and others pointed out elsewhere remote penalty is just a part of
the picture and on its own might be quite misleading. There are other
aspects (TLB pressure, page tables overhead etc) that might amortize the
access latency.

> But *that* in turn makes for other possible questions:
>
> - if the reason we couldn't get a local hugepage is that we're simply
> out of local memory (huge *or* small), then maybe a remote hugepage is
> better.
>
> Note that this now implies that the choice can be an issue of "did
> the hugepage allocation fail due to fragmentation, or due to the node
> being low of memory"

How exactly do you tell? Many systems are simply low on memory due to
caching. A clean pagecache is quite cheap to reclaim but it can be more
expensive to fault in. Do we consider it to be a viable target?

>
> and there is the other question that I asked in the other thread
> (before subject edit):
>
> - how local is the load to begin with?
>
> Relatively shortlived processes - or processes that are explicitly
> bound to a node - might have different preferences than some
> long-lived process where the CPU bounces around, and might have
> different trade-offs for the local vs remote question too.

Agreed

> So just based on David's numbers, and some wild handwaving on my part,
> a slightly more complex, but still very sensible default might be
> something like
>
> 1) try to do a cheap local node hugepage allocation
>
> Rationale: everybody agrees this is the best case.
>
> But if that fails:
>
> 2) look at compacting and the local node, but not very hard.
>
> If there's lots of memory on the local node, but synchronous
> compaction doesn't do anything easily, just fall back to small pages.

Do we reclaim at this stage or this is mostly GFP_NOWAIT attempt?

> Rationale: local memory is generally more important than THP.
>
> If that fails (ie local node is simply low on memory):
>
> 3) Try to do remote THP allocation
>
> Rationale: Ok, we simply didn't have a lot of local memory, so
> it's not just a question of fragmentation. If it *had* been
> fragmentation, lots of small local pages would have been better than a
> remote THP page.
>
> Oops, remote THP allocation failed (possibly after synchronous
> remote compaction, but maybe this is where we do kcompactd).
>
> 4) Just do any small page, and do reclaim etc. THP isn't happening,
> and it's not a priority when you're starting to feel memory pressure.

If 2) doesn't reclaim heavily (e.g. only try to reclaim clean page
cache) or even do NOWAIT (which would be even better) then I _think_
this sounds sane.

> In general, I really would want to avoid magic kernel command lines
> (or sysfs settings, or whatever) making a huge difference in behavior.
> So I really wish people would see the whole
> 'transparent_hugepage_flags' thing as a way for kernel developers to
> try different settings, not as a way for users to tune their loads.
>
> Our default should work as sane defaults, we shouldn't have a "ok,
> let's have this sysfs tunable and let people make their own
> decisions". That's a cop-out.

Agreed. I cannot say I am happy with all the ways THP can be tuned. It
is quite confusing to say the least.

--
Michal Hocko
SUSE Labs

2018-12-07 09:07:31

[permalink] [raw]

Subject: Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

On 12/7/18 8:49 AM, Michal Hocko wrote:
>> But *that* in turn makes for other possible questions:
>>
>> - if the reason we couldn't get a local hugepage is that we're simply
>> out of local memory (huge *or* small), then maybe a remote hugepage is
>> better.
>>
>> Note that this now implies that the choice can be an issue of "did
>> the hugepage allocation fail due to fragmentation, or due to the node
>> being low of memory"
> How exactly do you tell? Many systems are simply low on memory due to
> caching. A clean pagecache is quite cheap to reclaim but it can be more
> expensive to fault in. Do we consider it to be a viable target?

Compaction can report if it failed (more precisely: was skipped) due to
low memory, or for other reasons. It doesn't distinguish how easily
reclaimable is the memory, but I don't think we should reclaim anything
(see below).

>> and there is the other question that I asked in the other thread
>> (before subject edit):
>>
>> - how local is the load to begin with?
>>
>> Relatively shortlived processes - or processes that are explicitly
>> bound to a node - might have different preferences than some
>> long-lived process where the CPU bounces around, and might have
>> different trade-offs for the local vs remote question too.
> Agreed
>
>> So just based on David's numbers, and some wild handwaving on my part,
>> a slightly more complex, but still very sensible default might be
>> something like
>>
>> 1) try to do a cheap local node hugepage allocation
>>
>> Rationale: everybody agrees this is the best case.
>>
>> But if that fails:
>>
>> 2) look at compacting and the local node, but not very hard.
>>
>> If there's lots of memory on the local node, but synchronous
>> compaction doesn't do anything easily, just fall back to small pages.
> Do we reclaim at this stage or this is mostly GFP_NOWAIT attempt?

I would expect no reclaim, because for non-THP faults we also don't
reclaim the local node before trying to allocate from remote node. If
somebody wants such behavior they can enable the node reclaim mode. THP
faults shouldn't be different in this regard, right?

2018-12-07 23:16:38

[permalink] [raw]

Subject: Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

On Fri, 7 Dec 2018, Vlastimil Babka wrote:

> >> But *that* in turn makes for other possible questions:
> >>
> >> - if the reason we couldn't get a local hugepage is that we're simply
> >> out of local memory (huge *or* small), then maybe a remote hugepage is
> >> better.
> >>
> >> Note that this now implies that the choice can be an issue of "did
> >> the hugepage allocation fail due to fragmentation, or due to the node
> >> being low of memory"
> > How exactly do you tell? Many systems are simply low on memory due to
> > caching. A clean pagecache is quite cheap to reclaim but it can be more
> > expensive to fault in. Do we consider it to be a viable target?
>
> Compaction can report if it failed (more precisely: was skipped) due to
> low memory, or for other reasons. It doesn't distinguish how easily
> reclaimable is the memory, but I don't think we should reclaim anything
> (see below).
>

Note that just reclaiming when the order-0 watermark in
__compaction_suitable() fails is unfortunately not always sufficient: it
needs to be accessible to isolate_freepages(). For order-9 memory, it's
possible for isolate_migratepages_block() to skip over a top of free pages
that were just reclaimed if there are unmovable pages preventing the
entire pageblock from being freed.

2018-12-10 00:36:36

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Thu, 6 Dec 2018, Linus Torvalds wrote:

> > On Broadwell, the access latency to local small pages was +5.6%, remote
> > hugepages +16.4%, and remote small pages +19.9%.
> >
> > On Naples, the access latency to local small pages was +4.9%, intrasocket
> > hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages
> > +26.6%, and intersocket hugepages +29.2%
>
> Are those two last numbers transposed?
>
> Or why would small page accesses be *faster* than hugepages for the
> intersocket case?
>
> Of course, depending on testing, maybe the page itself was remote, but
> the page tables were random, and you happened to get a remote page
> table for the hugepage case?
>

Yes, looks like that was the case, if the page tables were from the same
node as the intersocket remote hugepage it looks like a ~0.1% increase
accessing small pages, so basically unchanged. So this complicates the
allocation strategy somewhat; on this platform, at least, hugepages are
preferred on the same socket but there isn't a significant benefit from
getting a cross socket hugepage over small page.

The typical way this is resolved is based on the SLIT and how the kernel
defines RECLAIM_DISTANCE. I'm not sure that we can expect the distances
between proximity domains to be defined according to this value for a
one-size-fits-all solution. I've always thought that RECLAIM_DISTANCE
should be configurable so that initscripts can actually determine its
ideal value when using vm.zone_reclaim_mode.

> > So it *appears* from the x86 platforms that NUMA matters much more
> > significantly than hugeness, but remote hugepages are a slight win over
> > remote small pages. PPC appeared the same wrt the local node but then
> > prefers hugeness over affinity when it comes to remote pages.
>
> I do think POWER at least historically has much weaker TLB fills, but
> also very costly page table creation/teardown. Constant-time O(1)
> arguments about hash lookups are only worth so much when the constant
> time is pretty big. They've been working on it.
>
> So at least on POWER, afaik one issue is literally that hugepages made
> the hash setup and teardown situation much better.
>

I'm still working on the more elaborate test case that will generate these
results because I think I can use it at boot to determine an ideal
RECLAIM_DISTANCE. I can also get numbers for hash vs radix MMU if you're
interested.

> One thing that might be worth looking at is whether the process itself
> is all that node-local. Maybe we could aim for a policy that says
> "prefer local memory, but if we notice that the accesses to this vma
> aren't all that local, then who cares?".
>
> IOW, the default could be something more dynamic than just "always use
> __GFP_THISNODE". It could be more along the lines of "start off using
> __GFP_THISNODE, but for longer-lived processes that bounce around
> across nodes, maybe relax it?"
>

It would allow the use of MPOL_PREFERRED for an exact preference if they
are known to not be bounced around. This would be required for processes
that are bound to the cpus of a single node through cpuset or
sched_setaffinity() but unconstrained as far as memory is concerned.

The goal of __GFP_THISNODE being the default for thp, however, is that we
*know* we're going to be accessing it locally at least in the short term,
perhaps forever. Any other default would assume the remotely allocated
hugepage would eventually be accessed locally, otherwise we would have
been much better off just failing the hugepage allocation and accessing
small pages. You could make an assumption that's the case iff the process
does not fit in its local node, and I think that would be the minority of
applications.

I guess there could be some heuristic that could determine this based on
MM_ANONPAGES of Andrea's qemu and zone->zone_pgdat->node_present_pages.
It feels like something that should be more exactly defined, though, for
the application to say that it prefers remote hugepages over local
small pages because it can't access either locally forever anyway.

This was where I suggested a new prctl() mode so that an application can
prefer remote hugepages because it knows it's larger than the single node
and that requires no change to the binary itself because it is inherited
across fork.

The sane default, though, seems to always prefer local allocation, whether
hugepages or small pages, for the majority of workloads since that's where
the lowest access latency is.

> Honestly, I think things like vm_policy etc should not be the solution
> - yes, some people may know *exactly* what access patterns they want,
> but for most situations, I think the policy should be that defaults
> "just work".
>
> In fact, I wish even MADV_HUGEPAGE itself were to approach being a
> no-op with THP.
>

Besides the NUMA locality of the allocations, we still have the allocation
latency concern that MADV_HUGEPAGE changes. The madvise mode has taken on
two meanings: (1) prefer to fault hugepages when the thp enabled setting
is "madvise" so other applications don't blow up their rss unexpectedly,
and (2) try synchronous compaction/reclaim at fault for thp defrag
settings of "madvise" or "defer+madvise".

[ It was intended to take on an additional meaning through the
now-reverted patch, which was (3) relax the NUMA locality preference. ]

My binaries that remap their text segment to be backed by transparent
hugepages and qemu both share the same preference to try hard to fault
hugepages through compaction because we don't necessarily care about
allocation latency, we care about access latency later. Smaller binaries,
those that do not have strict NUMA locality requirements, and short-lived
allocations are not going to want to incur the performance penalty of
synchronous compaction.

So I think today's semantics for MADV_HUGEPAGE make sense, but I'd like to
explore other areas that could improve this, both for the default case and
the specialized cases:

- prctl() mode to readily allow remote allocations rather than reclaiming
or compacting memory locally (affects more than just hugepages if the
system has a non-zero vm.zone_reclaim_mode),

- better feedback loop after the first compaction attempt in the page
allocator slowpath to determine if reclaim is actually worthwhile for
high-order allocations, and

- configurable vm.reclaim_distance (use RECLAIM_DISTANCE as the default)
which can be defined since there is not a one-size-fits-all strategy
for allocations (there's no benefit to allocating hugepages cross
socket on Naples, for example),

2018-12-10 04:51:13

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

Hello,

On Sun, Dec 09, 2018 at 04:29:13PM -0800, David Rientjes wrote:
> [..] on this platform, at least, hugepages are
> preferred on the same socket but there isn't a significant benefit from
> getting a cross socket hugepage over small page. [..]

You didn't release the proprietary software that depends on
__GFP_THISNODE behavior and that you're afraid is getting a
regression.

Could you at least release with an open source license the benchmark
software that you must have used to do the above measurement to
understand why it gives such a weird result on remote THP?

On skylake and on the threadripper I can't confirm that there isn't a
significant benefit from cross socket hugepage over cross socket small
page.

Skylake Xeon(R) Gold 5115:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 15602 MB
node 0 free: 14077 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 16099 MB
node 1 free: 15949 MB
node distances:
node 0 1
0: 10 21
1: 21 10
# numactl -m 0 -C 0 ./numa-thp-bench
random writes MADV_HUGEPAGE 10109753 usec
random writes MADV_NOHUGEPAGE 13682041 usec
random writes MADV_NOHUGEPAGE 13704208 usec
random writes MADV_HUGEPAGE 10120405 usec
# numactl -m 0 -C 10 ./numa-thp-bench
random writes MADV_HUGEPAGE 15393923 usec
random writes MADV_NOHUGEPAGE 19644793 usec
random writes MADV_NOHUGEPAGE 19671287 usec
random writes MADV_HUGEPAGE 15495281 usec
# grep Xeon /proc/cpuinfo |head -1
model name : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz

local 4k -> local 2m: +35%
local 4k -> remote 2m: -11%
remote 4k -> remote 2m: +26%

threadripper 1950x:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 15982 MB
node 0 free: 14422 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 16124 MB
node 1 free: 5357 MB
node distances:
node 0 1
0: 10 16
1: 16 10
# numactl -m 0 -C 0 /tmp/numa-thp-bench
random writes MADV_HUGEPAGE 12902667 usec
random writes MADV_NOHUGEPAGE 17543070 usec
random writes MADV_NOHUGEPAGE 17568858 usec
random writes MADV_HUGEPAGE 12896588 usec
# numactl -m 0 -C 8 /tmp/numa-thp-bench
random writes MADV_HUGEPAGE 19663515 usec
random writes MADV_NOHUGEPAGE 27819864 usec
random writes MADV_NOHUGEPAGE 27844066 usec
random writes MADV_HUGEPAGE 19662706 usec
# grep Threadripper /proc/cpuinfo |head -1
model name : AMD Ryzen Threadripper 1950X 16-Core Processor

local 4k -> local 2m: +35%
local 4k -> remote 2m: -10%
remote 4k -> remote 2m: +41%

Or if you prefer reversed in terms of compute time (negative
percentage is better in this case):

local 4k -> local 2m: -26%
local 4k -> remote 2m: +12%
remote 4k -> remote 2m: -29%

It's true that local 4k is generally a win vs remote THP when the
workload is memory bound also for the threadripper, the threadripper
seems even more favorable to remote THP than skylake Xeon is.

The above is the host bare metal result. Now let's try guest mode on
the threadripper. The last two lines seems more reliable (the first
two lines also needs to fault in the guest RAM because the guest
was fresh booted).

guest backed by local 2M pages:

random writes MADV_HUGEPAGE 16025855 usec
random writes MADV_NOHUGEPAGE 21903002 usec
random writes MADV_NOHUGEPAGE 19762767 usec
random writes MADV_HUGEPAGE 15189231 usec

guest backed by remote 2M pages:

random writes MADV_HUGEPAGE 25434251 usec
random writes MADV_NOHUGEPAGE 32404119 usec
random writes MADV_NOHUGEPAGE 31455592 usec
random writes MADV_HUGEPAGE 22248304 usec

guest backed by local 4k pages:

random writes MADV_HUGEPAGE 28945251 usec
random writes MADV_NOHUGEPAGE 32217690 usec
random writes MADV_NOHUGEPAGE 30664731 usec
random writes MADV_HUGEPAGE 22981082 usec

guest backed by remote 4k pages:

random writes MADV_HUGEPAGE 43772939 usec
random writes MADV_NOHUGEPAGE 52745664 usec
random writes MADV_NOHUGEPAGE 51632065 usec
random writes MADV_HUGEPAGE 40263194 usec

I haven't yet tried the guest mode on the skylake nor
haswell/broadwell. I can do that too but I don't expect a significant
difference.

On a threadripper guest, the remote 2m is practically identical to
local 4k. So shutting down compaction to try to generate local 4k
memory looks a sure loss.

Even if we ignore the guest mode results completely, if we don't make
assumption on the workload to be able to fit in the node, if I use
MADV_HUGEPAGE I think I'd prefer the risk of a -10% slowdown if the
THP page ends up in a remote node, than not getting the +41% THP
speedup on remote memory if the pagetable ends up being remote or the
4k page itself ends up being remote over time.

The cons left from your latest patch, is that you eventually also lose
the +35% speedup when compaction is clogged by COMPACT_SKIPPED, which
for a guest mode computation translates in losing the +59% speedup of
having host local THP (when guest uses 4k pages). khugepaged will
correct that by unclogging compaction but it may take hours. The idea
was to have MADV_HUGEPAGE provide THP without having to wait for
khugepaged to catch up with it.

Thanks,
Andrea

=====
/*
* numa-thp-bench.c
*
* Copyright (C) 2018 Red Hat, Inc.
*
* This work is licensed under the terms of the GNU GPL, version 2.
*/

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/time.h>

#define HPAGE_PMD_SIZE (2*1024*1024)
#define SIZE (2048UL*1024*1024-HPAGE_PMD_SIZE)
#if SIZE >= RAND_MAX
#error "SIZE >= RAND_MAX"
#endif
#define RATIO 5

int main()
{
char * p;
struct timeval before, after;
unsigned long i;

if (posix_memalign((void **) &p, HPAGE_PMD_SIZE, SIZE))
perror("posix_memalign"), exit(1);
if (madvise(p, SIZE, MADV_HUGEPAGE))
perror("madvise"), exit(1);
memset(p, 0, SIZE);
srand(100);
if (gettimeofday(&before, NULL))
perror("gettimeofday"), exit(1);
for (i = 0; i < SIZE / RATIO; i++)
p[rand() % SIZE] = 0;
if (gettimeofday(&after, NULL))
perror("gettimeofday"), exit(1);
printf("random writes MADV_HUGEPAGE %lu usec\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
munmap(p, SIZE);

if (posix_memalign((void **) &p, HPAGE_PMD_SIZE, SIZE))
perror("posix_memalign"), exit(1);
if (madvise(p, SIZE, MADV_NOHUGEPAGE))
perror("madvise"), exit(1);
memset(p, 0, SIZE);
srand(100);
if (gettimeofday(&before, NULL))
perror("gettimeofday"), exit(1);
for (i = 0; i < SIZE / RATIO; i++)
p[rand() % SIZE] = 0;
if (gettimeofday(&after, NULL))
perror("gettimeofday"), exit(1);
printf("random writes MADV_NOHUGEPAGE %lu usec\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
munmap(p, SIZE);

if (posix_memalign((void **) &p, HPAGE_PMD_SIZE, SIZE))
perror("posix_memalign"), exit(1);
if (madvise(p, SIZE, MADV_NOHUGEPAGE))
perror("madvise"), exit(1);
memset(p, 0, SIZE);
srand(100);
if (gettimeofday(&before, NULL))
perror("gettimeofday"), exit(1);
for (i = 0; i < SIZE / RATIO; i++)
p[rand() % SIZE] = 0;
if (gettimeofday(&after, NULL))
perror("gettimeofday"), exit(1);
printf("random writes MADV_NOHUGEPAGE %lu usec\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
munmap(p, SIZE);

if (posix_memalign((void **) &p, HPAGE_PMD_SIZE, SIZE))
perror("posix_memalign"), exit(1);
if (madvise(p, SIZE, MADV_HUGEPAGE))
perror("madvise"), exit(1);
memset(p, 0, SIZE);
srand(100);
if (gettimeofday(&before, NULL))
perror("gettimeofday"), exit(1);
for (i = 0; i < SIZE / RATIO; i++)
p[rand() % SIZE] = 0;
if (gettimeofday(&after, NULL))
perror("gettimeofday"), exit(1);
printf("random writes MADV_HUGEPAGE %lu usec\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
munmap(p, SIZE);

return 0;
}

2018-12-12 00:39:43

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Sun, 9 Dec 2018, Andrea Arcangeli wrote:

> You didn't release the proprietary software that depends on
> __GFP_THISNODE behavior and that you're afraid is getting a
> regression.
>
> Could you at least release with an open source license the benchmark
> software that you must have used to do the above measurement to
> understand why it gives such a weird result on remote THP?
>

Hi Andrea,

As I said in response to Linus, I'm in the process of writing a more
complete benchmarking test across all of our platforms for access and
allocation latency for x86 (both Intel and AMD), POWER8/9, and arm64, and
doing so on a kernel with minimum overhead (for the allocation latency, I
want to remove things like mem cgroup overhead from the result).

> On skylake and on the threadripper I can't confirm that there isn't a
> significant benefit from cross socket hugepage over cross socket small
> page.
>
> Skylake Xeon(R) Gold 5115:
>
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
> node 0 size: 15602 MB
> node 0 free: 14077 MB
> node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
> node 1 size: 16099 MB
> node 1 free: 15949 MB
> node distances:
> node 0 1
> 0: 10 21
> 1: 21 10
> # numactl -m 0 -C 0 ./numa-thp-bench
> random writes MADV_HUGEPAGE 10109753 usec
> random writes MADV_NOHUGEPAGE 13682041 usec
> random writes MADV_NOHUGEPAGE 13704208 usec
> random writes MADV_HUGEPAGE 10120405 usec
> # numactl -m 0 -C 10 ./numa-thp-bench
> random writes MADV_HUGEPAGE 15393923 usec
> random writes MADV_NOHUGEPAGE 19644793 usec
> random writes MADV_NOHUGEPAGE 19671287 usec
> random writes MADV_HUGEPAGE 15495281 usec
> # grep Xeon /proc/cpuinfo |head -1
> model name : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz
>
> local 4k -> local 2m: +35%
> local 4k -> remote 2m: -11%
> remote 4k -> remote 2m: +26%
>
> threadripper 1950x:
>
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> node 0 size: 15982 MB
> node 0 free: 14422 MB
> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> node 1 size: 16124 MB
> node 1 free: 5357 MB
> node distances:
> node 0 1
> 0: 10 16
> 1: 16 10
> # numactl -m 0 -C 0 /tmp/numa-thp-bench
> random writes MADV_HUGEPAGE 12902667 usec
> random writes MADV_NOHUGEPAGE 17543070 usec
> random writes MADV_NOHUGEPAGE 17568858 usec
> random writes MADV_HUGEPAGE 12896588 usec
> # numactl -m 0 -C 8 /tmp/numa-thp-bench
> random writes MADV_HUGEPAGE 19663515 usec
> random writes MADV_NOHUGEPAGE 27819864 usec
> random writes MADV_NOHUGEPAGE 27844066 usec
> random writes MADV_HUGEPAGE 19662706 usec
> # grep Threadripper /proc/cpuinfo |head -1
> model name : AMD Ryzen Threadripper 1950X 16-Core Processor
>
> local 4k -> local 2m: +35%
> local 4k -> remote 2m: -10%
> remote 4k -> remote 2m: +41%
>
> Or if you prefer reversed in terms of compute time (negative
> percentage is better in this case):
>
> local 4k -> local 2m: -26%
> local 4k -> remote 2m: +12%
> remote 4k -> remote 2m: -29%
>
> It's true that local 4k is generally a win vs remote THP when the
> workload is memory bound also for the threadripper, the threadripper
> seems even more favorable to remote THP than skylake Xeon is.
>

My results are organized slightly different since it considers local
hugepages as the baseline and is what we optimize for: on Broadwell, I've
obtained more accurate results that show local small pages at +3.8%,
remote hugepages at +12.8% and remote small pages at +18.8%. I think we
both agree that the locality preference for workloads that fit within a
single node is local hugepage -> local small page -> remote hugepage ->
remote small page, and that has been unchanged in any of benchmarking
results for either of us.

> The above is the host bare metal result. Now let's try guest mode on
> the threadripper. The last two lines seems more reliable (the first
> two lines also needs to fault in the guest RAM because the guest
> was fresh booted).
>
> guest backed by local 2M pages:
>
> random writes MADV_HUGEPAGE 16025855 usec
> random writes MADV_NOHUGEPAGE 21903002 usec
> random writes MADV_NOHUGEPAGE 19762767 usec
> random writes MADV_HUGEPAGE 15189231 usec
>
> guest backed by remote 2M pages:
>
> random writes MADV_HUGEPAGE 25434251 usec
> random writes MADV_NOHUGEPAGE 32404119 usec
> random writes MADV_NOHUGEPAGE 31455592 usec
> random writes MADV_HUGEPAGE 22248304 usec
>
> guest backed by local 4k pages:
>
> random writes MADV_HUGEPAGE 28945251 usec
> random writes MADV_NOHUGEPAGE 32217690 usec
> random writes MADV_NOHUGEPAGE 30664731 usec
> random writes MADV_HUGEPAGE 22981082 usec
>
> guest backed by remote 4k pages:
>
> random writes MADV_HUGEPAGE 43772939 usec
> random writes MADV_NOHUGEPAGE 52745664 usec
> random writes MADV_NOHUGEPAGE 51632065 usec
> random writes MADV_HUGEPAGE 40263194 usec
>
> I haven't yet tried the guest mode on the skylake nor
> haswell/broadwell. I can do that too but I don't expect a significant
> difference.
>
> On a threadripper guest, the remote 2m is practically identical to
> local 4k. So shutting down compaction to try to generate local 4k
> memory looks a sure loss.
>

I'm assuming your results above are with a defrag setting of "madvise" or
"defer+madvise".

> Even if we ignore the guest mode results completely, if we don't make
> assumption on the workload to be able to fit in the node, if I use
> MADV_HUGEPAGE I think I'd prefer the risk of a -10% slowdown if the
> THP page ends up in a remote node, than not getting the +41% THP
> speedup on remote memory if the pagetable ends up being remote or the
> 4k page itself ends up being remote over time.
>

I'm agreeing with you that the preference for remote hugepages over local
small pages depends on the configuration and the workload that you are
running and there are clear advantages and disadvantages to both. This is
different than what the long-standing NUMA preferences have been for thp
allocations.

I think we can optimize for *both* usecases without causing an unnecessary
regression for other and doing so is not extremely complex.

Since it depends on the workload, specifically workloads that fit within a
single node, I think the reasonable approach would be to have a sane
default regardless of the use of MADV_HUGEPAGE or thp defrag settings and
then optimzie for the minority of cases where the workload does not fit in
a single node. I'm assuming there is no debate about these larger
workloads being in the minority, although we have single machines where
this encompasses the totality of their workloads.

Regarding the role of direct reclaim in the allocator, I think we need
work on the feedback from compaction to determine whether it's worthwhile.
That's difficult because of the point I continue to bring up:
isolate_freepages() is not necessarily always able to access this freed
memory. But for cases where we get COMPACT_SKIPPED because the order-0
watermarks are failing, reclaim *is* likely to have an impact in the
success of compaction, otherwise we fail and defer because it wasn't able
to make a hugepage available.

[ If we run compaction regardless of the order-0 watermark check and find
a pageblock where we can likely free a hugepage because it is
fragmented movable pages, this is a pretty good indication that reclaim
is worthwhile iff the reclaimed memory is beyond the migration scanner. ]

Let me try to list out what I think is a reasonable design for the various
configs assuming we are able to address the reclaim concern above. Note
that this is for the majority of users where workloads do not span
multiple nodes:

- defrag=always: always compact, obviously

- defrag=madvise/defer+madvise:

- MADV_HUGEPAGE: always compact locally, fallback to small pages
locally (small pages become eligible for khugepaged to collapse
locally later, no chance of additional access latency)

- neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd locally,
fallback to small pages locally

- defrag=defer: kick kcompactd locally, fallback to small pages locally

- defrag=never: fallback to small pages locally

And that, AFAICT, has been the implementation for almost four years.

For workloads that *can* span multiple nodes, this doesn't make much
sense, as you point out and have reported in your bug. Considering the
reclaim problem separately where we thrash a node unnecessarily, if we
consider only hugepages and NUMA locality:

- defrag=always: always compact for all allowed zones, zonelist ordered
according to NUMA locality

- defrag=madvise/defer+madvise:

- MADV_HUGEPAGE: always compact for all allowed zones, try to allocate
hugepages in zonelist order, only fallback to small pages when
compaction fails

- neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd for all
allowed zones, fallback to small pages locally

- defrag=defer: kick kcompactd for all allowed zones, fallback to small
pages locally

- defrag=never: fallback to small pages locally

For this policy to be possible, we must clear __GFP_THISNODE. How to
determine when to do this? I think we have three options: heuristics (rss
vs zone managed pages), per-process prctl(), or global thp setting for
machine-wide behavior.

I've been suggesting a per-process prctl() that can be set and carried
across fork so that there are no changes needed to any workload and can
simply special-case the thp allocation policy to use __GFP_THISNODE, which
is the default for bare metal, and to not use it when we've said the
workload will span multiple nodes. Depending on the size of the workload,
it may choose to use this setting on certain systems and not others.

2018-12-12 09:52:05

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Tue 11-12-18 16:37:22, David Rientjes wrote:
[...]
> Since it depends on the workload, specifically workloads that fit within a
> single node, I think the reasonable approach would be to have a sane
> default regardless of the use of MADV_HUGEPAGE or thp defrag settings and
> then optimzie for the minority of cases where the workload does not fit in
> a single node. I'm assuming there is no debate about these larger
> workloads being in the minority, although we have single machines where
> this encompasses the totality of their workloads.

Your assumption is wrong I believe. This is the fundamental disagreement
we are discussing here. You are essentially arguing for node_reclaim
(formerly zone_reclaim) behavior for THP pages. All that without any
actual data on wider variety of workloads. As the matter of _fact_ we
know that node_reclaim behavior is not a suitable default. We did
that mistake in the past and we had to revert that default _exactly_
because a wider variety of workloads suffered from over reclaim and
performance issues as a result of constant reclaim

You have also haven't explained why you do care so much about remote THP
while you do not care about remote base bages (the page allocator
falls back to those as soon as the kswapd doesn't keep pace with the
allocation rate; THP or high order pages in general is analoguous with
kcompactd doing a pro-active compaction). Like the base pages we do not
want larger pages to fallback to a remote node too easily. There is no
question about that I believe.

I can be convinced that larger pages really require a different behavior
than base pages but you should better show _real_ numbers on a wider
variety workloads to back your claims. I have only heard hand waving and
a very vague and quite doubtful numbers for a non-disclosed benchmark
without a clear indication on how it relates to real world workloads. So
color me unconvinced.
--
Michal Hocko
SUSE Labs

2018-12-12 10:19:24

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On 12/12/18 1:37 AM, David Rientjes wrote:
>
> Regarding the role of direct reclaim in the allocator, I think we need
> work on the feedback from compaction to determine whether it's worthwhile.
> That's difficult because of the point I continue to bring up:
> isolate_freepages() is not necessarily always able to access this freed
> memory.

That's one of the *many* reasons why having free base pages doesn't
guarantee compaction success. We can and will improve on that. But I
don't think it would be e.g. practical to check the pfns of free pages
wrt compaction scanner positions and decide based on that. Also when you
invoke reclaim, you can't tell in advance those pfns, so I'm not sure
how the better feedback from compaction to reclaim for this specific
aspect would be supposed to work?

> But for cases where we get COMPACT_SKIPPED because the order-0
> watermarks are failing, reclaim *is* likely to have an impact in the
> success of compaction,

Yes that's the heuristic we rely on.

> otherwise we fail and defer because it wasn't able
> to make a hugepage available.

Note that THP fault compaction doesn't actually defer itself, which I
think is a weakness of the current implementation and hope that patch 3
in my series from yesterday [1] can address that. Because defering is
the general feedback mechanism that we have for suppressing compaction
(and thus associated reclaim) in cases it fails for any reason, not just
the one you mention. Instead of inspecting failure conditions in detail,
which would be costly, it's a simple statistical approach. And when
compaction is improved to fail less, defering automatically also happens
less.

> [ If we run compaction regardless of the order-0 watermark check and find
> a pageblock where we can likely free a hugepage because it is
> fragmented movable pages, this is a pretty good indication that reclaim
> is worthwhile iff the reclaimed memory is beyond the migration scanner. ]

I don't think that would be a good direction to pursue, to let scanning
happen even without having the free pages. Also as I've mentioned above,
LRU-based reclaim cannot satisfy your 'iff' condition, unless it
inspected the pfn's it freed, and continued reclaiming until enough of
those beyond migration scanner were freed. Instead IMHO we should look
again into replacing the free scanner with direct allocation from freelists.

[1] https://lkml.kernel.org/r/[email protected]

2018-12-12 10:45:47

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

Hello,

I now found a two socket EPYC (is this Naples?) to try to confirm the
THP effect of intra-socket THP.

CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 8
NUMA node0 CPU(s): 0-7,64-71
NUMA node1 CPU(s): 8-15,72-79
NUMA node2 CPU(s): 16-23,80-87
NUMA node3 CPU(s): 24-31,88-95
NUMA node4 CPU(s): 32-39,96-103
NUMA node5 CPU(s): 40-47,104-111
NUMA node6 CPU(s): 48-55,112-119
NUMA node7 CPU(s): 56-63,120-127

# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 32658 MB
node 0 free: 31554 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 32767 MB
node 1 free: 31854 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 32767 MB
node 2 free: 31535 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 32767 MB
node 3 free: 31777 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 32767 MB
node 4 free: 31949 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 32767 MB
node 5 free: 31957 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 32767 MB
node 6 free: 31945 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 32767 MB
node 7 free: 31958 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10
# for i in 0 8 16 24 32 40 48 56; do numactl -m 0 -C $i /tmp/numa-thp-bench; done
random writes MADV_HUGEPAGE 17622885 usec
random writes MADV_NOHUGEPAGE 25316593 usec
random writes MADV_NOHUGEPAGE 25291927 usec
random writes MADV_HUGEPAGE 17672446 usec
random writes MADV_HUGEPAGE 25698555 usec
random writes MADV_NOHUGEPAGE 36413941 usec
random writes MADV_NOHUGEPAGE 36402155 usec
random writes MADV_HUGEPAGE 25689574 usec
random writes MADV_HUGEPAGE 25136558 usec
random writes MADV_NOHUGEPAGE 35562724 usec
random writes MADV_NOHUGEPAGE 35504708 usec
random writes MADV_HUGEPAGE 25123186 usec
random writes MADV_HUGEPAGE 25137002 usec
random writes MADV_NOHUGEPAGE 35577429 usec
random writes MADV_NOHUGEPAGE 35582865 usec
random writes MADV_HUGEPAGE 25116561 usec
random writes MADV_HUGEPAGE 40281721 usec
random writes MADV_NOHUGEPAGE 56891233 usec
random writes MADV_NOHUGEPAGE 56924134 usec
random writes MADV_HUGEPAGE 40286512 usec
random writes MADV_HUGEPAGE 40377662 usec
random writes MADV_NOHUGEPAGE 56731400 usec
random writes MADV_NOHUGEPAGE 56443959 usec
random writes MADV_HUGEPAGE 40379022 usec
random writes MADV_HUGEPAGE 33907588 usec
random writes MADV_NOHUGEPAGE 47609976 usec
random writes MADV_NOHUGEPAGE 47523481 usec
random writes MADV_HUGEPAGE 33881974 usec
random writes MADV_HUGEPAGE 40809719 usec
random writes MADV_NOHUGEPAGE 57148321 usec
random writes MADV_NOHUGEPAGE 57164499 usec
random writes MADV_HUGEPAGE 40802979 usec
# grep EPYC /proc/cpuinfo |head -1
model name : AMD EPYC 7601 32-Core Processor

I suppose node 0-1-2-3 are socket 0 and node 4-5-6-7 are socket 1.

With the ram kept in nodeid 0, cpuid 0 is NUMA local, cpuid 8,16,24
are NUMA intrasocket remote and cpuid 32 40 48 56 are NUMA
intersocket remote.

local 4k -> local THP: +43.6% improvement

local 4k -> intersocket remote THP: -1.4%
intersocket remote 4k -> intersocket remote THP: +41.6%

local 4k -> intersocket remote 4k: -30.4%
local THP -> intersocket remote THP: -31.4%

local 4k -> intrasocket remote THP: -37.15% (-25% on node 6?)
intrasocket remote 4k -> intrasocket remote THP: +41.23%

local 4k -> intrasocket remote 4k: -55.5% (-46% on node 6?)
local THP -> intrasocket remote THP: -56.25% (-47% on node 6?)

In short intrasocket is a whole lot more expensive (4k -55% THP -56%)
than intersocket (4k -30% THP -31%)... as expected. The benefits of
THP vs 4k remains the same for intrasocket (+41.23%) and intersocket
(+41.6%) and local (+43.6%), also as expected.

The above was measured on bare metal on guests the impact of THP as
usual will be multiplied (I can try to measure that another time).

So while before I couldn't confirm that THP didn't help intersocket, I
think I can confirm it helps just like intrasocket and local now on
this architecture.

Especially intresocket the slowdown from remote THP compared to local
4k is a tiny -1% so in theory __GFP_THISNODE would at least need to
switch to __GFP_THISSOCKET for this architecture.. (I'm not suggesting
that, I'm talking in theory). Intersocket is even more favorable than
a 2 node 1 socket threadripper and a 2 node (2 sockets?) skylake in
fact even on bare metal.

Losing the +41% THP benefit across all distances makes __GFP_THISNODE
again questionable optimization, it only ever pays off in the
intersocket case (-37% is an increase of compute time of +59% which is
pretty bad and we should definitely have some logic that optimizes for
it). However eliminating the possibility of remote THP doesn't look
good, especially because the 4k allocation may end up being remote too
if the node is full of cache (there is no __GFP_THISNODE set on 4k
allocations...).

__GFP_THISNODE made MADV_HUGEPAGE act a bit like a NUMA node reclaim
hack embedded in MADV_HUGEPAGE: if you didn't set MADV_HUGEPAGE you'd
potentially get more intrasocket remote 4k pages instead of local THPs
thanks to the __GFP_THISNODE reclaim.

I think with the current __GFP_THISNODE and HPAGE_PMD_SIZE hardcoding,
you'll get a lot less local node memory when the local node is full of
clean pagecache, because 4k allocations have no __GFP_THISNODE set, so
you'll get a lot more of intrasocket remote 4k even when there's a ton
of +41% faster intrasocket remote THP memory available. This is
because it the new safe __GFP_THISNODE won't shrink the cache without
mbind anymore.

NUMA balancing must be enabled in a system like above and it should be
able to optimize a lack of convergence as long as the workload can fit
in a node (and if the workload doesn't fit in a node __GFP_THISNODE
will backfire anyway).

We need to keep thinking at how to optimize the case of preferring
local 4k to remote THP in a way that won't backfire the moment the
task in the CPU is moved to a different node, but __GFP_THISNODE set
in the only single attempt of allocating THP memory in a THP fault,
isn't the direction because without a __GFP_THISNODE-reclaim, less
local 4k memory will be allocated once the node is full of cache and
the remote THP will be totally ignored when they would perform better
than remote 4k. Ignoring remote THP isn't the purpose of
MADV_HUGEPAGE, quite the contrary.

Thanks,
Andrea

2018-12-12 17:11:32

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, Dec 12, 2018 at 10:50:51AM +0100, Michal Hocko wrote:
> I can be convinced that larger pages really require a different behavior
> than base pages but you should better show _real_ numbers on a wider
> variety workloads to back your claims. I have only heard hand waving and

I agree with your point about node_reclaim and I think David complaint
of "I got remote THP instead of local 4k" with our proposed fix, is
going to morph into "I got remote 4k instead of local 4k" with his
favorite fix.

Because David stopped calling reclaim with __GFP_THISNODE, the moment
the node is full of pagecache, node_reclaim behavior will go away and
even 4k pages will start to be allocated remote (and because of
__GFP_THISNODE set in the THP allocation, all readily available or
trivial to compact remote THP will be ignored too).

What David needs I think is a way to set __GFP_THISNODE for THP *and
4k* allocations and if both fails in a row with __GFP_THISNODE set, we
need to repeat the whole thing without __GFP_THISNODE set (ideally
with a mask to skip the node that we already scraped down to the
bottom during the initial __GFP_THISNODE pass). This way his
proprietary software binary will work even better than before when the
local node is fragmented and he'll finally be able to get the speedup
from remote THP too in case the local node is truly OOM, but all other
nodes are full of readily available THP.

To achieve this without a new MADV_THISNODE/MADV_NODE_RECLAIM, we'd
need a way to start with __GFP_THISNODE and then draw the line in
reclaim and decide to drop __GFP_THISNODE when too much pressure
mounts in the local node, but like you said it becomes like
node_reclaim and it would be better if it can be done with an opt-in,
like MADV_HUGEPAGE because not all workloads would benefit from such
extra pagecache reclaim cost (like not all workload benefits from
synchronous compaction).

I think some NUMA reclaim mode semantics ended up being embedded and
hidden in the THP MADV_HUGEPAGE, but they imposed massive slowdown to
all workloads that can't cope with the node_reclaim mode behavior
because they don't fit in a node.

Adding MADV_THISNODE/MADV_NODE_RECLAIM, will guarantee his proprietary
software binary will run at maximum performance without cache
interference, and he's happy to accept the risk of massive slowdown in
case the local node is truly OOM. The fallback, despite very
inefficient, will still happen without OOM killer triggering.

Thanks,
Andrea

2018-12-14 11:33:57

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed 12-12-18 12:00:16, Andrea Arcangeli wrote:
[...]
> Adding MADV_THISNODE/MADV_NODE_RECLAIM, will guarantee his proprietary
> software binary will run at maximum performance without cache
> interference, and he's happy to accept the risk of massive slowdown in
> case the local node is truly OOM. The fallback, despite very
> inefficient, will still happen without OOM killer triggering.

I believe this fits much better into a MPOL_$FOO rather than MADV_$FOO.
But other than that I full agree. There are reasonable usecases for the
node reclaim like behavior. As a bonus you do not get local node only
but all nodes within reclaim distance as well.
--
Michal Hocko
SUSE Labs

2018-12-14 21:05:50

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Wed, 12 Dec 2018, Vlastimil Babka wrote:

> > Regarding the role of direct reclaim in the allocator, I think we need
> > work on the feedback from compaction to determine whether it's worthwhile.
> > That's difficult because of the point I continue to bring up:
> > isolate_freepages() is not necessarily always able to access this freed
> > memory.
>
> That's one of the *many* reasons why having free base pages doesn't
> guarantee compaction success. We can and will improve on that. But I
> don't think it would be e.g. practical to check the pfns of free pages
> wrt compaction scanner positions and decide based on that.

Yeah, agreed. Rather than proposing that memory is only reclaimed if its
known that it can be accessible to isolate_freepages(), I'm wondering
about the implementation of the freeing scanner entirely.

In other words, I think there is a lot of potential stranding that occurs
for both scanners that could otherwise result in completely free
pageblocks. If there a single movable page present near the end of the
zone in an otherwise fully free pageblock, surely we can do better than
the current implementation that would never consider this very easy to
compact memory.

For hugepages, we don't care what pageblock we allocate from. There are
requirements for MAX_ORDER-1, but I assume we shouldn't optimize for these
cases (and if CMA has requirements for a migration/freeing scanner
redesign, I think that can be special cased).

The same problem occurs for the migration scanner where we can iterate
over a ton of free memory that is never considered a suitable migration
target. The implementation that attempts to migrate all memory toward the
end of the zone penalizes the freeing scanner when it is reset: we just
iterate over a ton of used pages.

Reclaim likely could be deterministically useful if we consider a redesign
of how migration sources and targets are determined in compaction.

Has anybody tried a migration scanner that isn't linearly based, rather
finding the highest-order free page of the same migratetype, iterating the
pages of its pageblock, and using this to determine whether the actual
migration will be worthwhile or not? I could imagine pageblock_skip being
repurposed for this as the heuristic.

Finding migration targets would be more tricky, but if we iterate the
pages of the pageblock for low-order free pages and find them to be mostly
used, that seems more appropriate than just pushing all memory to the end
of the zone?

It would be interesting to know if anybody has tried using the per-zone
free_area's to determine migration targets and set a bit if it should be
considered a migration source or a migration target. If all pages for a
pageblock are not on free_areas, they are fully used.

> > otherwise we fail and defer because it wasn't able
> > to make a hugepage available.
>
> Note that THP fault compaction doesn't actually defer itself, which I
> think is a weakness of the current implementation and hope that patch 3
> in my series from yesterday [1] can address that. Because defering is
> the general feedback mechanism that we have for suppressing compaction
> (and thus associated reclaim) in cases it fails for any reason, not just
> the one you mention. Instead of inspecting failure conditions in detail,
> which would be costly, it's a simple statistical approach. And when
> compaction is improved to fail less, defering automatically also happens
> less.
>

I couldn't get the link to work, unfortunately, I don't think the patch
series made it to LKML :/ I do see it archived for linux-mm, though, so
I'll take a look, thanks!

> [1] https://lkml.kernel.org/r/[email protected]
>

2018-12-14 21:38:03

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On 12/14/18 10:04 PM, David Rientjes wrote:
> On Wed, 12 Dec 2018, Vlastimil Babka wrote:

...

> Reclaim likely could be deterministically useful if we consider a redesign
> of how migration sources and targets are determined in compaction.
>
> Has anybody tried a migration scanner that isn't linearly based, rather
> finding the highest-order free page of the same migratetype, iterating the
> pages of its pageblock, and using this to determine whether the actual
> migration will be worthwhile or not?

Not exactly that AFAIK, but a year ago in my series [1] patch 6 made
migration scanner 'prescan' the block of requested order before actually
trying to isolate anything for migration.

> I could imagine pageblock_skip being
> repurposed for this as the heuristic.
>
> Finding migration targets would be more tricky, but if we iterate the
> pages of the pageblock for low-order free pages and find them to be mostly
> used, that seems more appropriate than just pushing all memory to the end
> of the zone?

Agree. That was patch 8/8 of the same series [1].

> It would be interesting to know if anybody has tried using the per-zone
> free_area's to determine migration targets and set a bit if it should be
> considered a migration source or a migration target. If all pages for a
> pageblock are not on free_areas, they are fully used.

Repurposing/adding a new pageblock bit was in my mind to help multiple
compactors not undo each other's work in the scheme where there's no
free page scanner, but I didn't implement it yet.

>>> otherwise we fail and defer because it wasn't able
>>> to make a hugepage available.
>>
>> Note that THP fault compaction doesn't actually defer itself, which I
>> think is a weakness of the current implementation and hope that patch 3
>> in my series from yesterday [1] can address that. Because defering is
>> the general feedback mechanism that we have for suppressing compaction
>> (and thus associated reclaim) in cases it fails for any reason, not just
>> the one you mention. Instead of inspecting failure conditions in detail,
>> which would be costly, it's a simple statistical approach. And when
>> compaction is improved to fail less, defering automatically also happens
>> less.
>>
>
> I couldn't get the link to work, unfortunately, I don't think the patch
> series made it to LKML :/ I do see it archived for linux-mm, though, so
> I'll take a look, thanks!

Yeah I forgot to Cc: LKML, but you were also in direct To: so you should
have received them directly. Also the abovementioned series, but that's
year ago. My fault for not returning to it after being done with the
Meltdown fun. I hope to do that soon.

[1] https://marc.info/?l=linux-mm&m=151315560308753

>> [1] https://lkml.kernel.org/r/[email protected]
>>

2018-12-14 23:12:57

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Fri, Dec 14, 2018 at 01:04:11PM -0800, David Rientjes wrote:
> On Wed, 12 Dec 2018, Vlastimil Babka wrote:
>
> > > Regarding the role of direct reclaim in the allocator, I think we need
> > > work on the feedback from compaction to determine whether it's worthwhile.
> > > That's difficult because of the point I continue to bring up:
> > > isolate_freepages() is not necessarily always able to access this freed
> > > memory.
> >
> > That's one of the *many* reasons why having free base pages doesn't
> > guarantee compaction success. We can and will improve on that. But I
> > don't think it would be e.g. practical to check the pfns of free pages
> > wrt compaction scanner positions and decide based on that.
>
> Yeah, agreed. Rather than proposing that memory is only reclaimed if its
> known that it can be accessible to isolate_freepages(), I'm wondering
> about the implementation of the freeing scanner entirely.
>
> In other words, I think there is a lot of potential stranding that occurs
> for both scanners that could otherwise result in completely free
> pageblocks. If there a single movable page present near the end of the
> zone in an otherwise fully free pageblock, surely we can do better than
> the current implementation that would never consider this very easy to
> compact memory.
>

While it's somewhat premature, I posted a series before I had a full set
of results because it uses free lists to reduce searches and reduces
inference between multiple scanners. Preliminary results indicated it
boosted allocation success rates by 20%ish, reduced migration scanning
by 99% and free scanning by 27%.

> The same problem occurs for the migration scanner where we can iterate
> over a ton of free memory that is never considered a suitable migration
> target. The implementation that attempts to migrate all memory toward the
> end of the zone penalizes the freeing scanner when it is reset: we just
> iterate over a ton of used pages.
>

Yes, partially addressed in series. It can be improved significantly but it
hit a boundary condition near the points where compaction scanners meet. I
dropped the patch in question as it needs more thought on how to deal
with the boundary condition without remigrating the blocks close to it.
Besides, at 14 patches, it would probably be best to get that reviewed
and finalised before building upon it further so review would be welcome.

> Has anybody tried a migration scanner that isn't linearly based, rather
> finding the highest-order free page of the same migratetype, iterating the
> pages of its pageblock, and using this to determine whether the actual
> migration will be worthwhile or not? I could imagine pageblock_skip being
> repurposed for this as the heuristic.
>

Yes, but it has downsides. Redoing the same work on pageblocks, tracking
state and tracking the exit conditions are tricky. I think it's best to
squeeze the most out of the linear scanning first and the series is the
first step in that.

> It would be interesting to know if anybody has tried using the per-zone
> free_area's to determine migration targets and set a bit if it should be
> considered a migration source or a migration target. If all pages for a
> pageblock are not on free_areas, they are fully used.
>

Series has patches which implement something similar to this idea.

--
Mel Gorman
SUSE Labs

2018-12-22 17:59:15

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Fri, 14 Dec 2018, Mel Gorman wrote:

> > In other words, I think there is a lot of potential stranding that occurs
> > for both scanners that could otherwise result in completely free
> > pageblocks. If there a single movable page present near the end of the
> > zone in an otherwise fully free pageblock, surely we can do better than
> > the current implementation that would never consider this very easy to
> > compact memory.
> >
>
> While it's somewhat premature, I posted a series before I had a full set
> of results because it uses free lists to reduce searches and reduces
> inference between multiple scanners. Preliminary results indicated it
> boosted allocation success rates by 20%ish, reduced migration scanning
> by 99% and free scanning by 27%.
>

Always good to have code to look at, I'll take a closer look. I've
unfortunately been distracted with other kernel issues lately :/

2018-12-23 09:38:35

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Fri, 14 Dec 2018, Vlastimil Babka wrote:

> > It would be interesting to know if anybody has tried using the per-zone
> > free_area's to determine migration targets and set a bit if it should be
> > considered a migration source or a migration target. If all pages for a
> > pageblock are not on free_areas, they are fully used.
>
> Repurposing/adding a new pageblock bit was in my mind to help multiple
> compactors not undo each other's work in the scheme where there's no
> free page scanner, but I didn't implement it yet.
>

It looks like Mel has a series posted that still is implemented with
linear scans through memory, so I'm happy to move the discussion there; I
think the goal for compaction with regard to this thread is determining
whether reclaim in the page allocator would actually be useful and
targeted reclaim to make memory available for isolate_freepages() could be
expensive. I'd hope that we could move in a direction where compaction
doesn't care where the pageblock is and does the minimal amount of work
possible to make a high-order page available, not sure if that's possible
with a linear scan. I'll take a look at Mel's series though.

2018-12-23 16:06:06

[permalink] [raw]

Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

On Fri, Dec 21, 2018 at 02:18:45PM -0800, David Rientjes wrote:
> On Fri, 14 Dec 2018, Vlastimil Babka wrote:
>
> > > It would be interesting to know if anybody has tried using the per-zone
> > > free_area's to determine migration targets and set a bit if it should be
> > > considered a migration source or a migration target. If all pages for a
> > > pageblock are not on free_areas, they are fully used.
> >
> > Repurposing/adding a new pageblock bit was in my mind to help multiple
> > compactors not undo each other's work in the scheme where there's no
> > free page scanner, but I didn't implement it yet.
> >
>
> It looks like Mel has a series posted that still is implemented with
> linear scans through memory, so I'm happy to move the discussion there; I
> think the goal for compaction with regard to this thread is determining
> whether reclaim in the page allocator would actually be useful and
> targeted reclaim to make memory available for isolate_freepages() could be
> expensive. I'd hope that we could move in a direction where compaction
> doesn't care where the pageblock is and does the minimal amount of work
> possible to make a high-order page available, not sure if that's possible
> with a linear scan. I'll take a look at Mel's series though.

That series has evolved significantly because there was a lot of missing
pieces. While it's somewhat ready other than badly written changelogs, I
didn't post it because I'm going offline and wouldn't respond to feedback
and I imagine others are offline too and unavailable for review. Besides,
the merge window is about to open and I know there are patches in Andrews
tree for mainline that should be taken into account.

The series is now 25 patches long and covers a lot of pre-requisites that
would be necessary before removing the linear scanner. What is critical
for a purely free-list scanner is that the exit conditions are identified
and the series provides a lot of the pieces. For example, a non-linear
scanner must properly control skip bits and isolate pageblocks from
multiple compaction instances which this series does.

The main takeawy from the series is that it reduces system CPU usage by
17%, reduces free scan rates by 99.5% and increases THP allocation success
rates by 33% giving almost 99% allocation success rates. It also;

o Isolates pageblocks for a single compaction instance
o Synchronises async/sync scanners when appropriate to reduce rescanning
o Identifies when a pageblock is being rescanned and is "sticky" and
makes forward progress instead of looping excessively
o Smarter logic when clearing pageblock skip bits so reduce scanning
o Various different methods for reducing unnecessary scanning
o Better handling of contention
o Avoids compaction of remote nodes in direct compaction context

If you do not want to wait until the new year, it's at
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-fast-compact-v2r15

Preliminary results based on thpscale using MADV_HUGEPAGE to allocate
huge pages on a fragmented system.

thpscale Fault Latencies
4.20.0-rc6 4.20.0-rc6
mmotm-20181210 noremote-v2r14
Amean fault-both-1 864.83 ( 0.00%) 1006.88 * -16.43%*
Amean fault-both-3 3566.05 ( 0.00%) 2460.97 * 30.99%*
Amean fault-both-5 5685.02 ( 0.00%) 4052.92 * 28.71%*
Amean fault-both-7 7289.40 ( 0.00%) 5929.65 ( 18.65%)
Amean fault-both-12 10937.46 ( 0.00%) 8870.53 ( 18.90%)
Amean fault-both-18 15440.48 ( 0.00%) 11464.86 * 25.75%*
Amean fault-both-24 15345.83 ( 0.00%) 13040.01 * 15.03%*
Amean fault-both-30 20159.73 ( 0.00%) 16618.73 * 17.56%*
Amean fault-both-32 20843.51 ( 0.00%) 14401.25 * 30.91%*

Fault latency (either huge or base) is mostly improved even when 32
tasks are trying to allocate huge pages on an 8-CPU single socket
machine where contention is a factor

thpscale Percentage Faults Huge
4.20.0-rc6 4.20.0-rc6
mmotm-20181210 noremote-v2r14
Percentage huge-1 96.03 ( 0.00%) 96.94 ( 0.95%)
Percentage huge-3 71.43 ( 0.00%) 95.43 ( 33.60%)
Percentage huge-5 70.44 ( 0.00%) 96.85 ( 37.48%)
Percentage huge-7 70.39 ( 0.00%) 94.77 ( 34.63%)
Percentage huge-12 71.53 ( 0.00%) 98.07 ( 37.11%)
Percentage huge-18 70.61 ( 0.00%) 98.42 ( 39.38%)
Percentage huge-24 71.84 ( 0.00%) 97.85 ( 36.20%)
Percentage huge-30 69.94 ( 0.00%) 98.13 ( 40.31%)
Percentage huge-32 66.92 ( 0.00%) 97.79 ( 46.13%)

96-98% of THP requests get huge pages on request

4.20.0-rc6 4.20.0-rc6
mmotm-20181210noremote-v2r14
User 27.30 27.86
System 192.70 159.42
Elapsed 580.13 571.98

System CPU usage is reduced so we get more huge pages for less work and
the workload completes slightly faster.

4.20.0-rc6 4.20.0-rc6
mmotm-20181210 noremote-v2r14
Allocation stalls 19156.00 3627.00

Fewer stalls which is always a plus.

THP fault alloc 77804.00 84618.00
THP fault fallback 7628.00 816.00
THP collapse alloc 12.00 0.00
THP collapse fail 0.00 0.00
THP split 56921.00 56920.00
THP split failed 1982.00 116.00
Compaction stalls 36350.00 25541.00
Compaction success 17491.00 22651.00
Compaction failures 18859.00 2890.00
Compaction efficiency 48.12 88.68

Compaction efficiency is increased a lot (efficiency is a basic measure
of success vs failure). Previously almost half of the THP requests failed.

Page migrate success 10200844.00 7802473.00
Page migrate failure 3703.00 409.00
Compaction pages isolated 23093029.00 16532642.00
Compaction migrate scanned 28454655.00 8976143.00
Compaction free scanned 717517120.00 3632762.00
Compact scan efficiency 3.97 247.09

Migration scanning is down 32%, free scanning is down 99.5%. Scan efficiency
is interesting because it's a measure of how many pages the free scanner
examines for one migration source. Before the series, we had to scan *way*
more pages to find a free page where as now we scan *fewer* pages to find
a migration target due to the use of free lists.

Kcompactd wake 1.00 9.00
Kcompactd migrate scanned 14023.00 13318.00
Kcompactd free scanned 6932.00 6643.00

Minor improvements for kcompactd but for this workload, it was barely
active.

I'll rebase and repost in the new year and I think it should be considered
a prerequisite before considering the removal of the linear scanning.
It'll be impossible to remove completely due to memory isoluation. If
built on this series I would imagine that it would take the following
approach.

o The migration scanner remains linear or mostly linear (series uses
free page lists to get hints on where suitable migration sources are)
o The free scanner would be purely based on the free lists i.e.
fast_isolate_freepages would be the only scanner
o The migration scanner would need to be strict about obeying the skip
bit to avoid picking a migration source that was previously a
migration target
o The exit condition for compaction is not when scanners meet but when
fast_isolate_freepages cannot find any pageblock that is
MIGRATE_MOVABLE && !pageblock_skip

--
Mel Gorman
SUSE Labs

2019-04-15 11:51:16