LinuxLists.cc - [PATCH v2 00/12] Introduced multi-preference mempolicy

2020-06-30 22:09:04

Subject: [PATCH v2 00/12] Introduced multi-preference mempolicy

Significant changes since v1:
* Dropped patch to replace numa_node_id in some places (mhocko)
* Dropped all the page allocation patches in favor of new mechanism to use
fallbacks. (mhocko)
* Dropped the special snowflake preferred node algorithm (bwidawsk)
* If the preferred node fails, ALL nodes are rechecked instead of just the
non-preferred nodes.

In v1, Andi Kleen brought up reusing MPOL_PREFERRED as the mode for the API.
There wasn't consensus around this, so I've left the existing API as it was. I'm
open to more feedback here, but my slight preference is to use a new API as it
ensures if people are using it, they are entirely aware of what they're doing
and not accidentally misusing the old interface. (In a similar way to how
MPOL_LOCAL was introduced).

In v1, Michal also brought up renaming this MPOL_PREFERRED_MASK. I'm equally
fine with that change, but I hadn't heard much emphatic support for one way or
another, so I've left that too.

v2 Summary:
1: Random fix I found along the way
2-5: Represent node preference as a mask internally
6-7: Tread many preferred like bind
8-11: Handle page allocation for the new policy
12: Enable the uapi

This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
preference for nodes which will fulfil memory allocation requests. Unlike the
MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
invoke the OOM killer if those preferred nodes are not available.

Along with these patches are patches for libnuma, numactl, numademo, and memhog.
They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many
It allows new usage: `numactl -P 0,3,4`

The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.
1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
requirements allowing preference to be given to all nodes with "fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast memory (or
perhaps slow memory), but doesn't care which node it runs on. The application
can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
etc). This reverses the nodes are chosen today where the kernel attempts to use
local memory to the CPU whenever possible. This will attempt to use the local
accelerator to the memory.
2. The Tortoise - The administrator (or the application itself) is aware it only
needs slow memory, and so can prefer that.

Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.

Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.

> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:
1. Ordered list of nodes. Currently it's believed that the added complexity is
nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes. This confuses the
notion of binding and is less flexible than the current solution.
3. Create flags or new modes that helps with some ordering. This offers both a
friendlier API as well as a solution for more customized usage. It's unknown
if it's worth the complexity to support this. Here is sample code for how
this might work:

> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>

Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>

Ben Widawsky (8):
mm/mempolicy: Add comment for missing LOCAL
mm/mempolicy: kill v.preferred_nodes
mm/mempolicy: handle MPOL_PREFERRED_MANY like BIND
mm/mempolicy: Create a page allocator for policy
mm/mempolicy: Thread allocation for many preferred
mm/mempolicy: VMA allocation for many preferred
mm/mempolicy: huge-page allocation for many preferred
mm/mempolicy: Advertise new MPOL_PREFERRED_MANY

Dave Hansen (4):
mm/mempolicy: convert single preferred_node to full nodemask
mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
mm/mempolicy: allow preferred code to take a nodemask
mm/mempolicy: refactor rebind code for PREFERRED_MANY

.../admin-guide/mm/numa_memory_policy.rst | 22 +-
include/linux/mempolicy.h | 6 +-
include/uapi/linux/mempolicy.h | 6 +-
mm/hugetlb.c | 20 +-
mm/mempolicy.c | 273 ++++++++++++------
5 files changed, 222 insertions(+), 105 deletions(-)

--
2.27.0

2020-06-30 22:09:05

by Ben Widawsky

[permalink] [raw]

Subject: [PATCH 04/12] mm/mempolicy: allow preferred code to take a nodemask

From: Dave Hansen <[email protected]>

Create a helper function (mpol_new_preferred_many()) which is usable
both by the old, single-node MPOL_PREFERRED and the new
MPOL_PREFERRED_MANY.

Enforce the old single-node MPOL_PREFERRED behavior in the "new"
version of mpol_new_preferred() which calls mpol_new_preferred_many().

Cc: Andrew Morton <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
---
mm/mempolicy.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 33bf29ddfab2..1ad6e446d8f6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -203,17 +203,30 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
return 0;
}

-static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
+static int mpol_new_preferred_many(struct mempolicy *pol,
+ const nodemask_t *nodes)
{
if (!nodes)
pol->flags |= MPOL_F_LOCAL; /* local allocation */
else if (nodes_empty(*nodes))
return -EINVAL; /* no allowed nodes */
else
- pol->v.preferred_nodes = nodemask_of_node(first_node(*nodes));
+ pol->v.preferred_nodes = *nodes;
return 0;
}

+static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
+{
+ if (nodes) {
+ /* MPOL_PREFERRED can only take a single node: */
+ nodemask_t tmp = nodemask_of_node(first_node(*nodes));
+
+ return mpol_new_preferred_many(pol, &tmp);
+ }
+
+ return mpol_new_preferred_many(pol, NULL);
+}
+
static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
{
if (nodes_empty(*nodes))
--
2.27.0

2020-06-30 22:09:48

by Ben Widawsky

[permalink] [raw]

Subject: [PATCH 07/12] mm/mempolicy: handle MPOL_PREFERRED_MANY like BIND

This patch begins the real plumbing for handling this new policy. Now
that the internal representation for preferred nodes and bound nodes is
the same, and we can envision what multiple preferred nodes will behave
like, there are obvious places where we can simply reuse the bind
behavior.

In v1 of this series, the moral equivalent was:
"mm: Finish handling MPOL_PREFERRED_MANY". Like that, this attempts to
implement the easiest spots for the new policy. Unlike that, this just
reuses BIND.

Cc: Andrew Morton <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
---
mm/mempolicy.c | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e71ebc906ff0..3b38c9c4e580 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -950,8 +950,6 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
switch (p->mode) {
case MPOL_BIND:
case MPOL_INTERLEAVE:
- *nodes = p->nodes;
- break;
case MPOL_PREFERRED_MANY:
*nodes = p->nodes;
break;
@@ -1938,7 +1936,8 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
/* Lower zones don't get a nodemask applied for MPOL_BIND */
- if (unlikely(policy->mode == MPOL_BIND) &&
+ if (unlikely(policy->mode == MPOL_BIND ||
+ policy->mode == MPOL_PREFERRED_MANY) &&
apply_policy_zone(policy, gfp_zone(gfp)) &&
cpuset_nodemask_valid_mems_allowed(&policy->nodes))
return &policy->nodes;
@@ -1995,7 +1994,6 @@ unsigned int mempolicy_slab_node(void)
return node;

switch (policy->mode) {
- case MPOL_PREFERRED_MANY:
case MPOL_PREFERRED:
/*
* handled MPOL_F_LOCAL above
@@ -2005,6 +2003,7 @@ unsigned int mempolicy_slab_node(void)
case MPOL_INTERLEAVE:
return interleave_nodes(policy);

+ case MPOL_PREFERRED_MANY:
case MPOL_BIND: {
struct zoneref *z;

@@ -2020,6 +2019,7 @@ unsigned int mempolicy_slab_node(void)
return z->zone ? zone_to_nid(z->zone) : node;
}

+
default:
BUG();
}
@@ -2130,9 +2130,6 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
task_lock(current);
mempolicy = current->mempolicy;
switch (mempolicy->mode) {
- case MPOL_PREFERRED_MANY:
- *mask = mempolicy->nodes;
- break;
case MPOL_PREFERRED:
if (mempolicy->flags & MPOL_F_LOCAL)
nid = numa_node_id();
@@ -2143,6 +2140,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)

case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_PREFERRED_MANY:
*mask = mempolicy->nodes;
break;

@@ -2186,12 +2184,11 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
* Thus, it's possible for tsk to have allocated memory from
* nodes in mask.
*/
- break;
- case MPOL_PREFERRED_MANY:
ret = nodes_intersects(mempolicy->nodes, *mask);
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_PREFERRED_MANY:
ret = nodes_intersects(mempolicy->nodes, *mask);
break;
default:
@@ -2415,7 +2412,6 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
switch (a->mode) {
case MPOL_BIND:
case MPOL_INTERLEAVE:
- return !!nodes_equal(a->nodes, b->nodes);
case MPOL_PREFERRED_MANY:
return !!nodes_equal(a->nodes, b->nodes);
case MPOL_PREFERRED:
@@ -2569,6 +2565,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
polnid = first_node(pol->nodes);
break;

+ case MPOL_PREFERRED_MANY:
case MPOL_BIND:

/*
@@ -2585,8 +2582,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
polnid = zone_to_nid(z->zone);
break;

- /* case MPOL_PREFERRED_MANY: */
-
default:
BUG();
}
@@ -3099,15 +3094,13 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
switch (mode) {
case MPOL_DEFAULT:
break;
- case MPOL_PREFERRED_MANY:
- WARN_ON(flags & MPOL_F_LOCAL);
- fallthrough;
case MPOL_PREFERRED:
if (flags & MPOL_F_LOCAL)
mode = MPOL_LOCAL;
else
nodes_or(nodes, nodes, pol->nodes);
break;
+ case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
nodes = pol->nodes;
--
2.27.0

2020-06-30 22:10:04

by Ben Widawsky

[permalink] [raw]

Subject: [PATCH 10/12] mm/mempolicy: VMA allocation for many preferred

This patch implements MPOL_PREFERRED_MANY for alloc_pages_vma(). Like
alloc_pages_current(), alloc_pages_vma() needs to support policy based
decisions if they've been configured via mbind(2).

The temporary "hack" of treating MPOL_PREFERRED and MPOL_PREFERRED_MANY
can now be removed with this, too.

All the actual machinery to make this work was part of
("mm/mempolicy: Create a page allocator for policy")

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
---
mm/mempolicy.c | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5fb70e6599a6..51ac0d4a2eda 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2281,8 +2281,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
{
struct mempolicy *pol;
struct page *page;
- int preferred_nid;
- nodemask_t *nmask;

pol = get_vma_policy(vma, addr);

@@ -2296,6 +2294,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
}

if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
+ nodemask_t *nmask;
int hpage_node = node;

/*
@@ -2309,10 +2308,26 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
* does not allow the current node in its nodemask, we allocate
* the standard way.
*/
- if ((pol->mode == MPOL_PREFERRED ||
- pol->mode == MPOL_PREFERRED_MANY) &&
- !(pol->flags & MPOL_F_LOCAL))
+ if (pol->mode == MPOL_PREFERRED || !(pol->flags & MPOL_F_LOCAL)) {
hpage_node = first_node(pol->nodes);
+ } else if (pol->mode == MPOL_PREFERRED_MANY) {
+ struct zoneref *z;
+
+ /*
+ * In this policy, with direct reclaim, the normal
+ * policy based allocation will do the right thing - try
+ * twice using the preferred nodes first, and all nodes
+ * second.
+ */
+ if (gfp & __GFP_DIRECT_RECLAIM) {
+ page = alloc_pages_policy(pol, gfp, order, NUMA_NO_NODE);
+ goto out;
+ }
+
+ z = first_zones_zonelist(node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ gfp_zone(GFP_HIGHUSER), &pol->nodes);
+ hpage_node = zone_to_nid(z->zone);
+ }

nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(hpage_node, *nmask)) {
@@ -2338,9 +2353,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
}
}

- nmask = policy_nodemask(gfp, pol);
- preferred_nid = policy_node(gfp, pol, node);
- page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
+ page = alloc_pages_policy(pol, gfp, order, NUMA_NO_NODE);
mpol_cond_put(pol);
out:
return page;
--
2.27.0

2020-07-02 11:40:42

by Chen, Rong A

[permalink] [raw]

Subject: [mm/mempolicy] 9586f666c8: Kernel_panic-not_syncing:stack-protector:Kernel_stack_is_corrupted_in:mpol_new_preferred

Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 9586f666c84d6b357371aff0237269852f64e3b6 ("[PATCH 04/12] mm/mempolicy: allow preferred code to take a nodemask")
url: https://github.com/0day-ci/linux/commits/Ben-Widawsky/Introduced-multi-preference-mempolicy/20200701-052810

in testcase: trinity
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

+------------------------------------------------------------------------------------------+------------+------------+
| | 65c23f0f71 | 9586f666c8 |
+------------------------------------------------------------------------------------------+------------+------------+
| boot_successes | 6 | 7 |
| boot_failures | 1 | 10 |
| INFO:rcu_sched_self-detected_stall_on_CPU | 1 | |
| RIP:iov_iter_copy_from_user_atomic | 1 | |
| BUG:soft_lockup-CPU##stuck_for#s![trinity-c5:#] | 1 | |
| Kernel_panic-not_syncing:softlockup:hung_tasks | 1 | |
| Kernel_panic-not_syncing:stack-protector:Kernel_stack_is_corrupted_in:mpol_new_preferred | 0 | 10 |
+------------------------------------------------------------------------------------------+------------+------------+

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

[ 236.443959] [main] 284488 iterations. [F:217538 S:65817 HI:4015]
[ 236.443963]
[ 238.480132] futex_wake_op: trinity-c3 tries to shift op by -16; fix this program
[ 246.551236] [main] 294727 iterations. [F:225347 S:68192 HI:4015]
[ 246.551240]
[ 247.209348] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: mpol_new_preferred+0x12f/0x130
[ 247.211379] CPU: 1 PID: 4445 Comm: trinity-c4 Not tainted 5.8.0-rc3-00004-g9586f666c84d6 #1
[ 247.213010] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 247.214503] Call Trace:
[ 247.215111] dump_stack+0x6d/0x90
[ 247.215814] panic+0x108/0x2de
[ 247.216476] ? mpol_new_preferred+0x12f/0x130
[ 247.217446] __stack_chk_fail+0x10/0x10
[ 247.218252] mpol_new_preferred+0x12f/0x130
[ 247.219145] do_set_mempolicy+0x7e/0x130
[ 247.219910] kernel_set_mempolicy+0x7c/0x90
[ 247.220705] do_syscall_64+0x4d/0x90
[ 247.221415] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 247.222305] RIP: 0033:0x453b29
[ 247.222987] Code: Bad RIP value.
[ 247.223665] RSP: 002b:00007ffc9b1c52c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ee
[ 247.225066] RAX: ffffffffffffffda RBX: 00000000000000ee RCX: 0000000000453b29
[ 247.226265] RDX: 0000000000000200 RSI: 00007f0f2a7c0000 RDI: 0000000000000001
[ 247.227477] RBP: 00007ffc9b1c5370 R08: 3bbfcbe05d2a35be R09: 00000a6226195b86
[ 247.228728] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000002
[ 247.230005] R13: 00007f0f2aed9058 R14: 0000000003007830 R15: 00007f0f2aed9000
[ 247.231283] Kernel Offset: disabled

Elapsed time: 300

To reproduce:

# build kernel
cd linux
cp config-5.8.0-rc3-00004-g9586f666c84d6 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email

Thanks,
Rong Chen

Attachments:

(No filename) (4.04 kB)
config-5.8.0-rc3-00004-g9586f666c84d6 (191.62 kB)
job-script (4.39 kB)
dmesg.xz (16.19 kB)
Download all attachments