2023-10-03 00:23:02

by Gregory Price

[permalink] [raw]
Subject: [RFC PATCH v2 0/4] mm/mempolicy: get/set_mempolicy2 syscalls

v2: style updates, weighted-interleave, rename partial-interleave to
preferred-interleave, variety of bug fixes.

---

This patch set is a proposal for set_mempolicy2 and get_mempolicy2
system calls. This is an extension to the existing mempolicy
syscalls that allow for a more extensible mempolicy interface and
new, complex memory policies.

This RFC is broken into 4 patches for discussion:

1) A refactor of do_set_mempolicy that allows code reuse for
the new syscalls when replacing the task mempolicy.

2) The implementation of get_mempolicy2 and set_mempolicy2 which
includes a new uapi type: "struct mempolicy_args" and denotes
the original mempolicies as "legacy". This allows the existing
policies to be routed through the original interface.

(note: only implemented on x86 at this time, though can be
hacked into other architectures somewhat trivially)

3) The implementation of "preferred-interleave", a policy which
applies a weight to the local node while interleaving.

4) The implementation of "weighted-interleave", a policy which
applies weights to all enabled nodes while interleaving.

x) Future Updates: ktest, numactl, and man page updates


Besides the obvious proposal of extending the mempolicy subsystem for
new policies, the core proposal is the addition of the new uapi type
"struct mempolicy". In this proposal, the get and set interfaces use
the same structure, and some fields may be ignored depending on the
requested operation.

This sample implementation of get_mempolicy allows for the retrieval
of all information that would have previously required multiple calls
to get_mempolicy, and implements an area for per-policy information.

This allows for future extensibility, and would avoid the need for
additional syscalls in the future.

struct mempolicy_args {
unsigned short mode;
unsigned long *nodemask;
unsigned long maxnode;
unsigned short flags;
struct {
/* Memory allowed */
struct {
unsigned long maxnode;
unsigned long *nodemask;
} allowed;
/* Address information */
struct {
unsigned long addr;
unsigned long node;
unsigned short mode;
unsigned short flags;
} addr;
} get;
union {
/* Interleave */
struct {
unsigned long next_node; /* get only */
} interleave;
/* Preferred Interleave */
struct {
unsigned long weight; /* get and set */
unsigned long next_node; /* get only */
} pil;
/* Weighted Interleave */
struct {
unsigned long next_node; /* get only */
unsigned char *weight; /* get and set */
} wil;
};
};

In the third and fourth patch, we implement preferred and weighted
interleave policies (respectively), which could not be implemented
with the existing syscalls.

We extend the internal mempolicy structure to include to include
a new union area which can be used to host complex policy data.

Example:
union {
/* Preferred Interleave: Allocate local count, then interleave */
struct {
int weight;
int count;
} pil;
/* Weighted Interleave */
struct {
unsigned int il_weight;
unsigned char cur_weight;
unsigned char weights[MAX_NUMNODES];
} wil;
};


Summary of Preferred Interleave:
================================
nodeset=0,1,2
interval=3
cpunode=0

The preferred node (cpunode) is the "preferred" node on which [weight]
allocations are made before an interleave occurs.

Over 10 consecutive allocations, the following nodes will be selected:
[0,0,0,1,2,0,0,0,1,2]

In this example, there is a 60%/20%/20% distribution of memory across
the node set.

This is a useful strategy if the goal is an even distribution of
memory across all non-local nodes for the purpose of bandwidth AND
task-node migrations are a possibility. In this case, the weight
applies to whatever the local node happens to be at the time of the
interleave, rather than a static node weight.


Summary of Weighted Interleave:
===============================

The weighted-interleave mempolicy implements weights per-node
which are used to distribute memory while interleaving.

For example:
nodes: 0,1,2
weights: 5,3,2

Over 10 consecutive allocations, the following nodes will be selected:
[0,0,0,0,0,1,1,1,2,2]

If a node is enabled, the minimum weight is 1. If an enabled node
ends up with a weight of 0 (cgroup updates can cause a runtime
recalculation) a minimum of 1 is applied during interleave.

This is a useful strategy if the goal is a non-even distribution of
memory across a variety of nodes AND task-node migrations are NOT
expected to occur (or the weights are approximately the same,
relationally from all possible target nodes).

This is because "Thread A" with weights set for best performance
from the perspective of "Socket 0" may have a less-than-optimal
interleave strategy if "Thread A" is migrated to "Socket 1". In
this scenario, the bandwidth and latency attributes of each node
will have changed, as will the local node.

In the above example, a thread migrating from node 0 to node 1 will
cause most of its memory to be allocated on remote nodes, which is
less than optimal.


Some notes for discussion
=========================
0) Why?

In the coming age of CXL and a many-numa-node system with memory
hosted on the PCIe bus, new memory policies are likely to be
beneficial to experiment with and ultimately implement new
allocation-time placement policies.

Presently, much focus is placed on memory-usage monitoring and data
migration, but these methods steal performance to accomplish what
could be optimized for up-front. For example, if maximizing bandwith
is preferable, then a statistical distribution of memory can be
calculated fairly easily based on task location..

Getting a fair approximation of distribution at allocation can help
reduce the migration load required after-the fact. This is the
intent of the included preferred-interleave example, which allows for
an approximate distribution of memory, where the local node is still
the preferred location for the majority of memory.

1) Maybe this should be a set of sysfs interfaces?

This would involve adding a /proc/pid/mempolicy interface that
allows for external processes to interrogate and change the
mempolicy of running processes. This would be a fundamental
change to the mempolicy subsystem.

I attempted this, but eventually came to the conclusion that it
would require a much more radical re-write of mempolicy.c code
due concurrency issues.

Notably, mempolicy.c is very "current"--centric, and is not well
designed for runtime changes to nodemask (and, subsequently, the
new weights added to struct mempolicy).

I avoided that for this RFC as it seemed far more radical than
proposing a set/get_mempolicy2 interface. Though technically it
could be done.

2) Why not do this in cgroups or memtier?

Both have the issue of functionally being a "global" setting,
in the sense that cgroups/memtier implemented weights would
produce poor results on processes whose threads span multiple
sockets (or on a thread migration).

Consider the following scenario:

Node 0 - Socket 0 DRAM
Node 1 - Socket 1 DRAM
Node 2 - Socket 0 local CXL
Node 3 - Socket 1 local CXL
Weights:
[0:4, 1:2, 2:2, 3:1]

The "Tiers" in this case are essentially [0, 1-2, 3]

We have 2 tasks in our cgroup:
Thread A - socket 0
Thread B - socket 1

In this scenario, Thread B will have a very poor distribution of
memory, with most of its memory landing on a remote-socket.

Instead, it's preferable for workloads to stick to a single socket
where possible, and future work will need to be done to determine
how to handle workloads which span sockets. Due to the above
mentioned issues with concurrency, this may be quite some time.

In the meantime, there is user for weights to be carried per-task.

For migrations:

Weights could be recalculated based on the new location of the
task. This recalculation of weights is not included in this
patch set, but could be done as an extension to weighted
interleave, where a thread that detects it has been migrated
works with memtier.c to adjust its weights internally.

So basically even if you implement these things in cgroups/memtier,
you still require per-task information (local node) to adjust the
weights. My proposal: Just do it in mempolicy and use things like
cgroups/memtier to enrich that implementation, rather than the other
way around.

3) Do we need this level extensibility?

Presently the ability to dictate allocation-time placement is
limited to a few primitive mechanisms:

1) existing mempolicy, and those that can be implemented using
the existing interface.
2) numa-aware applications, requiring code changes.
3) LDPRELOAD methods, which have compability issues.

For the sake of compatibility, being able to extend numactl to
include newer, more complex policies would be beneficial.

Gregory Price (4):
mm/mempolicy: refactor do_set_mempolicy for code re-use
mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls
mm/mempolicy: implement a preferred-interleave mempolicy
mm/mempolicy: implement a weighted-interleave mempolicy

arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/linux/mempolicy.h | 14 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 10 +-
include/uapi/linux/mempolicy.h | 41 ++
mm/mempolicy.c | 688 ++++++++++++++++++++++++-
7 files changed, 741 insertions(+), 20 deletions(-)

--
2.39.1


2023-10-03 00:23:12

by Gregory Price

[permalink] [raw]
Subject: [RFC PATCH v2 2/4] mm/mempolicy: Implement set_mempolicy2 and

sys_set_mempolicy is limited by its current argument structure
(mode, nodes, flags) to implementing policies that can be described
in that manner.

Implement set/get_mempolicy2 with a new mempolicy_args structure
which encapsulates the old behavior, and allows for new mempolicies
which may require additional information.

Signed-off-by: Gregory Price <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 10 +-
include/uapi/linux/mempolicy.h | 29 ++++
mm/mempolicy.c | 196 ++++++++++++++++++++++++-
6 files changed, 241 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 2d0b1bd866ea..a72ef588a704 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -457,3 +457,5 @@
450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
451 i386 cachestat sys_cachestat
452 i386 fchmodat2 sys_fchmodat2
+454 i386 set_mempolicy2 sys_set_mempolicy2
+455 i386 get_mempolicy2 sys_get_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1d6eee30eceb..ec54064de8b3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -375,6 +375,8 @@
451 common cachestat sys_cachestat
452 common fchmodat2 sys_fchmodat2
453 64 map_shadow_stack sys_map_shadow_stack
+454 common set_mempolicy2 sys_set_mempolicy2
+455 common get_mempolicy2 sys_get_mempolicy2

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 22bc6bc147f8..0c4a71177df9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -813,6 +813,10 @@ asmlinkage long sys_get_mempolicy(int __user *policy,
unsigned long addr, unsigned long flags);
asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
unsigned long maxnode);
+asmlinkage long sys_get_mempolicy2(struct mempolicy_args __user *args,
+ size_t size);
+asmlinkage long sys_set_mempolicy2(struct mempolicy_args __user *args,
+ size_t size);
asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
const unsigned long __user *from,
const unsigned long __user *to);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index abe087c53b4b..397dcf804941 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -823,8 +823,16 @@ __SYSCALL(__NR_cachestat, sys_cachestat)
#define __NR_fchmodat2 452
__SYSCALL(__NR_fchmodat2, sys_fchmodat2)

+/* CONFIG_MMU only */
+#ifndef __ARCH_NOMMU
+#define __NR_set_mempolicy 454
+__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
+#define __NR_set_mempolicy 455
+__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
+#endif
+
#undef __NR_syscalls
-#define __NR_syscalls 453
+#define __NR_syscalls 456

/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 046d0ccba4cd..ea386872094b 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,9 +23,38 @@ enum {
MPOL_INTERLEAVE,
MPOL_LOCAL,
MPOL_PREFERRED_MANY,
+ MPOL_LEGACY, /* set_mempolicy limited to above modes */
MPOL_MAX, /* always last member of enum */
};

+struct mempolicy_args {
+ unsigned short mode;
+ unsigned long *nodemask;
+ unsigned long maxnode;
+ unsigned short flags;
+ struct {
+ /* Memory allowed */
+ struct {
+ unsigned long maxnode;
+ unsigned long *nodemask;
+ } allowed;
+ /* Address information */
+ struct {
+ unsigned long addr;
+ unsigned long node;
+ unsigned short mode;
+ unsigned short flags;
+ } addr;
+ /* Interleave */
+ } get;
+ /* Mode specific settings */
+ union {
+ struct {
+ unsigned long next_node; /* get only */
+ } interleave;
+ };
+};
+
/* Flags for set_mempolicy */
#define MPOL_F_STATIC_NODES (1 << 15)
#define MPOL_F_RELATIVE_NODES (1 << 14)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ad26f41b91de..936c641f554e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1478,7 +1478,7 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
*flags = *mode & MPOL_MODE_FLAGS;
*mode &= ~MPOL_MODE_FLAGS;

- if ((unsigned int)(*mode) >= MPOL_MAX)
+ if ((unsigned int)(*mode) >= MPOL_LEGACY)
return -EINVAL;
if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
return -EINVAL;
@@ -1609,6 +1609,200 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
return kernel_set_mempolicy(mode, nmask, maxnode);
}

+static long do_set_mempolicy2(struct mempolicy_args *args)
+{
+ struct mempolicy *new = NULL;
+ nodemask_t nodes;
+ int err;
+
+ if (args->mode <= MPOL_LEGACY)
+ return -EINVAL;
+
+ if (args->mode >= MPOL_MAX)
+ return -EINVAL;
+
+ err = get_nodes(&nodes, args->nodemask, args->maxnode);
+ if (err)
+ return err;
+
+ new = mpol_new(args->mode, args->flags, &nodes);
+ if (IS_ERR(new))
+ return PTR_ERR(new);
+
+ switch (args->mode) {
+ default:
+ BUG();
+ }
+
+ if (err)
+ goto out;
+
+ err = replace_mempolicy(new, &nodes);
+out:
+ if (err)
+ mpol_put(new);
+ return err;
+};
+
+static bool mempolicy2_args_valid(struct mempolicy_args *kargs)
+{
+ /* Legacy modes are routed through the legacy interface */
+ return kargs->mode > MPOL_LEGACY && kargs->mode < MPOL_MAX;
+}
+
+static long kernel_set_mempolicy2(const struct mempolicy_args __user *uargs,
+ size_t usize)
+{
+ struct mempolicy_args kargs;
+ int err;
+
+ if (usize < sizeof(kargs))
+ return -EINVAL;
+
+ err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+ if (err)
+ return err;
+
+ /* If the mode is legacy, use the legacy path */
+ if (kargs.mode < MPOL_LEGACY) {
+ int legacy_mode = kargs.mode | kargs.flags;
+ const unsigned long __user *lnmask = kargs.nodemask;
+ unsigned long maxnode = kargs.maxnode;
+
+ return kernel_set_mempolicy(legacy_mode, lnmask, maxnode);
+ }
+
+ if (!mempolicy2_args_valid(&kargs))
+ return -EINVAL;
+
+ return do_set_mempolicy2(&kargs);
+}
+
+SYSCALL_DEFINE2(set_mempolicy2, const struct mempolicy_args __user *, args,
+ size_t, size)
+{
+ return kernel_set_mempolicy2(args, size);
+}
+
+/* Gets extended mempolicy information */
+static long do_get_mempolicy2(struct mempolicy_args *kargs)
+{
+ struct mempolicy *pol = current->mempolicy;
+ nodemask_t knodes;
+ int rc = 0;
+
+ kargs->mode = pol->mode;
+ /* Mask off internal flags */
+ kargs->flags = pol->flags & MPOL_MODE_FLAGS;
+
+ if (kargs->nodemask) {
+ if (mpol_store_user_nodemask(pol)) {
+ knodes = pol->w.user_nodemask;
+ } else {
+ task_lock(current);
+ get_policy_nodemask(pol, &knodes);
+ task_unlock(current);
+ }
+ rc = copy_nodes_to_user(kargs->nodemask, kargs->maxnode,
+ &knodes);
+ if (rc)
+ return rc;
+ }
+
+
+ if (kargs->get.allowed.nodemask) {
+ task_lock(current);
+ knodes = cpuset_current_mems_allowed;
+ task_unlock(current);
+ rc = copy_nodes_to_user(kargs->get.allowed.nodemask,
+ kargs->get.allowed.maxnode,
+ &knodes);
+ if (rc)
+ return rc;
+ }
+
+ if (kargs->get.addr.addr) {
+ struct mempolicy *addr_pol;
+ struct vm_area_struct *vma;
+ struct mm_struct *mm = current->mm;
+ unsigned long addr = kargs->get.addr.addr;
+
+ /*
+ * Do NOT fall back to task policy if the vma/shared policy
+ * at addr is NULL. Return MPOL_DEFAULT in this case.
+ */
+ mmap_read_lock(mm);
+ vma = vma_lookup(mm, addr);
+ if (!vma) {
+ mmap_read_unlock(mm);
+ return -EFAULT;
+ }
+ if (vma->vm_ops && vma->vm_ops->get_policy)
+ addr_pol = vma->vm_ops->get_policy(vma, addr);
+ else
+ addr_pol = vma->vm_policy;
+
+ kargs->get.addr.mode = addr_pol->mode;
+ /* Mask off internal flags */
+ kargs->get.addr.flags = (pol->flags & MPOL_MODE_FLAGS);
+
+ /*
+ * Take a refcount on the mpol, because we are about to
+ * drop the mmap_lock, after which only "pol" remains
+ * valid, "vma" is stale.
+ */
+ vma = NULL;
+ mpol_get(addr_pol);
+ mmap_read_unlock(mm);
+ rc = lookup_node(mm, addr);
+ mpol_put(addr_pol);
+ if (rc < 0)
+ return rc;
+ kargs->get.addr.node = rc;
+ }
+
+ switch (kargs->mode) {
+ case MPOL_INTERLEAVE:
+ kargs->interleave.next_node = next_node_in(current->il_prev,
+ pol->nodes);
+ rc = 0;
+ break;
+ default:
+ BUG();
+ }
+
+ return rc;
+}
+
+static long kernel_get_mempolicy2(struct mempolicy_args __user *uargs,
+ size_t usize)
+{
+ struct mempolicy_args kargs;
+ int err;
+
+ if (usize < sizeof(kargs))
+ return -EINVAL;
+
+ err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+ if (err)
+ return err;
+
+ /* Get the extended memory policy information (kargs.ext) */
+ err = do_get_mempolicy2(&kargs);
+ if (err)
+ return err;
+
+ err = copy_to_user(uargs, &kargs, sizeof(kargs));
+
+ return err;
+}
+
+SYSCALL_DEFINE2(get_mempolicy2, struct mempolicy_args __user *, policy,
+ size_t, size)
+{
+ return kernel_get_mempolicy2(policy, size);
+}
+
static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
const unsigned long __user *old_nodes,
const unsigned long __user *new_nodes)
--
2.39.1