2023-11-22 21:14:16

by Gregory Price

[permalink] [raw]
Subject: [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs

The patch set changes task->mempolicy to be modifiable by tasks other
than just current.

The ultimate goal is to make mempolicy more flexible and extensible,
such as adding interleave weights (which may need to change at runtime
due to hotplug events). Making mempolicy externally modifiable allows
for userland daemons to make runtime performance adjustments to running
tasks without that software needing to be made numa-aware.

This initial RFC involves 3 major updates the mempolicy.

1. Refactor modifying interfaces to accept a task as an argument,
and change existing callers to send `current` in to retain
the existing behavior.

2. Change locking behaviors to ensure task->mpol is referenced
safely by acquiring the task_lock where required. Since
allocators take the alloc lock (task lock), this successfully
prevents changes from being made during allocations.

3. Add external interfaces which allow for a task mempolicy to be
modified by another task. This is implemented in 4 syscalls
and a procfs interface:
sys_set_task_mempolicy
sys_get_task_mempolicy
sys_set_task_mempolicy_home_node
sys_task_mbind
/proc/[pid]/mempolicy

The new syscalls are the same as their current-task counterparts,
except that they take a pid as an argument. The exception is
task_mbind, which required a new struct due to the number of args.

The /proc/pid/mempolicy re-uses the interface mpol_parse_str format
to enable get/set of mempolicy via procsfs.

mpol_parse_str format:
<mode>[=<flags>][:<nodelist>]

Example usage:

echo "default" > /proc/pid/mempolicy
echo "prefer=relative:0" > /proc/pid/mempolicy
echo "interleave:0-3" > /proc/pid/mempolicy

Changing the mempolicy does not induce memory migrations via the
procfs interface (which is the exact same behavior as set_mempolicy).

Signed-off-by: Gregory Price <[email protected]>

Gregory Price (11):
mm/mempolicy: refactor do_set_mempolicy for code re-use
mm/mempolicy: swap cond reference counting logic in do_get_mempolicy
mm/mempolicy: refactor set_mempolicy stack to take a task argument
mm/mempolicy: modify get_mempolicy call stack to take a task argument
mm/mempolicy: modify set_mempolicy_home_node to take a task argument
mm/mempolicy: modify do_mbind to operate on task argument instead of
current
mm/mempolicy: add task mempolicy syscall variants
mm/mempolicy: export replace_mempolicy for use by procfs
mm/mempolicy: build mpol_parse_str unconditionally
mm/mempolicy: mpol_parse_str should ignore trailing characters in
nodelist
fs/proc: Add mempolicy attribute to allow read/write of task mempolicy

arch/x86/entry/syscalls/syscall_32.tbl | 4 +
arch/x86/entry/syscalls/syscall_64.tbl | 4 +
fs/proc/Makefile | 1 +
fs/proc/base.c | 1 +
fs/proc/internal.h | 1 +
fs/proc/mempolicy.c | 117 +++++++
include/linux/mempolicy.h | 13 +-
include/linux/syscalls.h | 14 +
include/uapi/asm-generic/unistd.h | 10 +-
include/uapi/linux/mempolicy.h | 10 +
mm/mempolicy.c | 432 +++++++++++++++++++------
11 files changed, 502 insertions(+), 105 deletions(-)
create mode 100644 fs/proc/mempolicy.c

--
2.39.1


2023-11-22 21:14:20

by Gregory Price

[permalink] [raw]
Subject: [RFC PATCH 04/11] mm/mempolicy: modify get_mempolicy call stack to take a task argument

To make mempolicy fetchable by external tasks, we must first change
the callstack to take a task as an argument.

Modify the following functions to require a task argument:
do_get_mempolicy
kernel_get_mempolicy

The way the task->mm is acquired must change slightly to enable this
change. Originally, do_get_mempolicy would acquire the task->mm
directly via (current->mm). This is unsafe to do in a non-current
context. However, utilizing get_task_mm would break the original
functionality of do_get_mempolicy due to the following check
in get_task_mm:

if (mm) {
if (task->flags & PF_KTHREAD)
mm = NULL;
else
mmget(mm);
}

To retain the original behavior, if (task == current) we access
the task->mm directly, but if (task != current) we will utilize
get_task_mm to safely access the mm.

We simplify the get/put mechanics by always taking a reference to
the mm, even if we are in the context of (task == current).

Additionally, since the mempolicy will become externally modifiable,
we need to take the task lock to acquire task->mempolicy safely,
regardless of whether we are operating on current or not.

Signed-off-by: Gregory Price <[email protected]>
---
mm/mempolicy.c | 43 +++++++++++++++++++++++++++++--------------
1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9ea3e1bfc002..4519f39b1a07 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -899,8 +899,9 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
}

/* Retrieve NUMA policy */
-static long do_get_mempolicy(int *policy, nodemask_t *nmask,
- unsigned long addr, unsigned long flags)
+static long do_get_mempolicy(struct task_struct *task, int *policy,
+ nodemask_t *nmask, unsigned long addr,
+ unsigned long flags)
{
int err;
struct mm_struct *mm;
@@ -915,9 +916,9 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
return -EINVAL;
*policy = 0; /* just so it's initialized */
- task_lock(current);
- *nmask = cpuset_current_mems_allowed;
- task_unlock(current);
+ task_lock(task);
+ *nmask = task->mems_allowed;
+ task_unlock(task);
return 0;
}

@@ -928,7 +929,16 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
* vma/shared policy at addr is NULL. We
* want to return MPOL_DEFAULT in this case.
*/
- mm = current->mm;
+ if (task == current) {
+ /*
+ * original behavior allows a kernel task changing its
+ * own policy to avoid the condition in get_task_mm,
+ * so we'll directly access
+ */
+ mm = task->mm;
+ mmget(mm);
+ } else
+ mm = get_task_mm(task);
mmap_read_lock(mm);
vma = vma_lookup(mm, addr);
if (!vma) {
@@ -947,8 +957,10 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
return -EINVAL;
else {
/* take a reference of the task policy now */
- pol = current->mempolicy;
+ task_lock(task);
+ pol = task->mempolicy;
mpol_get(pol);
+ task_unlock(task);
}

if (!pol) {
@@ -962,12 +974,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
vma = NULL;
mmap_read_unlock(mm);
err = lookup_node(mm, addr);
+ mmput(mm);
if (err < 0)
goto out;
*policy = err;
- } else if (pol == current->mempolicy &&
+ } else if (pol == task->mempolicy &&
pol->mode == MPOL_INTERLEAVE) {
- *policy = next_node_in(current->il_prev, pol->nodes);
+ *policy = next_node_in(task->il_prev, pol->nodes);
} else {
err = -EINVAL;
goto out;
@@ -987,9 +1000,9 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
if (mpol_store_user_nodemask(pol)) {
*nmask = pol->w.user_nodemask;
} else {
- task_lock(current);
+ task_lock(task);
get_policy_nodemask(pol, nmask);
- task_unlock(current);
+ task_unlock(task);
}
}

@@ -1704,7 +1717,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
}

/* Retrieve NUMA policy */
-static int kernel_get_mempolicy(int __user *policy,
+static int kernel_get_mempolicy(struct task_struct *task,
+ int __user *policy,
unsigned long __user *nmask,
unsigned long maxnode,
unsigned long addr,
@@ -1719,7 +1733,7 @@ static int kernel_get_mempolicy(int __user *policy,

addr = untagged_addr(addr);

- err = do_get_mempolicy(&pval, &nodes, addr, flags);
+ err = do_get_mempolicy(task, &pval, &nodes, addr, flags);

if (err)
return err;
@@ -1737,7 +1751,8 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
unsigned long __user *, nmask, unsigned long, maxnode,
unsigned long, addr, unsigned long, flags)
{
- return kernel_get_mempolicy(policy, nmask, maxnode, addr, flags);
+ return kernel_get_mempolicy(current, policy, nmask, maxnode, addr,
+ flags);
}

bool vma_migratable(struct vm_area_struct *vma)
--
2.39.1

2023-11-22 21:14:24

by Gregory Price

[permalink] [raw]
Subject: [RFC PATCH 08/11] mm/mempolicy: export replace_mempolicy for use by procfs

We will be adding a /proc/pid/mempolicy entry for use in swapping
the mempolicy of a process at runtime. Export replace_mempolicy
so that this can be used by that interface.

Signed-off-by: Gregory Price <[email protected]>
---
include/linux/mempolicy.h | 9 +++++++++
mm/mempolicy.c | 5 ++---
2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..b951e96a53ce 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -177,6 +177,8 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)

extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);

+extern long replace_mempolicy(struct task_struct *task, struct mempolicy *new,
+ nodemask_t *nodes);
#else

struct mempolicy {};
@@ -297,5 +299,12 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
return false;
}

+static inline long replace_mempolicy(struct task_struct *task,
+ struct mempolicy *new,
+ nodemask_t *nodes)
+{
+ return -ENODEV;
+}
+
#endif /* CONFIG_NUMA */
#endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fb295ade8ad7..e0c9127571dd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -815,9 +815,8 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma,
}

/* Attempt to replace mempolicy, release the old one if successful */
-static long replace_mempolicy(struct task_struct *task,
- struct mempolicy *new,
- nodemask_t *nodes)
+long replace_mempolicy(struct task_struct *task, struct mempolicy *new,
+ nodemask_t *nodes)
{
struct mempolicy *old = NULL;
NODEMASK_SCRATCH(scratch);
--
2.39.1

2023-11-22 21:14:26

by Gregory Price

[permalink] [raw]
Subject: [RFC PATCH 06/11] mm/mempolicy: modify do_mbind to operate on task argument instead of current

To make mbind applied mempolicies modifable by external tasks, we must
first change the do_mbind callstack to take a task as an argument.

This patch includes changes to the following functions:
do_mbind
kernel_mbind
get_vma_policy

And adds the following:
get_task_vma_policy

get_vma_policy is changed into a wrapper of get_task_vma_policy which
passes current as an argument to retain the existing behavior for
callers of get_vma_policy.

do_mbind is modified as followed:

1) the way task->mm is acquired is changed to be safe for non-current
tasks, but the original behavior of (task == current) is retained.
2) we take a reference to the mm so that the task lock can be dropped.
3) the task lock must now be acquired on call to get_task_policy to
ensure we acquire and reference the policy safely.
4) get_task_vma_policy is called instead of get_vma_policy. This
requires taking the task_lock because of the new semantics.

Change to acquiring task->mm:

When (task == curent), if we use get_task_mm, it would prevent a
kernel task from making modifications or accessing information
about its own vma's. So in this scenario, we simply access and
reference the mm directly, since the mempolicy information is
being accessed in the context of the task itself.

if (mm) {
if (task->flags & PF_KTHREAD)
mm = NULL;
else
mmget(mm);
}

The retains the existing behavior.

Change to get_task_vma_policy locking behavior:

Since task->policy is no longer guaranteed to be stable, any time we
seek to acquire a policy via get_task_vma_policy, we must use the
task_lock and reference it accordingly, regardless of where it
ultimately came from. A similar behvior can be seen in
do_get_mempolicy, where a reference is taken and a conditional release
is made to handle the situation where a shared policy is acquired.

In the case of do_mbind, we don't actually need to take a reference
to the policy, as we only call get_task_vma_policy to find the ilx.
In this case, we only need to call mpol_cond_put immediately to ensure
that if get_task_vma_policy returns a shared policy we decrement the
reference count since a shared mpol will return already referenced.

Signed-off-by: Gregory Price <[email protected]>
---
mm/mempolicy.c | 92 +++++++++++++++++++++++++++++++++++++-------------
1 file changed, 69 insertions(+), 23 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 540163f5d349..3d2171ac4098 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -422,6 +422,10 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
unsigned long flags);
static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
pgoff_t ilx, int *nid);
+static struct mempolicy *get_task_vma_policy(struct task_struct *task,
+ struct vm_area_struct *vma,
+ unsigned long addr, int order,
+ pgoff_t *ilx);

static bool strictly_unmovable(unsigned long flags)
{
@@ -1250,11 +1254,12 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
}
#endif

-static long do_mbind(unsigned long start, unsigned long len,
- unsigned short mode, unsigned short mode_flags,
- nodemask_t *nmask, unsigned long flags)
+static long do_mbind(struct task_struct *task, unsigned long start,
+ unsigned long len, unsigned short mode,
+ unsigned short mode_flags, nodemask_t *nmask,
+ unsigned long flags)
{
- struct mm_struct *mm = current->mm;
+ struct mm_struct *mm;
struct vm_area_struct *vma, *prev;
struct vma_iterator vmi;
struct migration_mpol mmpol;
@@ -1287,6 +1292,21 @@ static long do_mbind(unsigned long start, unsigned long len,
if (IS_ERR(new))
return PTR_ERR(new);

+ /*
+ * original behavior allows a kernel task modifying itself to bypass
+ * check in get_task_mm, so directly acquire mm in this case
+ */
+ if (task == current) {
+ mm = task->mm;
+ mmget(mm);
+ } else
+ mm = get_task_mm(task);
+
+ if (!mm) {
+ err = -ENODEV;
+ goto mpol_out;
+ }
+
/*
* If we are using the default policy then operation
* on discontinuous address spaces is okay after all
@@ -1300,7 +1320,9 @@ static long do_mbind(unsigned long start, unsigned long len,
NODEMASK_SCRATCH(scratch);
if (scratch) {
mmap_write_lock(mm);
- err = mpol_set_nodemask(current, new, nmask, scratch);
+ task_lock(task);
+ err = mpol_set_nodemask(task, new, nmask, scratch);
+ task_unlock(task);
if (err)
mmap_write_unlock(mm);
} else
@@ -1308,7 +1330,7 @@ static long do_mbind(unsigned long start, unsigned long len,
NODEMASK_SCRATCH_FREE(scratch);
}
if (err)
- goto mpol_out;
+ goto mm_out;

/*
* Lock the VMAs before scanning for pages to migrate,
@@ -1333,8 +1355,10 @@ static long do_mbind(unsigned long start, unsigned long len,
if (!err && !list_empty(&pagelist)) {
/* Convert MPOL_DEFAULT's NULL to task or default policy */
if (!new) {
- new = get_task_policy(current);
+ task_lock(task);
+ new = get_task_policy(task);
mpol_get(new);
+ task_unlock(task);
}
mmpol.pol = new;
mmpol.ilx = 0;
@@ -1365,8 +1389,11 @@ static long do_mbind(unsigned long start, unsigned long len,
if (addr != -EFAULT) {
order = compound_order(page);
/* We already know the pol, but not the ilx */
- mpol_cond_put(get_vma_policy(vma, addr, order,
- &mmpol.ilx));
+ task_lock(task);
+ mpol_cond_put(get_task_vma_policy(task, vma,
+ addr, order,
+ &mmpol.ilx));
+ task_unlock(task);
/* Set base from which to increment by index */
mmpol.ilx -= page->index >> order;
}
@@ -1386,6 +1413,8 @@ static long do_mbind(unsigned long start, unsigned long len,
err = -EIO;
if (!list_empty(&pagelist))
putback_movable_pages(&pagelist);
+mm_out:
+ mmput(mm);
mpol_out:
mpol_put(new);
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
@@ -1500,8 +1529,9 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
return 0;
}

-static long kernel_mbind(unsigned long start, unsigned long len,
- unsigned long mode, const unsigned long __user *nmask,
+static long kernel_mbind(struct task_struct *task, unsigned long start,
+ unsigned long len, unsigned long mode,
+ const unsigned long __user *nmask,
unsigned long maxnode, unsigned int flags)
{
unsigned short mode_flags;
@@ -1518,7 +1548,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
if (err)
return err;

- return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
+ return do_mbind(task, start, len, lmode, mode_flags, &nodes, flags);
}

static long __set_mempolicy_home_node(struct task_struct *task,
@@ -1628,7 +1658,7 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
unsigned long, mode, const unsigned long __user *, nmask,
unsigned long, maxnode, unsigned int, flags)
{
- return kernel_mbind(start, len, mode, nmask, maxnode, flags);
+ return kernel_mbind(current, start, len, mode, nmask, maxnode, flags);
}

/* Set the process memory policy */
@@ -1827,6 +1857,31 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
vma->vm_ops->get_policy(vma, addr, ilx) : vma->vm_policy;
}

+/*
+ * Task variant of get_vma_policy for use internally. Returns the task
+ * policy if the vma does not have a policy of its own. get_vma_policy
+ * will return current->mempolicy as a result.
+ *
+ * Like get_vma_policy and get_task_policy, must hold alloc/task_lock
+ * while calling this.
+ */
+static struct mempolicy *get_task_vma_policy(struct task_struct *task,
+ struct vm_area_struct *vma,
+ unsigned long addr, int order,
+ pgoff_t *ilx)
+{
+ struct mempolicy *pol;
+
+ pol = __get_vma_policy(vma, addr, ilx);
+ if (!pol)
+ pol = get_task_policy(task);
+ if (pol->mode == MPOL_INTERLEAVE) {
+ *ilx += vma->vm_pgoff >> order;
+ *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
+ }
+ return pol;
+}
+
/*
* get_vma_policy(@vma, @addr, @order, @ilx)
* @vma: virtual memory area whose policy is sought
@@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
unsigned long addr, int order, pgoff_t *ilx)
{
- struct mempolicy *pol;
-
- pol = __get_vma_policy(vma, addr, ilx);
- if (!pol)
- pol = get_task_policy(current);
- if (pol->mode == MPOL_INTERLEAVE) {
- *ilx += vma->vm_pgoff >> order;
- *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
- }
- return pol;
+ return get_task_vma_policy(current, vma, addr, order, ilx);
}

bool vma_policy_mof(struct vm_area_struct *vma)
--
2.39.1

2023-11-22 21:14:30

by Gregory Price

[permalink] [raw]
Subject: [RFC PATCH 10/11] mm/mempolicy: mpol_parse_str should ignore trailing characters in nodelist

When validating MPOL_PREFERRED, the nodelist has already been parsed
and error checked by nodelist_parse. So rather than looping through
the string again, we should just check that the weight of the nodemask
is 1, which is the actual condition we care to check.

This also handles the case where newline characters are present.

Signed-off-by: Gregory Price <[email protected]>
---
mm/mempolicy.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a418af0a1359..eac71f2adfdc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3159,12 +3159,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
* nodelist (or nodes) cannot be empty.
*/
if (nodelist) {
- char *rest = nodelist;
- while (isdigit(*rest))
- rest++;
- if (*rest)
- goto out;
- if (nodes_empty(nodes))
+ if (nodes_weight(nodes) != 1)
goto out;
}
break;
--
2.39.1

2023-11-22 21:34:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs

On Wed, 22 Nov 2023 16:11:49 -0500 Gregory Price <[email protected]> wrote:

> The patch set changes task->mempolicy to be modifiable by tasks other
> than just current.
>
> The ultimate goal is to make mempolicy more flexible and extensible,
> such as adding interleave weights (which may need to change at runtime
> due to hotplug events). Making mempolicy externally modifiable allows
> for userland daemons to make runtime performance adjustments to running
> tasks without that software needing to be made numa-aware.

Please add to this [0/N] a full description of the security aspect: who
can modify whose mempolicy, along with a full description of the
reasoning behind this decision.

> 3. Add external interfaces which allow for a task mempolicy to be
> modified by another task. This is implemented in 4 syscalls
> and a procfs interface:
> sys_set_task_mempolicy
> sys_get_task_mempolicy
> sys_set_task_mempolicy_home_node
> sys_task_mbind
> /proc/[pid]/mempolicy

Why is the procfs interface needed? Doesn't it simply duplicate the
syscall interface? Please update [0/N] with a description of this
decision.

> The new syscalls are the same as their current-task counterparts,
> except that they take a pid as an argument. The exception is
> task_mbind, which required a new struct due to the number of args.
>
> The /proc/pid/mempolicy re-uses the interface mpol_parse_str format
> to enable get/set of mempolicy via procsfs.
>
> mpol_parse_str format:
> <mode>[=<flags>][:<nodelist>]
>
> Example usage:
>
> echo "default" > /proc/pid/mempolicy
> echo "prefer=relative:0" > /proc/pid/mempolicy
> echo "interleave:0-3" > /proc/pid/mempolicy

What do we get when we read from this? Please add to changelog.

> Changing the mempolicy does not induce memory migrations via the
> procfs interface (which is the exact same behavior as set_mempolicy).
>

2023-11-22 21:36:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs

On Wed, 22 Nov 2023 13:33:48 -0800 Andrew Morton <[email protected]> wrote:

> On Wed, 22 Nov 2023 16:11:49 -0500 Gregory Price <[email protected]> wrote:
> > echo "default" > /proc/pid/mempolicy
> > echo "prefer=relative:0" > /proc/pid/mempolicy
> > echo "interleave:0-3" > /proc/pid/mempolicy
>
> What do we get when we read from this? Please add to changelog.
>

Also a description of the permissions for this procfs file, along with
reasoning. If it has global readability, and there's something
interesting in there, let's show that the security implications have
been fully considered.

2023-11-22 22:24:49

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs

On Wed, Nov 22, 2023 at 01:33:48PM -0800, Andrew Morton wrote:
> On Wed, 22 Nov 2023 16:11:49 -0500 Gregory Price <[email protected]> wrote:
>
> > The patch set changes task->mempolicy to be modifiable by tasks other
> > than just current.
> >
> > The ultimate goal is to make mempolicy more flexible and extensible,
> > such as adding interleave weights (which may need to change at runtime
> > due to hotplug events). Making mempolicy externally modifiable allows
> > for userland daemons to make runtime performance adjustments to running
> > tasks without that software needing to be made numa-aware.
>
> Please add to this [0/N] a full description of the security aspect: who
> can modify whose mempolicy, along with a full description of the
> reasoning behind this decision.
>

Will do. For the sake of v0 for now:

1) the task itself (task == current)
for obvious reasons: it already can

2) from external interfaces: CAP_SYS_NICE

There might be an argument for CAP_SYS_ADMIN, but CAP_SYS_NICE has
access to scheduling controls, and mbind uses CAP_SYS_NICE to validate
whether shared pages can be migrated. The same is true of migrate_pages
and other memory management controls. For this reason, I chose to gate
the task syscalls behind CAP_SYS_NICE unless (task == current).

I'm by no means an expert in this area, so slap away if i'm egregiously
wrong here.

I will add additional security context to v2 as to what impacts changing
a mempolicy can have at runtime. This will mostly be related to cpusets
implications, as mempolicy itself is not a "constraining" interface in
terms of security. For example: one can mbind/interleave/whatever a set
of nodes, and then use migrate_pages or move_pages to violate that
mempolicy. This is explicitly allowed and discussed in the
implementation of the existing syscalls / libnuma.

However, if cpusets must be respected.

This is why i refactored out replace_mempolicy and reused it, because
this enforcement is already handled by checking task->mems_allowed.

> > 3. Add external interfaces which allow for a task mempolicy to be
> > modified by another task. This is implemented in 4 syscalls
> > and a procfs interface:
> > sys_set_task_mempolicy
> > sys_get_task_mempolicy
> > sys_set_task_mempolicy_home_node
> > sys_task_mbind
> > /proc/[pid]/mempolicy
>
> Why is the procfs interface needed? Doesn't it simply duplicate the
> syscall interface? Please update [0/N] with a description of this
> decision.
>

Honestly I wrote the procfs interface first, and then came back around
to just implement the syscalls. mbind is not friendly to being procfs'd
so if the preference is to have only one, not both, then it should
probably be the syscalls.

That said, when I introduce weighted interleave on top of this, having a
simple procfs interface to those weights would be valuable, so I
imagined something like `proc/mempolicy` to determine if interleave was
being used and something like `proc/mpol_interleave_weights` for a clean
interface to update weights.

However, in the same breath, I have a prior RFC with set/get_mempolicy2
which could probably take all future mempolicy extensions and wrap them
up into one pair of syscalls, instead of us ending up with 200 more
sys_mempolicy_whatever as memory attached fabrics become more common.

So... yeah... the is one area I think the community very much needs to
comment: set/get_mempolicy2, many new mempolicy syscalls, procfs? All
of the above?

The procfs route provides a command-line user a nice, clean way to
update policies without the need for an additional tool, but if there is
an "all or nothing" preference on mempolicy controls - then procfs is
probably not the way to go.

This RFC at least shows there are options. I very much welcome input in
this particular area.

> > The new syscalls are the same as their current-task counterparts,
> > except that they take a pid as an argument. The exception is
> > task_mbind, which required a new struct due to the number of args.
> >
> > The /proc/pid/mempolicy re-uses the interface mpol_parse_str format
> > to enable get/set of mempolicy via procsfs.
> >
> > mpol_parse_str format:
> > <mode>[=<flags>][:<nodelist>]
> >
> > Example usage:
> >
> > echo "default" > /proc/pid/mempolicy
> > echo "prefer=relative:0" > /proc/pid/mempolicy
> > echo "interleave:0-3" > /proc/pid/mempolicy
>
> What do we get when we read from this? Please add to changelog.
>
> Also a description of the permissions for this procfs file, along with
> reasoning. If it has global readability, and there's something
> interesting in there, let's show that the security implications have
> been fully considered.
>

Ah, should have included that. Will add. For the sake of v0:

Current permissions: (S_IRUSR|S_IWUSR)
Which presumes the owner and obviosly root. Tried parity the syscall.

the total set of (current) policy outputs are:

"default"
"local"
"prefer:node"
"prefer=static:node"
"prefer=relative:node"
"prefer (many):nodelist"
"prefer (many)=static:nodelist"
"prefer (many)=relative:nodelist"
"interleave:nodelist"
"interleave=static:nodelist"
"interleave=relative:nodelist"
"bind:nodelist"
"bind=static:nodelist"
"bind=relative:nodelist"

There doesn't seem to be much of a security implication here, at least
not anything that can't already be gleaned via something like numa_maps,
but it does provide *some* level of memory placement imformation, so
it's still probably best gated behind owner/root.

That said, changing this policy may not imply it is actually used,
because individual VMA policies can override this policy. So it really
doesn't provide much info at all.

Something I just noticed: mpol_parse_str does not presently support the
numa balancing flag, so that would have to be added to achieve parity
with the set_mempolicy syscall.

> > Changing the mempolicy does not induce memory migrations via the
> > procfs interface (which is the exact same behavior as set_mempolicy).
> >
>

Thanks for taking a quick look!
~Gregory

2023-11-27 15:31:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs

Sorry, didn't have much time to do a proper review. Couple of points
here at least.

On Wed 22-11-23 17:24:10, Gregory Price wrote:
> On Wed, Nov 22, 2023 at 01:33:48PM -0800, Andrew Morton wrote:
> > On Wed, 22 Nov 2023 16:11:49 -0500 Gregory Price <[email protected]> wrote:
> >
> > > The patch set changes task->mempolicy to be modifiable by tasks other
> > > than just current.
> > >
> > > The ultimate goal is to make mempolicy more flexible and extensible,
> > > such as adding interleave weights (which may need to change at runtime
> > > due to hotplug events). Making mempolicy externally modifiable allows
> > > for userland daemons to make runtime performance adjustments to running
> > > tasks without that software needing to be made numa-aware.
> >
> > Please add to this [0/N] a full description of the security aspect: who
> > can modify whose mempolicy, along with a full description of the
> > reasoning behind this decision.
> >
>
> Will do. For the sake of v0 for now:
>
> 1) the task itself (task == current)
> for obvious reasons: it already can
>
> 2) from external interfaces: CAP_SYS_NICE

Makes sense.

[...]
> > > 3. Add external interfaces which allow for a task mempolicy to be
> > > modified by another task. This is implemented in 4 syscalls
> > > and a procfs interface:
> > > sys_set_task_mempolicy
> > > sys_get_task_mempolicy
> > > sys_set_task_mempolicy_home_node
> > > sys_task_mbind
> > > /proc/[pid]/mempolicy
> >
> > Why is the procfs interface needed? Doesn't it simply duplicate the
> > syscall interface? Please update [0/N] with a description of this
> > decision.
> >
>
> Honestly I wrote the procfs interface first, and then came back around
> to just implement the syscalls. mbind is not friendly to being procfs'd
> so if the preference is to have only one, not both, then it should
> probably be the syscalls.
>
> That said, when I introduce weighted interleave on top of this, having a
> simple procfs interface to those weights would be valuable, so I
> imagined something like `proc/mempolicy` to determine if interleave was
> being used and something like `proc/mpol_interleave_weights` for a clean
> interface to update weights.
>
> However, in the same breath, I have a prior RFC with set/get_mempolicy2
> which could probably take all future mempolicy extensions and wrap them
> up into one pair of syscalls, instead of us ending up with 200 more
> sys_mempolicy_whatever as memory attached fabrics become more common.
>
> So... yeah... the is one area I think the community very much needs to
> comment: set/get_mempolicy2, many new mempolicy syscalls, procfs? All
> of the above?

I think we should actively avoid using proc interface. The most
reasonable way would be to add get_mempolicy2 interface that would allow
extensions and then create a pidfd counterpart to allow acting on a
remote task. The latter would require some changes to make mempolicy
code less current oriented.
--
Michal Hocko
SUSE Labs

2023-11-27 16:16:15

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs

On Mon, Nov 27, 2023 at 04:29:56PM +0100, Michal Hocko wrote:
> Sorry, didn't have much time to do a proper review. Couple of points
> here at least.
>
> >
> > So... yeah... the is one area I think the community very much needs to
> > comment: set/get_mempolicy2, many new mempolicy syscalls, procfs? All
> > of the above?
>
> I think we should actively avoid using proc interface. The most
> reasonable way would be to add get_mempolicy2 interface that would allow
> extensions and then create a pidfd counterpart to allow acting on a
> remote task. The latter would require some changes to make mempolicy
> code less current oriented.

Sounds good, I'll pull my get/set_mempolicy2 RFC on top of this.

Just context: patches 1-6 refactor mempolicy to allow remote task
twiddling (fixing the current-oriented issues), and patch 7 adds the pidfd
interfaces you describe above.


Couple Questions

1) Should we consider simply adding a pidfd arg to set/get_mempolicy2,
where if (pidfd == 0), then it operates on current, otherwise it
operates on the target task? That would mitigate the need for what
amounts to the exact same interface.

2) Should we combine all the existing operations into set_mempolicy2 and
add an operation arg.

set_mempolicy2(pidfd, arg_struct, len)

struct {
int pidfd; /* optional */
int operation; /* describe which op_args to use */
union {
struct {
} set_mempolicy;
struct {
} set_vma_home_node;
struct {
} mbind;
...
} op_args;
} args;

capturing:
sys_set_mempolicy
sys_set_mempolicy_home_node
sys_mbind

or should we just make a separate interface for mbind/home_node to
limit complexity of the single syscall?

Personally I like the dispatch for the extensibility nature of the arg
struct, but I can understand wanting to limit complexity of a syscall
interface for a variety of reasons.

~Gregory

2023-11-28 09:45:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs

On Mon 27-11-23 11:14:44, Gregory Price wrote:
> On Mon, Nov 27, 2023 at 04:29:56PM +0100, Michal Hocko wrote:
> > Sorry, didn't have much time to do a proper review. Couple of points
> > here at least.
> >
> > >
> > > So... yeah... the is one area I think the community very much needs to
> > > comment: set/get_mempolicy2, many new mempolicy syscalls, procfs? All
> > > of the above?
> >
> > I think we should actively avoid using proc interface. The most
> > reasonable way would be to add get_mempolicy2 interface that would allow
> > extensions and then create a pidfd counterpart to allow acting on a
> > remote task. The latter would require some changes to make mempolicy
> > code less current oriented.
>
> Sounds good, I'll pull my get/set_mempolicy2 RFC on top of this.
>
> Just context: patches 1-6 refactor mempolicy to allow remote task
> twiddling (fixing the current-oriented issues), and patch 7 adds the pidfd
> interfaces you describe above.
>
>
> Couple Questions
>
> 1) Should we consider simply adding a pidfd arg to set/get_mempolicy2,
> where if (pidfd == 0), then it operates on current, otherwise it
> operates on the target task? That would mitigate the need for what
> amounts to the exact same interface.

This wouldn't fit into existing pidfd interfaces I am aware of. We
assume pidfd to be real fd, no special cases.

> 2) Should we combine all the existing operations into set_mempolicy2 and
> add an operation arg.
>
> set_mempolicy2(pidfd, arg_struct, len)
>
> struct {
> int pidfd; /* optional */
> int operation; /* describe which op_args to use */
> union {
> struct {
> } set_mempolicy;
> struct {
> } set_vma_home_node;
> struct {
> } mbind;
> ...
> } op_args;
> } args;
>
> capturing:
> sys_set_mempolicy
> sys_set_mempolicy_home_node
> sys_mbind
>
> or should we just make a separate interface for mbind/home_node to
> limit complexity of the single syscall?

My preference would be to go with specific syscalls. Multiplexing
syscalls have turned much more complex and less flexible over time.
Just have a look at futex.
--
Michal Hocko
SUSE Labs

2023-11-28 13:16:45

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs

On Tue, Nov 28, 2023 at 10:45:02AM +0100, Michal Hocko wrote:
> > 2) Should we combine all the existing operations into set_mempolicy2 and
> > add an operation arg.
> >
> > set_mempolicy2(pidfd, arg_struct, len)
> >
> > struct {
> > int pidfd; /* optional */
> > int operation; /* describe which op_args to use */
> > union {
> > struct {
> > } set_mempolicy;
> > struct {
> > } set_vma_home_node;
> > struct {
> > } mbind;
> > ...
> > } op_args;
> > } args;
> >
> > capturing:
> > sys_set_mempolicy
> > sys_set_mempolicy_home_node
> > sys_mbind
> >
> > or should we just make a separate interface for mbind/home_node to
> > limit complexity of the single syscall?
>
> My preference would be to go with specific syscalls. Multiplexing
> syscalls have turned much more complex and less flexible over time.
> Just have a look at futex.

got it, that simplifies things a bit. I can pull my set/get mempolicy2
work forward and just keep the interfaces pretty much the same. Only
difference being an argument structure that is extensible and possibly
some additional refactoring in do_get_mempolicy to make things a bit
cleaner.

~Gregory

2023-11-28 14:12:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 06/11] mm/mempolicy: modify do_mbind to operate on task argument instead of current

On Wed 22-11-23 16:11:55, Gregory Price wrote:
[...]
> + * Like get_vma_policy and get_task_policy, must hold alloc/task_lock
> + * while calling this.
> + */
> +static struct mempolicy *get_task_vma_policy(struct task_struct *task,
> + struct vm_area_struct *vma,
> + unsigned long addr, int order,
> + pgoff_t *ilx)
[...]

You should add lockdep annotation for alloc_lock/task_lock here for clarity and
also...
> @@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> unsigned long addr, int order, pgoff_t *ilx)
> {
> - struct mempolicy *pol;
> -
> - pol = __get_vma_policy(vma, addr, ilx);
> - if (!pol)
> - pol = get_task_policy(current);
> - if (pol->mode == MPOL_INTERLEAVE) {
> - *ilx += vma->vm_pgoff >> order;
> - *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> - }
> - return pol;
> + return get_task_vma_policy(current, vma, addr, order, ilx);

I do not think that all get_vma_policy take task_lock (just random check
dequeue_hugetlb_folio_vma->huge_node->get_vma_policy AFAICS)

Also I do not see policy_nodemask to be handled anywhere. That one is
used along with get_vma_policy (sometimes hidden like in
alloc_pages_mpol). It has a dependency on
cpuset_nodemask_valid_mems_allowed. That means that e.g. mbind on a
remote task would be constrained by current task cpuset when allocating
migration targets for the target task. I am wondering how many other
dependencies like that are lurking there.
--
Michal Hocko
SUSE Labs

2023-11-28 14:49:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 04/11] mm/mempolicy: modify get_mempolicy call stack to take a task argument

[restoring the CC list as I believe this was not meant to be a private
response]

On Tue 28-11-23 09:12:35, Gregory Price wrote:
> On Tue, Nov 28, 2023 at 03:07:28PM +0100, Michal Hocko wrote:
> > On Wed 22-11-23 16:11:53, Gregory Price wrote:
> > [...]
> > > @@ -928,7 +929,16 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> > > * vma/shared policy at addr is NULL. We
> > > * want to return MPOL_DEFAULT in this case.
> > > */
> > > - mm = current->mm;
> > > + if (task == current) {
> > > + /*
> > > + * original behavior allows a kernel task changing its
> > > + * own policy to avoid the condition in get_task_mm,
> > > + * so we'll directly access
> > > + */
> > > + mm = task->mm;
> > > + mmget(mm);
> >
> > Do we actually have any kernel thread that would call this? Does it
> > actually make sense to support?
> >
>
> This was changed in the upcoming v2 by using the pidfd interface for
> referencing both the task and the mm, so this code is a bit dead.

OK, that is the right thing to do IMHO. Allowing modifications on memory
policies on borrowed mms sounds rather weird and if we do not have any
actual usecases that would require that support then I would rather not
open that possibility at all.

--
Michal Hocko
SUSE Labs

2023-11-28 14:53:26

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH 06/11] mm/mempolicy: modify do_mbind to operate on task argument instead of current

On Tue, Nov 28, 2023 at 03:11:06PM +0100, Michal Hocko wrote:
> On Wed 22-11-23 16:11:55, Gregory Price wrote:
> [...]
> > + * Like get_vma_policy and get_task_policy, must hold alloc/task_lock
> > + * while calling this.
> > + */
> > +static struct mempolicy *get_task_vma_policy(struct task_struct *task,
> > + struct vm_area_struct *vma,
> > + unsigned long addr, int order,
> > + pgoff_t *ilx)
> [...]
>
> You should add lockdep annotation for alloc_lock/task_lock here for clarity and
> also...
> > @@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> > struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> > unsigned long addr, int order, pgoff_t *ilx)
> > {
> > - struct mempolicy *pol;
> > -
> > - pol = __get_vma_policy(vma, addr, ilx);
> > - if (!pol)
> > - pol = get_task_policy(current);
> > - if (pol->mode == MPOL_INTERLEAVE) {
> > - *ilx += vma->vm_pgoff >> order;
> > - *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> > - }
> > - return pol;
> > + return get_task_vma_policy(current, vma, addr, order, ilx);
>
> I do not think that all get_vma_policy take task_lock (just random check
> dequeue_hugetlb_folio_vma->huge_node->get_vma_policy AFAICS)
>

hm, i might have gotten turned around on this one. Forgot to check for
external references to get_vma_policy. I thought I considered it, but i
clearly did not leave myself any notes if I did.

This pattern is troublesome, we're holding the task lock during the
callback stack in __get_vma_policy - just incase that returns NULL so we
can return the task policy instead. If that vma is shared, it will take
the vma shared policy lock (sp->lock)

I almost want to change this interface to return NULL if the VMA doesn't
have one, and change callers to fetch the task policy explicitly instead
of implicitly returning the task policy. At least then we'd only take
the task lock on an explicit access to the *Task* policy.

> Also I do not see policy_nodemask to be handled anywhere. That one is
> used along with get_vma_policy (sometimes hidden like in
> alloc_pages_mpol). It has a dependency on
> cpuset_nodemask_valid_mems_allowed. That means that e.g. mbind on a
> remote task would be constrained by current task cpuset when allocating
> migration targets for the target task. I am wondering how many other
> dependencies like that are lurking there.

bah!

thought i dug all these out, but i missed alloc_migration_target_by_mpol
from do_mbind.

I'll need to take another look at the calls to cpusets interfaces to
make sure i dig this out. The number of hidden accesses to current is
really nasty :[

> --
> Michal Hocko
> SUSE Labs

2023-11-28 18:09:36

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH 06/11] mm/mempolicy: modify do_mbind to operate on task argument instead of current

On Tue, Nov 28, 2023 at 03:11:06PM +0100, Michal Hocko wrote:
> On Wed 22-11-23 16:11:55, Gregory Price wrote:
> [...]
> > + * Like get_vma_policy and get_task_policy, must hold alloc/task_lock
> > + * while calling this.
> > + */
> > +static struct mempolicy *get_task_vma_policy(struct task_struct *task,
> > + struct vm_area_struct *vma,
> > + unsigned long addr, int order,
> > + pgoff_t *ilx)
> [...]
>
> You should add lockdep annotation for alloc_lock/task_lock here for clarity and
> also...
> > @@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> > struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> > unsigned long addr, int order, pgoff_t *ilx)
> > {
> > - struct mempolicy *pol;
> > -
> > - pol = __get_vma_policy(vma, addr, ilx);
> > - if (!pol)
> > - pol = get_task_policy(current);
> > - if (pol->mode == MPOL_INTERLEAVE) {
> > - *ilx += vma->vm_pgoff >> order;
> > - *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> > - }
> > - return pol;
> > + return get_task_vma_policy(current, vma, addr, order, ilx);
>
> I do not think that all get_vma_policy take task_lock (just random check
> dequeue_hugetlb_folio_vma->huge_node->get_vma_policy AFAICS)
>
> Also I do not see policy_nodemask to be handled anywhere. That one is
> used along with get_vma_policy (sometimes hidden like in
> alloc_pages_mpol). It has a dependency on
> cpuset_nodemask_valid_mems_allowed. That means that e.g. mbind on a
> remote task would be constrained by current task cpuset when allocating
> migration targets for the target task. I am wondering how many other
> dependencies like that are lurking there.

So after further investigation, I'm going to have to back out the
changes that make home_node and mbind modifiable by an external task
and revisit it at a later time.

Right now, there's a very nasty rats nest of entanglement between
mempolicy and vma/shmem that hides a bunch of accesses to current.

It only becomes apparently when you start chasing all the callers of
mpol_dup, which had another silent access to current->cpusets.

mpol_dup calls the following:
current_cpuset_is_being_rebound
cpuset_mems_allowed(current)

So we would need to do the following
1) create mpol_dup_task and make current explicit, not implicit
2) chase down all callers to mpol_dup and make sure it isn't generated
from any of the task interfaces
3) if it is generated from the task interfaces, plumb a reference to
current down through... somehow... if possible...

Here's a ~1 hour chase that lead me to the conclusion that this will
take considerably more work, and is not to be taken lightly:

do_mbind
mbind_range
vma_modify_policy
split_vma
__split_vma
vma_dup_policy
mpol_dup
vma_replace_policy
mpol_dup
vma->vm_ops->set_policy - see below

__set_mempolicy_home_node
mbind_range
... same as above ...

digging into vma->vm_ops->set_policy we end up in mm/shmem.c

shmem_set_policy
mpol_set_shared_policy
sp_alloc
mpol_dup
current_cpuset_is_being_rebound()
cpuset_mems_allowed(current)

Who knows what else is burried in the vma stack, but making vma
mempolicies externally modifiable looks to be a much more monumental
task than just simply making the task policy modifiable.

For now i'm going to submit a V2 with home_node and mbind removed from
the proposal. Those will take far more investigation.

This also means that process_set_mempolicy should not be extended to
allow for vma policy replacements.

~Gregory