From: Vinicius Tavares Petrucci <[email protected]>
This patch introduces `process_mbind()` to enable a userspace orchestrator with
an understanding of another process's memory layout to alter its memory policy.
As system memory configurations become more and more complex (e.g., DDR+HBM+CXL memories),
such a userspace orchestrator can explore more advanced techniques to guide memory placement
to individual NUMA nodes across memory tiers. This allows for a more efficient allocation of
memory resources, leading to enhanced application performance.
Alternatively, there are existing methods such as LD_PRELOAD (https://pmem.io/memkind/) or
syscall_intercept (https://github.com/pmem/syscall_intercept), but these techniques, beyond the
lack of portability/universality, can lead to system incompatibility issues, inconsistency in
application behavior, potential risks due to global system-wide settings, and increased
complexity in implementation.
The concept of an external entity that understands the layout of another process's VM
is already present with `process_madvise()`. Thus, it seems reasonable to introduce
the `process_mbind` variant of `mbind`. The implementation framework of `process_mbind()`
is akin to `process_madvise()`. It uses pidfd of an external process to direct the memory
policy and supports a vector of memory address ranges.
The general use case here is similar to the prior RFC `pidfd_set_mempolicy()`
(https://lore.kernel.org/linux-mm/[email protected]/),
but offers a more fine-grained external control by binding specific memory regions
(say, heap data structures) to specific NUMA nodes. Another concrete use case was described
by a prior work showing up to 2X runtime improvement (compared to AutoNUMA tiering) using
memory object/region-based memory placement for workloads with irregular access patterns
such as graph analytics: https://iiswc.org/iiswc2022/IISWC2022_42.pdf
The proposed API is as follows:
long process_mbind(int pidfd,
const struct iovec *iovec,
unsigned long vlen,
unsigned long mode,
const unsigned long *nmask,
unsigned int flags);
The `pidfd` argument is used to select the process that is identified by the PID file
descriptor provided in pidfd. (See pidofd_open(2) for more information)
The pointer `iovec` points to an array of iovec structures (as described in <sys/uio.h>):
struct iovec {
void *iov_base; /* starting address of region */
size_t iov_len; /* size of region (in bytes) */
};
The `iovec` defines memory regions that start at the address (iov_base) and
have a size measured in bytes (iov_len).
The `vlen` indicates the quantity of elements contained in iovec.
Please note the initial `maxnode` parameter from `mbind` was omitted
to ensure the API doesn't exceed 6 arguments. Instead, the constant
MAX_NUMNODES was utilized.
Please see the mbind(2) man page for more details about other's arguments.
Additionally, it is worth noting the following:
- Using a vector of address ranges as an argument in `process_mbind` provides more
flexibility than the original `mbind` system call, even when invoked from a current
or local process.
- In contrast to `move_pages`, which requires an array of fixed-size pages,
`process_mbind` (with flags MPOL_MF_MOVE*) offers a more convinient and flexible page
migration capability on a per object or region basis.
- Similar to `process_madvise`, manipulating the memory binding of external processes
necessitates `CAP_SYS_NICE` and `PTRACE_MODE_READ_FSCREDS` checks (refer to ptrace(2)).
Suggested-by: Frank van der Linden <[email protected]>
Signed-off-by: Vinicius Tavares Petrucci <[email protected]>
Signed-off-by: Hasan Al Maruf <[email protected]>
---
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 4 ++
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 1 +
mm/mempolicy.c | 86 +++++++++++++++++++++++++-
5 files changed, 92 insertions(+), 4 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 8cb8bf68721c..9d9db49a3242 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -378,6 +378,7 @@
454 common futex_wake sys_futex_wake
455 common futex_wait sys_futex_wait
456 common futex_requeue sys_futex_requeue
+457 common process_mbind sys_process_mbind
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fd9d12de7e92..def5250ed625 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -816,6 +816,10 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len,
const unsigned long __user *nmask,
unsigned long maxnode,
unsigned flags);
+asmlinkage long sys_process_mbind(int pidfd, const struct iovec __user *vec,
+ size_t vlen, unsigned long mode,
+ const unsigned long __user *nmask,
+ unsigned flags);
asmlinkage long sys_get_mempolicy(int __user *policy,
unsigned long __user *nmask,
unsigned long maxnode,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 756b013fb832..9ed2c91940d6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
__SYSCALL(__NR_futex_wait, sys_futex_wait)
#define __NR_futex_requeue 456
__SYSCALL(__NR_futex_requeue, sys_futex_requeue)
+#define __NR_process_mbind 457
+__SYSCALL(__NR_process_mbind, sys_process_mbind)
#undef __NR_syscalls
-#define __NR_syscalls 457
+#define __NR_syscalls 458
/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e1a6e3c675c0..cc5cb5ae3ae7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -187,6 +187,7 @@ COND_SYSCALL(process_madvise);
COND_SYSCALL(process_mrelease);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
+COND_SYSCALL(process_mbind);
COND_SYSCALL(get_mempolicy);
COND_SYSCALL(set_mempolicy);
COND_SYSCALL(migrate_pages);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..91ee300fa728 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1215,11 +1215,10 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
}
#endif
-static long do_mbind(unsigned long start, unsigned long len,
+static long do_mbind(struct mm_struct *mm, unsigned long start, unsigned long len,
unsigned short mode, unsigned short mode_flags,
nodemask_t *nmask, unsigned long flags)
{
- struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
struct vma_iterator vmi;
struct migration_mpol mmpol;
@@ -1465,10 +1464,84 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
return 0;
}
+static long kernel_mbind_process(int pidfd, const struct iovec __user *vec,
+ size_t vlen, unsigned long mode,
+ const unsigned long __user *nmask, unsigned int flags)
+{
+ ssize_t ret;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
+ struct task_struct *task;
+ struct mm_struct *mm;
+ unsigned int f_flags;
+ unsigned short mode_flags;
+ int lmode = mode;
+ unsigned long maxnode = MAX_NUMNODES;
+ int err;
+ nodemask_t nodes;
+
+ err = sanitize_mpol_flags(&lmode, &mode_flags);
+ if (err)
+ goto out;
+
+ err = get_nodes(&nodes, nmask, maxnode);
+ if (err)
+ goto out;
+
+ ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack),
+ &iov, &iter);
+ if (ret < 0)
+ goto out;
+
+ task = pidfd_get_task(pidfd, &f_flags);
+ if (IS_ERR(task)) {
+ ret = PTR_ERR(task);
+ goto free_iov;
+ }
+
+ /* From process_madvise: Require PTRACE_MODE_READ
+ * to avoid leaking ASLR metadata. */
+ mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+ if (IS_ERR_OR_NULL(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ goto release_task;
+ }
+
+ /* From process_madvise: Require CAP_SYS_NICE for
+ * influencing process performance. */
+ if (!capable(CAP_SYS_NICE)) {
+ ret = -EPERM;
+ goto release_mm;
+ }
+
+ while (iov_iter_count(&iter)) {
+ unsigned long start = untagged_addr(
+ (unsigned long)iter_iov_addr(&iter));
+ unsigned long len = iter_iov_len(&iter);
+
+ ret = do_mbind(mm, start, len, lmode, mode_flags,
+ &nodes, flags);
+ if (ret < 0)
+ break;
+ iov_iter_advance(&iter, iter_iov_len(&iter));
+ }
+
+release_mm:
+ mmput(mm);
+release_task:
+ put_task_struct(task);
+free_iov:
+ kfree(iov);
+out:
+ return ret;
+}
+
static long kernel_mbind(unsigned long start, unsigned long len,
unsigned long mode, const unsigned long __user *nmask,
unsigned long maxnode, unsigned int flags)
{
+ struct mm_struct *mm = current->mm;
unsigned short mode_flags;
nodemask_t nodes;
int lmode = mode;
@@ -1483,7 +1556,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
if (err)
return err;
- return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
+ return do_mbind(mm, start, len, lmode, mode_flags, &nodes, flags);
}
SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
@@ -1553,6 +1626,13 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
return err;
}
+SYSCALL_DEFINE6(process_mbind, int, pidfd, const struct iovec __user *, vec,
+ size_t, vlen, unsigned long, mode,
+ const unsigned long __user *, nmask, unsigned int, flags)
+{
+ return kernel_mbind_process(pidfd, vec, vlen, mode, nmask, flags);
+}
+
SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
unsigned long, mode, const unsigned long __user *, nmask,
unsigned long, maxnode, unsigned int, flags)
--
2.41.0
On Wed, 22 Nov 2023 15:31:05 -0600 Vinicius Petrucci <[email protected]> wrote:
> From: Vinicius Tavares Petrucci <[email protected]>
>
> This patch introduces `process_mbind()` to enable a userspace orchestrator with
> an understanding of another process's memory layout to alter its memory policy.
I'm having deja vu. Not 10 minutes ago, Gregory sent out a patchset
which does the same thing.
https://lkml.kernel.org/r/[email protected]
Please share notes ;)
On Wed, Nov 22, 2023 at 01:39:44PM -0800, Andrew Morton wrote:
> On Wed, 22 Nov 2023 15:31:05 -0600 Vinicius Petrucci <[email protected]> wrote:
>
> > From: Vinicius Tavares Petrucci <[email protected]>
> >
> > This patch introduces `process_mbind()` to enable a userspace orchestrator with
> > an understanding of another process's memory layout to alter its memory policy.
>
>
> I'm having deja vu. Not 10 minutes ago, Gregory sent out a patchset
> which does the same thing.
>
> https://lkml.kernel.org/r/[email protected]
>
> Please share notes ;)
Heh, we discussed doing this at linux plumbers, as well as a long set of
RFC's related to weighted mempolicies.
Guess we were in a bit of a race to RFC without knowing!
more context: mhocko suggested this interface would be welcome and useful
before the introduction of weighted inteleave.
link: https://lore.kernel.org/linux-mm/ZVNBMW8iJIGDyp0y@tiehlicka/
~Gregory
On Wed, Nov 22, 2023 at 03:31:05PM -0600, Vinicius Petrucci wrote:
> From: Vinicius Tavares Petrucci <[email protected]>
>
> The proposed API is as follows:
>
> long process_mbind(int pidfd,
> const struct iovec *iovec,
> unsigned long vlen,
> unsigned long mode,
> const unsigned long *nmask,
> unsigned int flags);
>
> The `pidfd` argument is used to select the process that is identified by the PID file
> descriptor provided in pidfd. (See pidofd_open(2) for more information)
>
This is probably a more maintainable interface than the pid_t interface
in my RFC, but I have not used pidfd extensively myself.
I can pivot my RFC to utilize pidfd as a whole and pop this on top, if
the other interfaces are generally useful.
> Please note the initial `maxnode` parameter from `mbind` was omitted
> to ensure the API doesn't exceed 6 arguments. Instead, the constant
> MAX_NUMNODES was utilized.
>
I don't think this will work, users have traditionally been allowed to
shorten their nodemasks, and also for some level of portability.
We may want to consider an arg structure, rather than just chopping an
argument off.
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..91ee300fa728 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1215,11 +1215,10 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> }
> #endif
>
> -static long do_mbind(unsigned long start, unsigned long len,
> +static long do_mbind(struct mm_struct *mm, unsigned long start, unsigned long len,
> unsigned short mode, unsigned short mode_flags,
> nodemask_t *nmask, unsigned long flags)
> {
> - struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma, *prev;
> struct vma_iterator vmi;
> struct migration_mpol mmpol;
> @@ -1465,10 +1464,84 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
> return 0;
> }
This is a completely insufficient change to do_mbind. do_mbind utilizes
`current` in a variety of places for nodemask (cpuset) validation and to
acquire the task's lock. This will not work the way you intend it to,
you end up mixing up node masks between current and target task.
see here:
https://lore.kernel.org/all/[email protected]/
I had to make do_mbind operate on the target task explicitly.
We may want to combine this change and with my change so that your iovec
changes can be re-used, because that is a very nice feature.
> + unsigned long maxnode = MAX_NUMNODES;
> + int err;
> + nodemask_t nodes;
> +
> + err = sanitize_mpol_flags(&lmode, &mode_flags);
> + if (err)
> + goto out;
> +
> + err = get_nodes(&nodes, nmask, maxnode);
per above, userland MAX_NUMNODES may or may not be equal to kernel
MAX_NUMNODES, so i don't think you can just chop this argument off.
I think we can combine our RFCs here and get this sorted out.
~Gregory
On Wed, Nov 22, 2023 at 3:39 PM Andrew Morton <[email protected]> wrote:
> I'm having deja vu. Not 10 minutes ago, Gregory sent out a patchset
> which does the same thing.
>
> https://lkml.kernel.org/r/[email protected]
>
Certainly, it appears that we were addressing the same matter but
approaching it from different perspectives or discussions.
And we hastened to submit our RFCs before the Thanksgiving break. :-)
The patch I sent originated from the thread, and the subsequent
discussion commenced with the 'pidfd_set_mempolicy' RFC, possibly
influenced by the 'process_madvise.'
Specifically, the concept of a 'process_mbind()' syscall was suggested
by Frank van der Linden (Google):
https://lkml.kernel.org/linux-mm/[email protected]/T/#m950f26bdd6ce494b20ba170cc1f6db85e1924e8e
I would appreciate it if others could kindly evaluate what appears to
be the suitable course of action moving forward.
Thank you!
Hi Greg!
Thanks a lot for quickly looking into this and sharing your notes here.
On Wed, Nov 22, 2023 at 5:53 PM Gregory Price
<[email protected]> wrote:
>
> > Please note the initial `maxnode` parameter from `mbind` was omitted
> > to ensure the API doesn't exceed 6 arguments. Instead, the constant
> > MAX_NUMNODES was utilized.
> >
>
> I don't think this will work, users have traditionally been allowed to
> shorten their nodemasks, and also for some level of portability.
>
> We may want to consider an arg structure, rather than just chopping an
> argument off.
>
Yes, good point... that should be considered as a more complete
long-term approach beyond the MVP.
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 10a590ee1c89..91ee300fa728 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1215,11 +1215,10 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> > }
> > #endif
> >
> > -static long do_mbind(unsigned long start, unsigned long len,
> > +static long do_mbind(struct mm_struct *mm, unsigned long start, unsigned long len,
> > unsigned short mode, unsigned short mode_flags,
> > nodemask_t *nmask, unsigned long flags)
> > {
> > - struct mm_struct *mm = current->mm;
> > struct vm_area_struct *vma, *prev;
> > struct vma_iterator vmi;
> > struct migration_mpol mmpol;
> > @@ -1465,10 +1464,84 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
> > return 0;
> > }
>
> This is a completely insufficient change to do_mbind. do_mbind utilizes
> `current` in a variety of places for nodemask (cpuset) validation and to
> acquire the task's lock. This will not work the way you intend it to,
> you end up mixing up node masks between current and target task.
>
Oh oh. True! Good catch!
> see here:
> https://lore.kernel.org/all/[email protected]/
>
Let me go over this... Thanks!
> We may want to combine this change and with my change so that your iovec
> changes can be re-used, because that is a very nice feature.
>
Sounds good. Thanks again!
Best,
Vinicius
Hi Vinicius!
On Thu, Nov 23, 2023 at 5:32 AM Vinicius Petrucci <[email protected]> wrote:
>
> From: Vinicius Tavares Petrucci <[email protected]>
>
> This patch introduces `process_mbind()` to enable a userspace orchestrator with
> an understanding of another process's memory layout to alter its memory policy.
> As system memory configurations become more and more complex (e.g., DDR+HBM+CXL memories),
> such a userspace orchestrator can explore more advanced techniques to guide memory placement
> to individual NUMA nodes across memory tiers. This allows for a more efficient allocation of
> memory resources, leading to enhanced application performance.
>
> Alternatively, there are existing methods such as LD_PRELOAD (https://pmem.io/memkind/) or
> syscall_intercept (https://github.com/pmem/syscall_intercept), but these techniques, beyond the
> lack of portability/universality, can lead to system incompatibility issues, inconsistency in
> application behavior, potential risks due to global system-wide settings, and increased
> complexity in implementation.
>
> The concept of an external entity that understands the layout of another process's VM
> is already present with `process_madvise()`. Thus, it seems reasonable to introduce
> the `process_mbind` variant of `mbind`. The implementation framework of `process_mbind()`
> is akin to `process_madvise()`. It uses pidfd of an external process to direct the memory
> policy and supports a vector of memory address ranges.
>
> The general use case here is similar to the prior RFC `pidfd_set_mempolicy()`
> (https://lore.kernel.org/linux-mm/[email protected]/),
> but offers a more fine-grained external control by binding specific memory regions
> (say, heap data structures) to specific NUMA nodes. Another concrete use case was described
> by a prior work showing up to 2X runtime improvement (compared to AutoNUMA tiering) using
> memory object/region-based memory placement for workloads with irregular access patterns
> such as graph analytics: https://iiswc.org/iiswc2022/IISWC2022_42.pdf
>
> The proposed API is as follows:
>
> long process_mbind(int pidfd,
> const struct iovec *iovec,
> unsigned long vlen,
> unsigned long mode,
> const unsigned long *nmask,
> unsigned int flags);
>
> The `pidfd` argument is used to select the process that is identified by the PID file
> descriptor provided in pidfd. (See pidofd_open(2) for more information)
>
> The pointer `iovec` points to an array of iovec structures (as described in <sys/uio.h>):
>
> struct iovec {
> void *iov_base; /* starting address of region */
> size_t iov_len; /* size of region (in bytes) */
> };
>
> The `iovec` defines memory regions that start at the address (iov_base) and
> have a size measured in bytes (iov_len).
Good idea.
>
> The `vlen` indicates the quantity of elements contained in iovec.
>
> Please note the initial `maxnode` parameter from `mbind` was omitted
> to ensure the API doesn't exceed 6 arguments. Instead, the constant
> MAX_NUMNODES was utilized.
The original parameters should not be omitted, the patch from
Gregory Price is a good solution to put all of them together.
>
> Please see the mbind(2) man page for more details about other's arguments.
>
> Additionally, it is worth noting the following:
> - Using a vector of address ranges as an argument in `process_mbind` provides more
> flexibility than the original `mbind` system call, even when invoked from a current
> or local process.
> - In contrast to `move_pages`, which requires an array of fixed-size pages,
> `process_mbind` (with flags MPOL_MF_MOVE*) offers a more convinient and flexible page
> migration capability on a per object or region basis.
> - Similar to `process_madvise`, manipulating the memory binding of external processes
> necessitates `CAP_SYS_NICE` and `PTRACE_MODE_READ_FSCREDS` checks (refer to ptrace(2)).
>
> Suggested-by: Frank van der Linden <[email protected]>
> Signed-off-by: Vinicius Tavares Petrucci <[email protected]>
> Signed-off-by: Hasan Al Maruf <[email protected]>
> ---
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> include/linux/syscalls.h | 4 ++
> include/uapi/asm-generic/unistd.h | 4 +-
> kernel/sys_ni.c | 1 +
> mm/mempolicy.c | 86 +++++++++++++++++++++++++-
> 5 files changed, 92 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 8cb8bf68721c..9d9db49a3242 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -378,6 +378,7 @@
> 454 common futex_wake sys_futex_wake
> 455 common futex_wait sys_futex_wait
> 456 common futex_requeue sys_futex_requeue
> +457 common process_mbind sys_process_mbind
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index fd9d12de7e92..def5250ed625 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -816,6 +816,10 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len,
> const unsigned long __user *nmask,
> unsigned long maxnode,
> unsigned flags);
> +asmlinkage long sys_process_mbind(int pidfd, const struct iovec __user *vec,
> + size_t vlen, unsigned long mode,
> + const unsigned long __user *nmask,
> + unsigned flags);
> asmlinkage long sys_get_mempolicy(int __user *policy,
> unsigned long __user *nmask,
> unsigned long maxnode,
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 756b013fb832..9ed2c91940d6 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
> __SYSCALL(__NR_futex_wait, sys_futex_wait)
> #define __NR_futex_requeue 456
> __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
> +#define __NR_process_mbind 457
> +__SYSCALL(__NR_process_mbind, sys_process_mbind)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 457
> +#define __NR_syscalls 458
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index e1a6e3c675c0..cc5cb5ae3ae7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -187,6 +187,7 @@ COND_SYSCALL(process_madvise);
> COND_SYSCALL(process_mrelease);
> COND_SYSCALL(remap_file_pages);
> COND_SYSCALL(mbind);
> +COND_SYSCALL(process_mbind);
> COND_SYSCALL(get_mempolicy);
> COND_SYSCALL(set_mempolicy);
> COND_SYSCALL(migrate_pages);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..91ee300fa728 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1215,11 +1215,10 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
> }
> #endif
>
> -static long do_mbind(unsigned long start, unsigned long len,
> +static long do_mbind(struct mm_struct *mm, unsigned long start, unsigned long len,
> unsigned short mode, unsigned short mode_flags,
> nodemask_t *nmask, unsigned long flags)
> {
> - struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma, *prev;
> struct vma_iterator vmi;
> struct migration_mpol mmpol;
> @@ -1465,10 +1464,84 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
> return 0;
> }
>
> +static long kernel_mbind_process(int pidfd, const struct iovec __user *vec,
> + size_t vlen, unsigned long mode,
> + const unsigned long __user *nmask, unsigned int flags)
> +{
> + ssize_t ret;
> + struct iovec iovstack[UIO_FASTIOV];
> + struct iovec *iov = iovstack;
> + struct iov_iter iter;
> + struct task_struct *task;
> + struct mm_struct *mm;
> + unsigned int f_flags;
> + unsigned short mode_flags;
> + int lmode = mode;
> + unsigned long maxnode = MAX_NUMNODES;
> + int err;
> + nodemask_t nodes;
> +
> + err = sanitize_mpol_flags(&lmode, &mode_flags);
> + if (err)
> + goto out;
> +
> + err = get_nodes(&nodes, nmask, maxnode);
> + if (err)
> + goto out;
> +
> + ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack),
> + &iov, &iter);
> + if (ret < 0)
> + goto out;
> +
> + task = pidfd_get_task(pidfd, &f_flags);
> + if (IS_ERR(task)) {
> + ret = PTR_ERR(task);
> + goto free_iov;
> + }
> +
> + /* From process_madvise: Require PTRACE_MODE_READ
> + * to avoid leaking ASLR metadata. */
> + mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> + if (IS_ERR_OR_NULL(mm)) {
> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> + goto release_task;
> + }
> +
> + /* From process_madvise: Require CAP_SYS_NICE for
> + * influencing process performance. */
> + if (!capable(CAP_SYS_NICE)) {
> + ret = -EPERM;
> + goto release_mm;
> + }
> +
> + while (iov_iter_count(&iter)) {
> + unsigned long start = untagged_addr(
> + (unsigned long)iter_iov_addr(&iter));
> + unsigned long len = iter_iov_len(&iter);
> +
> + ret = do_mbind(mm, start, len, lmode, mode_flags,
> + &nodes, flags);
> + if (ret < 0)
> + break;
> + iov_iter_advance(&iter, iter_iov_len(&iter));
> + }
> +
> +release_mm:
> + mmput(mm);
> +release_task:
> + put_task_struct(task);
> +free_iov:
> + kfree(iov);
> +out:
> + return ret;
> +}
> +
The do_mbind function relies on the current task to obtain nodemask
and task policy,
so the current modification is not enough.
> static long kernel_mbind(unsigned long start, unsigned long len,
> unsigned long mode, const unsigned long __user *nmask,
> unsigned long maxnode, unsigned int flags)
> {
> + struct mm_struct *mm = current->mm;
> unsigned short mode_flags;
> nodemask_t nodes;
> int lmode = mode;
> @@ -1483,7 +1556,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
> if (err)
> return err;
>
> - return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
> + return do_mbind(mm, start, len, lmode, mode_flags, &nodes, flags);
> }
>
> SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
> @@ -1553,6 +1626,13 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
> return err;
> }
>
> +SYSCALL_DEFINE6(process_mbind, int, pidfd, const struct iovec __user *, vec,
> + size_t, vlen, unsigned long, mode,
> + const unsigned long __user *, nmask, unsigned int, flags)
> +{
> + return kernel_mbind_process(pidfd, vec, vlen, mode, nmask, flags);
> +}
> +
> SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
> unsigned long, mode, const unsigned long __user *, nmask,
> unsigned long, maxnode, unsigned int, flags)
> --
> 2.41.0
>
Per my understanding, the process_mbind() is implementable without
many difficult challenges,
since it is always protected by mm->mmap_lock. But task mempolicy does
not acquire any lock
in alloc_pages().
On Fri, Nov 24, 2023 at 04:13:41PM +0800, Zhongkun He wrote:
>
> Per my understanding, the process_mbind() is implementable without
> many difficult challenges,
> since it is always protected by mm->mmap_lock. But task mempolicy does
> not acquire any lock
> in alloc_pages().
per-vma policies are protected by the mmap lock, while the task
mempolicy is protected by the task lock on replacement, and
task->mems_allowed (protected by task_lock).
There is an update in my refactor tickets that enforces the acquisition
of task_lock on mpol_set_nodemask, which prevents the need for
alloc_pages to do anything else. That's not present in this patch.
Basically mems_allowed deals with the majority of situations, and
mmap_lock deals with per-vma mempolicy changes and migrations.
~Gregory
Hi Gregory, sorry for the late reply.
I tried pidfd_set_mempolicy(suggested by michal) about a year ago.
There is a problem here that may need attention.
A mempolicy can be either associated with a process or with a VMA.
All vma manipulation is somewhat protected by a down_read on
mmap_lock.In process context(in alloc_pages()) there is no locking
because only the process accesses its own state.
Now we need to change the process context mempolicy specified
in pidfd. the mempolicy may about to be freed by
pidfd_set_mempolicy() while alloc_pages() is using it,
The race condition appears.
Say something like the following:
pidfd_set_mempolicy() target task stack:
alloc_pages:
mpol = p->mempolicy;
task_lock(task);
old = task->mempolicy;
task->mempolicy = new;
task_unlock(task);
mpol_put(old);
/*old mpol has been freed.*/
policy_node(...., mpol)
__alloc_pages();
To reduce the use of locks and atomic operations(mpol_get/put)
in the hot path, there are no references or lock protections here
for task mempolicy.
It would be great if your refactoring has a good solution.
Thanks.
On Sat, Nov 25, 2023 at 4:09 AM Gregory Price
<[email protected]> wrote:
>
> On Fri, Nov 24, 2023 at 04:13:41PM +0800, Zhongkun He wrote:
> >
> > Per my understanding, the process_mbind() is implementable without
> > many difficult challenges,
> > since it is always protected by mm->mmap_lock. But task mempolicy does
> > not acquire any lock
> > in alloc_pages().
>
> per-vma policies are protected by the mmap lock, while the task
> mempolicy is protected by the task lock on replacement, and
> task->mems_allowed (protected by task_lock).
>
> There is an update in my refactor tickets that enforces the acquisition
> of task_lock on mpol_set_nodemask, which prevents the need for
> alloc_pages to do anything else. That's not present in this patch.
>
> Basically mems_allowed deals with the majority of situations, and
> mmap_lock deals with per-vma mempolicy changes and migrations.
>
> ~Gregory
On Thu, Nov 30, 2023 at 05:34:04PM +0800, Zhongkun He wrote:
> Hi Gregory, sorry for the late reply.
>
> I tried pidfd_set_mempolicy(suggested by michal) about a year ago.
> There is a problem here that may need attention.
>
> A mempolicy can be either associated with a process or with a VMA.
> All vma manipulation is somewhat protected by a down_read on
> mmap_lock.In process context(in alloc_pages()) there is no locking
> because only the process accesses its own state.
>
> Now we need to change the process context mempolicy specified
> in pidfd. the mempolicy may about to be freed by
> pidfd_set_mempolicy() while alloc_pages() is using it,
> The race condition appears.
>
> Say something like the following:
>
> pidfd_set_mempolicy() target task stack:
> alloc_pages:
> mpol = p->mempolicy;
> task_lock(task);
> old = task->mempolicy;
> task->mempolicy = new;
> task_unlock(task);
> mpol_put(old);
> /*old mpol has been freed.*/
> policy_node(...., mpol)
> __alloc_pages();
>
> To reduce the use of locks and atomic operations(mpol_get/put)
> in the hot path, there are no references or lock protections here
> for task mempolicy.
>
> It would be great if your refactoring has a good solution.
>
> Thanks.
>
Hi ZhongKun!
I actually just sent out a more general RFC to mempolicy updates that
discuss this more completely:
https://lore.kernel.org/linux-mm/ZWezcQk+BYEq%[email protected]/
and another post on even more issues with pidfd modifications to vma
mempolicies:
https://lore.kernel.org/linux-mm/[email protected]/
We may have to slow-walk the changes to vma policies due to there being
many more hidden accesses to (current) than expected. It's a rather
nasty rats nest of mempolicy-vma-cpusets-shmem callbacks that obscure
these current-task accesses, it will take time to work through.
As for hot-path reference counting - we may need to change the way
mempolicy is managed, possibly we could leverage RCU to manage mempolicy
references in the hot path, rather than using locks. In this scenario,
we would likely need to change the way the default policy is applied
(maybe not, I haven't fully explored it).
Do you have thoughts on this? Would very much like additional comments
before I go through the refactor work.
Regards,
Gregory
>
> Hi ZhongKun!
>
> I actually just sent out a more general RFC to mempolicy updates that
> discuss this more completely:
>
> https://lore.kernel.org/linux-mm/ZWezcQk+BYEq%[email protected]/
>
OK.
> and another post on even more issues with pidfd modifications to vma
> mempolicies:
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> We may have to slow-walk the changes to vma policies due to there being
> many more hidden accesses to (current) than expected. It's a rather
> nasty rats nest of mempolicy-vma-cpusets-shmem callbacks that obscure
> these current-task accesses, it will take time to work through.
>
Got it, thanks. It's more complicated than I thought.
> As for hot-path reference counting - we may need to change the way
> mempolicy is managed, possibly we could leverage RCU to manage mempolicy
> references in the hot path, rather than using locks. In this scenario,
> we would likely need to change the way the default policy is applied
> (maybe not, I haven't fully explored it).
>
RCU may have a long time in the read-side critical section.
We should probably replace the atomic_t refcnt with percpu_ref in
mempolicy(also suggested by Michal), but refactoring work involves
a lot of code.
A simple way is to use task_work to release the mempolicy which may
be used by alloc_pages(). But it doesn't have a direct result.
> Do you have thoughts on this? Would very much like additional comments
> before I go through the refactor work.
>
> Regards,
> Gregory