2021-06-23 19:30:40

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH 1/1] mm: introduce process_reap system call

In modern systems it's not unusual to have a system component monitoring
memory conditions of the system and tasked with keeping system memory
pressure under control. One way to accomplish that is to kill
non-essential processes to free up memory for more important ones.
Examples of this are Facebook's OOM killer daemon called oomd and
Android's low memory killer daemon called lmkd.
For such system component it's important to be able to free memory
quickly and efficiently. Unfortunately the time process takes to free
up its memory after receiving a SIGKILL might vary based on the state
of the process (uninterruptible sleep), size and OPP level of the core
the process is running. A mechanism to free resources of the target
process in a more predictable way would improve system's ability to
control its memory pressure.
Introduce process_reap system call that reclaims memory of a dying process
from the context of the caller. This way the memory in freed in a more
controllable way with CPU affinity and priority of the caller. The workload
of freeing the memory will also be charged to the caller.
The operation is allowed only on a dying process.

Previously I proposed a number of alternatives to accomplish this:
- https://lore.kernel.org/patchwork/patch/1060407 extending
pidfd_send_signal to allow memory reaping using oom_reaper thread;
- https://lore.kernel.org/patchwork/patch/1338196 extending
pidfd_send_signal to reap memory of the target process synchronously from
the context of the caller;
- https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
support for process_madvise implementing synchronous memory reaping.

The end of the last discussion culminated with suggestion to introduce a
dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
The reasoning was that the new variant of process_madvise
a) does not work on an address range
b) is destructive
c) doesn't share much code at all with the rest of process_madvise
From the userspace point of view it was awkward and inconvenient to provide
memory range for this operation that operates on the entire address space.
Using special flags or address values to specify the entire address space
was too hacky.

The API is as follows,

int process_reap(int pidfd, unsigned int flags);

DESCRIPTION
The process_reap() system call is used to free the memory of a
dying process.

The pidfd selects the process referred to by the PID file
descriptor.
(See pidofd_open(2) for further information)

The flags argument is reserved for future use; currently, this
argument must be specified as 0.

RETURN VALUE
On success, process_reap() returns 0. On error, -1 is returned
and errno is set to indicate the error.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 1 +
mm/oom_kill.c | 50 +++++++++++++++++++++
22 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..14b9e81d2fc4 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -486,3 +486,4 @@
554 common landlock_create_ruleset sys_landlock_create_ruleset
555 common landlock_add_rule sys_landlock_add_rule
556 common landlock_restrict_self sys_landlock_restrict_self
+557 common process_reap sys_process_reap
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 28e03b5fec00..889b78d0f63f 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -460,3 +460,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 727bfc3be99b..fb7a0be2f3d9 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)

-#define __NR_compat_syscalls 447
+#define __NR_compat_syscalls 448
#endif

#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 5dab69d2c22b..80593454173e 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
__SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
#define __NR_landlock_restrict_self 446
__SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
+#define __NR_process_reap 447
+__SYSCALL(__NR_process_reap, sys_process_reap)

/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index bb11fe4c875a..6c94feedf086 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -367,3 +367,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 79c2d24c89dd..e80a7fa55696 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -446,3 +446,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index b11395a20c20..511b2bd61fc1 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -452,3 +452,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 9220909526f9..1775704c6a24 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -385,3 +385,4 @@
444 n32 landlock_create_ruleset sys_landlock_create_ruleset
445 n32 landlock_add_rule sys_landlock_add_rule
446 n32 landlock_restrict_self sys_landlock_restrict_self
+447 n32 process_reap sys_process_reap
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 9cd1c34f31b5..d769daca3f79 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -361,3 +361,4 @@
444 n64 landlock_create_ruleset sys_landlock_create_ruleset
445 n64 landlock_add_rule sys_landlock_add_rule
446 n64 landlock_restrict_self sys_landlock_restrict_self
+447 n64 process_reap sys_process_reap
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index d560c467a8c6..1bd2fc056677 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -434,3 +434,4 @@
444 o32 landlock_create_ruleset sys_landlock_create_ruleset
445 o32 landlock_add_rule sys_landlock_add_rule
446 o32 landlock_restrict_self sys_landlock_restrict_self
+447 o32 process_reap sys_process_reap
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index aabc37f8cae3..0012561ca557 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -444,3 +444,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 8f052ff4058c..89cbcc732b18 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -526,3 +526,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 0690263df1dd..7ebd4d809b5e 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -449,3 +449,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap sys_process_reap
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 0b91499ebdcf..178fd47b372e 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -449,3 +449,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e34cc30ef22c..faee121b7ae2 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -492,3 +492,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4bbc267fb36b..cbe070de9884 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -451,3 +451,4 @@
444 i386 landlock_create_ruleset sys_landlock_create_ruleset
445 i386 landlock_add_rule sys_landlock_add_rule
446 i386 landlock_restrict_self sys_landlock_restrict_self
+447 i386 process_reap sys_process_reap
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ce18119ea0d0..e6765646731b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,7 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index fd2f30227d96..f0e9dbee1a5b 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -417,3 +417,4 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common process_reap sys_process_reap
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..b6659e09bf0d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len,
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
size_t vlen, int behavior, unsigned int flags);
+asmlinkage long sys_process_reap(int pidfd, unsigned int flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index d2a942086fcb..b3bf57b928af 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
__SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
#define __NR_landlock_restrict_self 446
__SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
+#define __NR_process_reap 447
+__SYSCALL(__NR_process_reap, sys_process_reap)

#undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 448

/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..56eb7c9f8356 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -289,6 +289,7 @@ COND_SYSCALL(munlockall);
COND_SYSCALL(mincore);
COND_SYSCALL(madvise);
COND_SYSCALL(process_madvise);
+COND_SYSCALL(process_reap);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eefd3f5fde46..0f85a0442fa5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -28,6 +28,7 @@
#include <linux/sched/task.h>
#include <linux/sched/debug.h>
#include <linux/swap.h>
+#include <linux/syscalls.h>
#include <linux/timex.h>
#include <linux/jiffies.h>
#include <linux/cpuset.h>
@@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void)
out_of_memory(&oc);
mutex_unlock(&oom_lock);
}
+
+SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
+{
+ struct pid *pid;
+ struct task_struct *task;
+ struct mm_struct *mm = NULL;
+ unsigned int f_flags;
+ long ret = 0;
+
+ if (flags != 0)
+ return -EINVAL;
+
+ pid = pidfd_get_pid(pidfd, &f_flags);
+ if (IS_ERR(pid))
+ return PTR_ERR(pid);
+
+ task = get_pid_task(pid, PIDTYPE_PID);
+ if (!task) {
+ ret = -ESRCH;
+ goto put_pid;
+ }
+
+ /*
+ * If the task is dying and in the process of releasing its memory
+ * then get its mm.
+ */
+ task_lock(task);
+ if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
+ mm = task->mm;
+ mmget(mm);
+ }
+ task_unlock(task);
+ if (!mm) {
+ ret = -EINVAL;
+ goto put_task;
+ }
+
+ mmap_read_lock(mm);
+ if (!__oom_reap_task_mm(mm))
+ ret = -EAGAIN;
+ mmap_read_unlock(mm);
+
+ mmput(mm);
+put_task:
+ put_task_struct(task);
+put_pid:
+ put_pid(pid);
+ return ret;
+}
--
2.32.0.93.g670b81a890-goog


2021-06-23 19:38:05

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <[email protected]> wrote:
>
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running. A mechanism to free resources of the target
> process in a more predictable way would improve system's ability to
> control its memory pressure.
> Introduce process_reap system call that reclaims memory of a dying process
> from the context of the caller. This way the memory in freed in a more
> controllable way with CPU affinity and priority of the caller. The workload
> of freeing the memory will also be charged to the caller.
> The operation is allowed only on a dying process.
>
> Previously I proposed a number of alternatives to accomplish this:
> - https://lore.kernel.org/patchwork/patch/1060407 extending
> pidfd_send_signal to allow memory reaping using oom_reaper thread;
> - https://lore.kernel.org/patchwork/patch/1338196 extending
> pidfd_send_signal to reap memory of the target process synchronously from
> the context of the caller;
> - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> support for process_madvise implementing synchronous memory reaping.
>
> The end of the last discussion culminated with suggestion to introduce a
> dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> The reasoning was that the new variant of process_madvise
> a) does not work on an address range
> b) is destructive
> c) doesn't share much code at all with the rest of process_madvise
> From the userspace point of view it was awkward and inconvenient to provide
> memory range for this operation that operates on the entire address space.
> Using special flags or address values to specify the entire address space
> was too hacky.
>
> The API is as follows,
>
> int process_reap(int pidfd, unsigned int flags);
>
> DESCRIPTION
> The process_reap() system call is used to free the memory of a
> dying process.
>
> The pidfd selects the process referred to by the PID file
> descriptor.
> (See pidofd_open(2) for further information)
>
> The flags argument is reserved for future use; currently, this
> argument must be specified as 0.
>
> RETURN VALUE
> On success, process_reap() returns 0. On error, -1 is returned
> and errno is set to indicate the error.
>

I noticed that the patch does not apply to linux-next because of the
new memfd_secret syscall introduced on x86 architecture only. It still
applies to Linus' ToT. If needed I can change it to apply on top of
linux-next.

> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> arch/arm/tools/syscall.tbl | 1 +
> arch/arm64/include/asm/unistd.h | 2 +-
> arch/arm64/include/asm/unistd32.h | 2 +
> arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
> arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> arch/s390/kernel/syscalls/syscall.tbl | 1 +
> arch/sh/kernel/syscalls/syscall.tbl | 1 +
> arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/unistd.h | 4 +-
> kernel/sys_ni.c | 1 +
> mm/oom_kill.c | 50 +++++++++++++++++++++
> 22 files changed, 74 insertions(+), 2 deletions(-)
>
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 3000a2e8ee21..14b9e81d2fc4 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -486,3 +486,4 @@
> 554 common landlock_create_ruleset sys_landlock_create_ruleset
> 555 common landlock_add_rule sys_landlock_add_rule
> 556 common landlock_restrict_self sys_landlock_restrict_self
> +557 common process_reap sys_process_reap
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 28e03b5fec00..889b78d0f63f 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -460,3 +460,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 727bfc3be99b..fb7a0be2f3d9 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
> #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
> #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
>
> -#define __NR_compat_syscalls 447
> +#define __NR_compat_syscalls 448
> #endif
>
> #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 5dab69d2c22b..80593454173e 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> #define __NR_landlock_restrict_self 446
> __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_process_reap 447
> +__SYSCALL(__NR_process_reap, sys_process_reap)
>
> /*
> * Please add new compat syscalls above this comment and update
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index bb11fe4c875a..6c94feedf086 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -367,3 +367,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 79c2d24c89dd..e80a7fa55696 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -446,3 +446,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index b11395a20c20..511b2bd61fc1 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -452,3 +452,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index 9220909526f9..1775704c6a24 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -385,3 +385,4 @@
> 444 n32 landlock_create_ruleset sys_landlock_create_ruleset
> 445 n32 landlock_add_rule sys_landlock_add_rule
> 446 n32 landlock_restrict_self sys_landlock_restrict_self
> +447 n32 process_reap sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index 9cd1c34f31b5..d769daca3f79 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -361,3 +361,4 @@
> 444 n64 landlock_create_ruleset sys_landlock_create_ruleset
> 445 n64 landlock_add_rule sys_landlock_add_rule
> 446 n64 landlock_restrict_self sys_landlock_restrict_self
> +447 n64 process_reap sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index d560c467a8c6..1bd2fc056677 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -434,3 +434,4 @@
> 444 o32 landlock_create_ruleset sys_landlock_create_ruleset
> 445 o32 landlock_add_rule sys_landlock_add_rule
> 446 o32 landlock_restrict_self sys_landlock_restrict_self
> +447 o32 process_reap sys_process_reap
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index aabc37f8cae3..0012561ca557 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -444,3 +444,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 8f052ff4058c..89cbcc732b18 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -526,3 +526,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 0690263df1dd..7ebd4d809b5e 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap sys_process_reap
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 0b91499ebdcf..178fd47b372e 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index e34cc30ef22c..faee121b7ae2 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -492,3 +492,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4bbc267fb36b..cbe070de9884 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -451,3 +451,4 @@
> 444 i386 landlock_create_ruleset sys_landlock_create_ruleset
> 445 i386 landlock_add_rule sys_landlock_add_rule
> 446 i386 landlock_restrict_self sys_landlock_restrict_self
> +447 i386 process_reap sys_process_reap
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index ce18119ea0d0..e6765646731b 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -368,6 +368,7 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index fd2f30227d96..f0e9dbee1a5b 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -417,3 +417,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 050511e8f1f8..b6659e09bf0d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len,
> asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
> size_t vlen, int behavior, unsigned int flags);
> +asmlinkage long sys_process_reap(int pidfd, unsigned int flags);
> asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> unsigned long prot, unsigned long pgoff,
> unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index d2a942086fcb..b3bf57b928af 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> #define __NR_landlock_restrict_self 446
> __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_process_reap 447
> +__SYSCALL(__NR_process_reap, sys_process_reap)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 447
> +#define __NR_syscalls 448
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 0ea8128468c3..56eb7c9f8356 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall);
> COND_SYSCALL(mincore);
> COND_SYSCALL(madvise);
> COND_SYSCALL(process_madvise);
> +COND_SYSCALL(process_reap);
> COND_SYSCALL(remap_file_pages);
> COND_SYSCALL(mbind);
> COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eefd3f5fde46..0f85a0442fa5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -28,6 +28,7 @@
> #include <linux/sched/task.h>
> #include <linux/sched/debug.h>
> #include <linux/swap.h>
> +#include <linux/syscalls.h>
> #include <linux/timex.h>
> #include <linux/jiffies.h>
> #include <linux/cpuset.h>
> @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void)
> out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> }
> +
> +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
> +{
> + struct pid *pid;
> + struct task_struct *task;
> + struct mm_struct *mm = NULL;
> + unsigned int f_flags;
> + long ret = 0;
> +
> + if (flags != 0)
> + return -EINVAL;
> +
> + pid = pidfd_get_pid(pidfd, &f_flags);
> + if (IS_ERR(pid))
> + return PTR_ERR(pid);
> +
> + task = get_pid_task(pid, PIDTYPE_PID);
> + if (!task) {
> + ret = -ESRCH;
> + goto put_pid;
> + }
> +
> + /*
> + * If the task is dying and in the process of releasing its memory
> + * then get its mm.
> + */
> + task_lock(task);
> + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
> + mm = task->mm;
> + mmget(mm);
> + }
> + task_unlock(task);
> + if (!mm) {
> + ret = -EINVAL;
> + goto put_task;
> + }
> +
> + mmap_read_lock(mm);
> + if (!__oom_reap_task_mm(mm))
> + ret = -EAGAIN;
> + mmap_read_unlock(mm);
> +
> + mmput(mm);
> +put_task:
> + put_task_struct(task);
> +put_pid:
> + put_pid(pid);
> + return ret;
> +}
> --
> 2.32.0.93.g670b81a890-goog
>

2021-06-29 15:39:12

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 23, 2021 at 12:28:22PM -0700, Suren Baghdasaryan wrote:
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running. A mechanism to free resources of the target
> process in a more predictable way would improve system's ability to
> control its memory pressure.
> Introduce process_reap system call that reclaims memory of a dying process
> from the context of the caller. This way the memory in freed in a more
> controllable way with CPU affinity and priority of the caller. The workload
> of freeing the memory will also be charged to the caller.
> The operation is allowed only on a dying process.
>
> Previously I proposed a number of alternatives to accomplish this:
> - https://lore.kernel.org/patchwork/patch/1060407 extending
> pidfd_send_signal to allow memory reaping using oom_reaper thread;
> - https://lore.kernel.org/patchwork/patch/1338196 extending
> pidfd_send_signal to reap memory of the target process synchronously from
> the context of the caller;
> - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> support for process_madvise implementing synchronous memory reaping.
>
> The end of the last discussion culminated with suggestion to introduce a
> dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> The reasoning was that the new variant of process_madvise
> a) does not work on an address range
> b) is destructive
> c) doesn't share much code at all with the rest of process_madvise
> From the userspace point of view it was awkward and inconvenient to provide
> memory range for this operation that operates on the entire address space.
> Using special flags or address values to specify the entire address space
> was too hacky.
>
> The API is as follows,
>
> int process_reap(int pidfd, unsigned int flags);
>
> DESCRIPTION
> The process_reap() system call is used to free the memory of a
> dying process.
>
> The pidfd selects the process referred to by the PID file
> descriptor.
> (See pidofd_open(2) for further information)
>
> The flags argument is reserved for future use; currently, this
> argument must be specified as 0.
>
> RETURN VALUE
> On success, process_reap() returns 0. On error, -1 is returned
> and errno is set to indicate the error.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> arch/arm/tools/syscall.tbl | 1 +
> arch/arm64/include/asm/unistd.h | 2 +-
> arch/arm64/include/asm/unistd32.h | 2 +
> arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
> arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> arch/s390/kernel/syscalls/syscall.tbl | 1 +
> arch/sh/kernel/syscalls/syscall.tbl | 1 +
> arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/unistd.h | 4 +-
> kernel/sys_ni.c | 1 +
> mm/oom_kill.c | 50 +++++++++++++++++++++
> 22 files changed, 74 insertions(+), 2 deletions(-)
>
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 3000a2e8ee21..14b9e81d2fc4 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -486,3 +486,4 @@
> 554 common landlock_create_ruleset sys_landlock_create_ruleset
> 555 common landlock_add_rule sys_landlock_add_rule
> 556 common landlock_restrict_self sys_landlock_restrict_self
> +557 common process_reap sys_process_reap
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 28e03b5fec00..889b78d0f63f 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -460,3 +460,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 727bfc3be99b..fb7a0be2f3d9 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
> #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
> #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
>
> -#define __NR_compat_syscalls 447
> +#define __NR_compat_syscalls 448
> #endif
>
> #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 5dab69d2c22b..80593454173e 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> #define __NR_landlock_restrict_self 446
> __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_process_reap 447
> +__SYSCALL(__NR_process_reap, sys_process_reap)
>
> /*
> * Please add new compat syscalls above this comment and update
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index bb11fe4c875a..6c94feedf086 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -367,3 +367,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 79c2d24c89dd..e80a7fa55696 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -446,3 +446,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index b11395a20c20..511b2bd61fc1 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -452,3 +452,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index 9220909526f9..1775704c6a24 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -385,3 +385,4 @@
> 444 n32 landlock_create_ruleset sys_landlock_create_ruleset
> 445 n32 landlock_add_rule sys_landlock_add_rule
> 446 n32 landlock_restrict_self sys_landlock_restrict_self
> +447 n32 process_reap sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index 9cd1c34f31b5..d769daca3f79 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -361,3 +361,4 @@
> 444 n64 landlock_create_ruleset sys_landlock_create_ruleset
> 445 n64 landlock_add_rule sys_landlock_add_rule
> 446 n64 landlock_restrict_self sys_landlock_restrict_self
> +447 n64 process_reap sys_process_reap
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index d560c467a8c6..1bd2fc056677 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -434,3 +434,4 @@
> 444 o32 landlock_create_ruleset sys_landlock_create_ruleset
> 445 o32 landlock_add_rule sys_landlock_add_rule
> 446 o32 landlock_restrict_self sys_landlock_restrict_self
> +447 o32 process_reap sys_process_reap
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index aabc37f8cae3..0012561ca557 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -444,3 +444,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 8f052ff4058c..89cbcc732b18 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -526,3 +526,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 0690263df1dd..7ebd4d809b5e 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap sys_process_reap
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 0b91499ebdcf..178fd47b372e 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index e34cc30ef22c..faee121b7ae2 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -492,3 +492,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4bbc267fb36b..cbe070de9884 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -451,3 +451,4 @@
> 444 i386 landlock_create_ruleset sys_landlock_create_ruleset
> 445 i386 landlock_add_rule sys_landlock_add_rule
> 446 i386 landlock_restrict_self sys_landlock_restrict_self
> +447 i386 process_reap sys_process_reap
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index ce18119ea0d0..e6765646731b 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -368,6 +368,7 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index fd2f30227d96..f0e9dbee1a5b 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -417,3 +417,4 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common process_reap sys_process_reap
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 050511e8f1f8..b6659e09bf0d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len,
> asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
> size_t vlen, int behavior, unsigned int flags);
> +asmlinkage long sys_process_reap(int pidfd, unsigned int flags);
> asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> unsigned long prot, unsigned long pgoff,
> unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index d2a942086fcb..b3bf57b928af 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> #define __NR_landlock_restrict_self 446
> __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_process_reap 447
> +__SYSCALL(__NR_process_reap, sys_process_reap)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 447
> +#define __NR_syscalls 448
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 0ea8128468c3..56eb7c9f8356 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall);
> COND_SYSCALL(mincore);
> COND_SYSCALL(madvise);
> COND_SYSCALL(process_madvise);
> +COND_SYSCALL(process_reap);
> COND_SYSCALL(remap_file_pages);
> COND_SYSCALL(mbind);
> COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eefd3f5fde46..0f85a0442fa5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -28,6 +28,7 @@
> #include <linux/sched/task.h>
> #include <linux/sched/debug.h>
> #include <linux/swap.h>
> +#include <linux/syscalls.h>
> #include <linux/timex.h>
> #include <linux/jiffies.h>
> #include <linux/cpuset.h>
> @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void)
> out_of_memory(&oc);
> mutex_unlock(&oom_lock);
> }
> +
> +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)

Hey Suren,

Wouldn't
- process_memory_reap()
- process_reap_memory()
- process_mreap()
be better names?

> +{
> + struct pid *pid;
> + struct task_struct *task;
> + struct mm_struct *mm = NULL;
> + unsigned int f_flags;
> + long ret = 0;
> +
> + if (flags != 0)
> + return -EINVAL;
> +
> + pid = pidfd_get_pid(pidfd, &f_flags);
> + if (IS_ERR(pid))
> + return PTR_ERR(pid);
> +
> + task = get_pid_task(pid, PIDTYPE_PID);
> + if (!task) {
> + ret = -ESRCH;
> + goto put_pid;
> + }

You have a similar pattern in process_madvise():

pid = pidfd_get_pid(pidfd, &f_flags);
if (IS_ERR(pid)) {
ret = PTR_ERR(pid);
goto free_iov;
}

task = get_pid_task(pid, PIDTYPE_PID);
if (!task) {
ret = -ESRCH;
goto put_pid;
}

I'd suggest you add a tiny helper to kernel/pid.c and call it in both
places.

2021-06-29 16:19:19

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Tue, Jun 29, 2021 at 6:14 AM Christian Brauner
<[email protected]> wrote:
>
> On Wed, Jun 23, 2021 at 12:28:22PM -0700, Suren Baghdasaryan wrote:
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running. A mechanism to free resources of the target
> > process in a more predictable way would improve system's ability to
> > control its memory pressure.
> > Introduce process_reap system call that reclaims memory of a dying process
> > from the context of the caller. This way the memory in freed in a more
> > controllable way with CPU affinity and priority of the caller. The workload
> > of freeing the memory will also be charged to the caller.
> > The operation is allowed only on a dying process.
> >
> > Previously I proposed a number of alternatives to accomplish this:
> > - https://lore.kernel.org/patchwork/patch/1060407 extending
> > pidfd_send_signal to allow memory reaping using oom_reaper thread;
> > - https://lore.kernel.org/patchwork/patch/1338196 extending
> > pidfd_send_signal to reap memory of the target process synchronously from
> > the context of the caller;
> > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> > support for process_madvise implementing synchronous memory reaping.
> >
> > The end of the last discussion culminated with suggestion to introduce a
> > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> > The reasoning was that the new variant of process_madvise
> > a) does not work on an address range
> > b) is destructive
> > c) doesn't share much code at all with the rest of process_madvise
> > From the userspace point of view it was awkward and inconvenient to provide
> > memory range for this operation that operates on the entire address space.
> > Using special flags or address values to specify the entire address space
> > was too hacky.
> >
> > The API is as follows,
> >
> > int process_reap(int pidfd, unsigned int flags);
> >
> > DESCRIPTION
> > The process_reap() system call is used to free the memory of a
> > dying process.
> >
> > The pidfd selects the process referred to by the PID file
> > descriptor.
> > (See pidofd_open(2) for further information)
> >
> > The flags argument is reserved for future use; currently, this
> > argument must be specified as 0.
> >
> > RETURN VALUE
> > On success, process_reap() returns 0. On error, -1 is returned
> > and errno is set to indicate the error.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> > arch/arm/tools/syscall.tbl | 1 +
> > arch/arm64/include/asm/unistd.h | 2 +-
> > arch/arm64/include/asm/unistd32.h | 2 +
> > arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> > arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> > arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> > arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
> > arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
> > arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
> > arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> > arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> > arch/s390/kernel/syscalls/syscall.tbl | 1 +
> > arch/sh/kernel/syscalls/syscall.tbl | 1 +
> > arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> > arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> > arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> > arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> > include/linux/syscalls.h | 1 +
> > include/uapi/asm-generic/unistd.h | 4 +-
> > kernel/sys_ni.c | 1 +
> > mm/oom_kill.c | 50 +++++++++++++++++++++
> > 22 files changed, 74 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> > index 3000a2e8ee21..14b9e81d2fc4 100644
> > --- a/arch/alpha/kernel/syscalls/syscall.tbl
> > +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> > @@ -486,3 +486,4 @@
> > 554 common landlock_create_ruleset sys_landlock_create_ruleset
> > 555 common landlock_add_rule sys_landlock_add_rule
> > 556 common landlock_restrict_self sys_landlock_restrict_self
> > +557 common process_reap sys_process_reap
> > diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> > index 28e03b5fec00..889b78d0f63f 100644
> > --- a/arch/arm/tools/syscall.tbl
> > +++ b/arch/arm/tools/syscall.tbl
> > @@ -460,3 +460,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> > diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> > index 727bfc3be99b..fb7a0be2f3d9 100644
> > --- a/arch/arm64/include/asm/unistd.h
> > +++ b/arch/arm64/include/asm/unistd.h
> > @@ -38,7 +38,7 @@
> > #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
> > #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
> >
> > -#define __NR_compat_syscalls 447
> > +#define __NR_compat_syscalls 448
> > #endif
> >
> > #define __ARCH_WANT_SYS_CLONE
> > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> > index 5dab69d2c22b..80593454173e 100644
> > --- a/arch/arm64/include/asm/unistd32.h
> > +++ b/arch/arm64/include/asm/unistd32.h
> > @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> > __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> > #define __NR_landlock_restrict_self 446
> > __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> > +#define __NR_process_reap 447
> > +__SYSCALL(__NR_process_reap, sys_process_reap)
> >
> > /*
> > * Please add new compat syscalls above this comment and update
> > diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> > index bb11fe4c875a..6c94feedf086 100644
> > --- a/arch/ia64/kernel/syscalls/syscall.tbl
> > +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> > @@ -367,3 +367,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> > index 79c2d24c89dd..e80a7fa55696 100644
> > --- a/arch/m68k/kernel/syscalls/syscall.tbl
> > +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> > @@ -446,3 +446,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> > diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> > index b11395a20c20..511b2bd61fc1 100644
> > --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> > +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> > @@ -452,3 +452,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> > diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > index 9220909526f9..1775704c6a24 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > @@ -385,3 +385,4 @@
> > 444 n32 landlock_create_ruleset sys_landlock_create_ruleset
> > 445 n32 landlock_add_rule sys_landlock_add_rule
> > 446 n32 landlock_restrict_self sys_landlock_restrict_self
> > +447 n32 process_reap sys_process_reap
> > diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > index 9cd1c34f31b5..d769daca3f79 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > @@ -361,3 +361,4 @@
> > 444 n64 landlock_create_ruleset sys_landlock_create_ruleset
> > 445 n64 landlock_add_rule sys_landlock_add_rule
> > 446 n64 landlock_restrict_self sys_landlock_restrict_self
> > +447 n64 process_reap sys_process_reap
> > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > index d560c467a8c6..1bd2fc056677 100644
> > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > @@ -434,3 +434,4 @@
> > 444 o32 landlock_create_ruleset sys_landlock_create_ruleset
> > 445 o32 landlock_add_rule sys_landlock_add_rule
> > 446 o32 landlock_restrict_self sys_landlock_restrict_self
> > +447 o32 process_reap sys_process_reap
> > diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> > index aabc37f8cae3..0012561ca557 100644
> > --- a/arch/parisc/kernel/syscalls/syscall.tbl
> > +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> > @@ -444,3 +444,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> > index 8f052ff4058c..89cbcc732b18 100644
> > --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> > @@ -526,3 +526,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> > index 0690263df1dd..7ebd4d809b5e 100644
> > --- a/arch/s390/kernel/syscalls/syscall.tbl
> > +++ b/arch/s390/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap sys_process_reap
> > diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> > index 0b91499ebdcf..178fd47b372e 100644
> > --- a/arch/sh/kernel/syscalls/syscall.tbl
> > +++ b/arch/sh/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> > diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> > index e34cc30ef22c..faee121b7ae2 100644
> > --- a/arch/sparc/kernel/syscalls/syscall.tbl
> > +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> > @@ -492,3 +492,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 4bbc267fb36b..cbe070de9884 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -451,3 +451,4 @@
> > 444 i386 landlock_create_ruleset sys_landlock_create_ruleset
> > 445 i386 landlock_add_rule sys_landlock_add_rule
> > 446 i386 landlock_restrict_self sys_landlock_restrict_self
> > +447 i386 process_reap sys_process_reap
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index ce18119ea0d0..e6765646731b 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -368,6 +368,7 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> >
> > #
> > # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> > index fd2f30227d96..f0e9dbee1a5b 100644
> > --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> > +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> > @@ -417,3 +417,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +447 common process_reap sys_process_reap
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 050511e8f1f8..b6659e09bf0d 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len,
> > asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> > asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
> > size_t vlen, int behavior, unsigned int flags);
> > +asmlinkage long sys_process_reap(int pidfd, unsigned int flags);
> > asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> > unsigned long prot, unsigned long pgoff,
> > unsigned long flags);
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index d2a942086fcb..b3bf57b928af 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> > __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> > #define __NR_landlock_restrict_self 446
> > __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> > +#define __NR_process_reap 447
> > +__SYSCALL(__NR_process_reap, sys_process_reap)
> >
> > #undef __NR_syscalls
> > -#define __NR_syscalls 447
> > +#define __NR_syscalls 448
> >
> > /*
> > * 32 bit systems traditionally used different
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 0ea8128468c3..56eb7c9f8356 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall);
> > COND_SYSCALL(mincore);
> > COND_SYSCALL(madvise);
> > COND_SYSCALL(process_madvise);
> > +COND_SYSCALL(process_reap);
> > COND_SYSCALL(remap_file_pages);
> > COND_SYSCALL(mbind);
> > COND_SYSCALL_COMPAT(mbind);
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index eefd3f5fde46..0f85a0442fa5 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -28,6 +28,7 @@
> > #include <linux/sched/task.h>
> > #include <linux/sched/debug.h>
> > #include <linux/swap.h>
> > +#include <linux/syscalls.h>
> > #include <linux/timex.h>
> > #include <linux/jiffies.h>
> > #include <linux/cpuset.h>
> > @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void)
> > out_of_memory(&oc);
> > mutex_unlock(&oom_lock);
> > }
> > +
> > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
>
> Hey Suren,
>
> Wouldn't
> - process_memory_reap()
> - process_reap_memory()
> - process_mreap()
> be better names?

Hi Christian,
I'm open to other names, whichever sounds better. From the list
process_reap_memory() sounds best to me but I'm open to others as
well.

>
> > +{
> > + struct pid *pid;
> > + struct task_struct *task;
> > + struct mm_struct *mm = NULL;
> > + unsigned int f_flags;
> > + long ret = 0;
> > +
> > + if (flags != 0)
> > + return -EINVAL;
> > +
> > + pid = pidfd_get_pid(pidfd, &f_flags);
> > + if (IS_ERR(pid))
> > + return PTR_ERR(pid);
> > +
> > + task = get_pid_task(pid, PIDTYPE_PID);
> > + if (!task) {
> > + ret = -ESRCH;
> > + goto put_pid;
> > + }
>
> You have a similar pattern in process_madvise():
>
> pid = pidfd_get_pid(pidfd, &f_flags);
> if (IS_ERR(pid)) {
> ret = PTR_ERR(pid);
> goto free_iov;
> }
>
> task = get_pid_task(pid, PIDTYPE_PID);
> if (!task) {
> ret = -ESRCH;
> goto put_pid;
> }
>
> I'd suggest you add a tiny helper to kernel/pid.c and call it in both
> places.

Agree. I'll post the new rev next week to give some more time for
reviews of this version.
Thanks!

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2021-06-30 18:02:42

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

Hi Suren,

On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <[email protected]> wrote:
>
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running. A mechanism to free resources of the target
> process in a more predictable way would improve system's ability to
> control its memory pressure.
> Introduce process_reap system call that reclaims memory of a dying process
> from the context of the caller. This way the memory in freed in a more
> controllable way with CPU affinity and priority of the caller. The workload
> of freeing the memory will also be charged to the caller.
> The operation is allowed only on a dying process.
>
> Previously I proposed a number of alternatives to accomplish this:
> - https://lore.kernel.org/patchwork/patch/1060407 extending
> pidfd_send_signal to allow memory reaping using oom_reaper thread;
> - https://lore.kernel.org/patchwork/patch/1338196 extending
> pidfd_send_signal to reap memory of the target process synchronously from
> the context of the caller;
> - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> support for process_madvise implementing synchronous memory reaping.
>
> The end of the last discussion culminated with suggestion to introduce a
> dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> The reasoning was that the new variant of process_madvise
> a) does not work on an address range
> b) is destructive
> c) doesn't share much code at all with the rest of process_madvise
> From the userspace point of view it was awkward and inconvenient to provide
> memory range for this operation that operates on the entire address space.
> Using special flags or address values to specify the entire address space
> was too hacky.
>
> The API is as follows,
>
> int process_reap(int pidfd, unsigned int flags);
>
> DESCRIPTION
> The process_reap() system call is used to free the memory of a
> dying process.
>
> The pidfd selects the process referred to by the PID file
> descriptor.
> (See pidofd_open(2) for further information)

*pidfd_open

>
> The flags argument is reserved for future use; currently, this
> argument must be specified as 0.
>
> RETURN VALUE
> On success, process_reap() returns 0. On error, -1 is returned
> and errno is set to indicate the error.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Thanks for continuously pushing this. One question I have is how do
you envision this syscall to be used for the cgroup based workloads.
Traverse the target tree, read pids from cgroup.procs files,
pidfd_open them, send SIGKILL and then process_reap them. Is that
right?

Orthogonal to this patch I wonder if we should have an optimized way
to reap processes from a cgroup. Something similar to cgroup.kill (or
maybe overload cgroup.kill with reaping as well).

[...]

> +
> +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
> +{
> + struct pid *pid;
> + struct task_struct *task;
> + struct mm_struct *mm = NULL;
> + unsigned int f_flags;
> + long ret = 0;
> +
> + if (flags != 0)
> + return -EINVAL;
> +
> + pid = pidfd_get_pid(pidfd, &f_flags);
> + if (IS_ERR(pid))
> + return PTR_ERR(pid);
> +
> + task = get_pid_task(pid, PIDTYPE_PID);
> + if (!task) {
> + ret = -ESRCH;
> + goto put_pid;
> + }
> +
> + /*
> + * If the task is dying and in the process of releasing its memory
> + * then get its mm.
> + */
> + task_lock(task);
> + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {

task_will_free_mem() is fine here but I think in parallel we should
optimize this function. At the moment it is traversing all the
processes on the machine. It is very normal to have tens of thousands
of processes on big machines, so it would be really costly when
reaping a bunch of processes.

2021-06-30 18:27:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <[email protected]> wrote:
>
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running. A mechanism to free resources of the target
> process in a more predictable way would improve system's ability to
> control its memory pressure.
> Introduce process_reap system call that reclaims memory of a dying process
> from the context of the caller. This way the memory in freed in a more
> controllable way with CPU affinity and priority of the caller. The workload
> of freeing the memory will also be charged to the caller.
> The operation is allowed only on a dying process.

At the risk of asking a potentially silly question, should this just
be a file in procfs?

Also, please consider removing all mention of the word "reap" from the
user API. For better or for worse, "reap" in UNIX refers to what
happens when a dead task gets wait()ed. I sincerely wish I could go
back in time and gently encourage whomever invented that particular
abomination to change their mind, but my time machine doesn't work.

--Andy

2021-06-30 18:45:10

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 30, 2021 at 11:01 AM Shakeel Butt <[email protected]> wrote:
>
> Hi Suren,
>
> On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running. A mechanism to free resources of the target
> > process in a more predictable way would improve system's ability to
> > control its memory pressure.
> > Introduce process_reap system call that reclaims memory of a dying process
> > from the context of the caller. This way the memory in freed in a more
> > controllable way with CPU affinity and priority of the caller. The workload
> > of freeing the memory will also be charged to the caller.
> > The operation is allowed only on a dying process.
> >
> > Previously I proposed a number of alternatives to accomplish this:
> > - https://lore.kernel.org/patchwork/patch/1060407 extending
> > pidfd_send_signal to allow memory reaping using oom_reaper thread;
> > - https://lore.kernel.org/patchwork/patch/1338196 extending
> > pidfd_send_signal to reap memory of the target process synchronously from
> > the context of the caller;
> > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED
> > support for process_madvise implementing synchronous memory reaping.
> >
> > The end of the last discussion culminated with suggestion to introduce a
> > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875)
> > The reasoning was that the new variant of process_madvise
> > a) does not work on an address range
> > b) is destructive
> > c) doesn't share much code at all with the rest of process_madvise
> > From the userspace point of view it was awkward and inconvenient to provide
> > memory range for this operation that operates on the entire address space.
> > Using special flags or address values to specify the entire address space
> > was too hacky.
> >
> > The API is as follows,
> >
> > int process_reap(int pidfd, unsigned int flags);
> >
> > DESCRIPTION
> > The process_reap() system call is used to free the memory of a
> > dying process.
> >
> > The pidfd selects the process referred to by the PID file
> > descriptor.
> > (See pidofd_open(2) for further information)
>
> *pidfd_open

Ack

>
> >
> > The flags argument is reserved for future use; currently, this
> > argument must be specified as 0.
> >
> > RETURN VALUE
> > On success, process_reap() returns 0. On error, -1 is returned
> > and errno is set to indicate the error.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> Thanks for continuously pushing this. One question I have is how do
> you envision this syscall to be used for the cgroup based workloads.
> Traverse the target tree, read pids from cgroup.procs files,
> pidfd_open them, send SIGKILL and then process_reap them. Is that
> right?

Yes, at least that's how Android does that. It's a bit more involved
but it's a technical detail. Userspace low memory killer kills a
process (sends SIGKILL and calls process_reap) and another system
component detects that a process died and will kill all processes
belonging to the same cgroup (that's how we identify related
processes).

>
> Orthogonal to this patch I wonder if we should have an optimized way
> to reap processes from a cgroup. Something similar to cgroup.kill (or
> maybe overload cgroup.kill with reaping as well).

Seems reasonable to me. We could use that in the above scenario.

>
> [...]
>
> > +
> > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags)
> > +{
> > + struct pid *pid;
> > + struct task_struct *task;
> > + struct mm_struct *mm = NULL;
> > + unsigned int f_flags;
> > + long ret = 0;
> > +
> > + if (flags != 0)
> > + return -EINVAL;
> > +
> > + pid = pidfd_get_pid(pidfd, &f_flags);
> > + if (IS_ERR(pid))
> > + return PTR_ERR(pid);
> > +
> > + task = get_pid_task(pid, PIDTYPE_PID);
> > + if (!task) {
> > + ret = -ESRCH;
> > + goto put_pid;
> > + }
> > +
> > + /*
> > + * If the task is dying and in the process of releasing its memory
> > + * then get its mm.
> > + */
> > + task_lock(task);
> > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
>
> task_will_free_mem() is fine here but I think in parallel we should
> optimize this function. At the moment it is traversing all the
> processes on the machine. It is very normal to have tens of thousands
> of processes on big machines, so it would be really costly when
> reaping a bunch of processes.

Hmm. But I think we still need to make sure that the mm is not shared
with another non-dying process. IIUC that's the point of that
traversal. Am I mistaken?

2021-06-30 18:54:07

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <[email protected]> wrote:
>
> On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running. A mechanism to free resources of the target
> > process in a more predictable way would improve system's ability to
> > control its memory pressure.
> > Introduce process_reap system call that reclaims memory of a dying process
> > from the context of the caller. This way the memory in freed in a more
> > controllable way with CPU affinity and priority of the caller. The workload
> > of freeing the memory will also be charged to the caller.
> > The operation is allowed only on a dying process.
>
> At the risk of asking a potentially silly question, should this just
> be a file in procfs?

Hmm. I guess it's doable if procfs will not disappear too soon before
memory is released... syscall also supports parameters, in this case
flags can be used in the future to support PIDs in addition to PIDFDs
for example.
Before looking more in that direction, a silly question from my side:
why procfs interface would be preferable to a syscall?

>
> Also, please consider removing all mention of the word "reap" from the
> user API. For better or for worse, "reap" in UNIX refers to what
> happens when a dead task gets wait()ed. I sincerely wish I could go
> back in time and gently encourage whomever invented that particular
> abomination to change their mind, but my time machine doesn't work.

I see. Thanks for the note. How about process_mem_release() and
replacing reap with release everywhere?

>
> --Andy

2021-06-30 19:03:21

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 30, 2021 at 11:44 AM Suren Baghdasaryan <[email protected]> wrote:
>
[...]
> > > + /*
> > > + * If the task is dying and in the process of releasing its memory
> > > + * then get its mm.
> > > + */
> > > + task_lock(task);
> > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
> >
> > task_will_free_mem() is fine here but I think in parallel we should
> > optimize this function. At the moment it is traversing all the
> > processes on the machine. It is very normal to have tens of thousands
> > of processes on big machines, so it would be really costly when
> > reaping a bunch of processes.
>
> Hmm. But I think we still need to make sure that the mm is not shared
> with another non-dying process. IIUC that's the point of that
> traversal. Am I mistaken?

You are right. I am talking about efficiently finding all processes
which are sharing mm (maybe linked into another list) instead of
traversing all the processes on the system.

2021-06-30 19:08:42

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 30, 2021 at 12:00 PM Shakeel Butt <[email protected]> wrote:
>
> On Wed, Jun 30, 2021 at 11:44 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> [...]
> > > > + /*
> > > > + * If the task is dying and in the process of releasing its memory
> > > > + * then get its mm.
> > > > + */
> > > > + task_lock(task);
> > > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) {
> > >
> > > task_will_free_mem() is fine here but I think in parallel we should
> > > optimize this function. At the moment it is traversing all the
> > > processes on the machine. It is very normal to have tens of thousands
> > > of processes on big machines, so it would be really costly when
> > > reaping a bunch of processes.
> >
> > Hmm. But I think we still need to make sure that the mm is not shared
> > with another non-dying process. IIUC that's the point of that
> > traversal. Am I mistaken?
>
> You are right. I am talking about efficiently finding all processes
> which are sharing mm (maybe linked into another list) instead of
> traversing all the processes on the system.

Oh, I see. I think that's a good idea but belongs to a separate patch
as an optimization for task_will_free_mem().
Thanks for reviewing and for good suggestions!

2021-06-30 21:46:34

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <[email protected]> wrote:
> > Also, please consider removing all mention of the word "reap" from the
> > user API. For better or for worse, "reap" in UNIX refers to what
> > happens when a dead task gets wait()ed. I sincerely wish I could go
> > back in time and gently encourage whomever invented that particular
> > abomination to change their mind, but my time machine doesn't work.
>
> I see. Thanks for the note. How about process_mem_release() and
> replacing reap with release everywhere?

I don't quite understand the objection. This syscall works on tasks
that are at the end of their life, right? Isn't something like
process_mreap() establishing exactly the mental link we want here?
Release is less descriptive for what this thing is to be used for.

2021-07-01 00:46:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
> > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <[email protected]> wrote:
> > > Also, please consider removing all mention of the word "reap" from the
> > > user API. For better or for worse, "reap" in UNIX refers to what
> > > happens when a dead task gets wait()ed. I sincerely wish I could go
> > > back in time and gently encourage whomever invented that particular
> > > abomination to change their mind, but my time machine doesn't work.
> >
> > I see. Thanks for the note. How about process_mem_release() and
> > replacing reap with release everywhere?
>
> I don't quite understand the objection. This syscall works on tasks
> that are at the end of their life, right? Isn't something like
> process_mreap() establishing exactly the mental link we want here?
> Release is less descriptive for what this thing is to be used for.

For better or for worse, "reap" means to make a zombie pid go away.
From the description, this new operation takes a dying process (not
necessarily a zombie yet) and aggressively frees its memory. This is
a different optioneration.

How about "free_dying_process_memory"?

2021-07-01 00:48:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 30, 2021 at 11:51 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <[email protected]> wrote:
> >
> > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > In modern systems it's not unusual to have a system component monitoring
> > > memory conditions of the system and tasked with keeping system memory
> > > pressure under control. One way to accomplish that is to kill
> > > non-essential processes to free up memory for more important ones.
> > > Examples of this are Facebook's OOM killer daemon called oomd and
> > > Android's low memory killer daemon called lmkd.
> > > For such system component it's important to be able to free memory
> > > quickly and efficiently. Unfortunately the time process takes to free
> > > up its memory after receiving a SIGKILL might vary based on the state
> > > of the process (uninterruptible sleep), size and OPP level of the core
> > > the process is running. A mechanism to free resources of the target
> > > process in a more predictable way would improve system's ability to
> > > control its memory pressure.
> > > Introduce process_reap system call that reclaims memory of a dying process
> > > from the context of the caller. This way the memory in freed in a more
> > > controllable way with CPU affinity and priority of the caller. The workload
> > > of freeing the memory will also be charged to the caller.
> > > The operation is allowed only on a dying process.
> >
> > At the risk of asking a potentially silly question, should this just
> > be a file in procfs?
>
> Hmm. I guess it's doable if procfs will not disappear too soon before
> memory is released... syscall also supports parameters, in this case
> flags can be used in the future to support PIDs in addition to PIDFDs
> for example.
> Before looking more in that direction, a silly question from my side:
> why procfs interface would be preferable to a syscall?

It avoids using a syscall nr. (Admittedly a syscall nr is not *that*
precious of a resource.) It also makes it possible to use a shell
script to do this, which is maybe useful.

--Andy

2021-07-01 23:01:27

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jun 30, 2021 at 5:44 PM Andy Lutomirski <[email protected]> wrote:
>
> On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
> > > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <[email protected]> wrote:
> > > > Also, please consider removing all mention of the word "reap" from the
> > > > user API. For better or for worse, "reap" in UNIX refers to what
> > > > happens when a dead task gets wait()ed. I sincerely wish I could go
> > > > back in time and gently encourage whomever invented that particular
> > > > abomination to change their mind, but my time machine doesn't work.
> > >
> > > I see. Thanks for the note. How about process_mem_release() and
> > > replacing reap with release everywhere?
> >
> > I don't quite understand the objection. This syscall works on tasks
> > that are at the end of their life, right? Isn't something like
> > process_mreap() establishing exactly the mental link we want here?
> > Release is less descriptive for what this thing is to be used for.
>
> For better or for worse, "reap" means to make a zombie pid go away.
> From the description, this new operation takes a dying process (not
> necessarily a zombie yet) and aggressively frees its memory. This is
> a different optioneration.
>
> How about "free_dying_process_memory"?

process_mreap sounds definitely better and in line with names like
process_madvise. So maybe we can use it?

2021-07-01 23:10:34

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

)

On Wed, Jun 30, 2021 at 5:46 PM Andy Lutomirski <[email protected]> wrote:
>
> On Wed, Jun 30, 2021 at 11:51 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <[email protected]> wrote:
> > >
> > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > In modern systems it's not unusual to have a system component monitoring
> > > > memory conditions of the system and tasked with keeping system memory
> > > > pressure under control. One way to accomplish that is to kill
> > > > non-essential processes to free up memory for more important ones.
> > > > Examples of this are Facebook's OOM killer daemon called oomd and
> > > > Android's low memory killer daemon called lmkd.
> > > > For such system component it's important to be able to free memory
> > > > quickly and efficiently. Unfortunately the time process takes to free
> > > > up its memory after receiving a SIGKILL might vary based on the state
> > > > of the process (uninterruptible sleep), size and OPP level of the core
> > > > the process is running. A mechanism to free resources of the target
> > > > process in a more predictable way would improve system's ability to
> > > > control its memory pressure.
> > > > Introduce process_reap system call that reclaims memory of a dying process
> > > > from the context of the caller. This way the memory in freed in a more
> > > > controllable way with CPU affinity and priority of the caller. The workload
> > > > of freeing the memory will also be charged to the caller.
> > > > The operation is allowed only on a dying process.
> > >
> > > At the risk of asking a potentially silly question, should this just
> > > be a file in procfs?
> >
> > Hmm. I guess it's doable if procfs will not disappear too soon before
> > memory is released... syscall also supports parameters, in this case
> > flags can be used in the future to support PIDs in addition to PIDFDs
> > for example.
> > Before looking more in that direction, a silly question from my side:
> > why procfs interface would be preferable to a syscall?
>
> It avoids using a syscall nr. (Admittedly a syscall nr is not *that*
> precious of a resource.) It also makes it possible to use a shell
> script to do this, which is maybe useful.

I see. Not really sure if the shell usage is a big usecase for this
operation but let's see if more people like that approach. For my
specific usecase one syscall (process_reap) is better than three
syscalls (open, write, close) and the possibility to extend the
functionality using flags might be of value for the future.

>
> --Andy

2021-07-02 15:39:40

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Thu, Jul 01, 2021 at 03:59:48PM -0700, Suren Baghdasaryan wrote:
> On Wed, Jun 30, 2021 at 5:44 PM Andy Lutomirski <[email protected]> wrote:
> >
> > On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
> > > > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <[email protected]> wrote:
> > > > > Also, please consider removing all mention of the word "reap" from the
> > > > > user API. For better or for worse, "reap" in UNIX refers to what
> > > > > happens when a dead task gets wait()ed. I sincerely wish I could go
> > > > > back in time and gently encourage whomever invented that particular
> > > > > abomination to change their mind, but my time machine doesn't work.
> > > >
> > > > I see. Thanks for the note. How about process_mem_release() and
> > > > replacing reap with release everywhere?
> > >
> > > I don't quite understand the objection. This syscall works on tasks
> > > that are at the end of their life, right? Isn't something like
> > > process_mreap() establishing exactly the mental link we want here?
> > > Release is less descriptive for what this thing is to be used for.
> >
> > For better or for worse, "reap" means to make a zombie pid go away.
> > From the description, this new operation takes a dying process (not
> > necessarily a zombie yet) and aggressively frees its memory. This is
> > a different optioneration.
> >
> > How about "free_dying_process_memory"?
>
> process_mreap sounds definitely better and in line with names like
> process_madvise. So maybe we can use it?

That one was my favorite from the list I gave too but maybe we can
satisfy Andy too if we use one of:
- process_mfree()
- process_mrelease()

2021-07-05 07:42:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On 02.07.21 17:27, Christian Brauner wrote:
> On Thu, Jul 01, 2021 at 03:59:48PM -0700, Suren Baghdasaryan wrote:
>> On Wed, Jun 30, 2021 at 5:44 PM Andy Lutomirski <[email protected]> wrote:
>>>
>>> On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <[email protected]> wrote:
>>>>
>>>> On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote:
>>>>> On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <[email protected]> wrote:
>>>>>> Also, please consider removing all mention of the word "reap" from the
>>>>>> user API. For better or for worse, "reap" in UNIX refers to what
>>>>>> happens when a dead task gets wait()ed. I sincerely wish I could go
>>>>>> back in time and gently encourage whomever invented that particular
>>>>>> abomination to change their mind, but my time machine doesn't work.
>>>>>
>>>>> I see. Thanks for the note. How about process_mem_release() and
>>>>> replacing reap with release everywhere?
>>>>
>>>> I don't quite understand the objection. This syscall works on tasks
>>>> that are at the end of their life, right? Isn't something like
>>>> process_mreap() establishing exactly the mental link we want here?
>>>> Release is less descriptive for what this thing is to be used for.
>>>
>>> For better or for worse, "reap" means to make a zombie pid go away.
>>> From the description, this new operation takes a dying process (not
>>> necessarily a zombie yet) and aggressively frees its memory. This is
>>> a different optioneration.
>>>
>>> How about "free_dying_process_memory"?
>>
>> process_mreap sounds definitely better and in line with names like
>> process_madvise. So maybe we can use it?
>
> That one was my favorite from the list I gave too but maybe we can
> satisfy Andy too if we use one of:
> - process_mfree()
> - process_mrelease()
>

FWIW, I tend to like process_mrelease(), due to the implied "release"
("free the memory if there are no other references") semantics. Further,
a new syscall feels cleaner than some magic sysfs/procfs toggle. Just my
2 cents.

--
Thanks,

David / dhildenb

2021-07-07 09:48:24

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

* Suren Baghdasaryan:

> The API is as follows,
>
> int process_reap(int pidfd, unsigned int flags);
>
> DESCRIPTION
> The process_reap() system call is used to free the memory of a
> dying process.
>
> The pidfd selects the process referred to by the PID file
> descriptor.
> (See pidofd_open(2) for further information)
>
> The flags argument is reserved for future use; currently, this
> argument must be specified as 0.
>
> RETURN VALUE
> On success, process_reap() returns 0. On error, -1 is returned
> and errno is set to indicate the error.

I think the manual page should mention what it means for a process to be
“dying”, and how to move a process to this state.

Thanks,
Florian

2021-07-07 12:39:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Mon 05-07-21 09:41:54, David Hildenbrand wrote:
> On 02.07.21 17:27, Christian Brauner wrote:
[...]
> > That one was my favorite from the list I gave too but maybe we can
> > satisfy Andy too if we use one of:
> > - process_mfree()
> > - process_mrelease()
> >
>
> FWIW, I tend to like process_mrelease(), due to the implied "release" ("free
> the memory if there are no other references") semantics.

Agreed.

> Further, a new
> syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents.

Yeah, proc based interface is both tricky to use and kinda ugly now that
pidfd can solve all at in once.

My original preference was a more generic kill syscall to allow flags
but a dedicated syscall doesn't look really bad either.
--
Michal Hocko
SUSE Labs

2021-07-07 21:53:34

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <[email protected]> wrote:
>
> * Suren Baghdasaryan:
>
> > The API is as follows,
> >
> > int process_reap(int pidfd, unsigned int flags);
> >
> > DESCRIPTION
> > The process_reap() system call is used to free the memory of a
> > dying process.
> >
> > The pidfd selects the process referred to by the PID file
> > descriptor.
> > (See pidofd_open(2) for further information)
> >
> > The flags argument is reserved for future use; currently, this
> > argument must be specified as 0.
> >
> > RETURN VALUE
> > On success, process_reap() returns 0. On error, -1 is returned
> > and errno is set to indicate the error.
>
> I think the manual page should mention what it means for a process to be
> “dying”, and how to move a process to this state.

Thanks for the suggestion, Florian! Would replacing "dying process"
with "process which was sent a SIGKILL signal" be sufficient?

>
> Thanks,
> Florian
>

2021-07-07 21:54:14

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jul 7, 2021 at 5:38 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 05-07-21 09:41:54, David Hildenbrand wrote:
> > On 02.07.21 17:27, Christian Brauner wrote:
> [...]
> > > That one was my favorite from the list I gave too but maybe we can
> > > satisfy Andy too if we use one of:
> > > - process_mfree()
> > > - process_mrelease()
> > >
> >
> > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free
> > the memory if there are no other references") semantics.
>
> Agreed.

Ok, sounds like process_mrelease() would be an acceptable compromise.

>
> > Further, a new
> > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents.
>
> Yeah, proc based interface is both tricky to use and kinda ugly now that
> pidfd can solve all at in once.

Sounds good. Will keep it as is then.

> My original preference was a more generic kill syscall to allow flags
> but a dedicated syscall doesn't look really bad either.

Yeah, I have tried that direction unsuccessfully before arriving at
this one. Hopefully it represents the right compromise which can
satisfy everyone's usecase.

> --
> Michal Hocko
> SUSE Labs

2021-07-08 05:42:40

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

* Suren Baghdasaryan:

> On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <[email protected]> wrote:
>>
>> * Suren Baghdasaryan:
>>
>> > The API is as follows,
>> >
>> > int process_reap(int pidfd, unsigned int flags);
>> >
>> > DESCRIPTION
>> > The process_reap() system call is used to free the memory of a
>> > dying process.
>> >
>> > The pidfd selects the process referred to by the PID file
>> > descriptor.
>> > (See pidofd_open(2) for further information)
>> >
>> > The flags argument is reserved for future use; currently, this
>> > argument must be specified as 0.
>> >
>> > RETURN VALUE
>> > On success, process_reap() returns 0. On error, -1 is returned
>> > and errno is set to indicate the error.
>>
>> I think the manual page should mention what it means for a process to be
>> “dying”, and how to move a process to this state.
>
> Thanks for the suggestion, Florian! Would replacing "dying process"
> with "process which was sent a SIGKILL signal" be sufficient?

That explains very clearly the requirement, but it raises the question
why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
system call.

Thanks,
Florian

2021-07-08 06:07:50

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jul 7, 2021 at 10:41 PM Florian Weimer <[email protected]> wrote:
>
> * Suren Baghdasaryan:
>
> > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <[email protected]> wrote:
> >>
> >> * Suren Baghdasaryan:
> >>
> >> > The API is as follows,
> >> >
> >> > int process_reap(int pidfd, unsigned int flags);
> >> >
> >> > DESCRIPTION
> >> > The process_reap() system call is used to free the memory of a
> >> > dying process.
> >> >
> >> > The pidfd selects the process referred to by the PID file
> >> > descriptor.
> >> > (See pidofd_open(2) for further information)
> >> >
> >> > The flags argument is reserved for future use; currently, this
> >> > argument must be specified as 0.
> >> >
> >> > RETURN VALUE
> >> > On success, process_reap() returns 0. On error, -1 is returned
> >> > and errno is set to indicate the error.
> >>
> >> I think the manual page should mention what it means for a process to be
> >> “dying”, and how to move a process to this state.
> >
> > Thanks for the suggestion, Florian! Would replacing "dying process"
> > with "process which was sent a SIGKILL signal" be sufficient?
>
> That explains very clearly the requirement, but it raises the question
> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
> system call.

I think you are suggesting to use sigqueue() to deliver the signal and
perform the reaping when a special value accompanies it. This would be
somewhat similar to my early suggestion to use a flag in
pidfd_send_signal() (see:
https://lore.kernel.org/patchwork/patch/1060407) to implement memory
reaping which has another advantage of operation on PIDFDs instead of
PIDs which can be recycled.
kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
signal and return without blocking. Changing that behavior was
considered unacceptable in these discussions. On the other hand using
some kthread to do the reaping asynchronously has its disadvantages:
userspace can't control the priority and cpu affinity of the thread
doing the reaping and this work is not charged towards the caller. In
the end a separate blocking syscall was deemed appropriate for this
operation. More details can be found in the links I posted in the
description of the patch.

>
> Thanks,
> Florian
>

2021-07-08 06:18:48

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

* Suren Baghdasaryan:

> On Wed, Jul 7, 2021 at 10:41 PM Florian Weimer <[email protected]> wrote:
>>
>> * Suren Baghdasaryan:
>>
>> > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <[email protected]> wrote:
>> >>
>> >> * Suren Baghdasaryan:
>> >>
>> >> > The API is as follows,
>> >> >
>> >> > int process_reap(int pidfd, unsigned int flags);
>> >> >
>> >> > DESCRIPTION
>> >> > The process_reap() system call is used to free the memory of a
>> >> > dying process.
>> >> >
>> >> > The pidfd selects the process referred to by the PID file
>> >> > descriptor.
>> >> > (See pidofd_open(2) for further information)
>> >> >
>> >> > The flags argument is reserved for future use; currently, this
>> >> > argument must be specified as 0.
>> >> >
>> >> > RETURN VALUE
>> >> > On success, process_reap() returns 0. On error, -1 is returned
>> >> > and errno is set to indicate the error.
>> >>
>> >> I think the manual page should mention what it means for a process to be
>> >> “dying”, and how to move a process to this state.
>> >
>> > Thanks for the suggestion, Florian! Would replacing "dying process"
>> > with "process which was sent a SIGKILL signal" be sufficient?
>>
>> That explains very clearly the requirement, but it raises the question
>> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
>> system call.
>
> I think you are suggesting to use sigqueue() to deliver the signal and
> perform the reaping when a special value accompanies it. This would be
> somewhat similar to my early suggestion to use a flag in
> pidfd_send_signal() (see:
> https://lore.kernel.org/patchwork/patch/1060407) to implement memory
> reaping which has another advantage of operation on PIDFDs instead of
> PIDs which can be recycled.
> kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
> signal and return without blocking. Changing that behavior was
> considered unacceptable in these discussions.

Does this mean that you need two threads, one that sends SIGKILL, and
one that calls process_reap? Given that sending SIGKILL is blocking
with the existing interfaces?

Please also note that asynchronous deallocation of resources leads to
bugs and can cause unrelated workloads to fail. For example, in some
configurations, clone can fail with EAGAIN even in cases where the total
number of tasks is clearly bounded because the kernel signals task exit
to applications before all resources are deallocated. I'm worried that
the new interface makes things quite a bit worse in this regard.

Thanks,
Florian

2021-07-08 06:40:40

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jul 7, 2021 at 11:15 PM Florian Weimer <[email protected]> wrote:
>
> * Suren Baghdasaryan:
>
> > On Wed, Jul 7, 2021 at 10:41 PM Florian Weimer <[email protected]> wrote:
> >>
> >> * Suren Baghdasaryan:
> >>
> >> > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <[email protected]> wrote:
> >> >>
> >> >> * Suren Baghdasaryan:
> >> >>
> >> >> > The API is as follows,
> >> >> >
> >> >> > int process_reap(int pidfd, unsigned int flags);
> >> >> >
> >> >> > DESCRIPTION
> >> >> > The process_reap() system call is used to free the memory of a
> >> >> > dying process.
> >> >> >
> >> >> > The pidfd selects the process referred to by the PID file
> >> >> > descriptor.
> >> >> > (See pidofd_open(2) for further information)
> >> >> >
> >> >> > The flags argument is reserved for future use; currently, this
> >> >> > argument must be specified as 0.
> >> >> >
> >> >> > RETURN VALUE
> >> >> > On success, process_reap() returns 0. On error, -1 is returned
> >> >> > and errno is set to indicate the error.
> >> >>
> >> >> I think the manual page should mention what it means for a process to be
> >> >> “dying”, and how to move a process to this state.
> >> >
> >> > Thanks for the suggestion, Florian! Would replacing "dying process"
> >> > with "process which was sent a SIGKILL signal" be sufficient?
> >>
> >> That explains very clearly the requirement, but it raises the question
> >> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
> >> system call.
> >
> > I think you are suggesting to use sigqueue() to deliver the signal and
> > perform the reaping when a special value accompanies it. This would be
> > somewhat similar to my early suggestion to use a flag in
> > pidfd_send_signal() (see:
> > https://lore.kernel.org/patchwork/patch/1060407) to implement memory
> > reaping which has another advantage of operation on PIDFDs instead of
> > PIDs which can be recycled.
> > kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
> > signal and return without blocking. Changing that behavior was
> > considered unacceptable in these discussions.
>
> Does this mean that you need two threads, one that sends SIGKILL, and
> one that calls process_reap? Given that sending SIGKILL is blocking
> with the existing interfaces?

Sending SIGKILL is blocking in terms of delivering the signal, but it
does not block waiting for SIGKILL to be processed by the signal
recipient and memory to be released. When I was talking about
"blocking", I meant that current kill() and friends do not block to
wait for SIGKILL to be processed.
process_reap() will block until the memory is released. Whether the
userspace caller is using it right after sending a SIGKILL to reclaim
the memory synchronously or spawns a separate thread to reclaim memory
asynchronously is up to the user. Both patterns are supported.

> Please also note that asynchronous deallocation of resources leads to
> bugs and can cause unrelated workloads to fail. For example, in some
> configurations, clone can fail with EAGAIN even in cases where the total
> number of tasks is clearly bounded because the kernel signals task exit
> to applications before all resources are deallocated. I'm worried that
> the new interface makes things quite a bit worse in this regard.

The process_reap() releases memory synchronously, no kthreads are
being used. If asynchronous release is required, the userspace would
need to spawn a userspace thread and issue this syscall from it. I
hope this clears your concerns, which I think are about asynchronous
deallocations within the kernel.
Thanks!

>
> Thanks,
> Florian
>

2021-07-08 07:17:44

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

* Suren Baghdasaryan:

> Sending SIGKILL is blocking in terms of delivering the signal, but it
> does not block waiting for SIGKILL to be processed by the signal
> recipient and memory to be released. When I was talking about
> "blocking", I meant that current kill() and friends do not block to
> wait for SIGKILL to be processed.
> process_reap() will block until the memory is released. Whether the
> userspace caller is using it right after sending a SIGKILL to reclaim
> the memory synchronously or spawns a separate thread to reclaim memory
> asynchronously is up to the user. Both patterns are supported.

I see, this makes sense.

Considering that the pidfd sticks around after process_reap returns, the
issue described in bug 154011 probably does not apply to process_reap.
(This relates to asynchronous resource deallocation, as discussed before.)

Thanks,
Florian

2021-07-09 09:00:25

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Wed, Jul 07, 2021 at 02:14:23PM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 7, 2021 at 5:38 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 05-07-21 09:41:54, David Hildenbrand wrote:
> > > On 02.07.21 17:27, Christian Brauner wrote:
> > [...]
> > > > That one was my favorite from the list I gave too but maybe we can
> > > > satisfy Andy too if we use one of:
> > > > - process_mfree()
> > > > - process_mrelease()
> > > >
> > >
> > > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free
> > > the memory if there are no other references") semantics.
> >
> > Agreed.
>
> Ok, sounds like process_mrelease() would be an acceptable compromise.
>
> >
> > > Further, a new
> > > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents.
> >
> > Yeah, proc based interface is both tricky to use and kinda ugly now that
> > pidfd can solve all at in once.
>
> Sounds good. Will keep it as is then.
>
> > My original preference was a more generic kill syscall to allow flags
> > but a dedicated syscall doesn't look really bad either.
>
> Yeah, I have tried that direction unsuccessfully before arriving at
> this one. Hopefully it represents the right compromise which can
> satisfy everyone's usecase.

I think a syscall is fine and it's not we're running out of numbers
(anymore). :)

Christian

2021-07-09 20:07:28

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Fri, Jul 9, 2021 at 1:59 AM Christian Brauner
<[email protected]> wrote:
>
> On Wed, Jul 07, 2021 at 02:14:23PM -0700, Suren Baghdasaryan wrote:
> > On Wed, Jul 7, 2021 at 5:38 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 05-07-21 09:41:54, David Hildenbrand wrote:
> > > > On 02.07.21 17:27, Christian Brauner wrote:
> > > [...]
> > > > > That one was my favorite from the list I gave too but maybe we can
> > > > > satisfy Andy too if we use one of:
> > > > > - process_mfree()
> > > > > - process_mrelease()
> > > > >
> > > >
> > > > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free
> > > > the memory if there are no other references") semantics.
> > >
> > > Agreed.
> >
> > Ok, sounds like process_mrelease() would be an acceptable compromise.
> >
> > >
> > > > Further, a new
> > > > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents.
> > >
> > > Yeah, proc based interface is both tricky to use and kinda ugly now that
> > > pidfd can solve all at in once.
> >
> > Sounds good. Will keep it as is then.
> >
> > > My original preference was a more generic kill syscall to allow flags
> > > but a dedicated syscall doesn't look really bad either.
> >
> > Yeah, I have tried that direction unsuccessfully before arriving at
> > this one. Hopefully it represents the right compromise which can
> > satisfy everyone's usecase.
>
> I think a syscall is fine and it's not we're running out of numbers
> (anymore). :)

Thanks everyone for the input!
So far I collected:
1. rename the syscall to process_mrelease()
2. replace "dying process" with "process which was sent a SIGKILL
signal" in the manual page text

I'll respin a v2 with these changes next week.
Have a great weekend!
Suren.

>
> Christian

2021-07-12 13:01:18

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call


On Thursday 2021-07-08 08:05, Suren Baghdasaryan wrote:
>>
>> That explains very clearly the requirement, but it raises the question
>> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
>> system call.
>
>I think you are suggesting to use sigqueue() to deliver the signal and
>perform the reaping when a special value accompanies it. This would be
>somewhat similar to my early suggestion to use a flag in
>pidfd_send_signal() (see:
>https://lore.kernel.org/patchwork/patch/1060407) to implement memory
>reaping which has another advantage of operation on PIDFDs instead of
>PIDs which can be recycled.
>kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
>signal and return without blocking. Changing that behavior was
>considered unacceptable in these discussions.

The way I understood the request is that a userspace program (or perhaps two,
if so desired) should issue _two_ calls, one to deliver the signal,
one to perform the reap portion:

uinfo.si_code = SI_QUEUE;
sigqueue(pid, SIGKILL, &uinfo);
uinfo.si_code = SI_REAP;
sigqueue(pid, SIGKILL, &uinfo);

2021-07-12 18:41:49

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call

On Mon, Jul 12, 2021 at 5:51 AM Jan Engelhardt <[email protected]> wrote:
>
>
> On Thursday 2021-07-08 08:05, Suren Baghdasaryan wrote:
> >>
> >> That explains very clearly the requirement, but it raises the question
> >> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing
> >> system call.
> >
> >I think you are suggesting to use sigqueue() to deliver the signal and
> >perform the reaping when a special value accompanies it. This would be
> >somewhat similar to my early suggestion to use a flag in
> >pidfd_send_signal() (see:
> >https://lore.kernel.org/patchwork/patch/1060407) to implement memory
> >reaping which has another advantage of operation on PIDFDs instead of
> >PIDs which can be recycled.
> >kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the
> >signal and return without blocking. Changing that behavior was
> >considered unacceptable in these discussions.
>
> The way I understood the request is that a userspace program (or perhaps two,
> if so desired) should issue _two_ calls, one to deliver the signal,
> one to perform the reap portion:
>
> uinfo.si_code = SI_QUEUE;
> sigqueue(pid, SIGKILL, &uinfo);
> uinfo.si_code = SI_REAP;
> sigqueue(pid, SIGKILL, &uinfo);

This approach would still lead to the same discussion: by design,
sigqueue/kill/pidfd_send_signal deliver the signal but do not wait for
the signal to be processed by the recipient. Changing that would be a
behavior change. Therefore we would have to follow this pattern and
implement memory reaping in an asynchronous manner using a
kthread/workqueue and it won't be done in the context of the calling
process. This is undesirable because we lose the ability to control
priority and cpu affinity for this operation and work won't be charged
to the caller.
That's why the proposed syscall performs memory reaping in the
caller's context and blocks until the operation is done. In this
proposal, your sequence looks like this:

pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
process_reap(pidfd, 0);

except we decided to rename process_reap() to process_mrelease() in
the next revision.

2021-07-12 19:17:20

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: introduce process_reap system call


On Monday 2021-07-12 20:39, Suren Baghdasaryan wrote:
>>
>> The way I understood the request is that a userspace program (or perhaps two,
>> if so desired) should issue _two_ calls, one to deliver the signal,
>> one to perform the reap portion:
>>
>> uinfo.si_code = SI_QUEUE;
>> sigqueue(pid, SIGKILL, &uinfo);
>> uinfo.si_code = SI_REAP;
>> sigqueue(pid, SIGKILL, &uinfo);
>
>This approach would still lead to the same discussion: by design,
>sigqueue/kill/pidfd_send_signal deliver the signal but do not wait for
>the signal to be processed by the recipient.

Oh, so the only reason not to do that is because there is some POSIX
specification that says the sigqueue API should be non-waiting for all
possible parameter values (with an implied "present and future
values!"), not because there's some hurdle to actually add a wait
inside within rt_sigqueueinfo if the REAP flag is set.
Gotcha.