Many thanks to Jann Horn for a detailed review of v0.5:
https://lore.kernel.org/lkml/[email protected]/
The main change in this patchet vs v0.5 is pinning of
pages containing struct umcg_task of RUNNING workers
and their servers. This allows uaccess operations from
sched_submit_work (preempt-disabled). The pages are
pinned only for RUNNING workers, i.e. at most NR_CPUS
pages are pinned by UMCG at any one time, across the
whole machine.
A smaller change is the move of worker wakeup processing
from sched_update_worker() to exit_to_user_mode_loop(),
to ensure that no kernel locks are held if the worker
is descheduled there.
Commit messages in patches 2 and 3 list more changes.
Peter Oskolkov (5):
sched/umcg: add WF_CURRENT_CPU and externise ttwu
sched/umcg: RFC: add userspace atomic helpers
sched/umcg: RFC: implement UMCG syscalls
sched/umcg: add Documentation/userspace-api/umcg.rst
sched/umcg: add Documentation/userspace-api/umcg.txt
Documentation/userspace-api/umcg.rst | 550 ++++++++++++++++++
Documentation/userspace-api/umcg.txt | 537 ++++++++++++++++++
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
fs/exec.c | 1 +
include/linux/sched.h | 56 ++
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 8 +-
include/uapi/linux/umcg.h | 117 ++++
init/Kconfig | 10 +
kernel/entry/common.c | 1 +
kernel/exit.c | 2 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 18 +-
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 15 +-
kernel/sched/umcg.c | 745 +++++++++++++++++++++++++
kernel/sched/umcg_uaccess.h | 344 ++++++++++++
kernel/sys_ni.c | 4 +
18 files changed, 2408 insertions(+), 11 deletions(-)
create mode 100644 Documentation/userspace-api/umcg.rst
create mode 100644 Documentation/userspace-api/umcg.txt
create mode 100644 include/uapi/linux/umcg.h
create mode 100644 kernel/sched/umcg.c
create mode 100644 kernel/sched/umcg_uaccess.h
--
2.25.1
Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases such as UMCG.
In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.
Signed-off-by: Peter Oskolkov <[email protected]>
---
kernel/sched/core.c | 3 +--
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 15 +++++++++------
3 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b36b5d7f2617..12a9d053e724 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3878,8 +3878,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
* Return: %true if @p->state changes (an actual wakeup was done),
* %false otherwise.
*/
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
unsigned long flags;
int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e26d622762a9..7face28b1830 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6863,6 +6863,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (wake_flags & WF_TTWU) {
record_wakee(p);
+ if ((wake_flags & WF_CURRENT_CPU) &&
+ cpumask_test_cpu(cpu, p->cpus_ptr))
+ return cpu;
+
if (sched_energy_enabled()) {
new_cpu = find_energy_efficient_cpu(p, prev_cpu);
if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 148d5d3255a5..2a79c302ee81 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2040,13 +2040,14 @@ static inline int task_on_rq_migrating(struct task_struct *p)
}
/* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
+#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
-#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
-#define WF_ON_CPU 0x40 /* Wakee is on_cpu */
+#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_ON_CPU 0x40 /* Wakee is on_cpu */
+#define WF_CURRENT_CPU 0x80 /* Prefer to move the wakee to the current CPU. */
#ifdef CONFIG_SMP
static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3054,6 +3055,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
extern void swake_up_all_locked(struct swait_queue_head *q);
extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
extern int preempt_dynamic_mode;
extern int sched_dynamic_mode(const char *str);
--
2.25.1
Add helper functions to work atomically with userspace 32/64 bit values -
there are some .*futex.* named helpers, but they are not exactly
what is needed for UMCG; I haven't found what else I could use, so I
rolled these.
At the moment only X86_64 is supported.
Note: the helpers should probably go into arch/ somewhere; I have
them in kernel/sched/umcg_uaccess.h temporarily for convenience. Please
let me know where I should put them.
Changelog:
v0.5->v0.6:
- replaced mmap_read_lock with mmap_read_lock_killable in fix_pagefault();
- fix_pagefault now validates proper uaddr alignment;
- renamed umcg.h to umcg_uaccess.h;
v0.4->v0.5:
- added xchg_user_** helpers;
v0.3->v0.4:
- added put_user_nosleep;
- removed linked list/stack operations patch;
v0.2->v0.3:
- renamed and refactored the helpers a bit, as described above;
- moved linked list/stack operations into a separate patch.
Signed-off-by: Peter Oskolkov <[email protected]>
---
kernel/sched/umcg_uaccess.h | 344 ++++++++++++++++++++++++++++++++++++
1 file changed, 344 insertions(+)
create mode 100644 kernel/sched/umcg_uaccess.h
diff --git a/kernel/sched/umcg_uaccess.h b/kernel/sched/umcg_uaccess.h
new file mode 100644
index 000000000000..e4ead8d2fd62
--- /dev/null
+++ b/kernel/sched/umcg_uaccess.h
@@ -0,0 +1,344 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _KERNEL_SCHED_UMCG_UACCESS_H
+#define _KERNEL_SCHED_UMCG_UACCESS_H
+
+#ifdef CONFIG_X86_64
+
+#include <linux/uaccess.h>
+
+#include <asm/asm.h>
+#include <linux/atomic.h>
+#include <asm/uaccess.h>
+
+/* TODO: move atomic operations below into arch/ headers */
+static inline int __try_cmpxchg_user_32(u32 *uval, u32 __user *uaddr,
+ u32 oldval, u32 newval)
+{
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\t" LOCK_PREFIX "cmpxchgl %4, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+ : "i" (-EFAULT), "r" (newval), "1" (oldval)
+ : "memory"
+ );
+ *uval = oldval;
+ return ret;
+}
+
+static inline int __try_cmpxchg_user_64(u64 *uval, u64 __user *uaddr,
+ u64 oldval, u64 newval)
+{
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+ : "i" (-EFAULT), "r" (newval), "1" (oldval)
+ : "memory"
+ );
+ *uval = oldval;
+ return ret;
+}
+
+static inline int fix_pagefault(unsigned long uaddr, bool write_fault, int bytes)
+{
+ struct mm_struct *mm = current->mm;
+ int ret;
+
+ /* Validate proper alignment. */
+ if (uaddr % bytes)
+ return -EINVAL;
+
+ if (mmap_read_lock_killable(mm))
+ return -EINTR;
+ ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
+ NULL);
+ mmap_read_unlock(mm);
+
+ return ret < 0 ? ret : 0;
+}
+
+/**
+ * cmpxchg_32_user_nosleep - compare_exchange 32-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int cmpxchg_user_32_nosleep(u32 __user *uaddr, u32 *old, u32 new)
+{
+ int ret = -EFAULT;
+ u32 __old = *old;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ __uaccess_begin_nospec();
+ ret = __try_cmpxchg_user_32(old, uaddr, __old, new);
+ user_access_end();
+
+ if (!ret)
+ ret = *old == __old ? 0 : -EAGAIN;
+
+ pagefault_enable();
+ return ret;
+}
+
+/**
+ * cmpxchg_64_user_nosleep - compare_exchange 64-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int cmpxchg_user_64_nosleep(u64 __user *uaddr, u64 *old, u64 new)
+{
+ int ret = -EFAULT;
+ u64 __old = *old;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ __uaccess_begin_nospec();
+ ret = __try_cmpxchg_user_64(old, uaddr, __old, new);
+ user_access_end();
+
+ if (!ret)
+ ret = *old == __old ? 0 : -EAGAIN;
+
+ pagefault_enable();
+
+ return ret;
+}
+
+/**
+ * cmpxchg_32_user - compare_exchange 32-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int cmpxchg_user_32(u32 __user *uaddr, u32 *old, u32 new)
+{
+ int ret = -EFAULT;
+ u32 __old = *old;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+ __uaccess_begin_nospec();
+ ret = __try_cmpxchg_user_32(old, uaddr, __old, new);
+ user_access_end();
+
+ if (!ret) {
+ ret = *old == __old ? 0 : -EAGAIN;
+ break;
+ }
+
+ if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
+ break;
+ }
+
+ pagefault_enable();
+ return ret;
+}
+
+/**
+ * cmpxchg_64_user - compare_exchange 64-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int cmpxchg_user_64(u64 __user *uaddr, u64 *old, u64 new)
+{
+ int ret = -EFAULT;
+ u64 __old = *old;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+ __uaccess_begin_nospec();
+ ret = __try_cmpxchg_user_64(old, uaddr, __old, new);
+ user_access_end();
+
+ if (!ret) {
+ ret = *old == __old ? 0 : -EAGAIN;
+ break;
+ }
+
+ if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
+ break;
+ }
+
+ pagefault_enable();
+
+ return ret;
+}
+
+static inline int __try_xchg_user_32(u32 *oval, u32 __user *uaddr, u32 newval)
+{
+ u32 oldval = 0;
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\txchgl %0, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "=r" (oldval), "=r" (ret), "+m" (*uaddr)
+ : "i" (-EFAULT), "0" (newval), "1" (0)
+ );
+
+ if (ret)
+ return ret;
+
+ *oval = oldval;
+ return 0;
+}
+
+static inline int __try_xchg_user_64(u64 *oval, u64 __user *uaddr, u64 newval)
+{
+ u64 oldval = 0;
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\txchgq %0, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "=r" (oldval), "=r" (ret), "+m" (*uaddr)
+ : "i" (-EFAULT), "0" (newval), "1" (0)
+ );
+
+ if (ret)
+ return ret;
+
+ *oval = oldval;
+ return 0;
+}
+
+/**
+ * xchg_32_user - atomically exchange 64-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ */
+static inline int xchg_user_32(u32 __user *uaddr, u32 *val)
+{
+ int ret = -EFAULT;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+
+ __uaccess_begin_nospec();
+ ret = __try_xchg_user_32(val, uaddr, *val);
+ user_access_end();
+
+ if (!ret)
+ break;
+
+ if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
+ break;
+ }
+
+ pagefault_enable();
+
+ return ret;
+}
+
+/**
+ * xchg_64_user - atomically exchange 64-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ */
+static inline int xchg_user_64(u64 __user *uaddr, u64 *val)
+{
+ int ret = -EFAULT;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+
+ __uaccess_begin_nospec();
+ ret = __try_xchg_user_64(val, uaddr, *val);
+ user_access_end();
+
+ if (!ret)
+ break;
+
+ if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
+ break;
+ }
+
+ pagefault_enable();
+
+ return ret;
+}
+
+/**
+ * get_user_nosleep - get user value without sleeping.
+ *
+ * get_user() might sleep and therefore cannot be used in preempt-disabled
+ * regions.
+ */
+#define get_user_nosleep(out, uaddr) \
+({ \
+ int ret = -EFAULT; \
+ \
+ if (access_ok((uaddr), sizeof(*(uaddr)))) { \
+ pagefault_disable(); \
+ \
+ if (!__get_user((out), (uaddr))) \
+ ret = 0; \
+ \
+ pagefault_enable(); \
+ } \
+ ret; \
+})
+
+#endif /* CONFIG_X86_64 */
+#endif /* _KERNEL_SCHED_UMCG_UACCESS_H */
--
2.25.1
Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
All key operations, such as wait/wake/context-switch, as well as
timeouts and block/wake detection, are working quite reliably.
In addition, the userspace can now force the kernel to preempt
a running worker by changing its state from RUNNING to
RUNNING | PREEMPTED and sending a signal to it. This new functionality
is less well tested than the key operations above, but is working
well in the common case of the worker busy in the userspace.
These big things remain to be addressed (in no particular order):
- tracing/debugging
- faster context switches (see umcg_do_context_switch in umcg.c)
- other architectures (we will need at least arm64 in addition to amd64)
- tools/lib/umcg for userspace
- kselftests
I'm working on finalizing libumcg and kselftests.
See Documentation/userspace-api/umcg.[txt|rst] for API usage and
other details.
v0.5->v0.6 changes:
- umcg_task pages are now pinned for RUNNING workers;
- waking workers now wait for the userspace to schedule them
in exit_to_user_mode_loop() instead of in sched_update_worker();
- added umcg_clear_child to fork and execve;
- changed current->umcg_task assignments to WRITE_ONCE;
- server/worker interactions are restricted to tasks in the same mm;
v0.4->v0.5 changes:
- handling idle workers and servers is now much simpler on the kernel
side, thanks to Thierry Delisle's suggestion:
https://lore.kernel.org/lkml/[email protected]/
- minor tweaks to improve preemption handling;
v0.3->v0.4 changes:
- removed server_tid and api_version fields from struct umcg_task;
- added timeout handling to sys_umcg_wait();
- implemented worker preemption via signals;
- handling idle workers and servers is changed again (see umcg.rst).
v0.2->v0.3 changes:
- the overall approach is now based on peterz@'s suggestion in
https://lore.kernel.org/patchwork/cover/1433967/
(should I add Suggested-by?)
- new protocol for working with idle workers and servers is used, to avoid
spinning in the kernel;
- waking a UMCG task now does not require spinning.
Signed-off-by: Peter Oskolkov <[email protected]>
---
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
fs/exec.c | 1 +
include/linux/sched.h | 56 ++
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 8 +-
include/uapi/linux/umcg.h | 117 ++++
init/Kconfig | 10 +
kernel/entry/common.c | 1 +
kernel/exit.c | 2 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 15 +-
kernel/sched/umcg.c | 745 +++++++++++++++++++++++++
kernel/sys_ni.c | 4 +
13 files changed, 963 insertions(+), 3 deletions(-)
create mode 100644 include/uapi/linux/umcg.h
create mode 100644 kernel/sched/umcg.c
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ce18119ea0d0..0c6c7fd72b0b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,8 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common umcg_ctl sys_umcg_ctl
+448 common umcg_wait sys_umcg_wait
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..d652ef8017b2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1835,6 +1835,7 @@ static int bprm_execve(struct linux_binprm *bprm,
current->fs->in_exec = 0;
current->in_execve = 0;
rseq_execve(current);
+ umcg_execve(current);
acct_update_integrals(current);
task_numa_free(current, false);
return retval;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 549018e46801..4cf9070d1361 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -66,6 +66,7 @@ struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct umcg_task;
/*
* Task state bitmask. NOTE! These bits are also
@@ -1230,6 +1231,12 @@ struct task_struct {
unsigned long rseq_event_mask;
#endif
+#ifdef CONFIG_UMCG
+ struct umcg_task __user *umcg_task;
+ struct page *pinned_umcg_worker_page; /* self */
+ struct page *pinned_umcg_server_page;
+#endif
+
struct tlbflush_unmap_batch tlb_ubc;
union {
@@ -1606,6 +1613,7 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
@@ -2191,6 +2199,54 @@ static inline void rseq_execve(struct task_struct *t)
#endif
+#ifdef CONFIG_UMCG
+
+void umcg_handle_resuming_worker(void);
+void umcg_handle_exiting_worker(void);
+void umcg_clear_child(struct task_struct *tsk);
+
+/* Called by bprm_execve() in fs/exec.c. */
+static inline void umcg_execve(struct task_struct *tsk)
+{
+ if (tsk->umcg_task)
+ umcg_clear_child(tsk);
+}
+
+/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
+static inline void umcg_handle_notify_resume(void)
+{
+ if (current->flags & PF_UMCG_WORKER)
+ umcg_handle_resuming_worker();
+}
+
+/* Called by do_exit() in kernel/exit.c. */
+static inline void umcg_handle_exit(void)
+{
+ if (current->flags & PF_UMCG_WORKER)
+ umcg_handle_exiting_worker();
+}
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+void umcg_wq_worker_sleeping(struct task_struct *tsk);
+void umcg_wq_worker_running(struct task_struct *tsk);
+
+#else
+
+static inline void umcg_execve(struct task_struct *tsk)
+{
+}
+static inline void umcg_handle_notify_resume(void)
+{
+}
+static inline void umcg_handle_exit(void)
+{
+}
+
+#endif
+
#ifdef CONFIG_DEBUG_RSEQ
void rseq_syscall(struct pt_regs *regs);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..f3e1ef8d842f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -71,6 +71,7 @@ struct open_how;
struct mount_attr;
struct landlock_ruleset_attr;
enum landlock_rule_type;
+struct umcg_task;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -1050,6 +1051,9 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
const void __user *rule_attr, __u32 flags);
asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
+
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 6de5a7fc066b..1a4c9ac0e296 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -873,8 +873,14 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
#define __NR_landlock_restrict_self 446
__SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
+#define __NR_umcg_ctl 447
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 448
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
+
+
#undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 449
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
new file mode 100644
index 000000000000..edce804781f9
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,117 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ * sys_umcg_ctl() - register/unregister UMCG tasks;
+ * sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for detals.
+ */
+
+/*
+ * UMCG task states, the first 8 bits. The states represent the user space
+ * point of view.
+ */
+#define UMCG_TASK_NONE 0
+#define UMCG_TASK_RUNNING 1
+#define UMCG_TASK_IDLE 2
+#define UMCG_TASK_BLOCKED 3
+
+/* The first byte: RUNNING, IDLE, or BLOCKED. */
+#define UMCG_TASK_STATE_MASK 0xff
+
+/* UMCG task state flags, bits 8-15 */
+
+/*
+ * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
+ */
+#define UMCG_TF_LOCKED (1 << 8)
+
+/*
+ * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
+ */
+#define UMCG_TF_PREEMPTED (1 << 9)
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+ /**
+ * @state: the current state of the UMCG task described by this struct.
+ *
+ * Readable/writable by both the kernel and the userspace.
+ *
+ * UMCG task state:
+ * bits 0 - 7: task state;
+ * bits 8 - 15: state flags;
+ * bits 16 - 23: reserved; must be zeroes;
+ * bits 24 - 31: for userspace use.
+ */
+ uint32_t state; /* r/w */
+
+ /**
+ * @next_tid: the TID of the UMCG task that should be context-switched
+ * into in sys_umcg_wait(). Can be zero.
+ *
+ * Running UMCG workers must have next_tid set to point to IDLE
+ * UMCG servers.
+ *
+ * Read-only for the kernel, read/write for the userspace.
+ */
+ uint32_t next_tid; /* r */
+
+ /**
+ * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
+ *
+ * Readable/writable by both the kernel and the userspace: the
+ * kernel adds items to the list, the userspace removes them.
+ */
+ uint64_t idle_workers_ptr; /* r/w */
+
+ /**
+ * @idle_server_tid_ptr: a pointer pointing to a single idle server.
+ * Readonly.
+ */
+ uint64_t idle_server_tid_ptr; /* r */
+} __attribute__((packed, aligned(8 * sizeof(__u64))));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER: register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER: register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+ UMCG_CTL_REGISTER = 0x00001,
+ UMCG_CTL_UNREGISTER = 0x00002,
+ UMCG_CTL_WORKER = 0x10000,
+};
+
+/**
+ * enum umcg_wait_flag - flags to pass to sys_umcg_wait
+ * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
+ * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
+ * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
+ * must be set.
+ */
+enum umcg_wait_flag {
+ UMCG_WAIT_WAKE_ONLY = 1,
+ UMCG_WAIT_WF_CURRENT_CPU = 2,
+};
+
+/* See Documentation/userspace-api/umcg.[txt|rst].*/
+#define UMCG_IDLE_NODE_PENDING (1ULL)
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
index a61c92066c2e..c15a50a61ba6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1662,6 +1662,16 @@ config MEMBARRIER
If unsure, say Y.
+config UMCG
+ bool "Enable User Managed Concurrency Groups API"
+ depends on X86_64
+ default n
+ help
+ Enable User Managed Concurrency Groups API, which form the basis
+ for an in-process M:N userspace scheduling framework.
+ At the moment this is an experimental/RFC feature that is not
+ guaranteed to be backward-compatible.
+
config KALLSYMS
bool "Load all symbols for debugging/ksymoops" if EXPERT
default y
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bf16395b9e13..f3cd335ab513 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -173,6 +173,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
if (ti_work & _TIF_NOTIFY_RESUME) {
tracehook_notify_resume(regs);
+ umcg_handle_notify_resume(); /* might sleep */
rseq_handle_notify_resume(NULL, regs);
}
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..fdd4e923cca9 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -744,6 +744,8 @@ void __noreturn do_exit(long code)
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");
+ umcg_handle_exit();
+
/*
* If do_exit is called because this processes oopsed, it's possible
* that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 978fcfca5871..e4e481eee1b7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
obj-$(CONFIG_PSI) += psi.o
obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12a9d053e724..c9133cf153b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4159,6 +4159,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
#endif
+#ifdef CONFIG_UMCG
+ umcg_clear_child(p);
+#endif
}
DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6105,10 +6108,14 @@ static inline void sched_submit_work(struct task_struct *tsk)
* in the possible wakeup of a kworker and because wq_worker_sleeping()
* requires it.
*/
- if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
preempt_disable();
if (task_flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);
+#ifdef CONFIG_UMCG
+ else if (task_flags & PF_UMCG_WORKER)
+ umcg_wq_worker_sleeping(tsk);
+#endif
else
io_wq_worker_sleeping(tsk);
preempt_enable_no_resched();
@@ -6127,9 +6134,13 @@ static inline void sched_submit_work(struct task_struct *tsk)
static void sched_update_worker(struct task_struct *tsk)
{
- if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
if (tsk->flags & PF_WQ_WORKER)
wq_worker_running(tsk);
+#ifdef CONFIG_UMCG
+ else if (tsk->flags & PF_UMCG_WORKER)
+ umcg_wq_worker_running(tsk);
+#endif
else
io_wq_worker_running(tsk);
}
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
new file mode 100644
index 000000000000..aa4dbb31c425
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,745 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for detals.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+#include "sched.h"
+#include "umcg_uaccess.h"
+
+/**
+ * umcg_pin_pages: pin pages containing struct umcg_task of this worker
+ * and its server.
+ *
+ * The pages are pinned when the worker exits to the userspace and unpinned
+ * when the worker is in sched_submit_work(), i.e. when the worker is
+ * about to be removed from its runqueue. Thus at most NR_CPUS UMCG pages
+ * are pinned at any one time across the whole system.
+ */
+static int umcg_pin_pages(u32 server_tid)
+{
+ struct umcg_task __user *worker_ut = current->umcg_task;
+ struct umcg_task __user *server_ut = NULL;
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(server_tid);
+ if (tsk)
+ server_ut = READ_ONCE(tsk->umcg_task);
+ rcu_read_unlock();
+
+ if (!server_ut)
+ return -EINVAL;
+
+ if (READ_ONCE(current->mm) != READ_ONCE(tsk->mm))
+ return -EINVAL;
+
+ tsk = current;
+
+ /* worker_ut is stable, don't need to repin */
+ if (!tsk->pinned_umcg_worker_page)
+ if (1 != pin_user_pages_fast((unsigned long)worker_ut, 1, 0,
+ &tsk->pinned_umcg_worker_page))
+ return -EFAULT;
+
+ /* server_ut may change, need to repin */
+ if (tsk->pinned_umcg_server_page) {
+ unpin_user_page(tsk->pinned_umcg_server_page);
+ tsk->pinned_umcg_server_page = NULL;
+ }
+
+ if (1 != pin_user_pages_fast((unsigned long)server_ut, 1, 0,
+ &tsk->pinned_umcg_server_page))
+ return -EFAULT;
+
+ return 0;
+}
+
+static void umcg_unpin_pages(void)
+{
+ struct task_struct *tsk = current;
+
+ if (tsk->pinned_umcg_worker_page)
+ unpin_user_page(tsk->pinned_umcg_worker_page);
+ if (tsk->pinned_umcg_server_page)
+ unpin_user_page(tsk->pinned_umcg_server_page);
+
+ tsk->pinned_umcg_worker_page = NULL;
+ tsk->pinned_umcg_server_page = NULL;
+}
+
+static void umcg_clear_task(struct task_struct *tsk)
+{
+ /*
+ * This is either called for the current task, or for a newly forked
+ * task that is not yet running, so we don't need strict atomicity
+ * below.
+ */
+ if (tsk->umcg_task) {
+ WRITE_ONCE(tsk->umcg_task, NULL);
+
+ /* These can be simple writes - see the commment above. */
+ tsk->pinned_umcg_worker_page = NULL;
+ tsk->pinned_umcg_server_page = NULL;
+ tsk->flags &= ~PF_UMCG_WORKER;
+ }
+}
+
+/* Called for a forked or execve-ed child. */
+void umcg_clear_child(struct task_struct *tsk)
+{
+ umcg_clear_task(tsk);
+}
+
+/* Called both by normally (unregister) and abnormally exiting workers. */
+void umcg_handle_exiting_worker(void)
+{
+ umcg_unpin_pages();
+ umcg_clear_task(current);
+}
+
+/**
+ * sys_umcg_ctl: (un)register the current task as a UMCG task.
+ * @flags: ORed values from enum umcg_ctl_flag; see below;
+ * @self: a pointer to struct umcg_task that describes this
+ * task and governs the behavior of sys_umcg_wait if
+ * registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ * UMCG workers:
+ * - self->state must be UMCG_TASK_IDLE
+ * - @flags & UMCG_CTL_WORKER
+ * UMCG servers:
+ * - self->state must be UMCG_TASK_RUNNING
+ * - !(@flags & UMCG_CTL_WORKER)
+ *
+ * All tasks:
+ * - self->next_tid must be zero
+ *
+ * If the conditions above are met, sys_umcg_ctl() immediately returns
+ * if the registered task is a server; a worker will be added to
+ * idle_workers_ptr, and the worker put to sleep; an idle server
+ * from idle_server_tid_ptr will be woken, if present.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
+ * is a UMCG worker, the userspace is responsible for waking its
+ * server (before or after calling sys_umcg_ctl).
+ *
+ * Return:
+ * 0 - success
+ * -EFAULT - failed to read @self
+ * -EINVAL - some other error occurred
+ */
+SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
+{
+ struct umcg_task ut;
+
+ if (flags == UMCG_CTL_UNREGISTER) {
+ if (self || !current->umcg_task)
+ return -EINVAL;
+
+ if (current->flags & PF_UMCG_WORKER)
+ umcg_handle_exiting_worker();
+ else
+ umcg_clear_task(current);
+
+ return 0;
+ }
+
+ /* Register the current task as a UMCG task. */
+ if (!(flags & UMCG_CTL_REGISTER))
+ return -EINVAL;
+
+ flags &= ~UMCG_CTL_REGISTER;
+ if (flags && flags != UMCG_CTL_WORKER)
+ return -EINVAL;
+
+ if (current->umcg_task || !self)
+ return -EINVAL;
+
+ if (copy_from_user(&ut, self, sizeof(ut)))
+ return -EFAULT;
+
+ if (ut.next_tid)
+ return -EINVAL;
+
+ if (flags == UMCG_CTL_WORKER) {
+ if (ut.state != UMCG_TASK_BLOCKED)
+ return -EINVAL;
+
+ WRITE_ONCE(current->umcg_task, self);
+ current->flags |= PF_UMCG_WORKER;
+
+ set_tsk_need_resched(current);
+ return 0;
+ }
+
+ /* This is a server task. */
+ if (ut.state != UMCG_TASK_RUNNING)
+ return -EINVAL;
+
+ WRITE_ONCE(current->umcg_task, self);
+ return 0;
+}
+
+/**
+ * handle_timedout_worker - make sure the worker is added to idle_workers
+ * upon a "clean" timeout.
+ */
+static int handle_timedout_worker(struct umcg_task __user *self)
+{
+ u32 prev_state, next_state;
+ int ret;
+
+ if (get_user(prev_state, &self->state))
+ return -EFAULT;
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE) {
+ /* TODO: should we care here about TF_LOCKED or TF_PREEMPTED? */
+
+ next_state = prev_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_BLOCKED;
+
+ ret = cmpxchg_user_32(&self->state, &prev_state, next_state);
+ if (ret)
+ return ret;
+
+ return -ETIMEDOUT;
+ }
+
+ return 0; /* Not really timed out. */
+}
+
+/**
+ * umcg_idle_loop - sleep until the current task becomes RUNNING or a timeout
+ * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout
+ *
+ * The function marks the current task as INTERRUPTIBLE and calls
+ * schedule(). It returns when either the timeout expires or
+ * the UMCG state of the task becomes RUNNING.
+ *
+ * Note: because UMCG workers should not be running WITHOUT attached servers,
+ * and because servers should not be running WITH attached workers,
+ * the function returns only on fatal signal pending and ignores/flushes
+ * all other signals.
+ */
+static int umcg_idle_loop(u64 abs_timeout)
+{
+ int ret;
+ struct hrtimer_sleeper timeout;
+ struct umcg_task __user *self = current->umcg_task;
+
+ if (abs_timeout) {
+ hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
+ HRTIMER_MODE_ABS);
+
+ hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout,
+ current->timer_slack_ns);
+ }
+
+ while (true) {
+ u32 umcg_state;
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ smp_mb(); /* Order with set_current_state() above. */
+ ret = -EFAULT;
+ if (get_user(umcg_state, &self->state)) {
+ set_current_state(TASK_RUNNING);
+ goto out;
+ }
+
+ ret = 0;
+ if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+ set_current_state(TASK_RUNNING);
+ goto out;
+ }
+
+ if (abs_timeout)
+ hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
+
+ if (!abs_timeout || timeout.task) {
+ /*
+ * Clear PF_UMCG_WORKER to elide workqueue handlers.
+ */
+ const bool worker = current->flags & PF_UMCG_WORKER;
+
+ if (worker)
+ current->flags &= ~PF_UMCG_WORKER;
+
+ /*
+ * Note: freezable_schedule() here is not appropriate
+ * as umcg_idle_loop can be called from rwsem locking
+ * context (via workqueue handlers), which may
+ * trigger a lockdep warning for mmap_lock.
+ */
+ schedule();
+
+ if (worker)
+ current->flags |= PF_UMCG_WORKER;
+ }
+ __set_current_state(TASK_RUNNING);
+
+ /*
+ * Check for timeout before checking the state, as workers
+ * are not going to return from schedule() unless
+ * they are RUNNING.
+ */
+ ret = -ETIMEDOUT;
+ if (abs_timeout && !timeout.task)
+ goto out;
+
+ ret = -EFAULT;
+ if (get_user(umcg_state, &self->state))
+ goto out;
+
+ ret = 0;
+ if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING)
+ goto out;
+
+ ret = -EINTR;
+ if (fatal_signal_pending(current))
+ goto out;
+
+ if (signal_pending(current))
+ flush_signals(current);
+ }
+
+out:
+ if (abs_timeout) {
+ hrtimer_cancel(&timeout.timer);
+ destroy_hrtimer_on_stack(&timeout.timer);
+ }
+
+ /* Workers must go through workqueue handlers upon wakeup. */
+ if (current->flags & PF_UMCG_WORKER) {
+ if (ret == -ETIMEDOUT)
+ ret = handle_timedout_worker(self);
+
+ set_tsk_need_resched(current);
+ }
+
+ return ret;
+}
+
+/*
+ * Try to wake up. May be called with preempt_disable set. May be called
+ * cross-process.
+ *
+ * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
+ * ordering logic.
+ */
+static int umcg_ttwu(u32 next_tid, int wake_flags)
+{
+ struct task_struct *next;
+
+ rcu_read_lock();
+ next = find_task_by_vpid(next_tid);
+ if (!next || !(READ_ONCE(next->umcg_task))) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+
+ /* Note: next does not necessarily share mm with current. */
+
+ try_to_wake_up(next, TASK_NORMAL, wake_flags); /* Result ignored. */
+ rcu_read_unlock();
+
+ return 0;
+}
+
+/*
+ * At the moment, umcg_do_context_switch simply wakes up @next with
+ * WF_CURRENT_CPU and puts the current task to sleep. May be called cross-mm.
+ *
+ * In the future an optimization will be added to adjust runtime accounting
+ * so that from the kernel scheduling perspective the two tasks are
+ * essentially treated as one. In addition, the context switch may be performed
+ * right here on the fast path, instead of going through the wake/wait pair.
+ */
+static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
+{
+ struct task_struct *next;
+
+ rcu_read_lock();
+ next = find_task_by_vpid(next_tid);
+ if (!next) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+
+ /* Note: next does not necessarily share mm with current. */
+
+ /* TODO: instead of wake + sleep, do a context switch. */
+ try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU); /* Result ignored. */
+ rcu_read_unlock();
+
+ return umcg_idle_loop(abs_timeout);
+}
+
+/**
+ * sys_umcg_wait: put the current task to sleep and/or wake another task.
+ * @flags: zero or a value from enum umcg_wait_flag.
+ * @abs_timeout: when to wake the task, in nanoseconds; zero for no timeout.
+ *
+ * @self->state must be UMCG_TASK_IDLE (where @self is current->umcg_task)
+ * if !(@flags & UMCG_WAIT_WAKE_ONLY).
+ *
+ * If @self->next_tid is not zero, it must point to an IDLE UMCG task.
+ * The userspace must have changed its state from IDLE to RUNNING
+ * before calling sys_umcg_wait() in the current task. This "next"
+ * task will be woken (context-switched-to on the fast path) when the
+ * current task is put to sleep.
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for detals.
+ *
+ * Return:
+ * 0 - OK;
+ * -ETIMEDOUT - the timeout expired;
+ * -EFAULT - failed accessing struct umcg_task __user of the current
+ * task;
+ * -ESRCH - the task to wake not found or not a UMCG task;
+ * -EINVAL - another error happened (e.g. bad @flags, or the current
+ * task is not a UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
+{
+ struct umcg_task __user *self = current->umcg_task;
+ u32 next_tid;
+
+ if (!self)
+ return -EINVAL;
+
+ if (get_user(next_tid, &self->next_tid))
+ return -EFAULT;
+
+ if (flags & UMCG_WAIT_WAKE_ONLY) {
+ if (!next_tid || abs_timeout)
+ return -EINVAL;
+
+ flags &= ~UMCG_WAIT_WAKE_ONLY;
+ if (flags & ~UMCG_WAIT_WF_CURRENT_CPU)
+ return -EINVAL;
+
+ return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ?
+ WF_CURRENT_CPU : 0);
+ }
+
+ /* Unlock the worker, if locked. */
+ if (current->flags & PF_UMCG_WORKER) {
+ u32 umcg_state;
+
+ if (get_user(umcg_state, &self->state))
+ return -EFAULT;
+
+ if ((umcg_state & UMCG_TF_LOCKED) && cmpxchg_user_32(
+ &self->state, &umcg_state,
+ umcg_state & ~UMCG_TF_LOCKED))
+ return -EFAULT;
+ }
+
+ if (next_tid)
+ return umcg_do_context_switch(next_tid, abs_timeout);
+
+ return umcg_idle_loop(abs_timeout);
+}
+
+/*
+ * NOTE: all code below is called from workqueue submit/update, or
+ * syscall exit to usermode loop, so all errors result in the
+ * termination of the current task (via SIGKILL).
+ */
+
+/* Returns true on success, false on _any_ error. */
+static bool mark_server_running(u32 server_tid, bool may_sleep)
+{
+ struct umcg_task __user *ut_server = NULL;
+ u32 state = UMCG_TASK_IDLE;
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(server_tid);
+ if (tsk)
+ ut_server = READ_ONCE(tsk->umcg_task);
+ rcu_read_unlock();
+
+ if (!ut_server)
+ return false;
+
+ if (READ_ONCE(current->mm) != READ_ONCE(tsk->mm))
+ return false;
+
+ if (may_sleep)
+ return !cmpxchg_user_32(&ut_server->state, &state, UMCG_TASK_RUNNING);
+
+ return !cmpxchg_user_32_nosleep(&ut_server->state, &state, UMCG_TASK_RUNNING);
+}
+
+/*
+ * Called by sched_submit_work() for UMCG workers from within preempt_disable()
+ * context. In the common case, the worker's state changes RUNNING => BLOCKED,
+ * and its server's state changes IDLE => RUNNING, and the server is ttwu-ed.
+ *
+ * Under some conditions (e.g. the worker is "locked", see
+ * /Documentation/userspace-api/umcg.[txt|rst] for more details), the
+ * function does nothing.
+ */
+static void __umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+ struct umcg_task __user *ut_worker = tsk->umcg_task;
+ u32 prev_state, next_state, server_tid;
+ bool preempted = false;
+ int ret;
+
+ if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+ return;
+
+ /* Sometimes "locked" workers run without servers. */
+ if (unlikely(!tsk->pinned_umcg_server_page))
+ return;
+
+ smp_mb(); /* The userspace may change the state concurrently. */
+ if (get_user_nosleep(prev_state, &ut_worker->state))
+ goto die; /* EFAULT */
+
+ if (prev_state & UMCG_TF_LOCKED)
+ return;
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING)
+ return; /* the worker is in umcg_wait */
+
+retry_once:
+ next_state = prev_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_BLOCKED;
+ preempted = prev_state & UMCG_TF_PREEMPTED;
+
+ ret = cmpxchg_user_32_nosleep(&ut_worker->state, &prev_state, next_state);
+ if (ret == -EAGAIN) {
+ if (preempted)
+ goto die; /* Preemption can only happen once. */
+
+ if (prev_state != (UMCG_TASK_RUNNING | UMCG_TF_PREEMPTED))
+ goto die; /* Only preemption can happen. */
+
+ preempted = true;
+ goto retry_once;
+ }
+ if (ret)
+ goto die; /* EFAULT */
+
+ if (get_user_nosleep(server_tid, &ut_worker->next_tid))
+ goto die; /* EFAULT */
+
+ if (!server_tid)
+ return; /* Waking a waiting worker leads here. */
+
+ /* The idle server's wait may timeout. */
+ /* TODO: make a smarter context switch below when available. */
+ if (mark_server_running(server_tid, false))
+ umcg_ttwu(server_tid, WF_CURRENT_CPU);
+
+ return;
+
+die:
+ pr_warn("umcg_wq_worker_sleeping: killing task %d\n", current->pid);
+ force_sig(SIGKILL);
+}
+
+/* Called from sched_submit_work() with preempt_disable. */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+ __umcg_wq_worker_sleeping(tsk);
+ umcg_unpin_pages();
+}
+
+/**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for details.
+ */
+static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+{
+ u64 __user *node = &ut_worker->idle_workers_ptr;
+ u64 __user *head_ptr;
+ u64 first = (u64)node;
+ u64 head;
+
+ if (get_user(head, node) || !head)
+ return false;
+
+ head_ptr = (u64 __user *)head;
+
+ if (put_user(UMCG_IDLE_NODE_PENDING, node))
+ return false;
+
+ if (xchg_user_64(head_ptr, &first))
+ return false;
+
+ if (put_user(first, node))
+ return false;
+
+ return true;
+}
+
+/**
+ * get_idle_server - retrieve an idle server, if present.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid)
+{
+ u64 server_tid_ptr;
+ u32 tid;
+ int ret;
+
+ *server_tid = 0; /* Empty result is OK. */
+
+ if (get_user(server_tid_ptr, &ut_worker->idle_server_tid_ptr))
+ return false;
+
+ if (!server_tid_ptr)
+ return false;
+
+ tid = 0;
+ ret = xchg_user_32((u32 __user *)server_tid_ptr, &tid);
+
+ if (ret)
+ return false;
+
+ if (tid && mark_server_running(tid, true))
+ *server_tid = tid;
+
+ return true;
+}
+
+/*
+ * Returns true to wait for the userspace to schedule this worker, false
+ * to return to the userspace. In the common case, enqueues the worker
+ * to idle_workers_ptr list and wakes the idle server (if present).
+ */
+static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
+{
+ struct umcg_task __user *ut_worker = tsk->umcg_task;
+ u32 prev_state, next_state;
+ int ret = 0;
+
+ *server_tid = 0;
+
+ if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+ return false;
+
+ if (fatal_signal_pending(tsk))
+ return false;
+
+ smp_mb(); /* The userspace may concurrently modify the worker's state. */
+ if (get_user(prev_state, &ut_worker->state))
+ goto die;
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+ u32 tid;
+
+ if (prev_state & UMCG_TF_LOCKED)
+ return true; /* Wakeup: wait but don't enqueue. */
+
+ smp_mb(); /* Order getting state and getting server_tid */
+
+ if (get_user(tid, &ut_worker->next_tid))
+ goto die;
+
+ *server_tid = tid;
+
+ if (prev_state & UMCG_TF_PREEMPTED) {
+ if (!tid)
+ goto die; /* PREEMPTED workers must have a server. */
+
+ /* Always enqueue preempted workers. */
+ if (!mark_server_running(tid, true))
+ goto die;
+ } else if (tid)
+ return false; /* pass-through: RUNNING with a server. */
+
+ /* If !PREEMPTED, the worker gets here via UMCG_WAIT_WAKE_ONLY */
+ } else if (unlikely((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
+ (prev_state & UMCG_TF_LOCKED)))
+ return false; /* The worker prepares to sleep or to unregister. */
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE)
+ return true; /* the worker called umcg_wait(); don't enqueue */
+
+ next_state = prev_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_IDLE;
+
+ if (prev_state != next_state)
+ ret = cmpxchg_user_32(&ut_worker->state, &prev_state, next_state);
+ if (ret)
+ goto die;
+
+ if (!enqueue_idle_worker(ut_worker))
+ goto die;
+
+ smp_mb(); /* Order enqueuing the worker with getting the server. */
+ if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
+ goto die;
+
+ return true;
+
+die:
+ pr_warn("umcg_process_waking_worker: killing task %d\n", current->pid);
+ force_sig(SIGKILL);
+ return false;
+}
+
+/*
+ * Called from sched_update_worker(): defer all work until later, as
+ * sched_update_worker() may be called with in-kernel locks held.
+ */
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+ set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
+}
+
+/* Called via TIF_NOTIFY_RESUME flag from exit_to_user_mode_loop. */
+void umcg_handle_resuming_worker(void)
+{
+ u32 server_tid;
+
+ /* Avoid recursion by removing PF_UMCG_WORKER */
+ current->flags &= ~PF_UMCG_WORKER;
+
+ do {
+ bool should_wait;
+
+ should_wait = process_waking_worker(current, &server_tid);
+
+ if (!should_wait)
+ break;
+
+ if (server_tid)
+ umcg_do_context_switch(server_tid, 0);
+ else
+ umcg_idle_loop(0);
+ } while (true);
+
+ if (server_tid && umcg_pin_pages(server_tid))
+ goto die;
+
+ if (!server_tid) /* No server => no reason to pin pages. */
+ umcg_unpin_pages();
+
+ goto out;
+
+die:
+ pr_warn("%s: killing task %d\n", __func__, current->pid);
+ force_sig(SIGKILL);
+out:
+ current->flags |= PF_UMCG_WORKER;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..cd1be6356e42 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -272,6 +272,10 @@ COND_SYSCALL(landlock_create_ruleset);
COND_SYSCALL(landlock_add_rule);
COND_SYSCALL(landlock_restrict_self);
+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+
/* arch/example/kernel/sys_example.c */
/* mm/fadvise.c */
--
2.25.1
Document User Managed Concurrency Groups syscalls, data structures,
state transitions, etc.
This is a text version of umcg.rst.
Signed-off-by: Peter Oskolkov <[email protected]>
---
Documentation/userspace-api/umcg.txt | 537 +++++++++++++++++++++++++++
1 file changed, 537 insertions(+)
create mode 100644 Documentation/userspace-api/umcg.txt
diff --git a/Documentation/userspace-api/umcg.txt b/Documentation/userspace-api/umcg.txt
new file mode 100644
index 000000000000..ef055bc40fc6
--- /dev/null
+++ b/Documentation/userspace-api/umcg.txt
@@ -0,0 +1,537 @@
+UMCG USERSPACE API
+
+User Managed Concurrency Groups (UMCG) is an M:N threading
+subsystem/toolkit that lets user space application developers implement
+in-process user space schedulers.
+
+
+CONTENTS
+
+ WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
+ REQUIREMENTS
+ UMCG KERNEL API
+ SERVERS
+ WORKERS
+ UMCG TASK STATES
+ STRUCT UMCG_TASK
+ SYS_UMCG_CTL()
+ SYS_UMCG_WAIT()
+ STATE TRANSITIONS
+ SERVER-ONLY USE CASES
+
+
+WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
+
+Linux kernel's CFS scheduler is designed for the "common" use case, with
+efficiency/throughput in mind. Work isolation and workloads of different
+"urgency" are addressed by tools such as cgroups, CPU affinity, priorities,
+etc., which are difficult or impossible to efficiently use in-process.
+
+For example, a single DBMS process may receive tens of thousands requests
+per second; some of these requests may have strong response latency
+requirements as they serve live user requests (e.g. login authentication);
+some of these requests may not care much about latency but must be served
+within a certain time period (e.g. an hourly aggregate usage report); some
+of these requests are to be served only on a best-effort basis and can be
+NACKed under high load (e.g. an exploratory research/hypothesis testing
+workload).
+
+Beyond different work item latency/throughput requirements as outlined
+above, the DBMS may need to provide certain guarantees to different users;
+for example, user A may "reserve" 1 CPU for their high-priority/low latency
+requests, 2 CPUs for mid-level throughput workloads, and be allowed to send
+as many best-effort requests as possible, which may or may not be served,
+depending on the DBMS load. Besides, the best-effort work, started when the
+load was low, may need to be delayed if suddenly a large amount of
+higher-priority work arrives. With hundreds or thousands of users like
+this, it is very difficult to guarantee the application's responsiveness
+using standard Linux tools while maintaining high CPU utilization.
+
+Gaming is another use case: some in-process work must be completed before a
+certain deadline dictated by frame rendering schedule, while other work
+items can be delayed; some work may need to be cancelled/discarded because
+the deadline has passed; etc.
+
+User Managed Concurrency Groups is an M:N threading toolkit that allows
+constructing user space schedulers designed to efficiently manage
+heterogeneous in-process workloads described above while maintaining high
+CPU utilization (95%+).
+
+
+REQUIREMENTS
+
+One relatively established way to design high-efficiency, low-latency
+systems is to split all work into small on-cpu work items, with
+asynchronous I/O and continuations, all executed on a thread pool with the
+number of threads not exceeding the number of available CPUs. Although this
+approach works, it is quite difficult to develop and maintain such a
+system, as, for example, small continuations are difficult to piece
+together when debugging. Besides, such asynchronous callback-based systems
+tend to be somewhat cache-inefficient, as continuations can get scheduled
+on any CPU regardless of cache locality.
+
+M:N threading and cooperative user space scheduling enables controlled CPU
+usage (minimal OS preemption), synchronous coding style, and better cache
+locality.
+
+Specifically:
+
+* a variable/fluctuating number M of "application" threads should be
+ "scheduled over" a relatively fixed number N of "kernel" threads, where
+ N is less than or equal to the number of CPUs available;
+* only those application threads that are attached to kernel threads are
+ scheduled "on CPU";
+* application threads should be able to cooperatively
+ yield to each other;
+* when an application thread blocks in kernel (e.g. in I/O), this becomes
+ a scheduling event ("block") that the userspace scheduler should be able
+ to efficiently detect, and reassign a waiting application thread to the
+ freeded "kernel" thread;
+* when a blocked application thread wakes (e.g. its I/O operation
+ completes), this even ("wake") should also be detectable by the
+ userspace scheduler, which should be able to either quickly dispatch the
+ newly woken thread to an idle "kernel" thread or, if all "kernel"
+ threads are busy, put it in the waiting queue;
+* in addition to the above, it would be extremely useful for a separate
+ in-process "watchdog" facility to be able to monitor the state of each
+ of the M+N threads, and to intervene in case of runaway workloads
+ (interrupt/preempt).
+
+
+UMCG KERNEL API
+
+Based on the requrements above, UMCG kernel API is build around the
+following ideas:
+
+* UMCG server: a task/thread representing "kernel threads", or CPUs from
+ the requirements above;
+* UMCG worker: a task/thread representing "application threads", to be
+ scheduled over servers;
+* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
+ server or a worker) can be in;
+* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
+ can be ORed with the task state to communicate additional information to
+ the kernel;
+* struct umcg_task: a per-task userspace set of data fields, usually
+ residing in the TLS, that fully reflects the current task's UMCG state
+ and controls the way the kernel manages the task;
+* sys_umcg_ctl(): a syscall used to register the current task/thread as a
+ server or a worker, or to unregister a UMCG task;
+* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
+ wake another task, pontentially context-switching between the two tasks
+ on-CPU synchronously.
+
+
+SERVERS
+
+When a task/thread is registered as a server, it is in RUNNING state and
+behaves like any other normal task/thread. In addition, servers can
+interact with other UMCG tasks via sys_umcg_wait():
+
+* servers can voluntarily suspend their execution (wait), becoming IDLE;
+* servers can wake other IDLE servers;
+* servers can context-switch between each other.
+
+Note that if a server blocks in the kernel not via sys_umcg_wait(), it
+still retains its RUNNING state.
+
+Also note that servers can be used for fast on-CPU context switching across
+process boundaries; server-worker interactions assume they belong to the
+same mm.
+
+See the next section on how servers interact with workers.
+
+
+WORKERS
+
+A worker cannot be RUNNING without having a server associated with it, so
+when a task is first registered as a worker, it enters the IDLE state.
+
+* a worker becomes RUNNING when a server calls sys_umcg_wait to
+ context-switch into it; the server goes IDLE, and the worker becomes
+ RUNNING in its place;
+* when a running worker blocks in the kernel, it becomes BLOCKED, its
+ associated server becomes RUNNING and the server's sys_umcg_wait() call
+ from the bullet above returns; this transition is sometimes called
+ "block detection";
+* when the syscall on which a BLOCKED worker completes, the worker
+ becomes IDLE and is added to the list of idle workers; if there is an
+ idle server waiting, the kernel wakes it; this transition is sometimes
+ called "wake detection";
+* running workers can voluntarily suspend their execution (wait),
+ becoming IDLE; their associated servers are woken;
+* a RUNNING worker can context-switch with an IDLE worker; the server of
+ the switched-out worker is transferred to the switched-in worker;
+* any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
+ this is a server running the worker as described in the first bullet in
+ this list, the worker remain IDLE but is added to the idle workers list;
+ this "wake" operation exists for completeness, to make sure
+ wait/wake/context-switch operations are available for all UMCG tasks;
+* the userspace can preempt a RUNNING worker by marking it
+ RUNNING|PREEMPTED and sending a signal to it; the userspace should have
+ installed a NOP signal handler for the signal; the kernel will then
+ transition the worker into IDLE|PREEMPTED state and wake its associated
+ server.
+
+
+UMCG TASK STATES
+
+Important: all state transitions described below involve at least two
+steps: the change of the state field in struct umcg_task, for example
+RUNNING to IDLE, and the corresponding change in struct task_struct state,
+for example a transition between the task running on CPU and being
+descheduled and removed from the kernel runqueue. The key principle of UMCG
+API design is that the party initiating the state transition modifies the
+state variable.
+
+For example, a task going IDLE first changes its state from RUNNING to IDLE
+in the userpace and then calls sys_umcg_wait(), which completes the
+transition.
+
+Note on documentation: in include/uapi/linux/umcg.h, task states have the
+form UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, etc. In this document these are
+usually referred to simply RUNNING and BLOCKED, unless it creates
+ambiguity. Task state flags, e.g. UMCG_TF_PREEMPTED, are treated similarly.
+
+UMCG task states reflect the view from the userspace, rather than from the
+kernel. There are three fundamental task states:
+
+* RUNNING: indicates that the task is schedulable by the kernel; applies
+ to both servers and workers;
+* IDLE: indicates that the task is not schedulable by the kernel (see
+ umcg_idle_loop() in kernel/sched/umcg.c); applies to both servers and
+ workers;
+* BLOCKED: indicates that the worker is blocked in the kernel; does not
+ apply to servers.
+
+In addition to the three states above, two state flags help with state
+transitions:
+
+* LOCKED: the userspace is preparing the worker for a state transition
+ and "locks" the worker until the worker is ready for the kernel to act
+ on the state transition; used similarly to preempt_disable or
+ irq_disable in the kernel; applies only to workers in RUNNING or IDLE
+ state; RUNNING|LOCKED means "this worker is about to become RUNNING,
+ while IDLE|LOCKED means "this worker is about to become IDLE or
+ unregister;
+* PREEMPTED: the userspace indicates it wants the worker to be preempted;
+ there are no situations when both LOCKED and PREEMPTED flags are set at
+ the same time.
+
+
+STRUCT UMCG_TASK
+
+From include/uapi/linux/umcg.h:
+
+struct umcg_task {
+ uint32_t state; /* r/w */
+ uint32_t next_tid; /* r */
+ uint64_t idle_workers_ptr; /* r/w */
+ uint64_t idle_server_tid_ptr; /* r* */
+};
+
+Each UMCG task is identified by struct umcg_task, which is provided to the
+kernel when the task is registered via sys_umcg_ctl().
+
+* uint32_t state: the current state of the task this struct identifies,
+ as described in the previous section. Readable/writable by both the
+ kernel and the userspace.
+
+ bits 0 - 7: task state (RUNNING, IDLE, BLOCKED);
+ bits 8 - 15: state flags (LOCKED, PREEMPTED);
+ bits 16 - 23: reserved; must be zeroes;
+ bits 24 - 31: for userspace use.
+
+* uint32_t next_tid: contains the TID of the task to context-switch-into
+ in sys_umcg_wait(); can be zero; writable by the userspace, readable by
+ the kernel; if this is a RUNNING worker, this field contains the TID of
+ the server that should be woken when this worker blocks; see
+ sys_umcg_wait() for more details;
+
+* uint64_t idle_workers_ptr: this field forms a single-linked list of
+ idle workers: all RUNNING workers have this field set to point to the
+ head of the list (a pointer variable in the userspace).
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from BLOCKED to IDLE and adds the worker to
+ the top of the list of idle workers using this logic:
+
+ /* kernel side */
+ /**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr
+ * list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+ static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+ {
+ u64 __user *node = &ut_worker->idle_workers_ptr;
+ u64 __user *head_ptr;
+ u64 first = (u64)node;
+ u64 head;
+
+ if (get_user_nosleep(head, node) || !head)
+ return false;
+
+ head_ptr = (u64 __user *)head;
+
+ if (put_user_nosleep(UMCG_IDLE_NODE_PENDING, node))
+ return false;
+
+ if (xchg_user_64(head_ptr, &first))
+ return false;
+
+ if (put_user_nosleep(first, node))
+ return false;
+
+ return true;
+ }
+
+ In the userspace the list is cleared atomically using this logic:
+
+ /* userspace side */
+ uint64_t *idle_workers = (uint64_t *)*head;
+
+ atomic_exchange(&idle_workers, NULL);
+
+ The userspace re-points workers' idle_workers_ptr to the list head
+ variable before the worker is allowed to become RUNNING again.
+
+ When processing the idle workers list, the userspace should wait for
+ workers marked as UMCG_IDLE_NODE_PENDING to have the flag cleared (see
+ enqueue_idle_worker() above).
+
+* uint64_t idle_server_tid_ptr: points to a variable in the userspace
+ that points to an idle server, i.e. a server in IDLE state waiting in
+ sys_umcg_wait(); read-only; workers must have this field set; not used
+ in servers.
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from BLOCKED to IDLE, adds the worker to the
+ list of idle workers, and wakes the idle server if present; the kernel
+ atomically exchanges (*idle_server_tid_ptr) with 0, thus waking the idle
+ server, if present, only once. See State transitions below for more
+ details.
+
+
+SYS_UMCG_CTL()
+
+int sys_umcg_ctl(uint32_t flags, struct umcg_task *self) is used to
+register or unregister the current task as a worker or server. Flags can be
+one of the following:
+
+ UMCG_CTL_REGISTER: register a server;
+ UMCG_CTL_REGISTER | UMCG_CTL_WORKER: register a worker;
+ UMCG_CTL_UNREGISTER: unregister the current server or worker.
+
+When registering a task, self must point to struct umcg_task describing
+this server or worker; the pointer must remain valid until the task is
+unregistered.
+
+When registering a server, self->state must be RUNNING; all other fields in
+self must be zeroes.
+
+When registering a worker, self->state must be BLOCKED;
+self->idle_server_tid_ptr and self->idle_workers_ptr must be valid pointers
+as described in struct umcg_task; self->next_tid must be zero.
+
+When unregistering a task, self must be NULL.
+
+
+SYS_UMCG_WAIT()
+
+int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout) operates on
+registered UMCG servers and workers: struct umcg_task *self provided to
+sys_umcg_ctl() when registering the current task is consulted in addition
+to flags and abs_timeout parameters.
+
+The function can be used to perform one of the three operations:
+
+* wait: if self->next_tid is zero, sys_umcg_wait() puts the current
+ server or worker to sleep;
+* wake: if self->next_tid is not zero, and flags & UMCG_WAIT_WAKE_ONLY,
+ the task identified by next_tid is woken;
+* context switch: if self->next_tid is not zero, and !(flags &
+ UMCG_WAIT_WAKE_ONLY), the current task is put to sleep and the next task
+ is woken, synchronously switching between the tasks on the current CPU
+ on the fast path.
+
+Flags can be zero or a combination of the following values:
+
+* UMCG_WAIT_WAKE_ONLY: wake the next task, don't put the current task to
+ sleep;
+* UMCG_WAIT_WF_CURRENT_CPU: wake the next task on the curent CPU; this
+ flag has an effect only if UMCG_WAIT_WAKE_ONLY is set: context switching
+ is always attempted to happen on the curent CPU.
+
+The section below provides more details on how servers and workers interact
+via sys_umcg_wait(), during worker block/wake events, and during worker
+preemption.
+
+
+STATE TRANSITIONS
+
+As mentioned above, the key principle of UMCG state transitions is that the
+party initiating the state transition modifies the state of affected tasks.
+
+Below, "TASK:STATE" indicates a task T, where T can be either W for worker
+or S for server, in state S, where S can be one of the three states,
+potentially ORed with a state flag. Each individual state transition is an
+atomic operation (cmpxchg) unless indicated otherwise. Also note that the
+order of state transitions is important and is part of the contract between
+the userspace and the kernel. The kernel is free to kill the task (SIGKILL)
+if the contract is broken.
+
+Some worker state transitions below include adding LOCKED flag to worker
+state. This is done to indicate to the kernel that the worker is
+transitioning state and should not participate in the block/wake detection
+routines, which can happen due to interrupts/pagefaults/signals.
+
+IDLE|LOCKED means that a running worker is preparing to sleep, so
+interrupts should not lead to server wakeup; RUNNING|LOCKED means that an
+idle worker is going to be "scheduled to run", but may not yet have its
+server set up properly.
+
+Key state transitions:
+
+* server to worker context switch ("schedule a worker to run"):
+ S:RUNNING+W:IDLE => S:IDLE+W:RUNNING:
+ in the userspace, in the context of the server S running:
+ S:RUNNING => S:IDLE (mark self as idle)
+ W:IDLE => W:RUNNING|LOCKED (mark the worker as running)
+ W.next_tid := S.tid; S.next_tid := W.tid (link the server with
+ the worker)
+ W:RUNNING|LOCKED => W:RUNNING (unlock the worker)
+ S: sys_umcg_wait() (make the syscall)
+ the kernel context switches from the server to the worker; the
+ server sleeps until it becomes RUNNING during one of the
+ transitions below;
+
+* worker to server context switch (worker "yields"): S:IDLE+W:RUNNING =>
+S:RUNNING+W:IDLE:
+ in the userspace, in the context of the worker W running (note that
+ a running worker has its next_tid set to point to its server):
+ W:RUNNING => W:IDLE|LOCKED (mark self as idle)
+ S:IDLE => S:RUNNING (mark the server as running)
+ W: sys_umcg_wait() (make the syscall)
+ the kernel removes the LOCKED flag from the worker's state and
+ context switches from the worker to the server; the worker sleeps
+ until it becomes RUNNING;
+
+* worker to worker context switch: W1:RUNNING+W2:IDLE =>
+ W1:IDLE+W2:RUNNING:
+ in the userspace, in the context of W1 running:
+ W2:IDLE => W2:RUNNING|LOCKED (mark W2 as running)
+ W1:RUNNING => W1:IDLE|LOCKED (mark self as idle)
+ W2.next_tid := W1.next_tid; S.next_tid := W2.tid (transfer the
+ server W1 => W2)
+ W1:next_tid := W2.tid (indicate that W1 should context-switch
+ into W2)
+ W2:RUNNING|LOCKED => W2:RUNNING (unlock W2)
+ W1: sys_umcg_wait() (make the syscall)
+ same as above, the kernel removes the LOCKED flag from the W1's
+ state and context switches to next_tid;
+
+* worker wakeup: W:IDLE => W:RUNNING:
+ in the userspace, a server S can wake a worker W without "running"
+ it:
+ S:next_tid :=W.tid
+ W:next_tid := 0
+ W:IDLE => W:RUNNING
+ sys_umcg_wait(UMCG_WAIT_WAKE_ONLY) (make the syscall)
+ the kernel will wake the worker W; as the worker does not have a
+ server assigned, "wake detection" will happen, the worker will be
+ immediately marked as IDLE and added to idle workers list; an idle
+ server, if any, will be woken (see 'wake detection' below);
+
+ Note: if needed, it is possible for a worker to wake another
+ worker: the waker marks itself "IDLE|LOCKED", points its next_tid
+ to the wakee, makes the syscall, restores its server in next_tid,
+ marks itself as RUNNING.
+
+* block detection: worker blocks in the kernel: S:IDLE+W:RUNNING =>
+ S:RUNNING+W:BLOCKED:
+ when a worker blocks in the kernel in RUNNING state (not LOCKED),
+ before descheduling the task from the CPU the kernel performs
+ these operations:
+ W:RUNNING => W:BLOCKED
+ S := W.next_tid
+ S:IDLE => S:RUNNING
+ try_to_wake_up(S)
+ if any of the first three operations above fail, the worker is
+ killed via SIGKILL. Note that ttwu(S) is not required to succeed,
+ as the server may still be transitioning to sleep in
+ sys_umcg_wait(); before actually putting the server to sleep its
+ UMCG state is checked and, if it is RUNNING, sys_umcg_wait()
+ returns to the userspace;
+ if the worker has its LOCKED flag set, block detection does not
+ trigger, as the worker is assumed to be in the userspace
+ scheduling code.
+
+* wake detection: worker wakes in the kernel: W:BLOCKED => W:IDLE:
+ all workers' returns to the userspace are intercepted:
+ start: (a label)
+ if W:RUNNING & W.next_tid != 0: let the worker exit to the
+ userspace, as this is a RUNNING worker with a server;
+ W:* => W:IDLE (previously blocked or woken without servers
+ workers are not allowed to return to the userspace);
+ the worker is appended to W.idle_workers_ptr idle workers list;
+ S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING;
+ ttwu(S)
+ idle_loop(W): this is the same idle loop that sys_umcg_wait()
+ uses: it breaks only when the worker becomes RUNNING; when
+ the idle loop exits, it is assumed that the userspace has
+ properly removed the worker from the idle workers list
+ before marking it RUNNING;
+ goto start; (repeat from the beginning).
+
+ the logic above is a bit more complicated in the presence of
+ LOCKED or PREEMPTED flags, but the main invariants
+ stay the same:
+ only RUNNING workers with servers assigned are allowed to run
+ in the userspace (unless LOCKED);
+ newly IDLE workers are added to the idle workers list; any
+ user-initiated state change assumes the userspace
+ properly removed the worker from the list;
+ as with wake detection, any "breach of contract" by the
+ userspace will result in the task termination via SIGKILL.
+
+* worker preemption: S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED:
+ when the userspace wants to preempt a RUNNING worker, it changes it
+ state, atomically, RUNNING => RUNNING|PREEMPTED and sends a
+ signal to the worker via tgkill(); the signal handler, previously
+ set up by the userspace, can be a NOP (note that only RUNNING
+ workers can be preempted);
+
+ if the worker, at the moment the signal arrived, continued to be
+ running on-CPU in the userspace, the "wake detection" code will be
+ triggered that, in addition to what was described above, will
+ check if the worker is in RUNNING|PREEMPTED state:
+ W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED
+ S := W.next_tid
+ S:IDLE => S:RUNNING
+ try_to_wakeup(S)
+
+ if the signal arrives after the worker blocks in the kernel,
+ the "block detection" happened as described above, with the
+ following change:
+ W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED
+ S := W.next_tid
+ S:IDLE => S:RUNNING
+ try_to_wake_up(S)
+
+ in any case, the worker's server is woken, with its attached
+ worker (S.next_tid) either in BLOCKED|PREEMPTED or IDLE|PREEMPTED
+ state.
+
+
+SERVER-ONLY USE CASES
+
+Some workloads/applications may benefit from fast and synchronous on-CPU
+user-initiated context switches without the need for full userspace
+scheduling (block/wake detection). These applications can use "standalone"
+UMCG servers to wait/wake/context-switch, including across process
+boundaries.
+
+These "worker-less" operations involve trivial RUNNING <==> IDLE state
+changes, not discussed here for brevity.
--
2.25.1
Document User Managed Concurrency Groups syscalls, data structures,
state transitions, etc.
Signed-off-by: Peter Oskolkov <[email protected]>
---
Documentation/userspace-api/umcg.rst | 550 +++++++++++++++++++++++++++
1 file changed, 550 insertions(+)
create mode 100644 Documentation/userspace-api/umcg.rst
diff --git a/Documentation/userspace-api/umcg.rst b/Documentation/userspace-api/umcg.rst
new file mode 100644
index 000000000000..825bf0fec4b6
--- /dev/null
+++ b/Documentation/userspace-api/umcg.rst
@@ -0,0 +1,550 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+UMCG Userspace API
+=====================================
+
+User Managed Concurrency Groups (UMCG) is an M:N threading
+subsystem/toolkit that lets user space application developers
+implement in-process user space schedulers.
+
+.. contents:: :local:
+
+Why? Heterogeneous in-process workloads
+=======================================
+Linux kernel's CFS scheduler is designed for the "common" use case,
+with efficiency/throughput in mind. Work isolation and workloads of
+different "urgency" are addressed by tools such as cgroups, CPU
+affinity, priorities, etc., which are difficult or impossible to
+efficiently use in-process.
+
+For example, a single DBMS process may receive tens of thousands
+requests per second; some of these requests may have strong response
+latency requirements as they serve live user requests (e.g. login
+authentication); some of these requests may not care much about
+latency but must be served within a certain time period (e.g. an
+hourly aggregate usage report); some of these requests are to be
+served only on a best-effort basis and can be NACKed under high load
+(e.g. an exploratory research/hypothesis testing workload).
+
+Beyond different work item latency/throughput requirements as outlined
+above, the DBMS may need to provide certain guarantees to different
+users; for example, user A may "reserve" 1 CPU for their
+high-priority/low latency requests, 2 CPUs for mid-level throughput
+workloads, and be allowed to send as many best-effort requests as
+possible, which may or may not be served, depending on the DBMS load.
+Besides, the best-effort work, started when the load was low, may need
+to be delayed if suddenly a large amount of higher-priority work
+arrives. With hundreds or thousands of users like this, it is very
+difficult to guarantee the application's responsiveness using standard
+Linux tools while maintaining high CPU utilization.
+
+Gaming is another use case: some in-process work must be completed
+before a certain deadline dictated by frame rendering schedule, while
+other work items can be delayed; some work may need to be
+cancelled/discarded because the deadline has passed; etc.
+
+User Managed Concurrency Groups is an M:N threading toolkit that
+allows constructing user space schedulers designed to efficiently
+manage heterogeneous in-process workloads described above while
+maintaining high CPU utilization (95%+).
+
+Requirements
+============
+One relatively established way to design high-efficiency, low-latency
+systems is to split all work into small on-cpu work items, with
+asynchronous I/O and continuations, all executed on a thread pool with
+the number of threads not exceeding the number of available CPUs.
+Although this approach works, it is quite difficult to develop and
+maintain such a system, as, for example, small continuations are
+difficult to piece together when debugging. Besides, such asynchronous
+callback-based systems tend to be somewhat cache-inefficient, as
+continuations can get scheduled on any CPU regardless of cache
+locality.
+
+M:N threading and cooperative user space scheduling enables controlled
+CPU usage (minimal OS preemption), synchronous coding style, and
+better cache locality.
+
+Specifically:
+
+- a variable/fluctuating number M of "application" threads should be
+ "scheduled over" a relatively fixed number N of "kernel" threads,
+ where N is less than or equal to the number of CPUs available;
+- only those application threads that are attached to kernel threads
+ are scheduled "on CPU";
+- application threads should be able to cooperatively yield to each other;
+- when an application thread blocks in kernel (e.g. in I/O), this
+ becomes a scheduling event ("block") that the userspace scheduler
+ should be able to efficiently detect, and reassign a waiting
+ application thread to the freeded "kernel" thread;
+- when a blocked application thread wakes (e.g. its I/O operation
+ completes), this even ("wake") should also be detectable by the
+ userspace scheduler, which should be able to either quickly dispatch
+ the newly woken thread to an idle "kernel" thread or, if all "kernel"
+ threads are busy, put it in the waiting queue;
+- in addition to the above, it would be extremely useful for a
+ separate in-process "watchdog" facility to be able to monitor the
+ state of each of the M+N threads, and to intervene in case of runaway
+ workloads (interrupt/preempt).
+
+
+UMCG kernel API
+===============
+Based on the requrements above, UMCG *kernel* API is build around
+the following ideas:
+
+- *UMCG server*: a task/thread representing "kernel threads", or CPUs
+ from the requirements above;
+- *UMCG worker*: a task/thread representing "application threads", to
+ be scheduled over servers;
+- UMCG *task state*: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG
+ task (a server or a worker) can be in;
+- UMCG task *state flag*: LOCKED, PREEMPTED: additional state flags
+ that can be ORed with the task state to communicate additional information
+ to the kernel;
+- ``struct umcg_task``: a per-task userspace set of data fields, usually
+ residing in the TLS, that fully reflects the current task's UMCG
+ state and controls the way the kernel manages the task;
+- ``sys_umcg_ctl()``: a syscall used to register the current task/thread
+ as a server or a worker, or to unregister a UMCG task;
+- ``sys_umcg_wait()``: a syscall used to put the current task to
+ sleep and/or wake another task, pontentially context-switching
+ between the two tasks on-CPU synchronously.
+
+
+Servers
+=======
+
+When a task/thread is registered as a server, it is in RUNNING
+state and behaves like any other normal task/thread. In addition,
+servers can interact with other UMCG tasks via sys_umcg_wait():
+
+- servers can voluntarily suspend their execution (wait), becoming IDLE;
+- servers can wake other IDLE servers;
+- servers can context-switch between each other.
+
+Note that if a server blocks in the kernel *not* via sys_umcg_wait(),
+it still retains its RUNNING state.
+
+Also note that servers can be used for fast on-CPU context switching
+across process boundaries; server-worker interactions assume they
+belong to the same mm.
+
+See the next section on how servers interact with workers.
+
+Workers
+=======
+
+A worker cannot be RUNNING without having a server associated
+with it, so when a task is first registered as a worker, it enters
+the IDLE state.
+
+- a worker becomes RUNNING when a server calls sys_umcg_wait to
+ context-switch into it; the server goes IDLE, and the worker becomes
+ RUNNING in its place;
+- when a running worker blocks in the kernel, it becomes BLOCKED,
+ its associated server becomes RUNNING and the server's
+ sys_umcg_wait() call from the bullet above returns; this transition
+ is sometimes called "block detection";
+- when the syscall on which a BLOCKED worker completes, the worker
+ becomes IDLE and is added to the list of idle workers; if there
+ is an idle server waiting, the kernel wakes it; this transition
+ is sometimes called "wake detection";
+- running workers can voluntarily suspend their execution (wait),
+ becoming IDLE; their associated servers are woken;
+- a RUNNING worker can context-switch with an IDLE worker; the server
+ of the switched-out worker is transferred to the switched-in worker;
+- any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
+ this is a server running the worker as described in the first bullet
+ in this list, the worker remain IDLE but is added to the idle workers
+ list; this "wake" operation exists for completeness, to make sure
+ wait/wake/context-switch operations are available for all UMCG tasks;
+- the userspace can preempt a RUNNING worker by marking it
+ ``RUNNING|PREEMPTED`` and sending a signal to it; the userspace should
+ have installed a NOP signal handler for the signal; the kernel will
+ then transition the worker into ``IDLE|PREEMPTED`` state and wake
+ its associated server.
+
+UMCG task states
+================
+
+Important: all state transitions described below involve at least
+two steps: the change of the state field in ``struct umcg_task``,
+for example ``RUNNING`` to ``IDLE``, and the corresponding change in
+``struct task_struct`` state, for example a transition between the task
+running on CPU and being descheduled and removed from the kernel runqueue.
+The key principle of UMCG API design is that the party initiating
+the state transition modifies the state variable.
+
+For example, a task going ``IDLE`` first changes its state from ``RUNNING``
+to ``IDLE`` in the userpace and then calls ``sys_umcg_wait()``, which
+completes the transition.
+
+Note on documentation: in ``include/uapi/linux/umcg.h``, task states
+have the form ``UMCG_TASK_RUNNING``, ``UMCG_TASK_BLOCKED``, etc. In
+this document these are usually referred to simply ``RUNNING`` and
+``BLOCKED``, unless it creates ambiguity. Task state flags, e.g.
+``UMCG_TF_PREEMPTED``, are treated similarly.
+
+UMCG task states reflect the view from the userspace, rather than from
+the kernel. There are three fundamental task states:
+
+- ``RUNNING``: indicates that the task is schedulable by the kernel; applies
+ to both servers and workers;
+- ``IDLE``: indicates that the task is *not* schedulable by the kernel
+ (see ``umcg_idle_loop()`` in ``kernel/sched/umcg.c``); applies to
+ both servers and workers;
+- ``BLOCKED``: indicates that the worker is blocked in the kernel;
+ does not apply to servers.
+
+In addition to the three states above, two state flags help with
+state transitions:
+
+- ``LOCKED``: the userspace is preparing the worker for a state transition
+ and "locks" the worker until the worker is ready for the kernel to
+ act on the state transition; used similarly to preempt_disable or
+ irq_disable in the kernel; applies only to workers in ``RUNNING`` or
+ ``IDLE`` state; ``RUNNING|LOCKED`` means "this worker is about to
+ become ``RUNNING``, while ``IDLE|LOCKED`` means "this worker is about
+ to become ``IDLE`` or unregister;
+- ``PREEMPTED``: the userspace indicates it wants the worker to be
+ preempted; there are no situations when both ``LOCKED`` and ``PREEMPTED``
+ flags are set at the same time.
+
+struct umcg_task
+================
+
+From ``include/uapi/linux/umcg.h``:
+
+.. code-block:: C
+
+ struct umcg_task {
+ uint32_t state; /* r/w */
+ uint32_t next_tid; /* r */
+ uint64_t idle_workers_ptr; /* r/w */
+ uint64_t idle_server_tid_ptr; /* r* */
+ };
+
+Each UMCG task is identified by ``struct umcg_task``, which is provided
+to the kernel when the task is registered via ``sys_umcg_ctl()``.
+
+- ``uint32_t state``: the current state of the task this struct identifies,
+ as described in the previous section. Readable/writable by both the kernel
+ and the userspace.
+
+ - bits 0 - 7: task state (RUNNING, IDLE, BLOCKED);
+ - bits 8 - 15: state flags (LOCKED, PREEMPTED);
+ - bits 16 - 23: reserved; must be zeroes;
+ - bits 24 - 31: for userspace use.
+
+- ``uint32_t next_tid``: contains the TID of the task to context-switch-into
+ in ``sys_umcg_wait()``; can be zero; writable by the userspace, readable
+ by the kernel; if this is a RUNNING worker, this field contains
+ the TID of the server that should be woken when this worker blocks;
+ see ``sys_umcg_wait()`` for more details;
+
+- ``uint64_t idle_workers_ptr``: this field forms a single-linked list
+ of idle workers: all RUNNING workers have this field set to point
+ to the head of the list (a pointer variable in the userspace).
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from ``BLOCKED`` to ``IDLE`` and adds the worker
+ to the top of the list of idle workers using this logic:
+
+ .. code-block:: C
+
+ /* kernel side */
+ /**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+ static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+ {
+ u64 __user *node = &ut_worker->idle_workers_ptr;
+ u64 __user *head_ptr;
+ u64 first = (u64)node;
+ u64 head;
+
+ if (get_user_nosleep(head, node) || !head)
+ return false;
+
+ head_ptr = (u64 __user *)head;
+
+ if (put_user_nosleep(UMCG_IDLE_NODE_PENDING, node))
+ return false;
+
+ if (xchg_user_64(head_ptr, &first))
+ return false;
+
+ if (put_user_nosleep(first, node))
+ return false;
+
+ return true;
+ }
+
+
+ In the userspace the list is cleared atomically using this logic:
+
+ .. code-block:: C
+
+ /* userspace side */
+ uint64_t *idle_workers = (uint64_t *)*head;
+
+ atomic_exchange(&idle_workers, NULL);
+
+ The userspace re-points workers' idle_workers_ptr to the list head
+ variable before the worker is allowed to become RUNNING again.
+
+ When processing the idle workers list, the userspace should wait for
+ workers marked as UMCG_IDLE_NODE_PENDING to have the flag cleared
+ (see ``enqueue_idle_worker()`` above).
+
+- ``uint64_t idle_server_tid_ptr``: points to a variable in the
+ userspace that points to an idle server, i.e. a server in IDLE state waiting
+ in sys_umcg_wait(); read-only; workers must have this field set; not used
+ in servers.
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from ``BLOCKED`` to ``IDLE``, adds the worker
+ to the list of idle workers, and wakes the idle server if present;
+ the kernel atomically exchanges ``(*idle_server_tid_ptr)`` with 0,
+ thus waking the idle server, if present, only once.
+ See `State transitions`_ below for more details.
+
+sys_umcg_ctl()
+==============
+
+``int sys_umcg_ctl(uint32_t flags, struct umcg_task *self)`` is used to
+register or unregister the current task as a worker or server. Flags
+can be one of the following:
+
+- ``UMCG_CTL_REGISTER``: register a server;
+- ``UMCG_CTL_REGISTER | UMCG_CTL_WORKER``: register a worker;
+- ``UMCG_CTL_UNREGISTER``: unregister the current server or worker.
+
+When registering a task, ``self`` must point to ``struct umcg_task``
+describing this server or worker; the pointer must remain valid until
+the task is unregistered.
+
+When registering a server, ``self->state`` must be ``RUNNING``; all other
+fields in ``self`` must be zeroes.
+
+When registering a worker, ``self->state`` must be ``BLOCKED``;
+``self->idle_server_tid_ptr`` and ``self->idle_workers_ptr`` must be
+valid pointers as described in `struct umcg_task`_; ``self->next_tid`` must
+be zero.
+
+When unregistering a task, ``self`` must be ``NULL``.
+
+sys_umcg_wait()
+===============
+
+``int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout)`` operates
+on registered UMCG servers and workers: ``struct umcg_task *self`` provided
+to ``sys_umcg_ctl()`` when registering the current task is consulted
+in addition to ``flags`` and ``abs_timeout`` parameters.
+
+The function can be used to perform one of the three operations:
+
+- wait: if ``self->next_tid`` is zero, ``sys_umcg_wait()`` puts the current
+ server or worker to sleep;
+- wake: if ``self->next_tid`` is not zero, and ``flags & UMCG_WAIT_WAKE_ONLY``,
+ the task identified by ``next_tid`` is woken;
+- context switch: if ``self->next_tid`` is not zero, and
+ ``!(flags & UMCG_WAIT_WAKE_ONLY)``, the current task is put to sleep and
+ the next task is woken, synchronously switching between the tasks on the
+ current CPU on the fast path.
+
+Flags can be zero or a combination of the following values:
+
+- ``UMCG_WAIT_WAKE_ONLY``: wake the next task, don't put the current task
+ to sleep;
+- ``UMCG_WAIT_WF_CURRENT_CPU``: wake the next task on the curent CPU;
+ this flag has an effect only if ``UMCG_WAIT_WAKE_ONLY`` is set: context
+ switching is always attempted to happen on the curent CPU.
+
+The section below provides more details on how servers and workers interact
+via ``sys_umcg_wait()``, during worker block/wake events, and during
+worker preemption.
+
+State transitions
+=================
+
+As mentioned above, the key principle of UMCG state transitions is that
+**the party initiating the state transition modifies the state of affected
+tasks**.
+
+Below, "``TASK:STATE``" indicates a task T, where T can be either W for
+worker or S for server, in state S, where S can be one of the three states,
+potentially ORed with a state flag. Each individual state transition
+is an atomic operation (cmpxchg) unless indicated otherwise. Also note
+that **the order of state transitions is important and is part of the
+contract between the userspace and the kernel. The kernel is free
+to kill the task (SIGKILL) if the contract is broken.**
+
+Some worker state transitions below include adding ``LOCKED`` flag to
+worker state. This is done to indicate to the kernel that the worker
+is transitioning state and should not participate in the block/wake
+detection routines, which can happen due to interrupts/pagefaults/signals.
+
+``IDLE|LOCKED`` means that a running worker is preparing to sleep, so
+interrupts should not lead to server wakeup; ``RUNNING|LOCKED`` means that
+an idle worker is going to be "scheduled to run", but may not yet have its
+server set up properly.
+
+Key state transitions:
+
+- server to worker context switch ("schedule a worker to run"):
+ ``S:RUNNING+W:IDLE => S:IDLE+W:RUNNING``:
+
+ - in the userspace, in the context of the server S running:
+
+ - ``S:RUNNING => S:IDLE`` (mark self as idle)
+ - ``W:IDLE => W:RUNNING|LOCKED`` (mark the worker as running)
+ - ``W.next_tid := S.tid; S.next_tid := W.tid``
+ (link the server with the worker)
+ - ``W:RUNNING|LOCKED => W:RUNNING`` (unlock the worker)
+ - ``S: sys_umcg_wait()`` (make the syscall)
+
+ - the kernel context switches from the server to the worker; the server
+ sleeps until it becomes ``RUNNING`` during one of the transitions below;
+
+- worker to server context switch (worker "yields"):
+ ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE``:
+
+ - in the userspace, in the context of the worker W running (note that
+ a running worker has its ``next_tid`` set to point to its server):
+
+ - ``W:RUNNING => W:IDLE|LOCKED`` (mark self as idle)
+ - ``S:IDLE => S:RUNNING`` (mark the server as running)
+ - ``W: sys_umcg_wait()`` (make the syscall)
+
+ - the kernel removes the ``LOCKED`` flag from the worker's state and
+ context switches from the worker to the server; the worker
+ sleeps until it becomes ``RUNNING``;
+
+- worker to worker context switch:
+ ``W1:RUNNING+W2:IDLE => W1:IDLE+W2:RUNNING``:
+
+ - in the userspace, in the context of W1 running:
+
+ - ``W2:IDLE => W2:RUNNING|LOCKED`` (mark W2 as running)
+ - ``W1:RUNNING => W1:IDLE|LOCKED`` (mark self as idle)
+ - ``W2.next_tid := W1.next_tid; S.next_tid := W2.tid``
+ (transfer the server W1 => W2)
+ - ``W1:next_tid := W2.tid`` (indicate that W1 should
+ context-switch into W2)
+ - ``W2:RUNNING|LOCKED => W2:RUNNING`` (unlock W2)
+ - ``W1: sys_umcg_wait()`` (make the syscall)
+
+ - same as above, the kernel removes the ``LOCKED`` flag from the W1's state
+ and context switches to next_tid;
+
+- worker wakeup: ``W:IDLE => W:RUNNING``:
+
+ - in the userspace, a server S can wake a worker W without "running" it:
+
+ - ``S:next_tid :=W.tid``
+ - ``W:next_tid := 0``
+ - ``W:IDLE => W:RUNNING``
+ - ``sys_umcg_wait(UMCG_WAIT_WAKE_ONLY)`` (make the syscall)
+
+ - the kernel will wake the worker W; as the worker does not have a server
+ assigned, "wake detection" will happen, the worker will be immediately
+ marked as ``IDLE`` and added to idle workers list; an idle server, if any,
+ will be woken (see 'wake detection' below);
+ - Note: if needed, it is possible for a worker to wake another worker:
+ the waker marks itself "IDLE|LOCKED", points its next_tid to the wakee,
+ makes the syscall, restores its server in next_tid, marks itself
+ as ``RUNNING``.
+
+- block detection: worker blocks in the kernel: ``S:IDLE+W:RUNNING => S:RUNNING+W:BLOCKED``:
+
+ - when a worker blocks in the kernel in ``RUNNING`` state (not ``LOCKED``),
+ before descheduling the task from the CPU the kernel performs these
+ operations:
+
+ - ``W:RUNNING => W:BLOCKED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wake_up(S)``
+
+ - if any of the first three operations above fail, the worker is killed via
+ ``SIGKILL``. Note that ``ttwu(S)`` is not required to succeed, as the
+ server may still be transitioning to sleep in ``sys_umcg_wait()``; before
+ actually putting the server to sleep its UMCG state is checked and, if
+ it is ``RUNNING``, sys_umcg_wait() returns to the userspace;
+ - if the worker has its ``LOCKED`` flag set, block detection does not trigger,
+ as the worker is assumed to be in the userspace scheduling code.
+
+- wake detection: worker wakes in the kernel: ``W:BLOCKED => W:IDLE``:
+
+ - all workers' returns to the userspace are intercepted:
+
+ - ``start:`` (a label)
+ - if ``W:RUNNING & W.next_tid != 0``: let the worker exit to the userspace,
+ as this is a ``RUNNING`` worker with a server;
+ - ``W:* => W:IDLE`` (previously blocked or woken without servers workers
+ are not allowed to return to the userspace);
+ - the worker is appended to ``W.idle_workers_ptr`` idle workers list;
+ - ``S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING; ttwu(S)``
+ - ``idle_loop(W)``: this is the same idle loop that ``sys_umcg_wait()``
+ uses: it breaks only when the worker becomes ``RUNNING``; when the
+ idle loop exits, it is assumed that the userspace has properly
+ removed the worker from the idle workers list before marking it
+ ``RUNNING``;
+ - ``goto start;`` (repeat from the beginning).
+
+ - the logic above is a bit more complicated in the presence of ``LOCKED`` or
+ ``PREEMPTED`` flags, but the main invariants stay the same:
+
+ - only ``RUNNING`` workers with servers assigned are allowed to run
+ in the userspace (unless ``LOCKED``);
+ - newly ``IDLE`` workers are added to the idle workers list; any
+ user-initiated state change assumes the userspace properly removed
+ the worker from the list;
+ - as with wake detection, any "breach of contract" by the userspace
+ will result in the task termination via ``SIGKILL``.
+
+- worker preemption: ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED``:
+
+ - when the userspace wants to preempt a ``RUNNING`` worker, it changes
+ it state, atomically, ``RUNNING => RUNNING|PREEMPTED`` and sends a signal
+ to the worker via ``tgkill()``; the signal handler, previously set up
+ by the userspace, can be a NOP (note that only ``RUNNING`` workers can be
+ preempted);
+ - if the worker, at the moment the signal arrived, continued to be running
+ on-CPU in the userspace, the "wake detection" code will be triggered that,
+ in addition to what was described above, will check if the worker is in
+ ``RUNNING|PREEMPTED`` state:
+
+ - ``W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wakeup(S)``
+
+ - if the signal arrives after the worker blocks in the kernel, the "block
+ detection" happened as described above, with the following change:
+
+ - ``W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wake_up(S)``
+
+ - in any case, the worker's server is woken, with its attached worker
+ (``S.next_tid``) either in ``BLOCKED|PREEMPTED`` or ``IDLE|PREEMPTED``
+ state.
+
+Server-only use cases
+=====================
+
+Some workloads/applications may benefit from fast and synchronous on-CPU
+user-initiated context switches without the need for full userspace
+scheduling (block/wake detection). These applications can use "standalone"
+UMCG servers to wait/wake/context-switch, including across process boundaries.
+
+These "worker-less" operations involve trivial ``RUNNING`` <==> ``IDLE``
+state changes, not discussed here for brevity.
+
--
2.25.1
On Fri, Sep 17, 2021 at 11:03:20AM -0700, Peter Oskolkov wrote:
> Add helper functions to work atomically with userspace 32/64 bit values -
> there are some .*futex.* named helpers, but they are not exactly
> what is needed for UMCG; I haven't found what else I could use, so I
> rolled these.
>
> At the moment only X86_64 is supported.
>
> Note: the helpers should probably go into arch/ somewhere; I have
> them in kernel/sched/umcg_uaccess.h temporarily for convenience. Please
> let me know where I should put them.
>
> Changelog:
> v0.5->v0.6:
> - replaced mmap_read_lock with mmap_read_lock_killable in fix_pagefault();
> - fix_pagefault now validates proper uaddr alignment;
> - renamed umcg.h to umcg_uaccess.h;
> v0.4->v0.5:
> - added xchg_user_** helpers;
> v0.3->v0.4:
> - added put_user_nosleep;
> - removed linked list/stack operations patch;
> v0.2->v0.3:
> - renamed and refactored the helpers a bit, as described above;
> - moved linked list/stack operations into a separate patch.
>
> Signed-off-by: Peter Oskolkov <[email protected]>
> ---
> kernel/sched/umcg_uaccess.h | 344 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 344 insertions(+)
> create mode 100644 kernel/sched/umcg_uaccess.h
>
> diff --git a/kernel/sched/umcg_uaccess.h b/kernel/sched/umcg_uaccess.h
> new file mode 100644
> index 000000000000..e4ead8d2fd62
> --- /dev/null
> +++ b/kernel/sched/umcg_uaccess.h
> @@ -0,0 +1,344 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _KERNEL_SCHED_UMCG_UACCESS_H
> +#define _KERNEL_SCHED_UMCG_UACCESS_H
> +
> +#ifdef CONFIG_X86_64
> +
> +#include <linux/uaccess.h>
> +
> +#include <asm/asm.h>
> +#include <linux/atomic.h>
> +#include <asm/uaccess.h>
> +
> +/* TODO: move atomic operations below into arch/ headers */
> +static inline int __try_cmpxchg_user_32(u32 *uval, u32 __user *uaddr,
> + u32 oldval, u32 newval)
> +{
> + int ret = 0;
> +
> + asm volatile("\n"
> + "1:\t" LOCK_PREFIX "cmpxchgl %4, %2\n"
> + "2:\n"
> + "\t.section .fixup, \"ax\"\n"
> + "3:\tmov %3, %0\n"
> + "\tjmp 2b\n"
> + "\t.previous\n"
> + _ASM_EXTABLE_UA(1b, 3b)
> + : "+r" (ret), "=a" (oldval), "+m" (*uaddr)
> + : "i" (-EFAULT), "r" (newval), "1" (oldval)
> + : "memory"
> + );
> + *uval = oldval;
> + return ret;
> +}
> +
> +static inline int __try_cmpxchg_user_64(u64 *uval, u64 __user *uaddr,
> + u64 oldval, u64 newval)
> +{
> + int ret = 0;
> +
> + asm volatile("\n"
> + "1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n"
> + "2:\n"
> + "\t.section .fixup, \"ax\"\n"
> + "3:\tmov %3, %0\n"
> + "\tjmp 2b\n"
> + "\t.previous\n"
> + _ASM_EXTABLE_UA(1b, 3b)
> + : "+r" (ret), "=a" (oldval), "+m" (*uaddr)
> + : "i" (-EFAULT), "r" (newval), "1" (oldval)
> + : "memory"
> + );
> + *uval = oldval;
> + return ret;
> +}
> +
> +static inline int fix_pagefault(unsigned long uaddr, bool write_fault, int bytes)
> +{
> + struct mm_struct *mm = current->mm;
> + int ret;
> +
> + /* Validate proper alignment. */
> + if (uaddr % bytes)
> + return -EINVAL;
> +
> + if (mmap_read_lock_killable(mm))
> + return -EINTR;
> + ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
> + NULL);
> + mmap_read_unlock(mm);
> +
> + return ret < 0 ? ret : 0;
> +}
> +
> +/**
> + * cmpxchg_32_user_nosleep - compare_exchange 32-bit values
^^^^^^^^^^^^^^^^^^^^^^^
Need to be consistent with the function name below.
> + * Return:
> + * 0 - OK
> + * -EFAULT: memory access error
> + * -EAGAIN: @expected did not match; consult @prev
> + */
> +static inline int cmpxchg_user_32_nosleep(u32 __user *uaddr, u32 *old, u32 new)
> +{
> + int ret = -EFAULT;
> + u32 __old = *old;
> +
> + if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> + return -EFAULT;
> +
> + pagefault_disable();
> +
> + __uaccess_begin_nospec();
> + ret = __try_cmpxchg_user_32(old, uaddr, __old, new);
> + user_access_end();
> +
> + if (!ret)
> + ret = *old == __old ? 0 : -EAGAIN;
> +
> + pagefault_enable();
> + return ret;
> +}
> +
> +/**
> + * cmpxchg_64_user_nosleep - compare_exchange 64-bit values
^^^^^^^^^^^^^^^^^^^^^^^
Ditto.
> + * Return:
> + * 0 - OK
> + * -EFAULT: memory access error
> + * -EAGAIN: @expected did not match; consult @prev
> + */
> +static inline int cmpxchg_user_64_nosleep(u64 __user *uaddr, u64 *old, u64 new)
> +{
> + int ret = -EFAULT;
> + u64 __old = *old;
> +
> + if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> + return -EFAULT;
> +
> + pagefault_disable();
> +
> + __uaccess_begin_nospec();
> + ret = __try_cmpxchg_user_64(old, uaddr, __old, new);
> + user_access_end();
> +
> + if (!ret)
> + ret = *old == __old ? 0 : -EAGAIN;
> +
> + pagefault_enable();
> +
> + return ret;
> +}
> +
> +/**
> + * cmpxchg_32_user - compare_exchange 32-bit values
^^^^^^^^^^^^^^^
Ditto.
> + * Return:
> + * 0 - OK
> + * -EFAULT: memory access error
> + * -EAGAIN: @expected did not match; consult @prev
> + */
> +static inline int cmpxchg_user_32(u32 __user *uaddr, u32 *old, u32 new)
> +{
> + int ret = -EFAULT;
> + u32 __old = *old;
> +
> + if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> + return -EFAULT;
> +
> + pagefault_disable();
> +
> + while (true) {
> + __uaccess_begin_nospec();
> + ret = __try_cmpxchg_user_32(old, uaddr, __old, new);
> + user_access_end();
> +
> + if (!ret) {
> + ret = *old == __old ? 0 : -EAGAIN;
> + break;
> + }
> +
> + if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
> + break;
> + }
> +
> + pagefault_enable();
> + return ret;
> +}
> +
> +/**
> + * cmpxchg_64_user - compare_exchange 64-bit values
^^^^^^^^^^^^^^^
Ditto.
> + * Return:
> + * 0 - OK
> + * -EFAULT: memory access error
> + * -EAGAIN: @expected did not match; consult @prev
> + */
> +static inline int cmpxchg_user_64(u64 __user *uaddr, u64 *old, u64 new)
> +{
> + int ret = -EFAULT;
> + u64 __old = *old;
> +
> + if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> + return -EFAULT;
> +
> + pagefault_disable();
> +
> + while (true) {
> + __uaccess_begin_nospec();
> + ret = __try_cmpxchg_user_64(old, uaddr, __old, new);
> + user_access_end();
> +
> + if (!ret) {
> + ret = *old == __old ? 0 : -EAGAIN;
> + break;
> + }
> +
> + if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
> + break;
> + }
> +
> + pagefault_enable();
> +
> + return ret;
> +}
> +
> +static inline int __try_xchg_user_32(u32 *oval, u32 __user *uaddr, u32 newval)
> +{
> + u32 oldval = 0;
> + int ret = 0;
> +
> + asm volatile("\n"
> + "1:\txchgl %0, %2\n"
> + "2:\n"
> + "\t.section .fixup, \"ax\"\n"
> + "3:\tmov %3, %0\n"
> + "\tjmp 2b\n"
> + "\t.previous\n"
> + _ASM_EXTABLE_UA(1b, 3b)
> + : "=r" (oldval), "=r" (ret), "+m" (*uaddr)
> + : "i" (-EFAULT), "0" (newval), "1" (0)
> + );
> +
> + if (ret)
> + return ret;
> +
> + *oval = oldval;
> + return 0;
> +}
> +
> +static inline int __try_xchg_user_64(u64 *oval, u64 __user *uaddr, u64 newval)
> +{
> + u64 oldval = 0;
> + int ret = 0;
> +
> + asm volatile("\n"
> + "1:\txchgq %0, %2\n"
> + "2:\n"
> + "\t.section .fixup, \"ax\"\n"
> + "3:\tmov %3, %0\n"
> + "\tjmp 2b\n"
> + "\t.previous\n"
> + _ASM_EXTABLE_UA(1b, 3b)
> + : "=r" (oldval), "=r" (ret), "+m" (*uaddr)
> + : "i" (-EFAULT), "0" (newval), "1" (0)
> + );
> +
> + if (ret)
> + return ret;
> +
> + *oval = oldval;
> + return 0;
> +}
> +
> +/**
> + * xchg_32_user - atomically exchange 64-bit values
^^^^^^^^^^^^
Ditto.
> + * Return:
> + * 0 - OK
> + * -EFAULT: memory access error
> + */
> +static inline int xchg_user_32(u32 __user *uaddr, u32 *val)
> +{
> + int ret = -EFAULT;
> +
> + if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> + return -EFAULT;
> +
> + pagefault_disable();
> +
> + while (true) {
> +
> + __uaccess_begin_nospec();
> + ret = __try_xchg_user_32(val, uaddr, *val);
> + user_access_end();
> +
> + if (!ret)
> + break;
> +
> + if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
> + break;
> + }
> +
> + pagefault_enable();
> +
> + return ret;
> +}
> +
> +/**
> + * xchg_64_user - atomically exchange 64-bit values
^^^^^^^^^^^^
Ditto.
> + * Return:
> + * 0 - OK
> + * -EFAULT: memory access error
> + */
> +static inline int xchg_user_64(u64 __user *uaddr, u64 *val)
> +{
> + int ret = -EFAULT;
> +
> + if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> + return -EFAULT;
> +
> + pagefault_disable();
> +
> + while (true) {
> +
> + __uaccess_begin_nospec();
> + ret = __try_xchg_user_64(val, uaddr, *val);
> + user_access_end();
> +
> + if (!ret)
> + break;
> +
> + if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
> + break;
> + }
> +
> + pagefault_enable();
> +
> + return ret;
> +}
> +
> +/**
> + * get_user_nosleep - get user value without sleeping.
> + *
> + * get_user() might sleep and therefore cannot be used in preempt-disabled
> + * regions.
> + */
> +#define get_user_nosleep(out, uaddr) \
> +({ \
> + int ret = -EFAULT; \
> + \
> + if (access_ok((uaddr), sizeof(*(uaddr)))) { \
> + pagefault_disable(); \
> + \
> + if (!__get_user((out), (uaddr))) \
> + ret = 0; \
> + \
> + pagefault_enable(); \
> + } \
> + ret; \
> +})
> +
> +#endif /* CONFIG_X86_64 */
> +#endif /* _KERNEL_SCHED_UMCG_UACCESS_H */
> --
> 2.25.1
>
Thanks,
Tao
Hi Peter,
On Fri, Sep 17, 2021 at 11:03:21AM -0700, Peter Oskolkov wrote:
> Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
>
> All key operations, such as wait/wake/context-switch, as well as
> timeouts and block/wake detection, are working quite reliably.
>
> In addition, the userspace can now force the kernel to preempt
> a running worker by changing its state from RUNNING to
> RUNNING | PREEMPTED and sending a signal to it. This new functionality
> is less well tested than the key operations above, but is working
> well in the common case of the worker busy in the userspace.
>
> These big things remain to be addressed (in no particular order):
> - tracing/debugging
> - faster context switches (see umcg_do_context_switch in umcg.c)
> - other architectures (we will need at least arm64 in addition to amd64)
> - tools/lib/umcg for userspace
> - kselftests
>
> I'm working on finalizing libumcg and kselftests.
>
> See Documentation/userspace-api/umcg.[txt|rst] for API usage and
> other details.
>
> v0.5->v0.6 changes:
> - umcg_task pages are now pinned for RUNNING workers;
> - waking workers now wait for the userspace to schedule them
> in exit_to_user_mode_loop() instead of in sched_update_worker();
> - added umcg_clear_child to fork and execve;
> - changed current->umcg_task assignments to WRITE_ONCE;
> - server/worker interactions are restricted to tasks in the same mm;
>
> v0.4->v0.5 changes:
> - handling idle workers and servers is now much simpler on the kernel
> side, thanks to Thierry Delisle's suggestion:
> https://lore.kernel.org/lkml/[email protected]/
> - minor tweaks to improve preemption handling;
>
> v0.3->v0.4 changes:
> - removed server_tid and api_version fields from struct umcg_task;
> - added timeout handling to sys_umcg_wait();
> - implemented worker preemption via signals;
> - handling idle workers and servers is changed again (see umcg.rst).
>
> v0.2->v0.3 changes:
> - the overall approach is now based on peterz@'s suggestion in
> https://lore.kernel.org/patchwork/cover/1433967/
> (should I add Suggested-by?)
> - new protocol for working with idle workers and servers is used, to avoid
> spinning in the kernel;
> - waking a UMCG task now does not require spinning.
>
> Signed-off-by: Peter Oskolkov <[email protected]>
> ---
> arch/x86/entry/syscalls/syscall_64.tbl | 2 +
> fs/exec.c | 1 +
> include/linux/sched.h | 56 ++
> include/linux/syscalls.h | 4 +
> include/uapi/asm-generic/unistd.h | 8 +-
> include/uapi/linux/umcg.h | 117 ++++
> init/Kconfig | 10 +
> kernel/entry/common.c | 1 +
> kernel/exit.c | 2 +
> kernel/sched/Makefile | 1 +
> kernel/sched/core.c | 15 +-
> kernel/sched/umcg.c | 745 +++++++++++++++++++++++++
> kernel/sys_ni.c | 4 +
> 13 files changed, 963 insertions(+), 3 deletions(-)
> create mode 100644 include/uapi/linux/umcg.h
> create mode 100644 kernel/sched/umcg.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index ce18119ea0d0..0c6c7fd72b0b 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -368,6 +368,8 @@
> 444 common landlock_create_ruleset sys_landlock_create_ruleset
> 445 common landlock_add_rule sys_landlock_add_rule
> 446 common landlock_restrict_self sys_landlock_restrict_self
> +447 common umcg_ctl sys_umcg_ctl
> +448 common umcg_wait sys_umcg_wait
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/fs/exec.c b/fs/exec.c
> index 18594f11c31f..d652ef8017b2 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1835,6 +1835,7 @@ static int bprm_execve(struct linux_binprm *bprm,
> current->fs->in_exec = 0;
> current->in_execve = 0;
> rseq_execve(current);
> + umcg_execve(current);
> acct_update_integrals(current);
> task_numa_free(current, false);
> return retval;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 549018e46801..4cf9070d1361 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -66,6 +66,7 @@ struct sighand_struct;
> struct signal_struct;
> struct task_delay_info;
> struct task_group;
> +struct umcg_task;
>
> /*
> * Task state bitmask. NOTE! These bits are also
> @@ -1230,6 +1231,12 @@ struct task_struct {
> unsigned long rseq_event_mask;
> #endif
>
> +#ifdef CONFIG_UMCG
> + struct umcg_task __user *umcg_task;
> + struct page *pinned_umcg_worker_page; /* self */
> + struct page *pinned_umcg_server_page;
> +#endif
> +
> struct tlbflush_unmap_batch tlb_ubc;
>
> union {
> @@ -1606,6 +1613,7 @@ extern struct pid *cad_pid;
> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> +#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
> #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
> @@ -2191,6 +2199,54 @@ static inline void rseq_execve(struct task_struct *t)
>
> #endif
>
> +#ifdef CONFIG_UMCG
> +
> +void umcg_handle_resuming_worker(void);
> +void umcg_handle_exiting_worker(void);
> +void umcg_clear_child(struct task_struct *tsk);
> +
> +/* Called by bprm_execve() in fs/exec.c. */
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> + if (tsk->umcg_task)
> + umcg_clear_child(tsk);
> +}
> +
> +/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
> +static inline void umcg_handle_notify_resume(void)
> +{
> + if (current->flags & PF_UMCG_WORKER)
> + umcg_handle_resuming_worker();
> +}
> +
> +/* Called by do_exit() in kernel/exit.c. */
> +static inline void umcg_handle_exit(void)
> +{
> + if (current->flags & PF_UMCG_WORKER)
> + umcg_handle_exiting_worker();
> +}
> +
> +/*
> + * umcg_wq_worker_[sleeping|running] are called in core.c by
> + * sched_submit_work() and sched_update_worker().
> + */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk);
> +void umcg_wq_worker_running(struct task_struct *tsk);
> +
> +#else
> +
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_handle_notify_resume(void)
> +{
> +}
> +static inline void umcg_handle_exit(void)
> +{
> +}
> +
> +#endif
> +
> #ifdef CONFIG_DEBUG_RSEQ
>
> void rseq_syscall(struct pt_regs *regs);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 050511e8f1f8..f3e1ef8d842f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -71,6 +71,7 @@ struct open_how;
> struct mount_attr;
> struct landlock_ruleset_attr;
> enum landlock_rule_type;
> +struct umcg_task;
>
> #include <linux/types.h>
> #include <linux/aio_abi.h>
> @@ -1050,6 +1051,9 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
> asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
> const void __user *rule_attr, __u32 flags);
> asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
> +asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
> +asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
> +
>
> /*
> * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 6de5a7fc066b..1a4c9ac0e296 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -873,8 +873,14 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> #define __NR_landlock_restrict_self 446
> __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
>
> +#define __NR_umcg_ctl 447
> +__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
> +#define __NR_umcg_wait 448
> +__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
> +
> +
> #undef __NR_syscalls
> -#define __NR_syscalls 447
> +#define __NR_syscalls 449
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
> new file mode 100644
> index 000000000000..edce804781f9
> --- /dev/null
> +++ b/include/uapi/linux/umcg.h
> @@ -0,0 +1,117 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UMCG_H
> +#define _UAPI_LINUX_UMCG_H
> +
> +#include <linux/limits.h>
> +#include <linux/types.h>
> +
> +/*
> + * UMCG: User Managed Concurrency Groups.
> + *
> + * Syscalls (see kernel/sched/umcg.c):
> + * sys_umcg_ctl() - register/unregister UMCG tasks;
> + * sys_umcg_wait() - wait/wake/context-switch.
> + *
> + * struct umcg_task (below): controls the state of UMCG tasks.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
> + */
> +
> +/*
> + * UMCG task states, the first 8 bits. The states represent the user space
> + * point of view.
> + */
> +#define UMCG_TASK_NONE 0
> +#define UMCG_TASK_RUNNING 1
> +#define UMCG_TASK_IDLE 2
> +#define UMCG_TASK_BLOCKED 3
> +
> +/* The first byte: RUNNING, IDLE, or BLOCKED. */
> +#define UMCG_TASK_STATE_MASK 0xff
> +
> +/* UMCG task state flags, bits 8-15 */
> +
> +/*
> + * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
> + */
> +#define UMCG_TF_LOCKED (1 << 8)
> +
> +/*
> + * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
> + */
> +#define UMCG_TF_PREEMPTED (1 << 9)
> +
> +/**
> + * struct umcg_task - controls the state of UMCG tasks.
> + *
> + * The struct is aligned at 64 bytes to ensure that it fits into
> + * a single cache line.
> + */
> +struct umcg_task {
> + /**
> + * @state: the current state of the UMCG task described by this struct.
> + *
> + * Readable/writable by both the kernel and the userspace.
> + *
> + * UMCG task state:
> + * bits 0 - 7: task state;
> + * bits 8 - 15: state flags;
> + * bits 16 - 23: reserved; must be zeroes;
> + * bits 24 - 31: for userspace use.
> + */
> + uint32_t state; /* r/w */
> +
> + /**
> + * @next_tid: the TID of the UMCG task that should be context-switched
> + * into in sys_umcg_wait(). Can be zero.
> + *
> + * Running UMCG workers must have next_tid set to point to IDLE
> + * UMCG servers.
> + *
> + * Read-only for the kernel, read/write for the userspace.
> + */
> + uint32_t next_tid; /* r */
> +
> + /**
> + * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
> + *
> + * Readable/writable by both the kernel and the userspace: the
> + * kernel adds items to the list, the userspace removes them.
> + */
> + uint64_t idle_workers_ptr; /* r/w */
> +
> + /**
> + * @idle_server_tid_ptr: a pointer pointing to a single idle server.
> + * Readonly.
> + */
> + uint64_t idle_server_tid_ptr; /* r */
> +} __attribute__((packed, aligned(8 * sizeof(__u64))));
> +
> +/**
> + * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
> + * @UMCG_CTL_REGISTER: register the current task as a UMCG task
> + * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
> + * @UMCG_CTL_WORKER: register the current task as a UMCG worker
> + */
> +enum umcg_ctl_flag {
> + UMCG_CTL_REGISTER = 0x00001,
> + UMCG_CTL_UNREGISTER = 0x00002,
> + UMCG_CTL_WORKER = 0x10000,
> +};
> +
> +/**
> + * enum umcg_wait_flag - flags to pass to sys_umcg_wait
> + * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
> + * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
> + * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
> + * must be set.
> + */
> +enum umcg_wait_flag {
> + UMCG_WAIT_WAKE_ONLY = 1,
> + UMCG_WAIT_WF_CURRENT_CPU = 2,
> +};
> +
> +/* See Documentation/userspace-api/umcg.[txt|rst].*/
> +#define UMCG_IDLE_NODE_PENDING (1ULL)
> +
> +#endif /* _UAPI_LINUX_UMCG_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index a61c92066c2e..c15a50a61ba6 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1662,6 +1662,16 @@ config MEMBARRIER
>
> If unsure, say Y.
>
> +config UMCG
> + bool "Enable User Managed Concurrency Groups API"
> + depends on X86_64
> + default n
> + help
> + Enable User Managed Concurrency Groups API, which form the basis
> + for an in-process M:N userspace scheduling framework.
> + At the moment this is an experimental/RFC feature that is not
> + guaranteed to be backward-compatible.
> +
> config KALLSYMS
> bool "Load all symbols for debugging/ksymoops" if EXPERT
> default y
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..f3cd335ab513 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -173,6 +173,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>
> if (ti_work & _TIF_NOTIFY_RESUME) {
> tracehook_notify_resume(regs);
> + umcg_handle_notify_resume(); /* might sleep */
> rseq_handle_notify_resume(NULL, regs);
> }
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index fd1c04193e18..fdd4e923cca9 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -744,6 +744,8 @@ void __noreturn do_exit(long code)
> if (unlikely(!tsk->pid))
> panic("Attempted to kill the idle task!");
>
> + umcg_handle_exit();
> +
> /*
> * If do_exit is called because this processes oopsed, it's possible
> * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 978fcfca5871..e4e481eee1b7 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
> obj-$(CONFIG_CPU_ISOLATION) += isolation.o
> obj-$(CONFIG_PSI) += psi.o
> obj-$(CONFIG_SCHED_CORE) += core_sched.o
> +obj-$(CONFIG_UMCG) += umcg.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 12a9d053e724..c9133cf153b9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4159,6 +4159,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->wake_entry.u_flags = CSD_TYPE_TTWU;
> p->migration_pending = NULL;
> #endif
> +#ifdef CONFIG_UMCG
> + umcg_clear_child(p);
> +#endif
> }
>
> DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> @@ -6105,10 +6108,14 @@ static inline void sched_submit_work(struct task_struct *tsk)
> * in the possible wakeup of a kworker and because wq_worker_sleeping()
> * requires it.
> */
> - if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> + if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
> preempt_disable();
> if (task_flags & PF_WQ_WORKER)
> wq_worker_sleeping(tsk);
> +#ifdef CONFIG_UMCG
> + else if (task_flags & PF_UMCG_WORKER)
> + umcg_wq_worker_sleeping(tsk);
> +#endif
> else
> io_wq_worker_sleeping(tsk);
> preempt_enable_no_resched();
> @@ -6127,9 +6134,13 @@ static inline void sched_submit_work(struct task_struct *tsk)
>
> static void sched_update_worker(struct task_struct *tsk)
> {
> - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
> if (tsk->flags & PF_WQ_WORKER)
> wq_worker_running(tsk);
> +#ifdef CONFIG_UMCG
> + else if (tsk->flags & PF_UMCG_WORKER)
> + umcg_wq_worker_running(tsk);
> +#endif
> else
> io_wq_worker_running(tsk);
> }
> diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
> new file mode 100644
> index 000000000000..aa4dbb31c425
> --- /dev/null
> +++ b/kernel/sched/umcg.c
> @@ -0,0 +1,745 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * User Managed Concurrency Groups (UMCG).
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/umcg.h>
> +
> +#include "sched.h"
> +#include "umcg_uaccess.h"
> +
> +/**
> + * umcg_pin_pages: pin pages containing struct umcg_task of this worker
> + * and its server.
> + *
> + * The pages are pinned when the worker exits to the userspace and unpinned
> + * when the worker is in sched_submit_work(), i.e. when the worker is
> + * about to be removed from its runqueue. Thus at most NR_CPUS UMCG pages
> + * are pinned at any one time across the whole system.
> + */
> +static int umcg_pin_pages(u32 server_tid)
> +{
> + struct umcg_task __user *worker_ut = current->umcg_task;
> + struct umcg_task __user *server_ut = NULL;
> + struct task_struct *tsk;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(server_tid);
> + if (tsk)
> + server_ut = READ_ONCE(tsk->umcg_task);
> + rcu_read_unlock();
> +
> + if (!server_ut)
> + return -EINVAL;
> +
> + if (READ_ONCE(current->mm) != READ_ONCE(tsk->mm))
> + return -EINVAL;
> +
> + tsk = current;
> +
> + /* worker_ut is stable, don't need to repin */
> + if (!tsk->pinned_umcg_worker_page)
> + if (1 != pin_user_pages_fast((unsigned long)worker_ut, 1, 0,
> + &tsk->pinned_umcg_worker_page))
> + return -EFAULT;
> +
> + /* server_ut may change, need to repin */
> + if (tsk->pinned_umcg_server_page) {
> + unpin_user_page(tsk->pinned_umcg_server_page);
> + tsk->pinned_umcg_server_page = NULL;
> + }
> +
> + if (1 != pin_user_pages_fast((unsigned long)server_ut, 1, 0,
> + &tsk->pinned_umcg_server_page))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static void umcg_unpin_pages(void)
> +{
> + struct task_struct *tsk = current;
> +
> + if (tsk->pinned_umcg_worker_page)
> + unpin_user_page(tsk->pinned_umcg_worker_page);
> + if (tsk->pinned_umcg_server_page)
> + unpin_user_page(tsk->pinned_umcg_server_page);
> +
> + tsk->pinned_umcg_worker_page = NULL;
> + tsk->pinned_umcg_server_page = NULL;
> +}
> +
> +static void umcg_clear_task(struct task_struct *tsk)
> +{
> + /*
> + * This is either called for the current task, or for a newly forked
> + * task that is not yet running, so we don't need strict atomicity
> + * below.
> + */
> + if (tsk->umcg_task) {
> + WRITE_ONCE(tsk->umcg_task, NULL);
> +
> + /* These can be simple writes - see the commment above. */
> + tsk->pinned_umcg_worker_page = NULL;
> + tsk->pinned_umcg_server_page = NULL;
> + tsk->flags &= ~PF_UMCG_WORKER;
> + }
> +}
> +
> +/* Called for a forked or execve-ed child. */
> +void umcg_clear_child(struct task_struct *tsk)
> +{
> + umcg_clear_task(tsk);
> +}
> +
> +/* Called both by normally (unregister) and abnormally exiting workers. */
> +void umcg_handle_exiting_worker(void)
> +{
> + umcg_unpin_pages();
> + umcg_clear_task(current);
> +}
> +
> +/**
> + * sys_umcg_ctl: (un)register the current task as a UMCG task.
> + * @flags: ORed values from enum umcg_ctl_flag; see below;
> + * @self: a pointer to struct umcg_task that describes this
> + * task and governs the behavior of sys_umcg_wait if
> + * registering; must be NULL if unregistering.
> + *
> + * @flags & UMCG_CTL_REGISTER: register a UMCG task:
> + * UMCG workers:
> + * - self->state must be UMCG_TASK_IDLE
The code below checks UMCG_TASK_BLOCK, but the comment says UMCG_TASK_IDLE.
Doc says: '+When registering a worker, self->state must be BLOCKED;'
So, the comment here needs to be modified.
> + * - @flags & UMCG_CTL_WORKER
> + * UMCG servers:
> + * - self->state must be UMCG_TASK_RUNNING
> + * - !(@flags & UMCG_CTL_WORKER)
> + *
> + * All tasks:
> + * - self->next_tid must be zero
> + *
> + * If the conditions above are met, sys_umcg_ctl() immediately returns
> + * if the registered task is a server; a worker will be added to
> + * idle_workers_ptr, and the worker put to sleep; an idle server
> + * from idle_server_tid_ptr will be woken, if present.
> + *
> + * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
> + * is a UMCG worker, the userspace is responsible for waking its
> + * server (before or after calling sys_umcg_ctl).
> + *
> + * Return:
> + * 0 - success
> + * -EFAULT - failed to read @self
> + * -EINVAL - some other error occurred
> + */
> +SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
> +{
> + struct umcg_task ut;
> +
> + if (flags == UMCG_CTL_UNREGISTER) {
> + if (self || !current->umcg_task)
> + return -EINVAL;
> +
> + if (current->flags & PF_UMCG_WORKER)
> + umcg_handle_exiting_worker();
> + else
> + umcg_clear_task(current);
> +
> + return 0;
> + }
> +
> + /* Register the current task as a UMCG task. */
> + if (!(flags & UMCG_CTL_REGISTER))
> + return -EINVAL;
> +
> + flags &= ~UMCG_CTL_REGISTER;
> + if (flags && flags != UMCG_CTL_WORKER)
> + return -EINVAL;
> +
> + if (current->umcg_task || !self)
> + return -EINVAL;
> +
> + if (copy_from_user(&ut, self, sizeof(ut)))
> + return -EFAULT;
> +
> + if (ut.next_tid)
> + return -EINVAL;
> +
> + if (flags == UMCG_CTL_WORKER) {
> + if (ut.state != UMCG_TASK_BLOCKED)
> + return -EINVAL;
Here is where the code check UMCG_TASK_BLOCKED.
> +
> + WRITE_ONCE(current->umcg_task, self);
> + current->flags |= PF_UMCG_WORKER;
> +
> + set_tsk_need_resched(current);
> + return 0;
> + }
> +
> + /* This is a server task. */
> + if (ut.state != UMCG_TASK_RUNNING)
> + return -EINVAL;
> +
> + WRITE_ONCE(current->umcg_task, self);
> + return 0;
> +}
> +
> +/**
> + * handle_timedout_worker - make sure the worker is added to idle_workers
> + * upon a "clean" timeout.
> + */
> +static int handle_timedout_worker(struct umcg_task __user *self)
> +{
> + u32 prev_state, next_state;
> + int ret;
> +
> + if (get_user(prev_state, &self->state))
> + return -EFAULT;
> +
> + if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE) {
> + /* TODO: should we care here about TF_LOCKED or TF_PREEMPTED? */
> +
> + next_state = prev_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_BLOCKED;
> +
> + ret = cmpxchg_user_32(&self->state, &prev_state, next_state);
> + if (ret)
> + return ret;
> +
> + return -ETIMEDOUT;
> + }
> +
> + return 0; /* Not really timed out. */
> +}
> +
> +/**
> + * umcg_idle_loop - sleep until the current task becomes RUNNING or a timeout
> + * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout
> + *
> + * The function marks the current task as INTERRUPTIBLE and calls
> + * schedule(). It returns when either the timeout expires or
> + * the UMCG state of the task becomes RUNNING.
> + *
> + * Note: because UMCG workers should not be running WITHOUT attached servers,
> + * and because servers should not be running WITH attached workers,
> + * the function returns only on fatal signal pending and ignores/flushes
> + * all other signals.
> + */
> +static int umcg_idle_loop(u64 abs_timeout)
> +{
> + int ret;
> + struct hrtimer_sleeper timeout;
> + struct umcg_task __user *self = current->umcg_task;
> +
> + if (abs_timeout) {
> + hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
> + HRTIMER_MODE_ABS);
> +
> + hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout,
> + current->timer_slack_ns);
> + }
> +
> + while (true) {
> + u32 umcg_state;
> +
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + smp_mb(); /* Order with set_current_state() above. */
set_current_state() implies the smp_mb(). I don't think about it just
feel it should be put above the set_current_state() to order with
__set_current_state() below.
> + ret = -EFAULT;
> + if (get_user(umcg_state, &self->state)) {
> + set_current_state(TASK_RUNNING);
> + goto out;
> + }
> +
> + ret = 0;
> + if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
> + set_current_state(TASK_RUNNING);
> + goto out;
> + }
> +
> + if (abs_timeout)
> + hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> +
> + if (!abs_timeout || timeout.task) {
> + /*
> + * Clear PF_UMCG_WORKER to elide workqueue handlers.
> + */
> + const bool worker = current->flags & PF_UMCG_WORKER;
> +
> + if (worker)
> + current->flags &= ~PF_UMCG_WORKER;
> +
> + /*
> + * Note: freezable_schedule() here is not appropriate
> + * as umcg_idle_loop can be called from rwsem locking
> + * context (via workqueue handlers), which may
> + * trigger a lockdep warning for mmap_lock.
> + */
> + schedule();
> +
> + if (worker)
> + current->flags |= PF_UMCG_WORKER;
> + }
> + __set_current_state(TASK_RUNNING);
> +
> + /*
> + * Check for timeout before checking the state, as workers
> + * are not going to return from schedule() unless
> + * they are RUNNING.
> + */
> + ret = -ETIMEDOUT;
> + if (abs_timeout && !timeout.task)
> + goto out;
> +
> + ret = -EFAULT;
> + if (get_user(umcg_state, &self->state))
> + goto out;
> +
> + ret = 0;
> + if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING)
> + goto out;
> +
> + ret = -EINTR;
> + if (fatal_signal_pending(current))
> + goto out;
> +
> + if (signal_pending(current))
> + flush_signals(current);
> + }
> +
> +out:
> + if (abs_timeout) {
> + hrtimer_cancel(&timeout.timer);
> + destroy_hrtimer_on_stack(&timeout.timer);
> + }
> +
> + /* Workers must go through workqueue handlers upon wakeup. */
> + if (current->flags & PF_UMCG_WORKER) {
> + if (ret == -ETIMEDOUT)
> + ret = handle_timedout_worker(self);
> +
> + set_tsk_need_resched(current);
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * Try to wake up. May be called with preempt_disable set. May be called
> + * cross-process.
> + *
> + * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
> + * ordering logic.
> + */
> +static int umcg_ttwu(u32 next_tid, int wake_flags)
> +{
> + struct task_struct *next;
> +
> + rcu_read_lock();
> + next = find_task_by_vpid(next_tid);
> + if (!next || !(READ_ONCE(next->umcg_task))) {
> + rcu_read_unlock();
> + return -ESRCH;
> + }
> +
> + /* Note: next does not necessarily share mm with current. */
> +
> + try_to_wake_up(next, TASK_NORMAL, wake_flags); /* Result ignored. */
> + rcu_read_unlock();
> +
> + return 0;
> +}
> +
> +/*
> + * At the moment, umcg_do_context_switch simply wakes up @next with
> + * WF_CURRENT_CPU and puts the current task to sleep. May be called cross-mm.
> + *
> + * In the future an optimization will be added to adjust runtime accounting
> + * so that from the kernel scheduling perspective the two tasks are
> + * essentially treated as one. In addition, the context switch may be performed
> + * right here on the fast path, instead of going through the wake/wait pair.
> + */
> +static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
> +{
> + struct task_struct *next;
> +
> + rcu_read_lock();
> + next = find_task_by_vpid(next_tid);
> + if (!next) {
> + rcu_read_unlock();
> + return -ESRCH;
> + }
> +
> + /* Note: next does not necessarily share mm with current. */
> +
> + /* TODO: instead of wake + sleep, do a context switch. */
> + try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU); /* Result ignored. */
> + rcu_read_unlock();
> +
> + return umcg_idle_loop(abs_timeout);
> +}
> +
> +/**
> + * sys_umcg_wait: put the current task to sleep and/or wake another task.
> + * @flags: zero or a value from enum umcg_wait_flag.
> + * @abs_timeout: when to wake the task, in nanoseconds; zero for no timeout.
> + *
> + * @self->state must be UMCG_TASK_IDLE (where @self is current->umcg_task)
> + * if !(@flags & UMCG_WAIT_WAKE_ONLY).
> + *
> + * If @self->next_tid is not zero, it must point to an IDLE UMCG task.
> + * The userspace must have changed its state from IDLE to RUNNING
> + * before calling sys_umcg_wait() in the current task. This "next"
> + * task will be woken (context-switched-to on the fast path) when the
> + * current task is put to sleep.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
> + *
> + * Return:
> + * 0 - OK;
> + * -ETIMEDOUT - the timeout expired;
> + * -EFAULT - failed accessing struct umcg_task __user of the current
> + * task;
> + * -ESRCH - the task to wake not found or not a UMCG task;
> + * -EINVAL - another error happened (e.g. bad @flags, or the current
> + * task is not a UMCG task, etc.)
> + */
> +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
> +{
> + struct umcg_task __user *self = current->umcg_task;
> + u32 next_tid;
> +
> + if (!self)
> + return -EINVAL;
> +
> + if (get_user(next_tid, &self->next_tid))
> + return -EFAULT;
> +
> + if (flags & UMCG_WAIT_WAKE_ONLY) {
> + if (!next_tid || abs_timeout)
> + return -EINVAL;
> +
> + flags &= ~UMCG_WAIT_WAKE_ONLY;
> + if (flags & ~UMCG_WAIT_WF_CURRENT_CPU)
> + return -EINVAL;
> +
> + return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ?
> + WF_CURRENT_CPU : 0);
> + }
> +
> + /* Unlock the worker, if locked. */
> + if (current->flags & PF_UMCG_WORKER) {
> + u32 umcg_state;
> +
> + if (get_user(umcg_state, &self->state))
> + return -EFAULT;
> +
> + if ((umcg_state & UMCG_TF_LOCKED) && cmpxchg_user_32(
> + &self->state, &umcg_state,
> + umcg_state & ~UMCG_TF_LOCKED))
> + return -EFAULT;
> + }
> +
> + if (next_tid)
> + return umcg_do_context_switch(next_tid, abs_timeout);
> +
> + return umcg_idle_loop(abs_timeout);
> +}
> +
> +/*
> + * NOTE: all code below is called from workqueue submit/update, or
> + * syscall exit to usermode loop, so all errors result in the
> + * termination of the current task (via SIGKILL).
> + */
> +
> +/* Returns true on success, false on _any_ error. */
> +static bool mark_server_running(u32 server_tid, bool may_sleep)
> +{
> + struct umcg_task __user *ut_server = NULL;
> + u32 state = UMCG_TASK_IDLE;
> + struct task_struct *tsk;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(server_tid);
> + if (tsk)
> + ut_server = READ_ONCE(tsk->umcg_task);
> + rcu_read_unlock();
> +
> + if (!ut_server)
> + return false;
> +
> + if (READ_ONCE(current->mm) != READ_ONCE(tsk->mm))
> + return false;
> +
> + if (may_sleep)
> + return !cmpxchg_user_32(&ut_server->state, &state, UMCG_TASK_RUNNING);
> +
> + return !cmpxchg_user_32_nosleep(&ut_server->state, &state, UMCG_TASK_RUNNING);
> +}
> +
> +/*
> + * Called by sched_submit_work() for UMCG workers from within preempt_disable()
> + * context. In the common case, the worker's state changes RUNNING => BLOCKED,
> + * and its server's state changes IDLE => RUNNING, and the server is ttwu-ed.
> + *
> + * Under some conditions (e.g. the worker is "locked", see
> + * /Documentation/userspace-api/umcg.[txt|rst] for more details), the
> + * function does nothing.
> + */
> +static void __umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> + struct umcg_task __user *ut_worker = tsk->umcg_task;
> + u32 prev_state, next_state, server_tid;
> + bool preempted = false;
> + int ret;
> +
> + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
> + return;
> +
> + /* Sometimes "locked" workers run without servers. */
> + if (unlikely(!tsk->pinned_umcg_server_page))
> + return;
> +
> + smp_mb(); /* The userspace may change the state concurrently. */
> + if (get_user_nosleep(prev_state, &ut_worker->state))
> + goto die; /* EFAULT */
> +
> + if (prev_state & UMCG_TF_LOCKED)
> + return;
> +
> + if ((prev_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING)
> + return; /* the worker is in umcg_wait */
> +
> +retry_once:
> + next_state = prev_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_BLOCKED;
> + preempted = prev_state & UMCG_TF_PREEMPTED;
> +
> + ret = cmpxchg_user_32_nosleep(&ut_worker->state, &prev_state, next_state);
> + if (ret == -EAGAIN) {
> + if (preempted)
> + goto die; /* Preemption can only happen once. */
> +
> + if (prev_state != (UMCG_TASK_RUNNING | UMCG_TF_PREEMPTED))
> + goto die; /* Only preemption can happen. */
> +
> + preempted = true;
> + goto retry_once;
> + }
> + if (ret)
> + goto die; /* EFAULT */
> +
> + if (get_user_nosleep(server_tid, &ut_worker->next_tid))
> + goto die; /* EFAULT */
> +
> + if (!server_tid)
> + return; /* Waking a waiting worker leads here. */
> +
> + /* The idle server's wait may timeout. */
> + /* TODO: make a smarter context switch below when available. */
> + if (mark_server_running(server_tid, false))
> + umcg_ttwu(server_tid, WF_CURRENT_CPU);
> +
> + return;
> +
> +die:
> + pr_warn("umcg_wq_worker_sleeping: killing task %d\n", current->pid);
> + force_sig(SIGKILL);
> +}
> +
> +/* Called from sched_submit_work() with preempt_disable. */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> + __umcg_wq_worker_sleeping(tsk);
> + umcg_unpin_pages();
> +}
> +
> +/**
> + * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
This function attract idot like me enough.
> + * Returns true on success, false on a fatal failure.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for details.
> + */
> +static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
> +{
> + u64 __user *node = &ut_worker->idle_workers_ptr;
> + u64 __user *head_ptr;
> + u64 first = (u64)node;
> + u64 head;
> +
> + if (get_user(head, node) || !head)
> + return false;
> +
> + head_ptr = (u64 __user *)head;
> +
> + if (put_user(UMCG_IDLE_NODE_PENDING, node))
> + return false;
> +
> + if (xchg_user_64(head_ptr, &first))
> + return false;
> +
> + if (put_user(first, node))
> + return false;
> +
> + return true;
> +}
> +
> +/**
> + * get_idle_server - retrieve an idle server, if present.
> + *
> + * Returns true on success, false on a fatal failure.
> + */
> +static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid)
> +{
> + u64 server_tid_ptr;
> + u32 tid;
> + int ret;
> +
> + *server_tid = 0; /* Empty result is OK. */
> +
> + if (get_user(server_tid_ptr, &ut_worker->idle_server_tid_ptr))
> + return false;
> +
> + if (!server_tid_ptr)
> + return false;
> +
> + tid = 0;
> + ret = xchg_user_32((u32 __user *)server_tid_ptr, &tid);
> +
> + if (ret)
> + return false;
> +
> + if (tid && mark_server_running(tid, true))
> + *server_tid = tid;
> +
> + return true;
> +}
> +
> +/*
> + * Returns true to wait for the userspace to schedule this worker, false
> + * to return to the userspace. In the common case, enqueues the worker
> + * to idle_workers_ptr list and wakes the idle server (if present).
> + */
> +static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
> +{
> + struct umcg_task __user *ut_worker = tsk->umcg_task;
> + u32 prev_state, next_state;
> + int ret = 0;
> +
> + *server_tid = 0;
> +
> + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
> + return false;
> +
> + if (fatal_signal_pending(tsk))
> + return false;
> +
> + smp_mb(); /* The userspace may concurrently modify the worker's state. */
> + if (get_user(prev_state, &ut_worker->state))
> + goto die;
> +
> + if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
> + u32 tid;
> +
> + if (prev_state & UMCG_TF_LOCKED)
> + return true; /* Wakeup: wait but don't enqueue. */
> +
> + smp_mb(); /* Order getting state and getting server_tid */
> +
> + if (get_user(tid, &ut_worker->next_tid))
> + goto die;
> +
> + *server_tid = tid;
> +
> + if (prev_state & UMCG_TF_PREEMPTED) {
> + if (!tid)
> + goto die; /* PREEMPTED workers must have a server. */
> +
> + /* Always enqueue preempted workers. */
> + if (!mark_server_running(tid, true))
> + goto die;
> + } else if (tid)
> + return false; /* pass-through: RUNNING with a server. */
> +
> + /* If !PREEMPTED, the worker gets here via UMCG_WAIT_WAKE_ONLY */
> + } else if (unlikely((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
> + (prev_state & UMCG_TF_LOCKED)))
> + return false; /* The worker prepares to sleep or to unregister. */
> +
> + if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE)
> + return true; /* the worker called umcg_wait(); don't enqueue */
> +
> + next_state = prev_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_IDLE;
> +
> + if (prev_state != next_state)
> + ret = cmpxchg_user_32(&ut_worker->state, &prev_state, next_state);
> + if (ret)
> + goto die;
> +
> + if (!enqueue_idle_worker(ut_worker))
> + goto die;
> +
> + smp_mb(); /* Order enqueuing the worker with getting the server. */
> + if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
> + goto die;
> +
> + return true;
> +
> +die:
> + pr_warn("umcg_process_waking_worker: killing task %d\n", current->pid);
> + force_sig(SIGKILL);
> + return false;
> +}
> +
> +/*
> + * Called from sched_update_worker(): defer all work until later, as
> + * sched_update_worker() may be called with in-kernel locks held.
> + */
> +void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> + set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
> +}
> +
> +/* Called via TIF_NOTIFY_RESUME flag from exit_to_user_mode_loop. */
> +void umcg_handle_resuming_worker(void)
> +{
> + u32 server_tid;
> +
> + /* Avoid recursion by removing PF_UMCG_WORKER */
> + current->flags &= ~PF_UMCG_WORKER;
> +
> + do {
> + bool should_wait;
> +
> + should_wait = process_waking_worker(current, &server_tid);
> +
> + if (!should_wait)
> + break;
> +
> + if (server_tid)
> + umcg_do_context_switch(server_tid, 0);
> + else
> + umcg_idle_loop(0);
> + } while (true);
> +
> + if (server_tid && umcg_pin_pages(server_tid))
> + goto die;
> +
> + if (!server_tid) /* No server => no reason to pin pages. */
> + umcg_unpin_pages();
> +
> + goto out;
> +
> +die:
> + pr_warn("%s: killing task %d\n", __func__, current->pid);
> + force_sig(SIGKILL);
> +out:
> + current->flags |= PF_UMCG_WORKER;
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 0ea8128468c3..cd1be6356e42 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -272,6 +272,10 @@ COND_SYSCALL(landlock_create_ruleset);
> COND_SYSCALL(landlock_add_rule);
> COND_SYSCALL(landlock_restrict_self);
>
> +/* kernel/sched/umcg.c */
> +COND_SYSCALL(umcg_ctl);
> +COND_SYSCALL(umcg_wait);
> +
> /* arch/example/kernel/sys_example.c */
>
> /* mm/fadvise.c */
> --
> 2.25.1
>
Thanks,
Tao
On 2021-09-17 2:03 p.m., Peter Oskolkov wrote:
> [...]
> +SYS_UMCG_WAIT()
> +
> +int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout) operates on
> +registered UMCG servers and workers: struct umcg_task *self provided to
> +sys_umcg_ctl() when registering the current task is consulted in
addition
> +to flags and abs_timeout parameters.
> +
> +The function can be used to perform one of the three operations:
> +
> +* wait: if self->next_tid is zero, sys_umcg_wait() puts the current
> + server or worker to sleep;
I believe this description is misleading but I might be wrong.
From the example
* worker to server context switch (worker "yields"):
S:IDLE+W:RUNNING => +S:RUNNING+W:IDLE
It seems to me that when a worker goes from running to idle, it should
*not* set the next_tid to 0, it should preserve the next_tid as-is,
which is expected to point to its current server. This is consistent
with my understanding of the umcg_wait implementation. This operation
is effectively a direct context-switch to the server.
With that said, I'm a little confused by the usage of "yields" in that
example. I would expect workers yielding to behave like kernel threads
calling sched_yield(), i.e., context switch to the server but also be
immediately added to the idle_workers_ptr.
From my understanding of the umcg_wait call, "worker to server context
switch" does not have analogous behaviour to sched_yield. Am I correct?
If so, I suggest using "park" instead of "yield" in the description
of that example. I believe the naming of wait/wake as park/unpark is
consistent with Java[1] and Rust[2], but I don't know if that naming
is used in contexts closer to the linux kernel.
[1]
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/LockSupport.html
[2] https://doc.rust-lang.org/std/thread/fn.park.html
Peter,
On Fri, Sep 17 2021 at 11:03, Peter Oskolkov wrote:
> Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
>
> All key operations, such as wait/wake/context-switch, as well as
> timeouts and block/wake detection, are working quite reliably.
>
> In addition, the userspace can now force the kernel to preempt
> a running worker by changing its state from RUNNING to
> RUNNING | PREEMPTED and sending a signal to it. This new functionality
> is less well tested than the key operations above, but is working
> well in the common case of the worker busy in the userspace.
>
> These big things remain to be addressed (in no particular order):
> - tracing/debugging
> - faster context switches (see umcg_do_context_switch in umcg.c)
> - other architectures (we will need at least arm64 in addition to amd64)
> - tools/lib/umcg for userspace
> - kselftests
>
> I'm working on finalizing libumcg and kselftests.
>
> See Documentation/userspace-api/umcg.[txt|rst] for API usage and
> other details.
The above is a work log and a todo list, but not a change log.
Change logs have to explain the what and especially the why and for new
concepts also the design and the rationale behind it.
And no, links to random discussions are not a replacement for that. It's
not the job of a reviewer to dig for that information. It's the task of
the submitter to provide that information so the reviewer can digest it.
> v0.5->v0.6 changes:
> - umcg_task pages are now pinned for RUNNING workers;
What's the rationale for that? Why is that needed and desired?
> v0.2->v0.3 changes:
> - the overall approach is now based on peterz@'s suggestion in
> https://lore.kernel.org/patchwork/cover/1433967/
All of these lore.kernel/org/patchwork/* links resolve to 404. Please
use proper lore.kernel.org/r/$MSGID links
> /*
> +
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -173,6 +173,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>
> if (ti_work & _TIF_NOTIFY_RESUME) {
> tracehook_notify_resume(regs);
> + umcg_handle_notify_resume(); /* might sleep */
Please do not use tail comments. They are horrible and disturb the
reading flow.
Aside of that this 'might sleep' info is not really interesting. All
functions which are invoked in exit_to_user_mode_loop() can sleep, so
what's special about this one?
> diff --git a/kernel/exit.c b/kernel/exit.c
> index fd1c04193e18..fdd4e923cca9 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -744,6 +744,8 @@ void __noreturn do_exit(long code)
> if (unlikely(!tsk->pid))
> panic("Attempted to kill the idle task!");
>
> + umcg_handle_exit();
Why is this invoked _before_ the oops fixup? How is that correct in the
oops case?
> index 12a9d053e724..c9133cf153b9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4159,6 +4159,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->wake_entry.u_flags = CSD_TYPE_TTWU;
> p->migration_pending = NULL;
> #endif
> +#ifdef CONFIG_UMCG
> + umcg_clear_child(p);
> +#endif
Can you please provide stub functions for the CONFIG_UMCG=n case instead
of sprinkling #ifdefs all over the place?
> DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> @@ -6105,10 +6108,14 @@ static inline void sched_submit_work(struct task_struct *tsk)
> * in the possible wakeup of a kworker and because wq_worker_sleeping()
> * requires it.
> */
> - if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> + if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
> preempt_disable();
> if (task_flags & PF_WQ_WORKER)
> wq_worker_sleeping(tsk);
> +#ifdef CONFIG_UMCG
> + else if (task_flags & PF_UMCG_WORKER)
> + umcg_wq_worker_sleeping(tsk);
> +#endif
This #ifdeffery can be completely avoided:
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#ifdef CONFIG_UMCG
# define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
#else
# define PF_UMCG_WORKER 0x00000000
#endif
plus a stub function for umcg_wq_worker_sleeping() and for UMCG=n the whole muck is
compiled out, the flags test does not care about PF_UMCG_WORKER....
But aside of that, why has umcg_wq_worker_sleeping() to run with
preemption disabled?
> +/*
> + * Called by sched_submit_work() for UMCG workers from within preempt_disable()
> + * context. In the common case, the worker's state changes RUNNING => BLOCKED,
> + * and its server's state changes IDLE => RUNNING, and the server is ttwu-ed.
> + *
> + * Under some conditions (e.g. the worker is "locked", see
> + * /Documentation/userspace-api/umcg.[txt|rst] for more details), the
> + * function does nothing.
> + */
> +static void __umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> + struct umcg_task __user *ut_worker = tsk->umcg_task;
> + u32 prev_state, next_state, server_tid;
> + bool preempted = false;
> + int ret;
> +
> + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
> + return;
> +
> + /* Sometimes "locked" workers run without servers. */
Sometimes the sun shines... Can you please add understandable comments?
What's wrong with:
/* Nothing to do for workers which are not attached to a server */
or something like that which explains nicely what this is about.
> + if (unlikely(!tsk->pinned_umcg_server_page))
and the comment is needed because tsk->pinned_umcg_server_page does not
make it obvious what the test is about while
if (unlikely(!ut_worker_has_server(tsk)))
or
if (unlikely(ut_worker_detached(tsk)))
would be self explanatory.
> + return;
> +
> + smp_mb(); /* The userspace may change the state concurrently. */
No tail comments please.
Also this does not explain why this needs to be a full barrier. All
barriers have to come with a proper explanation what they are
serializing against and what the counter part is even if the counter
part is in user space. And that documentation wants to be in the code
and not somewhere else.
> + if (get_user_nosleep(prev_state, &ut_worker->state))
> + goto die; /* EFAULT */
This /* EFAULT */ comment documents the obvious. Can you please document
the non-obvious things properly?
> +
> + if (prev_state & UMCG_TF_LOCKED)
> + return;
> +
> + if ((prev_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING)
> + return; /* the worker is in umcg_wait */
So if the worker is in umcg_wait() then why is it reaching this place at
all? The current task surely knows that it comes here from umcg_wait(),
right?
> +retry_once:
> + next_state = prev_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_BLOCKED;
> + preempted = prev_state & UMCG_TF_PREEMPTED;
> +
> + ret = cmpxchg_user_32_nosleep(&ut_worker->state, &prev_state, next_state);
> + if (ret == -EAGAIN) {
> + if (preempted)
> + goto die; /* Preemption can only happen once. */
> +
> + if (prev_state != (UMCG_TASK_RUNNING | UMCG_TF_PREEMPTED))
This check falls flat on it's nose when one of the user space bits
(24-31) is set in prev_state. prev_state is a misnomer anyway. The usual
convention is to use cur_state because that's what it is up to the point
where the cmpxchg succeeds.
> + goto die; /* Only preemption can happen. */
> +
> + preempted = true;
How is this supposed to do anything? This goes back to retry_once which
overwrites preempted...
> + goto retry_once;
> + }
> + if (ret)
> + goto die; /* EFAULT */
> +
> + if (get_user_nosleep(server_tid, &ut_worker->next_tid))
> + goto die; /* EFAULT */
> +
> + if (!server_tid)
> + return; /* Waking a waiting worker leads here. */
I have no idea what that comment is trying to say.
> + /* The idle server's wait may timeout. */
> + /* TODO: make a smarter context switch below when available. */
Neither those make any sense to me.
> + if (mark_server_running(server_tid, false))
> + umcg_ttwu(server_tid, WF_CURRENT_CPU);
Well, after looking at both functions I can see why this wants to be
smarter. Doing the vpid lookup twice instead of once is certainly not
the smartest solution.
> + return;
> +
> +die:
> + pr_warn("umcg_wq_worker_sleeping: killing task %d\n", current->pid);
> + force_sig(SIGKILL);
> +}
> +
> +/* Called from sched_submit_work() with preempt_disable. */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> + __umcg_wq_worker_sleeping(tsk);
> + umcg_unpin_pages();
Coming back to my previous question: Why has all of this to run with
preemption disabled?
There is absolutely no reason to do that and just because you picked a
place to invoke that with preemption disabled does not count.
In fact none of the existing two functions require to be invoked with
preemption disabled and I'm going to send out a patch which removes that
historic leftover.
And if this is not called with preemption disabled then none of these
misnomed _nosleep() accessors are needed and the code can be simplified.
> + * Try to wake up. May be called with preempt_disable set. May be called
> + * cross-process.
> + *
> + * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
> + * ordering logic.
> + */
> +static int umcg_ttwu(u32 next_tid, int wake_flags)
> +{
> + struct task_struct *next;
> +
> + rcu_read_lock();
> + next = find_task_by_vpid(next_tid);
> + if (!next || !(READ_ONCE(next->umcg_task))) {
> + rcu_read_unlock();
> + return -ESRCH;
> + }
> +
> + /* Note: next does not necessarily share mm with current. */
Which is irrelevant, but what's relevant is that there is absolutely no
sanity check of next_tid. So this just wakes up what ever TID user space
writes into the user space task memory. Anything copmeing from user
space cannot be trusted and needs to be validated. find_task_by_vpid()
is not sufficient for that.
> + try_to_wake_up(next, TASK_NORMAL, wake_flags); /* Result ignored. */
> + rcu_read_unlock();
> +
> + return 0;
> +}
> +/* Returns true on success, false on _any_ error. */
> +static bool mark_server_running(u32 server_tid, bool may_sleep)
> +{
> + struct umcg_task __user *ut_server = NULL;
> + u32 state = UMCG_TASK_IDLE;
> + struct task_struct *tsk;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(server_tid);
> + if (tsk)
> + ut_server = READ_ONCE(tsk->umcg_task);
> + rcu_read_unlock();
> +
> + if (!ut_server)
> + return false;
> +
> + if (READ_ONCE(current->mm) != READ_ONCE(tsk->mm))
> + return false;
This is broken because it's outside the rcu read side critical section
so @tsk can be gone already. Also this lacks a check whether @tsk is a
kthread because kthreads can use a user mm temporarily.
Also what's the purpose of these undocumented READ_ONCE() instances?
Thanks,
tglx
Hi Thomas,
Thanks a lot for the review! I will not comment here on your
suggestions that I fully understand how to address. Please see my
comments below on issues that need more clarification.
Thanks,
Peter
On Tue, Sep 28, 2021 at 2:21 AM Thomas Gleixner <[email protected]> wrote:
>
> Peter,
>
> On Fri, Sep 17 2021 at 11:03, Peter Oskolkov wrote:
>
> > Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
> >
> > All key operations, such as wait/wake/context-switch, as well as
> > timeouts and block/wake detection, are working quite reliably.
> >
> > In addition, the userspace can now force the kernel to preempt
> > a running worker by changing its state from RUNNING to
> > RUNNING | PREEMPTED and sending a signal to it. This new functionality
> > is less well tested than the key operations above, but is working
> > well in the common case of the worker busy in the userspace.
> >
> > These big things remain to be addressed (in no particular order):
> > - tracing/debugging
> > - faster context switches (see umcg_do_context_switch in umcg.c)
> > - other architectures (we will need at least arm64 in addition to amd64)
> > - tools/lib/umcg for userspace
> > - kselftests
> >
> > I'm working on finalizing libumcg and kselftests.
> >
> > See Documentation/userspace-api/umcg.[txt|rst] for API usage and
> > other details.
>
> The above is a work log and a todo list, but not a change log.
>
> Change logs have to explain the what and especially the why and for new
> concepts also the design and the rationale behind it.
>
> And no, links to random discussions are not a replacement for that. It's
> not the job of a reviewer to dig for that information. It's the task of
> the submitter to provide that information so the reviewer can digest it.
>
> > v0.5->v0.6 changes:
> > - umcg_task pages are now pinned for RUNNING workers;
>
> What's the rationale for that? Why is that needed and desired?
This was discussed in detail here:
https://lore.kernel.org/lkml/[email protected]/,
for example: https://lore.kernel.org/lkml/[email protected]/
In short, workqueue functions (sched_submit_work, sched_update_worker)
can be called with mm lock held, and so we cannot fixup pagefaults
inline; so we need to pin struct umcg_task pages temporarily. On the
wakeup path I elide this by setting TF_NOTIFY_RESUME flag and doing
everything later in return-to-usermode-loop in a sleepable context; on
the sched_submit_work (going to sleep) path, we need to access the
userspace to wake up the server, but this is a "nosleep" context, so
the pages have to be pinned.
>
> > v0.2->v0.3 changes:
> > - the overall approach is now based on peterz@'s suggestion in
> > https://lore.kernel.org/patchwork/cover/1433967/
>
> All of these lore.kernel/org/patchwork/* links resolve to 404. Please
> use proper lore.kernel.org/r/$MSGID links
>
> > /*
> > +
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -173,6 +173,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> >
> > if (ti_work & _TIF_NOTIFY_RESUME) {
> > tracehook_notify_resume(regs);
> > + umcg_handle_notify_resume(); /* might sleep */
>
> Please do not use tail comments. They are horrible and disturb the
> reading flow.
>
> Aside of that this 'might sleep' info is not really interesting. All
> functions which are invoked in exit_to_user_mode_loop() can sleep, so
> what's special about this one?
>
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index fd1c04193e18..fdd4e923cca9 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -744,6 +744,8 @@ void __noreturn do_exit(long code)
> > if (unlikely(!tsk->pid))
> > panic("Attempted to kill the idle task!");
> >
> > + umcg_handle_exit();
>
> Why is this invoked _before_ the oops fixup? How is that correct in the
> oops case?
umcg_handle_exit() is just unpinning pinned pages, assigning NULL to a
couple of pointers, and clearing PF_UMCG_WORKER flag, so I assumed
doing it as early as possible is the best way to avoid unnecessarily
triggering workqueue callbacks. Is my logic wrong? (UMCG tasks should
unregister before exiting; if they do, umcg_handle_exit() is a NOOP;
but if they did not unregister, umcg_handle_exit() just makes sure no
pinned pages are left over).
>
> > index 12a9d053e724..c9133cf153b9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4159,6 +4159,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > p->wake_entry.u_flags = CSD_TYPE_TTWU;
> > p->migration_pending = NULL;
> > #endif
> > +#ifdef CONFIG_UMCG
> > + umcg_clear_child(p);
> > +#endif
>
> Can you please provide stub functions for the CONFIG_UMCG=n case instead
> of sprinkling #ifdefs all over the place?
Will do.
>
> > DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > @@ -6105,10 +6108,14 @@ static inline void sched_submit_work(struct task_struct *tsk)
> > * in the possible wakeup of a kworker and because wq_worker_sleeping()
> > * requires it.
> > */
> > - if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> > + if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
> > preempt_disable();
> > if (task_flags & PF_WQ_WORKER)
> > wq_worker_sleeping(tsk);
> > +#ifdef CONFIG_UMCG
> > + else if (task_flags & PF_UMCG_WORKER)
> > + umcg_wq_worker_sleeping(tsk);
> > +#endif
>
> This #ifdeffery can be completely avoided:
>
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> #ifdef CONFIG_UMCG
> # define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
> #else
> # define PF_UMCG_WORKER 0x00000000
> #endif
>
> plus a stub function for umcg_wq_worker_sleeping() and for UMCG=n the whole muck is
> compiled out, the flags test does not care about PF_UMCG_WORKER....
>
> But aside of that, why has umcg_wq_worker_sleeping() to run with
> preemption disabled?
>
> > +/*
> > + * Called by sched_submit_work() for UMCG workers from within preempt_disable()
> > + * context. In the common case, the worker's state changes RUNNING => BLOCKED,
> > + * and its server's state changes IDLE => RUNNING, and the server is ttwu-ed.
> > + *
> > + * Under some conditions (e.g. the worker is "locked", see
> > + * /Documentation/userspace-api/umcg.[txt|rst] for more details), the
> > + * function does nothing.
> > + */
> > +static void __umcg_wq_worker_sleeping(struct task_struct *tsk)
> > +{
> > + struct umcg_task __user *ut_worker = tsk->umcg_task;
> > + u32 prev_state, next_state, server_tid;
> > + bool preempted = false;
> > + int ret;
> > +
> > + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
> > + return;
> > +
> > + /* Sometimes "locked" workers run without servers. */
>
> Sometimes the sun shines... Can you please add understandable comments?
> What's wrong with:
>
> /* Nothing to do for workers which are not attached to a server */
>
> or something like that which explains nicely what this is about.
>
> > + if (unlikely(!tsk->pinned_umcg_server_page))
>
> and the comment is needed because tsk->pinned_umcg_server_page does not
> make it obvious what the test is about while
>
> if (unlikely(!ut_worker_has_server(tsk)))
>
> or
>
> if (unlikely(ut_worker_detached(tsk)))
>
> would be self explanatory.
>
> > + return;
> > +
> > + smp_mb(); /* The userspace may change the state concurrently. */
>
> No tail comments please.
>
> Also this does not explain why this needs to be a full barrier. All
> barriers have to come with a proper explanation what they are
> serializing against and what the counter part is even if the counter
> part is in user space. And that documentation wants to be in the code
> and not somewhere else.
>
> > + if (get_user_nosleep(prev_state, &ut_worker->state))
> > + goto die; /* EFAULT */
>
> This /* EFAULT */ comment documents the obvious. Can you please document
> the non-obvious things properly?
>
> > +
> > + if (prev_state & UMCG_TF_LOCKED)
> > + return;
> > +
> > + if ((prev_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING)
> > + return; /* the worker is in umcg_wait */
>
> So if the worker is in umcg_wait() then why is it reaching this place at
> all? The current task surely knows that it comes here from umcg_wait(),
> right?
This place is on the "worker goes to sleep" path (sched_submit_work).
umcg_wait() puts the worker to sleep, so this place is triggered. I
can remove PF_UMCG_WORKER in umcg_wait() to elide this if you think
this would be better. I'll try that in the next patchset version.
>
> > +retry_once:
> > + next_state = prev_state & ~UMCG_TASK_STATE_MASK;
> > + next_state |= UMCG_TASK_BLOCKED;
> > + preempted = prev_state & UMCG_TF_PREEMPTED;
> > +
> > + ret = cmpxchg_user_32_nosleep(&ut_worker->state, &prev_state, next_state);
> > + if (ret == -EAGAIN) {
> > + if (preempted)
> > + goto die; /* Preemption can only happen once. */
> > +
> > + if (prev_state != (UMCG_TASK_RUNNING | UMCG_TF_PREEMPTED))
>
> This check falls flat on it's nose when one of the user space bits
> (24-31) is set in prev_state. prev_state is a misnomer anyway. The usual
> convention is to use cur_state because that's what it is up to the point
> where the cmpxchg succeeds.
Right, will fix.
>
> > + goto die; /* Only preemption can happen. */
> > +
> > + preempted = true;
>
> How is this supposed to do anything? This goes back to retry_once which
> overwrites preempted...
Hmm, yea, will fix. The idea is that we retry here only once, and only
if we retry due to UMCG_TF_PREEMPTED getting set.
>
> > + goto retry_once;
> > + }
>
> > + if (ret)
> > + goto die; /* EFAULT */
> > +
> > + if (get_user_nosleep(server_tid, &ut_worker->next_tid))
> > + goto die; /* EFAULT */
> > +
> > + if (!server_tid)
> > + return; /* Waking a waiting worker leads here. */
>
> I have no idea what that comment is trying to say.
>
> > + /* The idle server's wait may timeout. */
> > + /* TODO: make a smarter context switch below when available. */
>
> Neither those make any sense to me.
>
> > + if (mark_server_running(server_tid, false))
> > + umcg_ttwu(server_tid, WF_CURRENT_CPU);
>
> Well, after looking at both functions I can see why this wants to be
> smarter. Doing the vpid lookup twice instead of once is certainly not
> the smartest solution.
I'll do some refactoring to avoid extra vpid lookups. The comment,
though, hints at a "smarter context switch" that is more than just
"ttwu the server and put the worker to sleep".
>
> > + return;
> > +
> > +die:
> > + pr_warn("umcg_wq_worker_sleeping: killing task %d\n", current->pid);
> > + force_sig(SIGKILL);
> > +}
> > +
> > +/* Called from sched_submit_work() with preempt_disable. */
> > +void umcg_wq_worker_sleeping(struct task_struct *tsk)
> > +{
> > + __umcg_wq_worker_sleeping(tsk);
> > + umcg_unpin_pages();
>
> Coming back to my previous question: Why has all of this to run with
> preemption disabled?
>
> There is absolutely no reason to do that and just because you picked a
> place to invoke that with preemption disabled does not count.
>
> In fact none of the existing two functions require to be invoked with
> preemption disabled and I'm going to send out a patch which removes that
> historic leftover.
I've seen your patch - thanks!
>
> And if this is not called with preemption disabled then none of these
> misnomed _nosleep() accessors are needed and the code can be simplified.
We still need to work with pinned pages, as described above - mm lock
can be held here. How would you prefer these accessors to be named
other than "_nosleep()"? Maybe "_nofixup()"?
>
> > + * Try to wake up. May be called with preempt_disable set. May be called
> > + * cross-process.
> > + *
> > + * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
> > + * ordering logic.
> > + */
> > +static int umcg_ttwu(u32 next_tid, int wake_flags)
> > +{
> > + struct task_struct *next;
> > +
> > + rcu_read_lock();
> > + next = find_task_by_vpid(next_tid);
> > + if (!next || !(READ_ONCE(next->umcg_task))) {
> > + rcu_read_unlock();
> > + return -ESRCH;
> > + }
> > +
> > + /* Note: next does not necessarily share mm with current. */
>
> Which is irrelevant, but what's relevant is that there is absolutely no
> sanity check of next_tid. So this just wakes up what ever TID user space
> writes into the user space task memory. Anything copmeing from user
> space cannot be trusted and needs to be validated. find_task_by_vpid()
> is not sufficient for that.
Well, next_tid must point to a task registered with UMCG, and this is
checked above. I'm not sure what other validation is appropriate here.
What kind of sanity checks are needed? The worst thing that can happen
is a spurious wakeup, which is supposedly something that can happen
anyway and is not a major concern?
>
> > + try_to_wake_up(next, TASK_NORMAL, wake_flags); /* Result ignored. */
> > + rcu_read_unlock();
> > +
> > + return 0;
> > +}
>
>
> > +/* Returns true on success, false on _any_ error. */
> > +static bool mark_server_running(u32 server_tid, bool may_sleep)
> > +{
> > + struct umcg_task __user *ut_server = NULL;
> > + u32 state = UMCG_TASK_IDLE;
> > + struct task_struct *tsk;
> > +
> > + rcu_read_lock();
> > + tsk = find_task_by_vpid(server_tid);
> > + if (tsk)
> > + ut_server = READ_ONCE(tsk->umcg_task);
> > + rcu_read_unlock();
> > +
> > + if (!ut_server)
> > + return false;
> > +
> > + if (READ_ONCE(current->mm) != READ_ONCE(tsk->mm))
> > + return false;
>
> This is broken because it's outside the rcu read side critical section
> so @tsk can be gone already. Also this lacks a check whether @tsk is a
> kthread because kthreads can use a user mm temporarily.
I don't think kthreads can have tsk->umcg_task set, so the function
will return earlier. I'll move the check under rcu.
>
> Also what's the purpose of these undocumented READ_ONCE() instances?
While ttwu can be called cross-mm, anything more involved, such as
changing the server's UMCG state, requires the remote task to belong
to the same process, and this is what the check validates. I'll add
that comment to the code.
>
> Thanks,
>
> tglx
Thanks,
Peter
Peter,
On Tue, Sep 28 2021 at 10:26, Peter Oskolkov wrote:
> On Tue, Sep 28, 2021 at 2:21 AM Thomas Gleixner <[email protected]> wrote:
>> On Fri, Sep 17 2021 at 11:03, Peter Oskolkov wrote:
>>
>> What's the rationale for that? Why is that needed and desired?
>
> In short, workqueue functions (sched_submit_work, sched_update_worker)
> can be called with mm lock held, and so we cannot fixup pagefaults
> inline; so we need to pin struct umcg_task pages temporarily. On the
> wakeup path I elide this by setting TF_NOTIFY_RESUME flag and doing
> everything later in return-to-usermode-loop in a sleepable context; on
> the sched_submit_work (going to sleep) path, we need to access the
> userspace to wake up the server, but this is a "nosleep" context, so
> the pages have to be pinned.
Fair enough. That's the kind of information which belongs into the
change log. But this wants also be documented in the code, e.g. in the
comment above umcg_pin_pages(). You'll be happy to have it there when
you are forced to stare at that code 3 month after it got merged.
>> > diff --git a/kernel/exit.c b/kernel/exit.c
>> > index fd1c04193e18..fdd4e923cca9 100644
>> > --- a/kernel/exit.c
>> > +++ b/kernel/exit.c
>> > @@ -744,6 +744,8 @@ void __noreturn do_exit(long code)
>> > if (unlikely(!tsk->pid))
>> > panic("Attempted to kill the idle task!");
>> >
>> > + umcg_handle_exit();
>>
>> Why is this invoked _before_ the oops fixup? How is that correct in the
>> oops case?
>
> umcg_handle_exit() is just unpinning pinned pages, assigning NULL to a
> couple of pointers, and clearing PF_UMCG_WORKER flag, so I assumed
> doing it as early as possible is the best way to avoid unnecessarily
> triggering workqueue callbacks. Is my logic wrong? (UMCG tasks should
> unregister before exiting; if they do, umcg_handle_exit() is a NOOP;
> but if they did not unregister, umcg_handle_exit() just makes sure no
> pinned pages are left over).
Look at the comment right below where you placed that call.
"Just unpinning" calls into a boatload of code in mm and you really want
to do that _after_ all sanity checks and fixups and before exit_mm().
It really does not matter much where exactly and I have no idea which
workqueue callbacks you are worried about. The code after the fixup is
fully preemptible (in theory).
>> > +
>> > + if (prev_state & UMCG_TF_LOCKED)
>> > + return;
>> > +
>> > + if ((prev_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING)
>> > + return; /* the worker is in umcg_wait */
>>
>> So if the worker is in umcg_wait() then why is it reaching this place at
>> all? The current task surely knows that it comes here from umcg_wait(),
>> right?
>
> This place is on the "worker goes to sleep" path (sched_submit_work).
> umcg_wait() puts the worker to sleep, so this place is triggered. I
> can remove PF_UMCG_WORKER in umcg_wait() to elide this if you think
> this would be better. I'll try that in the next patchset version.
I was just wondering about that as it seemed strange to me that a task
which invoked umcg_wait() has to inspect it's own user space state to
figure out that it comes from umcg_wait(). There might be a reason for
this, but if not then avoiding this pointless exercise completely is
definitely not the worst idea.
>> > + /* The idle server's wait may timeout. */
>> > + /* TODO: make a smarter context switch below when available. */
>>
>> Neither those make any sense to me.
>>
>> > + if (mark_server_running(server_tid, false))
>> > + umcg_ttwu(server_tid, WF_CURRENT_CPU);
>>
>> Well, after looking at both functions I can see why this wants to be
>> smarter. Doing the vpid lookup twice instead of once is certainly not
>> the smartest solution.
>
> I'll do some refactoring to avoid extra vpid lookups. The comment,
> though, hints at a "smarter context switch" that is more than just
> "ttwu the server and put the worker to sleep".
These TODOs are not really helpful at the moment. Let's get the straight
forward version to work first. Enhancing this is an orthogonal issue.
>> And if this is not called with preemption disabled then none of these
>> misnomed _nosleep() accessors are needed and the code can be simplified.
>
> We still need to work with pinned pages, as described above - mm lock
> can be held here. How would you prefer these accessors to be named
> other than "_nosleep()"? Maybe "_nofixup()"?
nofixup() would be weird because the actual access uses an exception
fixup :)
If you look at mm/maccess.c you might notice that we already have such
functions, e.g. copy_from_user_nofault(). Pretty obvious name that.
I'll reply to that patch separately as the approach taken there needs
more than a quick name change.
>> > +static int umcg_ttwu(u32 next_tid, int wake_flags)
>> > +{
>> > + struct task_struct *next;
>> > +
>> > + rcu_read_lock();
>> > + next = find_task_by_vpid(next_tid);
>> > + if (!next || !(READ_ONCE(next->umcg_task))) {
>> > + rcu_read_unlock();
>> > + return -ESRCH;
>> > + }
>> > +
>> > + /* Note: next does not necessarily share mm with current. */
>>
>> Which is irrelevant, but what's relevant is that there is absolutely no
>> sanity check of next_tid. So this just wakes up what ever TID user space
>> writes into the user space task memory. Anything copmeing from user
>> space cannot be trusted and needs to be validated. find_task_by_vpid()
>> is not sufficient for that.
>
> Well, next_tid must point to a task registered with UMCG, and this is
> checked above. I'm not sure what other validation is appropriate here.
> What kind of sanity checks are needed? The worst thing that can happen
> is a spurious wakeup, which is supposedly something that can happen
> anyway and is not a major concern?
A spurious wakeup which is caused by a race in the kernel is definitely
not unexpected and inevitable.
User controlled targeted wakeups at random threads are not really the
same category. There is a reason why you can't send signals to random
processes/tasks.
Whether you like it or not, _any_ user supplied value which is used to
cause a directed action of the kernel has to be considered malicious
unless you can explain coherently why there is absolutely no risk.
> Well, next_tid must point to a task registered with UMCG ...
is not really sufficient. All it tells us is that it is a task which has
a umcg pointer. Not more not less. It does not answer the following
questions:
- Is it a task which is related to the same UMCG entity?
- Does it belong to the same user/group or whatever?
- ...
You might want to talk to Kees Cook about that.
>> > + rcu_read_lock();
>> > + tsk = find_task_by_vpid(server_tid);
>> > + if (tsk)
>> > + ut_server = READ_ONCE(tsk->umcg_task);
>> > + rcu_read_unlock();
>> > +
>> > + if (!ut_server)
>> > + return false;
>> > +
>> > + if (READ_ONCE(current->mm) != READ_ONCE(tsk->mm))
>> > + return false;
>>
>> This is broken because it's outside the rcu read side critical section
>> so @tsk can be gone already. Also this lacks a check whether @tsk is a
>> kthread because kthreads can use a user mm temporarily.
>
> I don't think kthreads can have tsk->umcg_task set, so the function
> will return earlier. I'll move the check under rcu.
Indeed, you are right. This stuff is just hard to read.
rcu_read_lock();
tsk = find_task_by_vpid(server_tid);
/*
* Validate that the task is associated with UMCG and
* has the same mm because $RAISINS...
*/
if (tsk && tsk->umcg_task && current->mm == tsk->mm) {
...
do_stuff(tsk);
...
}
rcu_read_unlock();
>> Also what's the purpose of these undocumented READ_ONCE() instances?
>
> While ttwu can be called cross-mm, anything more involved, such as
> changing the server's UMCG state, requires the remote task to belong
> to the same process, and this is what the check validates. I'll add
> that comment to the code.
But that has nothing to do with my question:
>> Also what's the purpose of these undocumented READ_ONCE() instances?
READ_ONCE(current->mm) is pointless. current->mm cannot change while
current is evaluating it. READ_ONCE(tsk->mm) does not provide any value
either.
READ_ONCE() is a clear indicator for dealing with concurrency and we
expect a corresponding WRITE_ONCE() and a comment explaining what this
is about.
Sprinkling READ_ONCE() around the code is absolutely not helpful.
If you are unsure about the usage of this, please ask your colleagues or
on the mailing list upfront instead of letting reviewers scratch their
heads.
Btw, the same problems exist all over the place and there is quite some
duplicated code which is wrong except for one instance. Can you please
make that:
find_umcg_task(tid)
{
tsk = find_task_by_vpid(tid);
/*
* Validate that the task is associated with UMCG and
* has the same mm because $RAISINS...
*/
if (tsk && tsk->umcg_task && current->mm == tsk->mm)
return tsk;
return NULL:
}
and at the callsites:
rcu_read_lock();
tsk = find_umcg_task(tid);
if (tsk)
do_stuff(tsk);
rcu_read_unlock();
Hmm?
Thanks,
tglx
Peter,
On Fri, Sep 17 2021 at 11:03, Peter Oskolkov wrote:
> Add helper functions to work atomically with userspace 32/64 bit values -
> there are some .*futex.* named helpers, but they are not exactly
> what is needed for UMCG; I haven't found what else I could use, so I
> rolled these.
>
> At the moment only X86_64 is supported.
>
> Note: the helpers should probably go into arch/ somewhere; I have
> them in kernel/sched/umcg_uaccess.h temporarily for convenience. Please
> let me know where I should put them.
Again: This does not qualify as a changelog, really.
That aside, you already noticed that there are futex helpers. Your
reasoning that they can't be reused is only partially correct.
What you named __try_cmpxchg_user_32() is pretty much a verbatim copy of
X86 futex_atomic_cmpxchg_inatomic(). The only difference is that you placed
the uaccess_begin()/end() into the inline.
Not going anywhere. You have the bad luck to have the second use case
for such an infrastucture and therefore you have the honours of mopping
it up by making it a generic facility which replaces the futex specific
variant.
Also some of the other instances are just a remix of the futex_op()
mechanics so your argument is even more weak.
> +static inline int fix_pagefault(unsigned long uaddr, bool write_fault, int bytes)
> +{
> + struct mm_struct *mm = current->mm;
> + int ret;
> +
> + /* Validate proper alignment. */
> + if (uaddr % bytes)
> + return -EINVAL;
Why do you want to make this check _after_ the page fault? Checks
on user supplied pointers have to be done _before_ trying to access
them.
> +
> + if (mmap_read_lock_killable(mm))
> + return -EINTR;
> + ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
> + NULL);
> + mmap_read_unlock(mm);
> +
> + return ret < 0 ? ret : 0;
> +}
There is no point in making this inline. Fault handling is not a hotpath
by any means.
Aside of that it's pretty much what futex.c::fault_in_user_writeable()
does. So it's pretty obvious to factor this out in the first step into
mm/gup.c or whatever place the mm people fancy and make the futex code
use it.
> +/**
> + * cmpxchg_32_user_nosleep - compare_exchange 32-bit values
> + *
> + * Return:
> + * 0 - OK
> + * -EFAULT: memory access error
> + * -EAGAIN: @expected did not match; consult @prev
> + */
> +static inline int cmpxchg_user_32_nosleep(u32 __user *uaddr, u32 *old, u32 new)
> +{
> + int ret = -EFAULT;
> + u32 __old = *old;
> +
> + if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> + return -EFAULT;
> +
> + pagefault_disable();
> +
> + __uaccess_begin_nospec();
Why exactly do you need _nospec() here? Just to make sure that this code
is x86 only or just because you happened to copy a x86 implementation
which uses these nospec() variants?
Again, 90% of this is generic and not at all x86 specific and this
nospec() thing is very well hidden in the architecture code for a good
reason while
if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
return -EFAULT;
pagefault_disable();
ret = user_op(.....);
pagefault_enable();
is completely generic and does not have any x86 or other architecture
dependency in it.
> + ret = __try_cmpxchg_user_32(old, uaddr, __old, new);
> + user_access_end();
> +
> + if (!ret)
> + ret = *old == __old ? 0 : -EAGAIN;
> +
> + pagefault_enable();
> + return ret;
> +}
Aside of that this should go into mm/maccess.c or some other reasonable
place where people can find it along with other properly named
_nofault() helpers.
Nothing except the ASM wrappers is architecture specific here. So 90% of
this can be made generic infrastructure as out of line library code.
And yes, I mean out of line library code unless you can come up with a
compelling reason backed by actual numbers why this has to be inlined.
May I recommend to read this for inspiration:
https://lore.kernel.org/lkml/[email protected]/
Thanks,
tglx
Thanks for the review, Thomas!
I'll work on a patch(set) to put this stuff into mm/ somewhere. Let's
see how quickly that can happen...
Thanks,
Peter
On Tue, Sep 28, 2021 at 2:58 PM Thomas Gleixner <[email protected]> wrote:
>
> Peter,
>
> On Fri, Sep 17 2021 at 11:03, Peter Oskolkov wrote:
>
> > Add helper functions to work atomically with userspace 32/64 bit values -
> > there are some .*futex.* named helpers, but they are not exactly
> > what is needed for UMCG; I haven't found what else I could use, so I
> > rolled these.
> >
> > At the moment only X86_64 is supported.
> >
> > Note: the helpers should probably go into arch/ somewhere; I have
> > them in kernel/sched/umcg_uaccess.h temporarily for convenience. Please
> > let me know where I should put them.
>
> Again: This does not qualify as a changelog, really.
>
> That aside, you already noticed that there are futex helpers. Your
> reasoning that they can't be reused is only partially correct.
>
> What you named __try_cmpxchg_user_32() is pretty much a verbatim copy of
> X86 futex_atomic_cmpxchg_inatomic(). The only difference is that you placed
> the uaccess_begin()/end() into the inline.
>
> Not going anywhere. You have the bad luck to have the second use case
> for such an infrastucture and therefore you have the honours of mopping
> it up by making it a generic facility which replaces the futex specific
> variant.
>
> Also some of the other instances are just a remix of the futex_op()
> mechanics so your argument is even more weak.
>
> > +static inline int fix_pagefault(unsigned long uaddr, bool write_fault, int bytes)
> > +{
> > + struct mm_struct *mm = current->mm;
> > + int ret;
> > +
> > + /* Validate proper alignment. */
> > + if (uaddr % bytes)
> > + return -EINVAL;
>
> Why do you want to make this check _after_ the page fault? Checks
> on user supplied pointers have to be done _before_ trying to access
> them.
>
> > +
> > + if (mmap_read_lock_killable(mm))
> > + return -EINTR;
> > + ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
> > + NULL);
> > + mmap_read_unlock(mm);
> > +
> > + return ret < 0 ? ret : 0;
> > +}
>
> There is no point in making this inline. Fault handling is not a hotpath
> by any means.
>
> Aside of that it's pretty much what futex.c::fault_in_user_writeable()
> does. So it's pretty obvious to factor this out in the first step into
> mm/gup.c or whatever place the mm people fancy and make the futex code
> use it.
>
> > +/**
> > + * cmpxchg_32_user_nosleep - compare_exchange 32-bit values
> > + *
> > + * Return:
> > + * 0 - OK
> > + * -EFAULT: memory access error
> > + * -EAGAIN: @expected did not match; consult @prev
> > + */
> > +static inline int cmpxchg_user_32_nosleep(u32 __user *uaddr, u32 *old, u32 new)
> > +{
> > + int ret = -EFAULT;
> > + u32 __old = *old;
> > +
> > + if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> > + return -EFAULT;
> > +
> > + pagefault_disable();
> > +
> > + __uaccess_begin_nospec();
>
> Why exactly do you need _nospec() here? Just to make sure that this code
> is x86 only or just because you happened to copy a x86 implementation
> which uses these nospec() variants?
>
> Again, 90% of this is generic and not at all x86 specific and this
> nospec() thing is very well hidden in the architecture code for a good
> reason while
>
> if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> return -EFAULT;
>
> pagefault_disable();
> ret = user_op(.....);
> pagefault_enable();
>
> is completely generic and does not have any x86 or other architecture
> dependency in it.
>
> > + ret = __try_cmpxchg_user_32(old, uaddr, __old, new);
> > + user_access_end();
> > +
> > + if (!ret)
> > + ret = *old == __old ? 0 : -EAGAIN;
> > +
> > + pagefault_enable();
> > + return ret;
> > +}
>
> Aside of that this should go into mm/maccess.c or some other reasonable
> place where people can find it along with other properly named
> _nofault() helpers.
>
> Nothing except the ASM wrappers is architecture specific here. So 90% of
> this can be made generic infrastructure as out of line library code.
>
> And yes, I mean out of line library code unless you can come up with a
> compelling reason backed by actual numbers why this has to be inlined.
>
> May I recommend to read this for inspiration:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Thanks,
>
> tglx
Hi Thierry,
sorry for the delayed reply - I'm finally going through the
documentation patches in preparation for the upcoming next version
patchset mail-out.
On Wed, Sep 22, 2021 at 11:39 AM Thierry Delisle <[email protected]> wrote:
>
> On 2021-09-17 2:03 p.m., Peter Oskolkov wrote:
> > [...]
> > +SYS_UMCG_WAIT()
> > +
> > +int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout) operates on
> > +registered UMCG servers and workers: struct umcg_task *self provided to
> > +sys_umcg_ctl() when registering the current task is consulted in
> addition
> > +to flags and abs_timeout parameters.
> > +
> > +The function can be used to perform one of the three operations:
> > +
> > +* wait: if self->next_tid is zero, sys_umcg_wait() puts the current
> > + server or worker to sleep;
>
> I believe this description is misleading but I might be wrong.
> From the example
> * worker to server context switch (worker "yields"):
> S:IDLE+W:RUNNING => +S:RUNNING+W:IDLE
>
> It seems to me that when a worker goes from running to idle, it should
> *not* set the next_tid to 0, it should preserve the next_tid as-is,
> which is expected to point to its current server. This is consistent
> with my understanding of the umcg_wait implementation. This operation
> is effectively a direct context-switch to the server.
The documentation here outlines what sys_umcg_wait does, and it does
put the current task to sleep without context switching if next_tid is
zero. The question of whether this behavior is or is not appropriate
for a worker wishing to yield/park itself is at a "policy" level, if
you wish, and this "policy" level is described in "state transitions"
section later in the document. sys_umcg_wait() does not enforce this
"policy" directly, in order to make it simpler and easier to describe
and reason about.
>
> With that said, I'm a little confused by the usage of "yields" in that
> example. I would expect workers yielding to behave like kernel threads
> calling sched_yield(), i.e., context switch to the server but also be
> immediately added to the idle_workers_ptr.
>
> From my understanding of the umcg_wait call, "worker to server context
> switch" does not have analogous behaviour to sched_yield. Am I correct?
> If so, I suggest using "park" instead of "yield" in the description
> of that example. I believe the naming of wait/wake as park/unpark is
> consistent with Java[1] and Rust[2], but I don't know if that naming
> is used in contexts closer to the linux kernel.
>
> [1]
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/LockSupport.html
> [2] https://doc.rust-lang.org/std/thread/fn.park.html
I'm not a fan of arguing about how to name things. If the maintainers
ask me to rename wait/wake to park/unpark, I'll do that. But it seems
they are OK with this terminology, I believe because wait/wake is a
relatively well understood pair of verbs in the kernel context;
futexes, for example, have wait/wake operations. A higher level
library in the userspace may later expose park/unpark functions that
at the lower level call sys_umcg_wait...
> Hi Thierry,
>
> sorry for the delayed reply - I'm finally going through the
> documentation patches in preparation for the upcoming next version
> patchset mail-out.
No problem.
> The documentation here outlines what sys_umcg_wait does, and it does
> put the current task to sleep without context switching if next_tid is
> zero. The question of whether this behavior is or is not appropriate
> for a worker wishing to yield/park itself is at a "policy" level, if
> you wish, and this "policy" level is described in "state transitions"
> section later in the document. sys_umcg_wait() does not enforce this
> "policy" directly, in order to make it simpler and easier to describe
> and reason about.
Just to be clear, sys_umcg_wait supports an operation that, when called
from
a worker, puts the worker to sleep without triggering block detection
or context-switching back to the server?
>> With that said, I'm a little confused by the usage of "yields" in that
>> example. I would expect workers yielding to behave like kernel threads
>> calling sched_yield(), i.e., context switch to the server but also be
>> immediately added to the idle_workers_ptr.
>
> I'm not a fan of arguing about how to name things. If the maintainers
> ask me to rename wait/wake to park/unpark, I'll do that.
I understand the sentiment, and I'm perfectly happy with the use of
wait/wake.
I was exclusively referring to this one use of "yield" in the
documentation.
On Tue, Oct 12, 2021 at 9:25 AM Thierry Delisle <[email protected]> wrote:
[...]
> Just to be clear, sys_umcg_wait supports an operation that, when called
> from
> a worker, puts the worker to sleep without triggering block detection
> or context-switching back to the server?
Potentially, yes - when a worker wants to yield (e.g. as part of a
custom UMCG-aware mutex/condvar code), and calls into the userspace
scheduler, it may be faster to skip the server wakeup (e.g. reassign
the server to another sleeping worker and wake this worker). This is
not a supported operation right now, but I see how it could be used to
optimize some things in the future.
Do you have any concerns here?
>
>
>
> >> With that said, I'm a little confused by the usage of "yields" in that
> >> example. I would expect workers yielding to behave like kernel threads
> >> calling sched_yield(), i.e., context switch to the server but also be
> >> immediately added to the idle_workers_ptr.
> >
> > I'm not a fan of arguing about how to name things. If the maintainers
> > ask me to rename wait/wake to park/unpark, I'll do that.
>
> I understand the sentiment, and I'm perfectly happy with the use of
> wait/wake.
> I was exclusively referring to this one use of "yield" in the
> documentation.
I don't see a big difference here, sorry. We are mixing levels of
abstraction here again, I think. For example, the higher level
userspace scheduling code will have more nuanced treatment of IDLE
workers; but down at the kernel they are all the same: IDLE worker is
a worker that the userspace can "schedule" by marking it RUNNING,
regardless of whether the worker is "parked", or "woke from a blocking
op", or whatever other semantically different state the worker can be.
For the kernel, they are all the same, idle, not runnable, waiting for
the userspace to explicitly "schedule" them.
Similarly, I don't see a need to semantically distinguish "yield" from
"park" at the kernel level of things; this distinction seems to be a
higher-level abstraction that can be properly expressed in the
userspace, and does not need to be explicitly addressed in the kernel
(to make the code faster and simpler, for example).
>> Just to be clear, sys_umcg_wait supports an operation that, when called
>> from a worker, puts the worker to sleep without triggering block
detection
>> or context-switching back to the server?
>
> Potentially, yes - when a worker wants to yield (e.g. as part of a
> custom UMCG-aware mutex/condvar code), and calls into the userspace
> scheduler, it may be faster to skip the server wakeup (e.g. reassign
> the server to another sleeping worker and wake this worker). This is
> not a supported operation right now, but I see how it could be used to
> optimize some things in the future.
>
> Do you have any concerns here?
To be honest, I did not realize this was a possibility until your previous
email. I'm not sure I buy your example, it just sounds like worker to worker
context-switching, but I could imagine "stop the world" cases or some "race
to idle" policy using this feature.
It seems to me the corresponding wake needs to know if it needs to enqueue
the worker into the idle workers list or if it should just schedule the
worker
as it would a server.
How does the wake know which to do?
> I don't see a big difference here, sorry. We are mixing levels of
> abstraction here again, I think. For example, the higher level
> userspace scheduling code will have more nuanced treatment of IDLE
> workers; but down at the kernel they are all the same: IDLE worker is
> a worker that the userspace can "schedule" by marking it RUNNING,
> regardless of whether the worker is "parked", or "woke from a blocking
> op", or whatever other semantically different state the worker can be.
> For the kernel, they are all the same, idle, not runnable, waiting for
> the userspace to explicitly "schedule" them.
>
> Similarly, I don't see a need to semantically distinguish "yield" from
> "park" at the kernel level of things; this distinction seems to be a
> higher-level abstraction that can be properly expressed in the
> userspace, and does not need to be explicitly addressed in the kernel
> (to make the code faster and simpler, for example).
From the kernel's perspective, I can see two distinct operation:
1 - Mark the worker as IDLE and put it to sleep.
2 - Mark the worker as IDLE, put it to sleep *and* immediately add it
to the idle workers list.
The wait in operation 1 expects an outside wakeup call to match it and
resume
the worker, while operation 2 is its own wakeup. To me that is the
distinction
between wait/park and yield, respectively.
Is Operation 2 supported?
I'm not sure this distinction can be handled in userspace in all cases.
Waking
oneself is generally not a possibility.
Hi Thierry!
Again, it seems you are ascribing higher level semantics to lower
level constructs here, specifically to "idle workers list".
In this patchset, idle workers list is just a mechanism used by the
kernel to notify the userspace that previously blocked workers are now
available for scheduling. Maybe a better name would have been
"unblocked workers list". The true list of idle workers that the
userspace scheduler can schedule is maintained by the userspace; it
can be said that if a worker is on the kernel's idle workers list, it
is NOT on the userspace's idle workers list, and so workers on the
kernel's idle workers list are not yet fully "idle workers" that can
be scheduled.
On Tue, Oct 12, 2021 at 11:46 AM Thierry Delisle <[email protected]> wrote:
>
> >> Just to be clear, sys_umcg_wait supports an operation that, when called
> >> from a worker, puts the worker to sleep without triggering block
> detection
> >> or context-switching back to the server?
> >
> > Potentially, yes - when a worker wants to yield (e.g. as part of a
> > custom UMCG-aware mutex/condvar code), and calls into the userspace
> > scheduler, it may be faster to skip the server wakeup (e.g. reassign
> > the server to another sleeping worker and wake this worker). This is
> > not a supported operation right now, but I see how it could be used to
> > optimize some things in the future.
> >
> > Do you have any concerns here?
>
> To be honest, I did not realize this was a possibility until your previous
> email. I'm not sure I buy your example, it just sounds like worker to worker
> context-switching, but I could imagine "stop the world" cases or some "race
> to idle" policy using this feature.
>
> It seems to me the corresponding wake needs to know if it needs to enqueue
> the worker into the idle workers list or if it should just schedule the
> worker
> as it would a server.
>
> How does the wake know which to do?
If the worker is IDLE, and the userspace knows about it (i.e. the
worker is NOT on the kernel's idle workers list), the userspace either
can directly schedule the worker with a server (mark it RUNNING,
assign a server, etc.), or instruct the kernel to put the worker onto
the kernel's idle workers list so that it can later be picked up by a
userspace thread checking the kernel's idle workers list. This last
operation is provided for cases when for example a server wants to run
IDLE worker A and knows about IDLE worker B, but for some reason
cannot properly process worker B at the moment, so by putting worker B
back into the kernel's idle worker list this server delegates
processing worker B to another server in the future.
Both of these state transitions are explicitly covered in the documentation.
Again, you appear to be trying to equate kernel UMCG API with higher
level userspace scheduling notions/facilities, while in reality kernel
UMCG API is a low level facility that will be _used_ to construct the
higher level behaviors/semantics you are wondering about.
I suggest you wait until I post the userspace library to ask these
higher-level semantic questions.
Thanks,
Peter
[...]