2021-09-08 18:50:11

by Peter Oskolkov

[permalink] [raw]
Subject: [PATCH 0/4 v0.5] sched/umcg: RFC UMCG patchset

This is an update on v0.4:
https://lore.kernel.org/lkml/[email protected]/

Key changes:

v0.3 => v0.4:
- made idle workers list logic simpler
- only one idle server tracking
- removed two fields from struct umcg_task
- added timeout handling
- added worker preemption
- added a doc patch that now documents the syscalls and state
transitions.

v0.4 => v0.5:
- refactored idle workers list management, as suggested by
Thierry Delisle here:
https://lore.kernel.org/lkml/[email protected]/

I will now clean up libumcg (userspace) and selftests and post them
in a follow-up patchset.

Peter Oskolkov (4):
sched/umcg: add WF_CURRENT_CPU and externise ttwu
sched/umcg: RFC: add userspace atomic helpers
sched/umcg: RFC: implement UMCG syscalls
sched/umcg: add Documentation/userspace-api/umcg.rst

Documentation/userspace-api/umcg.rst | 546 ++++++++++++++++++++++
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/linux/sched.h | 6 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 8 +-
include/uapi/linux/umcg.h | 114 +++++
init/Kconfig | 10 +
kernel/exit.c | 7 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 20 +-
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 15 +-
kernel/sched/umcg.c | 618 +++++++++++++++++++++++++
kernel/sched/umcg.h | 325 +++++++++++++
kernel/sys_ni.c | 4 +
15 files changed, 1673 insertions(+), 11 deletions(-)
create mode 100644 Documentation/userspace-api/umcg.rst
create mode 100644 include/uapi/linux/umcg.h
create mode 100644 kernel/sched/umcg.c
create mode 100644 kernel/sched/umcg.h

--
2.25.1


2021-09-08 18:50:11

by Peter Oskolkov

[permalink] [raw]
Subject: [PATCH 1/4 v0.5] sched/umcg: add WF_CURRENT_CPU and externise ttwu

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases such as UMCG.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <[email protected]>
---
kernel/sched/core.c | 3 +--
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 15 +++++++++------
3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a22cc3c156ce..377c4d931546 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3878,8 +3878,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
* Return: %true if @p->state changes (an actual wakeup was done),
* %false otherwise.
*/
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
unsigned long flags;
int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b3e85912a1f..133eea90e045 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6893,6 +6893,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (wake_flags & WF_TTWU) {
record_wakee(p);

+ if ((wake_flags & WF_CURRENT_CPU) &&
+ cpumask_test_cpu(cpu, p->cpus_ptr))
+ return cpu;
+
if (sched_energy_enabled()) {
new_cpu = find_energy_efficient_cpu(p, prev_cpu);
if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e7e2bba5b520..bb807650722a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2041,13 +2041,14 @@ static inline int task_on_rq_migrating(struct task_struct *p)
}

/* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
+#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */

-#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
-#define WF_ON_CPU 0x40 /* Wakee is on_cpu */
+#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_ON_CPU 0x40 /* Wakee is on_cpu */
+#define WF_CURRENT_CPU 0x80 /* Prefer to move the wakee to the current CPU. */

#ifdef CONFIG_SMP
static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3048,6 +3049,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
extern void swake_up_all_locked(struct swait_queue_head *q);
extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);

+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
extern int preempt_dynamic_mode;
extern int sched_dynamic_mode(const char *str);
--
2.25.1

2021-09-08 18:51:55

by Peter Oskolkov

[permalink] [raw]
Subject: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

Add helper functions to work atomically with userspace 32/64 bit values -
there are some .*futex.* named helpers, but they are not exactly
what is needed for UMCG; I haven't found what else I could use, so I
rolled these.

At the moment only X86_64 is supported.

Note: the helpers should probably go into arch/ somewhere; I have
them in kernel/sched/umcg.h temporarily for convenience. Please
let me know where I should put them.

Note: the current code follows sugestions here:
https://lore.kernel.org/lkml/[email protected]/
with the exception that I couldn't combine __try_cmpxchg_user_32/64 functions
into a macro, as my asm foo is not too strong. I'll continue trying to make
the macro work, but for the moment I've decided to post this RFC so that
other areas of the patchset could be reviewed.

Changelog:
v04->v0.5:
- added xchg_user_** helpers;
v0.3->v0.4:
- added put_user_nosleep;
- removed linked list/stack operations patch;
v0.2->v0.3:
- renamed and refactored the helpers a bit, as described above;
- moved linked list/stack operations into a separate patch.

Signed-off-by: Peter Oskolkov <[email protected]>
---
kernel/sched/umcg.h | 312 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 312 insertions(+)
create mode 100644 kernel/sched/umcg.h

diff --git a/kernel/sched/umcg.h b/kernel/sched/umcg.h
new file mode 100644
index 000000000000..89ba84afa977
--- /dev/null
+++ b/kernel/sched/umcg.h
@@ -0,0 +1,312 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _KERNEL_SCHED_UMCG_H
+#define _KERNEL_SCHED_UMCG_H
+
+#ifdef CONFIG_X86_64
+
+#include <linux/uaccess.h>
+
+#include <asm/asm.h>
+#include <linux/atomic.h>
+
+/* TODO: move atomic operations below into arch/ headers */
+static inline int __try_cmpxchg_user_32(u32 *uval, u32 __user *uaddr,
+ u32 oldval, u32 newval)
+{
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\t" LOCK_PREFIX "cmpxchgl %4, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+ : "i" (-EFAULT), "r" (newval), "1" (oldval)
+ : "memory"
+ );
+ *uval = oldval;
+ return ret;
+}
+
+static inline int __try_cmpxchg_user_64(u64 *uval, u64 __user *uaddr,
+ u64 oldval, u64 newval)
+{
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+ : "i" (-EFAULT), "r" (newval), "1" (oldval)
+ : "memory"
+ );
+ *uval = oldval;
+ return ret;
+}
+
+static inline int fix_pagefault(unsigned long uaddr, bool write_fault)
+{
+ struct mm_struct *mm = current->mm;
+ int ret;
+
+ mmap_read_lock(mm);
+ ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
+ NULL);
+ mmap_read_unlock(mm);
+
+ return ret < 0 ? ret : 0;
+}
+
+/**
+ * cmpxchg_32_user - compare_exchange 32-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int cmpxchg_user_32(u32 __user *uaddr, u32 *old, u32 new)
+{
+ int ret = -EFAULT;
+ u32 __old = *old;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+ __uaccess_begin_nospec();
+ ret = __try_cmpxchg_user_32(old, uaddr, __old, new);
+ user_access_end();
+
+ if (!ret) {
+ ret = *old == __old ? 0 : -EAGAIN;
+ break;
+ }
+
+ if (fix_pagefault((unsigned long)uaddr, true) < 0)
+ break;
+ }
+
+ pagefault_enable();
+ return ret;
+}
+
+/**
+ * cmpxchg_64_user - compare_exchange 64-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int cmpxchg_user_64(u64 __user *uaddr, u64 *old, u64 new)
+{
+ int ret = -EFAULT;
+ u64 __old = *old;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+ __uaccess_begin_nospec();
+ ret = __try_cmpxchg_user_64(old, uaddr, __old, new);
+ user_access_end();
+
+ if (!ret) {
+ ret = *old == __old ? 0 : -EAGAIN;
+ break;
+ }
+
+ if (fix_pagefault((unsigned long)uaddr, true) < 0)
+ break;
+ }
+
+ pagefault_enable();
+
+ return ret;
+}
+
+static inline int __try_xchg_user_32(u32 *oval, u32 __user *uaddr, u32 newval)
+{
+ u32 oldval = 0;
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\txchgl %0, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "=r" (oldval), "=r" (ret), "+m" (*uaddr)
+ : "i" (-EFAULT), "0" (newval), "1" (0)
+ );
+
+ if (ret)
+ return ret;
+
+ *oval = oldval;
+ return 0;
+}
+
+static inline int __try_xchg_user_64(u64 *oval, u64 __user *uaddr, u64 newval)
+{
+ u64 oldval = 0;
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\txchgq %0, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "=r" (oldval), "=r" (ret), "+m" (*uaddr)
+ : "i" (-EFAULT), "0" (newval), "1" (0)
+ );
+
+ if (ret)
+ return ret;
+
+ *oval = oldval;
+ return 0;
+}
+
+/**
+ * xchg_32_user - atomically exchange 64-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ */
+static inline int xchg_user_32(u32 __user *uaddr, u32 *val)
+{
+ int ret = -EFAULT;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+ __uaccess_begin_nospec();
+ ret = __try_xchg_user_32(val, uaddr, *val);
+ user_access_end();
+
+ if (!ret)
+ break;
+
+ if (fix_pagefault((unsigned long)uaddr, true) < 0)
+ break;
+ }
+
+ pagefault_enable();
+
+ return ret;
+}
+
+/**
+ * xchg_64_user - atomically exchange 64-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ */
+static inline int xchg_user_64(u64 __user *uaddr, u64 *val)
+{
+ int ret = -EFAULT;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+ __uaccess_begin_nospec();
+ ret = __try_xchg_user_64(val, uaddr, *val);
+ user_access_end();
+
+ if (!ret)
+ break;
+
+ if (fix_pagefault((unsigned long)uaddr, true) < 0)
+ break;
+ }
+
+ pagefault_enable();
+
+ return ret;
+}
+
+/**
+ * get_user_nosleep - get user value with inline fixup without sleeping.
+ *
+ * get_user() might sleep and therefore cannot be used in preempt-disabled
+ * regions.
+ */
+#define get_user_nosleep(out, uaddr) \
+({ \
+ int ret = -EFAULT; \
+ \
+ if (access_ok((uaddr), sizeof(*(uaddr)))) { \
+ pagefault_disable(); \
+ \
+ while (true) { \
+ if (!__get_user((out), (uaddr))) { \
+ ret = 0; \
+ break; \
+ } \
+ \
+ if (fix_pagefault((unsigned long)(uaddr), false) < 0) \
+ break; \
+ } \
+ \
+ pagefault_enable(); \
+ } \
+ ret; \
+})
+
+/**
+ * put_user_nosleep - put user value with inline fixup without sleeping.
+ *
+ * put_user() might sleep and therefore cannot be used in preempt-disabled
+ * regions.
+ */
+#define put_user_nosleep(out, uaddr) \
+({ \
+ int ret = -EFAULT; \
+ \
+ if (access_ok((uaddr), sizeof(*(uaddr)))) { \
+ pagefault_disable(); \
+ \
+ while (true) { \
+ if (!__put_user((out), (uaddr))) { \
+ ret = 0; \
+ break; \
+ } \
+ \
+ if (fix_pagefault((unsigned long)(uaddr), true) < 0) \
+ break; \
+ } \
+ \
+ pagefault_enable(); \
+ } \
+ ret; \
+})
+
+#endif /* CONFIG_X86_64 */
+#endif /* _KERNEL_SCHED_UMCG_H */
--
2.25.1

2021-09-08 18:52:05

by Peter Oskolkov

[permalink] [raw]
Subject: [PATCH 3/4 v0.5] sched/umcg: RFC: implement UMCG syscalls

Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.

All key operations, such as wait/wake/context-switch, as well as
timeouts and block/wake detection, are working quite reliably.

In addition, the userspace can now force the kernel to preempt
a running worker by changing its state from RUNNING to
RUNNING | PREEMPTED and sending a signal to it. This new functionality
is less well tested than the key operations above, but is working
well in the common case of the worker busy in the userspace.

Peter Z. suggested a different approach in
https://lore.kernel.org/lkml/[email protected]/
but I'm not sure how a new syscall (called from a signal handler?)
would improve things: at least now the interrupter can be reasonably
assured that if the state change RUNNING => RUNNING | PREEMPTED
succeeded, and the signal delivered, the worker will be interrupted;
am I missing something?

These big things remain to be addressed:
- tracing/debugging
- faster context switches (see umcg_do_context_switch in umcg.c)
- other architectures (we will need at least arm64 in addition to amd64)
- tools/lib/umcg for userspace
- kselftests

I obviosly have the last two items (libumcg and selftests) to a
certain extent, but they are a bit rough at the moment - I'll
clean them up and post to LKML soon(ish).

See Documentation/userspace-api/umcg.rst for API usage and other details.

v0.4->v0.5 changes:
- handling idle workers and servers is now much simpler on the kernel
side, thanks to Thierry Delisle's suggestion:
https://lore.kernel.org/lkml/[email protected]/
- minor tweaks to improve preemption handling;

v0.3->v0.4 changes:
- removed server_tid and api_version fields from struct umcg_task;
- added timeout handling to sys_umcg_wait();
- implemented worker preemption via signals;
- handling idle workers and servers is changed again (see umcg.rst).

v0.2->v0.3 changes:
- the overall approach is now based on peterz@'s suggestion in
https://lore.kernel.org/patchwork/cover/1433967/
(should I add Suggested-by?)
- new protocol for working with idle workers and servers is used, to avoid
spinning in the kernel;
- waking a UMCG task now does not require spinning.

Signed-off-by: Peter Oskolkov <[email protected]>
---
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/linux/sched.h | 6 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 8 +-
include/uapi/linux/umcg.h | 114 +++++
init/Kconfig | 10 +
kernel/exit.c | 7 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 17 +-
kernel/sched/umcg.c | 618 +++++++++++++++++++++++++
kernel/sched/umcg.h | 13 +
kernel/sys_ni.c | 4 +
12 files changed, 801 insertions(+), 3 deletions(-)
create mode 100644 include/uapi/linux/umcg.h
create mode 100644 kernel/sched/umcg.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ce18119ea0d0..0c6c7fd72b0b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,8 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common umcg_ctl sys_umcg_ctl
+448 common umcg_wait sys_umcg_wait

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3bb9fecfdaa1..1b38737ef0ab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -66,6 +66,7 @@ struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct umcg_task;

/*
* Task state bitmask. NOTE! These bits are also
@@ -1224,6 +1225,10 @@ struct task_struct {
unsigned long rseq_event_mask;
#endif

+#ifdef CONFIG_UMCG
+ struct umcg_task __user *umcg_task;
+#endif
+
struct tlbflush_unmap_batch tlb_ubc;

union {
@@ -1600,6 +1605,7 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..f3e1ef8d842f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -71,6 +71,7 @@ struct open_how;
struct mount_attr;
struct landlock_ruleset_attr;
enum landlock_rule_type;
+struct umcg_task;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -1050,6 +1051,9 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
const void __user *rule_attr, __u32 flags);
asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
+

/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 6de5a7fc066b..1a4c9ac0e296 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -873,8 +873,14 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
#define __NR_landlock_restrict_self 446
__SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)

+#define __NR_umcg_ctl 447
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 448
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
+
+
#undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 449

/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
new file mode 100644
index 000000000000..2fc15c6068ad
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ * sys_umcg_ctl() - register/unregister UMCG tasks;
+ * sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ *
+ * See Documentation/userspace-api/umcg.rst for detals.
+ */
+
+/*
+ * UMCG task states, the first 8 bits. The states represent the user space
+ * point of view.
+ */
+#define UMCG_TASK_NONE 0
+#define UMCG_TASK_RUNNING 1
+#define UMCG_TASK_IDLE 2
+#define UMCG_TASK_BLOCKED 3
+
+/* The first byte: RUNNING, IDLE, or BLOCKED. */
+#define UMCG_TASK_STATE_MASK 0xff
+
+/* UMCG task state flags, bits 8-15 */
+
+/*
+ * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
+ */
+#define UMCG_TF_LOCKED (1 << 8)
+
+/*
+ * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
+ */
+#define UMCG_TF_PREEMPTED (1 << 9)
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+ /**
+ * @state: the current state of the UMCG task described by this struct.
+ *
+ * Readable/writable by both the kernel and the userspace.
+ *
+ * UMCG task state:
+ * bits 0 - 7: task state;
+ * bits 8 - 15: state flags;
+ * bits 16 - 23: reserved; must be zeroes;
+ * bits 24 - 31: for userspace use.
+ */
+ uint32_t state; /* r/w */
+
+ /**
+ * @next_tid: the TID of the UMCG task that should be context-switched
+ * into in sys_umcg_wait(). Can be zero.
+ *
+ * Running UMCG workers must have next_tid set to point to IDLE
+ * UMCG servers.
+ *
+ * Read-only for the kernel, read/write for the userspace.
+ */
+ uint32_t next_tid; /* r */
+
+ /**
+ * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
+ *
+ * Readable/writable by both the kernel and the userspace: the
+ * kernel adds items to the list, the userspace removes them.
+ */
+ uint64_t idle_workers_ptr; /* r/w */
+
+ /**
+ * @idle_server_tid_ptr: a pointer pointing to a single idle server.
+ * Readonly.
+ */
+ uint64_t idle_server_tid_ptr; /* r */
+} __attribute__((packed, aligned(8 * sizeof(__u64))));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER: register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER: register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+ UMCG_CTL_REGISTER = 0x00001,
+ UMCG_CTL_UNREGISTER = 0x00002,
+ UMCG_CTL_WORKER = 0x10000,
+};
+
+/**
+ * enum umcg_wait_flag - flags to pass to sys_umcg_wait
+ * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
+ * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
+ * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
+ * must be set.
+ */
+enum umcg_wait_flag {
+ UMCG_WAIT_WAKE_ONLY = 1,
+ UMCG_WAIT_WF_CURRENT_CPU = 2,
+};
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
index a61c92066c2e..c15a50a61ba6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1662,6 +1662,16 @@ config MEMBARRIER

If unsure, say Y.

+config UMCG
+ bool "Enable User Managed Concurrency Groups API"
+ depends on X86_64
+ default n
+ help
+ Enable User Managed Concurrency Groups API, which form the basis
+ for an in-process M:N userspace scheduling framework.
+ At the moment this is an experimental/RFC feature that is not
+ guaranteed to be backward-compatible.
+
config KALLSYMS
bool "Load all symbols for debugging/ksymoops" if EXPERT
default y
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..dc8398558d87 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -744,6 +744,13 @@ void __noreturn do_exit(long code)
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");

+#ifdef CONFIG_UMCG
+ if (unlikely(tsk->flags & PF_UMCG_WORKER))
+ tsk->flags &= ~PF_UMCG_WORKER;
+
+ tsk->umcg_task = NULL;
+#endif
+
/*
* If do_exit is called because this processes oopsed, it's possible
* that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 978fcfca5871..e4e481eee1b7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
obj-$(CONFIG_PSI) += psi.o
obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 377c4d931546..367a21fa45ea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -26,6 +26,7 @@

#include "pelt.h"
#include "smp.h"
+#include "umcg.h"

/*
* Export tracepoints that act as a bare tracehook (ie: have no trace event
@@ -4159,6 +4160,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
#endif
+#ifdef CONFIG_UMCG
+ p->umcg_task = NULL;
+ p->flags &= ~PF_UMCG_WORKER;
+#endif
}

DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6171,10 +6176,14 @@ static inline void sched_submit_work(struct task_struct *tsk)
* in the possible wakeup of a kworker and because wq_worker_sleeping()
* requires it.
*/
- if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
preempt_disable();
if (task_flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);
+#ifdef CONFIG_UMCG
+ else if (task_flags & PF_UMCG_WORKER)
+ umcg_wq_worker_sleeping(tsk);
+#endif
else
io_wq_worker_sleeping(tsk);
preempt_enable_no_resched();
@@ -6193,9 +6202,13 @@ static inline void sched_submit_work(struct task_struct *tsk)

static void sched_update_worker(struct task_struct *tsk)
{
- if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
if (tsk->flags & PF_WQ_WORKER)
wq_worker_running(tsk);
+#ifdef CONFIG_UMCG
+ else if (tsk->flags & PF_UMCG_WORKER)
+ umcg_wq_worker_running(tsk);
+#endif
else
io_wq_worker_running(tsk);
}
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
new file mode 100644
index 000000000000..c833246a6bbc
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,618 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ *
+ * See Documentation/userspace-api/umcg.rst for detals.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+#include "sched.h"
+#include "umcg.h"
+
+/**
+ * sys_umcg_ctl: (un)register the current task as a UMCG task.
+ * @flags: ORed values from enum umcg_ctl_flag; see below;
+ * @self: a pointer to struct umcg_task that describes this
+ * task and governs the behavior of sys_umcg_wait if
+ * registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ * UMCG workers:
+ * - self->state must be UMCG_TASK_IDLE
+ * - @flags & UMCG_CTL_WORKER
+ * UMCG servers:
+ * - self->state must be UMCG_TASK_RUNNING
+ * - !(@flags & UMCG_CTL_WORKER)
+ *
+ * All tasks:
+ * - self->next_tid must be zero
+ *
+ * If the conditions above are met, sys_umcg_ctl() immediately returns
+ * if the registered task is a server; a worker will be added to
+ * idle_workers_ptr, and the worker put to sleep; an idle server
+ * from idle_server_tid_ptr will be woken, if present.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
+ * is a UMCG worker, the userspace is responsible for waking its
+ * server (before or after calling sys_umcg_ctl).
+ *
+ * Return:
+ * 0 - success
+ * -EFAULT - failed to read @self
+ * -EINVAL - some other error occurred
+ */
+SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
+{
+ struct umcg_task ut;
+
+ if (flags == UMCG_CTL_UNREGISTER) {
+ if (self || !current->umcg_task)
+ return -EINVAL;
+
+ if (current->flags & PF_UMCG_WORKER)
+ current->flags &= ~PF_UMCG_WORKER;
+
+ current->umcg_task = NULL;
+ return 0;
+ }
+
+ /* Register the current task as a UMCG task. */
+ if (!(flags & UMCG_CTL_REGISTER))
+ return -EINVAL;
+
+ flags &= ~UMCG_CTL_REGISTER;
+ if (flags && flags != UMCG_CTL_WORKER)
+ return -EINVAL;
+
+ if (current->umcg_task || !self)
+ return -EINVAL;
+
+ if (copy_from_user(&ut, self, sizeof(ut)))
+ return -EFAULT;
+
+ if (ut.next_tid)
+ return -EINVAL;
+
+ if (flags == UMCG_CTL_WORKER) {
+ if (ut.state != UMCG_TASK_BLOCKED)
+ return -EINVAL;
+
+ current->umcg_task = self;
+ current->flags |= PF_UMCG_WORKER;
+
+ set_tsk_need_resched(current);
+ return 0;
+ }
+
+ /* This is a server task. */
+ if (ut.state != UMCG_TASK_RUNNING)
+ return -EINVAL;
+
+ current->umcg_task = self;
+ return 0;
+}
+
+/**
+ * handle_timedout_worker - make sure the worker is added to idle_workers
+ * upon a "clean" timeout.
+ */
+static int handle_timedout_worker(struct umcg_task __user *self)
+{
+ u32 prev_state, next_state;
+ int ret;
+
+ if (get_user_nosleep(prev_state, &self->state))
+ return -EFAULT;
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE) {
+ /* TODO: should we care here about TF_LOCKED or TF_PREEMPTED? */
+
+ next_state = prev_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_BLOCKED;
+
+ ret = cmpxchg_user_32(&self->state, &prev_state, next_state);
+ if (ret)
+ return ret;
+
+ return -ETIMEDOUT;
+ }
+
+ return 0; /* Not really timed out. */
+}
+
+/**
+ * umcg_idle_loop - sleep until the current task becomes RUNNING or a timeout
+ * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout
+ *
+ * The function marks the current task as INTERRUPTIBLE and calls
+ * schedule(). It returns when either the timeout expires or
+ * the UMCG state of the task becomes RUNNING.
+ *
+ * Note: because UMCG workers should not be running WITHOUT attached servers,
+ * and because servers should not be running WITH attached workers,
+ * the function returns only on fatal signal pending and ignores/flushes
+ * all other signals.
+ */
+static int umcg_idle_loop(u64 abs_timeout)
+{
+ int ret;
+ struct hrtimer_sleeper timeout;
+ struct umcg_task __user *self = current->umcg_task;
+
+ if (abs_timeout) {
+ hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
+ HRTIMER_MODE_ABS);
+
+ hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout,
+ current->timer_slack_ns);
+ }
+
+ while (true) {
+ u32 umcg_state;
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ smp_mb(); /* Order with set_current_state() above. */
+ ret = -EFAULT;
+ if (get_user_nosleep(umcg_state, &self->state)) {
+ set_current_state(TASK_RUNNING);
+ goto out;
+ }
+
+ ret = 0;
+ if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+ set_current_state(TASK_RUNNING);
+ goto out;
+ }
+
+ if (abs_timeout)
+ hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
+
+ if (!abs_timeout || timeout.task) {
+ /*
+ * Clear PF_UMCG_WORKER to elide workqueue handlers.
+ */
+ const bool worker = current->flags & PF_UMCG_WORKER;
+
+ if (worker)
+ current->flags &= ~PF_UMCG_WORKER;
+
+ /*
+ * Note: freezable_schedule() here is not appropriate
+ * as umcg_idle_loop can be called from rwsem locking
+ * context (via workqueue handlers), which may
+ * trigger a lockdep warning for mmap_lock.
+ */
+ schedule();
+
+ if (worker)
+ current->flags |= PF_UMCG_WORKER;
+ }
+ __set_current_state(TASK_RUNNING);
+
+ /*
+ * Check for timeout before checking the state, as workers
+ * are not going to return from schedule() unless
+ * they are RUNNING.
+ */
+ ret = -ETIMEDOUT;
+ if (abs_timeout && !timeout.task)
+ goto out;
+
+ ret = -EFAULT;
+ if (get_user_nosleep(umcg_state, &self->state))
+ goto out;
+
+ ret = 0;
+ if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING)
+ goto out;
+
+ ret = -EINTR;
+ if (fatal_signal_pending(current))
+ goto out;
+
+ if (signal_pending(current))
+ flush_signals(current);
+ }
+
+out:
+ if (abs_timeout) {
+ hrtimer_cancel(&timeout.timer);
+ destroy_hrtimer_on_stack(&timeout.timer);
+ }
+
+ /* Workers must go through workqueue handlers upon wakeup. */
+ if (current->flags & PF_UMCG_WORKER) {
+ if (ret == -ETIMEDOUT)
+ ret = handle_timedout_worker(self);
+
+ set_tsk_need_resched(current);
+ }
+
+ return ret;
+}
+
+/*
+ * Try to wake up. May be called with preempt_disable set.
+ *
+ * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
+ * ordering logic.
+ */
+static int umcg_ttwu(u32 next_tid, int wake_flags)
+{
+ struct task_struct *next;
+
+ rcu_read_lock();
+ next = find_task_by_vpid(next_tid);
+ if (!next || !(READ_ONCE(next->umcg_task))) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+
+ try_to_wake_up(next, TASK_NORMAL, wake_flags); /* Result ignored. */
+ rcu_read_unlock();
+
+ return 0;
+}
+
+/*
+ * At the moment, umcg_do_context_switch simply wakes up @next with
+ * WF_CURRENT_CPU and puts the current task to sleep.
+ *
+ * In the future an optimization will be added to adjust runtime accounting
+ * so that from the kernel scheduling perspective the two tasks are
+ * essentially treated as one. In addition, the context switch may be performed
+ * right here on the fast path, instead of going through the wake/wait pair.
+ */
+static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
+{
+ struct task_struct *next;
+
+ rcu_read_lock();
+ next = find_task_by_vpid(next_tid);
+ if (!next) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+
+ /* TODO: instead of wake + sleep, do a context switch. */
+ try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU); /* Result ignored. */
+ rcu_read_unlock();
+
+ return umcg_idle_loop(abs_timeout);
+}
+
+/**
+ * sys_umcg_wait: put the current task to sleep and/or wake another task.
+ * @flags: zero or a value from enum umcg_wait_flag.
+ * @abs_timeout: when to wake the task, in nanoseconds; zero for no timeout.
+ *
+ * @self->state must be UMCG_TASK_IDLE (where @self is current->umcg_task)
+ * if !(@flags & UMCG_WAIT_WAKE_ONLY).
+ *
+ * If @self->next_tid is not zero, it must point to an IDLE UMCG task.
+ * The userspace must have changed its state from IDLE to RUNNING
+ * before calling sys_umcg_wait() in the current task. This "next"
+ * task will be woken (context-switched-to on the fast path) when the
+ * current task is put to sleep.
+ *
+ * See Documentation/userspace-api/umcg.rst for detals.
+ *
+ * Return:
+ * 0 - OK;
+ * -ETIMEDOUT - the timeout expired;
+ * -EFAULT - failed accessing struct umcg_task __user of the current
+ * task;
+ * -ESRCH - the task to wake not found or not a UMCG task;
+ * -EINVAL - another error happened (e.g. bad @flags, or the current
+ * task is not a UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
+{
+ struct umcg_task __user *self = current->umcg_task;
+ u32 next_tid;
+
+ if (!self)
+ return -EINVAL;
+
+ if (get_user(next_tid, &self->next_tid))
+ return -EFAULT;
+
+ if (flags & UMCG_WAIT_WAKE_ONLY) {
+ if (!next_tid || abs_timeout)
+ return -EINVAL;
+
+ flags &= ~UMCG_WAIT_WAKE_ONLY;
+ if (flags & ~UMCG_WAIT_WF_CURRENT_CPU)
+ return -EINVAL;
+
+ return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ?
+ WF_CURRENT_CPU : 0);
+ }
+
+ /* Unlock the worker, if locked. */
+ if (current->flags & PF_UMCG_WORKER) {
+ u32 umcg_state;
+
+ if (get_user(umcg_state, &self->state))
+ return -EFAULT;
+
+ if ((umcg_state & UMCG_TF_LOCKED) && cmpxchg_user_32(
+ &self->state, &umcg_state,
+ umcg_state & ~UMCG_TF_LOCKED))
+ return -EFAULT;
+ }
+
+ if (next_tid)
+ return umcg_do_context_switch(next_tid, abs_timeout);
+
+ return umcg_idle_loop(abs_timeout);
+}
+
+/*
+ * NOTE: all code below is called from workqueue submit/update, so all
+ * errors result in the termination of the current task (via SIGKILL).
+ */
+
+/* Returns true on success, false on _any_ error. */
+static bool mark_server_running(u32 server_tid)
+{
+ struct umcg_task __user *ut_server = NULL;
+ u32 state = UMCG_TASK_IDLE;
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(server_tid);
+ if (tsk)
+ ut_server = READ_ONCE(tsk->umcg_task);
+ rcu_read_unlock();
+
+ if (!ut_server)
+ return false;
+
+ return !cmpxchg_user_32(&ut_server->state, &state, UMCG_TASK_RUNNING);
+}
+
+/*
+ * In the common case, change @tsk RUNNING => BLOCKED. Called from
+ * preempt_disable() and local_irq_disable() context.
+ */
+static void __umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+ struct umcg_task __user *ut_worker = tsk->umcg_task;
+ u32 prev_state, next_state, server_tid;
+ bool preempted = false;
+ int ret;
+
+ if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+ return;
+
+ smp_mb(); /* Guard the read below. */
+ if (get_user_nosleep(prev_state, &ut_worker->state))
+ goto die; /* EFAULT */
+
+ if (prev_state & UMCG_TF_LOCKED)
+ return;
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING)
+ return; /* the worker is in umcg_wait */
+
+retry_once:
+ next_state = prev_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_BLOCKED;
+ preempted = prev_state & UMCG_TF_PREEMPTED;
+
+ ret = cmpxchg_user_32(&ut_worker->state, &prev_state, next_state);
+ if (ret == -EAGAIN) {
+ if (preempted)
+ goto die; /* Preemption can only happen once. */
+
+ if (prev_state != (UMCG_TASK_RUNNING | UMCG_TF_PREEMPTED))
+ goto die; /* Only preemption can happen. */
+
+ preempted = true;
+ goto retry_once;
+ }
+ if (ret)
+ goto die; /* EFAULT */
+
+ if (get_user_nosleep(server_tid, &ut_worker->next_tid))
+ goto die; /* EFAULT */
+
+ if (!server_tid)
+ return; /* Waking a waiting worker leads here. */
+
+ /* The idle server's wait may timeout. */
+ /* TODO: make a smarter context switch below when available. */
+ if (mark_server_running(server_tid))
+ umcg_ttwu(server_tid, WF_CURRENT_CPU);
+
+ return;
+
+die:
+ pr_warn("umcg_wq_worker_sleeping: killing task %d\n", current->pid);
+ force_sig(SIGKILL);
+}
+
+/* Called from sched_submit_work() with preempt_disable. */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+ /*
+ * Although UMCG preemption state change (UMCG_TF_PREEMPTED) racing
+ * with the worker blocking in a syscall is handled correctly in
+ * __umcg_wq_worker_sleeping() above, actual signal to the worker
+ * during the execution of this function might be causing
+ * isuses, based on some observed test failures. Disabling IRQs
+ * make the failures go away.
+ */
+ local_irq_disable();
+ __umcg_wq_worker_sleeping(tsk);
+ local_irq_enable();
+}
+
+/**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+{
+ u64 __user *node = &ut_worker->idle_workers_ptr;
+ u64 __user *head_ptr;
+ u64 first = (u64)node;
+ u64 head;
+
+ if (get_user_nosleep(head, node) || !head)
+ return false;
+
+ head_ptr = (u64 __user *)head;
+
+ if (put_user_nosleep(1ULL, node))
+ return false;
+
+ if (xchg_user_64(head_ptr, &first))
+ return false;
+
+ if (put_user_nosleep(first, node))
+ return false;
+
+ return true;
+}
+
+/**
+ * get_idle_server - retrieve an idle server, if present.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid)
+{
+ u64 server_tid_ptr;
+ u32 tid;
+ int ret;
+
+ *server_tid = 0; /* Empty result is OK. */
+
+ if (get_user_nosleep(server_tid_ptr, &ut_worker->idle_server_tid_ptr))
+ return false;
+
+ if (!server_tid_ptr)
+ return false;
+
+ tid = 0;
+ ret = xchg_user_32((u32 __user *)server_tid_ptr, &tid);
+
+ if (ret)
+ return false;
+
+ if (tid && mark_server_running(tid))
+ *server_tid = tid;
+
+ return true;
+}
+
+/*
+ * Returns true to wait, false to return to the userspace. Called with IRQs
+ * disabled. In the common case, enqueues the worker to idle_workers_ptr list
+ * and wakes the idle server (if present).
+ */
+static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
+{
+ struct umcg_task __user *ut_worker = tsk->umcg_task;
+ u32 prev_state, next_state;
+ int ret = 0;
+
+ *server_tid = 0;
+
+ if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+ return false;
+
+ if (fatal_signal_pending(tsk))
+ return false;
+
+ smp_mb(); /* Guard the read below. */
+ if (get_user_nosleep(prev_state, &ut_worker->state))
+ goto die;
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+ u32 tid;
+
+ if (prev_state & UMCG_TF_LOCKED)
+ return true; /* Wakeup: wait but don't enqueue. */
+
+ smp_mb(); /* Order getting state and getting server_tid */
+
+ if (get_user_nosleep(tid, &ut_worker->next_tid))
+ goto die;
+
+ if (prev_state & UMCG_TF_PREEMPTED) {
+ if (!tid)
+ goto die; /* PREEMPTED workers must have a server. */
+
+ /* Always enqueue preempted workers. */
+ if (!mark_server_running(tid))
+ goto die;
+
+ *server_tid = tid;
+ } else if (tid)
+ return false; /* pass-through: RUNNING with a server. */
+
+ /* If !PREEMPTED, the worker gets here via UMCG_WAIT_WAKE_ONLY */
+ } else if (unlikely((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
+ (prev_state & UMCG_TF_LOCKED)))
+ return false; /* The worker prepares to sleep or to unregister. */
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE)
+ return true; /* the worker called umcg_wait(); don't enqueue */
+
+ next_state = prev_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_IDLE;
+
+ if (prev_state != next_state)
+ ret = cmpxchg_user_32(&ut_worker->state, &prev_state, next_state);
+ if (ret)
+ goto die;
+
+ if (!enqueue_idle_worker(ut_worker))
+ goto die;
+
+ smp_mb(); /* Order enqueuing the worker with getting the server. */
+ if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
+ goto die;
+
+ return true;
+
+die:
+ pr_warn("umcg_wq_worker_running: killing task %d\n", current->pid);
+ force_sig(SIGKILL);
+ return false;
+}
+
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+ might_sleep();
+
+ /* Avoid recursion by removing PF_UMCG_WORKER */
+ current->flags &= ~PF_UMCG_WORKER;
+
+ do {
+ bool should_wait;
+ u32 server_tid;
+
+ should_wait = process_waking_worker(tsk, &server_tid);
+
+ if (!should_wait)
+ break;
+
+ if (server_tid)
+ umcg_do_context_switch(server_tid, 0);
+ else
+ umcg_idle_loop(0);
+ } while (true);
+
+ current->flags |= PF_UMCG_WORKER;
+}
diff --git a/kernel/sched/umcg.h b/kernel/sched/umcg.h
index 89ba84afa977..737c8612fb47 100644
--- a/kernel/sched/umcg.h
+++ b/kernel/sched/umcg.h
@@ -9,6 +9,19 @@
#include <asm/asm.h>
#include <linux/atomic.h>

+#ifdef CONFIG_UMCG
+
+struct task_struct;
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+void umcg_wq_worker_sleeping(struct task_struct *tsk);
+void umcg_wq_worker_running(struct task_struct *tsk);
+
+#endif /* CONFIG_UMCG */
+
/* TODO: move atomic operations below into arch/ headers */
static inline int __try_cmpxchg_user_32(u32 *uval, u32 __user *uaddr,
u32 oldval, u32 newval)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..cd1be6356e42 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -272,6 +272,10 @@ COND_SYSCALL(landlock_create_ruleset);
COND_SYSCALL(landlock_add_rule);
COND_SYSCALL(landlock_restrict_self);

+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+
/* arch/example/kernel/sys_example.c */

/* mm/fadvise.c */
--
2.25.1

2021-09-08 19:07:40

by Peter Oskolkov

[permalink] [raw]
Subject: [PATCH 4/4 v0.5] sched/umcg: add Documentation/userspace-api/umcg.rst

Document User Managed Concurrency Groups syscalls, data structures,
state transitions, etc.

Signed-off-by: Peter Oskolkov <[email protected]>
---
Documentation/userspace-api/umcg.rst | 546 +++++++++++++++++++++++++++
1 file changed, 546 insertions(+)
create mode 100644 Documentation/userspace-api/umcg.rst

diff --git a/Documentation/userspace-api/umcg.rst b/Documentation/userspace-api/umcg.rst
new file mode 100644
index 000000000000..b3e84cef212c
--- /dev/null
+++ b/Documentation/userspace-api/umcg.rst
@@ -0,0 +1,546 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+UMCG Userspace API
+=====================================
+
+User Managed Concurrency Groups (UMCG) is an M:N threading
+subsystem/toolkit that lets user space application developers
+implement in-process user space schedulers.
+
+.. contents:: :local:
+
+Why? Heterogeneous in-process workloads
+=======================================
+Linux kernel's CFS scheduler is designed for the "common" use case,
+with efficiency/throughput in mind. Work isolation and workloads of
+different "urgency" are addressed by tools such as cgroups, CPU
+affinity, priorities, etc., which are difficult or impossible to
+efficiently use in-process.
+
+For example, a single DBMS process may receive tens of thousands
+requests per second; some of these requests may have strong response
+latency requirements as they serve live user requests (e.g. login
+authentication); some of these requests may not care much about
+latency but must be served within a certain time period (e.g. an
+hourly aggregate usage report); some of these requests are to be
+served only on a best-effort basis and can be NACKed under high load
+(e.g. an exploratory research/hypothesis testing workload).
+
+Beyond different work item latency/throughput requirements as outlined
+above, the DBMS may need to provide certain guarantees to different
+users; for example, user A may "reserve" 1 CPU for their
+high-priority/low latency requests, 2 CPUs for mid-level throughput
+workloads, and be allowed to send as many best-effort requests as
+possible, which may or may not be served, depending on the DBMS load.
+Besides, the best-effort work, started when the load was low, may need
+to be delayed if suddenly a large amount of higher-priority work
+arrives. With hundreds or thousands of users like this, it is very
+difficult to guarantee the application's responsiveness using standard
+Linux tools while maintaining high CPU utilization.
+
+Gaming is another use case: some in-process work must be completed
+before a certain deadline dictated by frame rendering schedule, while
+other work items can be delayed; some work may need to be
+cancelled/discarded because the deadline has passed; etc.
+
+User Managed Concurrency Groups is an M:N threading toolkit that
+allows constructing user space schedulers designed to efficiently
+manage heterogeneous in-process workloads described above while
+maintaining high CPU utilization (95%+).
+
+Requirements
+============
+One relatively established way to design high-efficiency, low-latency
+systems is to split all work into small on-cpu work items, with
+asynchronous I/O and continuations, all executed on a thread pool with
+the number of threads not exceeding the number of available CPUs.
+Although this approach works, it is quite difficult to develop and
+maintain such a system, as, for example, small continuations are
+difficult to piece together when debugging. Besides, such asynchronous
+callback-based systems tend to be somewhat cache-inefficient, as
+continuations can get scheduled on any CPU regardless of cache
+locality.
+
+M:N threading and cooperative user space scheduling enables controlled
+CPU usage (minimal OS preemption), synchronous coding style, and
+better cache locality.
+
+Specifically:
+
+- a variable/fluctuating number M of "application" threads should be
+ "scheduled over" a relatively fixed number N of "kernel" threads,
+ where N is less than or equal to the number of CPUs available;
+- only those application threads that are attached to kernel threads
+ are scheduled "on CPU";
+- application threads should be able to cooperatively yield to each other;
+- when an application thread blocks in kernel (e.g. in I/O), this
+ becomes a scheduling event ("block") that the userspace scheduler
+ should be able to efficiently detect, and reassign a waiting
+ application thread to the freeded "kernel" thread;
+- when a blocked application thread wakes (e.g. its I/O operation
+ completes), this even ("wake") should also be detectable by the
+ userspace scheduler, which should be able to either quickly dispatch
+ the newly woken thread to an idle "kernel" thread or, if all "kernel"
+ threads are busy, put it in the waiting queue;
+- in addition to the above, it would be extremely useful for a
+ separate in-process "watchdog" facility to be able to monitor the
+ state of each of the M+N threads, and to intervene in case of runaway
+ workloads (interrupt/preempt).
+
+
+UMCG kernel API
+===============
+Based on the requrements above, UMCG *kernel* API is build around
+the following ideas:
+
+- *UMCG server*: a task/thread representing "kernel threads", or CPUs
+ from the requirements above;
+- *UMCG worker*: a task/thread representing "application threads", to
+ be scheduled over servers;
+- UMCG *task state*: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG
+ task (a server or a worker) can be in;
+- UMCG task *state flag*: LOCKED, PREEMPTED: additional state flags
+ that can be ORed with the task state to communicate additional information
+ to the kernel;
+- ``struct umcg_task``: a per-task userspace set of data fields, usually
+ residing in the TLS, that fully reflects the current task's UMCG
+ state and controls the way the kernel manages the task;
+- ``sys_umcg_ctl()``: a syscall used to register the current task/thread
+ as a server or a worker, or to unregister a UMCG task;
+- ``sys_umcg_wait()``: a syscall used to put the current task to
+ sleep and/or wake another task, pontentially context-switching
+ between the two tasks on-CPU synchronously.
+
+
+Servers
+=======
+
+When a task/thread is registered as a server, it is in RUNNING
+state and behaves like any other normal task/thread. In addition,
+servers can interact with other UMCG tasks via sys_umcg_wait():
+
+- servers can voluntarily suspend their execution (wait), becoming IDLE;
+- servers can wake other IDLE servers;
+- servers can context-switch between each other.
+
+Note that if a server blocks in the kernel *not* via sys_umcg_wait(),
+it still retains its RUNNING state.
+
+Also note that servers can be used for fast on-CPU context switching
+across process boundaries; server-worker interactions assume they
+belong to the same mm.
+
+See the next section on how servers interact with workers.
+
+Workers
+=======
+
+A worker cannot be RUNNING without having a server associated
+with it, so when a task is first registered as a worker, it enters
+the IDLE state.
+
+- a worker becomes RUNNING when a server calls sys_umcg_wait to
+ context-switch into it; the server goes IDLE, and the worker becomes
+ RUNNING in its place;
+- when a running worker blocks in the kernel, it becomes BLOCKED,
+ its associated server becomes RUNNING and the server's
+ sys_umcg_wait() call from the bullet above returns; this transition
+ is sometimes called "block detection";
+- when the syscall on which a BLOCKED worker completes, the worker
+ becomes IDLE and is added to the list of idle workers; if there
+ is an idle server waiting, the kernel wakes it; this transition
+ is sometimes called "wake detection";
+- running workers can voluntarily suspend their execution (wait),
+ becoming IDLE; their associated servers are woken;
+- a RUNNING worker can context-switch with an IDLE worker; the server
+ of the switched-out worker is transferred to the switched-in worker;
+- any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
+ this is a server running the worker as described in the first bullet
+ in this list, the worker remain IDLE but is added to the idle workers
+ list; this "wake" operation exists for completeness, to make sure
+ wait/wake/context-switch operations are available for all UMCG tasks;
+- the userspace can preempt a RUNNING worker by marking it
+ ``RUNNING|PREEMPTED`` and sending a signal to it; the userspace should
+ have installed a NOP signal handler for the signal; the kernel will
+ then transition the worker into ``IDLE|PREEMPTED`` state and wake
+ its associated server.
+
+UMCG task states
+================
+
+Important: all state transitions described below involve at least
+two steps: the change of the state field in ``struct umcg_task``,
+for example ``RUNNING`` to ``IDLE``, and the corresponding change in
+``struct task_struct`` state, for example a transition between the task
+running on CPU and being descheduled and removed from the kernel runqueue.
+The key principle of UMCG API design is that the party initiating
+the state transition modifies the state variable.
+
+For example, a task going ``IDLE`` first changes its state from ``RUNNING``
+to ``IDLE`` in the userpace and then calls ``sys_umcg_wait()``, which
+completes the transition.
+
+Note on documentation: in ``include/uapi/linux/umcg.h``, task states
+have the form ``UMCG_TASK_RUNNING``, ``UMCG_TASK_BLOCKED``, etc. In
+this document these are usually referred to simply ``RUNNING`` and
+``BLOCKED``, unless it creates ambiguity. Task state flags, e.g.
+``UMCG_TF_PREEMPTED``, are treated similarly.
+
+UMCG task states reflect the view from the userspace, rather than from
+the kernel. There are three fundamental task states:
+
+- ``RUNNING``: indicates that the task is schedulable by the kernel; applies
+ to both servers and workers;
+- ``IDLE``: indicates that the task is *not* schedulable by the kernel
+ (see ``umcg_idle_loop()`` in ``kernel/sched/umcg.c``); applies to
+ both servers and workers;
+- ``BLOCKED``: indicates that the worker is blocked in the kernel;
+ does not apply to servers.
+
+In addition to the three states above, two state flags help with
+state transitions:
+
+- ``LOCKED``: the userspace is preparing the worker for a state transition
+ and "locks" the worker until the worker is ready for the kernel to
+ act on the state transition; used similarly to preempt_disable or
+ irq_disable in the kernel; applies only to workers in ``RUNNING`` or
+ ``IDLE`` state; ``RUNNING|LOCKED`` means "this worker is about to
+ become ``RUNNING``, while ``IDLE|LOCKED`` means "this worker is about
+ to become ``IDLE`` or unregister;
+- ``PREEMPTED``: the userspace indicates it wants the worker to be
+ preempted; there are no situations when both ``LOCKED`` and ``PREEMPTED``
+ flags are set at the same time.
+
+struct umcg_task
+================
+
+From ``include/uapi/linux/umcg.h``:
+
+.. code-block:: C
+
+ struct umcg_task {
+ uint32_t state; /* r/w */
+ uint32_t next_tid; /* r */
+ uint64_t idle_workers_ptr; /* r/w */
+ uint64_t idle_server_tid_ptr; /* r* */
+ };
+
+Each UMCG task is identified by ``struct umcg_task``, which is provided
+to the kernel when the task is registered via ``sys_umcg_ctl()``.
+
+- ``uint32_t state``: the current state of the task this struct identifies,
+ as described in the previous section. Readable/writable by both the kernel
+ and the userspace.
+
+ - bits 0 - 7: task state (RUNNING, IDLE, BLOCKED);
+ - bits 8 - 15: state flags (LOCKED, PREEMPTED);
+ - bits 16 - 23: reserved; must be zeroes;
+ - bits 24 - 31: for userspace use.
+
+- ``uint32_t next_tid``: contains the TID of the task to context-switch-into
+ in ``sys_umcg_wait()``; can be zero; writable by the userspace, readable
+ by the kernel; if this is a RUNNING worker, this field contains
+ the TID of the server that should be woken when this worker blocks;
+ see ``sys_umcg_wait()`` for more details;
+
+- ``uint64_t idle_workers_ptr``: this field forms a single-linked list
+ of idle workers: all RUNNING workers have this field set to point
+ to the head of the list (a pointer variable in the userspace).
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from ``BLOCKED`` to ``IDLE`` and adds the worker
+ to the top of the list of idle workers using this logic:
+
+ .. code-block:: C
+
+ /* kernel side */
+ /**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+ static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+ {
+ u64 __user *node = &ut_worker->idle_workers_ptr;
+ u64 __user *head_ptr;
+ u64 first = (u64)node;
+ u64 head;
+
+ if (get_user_nosleep(head, node) || !head)
+ return false;
+
+ head_ptr = (u64 __user *)head;
+
+ if (put_user_nosleep(1ULL, node))
+ return false;
+
+ if (xchg_user_64(head_ptr, &first))
+ return false;
+
+ if (put_user_nosleep(first, node))
+ return false;
+
+ return true;
+ }
+
+
+ In the userspace the list is cleared atomically using this logic:
+
+ .. code-block:: C
+
+ /* userspace side */
+ uint64_t *idle_workers = (uint64_t *)*head;
+
+ atomic_exchange(&idle_workers, NULL);
+
+ The userspace re-points workers' idle_workers_ptr to the list head
+ variable before the worker is allowed to become RUNNING again.
+
+- ``uint64_t idle_server_tid_ptr``: points to a pointer variable in the
+ userspace that points to an idle server, i.e. a server in IDLE state waiting
+ in sys_umcg_wait(); read-only; workers must have this field set; not used
+ in servers.
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from ``BLOCKED`` to ``IDLE``, adds the worker
+ to the list of idle workers, and checks whether
+ ``*idle_server_tid_ptr`` is not zero. If not, the kernel tries to cmpxchg()
+ it with zero; if cmpxchg() succeeds, the kernel will then wake the server.
+ See `State transitions`_ below for more details.
+
+sys_umcg_ctl()
+==============
+
+``int sys_umcg_ctl(uint32_t flags, struct umcg_task *self)`` is used to
+register or unregister the current task as a worker or server. Flags
+can be one of the following:
+
+- ``UMCG_CTL_REGISTER``: register a server;
+- ``UMCG_CTL_REGISTER | UMCG_CTL_WORKER``: register a worker;
+- ``UMCG_CTL_UNREGISTER``: unregister the current server or worker.
+
+When registering a task, ``self`` must point to ``struct umcg_task``
+describing this server or worker; the pointer must remain valid until
+the task is unregistered.
+
+When registering a server, ``self->state`` must be ``RUNNING``; all other
+fields in ``self`` must be zeroes.
+
+When registering a worker, ``self->state`` must be ``BLOCKED``;
+``self->idle_server_tid_ptr`` and ``self->idle_workers_ptr`` must be
+valid pointers as described in `struct umcg_task`_; ``self->next_tid`` must
+be zero.
+
+When unregistering a task, ``self`` must be ``NULL``.
+
+sys_umcg_wait()
+===============
+
+``int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout)`` operates
+on registered UMCG servers and workers: ``struct umcg_task *self`` provided
+to ``sys_umcg_ctl()`` when registering the current task is consulted
+in addition to ``flags`` and ``abs_timeout`` parameters.
+
+The function can be used to perform one of the three operations:
+
+- wait: if ``self->next_tid`` is zero, ``sys_umcg_wait()`` puts the current
+ server or worker to sleep;
+- wake: if ``self->next_tid`` is not zero, and ``flags & UMCG_WAIT_WAKE_ONLY``,
+ the task identified by ``next_tid`` is woken (must be in ``IDLE`` state);
+- context switch: if ``self->next_tid`` is not zero, and
+ ``!(flags & UMCG_WAIT_WAKE_ONLY)``, the current task is put to sleep and
+ the next task is woken, synchronously switching between the tasks on the
+ current CPU on the fast path.
+
+Flags can be zero or a combination of the following values:
+
+- ``UMCG_WAIT_WAKE_ONLY``: wake the next task, don't put the current task
+ to sleep;
+- ``UMCG_WAIT_WF_CURRENT_CPU``: wake the next task on the curent CPU;
+ this flag has an effect only if ``UMCG_WAIT_WAKE_ONLY`` is set: context
+ switching is always attempted to happen on the curent CPU.
+
+The section below provides more details on how servers and workers interact
+via ``sys_umcg_wait()``, during worker block/wake events, and during
+worker preemption.
+
+State transitions
+=================
+
+As mentioned above, the key principle of UMCG state transitions is that
+**the party initiating the state transition modifies the state of affected
+tasks**.
+
+Below, "``TASK:STATE``" indicates a task T, where T can be either W for
+worker or S for server, in state S, where S can be one of the three states,
+potentially ORed with a state flag. Each individual state transition
+is an atomic operation (cmpxchg) unless indicated otherwise. Also note
+that **the order of state transitions is important and is part of the
+contract between the userspace and the kernel. The kernel is free
+to kill the task (SIGSEGV) if the contract is broken.**
+
+Some worker state transitions below include adding ``LOCKED`` flag to
+worker state. This is done to indicate to the kernel that the worker
+is transitioning state and should not participate in the block/wake
+detection routines, which can happen due to interrupts/pagefaults/signals.
+
+``IDLE|LOCKED`` means that a running worker is preparing to sleep, so
+interrupts should not lead to server wakeup; ``RUNNING|LOCKED`` means that
+an idle worker is going to be "scheduled to run", but may not yet have its
+server set up properly.
+
+Key state transitions:
+
+- server to worker context switch ("schedule a worker to run"):
+ ``S:RUNNING+W:IDLE => S:IDLE+W:RUNNING``:
+
+ - in the userspace, in the context of the server S running:
+
+ - ``S:RUNNING => S:IDLE`` (mark self as idle)
+ - ``W:IDLE => W:RUNNING|LOCKED`` (mark the worker as running)
+ - ``W.next_tid := S.tid; S.next_tid := W.tid``
+ (link the server with the worker)
+ - ``W:RUNNING|LOCKED => W:RUNNING`` (unlock the worker)
+ - ``S: sys_umcg_wait()`` (make the syscall)
+
+ - the kernel context switches from the server to the worker; the server
+ sleeps until it becomes ``RUNNING`` during one of the transitions below;
+
+- worker to server context switch (worker "yields"):
+ ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE``:
+
+ - in the userspace, in the context of the worker W running (note that
+ a running worker has its ``next_tid`` set to point to its server):
+
+ - ``W:RUNNING => W:IDLE|LOCKED`` (mark self as idle)
+ - ``S:IDLE => S:RUNNING`` (mark the server as running)
+ - ``W: sys_umcg_wait()`` (make the syscall)
+
+ - the kernel removes the ``LOCKED`` flag from the worker's state and
+ context switches from the worker to the server; the worker
+ sleeps until it becomes ``RUNNING``;
+
+- worker to worker context switch:
+ ``W1:RUNNING+W2:IDLE => W1:IDLE+W2:RUNNING``:
+
+ - in the userspace, in the context of W1 running:
+
+ - ``W2:IDLE => W2:RUNNING|LOCKED`` (mark W2 as running)
+ - ``W1:RUNNING => W1:IDLE|LOCKED`` (mark self as idle)
+ - ``W2.next_tid := W1.next_tid; S.next_tid := W2.next_tid``
+ (transfer the server W1 => W2)
+ - ``W1:next_tid := W2.tid`` (indicate that W1 should
+ context-switch into W2)
+ - ``W2:RUNNING|LOCKED => W2:RUNNING`` (unlock W2)
+ - ``W1: sys_umcg_wait()`` (make the syscall)
+
+ - same as above, the kernel removes the ``LOCKED`` flag from the W1's state
+ and context switches to next_tid;
+
+- worker wakeup: ``W:IDLE => W:RUNNING``:
+
+ - in the userspace, a server S can wake a worker W without "running" it:
+
+ - ``S:next_tid :=W.tid``
+ - ``W:next_tid := 0``
+ - ``W:IDLE => W:RUNNING``
+ - ``sys_umcg_wait(UMCG_WAIT_WAKE_ONLY)`` (make the syscall)
+
+ - the kernel will wake the worker W; as the worker does not have a server
+ assigned, "wake detection" will happen, the worker will be immediately
+ marked as ``IDLE`` and added to idle workers list; an idle server, if any,
+ will be woken (see 'wake detection' below);
+ - Note: if needed, it is possible for a worker to wake another worker:
+ the waker marks itself "IDLE|LOCKED", points its next_tid to the wakee,
+ makes the syscall, restores its server in next_tid, marks itself
+ as ``RUNNING``.
+
+- block detection: worker blocks in the kernel: ``S:IDLE+W:RUNNING => S:RUNNING+W:BLOCKED``:
+
+ - when a worker blocks in the kernel in ``RUNNING`` state (not ``LOCKED``),
+ before descheduling the task from the CPU the kernel performs these
+ operations:
+
+ - ``W:RUNNING => W:BLOCKED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wake_up(S)``
+
+ - if any of the first three operations above fail, the worker is killed via
+ ``SIGSEGV``. Note that ``ttwu(S)`` is not required to succeed, as the
+ server may still be transitioning to sleep in ``sys_umcg_wait()``; before
+ actually putting the server to sleep its UMCG state is checked and, if
+ it is ``RUNNING``, sys_umcg_wait() returns to the userspace;
+ - if the worker has its ``LOCKED`` flag set, block detection does not trigger,
+ as the worker is assumed to be in the userspace scheduling code.
+
+- wake detection: worker wakes in the kernel: ``W:BLOCKED => W:IDLE``:
+
+ - all workers' returns to the userspace are intercepted:
+
+ - ``start:`` (a label)
+ - if ``W:RUNNING & W.next_tid != 0``: let the worker exit to the userspace,
+ as this is a ``RUNNING`` worker with a server;
+ - ``W:* => W:IDLE`` (previously blocked or woken without servers workers
+ are not allowed to return to the userspace);
+ - the worker is appended to ``W.idle_workers_ptr`` idle workers list;
+ - ``S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING; ttwu(S)``
+ - ``idle_loop(W)``: this is the same idle loop that ``sys_umcg_wait()``
+ uses: it breaks only when the worker becomes ``RUNNING``; when the
+ idle loop exits, it is assumed that the userspace has properly
+ removed the worker from the idle workers list before marking it
+ ``RUNNING``;
+ - ``goto start;`` (repeat from the beginning).
+
+ - the logic above is a bit more complicated in the presence of ``LOCKED`` or
+ ``PREEMPTED`` flags, but the main invariants stay the same:
+
+ - only ``RUNNING`` workers with servers assigned are allowed to run
+ in the userspace (unless ``LOCKED``);
+ - newly ``IDLE`` workers are added to the idle workers list; any
+ user-initiated state change assumes the userspace properly removed
+ the worker from the list;
+ - as with wake detection, any "breach of contract" by the userspace
+ will result in the task termination via ``SIGSEGV``.
+
+- worker preemption: ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED``:
+
+ - when the userspace wants to preempt a ``RUNNING`` worker, it changes
+ it state, atomically, ``RUNNING => RUNNING|PREEMPTED`` and sends a signal
+ to the worker via ``tgkill()``; the signal handler, previously set up
+ by the userspace, can be a NOP (note that only ``RUNNING`` workers can be
+ preempted);
+ - if the worker, at the moment the signal arrived, continued to be running
+ on-CPU in the userspace, the "wake detection" code will be triggered that,
+ in addition to what was described above, will check if the worker is in
+ ``RUNNING|PREEMPTED`` state:
+
+ - ``W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wakeup(S)``
+
+ - if the signal arrives after the worker blocks in the kernel, the "block
+ detection" happened as described above, with the following change:
+
+ - ``W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wake_up(S)``
+
+ - in any case, the worker's server is woken, with its attached worker
+ (``S.next_tid``) either in ``BLOCKED|PREEMPTED`` or ``IDLE|PREEMPTED``
+ state.
+
+Server-only use cases
+=====================
+
+Some workloads/applications may benefit from fast and synchronous on-CPU
+user-initiated context switches without the need for full userspace
+scheduling (block/wake detection). These applications can use "standalone"
+UMCG servers to wait/wake/context-switch, including across process boundaries.
+
+These "worker-less" operations involve trivial ``RUNNING`` <==> ``IDLE``
+state changes, not discussed here for brevity.
+
--
2.25.1

2021-09-08 23:40:33

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Wed, Sep 8, 2021 at 8:49 PM Peter Oskolkov <[email protected]> wrote:
> Add helper functions to work atomically with userspace 32/64 bit values -
> there are some .*futex.* named helpers, but they are not exactly
> what is needed for UMCG; I haven't found what else I could use, so I
> rolled these.
[...]
> +static inline int fix_pagefault(unsigned long uaddr, bool write_fault)
> +{
> + struct mm_struct *mm = current->mm;
> + int ret;
> +
> + mmap_read_lock(mm);

A minor note: mmap_read_lock() can potentially block for extended
amounts of time, so *if* you have a way to safely bail out, it's nice
to use mmap_read_lock_killable() to ensure that if the task gets
killed, it'll go away quickly instead of continuing to potentially
wait on the lock. (But of course, there are situations where you just
can't do that, so there you have to use the unconditional
mmap_read_lock().)

So here, since you need a bailout path anyway in this function, I'd write:

if (mmap_read_lock_killable(mm))
return -EINTR;

> + ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
> + NULL);
> + mmap_read_unlock(mm);
> +
> + return ret < 0 ? ret : 0;
> +}
[...]
> +static inline int __try_xchg_user_64(u64 *oval, u64 __user *uaddr, u64 newval)
> +{
> + u64 oldval = 0;
> + int ret = 0;
> +
> + asm volatile("\n"
> + "1:\txchgq %0, %2\n"
> + "2:\n"
> + "\t.section .fixup, \"ax\"\n"
> + "3:\tmov %3, %0\n"
> + "\tjmp 2b\n"
> + "\t.previous\n"
> + _ASM_EXTABLE_UA(1b, 3b)
> + : "=r" (oldval), "=r" (ret), "+m" (*uaddr)
> + : "i" (-EFAULT), "0" (newval), "1" (0)
> + );
> +
> + if (ret)
> + return ret;
> +
> + *oval = oldval;
> + return 0;
> +}
[...]
> +/**
> + * xchg_64_user - atomically exchange 64-bit values
> + *
> + * Return:
> + * 0 - OK
> + * -EFAULT: memory access error
> + */
> +static inline int xchg_user_64(u64 __user *uaddr, u64 *val)
> +{
> + int ret = -EFAULT;
> +
> + if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> + return -EFAULT;

For these atomic xchg operations, I think you should probably also
check for proper alignment (something like "(unsigned long)uaddr %
sizeof(*uaddr) == 0", I guess)?
Otherwise the __try_xchg_user_64() call could hit Split Lock Detection
(see <https://lore.kernel.org/all/[email protected]/>)
on newer hardware, meaning the XCHGQ instruction would throw a #AC
exception, which would get caught by the kernel because it happened
during user access, so then you land on the fixup path that returns
-EFAULT, and then this function would assume that it was caused by a
pagefault, invoke pagefault handling, the pagefault handling would say
"the page is present, try again now", and you'd end up in an infinite
loop...

(Yes, the futex version doesn't have that - the futex code instead
does that check further up, in get_futex_key() and
handle_futex_death().)

> + pagefault_disable();
> +
> + while (true) {
> + __uaccess_begin_nospec();
> + ret = __try_xchg_user_64(val, uaddr, *val);
> + user_access_end();
> +
> + if (!ret)
> + break;
> +
> + if (fix_pagefault((unsigned long)uaddr, true) < 0)
> + break;
> + }
> +
> + pagefault_enable();
> +
> + return ret;
> +}
> +
> +/**
> + * get_user_nosleep - get user value with inline fixup without sleeping.
> + *
> + * get_user() might sleep and therefore cannot be used in preempt-disabled
> + * regions.
> + */

If this function is not allowed to sleep, as the comment says...

> +#define get_user_nosleep(out, uaddr) \
> +({ \
> + int ret = -EFAULT; \
> + \
> + if (access_ok((uaddr), sizeof(*(uaddr)))) { \
> + pagefault_disable(); \
> + \
> + while (true) { \
> + if (!__get_user((out), (uaddr))) { \
> + ret = 0; \
> + break; \
> + } \
> + \
> + if (fix_pagefault((unsigned long)(uaddr), false) < 0) \
> + break; \

... then I'm pretty sure you can't call fix_pagefault() here, which
acquires the mmap semaphore (which may involve sleeping) and then goes
through the pagefault handling path (which can also sleep for various
reasons, like allocating memory for pagetables, loading pages from
disk / NFS / FUSE, and so on).

If you're in some kind of non-sleepable context, and you want to
access a userspace address that isn't currently paged in, you have to
get out of whatever non-sleepable context you're in before going
through the fault-handling path and back into the context you were in.

Alternatively, if sleeping isn't possible and getting back out of the
sleepable context temporarily is also too hard, you could try to look
up the userspace page ahead of time (e.g. during umcg_ctl()) with
pin_user_pages_unlocked() and kmap() it into the kernel. That's
probably a lot slower than a direct userspace access, but if you only
have to do it during umcg_ctl(), that might not be a problem. It also
more or less requires that the userspace struct doesn't cross a page
boundary (otherwise you'd have to either vmap it or use helpers for
accessing the pages), and it means you have 4KiB of extra unswappable
memory per thread, and it might worsen memory fragmentation (because
pinned pages can't be migrated anymore even though the kernel thought
they'd probably be migratable at page allocation time).

Since it looks like you want to access userspace memory during
sched_submit_work() (e.g. for storing the UMCG_TF_PREEMPTED flag), I
think the pin_user_pages_unlocked() approach is what you'll have to
use there. There you can then essentially access the userspace
structure through its kernel mapping, basically like normal kernel
memory (except of course that you have to keep in mind that userspace
can concurrently read and write that memory, so instead of plain
loads/stores you have to use at least READ_ONCE()/WRITE_ONCE()).

You of course won't be able to simply traverse userspace pointers in
such a situation, only access the specific userspace object that
you've prepared beforehand, but luckily it looks like:

* idle_server_tid_ptr is only accessed in get_idle_server()
-> which is used from process_waking_worker()
-> which runs in sleepable context
* idle_workers_ptr is accessed from:
-> enqueue_idle_worker
-> which is also used from process_waking_worker()

so I guess that's not a problem?





> + } \
> + \
> + pagefault_enable(); \
> + } \
> + ret; \
> +})

2021-09-09 01:45:36

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 3/4 v0.5] sched/umcg: RFC: implement UMCG syscalls

On Wed, Sep 8, 2021 at 8:49 PM Peter Oskolkov <[email protected]> wrote:
> Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
>
> All key operations, such as wait/wake/context-switch, as well as
> timeouts and block/wake detection, are working quite reliably.
>
> In addition, the userspace can now force the kernel to preempt
> a running worker by changing its state from RUNNING to
> RUNNING | PREEMPTED and sending a signal to it. This new functionality
> is less well tested than the key operations above, but is working
> well in the common case of the worker busy in the userspace.

I haven't properly reviewed the entire thing, but I've left a bunch of
comments below. Especially for the part about priority inversion /
livelocks at the bottom, it might be tricky to figure out exactly how
this should work.

[...]
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
[...]
> @@ -4159,6 +4160,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->wake_entry.u_flags = CSD_TYPE_TTWU;
> p->migration_pending = NULL;
> #endif
> +#ifdef CONFIG_UMCG
> + p->umcg_task = NULL;
> + p->flags &= ~PF_UMCG_WORKER;
> +#endif

These things probably also need to be reset on execve, not just on
fork? In particular the userspace pointer in ->umcg_task will become
meaningless on execve(). Maybe a good spot to reset that would be
somewhere in begin_new_exec(), before the call to exec_mmap()? Or
maybe execve should bail out early on if the task is configured for
UMCG?

[...]
> diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
[...]
> +/*
> + * NOTE: all code below is called from workqueue submit/update, so all
> + * errors result in the termination of the current task (via SIGKILL).
> + */
> +
> +/* Returns true on success, false on _any_ error. */
> +static bool mark_server_running(u32 server_tid)
> +{
> + struct umcg_task __user *ut_server = NULL;
> + u32 state = UMCG_TASK_IDLE;
> + struct task_struct *tsk;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(server_tid);
> + if (tsk)
> + ut_server = READ_ONCE(tsk->umcg_task);

Since you are reading the ->umcg_task pointer of another task here,
protected only by READ_ONCE(), I think the writes over in the
umcg_ctl() syscall and in do_exit() should be using WRITE_ONCE()?

You should check here that userspace isn't just giving us a TID of
some entirely unrelated thread (and also that if that thread just
died, and its ID was reassigned to a new thread in another process,
we're not going to access an address belonging to an entirely
different process here). I think an appropriate check would be
"READ_ONCE(tsk->group_leader) == current->group_leader"?

(And then maybe you could also do that check in the other places that
do find_task_by_vpid() - for consistency and also in general to reduce
how much userspace can interfere with the scheduling of unrelated
processes.)

> + rcu_read_unlock();
> +
> + if (!ut_server)
> + return false;
> +
> + return !cmpxchg_user_32(&ut_server->state, &state, UMCG_TASK_RUNNING);
> +}
> +
> +/*
> + * In the common case, change @tsk RUNNING => BLOCKED. Called from
> + * preempt_disable() and local_irq_disable() context.
> + */
> +static void __umcg_wq_worker_sleeping(struct task_struct *tsk)

Can you maybe add a sentence here that describes in more high-level
terms what's going on? Something like "The current task is a UMCG
worker that is about to block, so we may have to inform userspace and
maybe also wake up another task on this CPU to keep the CPU fully
utilized"?

> +{
> + struct umcg_task __user *ut_worker = tsk->umcg_task;
> + u32 prev_state, next_state, server_tid;
> + bool preempted = false;
> + int ret;
> +
> + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
> + return;
> +
> + smp_mb(); /* Guard the read below. */

There are two sides to a barrier - one specific operation (or set of
operations) has to happen after another operation (or set of
operations). The comment doesn't make it clear against which preceding
operation we're ordering the following read.

> + if (get_user_nosleep(prev_state, &ut_worker->state))
> + goto die; /* EFAULT */
> +
> + if (prev_state & UMCG_TF_LOCKED)
> + return;
> + if ((prev_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING)
> + return; /* the worker is in umcg_wait */
> +
> +retry_once:
> + next_state = prev_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_BLOCKED;
> + preempted = prev_state & UMCG_TF_PREEMPTED;
> +
> + ret = cmpxchg_user_32(&ut_worker->state, &prev_state, next_state);
> + if (ret == -EAGAIN) {
> + if (preempted)
> + goto die; /* Preemption can only happen once. */
> +
> + if (prev_state != (UMCG_TASK_RUNNING | UMCG_TF_PREEMPTED))
> + goto die; /* Only preemption can happen. */
> +
> + preempted = true;
> + goto retry_once;
> + }
> + if (ret)
> + goto die; /* EFAULT */
> +
> + if (get_user_nosleep(server_tid, &ut_worker->next_tid))
> + goto die; /* EFAULT */
> +
> + if (!server_tid)
> + return; /* Waking a waiting worker leads here. */
> +
> + /* The idle server's wait may timeout. */
> + /* TODO: make a smarter context switch below when available. */
> + if (mark_server_running(server_tid))
> + umcg_ttwu(server_tid, WF_CURRENT_CPU);
> +
> + return;
> +
> +die:
> + pr_warn("umcg_wq_worker_sleeping: killing task %d\n", current->pid);
> + force_sig(SIGKILL);
> +}
> +
> +/* Called from sched_submit_work() with preempt_disable. */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> + /*
> + * Although UMCG preemption state change (UMCG_TF_PREEMPTED) racing
> + * with the worker blocking in a syscall is handled correctly in
> + * __umcg_wq_worker_sleeping() above, actual signal to the worker
> + * during the execution of this function might be causing
> + * isuses, based on some observed test failures. Disabling IRQs
> + * make the failures go away.
> + */

I guess that's a todo item to figure out before this lands?

> + local_irq_disable();
> + __umcg_wq_worker_sleeping(tsk);
> + local_irq_enable();
> +}
> +
> +/**
> + * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
> + *
> + * Returns true on success, false on a fatal failure.
> + */
> +static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
> +{
> + u64 __user *node = &ut_worker->idle_workers_ptr;
> + u64 __user *head_ptr;
> + u64 first = (u64)node;
> + u64 head;
> +
> + if (get_user_nosleep(head, node) || !head)
> + return false;
> +
> + head_ptr = (u64 __user *)head;
> +
> + if (put_user_nosleep(1ULL, node))
> + return false;

Can anyone observe this value, and if so, what does it mean for them?

If it's a meaningful value, maybe there should be a #define for it
somewhere that gives it a proper name; if it isn't, and you're just
writing it here as a sort of sanity check, maybe add a comment that
explains that?

> + if (xchg_user_64(head_ptr, &first))
> + return false;
> +
> + if (put_user_nosleep(first, node))
> + return false;
> +
> + return true;
> +}
[...]
> +/*
> + * Returns true to wait,

Maybe clarify this as "Returns true to switch away from this task
(either by switching to a specific other task or by just scheduling)",
or something like that?

> false to return to the userspace.

Maybe "false to continue execution in the current task"? We're not
necessarily going to return straight to userspace from here.

> Called with IRQs
> + * disabled. In the common case, enqueues the worker to idle_workers_ptr list
> + * and wakes the idle server (if present).
> + */

The comment here says that we're called with IRQs disabled, but
there's a path in our caller umcg_wq_worker_running() where we're
called almost directly after calling might_sleep(). Isn't that a
contradiction?

As far as I can tell, we always get here in sleepable context.

> +static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
> +{
> + struct umcg_task __user *ut_worker = tsk->umcg_task;
> + u32 prev_state, next_state;
> + int ret = 0;
> +
> + *server_tid = 0;
> +
> + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
> + return false;
> +
> + if (fatal_signal_pending(tsk))
> + return false;
> +
> + smp_mb(); /* Guard the read below. */
> + if (get_user_nosleep(prev_state, &ut_worker->state))
> + goto die;
> + if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
> + u32 tid;
> +
> + if (prev_state & UMCG_TF_LOCKED)
> + return true; /* Wakeup: wait but don't enqueue. */
> +
> + smp_mb(); /* Order getting state and getting server_tid */
> +
> + if (get_user_nosleep(tid, &ut_worker->next_tid))
> + goto die;
> + if (prev_state & UMCG_TF_PREEMPTED) {
> + if (!tid)
> + goto die; /* PREEMPTED workers must have a server. */
> +
> + /* Always enqueue preempted workers. */
> + if (!mark_server_running(tid))
> + goto die;
> +
> + *server_tid = tid;
> + } else if (tid)
> + return false; /* pass-through: RUNNING with a server. */
> +
> + /* If !PREEMPTED, the worker gets here via UMCG_WAIT_WAKE_ONLY */
> + } else if (unlikely((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
> + (prev_state & UMCG_TF_LOCKED)))
> + return false; /* The worker prepares to sleep or to unregister. */
> +
> + if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE)
> + return true; /* the worker called umcg_wait(); don't enqueue */
> +
> + next_state = prev_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_IDLE;
> +
> + if (prev_state != next_state)
> + ret = cmpxchg_user_32(&ut_worker->state, &prev_state, next_state);
> + if (ret)
> + goto die;
> +
> + if (!enqueue_idle_worker(ut_worker))
> + goto die;
> +
> + smp_mb(); /* Order enqueuing the worker with getting the server. */
> + if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
> + goto die;
> +
> + return true;
> +
> +die:
> + pr_warn("umcg_wq_worker_running: killing task %d\n", current->pid);
> + force_sig(SIGKILL);
> + return false;
> +}
> +
> +void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> + might_sleep();
> +
> + /* Avoid recursion by removing PF_UMCG_WORKER */
> + current->flags &= ~PF_UMCG_WORKER;
> +
> + do {
> + bool should_wait;
> + u32 server_tid;
> +
> + should_wait = process_waking_worker(tsk, &server_tid);
> +
> + if (!should_wait)
> + break;
> +
> + if (server_tid)
> + umcg_do_context_switch(server_tid, 0);
> + else
> + umcg_idle_loop(0);

I think a pretty nasty aspect here is that if we're deep in some
kernel code that is holding some mutex, and we have to briefly
schedule() for a kmalloc() call or something like that, and userspace
does something stupid like suddenly setting our ->state to
UMCG_TASK_IDLE, then we'll go into umcg_idle_loop() and potentially
sleep forever, even though we might be holding some important mutex
that half the system needs to function, and then the whole machine
deadlocks or something like that.

An even more nasty situation would be if we're holding some kernel
mutex that the thread which UMCG-preempted us is holding, so when we
get scheduled we'll reschedule over to that other thread, but when
that thread gets scheduled it notices that the mutex is still held, so
then it'll reschedule back over to us, and then we'll reschedule back
over... that'd be even worse than Priority Inversion, where at least
*some* work gets done, even if it's not the most high-priority work;
we'd just get stuck in an unproductive livelock.


I think umcg_idle_loop() should never be called from scheduler
callbacks (meaning umcg_wq_worker_running()), only from UMCG syscalls.
There also shouldn't be any mechanism in a scheduler callback that
would allow userspace to prevent the task from executing for a long
amount of time. For comparison, when you use sched_setscheduler() to
set the priority of a task to SCHED_IDLE, the task will still run very
infrequently; see
<https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html#scheduling-policies>:
"SCHED_IDLE: This is even weaker than nice 19, but its not a true idle
timer scheduler in order to avoid to get into priority inversion
problems which would deadlock the machine."

Maybe UMCG_TF_PREEMPTED should always only take effect once we're out
of whatever syscall or pagefault handler we were in?

(A more complicated/messy/ugly compromise would be to add a field to
task_struct that keeps track of how often we have switched to another
thread in the scheduler callback because someone set UMCG_TF_PREEMPTED
on us, and reset that counter when we've reached a safe point where
we're sure we can't be holding any locks (e.g. when returning to
userspace?); and if the counter gets too big, temporarily ignore
UMCG_TF_PREEMPTED until we've reached a safe point. But I'm not a
scheduler expert, I don't know how much sense that makes in the bigger
picture. I guess it would only make sense if the preempting task is
likely to finish its work before the scheduler gets back to us?)

(Or maybe yet another approach would be to temporarily drop the
current thread to SCHED_IDLE somehow while it is preempted, but then
only explicitly reschedule once, and then let the kernel figure out
how to deal with the locking issues just like it would for a normal
SCHED_IDLE thread - in other words, by accepting that there will
potentially be a priority inversion situation for several seconds
before the idle thread drops its lock... Probably also not ideal...)


> + } while (true);
> +
> + current->flags |= PF_UMCG_WORKER;
> +}

2021-09-09 02:37:01

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Thu, Sep 9, 2021 at 1:38 AM Jann Horn <[email protected]> wrote:
> On Wed, Sep 8, 2021 at 8:49 PM Peter Oskolkov <[email protected]> wrote:
> > Add helper functions to work atomically with userspace 32/64 bit values -
> > there are some .*futex.* named helpers, but they are not exactly
> > what is needed for UMCG; I haven't found what else I could use, so I
> > rolled these.
[...]
> You of course won't be able to simply traverse userspace pointers in
> such a situation, only access the specific userspace object that
> you've prepared beforehand, but luckily it looks like:
>
> * idle_server_tid_ptr is only accessed in get_idle_server()
> -> which is used from process_waking_worker()
> -> which runs in sleepable context
> * idle_workers_ptr is accessed from:
> -> enqueue_idle_worker
> -> which is also used from process_waking_worker()

Ah, I guess I got that wrong: process_waking_worker() is sleepable,
but it might be holding the mmap lock, so it can't fault, right? Which
means this would actually be problematic...

2021-09-09 19:09:21

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Wed, Sep 8, 2021 at 4:39 PM Jann Horn <[email protected]> wrote:

Thanks a lot for the reviews, Jann!

I understand how to address most of your comments. However, one issue
I'm not sure what to do about:

[...]

> If this function is not allowed to sleep, as the comment says...

[...]

> ... then I'm pretty sure you can't call fix_pagefault() here, which
> acquires the mmap semaphore (which may involve sleeping) and then goes
> through the pagefault handling path (which can also sleep for various
> reasons, like allocating memory for pagetables, loading pages from
> disk / NFS / FUSE, and so on).

<quote from peterz@ from
https://lore.kernel.org/lkml/[email protected]/>:
So a PF_UMCG_WORKER would be added to sched_submit_work()'s PF_*_WORKER
path to capture these tasks blocking. The umcg_sleeping() hook added
there would:

put_user(BLOCKED, umcg_task->umcg_status);
...
</quote>

Which is basically what I am doing here: in sched_submit_work() I need
to read/write to userspace; and we cannot sleep in
sched_submit_work(), I believe.

If you are right that it is impossible to deal with pagefaults from
within non-sleepable contexts, I see two options:

Option 1: as you suggest, pin pages holding struct umcg_task in sys_umcg_ctl;

or

Option 2: add more umcg-related kernel state to task_struct so that
reading/writing to userspace is not necessary in sched_submit_work().

The first option sounds much better from the code simplicity point of
view, but I'm not sure if it is a viable approach, i.e. I'm afraid
we'll get a hard NACK here, as a non-privileged process will be able
to force the kernel to pin a page per task/thread. We may get around
it by first pinning a limited number of pages, then having the
userspace allocate structs umcg_task on those pages, so that a pinned
page would cover more than a single task/thread. And have a sysctl
that limits the number of pinned pages per MM.

Peter Z., could you, please, comment here? Do you think pinning pages
to hold structs umcg_task is acceptable?

[...]

2021-09-09 21:32:15

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Thu, Sep 9, 2021 at 9:07 PM Peter Oskolkov <[email protected]> wrote:
> On Wed, Sep 8, 2021 at 4:39 PM Jann Horn <[email protected]> wrote:
>
> Thanks a lot for the reviews, Jann!
>
> I understand how to address most of your comments. However, one issue
> I'm not sure what to do about:
>
> [...]
>
> > If this function is not allowed to sleep, as the comment says...
>
> [...]
>
> > ... then I'm pretty sure you can't call fix_pagefault() here, which
> > acquires the mmap semaphore (which may involve sleeping) and then goes
> > through the pagefault handling path (which can also sleep for various
> > reasons, like allocating memory for pagetables, loading pages from
> > disk / NFS / FUSE, and so on).
>
> <quote from peterz@ from
> https://lore.kernel.org/lkml/[email protected]/>:
> So a PF_UMCG_WORKER would be added to sched_submit_work()'s PF_*_WORKER
> path to capture these tasks blocking. The umcg_sleeping() hook added
> there would:
>
> put_user(BLOCKED, umcg_task->umcg_status);
> ...
> </quote>
>
> Which is basically what I am doing here: in sched_submit_work() I need
> to read/write to userspace; and we cannot sleep in
> sched_submit_work(), I believe.
>
> If you are right that it is impossible to deal with pagefaults from
> within non-sleepable contexts, I see two options:
>
> Option 1: as you suggest, pin pages holding struct umcg_task in sys_umcg_ctl;

FWIW, there is a variant on this that might also be an option:

You can create a new memory mapping from kernel code and stuff pages
into it that were originally allocated as normal kernel pages. This is
done in a bunch of places, e.g.:

This has the advantage that it avoids pinning random pages that were
originally allocated from ZONE_MOVABLE blocks. (Or pinning hugepages,
or something like that.)
The downsides are that it reduces userspace's freedom to place the
UAPI structs wherever it wants (so userspace e.g. probably can't
directly put the struct in thread-local storage, instead it has to
store a pointer to the struct), and that you need to write a bunch of
code to create the mapping and allocate slots in these pages for
userspace threads.

Maybe UMCG_CTL_REGISTER could do something vaguely like this (utterly
untested, I just scribbled this down in my mail client)?


#define UMCG_TASKS_PER_PAGE (sizeof(struct umcg_task) / PAGE_SIZE)
struct umcg_page {
struct page *page;
struct umcg_task *__user user_base;
struct vm_special_mapping special_mapping;
DECLARE_BITMAP(inuse_bits, UMCG_TASKS_PER_PAGE);
};
struct mm_umcg {
struct umcg_page *slot_pages;
size_t slot_page_count;
size_t used_slots;
};
struct mm_struct {
...
#ifdef CONFIG_UMCG
struct mm_umcg umcg;
#endif
...
};

static int deny_mremap(struct vm_area_struct *new_vma)
{
return -EINVAL;
}

BUILD_BUG_ON(UMCG_TASKS_PER_PAGE < 1);
struct mm_struct *mm = current->mm;
size_t page_idx, free_idx_in_page;
if (!mmap_lock_killable(mm))
return -EINTR;
if (mm->umcg.used_slots == mm->umcg.slot_page_count * UMCG_TASKS_PER_PAGE) {
unsigned long addr;
struct page *new_page;
struct umcg_page *slot_pages_new;

addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
if (IS_ERR_VALUE(addr)) {
ret = addr;
goto unlock;
}

slot_pages_new = krealloc_array(mm->umcg.slot_page_count + 1,
sizeof(struct umcg_page), GFP_KERNEL);
if (!slot_pages_new) {
ret = -ENOMEM;
goto unlock;
}
mm->umcg.slot_pages = slot_pages_new;
new_page = alloc_page(GFP_USER | __GFP_ACCOUNT);
if (!new_page) {
ret = -ENOMEM;
goto unlock;
}
mm->umcg.slot_pages[mm->umcg.slot_page_count].page = new_page;
mm->umcg.slot_pages[mm->umcg.slot_page_count].user_base = addr;
mm->umcg.slot_pages[mm->umcg.slot_page_count].special_mapping.name = "[umcg]";
mm->umcg.slot_pages[mm->umcg.slot_page_count].special_mapping.pages
= &mm->umcg.slot_pages[mm->umcg.slot_page_count].page;
mm->umcg.slot_pages[mm->umcg.slot_page_count].special_mapping.mremap
= deny_mremap;
if (IS_ERR(_install_special_mapping(mm, addr, PAGE_SIZE,
VM_READ|VM_WRITE|VM_MAYREAD|VM_MAYWRITE|VM_DONTCOPY,
&new_page->special_mapping)))
... free new_page and return error ...
mm->umcg.slot_page_count++;
}

for (page_idx = 0; 1; page_idx++) {
if (page_idx == mm->umcg->slot_page_count)
... WARN() and bail out, shouldn't happen...
free_idx_in_page =
find_first_zero_bit(mm->umcg->slot_pages[page_idx],
UMCG_TASKS_PER_PAGE);
if (free_idx_in_page != UMCG_TASKS_PER_PAGE)
break;
}
set_bit(free_idx_in_page, mm->umcg->slot_pages[page_idx]);
mm->umcg.used_slots++;
current->umcg_user_mapping = mm->umcg->slot_pages[page_idx].user_base
+ free_idx_in_page;
current->umcg_kernel_mapping = (struct umcg_task
*)page_to_virt(mm->umcg->slot_pages[page_idx].page) +
free_idx_in_page;
current->umcg_index = page_idx * UMCG_TASKS_PER_PAGE + free_idx_in_page;
unlock:
mmap_unlock(mm);

... and then when a task exits, you'd pretty much just take the
mmap_lock and do clear_bit(current->umcg_index % UMCG_TASKS_PER_PAGE,
current->mm->umcg.slot_pages[current->umcg_index /
UMCG_TASKS_PER_PAGE].inuse_bits)?

The pages would stay allocated as long as the process is running, but
given how fragmented those pages are going to get, that's probably
inevitable. And when the process exits (more precisely, when the
mm_struct is torn down), you could free all this stuff?

Note that what I'm suggesting here is a bit unusual - normally only
the vDSO is a "special mapping", other APIs tend to use mappings that
are backed by files. But I think we probably don't want to have a file
involved here...

If you decide to go this route, you should probably CC
[email protected] (for general memory management) and Andy Lutomirski
(who has tinkered around in vDSO-related code a lot).

> or
>
> Option 2: add more umcg-related kernel state to task_struct so that
> reading/writing to userspace is not necessary in sched_submit_work().
>
> The first option sounds much better from the code simplicity point of
> view, but I'm not sure if it is a viable approach, i.e. I'm afraid
> we'll get a hard NACK here, as a non-privileged process will be able
> to force the kernel to pin a page per task/thread.

To clarify: It's entirely normal that userspace processes can force
the kernel to hold on to some amounts of memory that can't be paged
out - consider e.g. pagetables and kernel objects referenced by file
descriptors. So an API that pins limited amounts of memory that are
also mapped in userspace isn't inherently special. But pinning pages
that were originally allocated as normal userspace memory can be more
problematic because that memory might be hugepages, or file pages, or
it might prevent the hugepaged from being able to defragment memory
because the pinned page was allocated in ZONE_MOVABLE.


> We may get around
> it by first pinning a limited number of pages, then having the
> userspace allocate structs umcg_task on those pages, so that a pinned
> page would cover more than a single task/thread. And have a sysctl
> that limits the number of pinned pages per MM.

I think that you wouldn't necessarily need a sysctl for that if the
kernel can enforce that you don't have more pages allocated than you
need for the maximum number of threads that have ever been running
under the process, and you also use __GFP_ACCOUNT so that cgroups can
correctly attribute the memory usage.

> Peter Z., could you, please, comment here? Do you think pinning pages
> to hold structs umcg_task is acceptable?

2021-09-09 22:17:25

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Thu, Sep 9, 2021 at 2:21 PM Jann Horn <[email protected]> wrote:
>

[...]

> >
> > Option 1: as you suggest, pin pages holding struct umcg_task in sys_umcg_ctl;
>
> FWIW, there is a variant on this that might also be an option:
>
> You can create a new memory mapping from kernel code and stuff pages
> into it that were originally allocated as normal kernel pages. This is
> done in a bunch of places, e.g.:
>
> This has the advantage that it avoids pinning random pages that were
> originally allocated from ZONE_MOVABLE blocks. (Or pinning hugepages,
> or something like that.)
> The downsides are that it reduces userspace's freedom to place the
> UAPI structs wherever it wants (so userspace e.g. probably can't
> directly put the struct in thread-local storage, instead it has to
> store a pointer to the struct), and that you need to write a bunch of
> code to create the mapping and allocate slots in these pages for
> userspace threads.

Thanks again, Jann! Why do you think using custom mapping like this is
preferable to doing just kzalloc(size, GFP_USER), or maybe
alloc_page(GFP_USER)?

The documentation here
https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html
says:

"GFP_USER means that the allocated memory is not movable and it must
be directly accessible by the kernel", which sounds exactly what we
need here.

[...]

2021-09-09 23:16:19

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Fri, Sep 10, 2021 at 12:10 AM Peter Oskolkov <[email protected]> wrote:
> On Thu, Sep 9, 2021 at 2:21 PM Jann Horn <[email protected]> wrote:
> > > Option 1: as you suggest, pin pages holding struct umcg_task in sys_umcg_ctl;
> >
> > FWIW, there is a variant on this that might also be an option:
> >
> > You can create a new memory mapping from kernel code and stuff pages
> > into it that were originally allocated as normal kernel pages. This is
> > done in a bunch of places, e.g.:
> >
> > This has the advantage that it avoids pinning random pages that were
> > originally allocated from ZONE_MOVABLE blocks. (Or pinning hugepages,
> > or something like that.)
> > The downsides are that it reduces userspace's freedom to place the
> > UAPI structs wherever it wants (so userspace e.g. probably can't
> > directly put the struct in thread-local storage, instead it has to
> > store a pointer to the struct), and that you need to write a bunch of
> > code to create the mapping and allocate slots in these pages for
> > userspace threads.
>
> Thanks again, Jann! Why do you think using custom mapping like this is
> preferable to doing just kzalloc(size, GFP_USER), or maybe
> alloc_page(GFP_USER)?

kzalloc() / alloc_page() just give you kernel memory allocations; but
if you want userspace to be able to directly read/write that memory,
you have to also map the same physical memory into the userspace
pagetables somehow (at a separate address), which requires that you
set up a VMA to tell the MM subsystem that that userspace address
range is in use, and to specify what should happen when userspace
calls memory management syscalls on it, or when pagefaults occur in
it.

Also, when allocating memory that should also be mapped into
userspace, you have to use alloc_page(); memory from kzalloc() (in
other words, slab memory) can't be mapped into userspace. (Technically
it could be mapped into userspace with PFNMAP, but doing that would be
weird.)

> The documentation here
> https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html
> says:
>
> "GFP_USER means that the allocated memory is not movable and it must
> be directly accessible by the kernel", which sounds exactly what we
> need here.

If you look at the actual definitions of GFP_KERNEL and GFP_USER:

#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)

you can see that the only difference between them is the
__GFP_HARDWALL flag, which is documented as follows:

* %__GFP_HARDWALL enforces the cpuset memory allocation policy.

So this basically just determines whether memory allocations fail if
the task's memory allocation policy says they should fail because no
memory is available on the right nodes. The choice of GFP flags
doesn't influence whether userspace ends up getting access to the
allocated memory. (On 32-bit machines, it does influence whether the
kernel can easily access the memory though: Normal userspace anonymous
memory is GFP_HIGHUSER, which includes __GFP_HIGHMEM, meaning that the
returned page doesn't have to be mapped in the kernel's linear
mapping, it can be in "high memory".)

2021-09-14 08:12:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Thu, Sep 09, 2021 at 12:06:58PM -0700, Peter Oskolkov wrote:
> On Wed, Sep 8, 2021 at 4:39 PM Jann Horn <[email protected]> wrote:
>
> Thanks a lot for the reviews, Jann!
>
> I understand how to address most of your comments. However, one issue
> I'm not sure what to do about:
>
> [...]
>
> > If this function is not allowed to sleep, as the comment says...
>
> [...]
>
> > ... then I'm pretty sure you can't call fix_pagefault() here, which
> > acquires the mmap semaphore (which may involve sleeping) and then goes
> > through the pagefault handling path (which can also sleep for various
> > reasons, like allocating memory for pagetables, loading pages from
> > disk / NFS / FUSE, and so on).
>
> <quote from peterz@ from
> https://lore.kernel.org/lkml/[email protected]/>:
> So a PF_UMCG_WORKER would be added to sched_submit_work()'s PF_*_WORKER
> path to capture these tasks blocking. The umcg_sleeping() hook added
> there would:
>
> put_user(BLOCKED, umcg_task->umcg_status);
> ...
> </quote>
>
> Which is basically what I am doing here: in sched_submit_work() I need
> to read/write to userspace; and we cannot sleep in
> sched_submit_work(), I believe.
>
> If you are right that it is impossible to deal with pagefaults from
> within non-sleepable contexts, I see two options:
>
> Option 1: as you suggest, pin pages holding struct umcg_task in sys_umcg_ctl;
>
> or
>
> Option 2: add more umcg-related kernel state to task_struct so that
> reading/writing to userspace is not necessary in sched_submit_work().

Durr.. so yeah this is a bit of a chicken and egg problem here. We need
a userspace page to notify we're blocked, but at the same time,
accessing said page can get us blocked.

And then worse, as Jann said, we cannot do this in the appropriate spot
because we could be blocking on mmap_sem, so we must not require
mmap_sem to make progress etc.. :/

Now, in reality actually taking a fault for these pages is extremely
unlikely, but if we do, there's really no option but to block and wait
for it without notification. Tought luck there.

So what we can do, is use get_user_page() on the appropriate pages
(alignment ensure the whole umcg struct must be in a single page etc..)
the moment a umcg task enters the kernel. For this we need some
SYSCALL_WORK_ENTER flag.

So normally a task would have ->umcg_page and ->umcg_server_page be
NULL, the above SYSCALL_WORK_SYSCALL_UMCG flag would get_user_page() the
self and server pages. If get_user_page() blocks, these fields would
still be NULL and sched_submit_work() would not do anything, c'est la
vie.

Once we have the pages, any actual blocking hitting sched_submit_work()
can do the updates without further blocking. It can then also put_page()
and clear the ->umcg_{,server_}page pointers, because the task_work that
will set RUNNABLE *can* suffer mmap_sem (again, unlikely, again tough
luck if it does).

The reason for put'ing the pages on blocking, is that this guarantees
the pages are only pinned for a short amount of time, and 'never' by a
blocked task. IOW, it's a proper transient pin and doesn't require extra
special care or accounting.



Also, can you *please* convert that RST crud to a text file, it's
absolutely unreadable gunk. Those documentation files should be readable
as plain text first and foremost. That whole rendering to html crap is
nonsense. Using a browser to read a test file is insane.

2021-09-14 16:31:25

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Tue, Sep 14, 2021 at 1:09 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Sep 09, 2021 at 12:06:58PM -0700, Peter Oskolkov wrote:
> > On Wed, Sep 8, 2021 at 4:39 PM Jann Horn <[email protected]> wrote:

[...]

>
> Durr.. so yeah this is a bit of a chicken and egg problem here. We need
> a userspace page to notify we're blocked, but at the same time,
> accessing said page can get us blocked.
>
> And then worse, as Jann said, we cannot do this in the appropriate spot
> because we could be blocking on mmap_sem, so we must not require
> mmap_sem to make progress etc.. :/
>
> Now, in reality actually taking a fault for these pages is extremely
> unlikely, but if we do, there's really no option but to block and wait
> for it without notification. Tought luck there.

In the version of the patchset that I'm preparing to send I've decided
to punt on the issue and just ask the userspace to deal with locking
the memory as it sees fit: mlock() is available and as far as I can
tell RLIMIT_MLOCK is decently sized by default (6MB on Ubuntu, so
locked memory can contain more than 100k of structs umcg_task if
nothing else uses it); and if it is not enough for some special case,
it can be adjusted at a higher level in the userspace. If we get a
pagefault when we access struct umcg_task in the kernel, we just kill
the task.

Does the approach seem reasonable for the initial version of the patchset?

>
> So what we can do, is use get_user_page() on the appropriate pages
> (alignment ensure the whole umcg struct must be in a single page etc..)
> the moment a umcg task enters the kernel. For this we need some
> SYSCALL_WORK_ENTER flag.
>
> So normally a task would have ->umcg_page and ->umcg_server_page be
> NULL, the above SYSCALL_WORK_SYSCALL_UMCG flag would get_user_page() the
> self and server pages. If get_user_page() blocks, these fields would
> still be NULL and sched_submit_work() would not do anything, c'est la
> vie.
>
> Once we have the pages, any actual blocking hitting sched_submit_work()
> can do the updates without further blocking. It can then also put_page()
> and clear the ->umcg_{,server_}page pointers, because the task_work that
> will set RUNNABLE *can* suffer mmap_sem (again, unlikely, again tough
> luck if it does).
>
> The reason for put'ing the pages on blocking, is that this guarantees
> the pages are only pinned for a short amount of time, and 'never' by a
> blocked task. IOW, it's a proper transient pin and doesn't require extra
> special care or accounting.

I'd prefer to defer this smart/transient pinning of pages until later
if mlock() will solve the issue at the moment.

> Also, can you *please* convert that RST crud to a text file, it's
> absolutely unreadable gunk. Those documentation files should be readable
> as plain text first and foremost. That whole rendering to html crap is
> nonsense. Using a browser to read a test file is insane.

Will do. Maybe we can have both an RST and a TXT version of the
document? I think most files in /Documentation are RST...

Thanks,
Peter

2021-09-14 16:37:38

by Tao Zhou

[permalink] [raw]
Subject: Re: [PATCH 4/4 v0.5] sched/umcg: add Documentation/userspace-api/umcg.rst

Hi Peter,

On Wed, Sep 08, 2021 at 11:49:05AM -0700, Peter Oskolkov wrote:
> Document User Managed Concurrency Groups syscalls, data structures,
> state transitions, etc.
>
> Signed-off-by: Peter Oskolkov <[email protected]>
> ---
> Documentation/userspace-api/umcg.rst | 546 +++++++++++++++++++++++++++
> 1 file changed, 546 insertions(+)
> create mode 100644 Documentation/userspace-api/umcg.rst
>
> diff --git a/Documentation/userspace-api/umcg.rst b/Documentation/userspace-api/umcg.rst
> new file mode 100644
> index 000000000000..b3e84cef212c
> --- /dev/null
> +++ b/Documentation/userspace-api/umcg.rst
> @@ -0,0 +1,546 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================
> +UMCG Userspace API
> +=====================================
> +
> +User Managed Concurrency Groups (UMCG) is an M:N threading
> +subsystem/toolkit that lets user space application developers
> +implement in-process user space schedulers.
> +
> +.. contents:: :local:
> +
> +Why? Heterogeneous in-process workloads
> +=======================================
> +Linux kernel's CFS scheduler is designed for the "common" use case,
> +with efficiency/throughput in mind. Work isolation and workloads of
> +different "urgency" are addressed by tools such as cgroups, CPU
> +affinity, priorities, etc., which are difficult or impossible to
> +efficiently use in-process.
> +
> +For example, a single DBMS process may receive tens of thousands
> +requests per second; some of these requests may have strong response
> +latency requirements as they serve live user requests (e.g. login
> +authentication); some of these requests may not care much about
> +latency but must be served within a certain time period (e.g. an
> +hourly aggregate usage report); some of these requests are to be
> +served only on a best-effort basis and can be NACKed under high load
> +(e.g. an exploratory research/hypothesis testing workload).
> +
> +Beyond different work item latency/throughput requirements as outlined
> +above, the DBMS may need to provide certain guarantees to different
> +users; for example, user A may "reserve" 1 CPU for their
> +high-priority/low latency requests, 2 CPUs for mid-level throughput
> +workloads, and be allowed to send as many best-effort requests as
> +possible, which may or may not be served, depending on the DBMS load.
> +Besides, the best-effort work, started when the load was low, may need
> +to be delayed if suddenly a large amount of higher-priority work
> +arrives. With hundreds or thousands of users like this, it is very
> +difficult to guarantee the application's responsiveness using standard
> +Linux tools while maintaining high CPU utilization.
> +
> +Gaming is another use case: some in-process work must be completed
> +before a certain deadline dictated by frame rendering schedule, while
> +other work items can be delayed; some work may need to be
> +cancelled/discarded because the deadline has passed; etc.
> +
> +User Managed Concurrency Groups is an M:N threading toolkit that
> +allows constructing user space schedulers designed to efficiently
> +manage heterogeneous in-process workloads described above while
> +maintaining high CPU utilization (95%+).
> +
> +Requirements
> +============
> +One relatively established way to design high-efficiency, low-latency
> +systems is to split all work into small on-cpu work items, with
> +asynchronous I/O and continuations, all executed on a thread pool with
> +the number of threads not exceeding the number of available CPUs.
> +Although this approach works, it is quite difficult to develop and
> +maintain such a system, as, for example, small continuations are
> +difficult to piece together when debugging. Besides, such asynchronous
> +callback-based systems tend to be somewhat cache-inefficient, as
> +continuations can get scheduled on any CPU regardless of cache
> +locality.
> +
> +M:N threading and cooperative user space scheduling enables controlled
> +CPU usage (minimal OS preemption), synchronous coding style, and
> +better cache locality.
> +
> +Specifically:
> +
> +- a variable/fluctuating number M of "application" threads should be
> + "scheduled over" a relatively fixed number N of "kernel" threads,
> + where N is less than or equal to the number of CPUs available;
> +- only those application threads that are attached to kernel threads
> + are scheduled "on CPU";
> +- application threads should be able to cooperatively yield to each other;
> +- when an application thread blocks in kernel (e.g. in I/O), this
> + becomes a scheduling event ("block") that the userspace scheduler
> + should be able to efficiently detect, and reassign a waiting
> + application thread to the freeded "kernel" thread;
> +- when a blocked application thread wakes (e.g. its I/O operation
> + completes), this even ("wake") should also be detectable by the
> + userspace scheduler, which should be able to either quickly dispatch
> + the newly woken thread to an idle "kernel" thread or, if all "kernel"
> + threads are busy, put it in the waiting queue;
> +- in addition to the above, it would be extremely useful for a
> + separate in-process "watchdog" facility to be able to monitor the
> + state of each of the M+N threads, and to intervene in case of runaway
> + workloads (interrupt/preempt).
> +
> +
> +UMCG kernel API
> +===============
> +Based on the requrements above, UMCG *kernel* API is build around
> +the following ideas:
> +
> +- *UMCG server*: a task/thread representing "kernel threads", or CPUs
> + from the requirements above;
> +- *UMCG worker*: a task/thread representing "application threads", to
> + be scheduled over servers;
> +- UMCG *task state*: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG
> + task (a server or a worker) can be in;
> +- UMCG task *state flag*: LOCKED, PREEMPTED: additional state flags
> + that can be ORed with the task state to communicate additional information
> + to the kernel;
> +- ``struct umcg_task``: a per-task userspace set of data fields, usually
> + residing in the TLS, that fully reflects the current task's UMCG
> + state and controls the way the kernel manages the task;
> +- ``sys_umcg_ctl()``: a syscall used to register the current task/thread
> + as a server or a worker, or to unregister a UMCG task;
> +- ``sys_umcg_wait()``: a syscall used to put the current task to
> + sleep and/or wake another task, pontentially context-switching
> + between the two tasks on-CPU synchronously.
> +
> +
> +Servers
> +=======
> +
> +When a task/thread is registered as a server, it is in RUNNING
> +state and behaves like any other normal task/thread. In addition,
> +servers can interact with other UMCG tasks via sys_umcg_wait():
> +
> +- servers can voluntarily suspend their execution (wait), becoming IDLE;
> +- servers can wake other IDLE servers;
> +- servers can context-switch between each other.
> +
> +Note that if a server blocks in the kernel *not* via sys_umcg_wait(),
> +it still retains its RUNNING state.
> +
> +Also note that servers can be used for fast on-CPU context switching
> +across process boundaries; server-worker interactions assume they
> +belong to the same mm.
> +
> +See the next section on how servers interact with workers.
> +
> +Workers
> +=======
> +
> +A worker cannot be RUNNING without having a server associated
> +with it, so when a task is first registered as a worker, it enters
> +the IDLE state.
> +
> +- a worker becomes RUNNING when a server calls sys_umcg_wait to
> + context-switch into it; the server goes IDLE, and the worker becomes
> + RUNNING in its place;
> +- when a running worker blocks in the kernel, it becomes BLOCKED,
> + its associated server becomes RUNNING and the server's
> + sys_umcg_wait() call from the bullet above returns; this transition
> + is sometimes called "block detection";
> +- when the syscall on which a BLOCKED worker completes, the worker
> + becomes IDLE and is added to the list of idle workers; if there
> + is an idle server waiting, the kernel wakes it; this transition
> + is sometimes called "wake detection";
> +- running workers can voluntarily suspend their execution (wait),
> + becoming IDLE; their associated servers are woken;
> +- a RUNNING worker can context-switch with an IDLE worker; the server
> + of the switched-out worker is transferred to the switched-in worker;
> +- any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
> + this is a server running the worker as described in the first bullet
> + in this list, the worker remain IDLE but is added to the idle workers
> + list; this "wake" operation exists for completeness, to make sure
> + wait/wake/context-switch operations are available for all UMCG tasks;
> +- the userspace can preempt a RUNNING worker by marking it
> + ``RUNNING|PREEMPTED`` and sending a signal to it; the userspace should
> + have installed a NOP signal handler for the signal; the kernel will
> + then transition the worker into ``IDLE|PREEMPTED`` state and wake
> + its associated server.
> +
> +UMCG task states
> +================
> +
> +Important: all state transitions described below involve at least
> +two steps: the change of the state field in ``struct umcg_task``,
> +for example ``RUNNING`` to ``IDLE``, and the corresponding change in
> +``struct task_struct`` state, for example a transition between the task
> +running on CPU and being descheduled and removed from the kernel runqueue.
> +The key principle of UMCG API design is that the party initiating
> +the state transition modifies the state variable.
> +
> +For example, a task going ``IDLE`` first changes its state from ``RUNNING``
> +to ``IDLE`` in the userpace and then calls ``sys_umcg_wait()``, which
> +completes the transition.
> +
> +Note on documentation: in ``include/uapi/linux/umcg.h``, task states
> +have the form ``UMCG_TASK_RUNNING``, ``UMCG_TASK_BLOCKED``, etc. In
> +this document these are usually referred to simply ``RUNNING`` and
> +``BLOCKED``, unless it creates ambiguity. Task state flags, e.g.
> +``UMCG_TF_PREEMPTED``, are treated similarly.
> +
> +UMCG task states reflect the view from the userspace, rather than from
> +the kernel. There are three fundamental task states:
> +
> +- ``RUNNING``: indicates that the task is schedulable by the kernel; applies
> + to both servers and workers;
> +- ``IDLE``: indicates that the task is *not* schedulable by the kernel
> + (see ``umcg_idle_loop()`` in ``kernel/sched/umcg.c``); applies to
> + both servers and workers;
> +- ``BLOCKED``: indicates that the worker is blocked in the kernel;
> + does not apply to servers.
> +
> +In addition to the three states above, two state flags help with
> +state transitions:
> +
> +- ``LOCKED``: the userspace is preparing the worker for a state transition
> + and "locks" the worker until the worker is ready for the kernel to
> + act on the state transition; used similarly to preempt_disable or
> + irq_disable in the kernel; applies only to workers in ``RUNNING`` or
> + ``IDLE`` state; ``RUNNING|LOCKED`` means "this worker is about to
> + become ``RUNNING``, while ``IDLE|LOCKED`` means "this worker is about
> + to become ``IDLE`` or unregister;
> +- ``PREEMPTED``: the userspace indicates it wants the worker to be
> + preempted; there are no situations when both ``LOCKED`` and ``PREEMPTED``
> + flags are set at the same time.
> +
> +struct umcg_task
> +================
> +
> +From ``include/uapi/linux/umcg.h``:
> +
> +.. code-block:: C
> +
> + struct umcg_task {
> + uint32_t state; /* r/w */
> + uint32_t next_tid; /* r */
> + uint64_t idle_workers_ptr; /* r/w */
> + uint64_t idle_server_tid_ptr; /* r* */
> + };
> +
> +Each UMCG task is identified by ``struct umcg_task``, which is provided
> +to the kernel when the task is registered via ``sys_umcg_ctl()``.
> +
> +- ``uint32_t state``: the current state of the task this struct identifies,
> + as described in the previous section. Readable/writable by both the kernel
> + and the userspace.
> +
> + - bits 0 - 7: task state (RUNNING, IDLE, BLOCKED);
> + - bits 8 - 15: state flags (LOCKED, PREEMPTED);
> + - bits 16 - 23: reserved; must be zeroes;
> + - bits 24 - 31: for userspace use.
> +
> +- ``uint32_t next_tid``: contains the TID of the task to context-switch-into
> + in ``sys_umcg_wait()``; can be zero; writable by the userspace, readable
> + by the kernel; if this is a RUNNING worker, this field contains
> + the TID of the server that should be woken when this worker blocks;
> + see ``sys_umcg_wait()`` for more details;
> +
> +- ``uint64_t idle_workers_ptr``: this field forms a single-linked list
> + of idle workers: all RUNNING workers have this field set to point
> + to the head of the list (a pointer variable in the userspace).
> +
> + When a worker's blocking operation in the kernel completes, the kernel
> + changes the worker's state from ``BLOCKED`` to ``IDLE`` and adds the worker
> + to the top of the list of idle workers using this logic:
> +
> + .. code-block:: C
> +
> + /* kernel side */
> + /**
> + * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
> + *
> + * Returns true on success, false on a fatal failure.
> + */
> + static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
> + {
> + u64 __user *node = &ut_worker->idle_workers_ptr;
> + u64 __user *head_ptr;
> + u64 first = (u64)node;
> + u64 head;
> +
> + if (get_user_nosleep(head, node) || !head)
> + return false;
> +
> + head_ptr = (u64 __user *)head;
> +
> + if (put_user_nosleep(1ULL, node))
> + return false;
> +
> + if (xchg_user_64(head_ptr, &first))
> + return false;
> +
> + if (put_user_nosleep(first, node))
> + return false;
> +
> + return true;
> + }
> +
> +
> + In the userspace the list is cleared atomically using this logic:
> +
> + .. code-block:: C
> +
> + /* userspace side */
> + uint64_t *idle_workers = (uint64_t *)*head;
> +
> + atomic_exchange(&idle_workers, NULL);
> +
> + The userspace re-points workers' idle_workers_ptr to the list head
> + variable before the worker is allowed to become RUNNING again.
> +
> +- ``uint64_t idle_server_tid_ptr``: points to a pointer variable in the
> + userspace that points to an idle server, i.e. a server in IDLE state waiting
> + in sys_umcg_wait(); read-only; workers must have this field set; not used
> + in servers.
> +
> + When a worker's blocking operation in the kernel completes, the kernel
> + changes the worker's state from ``BLOCKED`` to ``IDLE``, adds the worker
> + to the list of idle workers, and checks whether
> + ``*idle_server_tid_ptr`` is not zero. If not, the kernel tries to cmpxchg()
> + it with zero; if cmpxchg() succeeds, the kernel will then wake the server.
> + See `State transitions`_ below for more details.
> +
> +sys_umcg_ctl()
> +==============
> +
> +``int sys_umcg_ctl(uint32_t flags, struct umcg_task *self)`` is used to
> +register or unregister the current task as a worker or server. Flags
> +can be one of the following:
> +
> +- ``UMCG_CTL_REGISTER``: register a server;
> +- ``UMCG_CTL_REGISTER | UMCG_CTL_WORKER``: register a worker;
> +- ``UMCG_CTL_UNREGISTER``: unregister the current server or worker.
> +
> +When registering a task, ``self`` must point to ``struct umcg_task``
> +describing this server or worker; the pointer must remain valid until
> +the task is unregistered.
> +
> +When registering a server, ``self->state`` must be ``RUNNING``; all other
> +fields in ``self`` must be zeroes.
> +
> +When registering a worker, ``self->state`` must be ``BLOCKED``;
> +``self->idle_server_tid_ptr`` and ``self->idle_workers_ptr`` must be
> +valid pointers as described in `struct umcg_task`_; ``self->next_tid`` must
> +be zero.
> +
> +When unregistering a task, ``self`` must be ``NULL``.
> +
> +sys_umcg_wait()
> +===============
> +
> +``int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout)`` operates
> +on registered UMCG servers and workers: ``struct umcg_task *self`` provided
> +to ``sys_umcg_ctl()`` when registering the current task is consulted
> +in addition to ``flags`` and ``abs_timeout`` parameters.
> +
> +The function can be used to perform one of the three operations:
> +
> +- wait: if ``self->next_tid`` is zero, ``sys_umcg_wait()`` puts the current
> + server or worker to sleep;
> +- wake: if ``self->next_tid`` is not zero, and ``flags & UMCG_WAIT_WAKE_ONLY``,
> + the task identified by ``next_tid`` is woken (must be in ``IDLE`` state);
> +- context switch: if ``self->next_tid`` is not zero, and
> + ``!(flags & UMCG_WAIT_WAKE_ONLY)``, the current task is put to sleep and
> + the next task is woken, synchronously switching between the tasks on the
> + current CPU on the fast path.
> +
> +Flags can be zero or a combination of the following values:
> +
> +- ``UMCG_WAIT_WAKE_ONLY``: wake the next task, don't put the current task
> + to sleep;
> +- ``UMCG_WAIT_WF_CURRENT_CPU``: wake the next task on the curent CPU;
> + this flag has an effect only if ``UMCG_WAIT_WAKE_ONLY`` is set: context
> + switching is always attempted to happen on the curent CPU.
> +
> +The section below provides more details on how servers and workers interact
> +via ``sys_umcg_wait()``, during worker block/wake events, and during
> +worker preemption.
> +
> +State transitions
> +=================
> +
> +As mentioned above, the key principle of UMCG state transitions is that
> +**the party initiating the state transition modifies the state of affected
> +tasks**.
> +
> +Below, "``TASK:STATE``" indicates a task T, where T can be either W for
> +worker or S for server, in state S, where S can be one of the three states,
> +potentially ORed with a state flag. Each individual state transition
> +is an atomic operation (cmpxchg) unless indicated otherwise. Also note
> +that **the order of state transitions is important and is part of the
> +contract between the userspace and the kernel. The kernel is free
> +to kill the task (SIGSEGV) if the contract is broken.**
> +
> +Some worker state transitions below include adding ``LOCKED`` flag to
> +worker state. This is done to indicate to the kernel that the worker
> +is transitioning state and should not participate in the block/wake
> +detection routines, which can happen due to interrupts/pagefaults/signals.
> +
> +``IDLE|LOCKED`` means that a running worker is preparing to sleep, so
> +interrupts should not lead to server wakeup; ``RUNNING|LOCKED`` means that
> +an idle worker is going to be "scheduled to run", but may not yet have its
> +server set up properly.
> +
> +Key state transitions:
> +
> +- server to worker context switch ("schedule a worker to run"):
> + ``S:RUNNING+W:IDLE => S:IDLE+W:RUNNING``:
> +
> + - in the userspace, in the context of the server S running:
> +
> + - ``S:RUNNING => S:IDLE`` (mark self as idle)
> + - ``W:IDLE => W:RUNNING|LOCKED`` (mark the worker as running)
> + - ``W.next_tid := S.tid; S.next_tid := W.tid``
> + (link the server with the worker)
> + - ``W:RUNNING|LOCKED => W:RUNNING`` (unlock the worker)
> + - ``S: sys_umcg_wait()`` (make the syscall)
> +
> + - the kernel context switches from the server to the worker; the server
> + sleeps until it becomes ``RUNNING`` during one of the transitions below;
> +
> +- worker to server context switch (worker "yields"):
> + ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE``:
> +
> + - in the userspace, in the context of the worker W running (note that
> + a running worker has its ``next_tid`` set to point to its server):
> +
> + - ``W:RUNNING => W:IDLE|LOCKED`` (mark self as idle)
> + - ``S:IDLE => S:RUNNING`` (mark the server as running)
> + - ``W: sys_umcg_wait()`` (make the syscall)
> +
> + - the kernel removes the ``LOCKED`` flag from the worker's state and
> + context switches from the worker to the server; the worker
> + sleeps until it becomes ``RUNNING``;
> +
> +- worker to worker context switch:
> + ``W1:RUNNING+W2:IDLE => W1:IDLE+W2:RUNNING``:
> +
> + - in the userspace, in the context of W1 running:
> +
> + - ``W2:IDLE => W2:RUNNING|LOCKED`` (mark W2 as running)
> + - ``W1:RUNNING => W1:IDLE|LOCKED`` (mark self as idle)
> + - ``W2.next_tid := W1.next_tid; S.next_tid := W2.next_tid``
^^^^^^^^
W2.next_tid is a server. S.next_tid should be a worker; say:

S.next_tid := W2.tid

Not sure.

> + (transfer the server W1 => W2)
> + - ``W1:next_tid := W2.tid`` (indicate that W1 should
> + context-switch into W2)
> + - ``W2:RUNNING|LOCKED => W2:RUNNING`` (unlock W2)
> + - ``W1: sys_umcg_wait()`` (make the syscall)
> +
> + - same as above, the kernel removes the ``LOCKED`` flag from the W1's state
> + and context switches to next_tid;
> +
> +- worker wakeup: ``W:IDLE => W:RUNNING``:
> +
> + - in the userspace, a server S can wake a worker W without "running" it:
> +
> + - ``S:next_tid :=W.tid``
> + - ``W:next_tid := 0``
> + - ``W:IDLE => W:RUNNING``
> + - ``sys_umcg_wait(UMCG_WAIT_WAKE_ONLY)`` (make the syscall)
> +
> + - the kernel will wake the worker W; as the worker does not have a server
> + assigned, "wake detection" will happen, the worker will be immediately
> + marked as ``IDLE`` and added to idle workers list; an idle server, if any,
> + will be woken (see 'wake detection' below);
> + - Note: if needed, it is possible for a worker to wake another worker:
> + the waker marks itself "IDLE|LOCKED", points its next_tid to the wakee,
> + makes the syscall, restores its server in next_tid, marks itself
> + as ``RUNNING``.
> +
> +- block detection: worker blocks in the kernel: ``S:IDLE+W:RUNNING => S:RUNNING+W:BLOCKED``:
> +
> + - when a worker blocks in the kernel in ``RUNNING`` state (not ``LOCKED``),
> + before descheduling the task from the CPU the kernel performs these
> + operations:
> +
> + - ``W:RUNNING => W:BLOCKED``
> + - ``S := W.next_tid``
> + - ``S:IDLE => S:RUNNING``
> + - ``try_to_wake_up(S)``
> +
> + - if any of the first three operations above fail, the worker is killed via
> + ``SIGSEGV``. Note that ``ttwu(S)`` is not required to succeed, as the
> + server may still be transitioning to sleep in ``sys_umcg_wait()``; before
> + actually putting the server to sleep its UMCG state is checked and, if
> + it is ``RUNNING``, sys_umcg_wait() returns to the userspace;
> + - if the worker has its ``LOCKED`` flag set, block detection does not trigger,
> + as the worker is assumed to be in the userspace scheduling code.
> +
> +- wake detection: worker wakes in the kernel: ``W:BLOCKED => W:IDLE``:
> +
> + - all workers' returns to the userspace are intercepted:
> +
> + - ``start:`` (a label)
> + - if ``W:RUNNING & W.next_tid != 0``: let the worker exit to the userspace,
> + as this is a ``RUNNING`` worker with a server;
> + - ``W:* => W:IDLE`` (previously blocked or woken without servers workers
> + are not allowed to return to the userspace);
> + - the worker is appended to ``W.idle_workers_ptr`` idle workers list;
> + - ``S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING; ttwu(S)``
> + - ``idle_loop(W)``: this is the same idle loop that ``sys_umcg_wait()``
> + uses: it breaks only when the worker becomes ``RUNNING``; when the
> + idle loop exits, it is assumed that the userspace has properly
> + removed the worker from the idle workers list before marking it
> + ``RUNNING``;
> + - ``goto start;`` (repeat from the beginning).
> +
> + - the logic above is a bit more complicated in the presence of ``LOCKED`` or
> + ``PREEMPTED`` flags, but the main invariants stay the same:
> +
> + - only ``RUNNING`` workers with servers assigned are allowed to run
> + in the userspace (unless ``LOCKED``);
> + - newly ``IDLE`` workers are added to the idle workers list; any
> + user-initiated state change assumes the userspace properly removed
> + the worker from the list;
> + - as with wake detection, any "breach of contract" by the userspace
> + will result in the task termination via ``SIGSEGV``.
> +
> +- worker preemption: ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED``:
> +
> + - when the userspace wants to preempt a ``RUNNING`` worker, it changes
> + it state, atomically, ``RUNNING => RUNNING|PREEMPTED`` and sends a signal
> + to the worker via ``tgkill()``; the signal handler, previously set up
> + by the userspace, can be a NOP (note that only ``RUNNING`` workers can be
> + preempted);
> + - if the worker, at the moment the signal arrived, continued to be running
> + on-CPU in the userspace, the "wake detection" code will be triggered that,
> + in addition to what was described above, will check if the worker is in
> + ``RUNNING|PREEMPTED`` state:
> +
> + - ``W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED``
> + - ``S := W.next_tid``
> + - ``S:IDLE => S:RUNNING``
> + - ``try_to_wakeup(S)``

- ``try_to_wake_up(S)``

> +
> + - if the signal arrives after the worker blocks in the kernel, the "block
> + detection" happened as described above, with the following change:
> +
> + - ``W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED``
> + - ``S := W.next_tid``
> + - ``S:IDLE => S:RUNNING``
> + - ``try_to_wake_up(S)``
> +
> + - in any case, the worker's server is woken, with its attached worker
> + (``S.next_tid``) either in ``BLOCKED|PREEMPTED`` or ``IDLE|PREEMPTED``
> + state.
> +
> +Server-only use cases
> +=====================
> +
> +Some workloads/applications may benefit from fast and synchronous on-CPU
> +user-initiated context switches without the need for full userspace
> +scheduling (block/wake detection). These applications can use "standalone"
> +UMCG servers to wait/wake/context-switch, including across process boundaries.
> +
> +These "worker-less" operations involve trivial ``RUNNING`` <==> ``IDLE``
> +state changes, not discussed here for brevity.
> +
> --
> 2.25.1
>


Thanks,
Tao

2021-09-14 16:54:09

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH 3/4 v0.5] sched/umcg: RFC: implement UMCG syscalls

On Wed, Sep 8, 2021 at 6:40 PM Jann Horn <[email protected]> wrote:

[...]

> I think umcg_idle_loop() should never be called from scheduler
> callbacks (meaning umcg_wq_worker_running()), only from UMCG syscalls.

I'm moving umcg_wq_worker_running() out of
core.c/sched_update_worker() and into
/kernel/entry/common.c/exit_to_user_mode_loop()
(and will rename the function appropriately).
It seems rescheduling/sleeping there is fine.

I'm not yet sure if this is all that is needed to deal with
UMCG_TF_PREEMPTED flag; but I don't expect to see
any locks held when the task truly returns to the userspace.

Maybe I'll need to set TIF_NOTIFY_RESUME in sched_update_worker()...

[...]

2021-09-14 16:54:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers



On Thu, Sep 9, 2021, at 2:20 PM, Jann Horn wrote:
> On Thu, Sep 9, 2021 at 9:07 PM Peter Oskolkov <[email protected]> wrote:
> > On Wed, Sep 8, 2021 at 4:39 PM Jann Horn <[email protected]> wrote:
> >
> > Thanks a lot for the reviews, Jann!
> >
> > I understand how to address most of your comments. However, one issue
> > I'm not sure what to do about:
> >
> > [...]
> >
> > > If this function is not allowed to sleep, as the comment says...
> >
> > [...]
> >
> > > ... then I'm pretty sure you can't call fix_pagefault() here, which
> > > acquires the mmap semaphore (which may involve sleeping) and then goes
> > > through the pagefault handling path (which can also sleep for various
> > > reasons, like allocating memory for pagetables, loading pages from
> > > disk / NFS / FUSE, and so on).
> >
> > <quote from peterz@ from
> > https://lore.kernel.org/lkml/[email protected]/>:
> > So a PF_UMCG_WORKER would be added to sched_submit_work()'s PF_*_WORKER
> > path to capture these tasks blocking. The umcg_sleeping() hook added
> > there would:
> >
> > put_user(BLOCKED, umcg_task->umcg_status);
> > ...
> > </quote>
> >
> > Which is basically what I am doing here: in sched_submit_work() I need
> > to read/write to userspace; and we cannot sleep in
> > sched_submit_work(), I believe.
> >
> > If you are right that it is impossible to deal with pagefaults from
> > within non-sleepable contexts, I see two options:
> >
> > Option 1: as you suggest, pin pages holding struct umcg_task in sys_umcg_ctl;
>
> FWIW, there is a variant on this that might also be an option:
>
> You can create a new memory mapping from kernel code and stuff pages
> into it that were originally allocated as normal kernel pages. This is
> done in a bunch of places, e.g.:

With a custom mapping, you don’t need to pin pages at all, I think. As long as you can reconstruct the contents of the shared page and you’re willing to do some slightly careful synchronization, you can detect that the page is missing when you try to update it and skip the update. The vm_ops->fault handler can repopulate the page the next time it’s accessed.

All that being said, I feel like I’m missing something. The point of this is to send what the old M:N folks called “scheduler activations”, right? Wouldn’t it be more efficient to explicitly wake something blockable/pollable and write the message into a more efficient data structure? Polling one page per task from userspace seems like it will have inherently high latency due to the polling interval and will also have very poor locality. Or am I missing something?

>
>
> Note that what I'm suggesting here is a bit unusual - normally only
> the vDSO is a "special mapping", other APIs tend to use mappings that
> are backed by files. But I think we probably don't want to have a file
> involved here...
>

A file would be weird — the lifetime and SCM_RIGHTS interactions may be unpleasant.

> If you decide to go this route, you should probably CC
> [email protected] (for general memory management) and Andy Lutomirski
> (who has tinkered around in vDSO-related code a lot).
>

Who’s that? :)

> > or
> >
> > Option 2: add more umcg-related kernel state to task_struct so that
> > reading/writing to userspace is not necessary in sched_submit_work().
> >
> > The first option sounds much better from the code simplicity point of
> > view, but I'm not sure if it is a viable approach, i.e. I'm afraid
> > we'll get a hard NACK here, as a non-privileged process will be able
> > to force the kernel to pin a page per task/thread.
>
> To clarify: It's entirely normal that userspace processes can force
> the kernel to hold on to some amounts of memory that can't be paged
> out - consider e.g. pagetables and kernel objects referenced by file
> descriptors. So an API that pins limited amounts of memory that are
> also mapped in userspace isn't inherently special. But pinning pages
> that were originally allocated as normal userspace memory can be more
> problematic because that memory might be hugepages, or file pages, or
> it might prevent the hugepaged from being able to defragment memory
> because the pinned page was allocated in ZONE_MOVABLE.
>
>
> > We may get around
> > it by first pinning a limited number of pages, then having the
> > userspace allocate structs umcg_task on those pages, so that a pinned
> > page would cover more than a single task/thread. And have a sysctl
> > that limits the number of pinned pages per MM.
>
> I think that you wouldn't necessarily need a sysctl for that if the
> kernel can enforce that you don't have more pages allocated than you
> need for the maximum number of threads that have ever been running
> under the process, and you also use __GFP_ACCOUNT so that cgroups can
> correctly attribute the memory usage.
>
> > Peter Z., could you, please, comment here? Do you think pinning pages
> > to hold structs umcg_task is acceptable?
>

2021-09-14 16:59:13

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH 4/4 v0.5] sched/umcg: add Documentation/userspace-api/umcg.rst

On Tue, Sep 14, 2021 at 9:34 AM Tao Zhou <[email protected]> wrote:

[...]

> > +- worker to worker context switch:
> > + ``W1:RUNNING+W2:IDLE => W1:IDLE+W2:RUNNING``:
> > +
> > + - in the userspace, in the context of W1 running:
> > +
> > + - ``W2:IDLE => W2:RUNNING|LOCKED`` (mark W2 as running)
> > + - ``W1:RUNNING => W1:IDLE|LOCKED`` (mark self as idle)
> > + - ``W2.next_tid := W1.next_tid; S.next_tid := W2.next_tid``
> ^^^^^^^^
> W2.next_tid is a server. S.next_tid should be a worker; say:
>
> S.next_tid := W2.tid

You are definitely right here. I'll fix the doc in the next patchset.

2021-09-14 18:07:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Tue, Sep 14, 2021 at 09:29:00AM -0700, Peter Oskolkov wrote:
> In the version of the patchset that I'm preparing to send I've decided
> to punt on the issue and just ask the userspace to deal with locking
> the memory as it sees fit: mlock() is available and as far as I can

Sadly mlock() does not imply no faults. Someone had a too literal
reading of the POSIX-RT spec (of which mlock is part) and figured that
all that was required was to keep the page in memory, not avoid faults.

Linux has had this bahviour for ages, PREEMPT_RT has tried to change
this, but so far to no avail. At some point sys_mpin() was proposed to
meet the original POSIX-RT intent, but afaict that never actually
happened.

In short, mlock() does not avoid minor faults, or even migration faults,
which can take a fair while to resolve.

2021-09-14 18:14:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Tue, Sep 14, 2021 at 09:52:08AM -0700, Andy Lutomirski wrote:
> With a custom mapping, you don’t need to pin pages at all, I think.
> As long as you can reconstruct the contents of the shared page and
> you’re willing to do some slightly careful synchronization, you can
> detect that the page is missing when you try to update it and skip the
> update. The vm_ops->fault handler can repopulate the page the next
> time it’s accessed.

The point is that the moment we know we need to do this user-poke, is
schedule(), which could be called while holding mmap_sem (it being a
preemptable lock). Which means we cannot go and do faults.

> All that being said, I feel like I’m missing something. The point of
> this is to send what the old M:N folks called “scheduler activations”,
> right? Wouldn’t it be more efficient to explicitly wake something
> blockable/pollable and write the message into a more efficient data
> structure? Polling one page per task from userspace seems like it
> will have inherently high latency due to the polling interval and will
> also have very poor locality. Or am I missing something?

The idea was to link the user structures together in a (single) linked
list. The server structure gets a list of all the blocked tasks. This
avoids having to a full N iteration (like Java, they're talking stupid
number of N).

Polling should not happen, once we run out of runnable tasks, the server
task gets ran again and it can instantly pick up all the blocked
notifications.

2021-09-14 18:19:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Tue, Sep 14, 2021 at 08:04:55PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 14, 2021 at 09:29:00AM -0700, Peter Oskolkov wrote:
> > In the version of the patchset that I'm preparing to send I've decided
> > to punt on the issue and just ask the userspace to deal with locking
> > the memory as it sees fit: mlock() is available and as far as I can
>
> Sadly mlock() does not imply no faults. Someone had a too literal
> reading of the POSIX-RT spec (of which mlock is part) and figured that
> all that was required was to keep the page in memory, not avoid faults.
>
> Linux has had this bahviour for ages, PREEMPT_RT has tried to change
> this, but so far to no avail. At some point sys_mpin() was proposed to
> meet the original POSIX-RT intent, but afaict that never actually
> happened.
>
> In short, mlock() does not avoid minor faults, or even migration faults,
> which can take a fair while to resolve.

Also, even if it did, that would still not be acceptible because
userspace could fail to call mlock() at which point mis-behaving
userspace can deadlock the kernel, which is a no-no.

2021-09-14 18:32:45

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Tue, Sep 14, 2021 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Sep 14, 2021 at 09:29:00AM -0700, Peter Oskolkov wrote:
> > In the version of the patchset that I'm preparing to send I've decided
> > to punt on the issue and just ask the userspace to deal with locking
> > the memory as it sees fit: mlock() is available and as far as I can
>
> Sadly mlock() does not imply no faults. Someone had a too literal
> reading of the POSIX-RT spec (of which mlock is part) and figured that
> all that was required was to keep the page in memory, not avoid faults.
>
> Linux has had this bahviour for ages, PREEMPT_RT has tried to change
> this, but so far to no avail. At some point sys_mpin() was proposed to
> meet the original POSIX-RT intent, but afaict that never actually
> happened.
>
> In short, mlock() does not avoid minor faults, or even migration faults,
> which can take a fair while to resolve.

Ok, I'll go with transiently pinning pages in
__syscall_enter_from_user_work(), as you suggested. Seems easy enough
to do.

Thanks for the suggestion!

2021-09-14 18:42:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers



On Tue, Sep 14, 2021, at 11:11 AM, Peter Zijlstra wrote:
> On Tue, Sep 14, 2021 at 09:52:08AM -0700, Andy Lutomirski wrote:
> > With a custom mapping, you don’t need to pin pages at all, I think.
> > As long as you can reconstruct the contents of the shared page and
> > you’re willing to do some slightly careful synchronization, you can
> > detect that the page is missing when you try to update it and skip the
> > update. The vm_ops->fault handler can repopulate the page the next
> > time it’s accessed.
>
> The point is that the moment we know we need to do this user-poke, is
> schedule(), which could be called while holding mmap_sem (it being a
> preemptable lock). Which means we cannot go and do faults.

That’s fine. The page would be in one or two states: present and writable by kernel or completely gone. If its present, the scheduler writes it. If it’s gone, the scheduler skips the write and the next fault fills it in.

>
> > All that being said, I feel like I’m missing something. The point of
> > this is to send what the old M:N folks called “scheduler activations”,
> > right? Wouldn’t it be more efficient to explicitly wake something
> > blockable/pollable and write the message into a more efficient data
> > structure? Polling one page per task from userspace seems like it
> > will have inherently high latency due to the polling interval and will
> > also have very poor locality. Or am I missing something?
>
> The idea was to link the user structures together in a (single) linked
> list. The server structure gets a list of all the blocked tasks. This
> avoids having to a full N iteration (like Java, they're talking stupid
> number of N).
>
> Polling should not happen, once we run out of runnable tasks, the server
> task gets ran again and it can instantly pick up all the blocked
> notifications.
>

How does the server task know when to read the linked list? And what’s wrong with a ring buffer or a syscall?

2021-09-14 18:53:31

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Tue, Sep 14, 2021 at 11:29 AM Peter Oskolkov <[email protected]> wrote:
>
> On Tue, Sep 14, 2021 at 11:05 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Tue, Sep 14, 2021 at 09:29:00AM -0700, Peter Oskolkov wrote:
> > > In the version of the patchset that I'm preparing to send I've decided
> > > to punt on the issue and just ask the userspace to deal with locking
> > > the memory as it sees fit: mlock() is available and as far as I can
> >
> > Sadly mlock() does not imply no faults. Someone had a too literal
> > reading of the POSIX-RT spec (of which mlock is part) and figured that
> > all that was required was to keep the page in memory, not avoid faults.
> >
> > Linux has had this bahviour for ages, PREEMPT_RT has tried to change
> > this, but so far to no avail. At some point sys_mpin() was proposed to
> > meet the original POSIX-RT intent, but afaict that never actually
> > happened.
> >
> > In short, mlock() does not avoid minor faults, or even migration faults,
> > which can take a fair while to resolve.
>
> Ok, I'll go with transiently pinning pages in
> __syscall_enter_from_user_work(), as you suggested. Seems easy enough
> to do.

Actually, I think pinning these pages when the worker exits to the
userspace (i.e. is scheduled on a CPU) and releasing them when the
worker is descheduled (blocks) would be better - this way we will be
able to wake the server not only on blocking syscalls but also on
pagefaults (on other pages) as well.

Do you think this approach is acceptable?

>
> Thanks for the suggestion!

2021-09-15 15:47:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Tue, Sep 14, 2021 at 11:40:01AM -0700, Andy Lutomirski wrote:
>
>
> On Tue, Sep 14, 2021, at 11:11 AM, Peter Zijlstra wrote:
> > On Tue, Sep 14, 2021 at 09:52:08AM -0700, Andy Lutomirski wrote:
> > > With a custom mapping, you don’t need to pin pages at all, I think.
> > > As long as you can reconstruct the contents of the shared page and
> > > you’re willing to do some slightly careful synchronization, you can
> > > detect that the page is missing when you try to update it and skip the
> > > update. The vm_ops->fault handler can repopulate the page the next
> > > time it’s accessed.
> >
> > The point is that the moment we know we need to do this user-poke, is
> > schedule(), which could be called while holding mmap_sem (it being a
> > preemptable lock). Which means we cannot go and do faults.
>
> That’s fine. The page would be in one or two states: present and
> writable by kernel or completely gone. If its present, the scheduler
> writes it. If it’s gone, the scheduler skips the write and the next
> fault fills it in.

That's non-deterministic, and as such not suitable.

> > > All that being said, I feel like I’m missing something. The point of
> > > this is to send what the old M:N folks called “scheduler activations”,
> > > right? Wouldn’t it be more efficient to explicitly wake something
> > > blockable/pollable and write the message into a more efficient data
> > > structure? Polling one page per task from userspace seems like it
> > > will have inherently high latency due to the polling interval and will
> > > also have very poor locality. Or am I missing something?
> >
> > The idea was to link the user structures together in a (single) linked
> > list. The server structure gets a list of all the blocked tasks. This
> > avoids having to a full N iteration (like Java, they're talking stupid
> > number of N).
> >
> > Polling should not happen, once we run out of runnable tasks, the server
> > task gets ran again and it can instantly pick up all the blocked
> > notifications.
> >
>
> How does the server task know when to read the linked list? And
> what’s wrong with a ring buffer or a syscall?

Same problem, ring-buffer has the case where it's full and events get
dropped, at which point you've completely lost state. If it is at all
possible to recover from that, doing so is non-deterministic.

I really want this stuff to work for realtime workloads too.

2021-09-15 16:52:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Wed, Sep 15, 2021 at 8:45 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Sep 14, 2021 at 11:40:01AM -0700, Andy Lutomirski wrote:
> >
> >
> > On Tue, Sep 14, 2021, at 11:11 AM, Peter Zijlstra wrote:
> > > On Tue, Sep 14, 2021 at 09:52:08AM -0700, Andy Lutomirski wrote:
> > > > With a custom mapping, you don’t need to pin pages at all, I think.
> > > > As long as you can reconstruct the contents of the shared page and
> > > > you’re willing to do some slightly careful synchronization, you can
> > > > detect that the page is missing when you try to update it and skip the
> > > > update. The vm_ops->fault handler can repopulate the page the next
> > > > time it’s accessed.
> > >
> > > The point is that the moment we know we need to do this user-poke, is
> > > schedule(), which could be called while holding mmap_sem (it being a
> > > preemptable lock). Which means we cannot go and do faults.
> >
> > That’s fine. The page would be in one or two states: present and
> > writable by kernel or completely gone. If its present, the scheduler
> > writes it. If it’s gone, the scheduler skips the write and the next
> > fault fills it in.
>
> That's non-deterministic, and as such not suitable.

What's the precise problem? The code would be roughly:

if (try_pin_the_page) {
write it;
unpin;
} else {
do nothing -- .fault will fill in the correct contents.
}

The time this takes is nondeterministic, but it's bounded and short.

>
> > > > All that being said, I feel like I’m missing something. The point of
> > > > this is to send what the old M:N folks called “scheduler activations”,
> > > > right? Wouldn’t it be more efficient to explicitly wake something
> > > > blockable/pollable and write the message into a more efficient data
> > > > structure? Polling one page per task from userspace seems like it
> > > > will have inherently high latency due to the polling interval and will
> > > > also have very poor locality. Or am I missing something?
> > >
> > > The idea was to link the user structures together in a (single) linked
> > > list. The server structure gets a list of all the blocked tasks. This
> > > avoids having to a full N iteration (like Java, they're talking stupid
> > > number of N).
> > >
> > > Polling should not happen, once we run out of runnable tasks, the server
> > > task gets ran again and it can instantly pick up all the blocked
> > > notifications.
> > >
> >
> > How does the server task know when to read the linked list? And
> > what’s wrong with a ring buffer or a syscall?
>
> Same problem, ring-buffer has the case where it's full and events get
> dropped, at which point you've completely lost state. If it is at all
> possible to recover from that, doing so is non-deterministic.
>
> I really want this stuff to work for realtime workloads too.

A ring buffer would have a bounded size -- one word (of whatever size)
per user thread.

2021-09-15 19:14:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Wed, Sep 15, 2021 at 09:50:41AM -0700, Andy Lutomirski wrote:
> What's the precise problem? The code would be roughly:
>
> if (try_pin_the_page) {
> write it;
> unpin;
> } else {
> do nothing -- .fault will fill in the correct contents.
> }
>
> The time this takes is nondeterministic, but it's bounded and short.

You can't do this from sched_submit_work(), you might already hold
mmap_sem. Hence the suggestion to, for UMCG registered threads, pin the
page(s) on syscall-enter and unpin on either schedule() or
syscall-return.