User Managed Concurrency Groups (UMCG) is an M:N threading
subsystem/toolkit that lets user space application developers implement
in-process user space schedulers.
Key changes from the previous patchset
https://lore.kernel.org/all/[email protected]/ :
- userspace atomic helpers moved into mm/
- UMCG task states are now tagged with timestamps
- cross-mm interactions (wakeups) are not permitted
- several smaller fixes and refactorings
These big things remain to be addressed (in no particular order):
- support tracing/debugging
- make context switches faster (see umcg_do_context_switch in umcg.c)
- support other architectures
- cleanup and post userspace support in tools/lib/umcg/
- cleanup and post selftests in tools/testing/selftests/umcg/
- allow cross-mm wakeups (securely)
I'm working on finalizing libumcg and kselftests.
Signed-off-by: Peter Oskolkov <[email protected]>
Peter Oskolkov (5):
sched/umcg: add WF_CURRENT_CPU and externise ttwu
mm, x86/uaccess: add userspace atomic helpers
sched/umcg: implement UMCG syscalls
sched/umcg: add Documentation/userspace-api/umcg.rst
sched/umcg: add Documentation/userspace-api/umcg.txt
Documentation/userspace-api/umcg.rst | 611 ++++++++++++++++
Documentation/userspace-api/umcg.txt | 594 ++++++++++++++++
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/x86/include/asm/uaccess_64.h | 93 +++
fs/exec.c | 1 +
include/linux/sched.h | 71 ++
include/linux/syscalls.h | 3 +
include/linux/uaccess.h | 46 ++
include/uapi/asm-generic/unistd.h | 6 +-
include/uapi/linux/umcg.h | 137 ++++
init/Kconfig | 10 +
kernel/entry/common.c | 4 +-
kernel/exit.c | 5 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 12 +-
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 15 +-
kernel/sched/umcg.c | 926 +++++++++++++++++++++++++
kernel/sys_ni.c | 4 +
mm/maccess.c | 264 +++++++
20 files changed, 2797 insertions(+), 12 deletions(-)
create mode 100644 Documentation/userspace-api/umcg.rst
create mode 100644 Documentation/userspace-api/umcg.txt
create mode 100644 include/uapi/linux/umcg.h
create mode 100644 kernel/sched/umcg.c
--
2.25.1
Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
User Managed Concurrency Groups is an M:N threading toolkit that allows
constructing user space schedulers designed to efficiently manage
heterogeneous in-process workloads while maintaining high CPU
utilization (95%+).
In addition, M:N threading and cooperative user space scheduling
enables synchronous coding style and better cache locality when
compared to asynchronous callback/continuation style of programming.
UMCG kernel API is build around the following ideas:
* UMCG server: a task/thread representing "kernel threads", or (v)CPUs;
* UMCG worker: a task/thread representing "application threads", to be
scheduled over servers;
* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
server or a worker) can be in;
* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
can be ORed with the task state to communicate additional information to
the kernel;
* struct umcg_task: a per-task userspace set of data fields, usually
residing in the TLS, that fully reflects the current task's UMCG state
and controls the way the kernel manages the task;
* sys_umcg_ctl(): a syscall used to register the current task/thread as a
server or a worker, or to unregister a UMCG task;
* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
wake another task, pontentially context-switching between the two tasks
on-CPU synchronously.
In short, servers can be thought of as CPUs over which application
threads (workers) are scheduled; at any one time a worker is either:
- RUNNING: has a server and is schedulable by the kernel;
- BLOCKED: blocked in the kernel (e.g. on I/O, or a futex);
- IDLE: is not blocked, but cannot be scheduled by the kernel to
run because it has no server assigned to it (e.g. because all
available servers are busy "running" other workers).
Usually the number of servers in a process is equal to the number of
CPUs available to the kernel if the process is supposed to consume
the whole machine, or less than the number of CPUs available if the
process is sharing the machine with other workloads. The number of
workers in a process can grow very large: tens of thousands is normal;
hundreds of thousands and more (millions) is something that would
be desirable to achieve in the future, as lightweight userspace
threads in Java and Go easily scale to millions, and UMCG workers
are (intended to be) conceptually similar to those.
Detailed use cases and API behavior are provided in
Documentation/userspace-api/umcg.[txt|rst] (see sibling patches).
Some high-level implementation notes:
UMCG tasks (workers and servers) are "tagged" with struct umcg_task
residing in userspace (usually in TLS) to facilitate kernel/userspace
communication. This makes the kernel-side code much simpler (see e.g.
the implementation of sys_umcg_wait), but also requires some careful
uaccess handling and page pinning (see below).
The main UMCG server/worker interaction looks like:
a. worker W1 is RUNNING, with a server S attached to it sleeping
in IDLE state;
b. worker W1 blocks in the kernel, e.g. on I/O;
c. the kernel marks W1 as BLOCKED, the attached server S
as RUNNING, and wakes S (the "block detection" event);
d. the server now picks another IDLE worker W2 to run: marks
W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait();
e. when the blocking operation of W1 completes, the worker
is marked by the kernel as IDLE and added to idle workers list
(see struct umcg_task) for the userspace to pick up and
later run (the "wake detection" event).
While there are additional operations such as worker-to-worker
context switch, preemption, workers "yielding", etc., the "workflow"
above is the main worker/server interaction that drives the
implementation.
Specifically:
- most operations are conceptually context switches:
- scheduling a worker: a running server goes to sleep and "runs"
a worker in its place;
- block detection: worker is descheduled, and its server is woken;
- wake detection: woken worker, running in the kernel, is descheduled,
and if there is an idle server, it is woken to process the wake
detection event;
- to faciliate low scheduling latencies and cache locality, most
server/worker interactions described above are performed synchronously
"on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment
the context switches are simulated by putting the switch-out task to
sleep and waking the switch-into task on the same cpu, it is very much
the long-term goal of this project to make the context switch much
lighter, by tweaking runtime accounting and, maybe, even bypassing
__schedule();
- worker blocking is detected in a hook to sched_submit_work; as mentioned
above, the server is to be woken on the same CPU, synchronously;
this code may not pagefault, so to access worker's and server's
userspace memory (struct umcg_task), memory pages containing the worker's
and the server's structs umcg_task are pinned when the worker is
exiting to the userspace, and unpinned when the worker is descheduled;
- worker wakeup is detected in a hook to sched_update_worker, and processed
in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN
pagefault on the wakeup path;
- worker preemption is implemented by the userspace tagging the worker
with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it;
on the exit to usermode the worker is intercepted and its server is woken
(see Documentation/userspace-api/umcg.[txt|rst] for more details);
- each state change is tagged with a unique timestamp (of MONOTONIC
variety), so that
- scheduling instrumentation is naturally available;
- racing state changes are easily detected and ABA issues are
avoided;
see umcg_update_state() in umcg.c for implementation details, and
Documentation/userspace-api/umcg.[txt|rst] for a higher-level
description.
The previous version of the patchset can be found at
https://lore.kernel.org/all/[email protected]/
containing some additional context and links to earlier discussions.
More details are available in Documentation/userspace-api/umcg.[txt|rst]
in sibling patches, and in doc-comments in the code.
Signed-off-by: Peter Oskolkov <[email protected]>
---
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
fs/exec.c | 1 +
include/linux/sched.h | 71 ++
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 6 +-
include/uapi/linux/umcg.h | 137 ++++
init/Kconfig | 10 +
kernel/entry/common.c | 4 +-
kernel/exit.c | 5 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 9 +-
kernel/sched/umcg.c | 926 +++++++++++++++++++++++++
kernel/sys_ni.c | 4 +
13 files changed, 1175 insertions(+), 4 deletions(-)
create mode 100644 include/uapi/linux/umcg.h
create mode 100644 kernel/sched/umcg.c
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 18b5500ea8bf..cb71f383060f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -370,6 +370,8 @@
446 common landlock_restrict_self sys_landlock_restrict_self
447 common memfd_secret sys_memfd_secret
448 common process_mrelease sys_process_mrelease
+449 common umcg_ctl sys_umcg_ctl
+450 common umcg_wait sys_umcg_wait
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/exec.c b/fs/exec.c
index a098c133d8d7..dfa24bb99a97 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1840,6 +1840,7 @@ static int bprm_execve(struct linux_binprm *bprm,
current->fs->in_exec = 0;
current->in_execve = 0;
rseq_execve(current);
+ umcg_execve(current);
acct_update_integrals(current);
task_numa_free(current, false);
return retval;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 343603f77f8b..c7e812ceec3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct umcg_task;
/*
* Task state bitmask. NOTE! These bits are also
@@ -1296,6 +1297,12 @@ struct task_struct {
unsigned long rseq_event_mask;
#endif
+#ifdef CONFIG_UMCG
+ struct umcg_task __user *umcg_task;
+ struct page *pinned_umcg_worker_page; /* self */
+ struct page *pinned_umcg_server_page;
+#endif
+
struct tlbflush_unmap_batch tlb_ubc;
union {
@@ -1688,6 +1695,13 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+
+#ifdef CONFIG_UMCG
+#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
+#else
+#define PF_UMCG_WORKER 0x00000000
+#endif
+
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
@@ -2275,6 +2289,63 @@ static inline void rseq_execve(struct task_struct *t)
#endif
+#ifdef CONFIG_UMCG
+
+void umcg_handle_resuming_worker(void);
+void umcg_handle_exiting_worker(void);
+void umcg_clear_child(struct task_struct *tsk);
+
+/* Called by bprm_execve() in fs/exec.c. */
+static inline void umcg_execve(struct task_struct *tsk)
+{
+ if (tsk->umcg_task)
+ umcg_clear_child(tsk);
+}
+
+/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
+static inline void umcg_handle_notify_resume(void)
+{
+ if (current->flags & PF_UMCG_WORKER)
+ umcg_handle_resuming_worker();
+}
+
+/* Called by do_exit() in kernel/exit.c. */
+static inline void umcg_handle_exit(void)
+{
+ if (current->flags & PF_UMCG_WORKER)
+ umcg_handle_exiting_worker();
+}
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+void umcg_wq_worker_sleeping(struct task_struct *tsk);
+void umcg_wq_worker_running(struct task_struct *tsk);
+
+#else /* CONFIG_UMCG */
+
+static inline void umcg_clear_child(struct task_struct *tsk)
+{
+}
+static inline void umcg_execve(struct task_struct *tsk)
+{
+}
+static inline void umcg_handle_notify_resume(void)
+{
+}
+static inline void umcg_handle_exit(void)
+{
+}
+static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+}
+static inline void umcg_wq_worker_running(struct task_struct *tsk)
+{
+}
+
+#endif
+
#ifdef CONFIG_DEBUG_RSEQ
void rseq_syscall(struct pt_regs *regs);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 252243c7783d..97a05879da41 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -71,6 +71,7 @@ struct open_how;
struct mount_attr;
struct landlock_ruleset_attr;
enum landlock_rule_type;
+struct umcg_task;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -1052,6 +1053,8 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
const void __user *rule_attr, __u32 flags);
asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
asmlinkage long sys_memfd_secret(unsigned int flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1c5fb86d455a..3e3d50de5137 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -879,9 +879,13 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
#endif
#define __NR_process_mrelease 448
__SYSCALL(__NR_process_mrelease, sys_process_mrelease)
+#define __NR_umcg_ctl 449
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 450
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
#undef __NR_syscalls
-#define __NR_syscalls 449
+#define __NR_syscalls 451
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
new file mode 100644
index 000000000000..ce4c7980b837
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ * sys_umcg_ctl() - register/unregister UMCG tasks;
+ * sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for detals.
+ */
+
+/*
+ * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
+ * The states represent the user space point of view.
+ */
+#define UMCG_TASK_NONE 0ULL
+#define UMCG_TASK_RUNNING 1ULL
+#define UMCG_TASK_IDLE 2ULL
+#define UMCG_TASK_BLOCKED 3ULL
+
+/* UMCG task state flags, bits 7-8 */
+
+/*
+ * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
+ */
+#define UMCG_TF_LOCKED (1ULL << 6)
+
+/*
+ * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
+ */
+#define UMCG_TF_PREEMPTED (1ULL << 7)
+
+/* The first six bits: RUNNING, IDLE, or BLOCKED. */
+#define UMCG_TASK_STATE_MASK 0x3fULL
+
+/* The full kernel state mask: the first 13 bits. */
+#define UMCG_TASK_STATE_MASK_FULL 0x1fffULL
+
+/*
+ * The number of bits reserved for UMCG state timestamp in
+ * struct umcg_task.state_ts.
+ */
+#define UMCG_STATE_TIMESTAMP_BITS 46
+
+/* The number of bits truncated from UMCG state timestamp. */
+#define UMCG_STATE_TIMESTAMP_GRANULARITY 4
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+ /**
+ * @state_ts: the current state of the UMCG task described by
+ * this struct, with a unique timestamp indicating
+ * when the last state change happened.
+ *
+ * Readable/writable by both the kernel and the userspace.
+ *
+ * UMCG task state:
+ * bits 0 - 5: task state;
+ * bits 6 - 7: state flags;
+ * bits 8 - 12: reserved; must be zeroes;
+ * bits 13 - 17: for userspace use;
+ * bits 18 - 63: timestamp (see below).
+ *
+ * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
+ * See Documentation/userspace-api/umcg.[txt|rst] for detals.
+ */
+ uint64_t state_ts; /* r/w */
+
+ /**
+ * @next_tid: the TID of the UMCG task that should be context-switched
+ * into in sys_umcg_wait(). Can be zero.
+ *
+ * Running UMCG workers must have next_tid set to point to IDLE
+ * UMCG servers.
+ *
+ * Read-only for the kernel, read/write for the userspace.
+ */
+ uint32_t next_tid; /* r */
+
+ uint32_t flags; /* Reserved; must be zero. */
+
+ /**
+ * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
+ *
+ * Readable/writable by both the kernel and the userspace: the
+ * kernel adds items to the list, the userspace removes them.
+ */
+ uint64_t idle_workers_ptr; /* r/w */
+
+ /**
+ * @idle_server_tid_ptr: a pointer pointing to a single idle server.
+ * Readonly.
+ */
+ uint64_t idle_server_tid_ptr; /* r */
+} __attribute__((packed, aligned(8 * sizeof(__u64))));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER: register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER: register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+ UMCG_CTL_REGISTER = 0x00001,
+ UMCG_CTL_UNREGISTER = 0x00002,
+ UMCG_CTL_WORKER = 0x10000,
+};
+
+/**
+ * enum umcg_wait_flag - flags to pass to sys_umcg_wait
+ * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
+ * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
+ * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
+ * must be set.
+ */
+enum umcg_wait_flag {
+ UMCG_WAIT_WAKE_ONLY = 1,
+ UMCG_WAIT_WF_CURRENT_CPU = 2,
+};
+
+/* See Documentation/userspace-api/umcg.[txt|rst].*/
+#define UMCG_IDLE_NODE_PENDING (1ULL)
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
index 11f8a845f259..b52a79cfb130 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1688,6 +1688,16 @@ config MEMBARRIER
If unsure, say Y.
+config UMCG
+ bool "Enable User Managed Concurrency Groups API"
+ depends on X86_64
+ default n
+ help
+ Enable User Managed Concurrency Groups API, which form the basis
+ for an in-process M:N userspace scheduling framework.
+ At the moment this is an experimental/RFC feature that is not
+ guaranteed to be backward-compatible.
+
config KALLSYMS
bool "Load all symbols for debugging/ksymoops" if EXPERT
default y
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d5a61d565ad5..62453772a0c7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -171,8 +171,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
handle_signal_work(regs, ti_work);
- if (ti_work & _TIF_NOTIFY_RESUME)
+ if (ti_work & _TIF_NOTIFY_RESUME) {
+ umcg_handle_notify_resume();
tracehook_notify_resume(regs);
+ }
/* Architecture specific TIF work */
arch_exit_to_user_mode_work(regs, ti_work);
diff --git a/kernel/exit.c b/kernel/exit.c
index 63851320ae73..c55f9df430c8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -745,6 +745,10 @@ void __noreturn do_exit(long code)
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");
+ /* Turn off UMCG sched hooks. */
+ if (unlikely(tsk->flags & PF_UMCG_WORKER))
+ tsk->flags &= ~PF_UMCG_WORKER;
+
/*
* If do_exit is called because this processes oopsed, it's possible
* that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
@@ -781,6 +785,7 @@ void __noreturn do_exit(long code)
io_uring_files_cancel();
exit_signals(tsk); /* sets PF_EXITING */
+ umcg_handle_exit();
/* sync mm's RSS info before statistics gathering */
if (tsk->mm)
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 978fcfca5871..e4e481eee1b7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
obj-$(CONFIG_PSI) += psi.o
obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d6da1efb5ce6..9ff63e32544a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4236,6 +4236,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
#endif
+ umcg_clear_child(p);
}
DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6265,9 +6266,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
* If a worker goes to sleep, notify and ask workqueue whether it
* wants to wake up a task to maintain concurrency.
*/
- if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
if (task_flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);
+ else if (task_flags & PF_UMCG_WORKER)
+ umcg_wq_worker_sleeping(tsk);
else
io_wq_worker_sleeping(tsk);
}
@@ -6285,9 +6288,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
static void sched_update_worker(struct task_struct *tsk)
{
- if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
if (tsk->flags & PF_WQ_WORKER)
wq_worker_running(tsk);
+ else if (tsk->flags & PF_UMCG_WORKER)
+ umcg_wq_worker_running(tsk);
else
io_wq_worker_running(tsk);
}
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
new file mode 100644
index 000000000000..bc4eeb3f5dd7
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,926 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for detals.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+#include "sched.h"
+
+/**
+ * get_user_nofault - get user value without sleeping.
+ *
+ * get_user() might sleep and therefore cannot be used in preempt-disabled
+ * regions.
+ */
+#define get_user_nofault(out, uaddr) \
+({ \
+ int ret = -EFAULT; \
+ \
+ if (access_ok((uaddr), sizeof(*(uaddr)))) { \
+ pagefault_disable(); \
+ \
+ if (!__get_user((out), (uaddr))) \
+ ret = 0; \
+ \
+ pagefault_enable(); \
+ } \
+ ret; \
+})
+
+/**
+ * umcg_pin_pages: pin pages containing struct umcg_task of this worker
+ * and its server.
+ *
+ * The pages are pinned when the worker exits to the userspace and unpinned
+ * when the worker is in sched_submit_work(), i.e. when the worker is
+ * about to be removed from its runqueue. Thus at most NR_CPUS UMCG pages
+ * are pinned at any one time across the whole system.
+ *
+ * The pinning is needed so that going-to-sleep workers can access
+ * their and their servers' userspace umcg_task structs without page faults,
+ * as the code path can be executed in the context of a pagefault, with
+ * mm lock held.
+ */
+static int umcg_pin_pages(u32 server_tid)
+{
+ struct umcg_task __user *worker_ut = current->umcg_task;
+ struct umcg_task __user *server_ut = NULL;
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(server_tid);
+ /* Server/worker interaction is allowed only within the same mm. */
+ if (tsk && current->mm == tsk->mm)
+ server_ut = READ_ONCE(tsk->umcg_task);
+ rcu_read_unlock();
+
+ if (!server_ut)
+ return -EINVAL;
+
+ tsk = current;
+
+ /* worker_ut is stable, don't need to repin */
+ if (!tsk->pinned_umcg_worker_page)
+ if (1 != pin_user_pages_fast((unsigned long)worker_ut, 1, 0,
+ &tsk->pinned_umcg_worker_page))
+ return -EFAULT;
+
+ /* server_ut may change, need to repin */
+ if (tsk->pinned_umcg_server_page) {
+ unpin_user_page(tsk->pinned_umcg_server_page);
+ tsk->pinned_umcg_server_page = NULL;
+ }
+
+ if (1 != pin_user_pages_fast((unsigned long)server_ut, 1, 0,
+ &tsk->pinned_umcg_server_page))
+ return -EFAULT;
+
+ return 0;
+}
+
+static void umcg_unpin_pages(void)
+{
+ struct task_struct *tsk = current;
+
+ if (tsk->pinned_umcg_worker_page)
+ unpin_user_page(tsk->pinned_umcg_worker_page);
+ if (tsk->pinned_umcg_server_page)
+ unpin_user_page(tsk->pinned_umcg_server_page);
+
+ tsk->pinned_umcg_worker_page = NULL;
+ tsk->pinned_umcg_server_page = NULL;
+}
+
+static void umcg_clear_task(struct task_struct *tsk)
+{
+ /*
+ * This is either called for the current task, or for a newly forked
+ * task that is not yet running, so we don't need strict atomicity
+ * below.
+ */
+ if (tsk->umcg_task) {
+ WRITE_ONCE(tsk->umcg_task, NULL);
+
+ /* These can be simple writes - see the commment above. */
+ tsk->pinned_umcg_worker_page = NULL;
+ tsk->pinned_umcg_server_page = NULL;
+ tsk->flags &= ~PF_UMCG_WORKER;
+ }
+}
+
+/* Called for a forked or execve-ed child. */
+void umcg_clear_child(struct task_struct *tsk)
+{
+ umcg_clear_task(tsk);
+}
+
+/* Called both by normally (unregister) and abnormally exiting workers. */
+void umcg_handle_exiting_worker(void)
+{
+ umcg_unpin_pages();
+ umcg_clear_task(current);
+}
+
+/**
+ * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
+ * @state_ts - points to the state_ts member of struct umcg_task to update;
+ * @expected - the expected value of state_ts, including the timestamp;
+ * @desired - the desired value of state_ts, state part only;
+ * @may_fault - whether to use normal or _nofault cmpxchg.
+ *
+ * The function is basically cmpxchg(state_ts, expected, desired), with extra
+ * code to set the timestamp in @desired.
+ */
+static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
+ bool may_fault)
+{
+ u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
+ u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
+
+ /* Cut higher order bits. */
+ next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1);
+
+ if (next_ts == curr_ts)
+ ++next_ts;
+
+ /* Remove an old timestamp, if any. */
+ desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1);
+
+ /* Set the new timestamp. */
+ desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
+
+ if (may_fault)
+ return cmpxchg_user_64(state_ts, expected, desired);
+
+ return cmpxchg_user_64_nofault(state_ts, expected, desired);
+}
+
+/**
+ * sys_umcg_ctl: (un)register the current task as a UMCG task.
+ * @flags: ORed values from enum umcg_ctl_flag; see below;
+ * @self: a pointer to struct umcg_task that describes this
+ * task and governs the behavior of sys_umcg_wait if
+ * registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ * UMCG workers:
+ * - @flags & UMCG_CTL_WORKER
+ * UMCG servers:
+ * - !(@flags & UMCG_CTL_WORKER)
+ *
+ * All tasks:
+ * - self->state must be UMCG_TASK_RUNNING
+ * - self->next_tid must be zero
+ *
+ * If the conditions above are met, sys_umcg_ctl() immediately returns
+ * if the registered task is a server; a worker will be added to
+ * idle_workers_ptr, and the worker put to sleep; an idle server
+ * from idle_server_tid_ptr will be woken, if present.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
+ * is a UMCG worker, the userspace is responsible for waking its
+ * server (before or after calling sys_umcg_ctl).
+ *
+ * Return:
+ * 0 - success
+ * -EFAULT - failed to read @self
+ * -EINVAL - some other error occurred
+ */
+SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
+{
+ struct umcg_task ut;
+
+ if (flags == UMCG_CTL_UNREGISTER) {
+ if (self || !current->umcg_task)
+ return -EINVAL;
+
+ if (current->flags & PF_UMCG_WORKER)
+ umcg_handle_exiting_worker();
+ else
+ umcg_clear_task(current);
+
+ return 0;
+ }
+
+ /* Register the current task as a UMCG task. */
+ if (!(flags & UMCG_CTL_REGISTER))
+ return -EINVAL;
+
+ flags &= ~UMCG_CTL_REGISTER;
+ if (flags && flags != UMCG_CTL_WORKER)
+ return -EINVAL;
+
+ if (current->umcg_task || !self)
+ return -EINVAL;
+
+ if (copy_from_user(&ut, self, sizeof(ut)))
+ return -EFAULT;
+
+ if (ut.next_tid)
+ return -EINVAL;
+
+ if ((ut.state_ts & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_RUNNING)
+ return -EINVAL;
+
+ WRITE_ONCE(current->umcg_task, self);
+
+ if (flags == UMCG_CTL_WORKER) {
+ current->flags |= PF_UMCG_WORKER;
+
+ /* Trigger umcg_handle_resuming_worker() */
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ }
+
+ return 0;
+}
+
+/**
+ * handle_timedout_worker - make sure the worker is added to idle_workers
+ * upon a "clean" timeout.
+ */
+static int handle_timedout_worker(struct umcg_task __user *self)
+{
+ u64 curr_state, next_state;
+ int ret;
+
+ if (get_user(curr_state, &self->state_ts))
+ return -EFAULT;
+
+ if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE) {
+ /* TODO: should we care here about TF_LOCKED or TF_PREEMPTED? */
+
+ next_state = curr_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_BLOCKED;
+
+ ret = umcg_update_state(&self->state_ts, &curr_state, next_state, true);
+ if (ret)
+ return ret;
+
+ return -ETIMEDOUT;
+ }
+
+ return 0; /* Not really timed out. */
+}
+
+/**
+ * umcg_idle_loop - sleep until the current task becomes RUNNING or a timeout
+ * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout
+ *
+ * The function marks the current task as INTERRUPTIBLE and calls
+ * freezable_schedule(). It returns when either the timeout expires or
+ * the UMCG state of the task becomes RUNNING.
+ *
+ * Note: because UMCG workers should not be running WITHOUT attached servers,
+ * and because servers should not be running WITH attached workers,
+ * the function returns only on fatal signal pending and ignores/flushes
+ * all other signals.
+ */
+static int umcg_idle_loop(u64 abs_timeout)
+{
+ int ret;
+ struct page *pinned_page = NULL;
+ struct hrtimer_sleeper timeout;
+ struct umcg_task __user *self = current->umcg_task;
+
+ if (abs_timeout) {
+ hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
+ HRTIMER_MODE_ABS);
+
+ hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout,
+ current->timer_slack_ns);
+ }
+
+ while (true) {
+ u64 umcg_state;
+
+ /*
+ * We need to read from userspace _after_ the task is marked
+ * TASK_INTERRUPTIBLE, to properly handle concurrent wakeups;
+ * but faulting is not allowed; so we try a fast no-fault read,
+ * and if it fails, pin the page temporarily.
+ */
+retry_once:
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ /* Order set_current_state above with get_user_nofault below. */
+ smp_mb();
+ ret = -EFAULT;
+ if (get_user_nofault(umcg_state, &self->state_ts)) {
+ set_current_state(TASK_RUNNING);
+
+ if (pinned_page)
+ goto out;
+ else if (1 != pin_user_pages_fast((unsigned long)self,
+ 1, 0, &pinned_page))
+ goto out;
+
+ goto retry_once;
+ }
+
+ if (pinned_page) {
+ unpin_user_page(pinned_page);
+ pinned_page = NULL;
+ }
+
+ ret = 0;
+ if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+ set_current_state(TASK_RUNNING);
+ goto out;
+ }
+
+ if (abs_timeout)
+ hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
+
+ if (!abs_timeout || timeout.task) {
+ /* Clear PF_UMCG_WORKER to elide workqueue handlers. */
+ const bool worker = current->flags & PF_UMCG_WORKER;
+
+ if (worker)
+ current->flags &= ~PF_UMCG_WORKER;
+
+ freezable_schedule();
+
+ if (worker)
+ current->flags |= PF_UMCG_WORKER;
+ }
+ __set_current_state(TASK_RUNNING);
+
+ /*
+ * Check for timeout before checking the state, as workers
+ * are not going to return from schedule() unless
+ * they are RUNNING.
+ */
+ ret = -ETIMEDOUT;
+ if (abs_timeout && !timeout.task)
+ goto out;
+
+ ret = -EFAULT;
+ if (get_user(umcg_state, &self->state_ts))
+ goto out;
+
+ ret = 0;
+ if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING)
+ goto out;
+
+ ret = -EINTR;
+ if (fatal_signal_pending(current))
+ goto out;
+
+ if (signal_pending(current))
+ flush_signals(current);
+ }
+
+out:
+ if (pinned_page) {
+ unpin_user_page(pinned_page);
+ pinned_page = NULL;
+ }
+
+ if (abs_timeout) {
+ hrtimer_cancel(&timeout.timer);
+ destroy_hrtimer_on_stack(&timeout.timer);
+ }
+
+ /* Workers must go through workqueue handlers upon wakeup. */
+ if (current->flags & PF_UMCG_WORKER) {
+ if (ret == -ETIMEDOUT)
+ ret = handle_timedout_worker(self);
+
+ set_tsk_need_resched(current);
+ }
+
+ return ret;
+}
+
+/**
+ * umcg_wakeup_allowed - check whether @current can wake @tsk.
+ *
+ * Currently a placeholder that allows wakeups within a single process
+ * only (same mm). In the future the requirement will be relaxed (securely).
+ */
+static bool umcg_wakeup_allowed(struct task_struct *tsk)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ if (tsk->mm && tsk->mm == current->mm && READ_ONCE(tsk->umcg_task))
+ return true;
+
+ return false;
+}
+
+/*
+ * Try to wake up. May be called with preempt_disable set. May be called
+ * cross-process.
+ *
+ * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
+ * ordering logic.
+ */
+static int umcg_ttwu(u32 next_tid, int wake_flags)
+{
+ struct task_struct *next;
+
+ rcu_read_lock();
+ next = find_task_by_vpid(next_tid);
+ if (!next || !umcg_wakeup_allowed(next)) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+
+ /* The result of ttwu below is ignored. */
+ try_to_wake_up(next, TASK_NORMAL, wake_flags);
+ rcu_read_unlock();
+
+ return 0;
+}
+
+/*
+ * At the moment, umcg_do_context_switch simply wakes up @next with
+ * WF_CURRENT_CPU and puts the current task to sleep.
+ *
+ * In the future an optimization will be added to adjust runtime accounting
+ * so that from the kernel scheduling perspective the two tasks are
+ * essentially treated as one. In addition, the context switch may be performed
+ * right here on the fast path, instead of going through the wake/wait pair.
+ */
+static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
+{
+ int ret;
+
+ ret = umcg_ttwu(next_tid, WF_CURRENT_CPU);
+ if (ret)
+ return ret;
+
+ return umcg_idle_loop(abs_timeout);
+}
+
+/**
+ * sys_umcg_wait: put the current task to sleep and/or wake another task.
+ * @flags: zero or a value from enum umcg_wait_flag.
+ * @abs_timeout: when to wake the task, in nanoseconds; zero for no timeout.
+ *
+ * @self->state_ts must be UMCG_TASK_IDLE (where @self is current->umcg_task)
+ * if !(@flags & UMCG_WAIT_WAKE_ONLY).
+ *
+ * If @self->next_tid is not zero, it must point to an IDLE UMCG task.
+ * The userspace must have changed its state from IDLE to RUNNING
+ * before calling sys_umcg_wait() in the current task. This "next"
+ * task will be woken (context-switched-to on the fast path) when the
+ * current task is put to sleep.
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for detals.
+ *
+ * Return:
+ * 0 - OK;
+ * -ETIMEDOUT - the timeout expired;
+ * -EFAULT - failed accessing struct umcg_task __user of the current
+ * task;
+ * -ESRCH - the task to wake not found or not a UMCG task;
+ * -EINVAL - another error happened (e.g. bad @flags, or the current
+ * task is not a UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
+{
+ struct umcg_task __user *self = current->umcg_task;
+ u32 next_tid;
+
+ if (!self)
+ return -EINVAL;
+
+ if (get_user(next_tid, &self->next_tid))
+ return -EFAULT;
+
+ if (flags & UMCG_WAIT_WAKE_ONLY) {
+ if (!next_tid || abs_timeout)
+ return -EINVAL;
+
+ flags &= ~UMCG_WAIT_WAKE_ONLY;
+ if (flags & ~UMCG_WAIT_WF_CURRENT_CPU)
+ return -EINVAL;
+
+ return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ?
+ WF_CURRENT_CPU : 0);
+ }
+
+ /* Unlock the worker, if locked. */
+ if (current->flags & PF_UMCG_WORKER) {
+ u64 umcg_state;
+
+ if (get_user(umcg_state, &self->state_ts))
+ return -EFAULT;
+
+ if ((umcg_state & UMCG_TF_LOCKED) && umcg_update_state(
+ &self->state_ts, &umcg_state,
+ umcg_state & ~UMCG_TF_LOCKED, true))
+ return -EFAULT;
+ }
+
+ if (next_tid)
+ return umcg_do_context_switch(next_tid, abs_timeout);
+
+ return umcg_idle_loop(abs_timeout);
+}
+
+/*
+ * NOTE: all code below is called from workqueue submit/update, or
+ * syscall exit to usermode loop, so all errors result in the
+ * termination of the current task (via SIGKILL).
+ */
+
+/*
+ * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
+ */
+static int umcg_wake_idle_server_nofault(u32 server_tid)
+{
+ struct umcg_task __user *ut_server = NULL;
+ struct task_struct *tsk;
+ int ret = -EINVAL;
+ u64 state;
+
+ rcu_read_lock();
+
+ tsk = find_task_by_vpid(server_tid);
+ /* Server/worker interaction is allowed only within the same mm. */
+ if (tsk && current->mm == tsk->mm)
+ ut_server = READ_ONCE(tsk->umcg_task);
+
+ if (!ut_server)
+ goto out_rcu;
+
+ ret = -EFAULT;
+ if (get_user_nofault(state, &ut_server->state_ts))
+ goto out_rcu;
+
+ ret = -EAGAIN;
+ if ((state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_IDLE)
+ goto out_rcu;
+
+ ret = umcg_update_state(&ut_server->state_ts, &state,
+ UMCG_TASK_RUNNING, false);
+
+ if (ret)
+ goto out_rcu;
+
+ try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
+ ret = 0;
+
+out_rcu:
+ rcu_read_unlock();
+ return ret;
+}
+
+/*
+ * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
+ */
+static int umcg_wake_idle_server_may_fault(u32 server_tid)
+{
+ struct umcg_task __user *ut_server = NULL;
+ struct task_struct *tsk;
+ int ret = -EINVAL;
+ u64 state;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(server_tid);
+ if (tsk && current->mm == tsk->mm)
+ ut_server = READ_ONCE(tsk->umcg_task);
+ rcu_read_unlock();
+
+ if (!ut_server)
+ return -EINVAL;
+
+ if (get_user(state, &ut_server->state_ts))
+ return -EFAULT;
+
+ if ((state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_IDLE)
+ return -EAGAIN;
+
+ ret = umcg_update_state(&ut_server->state_ts, &state,
+ UMCG_TASK_RUNNING, true);
+ if (ret)
+ return ret;
+
+ /*
+ * umcg_ttwu will call find_task_by_vpid again; but we cannot
+ * elide this, as we cannot do get_user() from an rcu-locked
+ * code block.
+ */
+ return umcg_ttwu(server_tid, WF_CURRENT_CPU);
+}
+
+/*
+ * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
+ */
+static int umcg_wake_idle_server(u32 server_tid, bool may_fault)
+{
+ int ret = umcg_wake_idle_server_nofault(server_tid);
+
+ if (!ret)
+ return 0;
+
+ if (!may_fault || ret != -EFAULT)
+ return ret;
+
+ return umcg_wake_idle_server_may_fault(server_tid);
+}
+
+/*
+ * Called in sched_submit_work() context for UMCG workers. In the common case,
+ * the worker's state changes RUNNING => BLOCKED, and its server's state
+ * changes IDLE => RUNNING, and the server is ttwu-ed.
+ *
+ * Under some conditions (e.g. the worker is "locked", see
+ * /Documentation/userspace-api/umcg.[txt|rst] for more details), the
+ * function does nothing.
+ *
+ * The function is called with preempt disabled to make sure the retry_once
+ * logic below works correctly.
+ */
+static void process_sleeping_worker(struct task_struct *tsk, u32 *server_tid)
+{
+ struct umcg_task __user *ut_worker = tsk->umcg_task;
+ u64 curr_state, next_state;
+ bool retried = false;
+ u32 tid;
+ int ret;
+
+ *server_tid = 0;
+
+ if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid UMCG worker."))
+ return;
+
+ /* If the worker has no server, do nothing. */
+ if (unlikely(!tsk->pinned_umcg_server_page))
+ return;
+
+ if (get_user_nofault(curr_state, &ut_worker->state_ts))
+ goto die;
+
+ /*
+ * The userspace is allowed to concurrently change a RUNNING worker's
+ * state only once in a "short" period of time, so we retry state
+ * change at most once. As this retry block is within a
+ * preempt_disable region, "short" is truly short here.
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for details.
+ */
+retry_once:
+ if (curr_state & UMCG_TF_LOCKED)
+ return;
+
+ if (WARN_ONCE((curr_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING,
+ "Unexpected UMCG worker state."))
+ goto die;
+
+ next_state = curr_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_BLOCKED;
+
+ ret = umcg_update_state(&ut_worker->state_ts, &curr_state, next_state, false);
+ if (ret == -EAGAIN) {
+ if (retried)
+ goto die;
+
+ retried = true;
+ goto retry_once;
+ }
+ if (ret)
+ goto die;
+
+ if (get_user_nofault(tid, &ut_worker->next_tid))
+ goto die;
+
+ *server_tid = tid;
+ return;
+
+die:
+ pr_warn("%s: killing task %d\n", __func__, current->pid);
+ force_sig(SIGKILL);
+}
+
+/* Called from sched_submit_work(). Must not fault/sleep. */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+ u32 server_tid;
+
+ /*
+ * Disable preemption so that retry_once in process_sleeping_worker
+ * works properly.
+ */
+ preempt_disable();
+ process_sleeping_worker(tsk, &server_tid);
+ preempt_enable();
+
+ if (server_tid) {
+ int ret = umcg_wake_idle_server_nofault(server_tid);
+
+ if (ret && ret != -EAGAIN)
+ goto die;
+ }
+
+ goto out;
+
+die:
+ pr_warn("%s: killing task %d\n", __func__, current->pid);
+ force_sig(SIGKILL);
+out:
+ umcg_unpin_pages();
+}
+
+/**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ *
+ * See Documentation/userspace-api/umcg.[txt|rst] for details.
+ */
+static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+{
+ u64 __user *node = &ut_worker->idle_workers_ptr;
+ u64 __user *head_ptr;
+ u64 first = (u64)node;
+ u64 head;
+
+ if (get_user(head, node) || !head)
+ return false;
+
+ head_ptr = (u64 __user *)head;
+
+ if (put_user(UMCG_IDLE_NODE_PENDING, node))
+ return false;
+
+ if (xchg_user_64(head_ptr, &first))
+ return false;
+
+ if (put_user(first, node))
+ return false;
+
+ return true;
+}
+
+/**
+ * get_idle_server - retrieve an idle server, if present.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid)
+{
+ u64 server_tid_ptr;
+ u32 tid;
+
+ /* Empty result is OK. */
+ *server_tid = 0;
+
+ if (get_user(server_tid_ptr, &ut_worker->idle_server_tid_ptr))
+ return false;
+
+ if (!server_tid_ptr)
+ return false;
+
+ tid = 0;
+ if (xchg_user_32((u32 __user *)server_tid_ptr, &tid))
+ return false;
+
+ *server_tid = tid;
+ return true;
+}
+
+/*
+ * Returns true to wait for the userspace to schedule this worker, false
+ * to return to the userspace.
+ *
+ * In the common case, a BLOCKED worker is marked IDLE and enqueued
+ * to idle_workers_ptr list. The idle server is woken (if present).
+ *
+ * If a RUNNING worker is preempted, this function will trigger, in which
+ * case the worker is moved to IDLE state and its server is woken.
+ *
+ * Sets @server_tid to point to the server to be woken if the worker
+ * is going to sleep; sets @server_tid to point to the server assigned
+ * to this RUNNING worker if the worker is to return to the userspace.
+ */
+static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
+{
+ struct umcg_task __user *ut_worker = tsk->umcg_task;
+ u64 curr_state, next_state;
+
+ *server_tid = 0;
+
+ if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+ return false;
+
+ if (fatal_signal_pending(tsk))
+ return false;
+
+ if (get_user(curr_state, &ut_worker->state_ts))
+ goto die;
+
+ if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+ u32 tid;
+
+ /* Wakeup: wait but don't enqueue. */
+ if (curr_state & UMCG_TF_LOCKED)
+ return true;
+
+ smp_rmb(); /* Order getting state and getting server_tid */
+ if (get_user(tid, &ut_worker->next_tid))
+ goto die;
+
+ if (tid) {
+ *server_tid = tid;
+
+ /* pass-through: RUNNING with a server. */
+ if (!(curr_state & UMCG_TF_PREEMPTED))
+ return false;
+ } else if (curr_state & UMCG_TF_PREEMPTED)
+ /* PREEMPTED workers must have servers. */
+ goto die;
+
+ /*
+ * Fallthrough to mark the worker IDLE: the worker is
+ * PREEMPTED, or the worker is RUNNING, but has no server
+ * (which happens via UMCG_WAIT_WAKE_ONLY).
+ */
+ } else if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
+ (curr_state & UMCG_TF_LOCKED)))
+ /* The worker prepares to sleep or to unregister. */
+ return false;
+
+ if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE))
+ goto die;
+
+ next_state = curr_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_IDLE;
+
+ if (umcg_update_state(&ut_worker->state_ts, &curr_state,
+ next_state, true))
+ goto die;
+
+ if (!enqueue_idle_worker(ut_worker))
+ goto die;
+
+ smp_mb(); /* Order enqueuing the worker with getting the server. */
+ if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
+ goto die;
+
+ return true;
+
+die:
+ pr_warn("umcg_process_waking_worker: killing task %d\n", current->pid);
+ force_sig(SIGKILL);
+ return false;
+}
+
+/*
+ * Called from sched_update_worker(): defer all work until later, as
+ * sched_update_worker() may be called with in-kernel locks held.
+ */
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+ set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
+}
+
+/* Called via TIF_NOTIFY_RESUME flag from exit_to_user_mode_loop. */
+void umcg_handle_resuming_worker(void)
+{
+ u32 server_tid;
+
+ /* Avoid recursion by removing PF_UMCG_WORKER */
+ current->flags &= ~PF_UMCG_WORKER;
+
+ do {
+ bool should_wait;
+
+ should_wait = process_waking_worker(current, &server_tid);
+
+ if (!should_wait)
+ break;
+
+ if (server_tid) {
+ int ret = umcg_wake_idle_server(server_tid, true);
+
+ if (ret && ret != -EAGAIN)
+ goto die;
+ }
+
+ umcg_idle_loop(0);
+ } while (true);
+
+ if (!server_tid)
+ /* No server => no reason to pin pages. */
+ umcg_unpin_pages();
+ else if (umcg_pin_pages(server_tid))
+ goto die;
+
+ goto out;
+
+die:
+ pr_warn("%s: killing task %d\n", __func__, current->pid);
+ force_sig(SIGKILL);
+out:
+ current->flags |= PF_UMCG_WORKER;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index f43d89d92860..682261d78ee7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -272,6 +272,10 @@ COND_SYSCALL(landlock_create_ruleset);
COND_SYSCALL(landlock_add_rule);
COND_SYSCALL(landlock_restrict_self);
+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+
/* arch/example/kernel/sys_example.c */
/* mm/fadvise.c */
--
2.25.1
Document User Managed Concurrency Groups syscalls, data structures,
state transitions, etc.
Signed-off-by: Peter Oskolkov <[email protected]>
---
Documentation/userspace-api/umcg.rst | 611 +++++++++++++++++++++++++++
1 file changed, 611 insertions(+)
create mode 100644 Documentation/userspace-api/umcg.rst
diff --git a/Documentation/userspace-api/umcg.rst b/Documentation/userspace-api/umcg.rst
new file mode 100644
index 000000000000..04206490fea2
--- /dev/null
+++ b/Documentation/userspace-api/umcg.rst
@@ -0,0 +1,611 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+UMCG Userspace API
+=====================================
+
+User Managed Concurrency Groups (UMCG) is an M:N threading
+subsystem/toolkit that lets user space application developers
+implement in-process user space schedulers.
+
+.. contents:: :local:
+
+Why? Heterogeneous in-process workloads
+=======================================
+Linux kernel's CFS scheduler is designed for the "common" use case,
+with efficiency/throughput in mind. Work isolation and workloads of
+different "urgency" are addressed by tools such as cgroups, CPU
+affinity, priorities, etc., which are difficult or impossible to
+efficiently use in-process.
+
+For example, a single DBMS process may receive tens of thousands
+requests per second; some of these requests may have strong response
+latency requirements as they serve live user requests (e.g. login
+authentication); some of these requests may not care much about
+latency but must be served within a certain time period (e.g. an
+hourly aggregate usage report); some of these requests are to be
+served only on a best-effort basis and can be NACKed under high load
+(e.g. an exploratory research/hypothesis testing workload).
+
+Beyond different work item latency/throughput requirements as outlined
+above, the DBMS may need to provide certain guarantees to different
+users; for example, user A may "reserve" 1 CPU for their
+high-priority/low latency requests, 2 CPUs for mid-level throughput
+workloads, and be allowed to send as many best-effort requests as
+possible, which may or may not be served, depending on the DBMS load.
+Besides, the best-effort work, started when the load was low, may need
+to be delayed if suddenly a large amount of higher-priority work
+arrives. With hundreds or thousands of users like this, it is very
+difficult to guarantee the application's responsiveness using standard
+Linux tools while maintaining high CPU utilization.
+
+Gaming is another use case: some in-process work must be completed
+before a certain deadline dictated by frame rendering schedule, while
+other work items can be delayed; some work may need to be
+cancelled/discarded because the deadline has passed; etc.
+
+User Managed Concurrency Groups is an M:N threading toolkit that
+allows constructing user space schedulers designed to efficiently
+manage heterogeneous in-process workloads described above while
+maintaining high CPU utilization (95%+).
+
+Requirements
+============
+One relatively established way to design high-efficiency, low-latency
+systems is to split all work into small on-cpu work items, with
+asynchronous I/O and continuations, all executed on a thread pool with
+the number of threads not exceeding the number of available CPUs.
+Although this approach works, it is quite difficult to develop and
+maintain such a system, as, for example, small continuations are
+difficult to piece together when debugging. Besides, such asynchronous
+callback-based systems tend to be somewhat cache-inefficient, as
+continuations can get scheduled on any CPU regardless of cache
+locality.
+
+M:N threading and cooperative user space scheduling enables controlled
+CPU usage (minimal OS preemption), synchronous coding style, and
+better cache locality.
+
+Specifically:
+
+- a variable/fluctuating number M of "application" threads should be
+ "scheduled over" a relatively fixed number N of "kernel" threads,
+ where N is less than or equal to the number of CPUs available;
+- only those application threads that are attached to kernel threads
+ are scheduled "on CPU";
+- application threads should be able to cooperatively yield to each other;
+- when an application thread blocks in kernel (e.g. in I/O), this
+ becomes a scheduling event ("block") that the userspace scheduler
+ should be able to efficiently detect, and reassign a waiting
+ application thread to the freeded "kernel" thread;
+- when a blocked application thread wakes (e.g. its I/O operation
+ completes), this even ("wake") should also be detectable by the
+ userspace scheduler, which should be able to either quickly dispatch
+ the newly woken thread to an idle "kernel" thread or, if all "kernel"
+ threads are busy, put it in the waiting queue;
+- in addition to the above, it would be extremely useful for a
+ separate in-process "watchdog" facility to be able to monitor the
+ state of each of the M+N threads, and to intervene in case of runaway
+ workloads (interrupt/preempt).
+
+
+UMCG kernel API
+===============
+Based on the requrements above, UMCG *kernel* API is build around
+the following ideas:
+
+- *UMCG server*: a task/thread representing "kernel threads", or CPUs
+ from the requirements above;
+- *UMCG worker*: a task/thread representing "application threads", to
+ be scheduled over servers;
+- UMCG *task state*: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG
+ task (a server or a worker) can be in;
+- UMCG task *state flag*: LOCKED, PREEMPTED: additional state flags
+ that can be ORed with the task state to communicate additional information
+ to the kernel;
+- ``struct umcg_task``: a per-task userspace set of data fields, usually
+ residing in the TLS, that fully reflects the current task's UMCG
+ state and controls the way the kernel manages the task;
+- ``sys_umcg_ctl()``: a syscall used to register the current task/thread
+ as a server or a worker, or to unregister a UMCG task;
+- ``sys_umcg_wait()``: a syscall used to put the current task to
+ sleep and/or wake another task, pontentially context-switching
+ between the two tasks on-CPU synchronously.
+
+
+Servers
+=======
+
+When a task/thread is registered as a server, it is in RUNNING
+state and behaves like any other normal task/thread. In addition,
+servers can interact with other UMCG tasks via sys_umcg_wait():
+
+- servers can voluntarily suspend their execution (wait), becoming IDLE;
+- servers can wake other IDLE servers;
+- servers can context-switch between each other.
+
+Note that if a server blocks in the kernel *not* via sys_umcg_wait(),
+it still retains its RUNNING state.
+
+
+Workers
+=======
+
+A worker cannot be RUNNING without having a server associated
+with it, so when a task is first registered as a worker, it enters
+the IDLE state.
+
+- a worker becomes RUNNING when a server calls sys_umcg_wait to
+ context-switch into it; the server goes IDLE, and the worker becomes
+ RUNNING in its place;
+- when a running worker blocks in the kernel, it becomes BLOCKED,
+ its associated server becomes RUNNING and the server's
+ sys_umcg_wait() call from the bullet above returns; this transition
+ is sometimes called "block detection";
+- when the syscall on which a BLOCKED worker completes, the worker
+ becomes IDLE and is added to the list of idle workers; if there
+ is an idle server waiting, the kernel wakes it; this transition
+ is sometimes called "wake detection";
+- running workers can voluntarily suspend their execution (wait),
+ becoming IDLE; their associated servers are woken;
+- a RUNNING worker can context-switch with an IDLE worker; the server
+ of the switched-out worker is transferred to the switched-in worker;
+- any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
+ this is a server running the worker as described in the first bullet
+ in this list, the worker remain IDLE but is added to the idle workers
+ list; this "wake" operation exists for completeness, to make sure
+ wait/wake/context-switch operations are available for all UMCG tasks;
+- the userspace can preempt a RUNNING worker by marking it
+ ``RUNNING|PREEMPTED`` and sending a signal to it; the userspace should
+ have installed a NOP signal handler for the signal; the kernel will
+ then transition the worker into ``IDLE|PREEMPTED`` state and wake
+ its associated server.
+
+UMCG task states
+================
+
+Important: all state transitions described below involve at least
+two steps: the change of the state field in ``struct umcg_task``,
+for example ``RUNNING`` to ``IDLE``, and the corresponding change in
+``struct task_struct`` state, for example a transition between the task
+running on CPU and being descheduled and removed from the kernel runqueue.
+The key principle of UMCG API design is that the party initiating
+the state transition modifies the state variable.
+
+For example, a task going ``IDLE`` first changes its state from ``RUNNING``
+to ``IDLE`` in the userpace and then calls ``sys_umcg_wait()``, which
+completes the transition.
+
+Note on documentation: in ``include/uapi/linux/umcg.h``, task states
+have the form ``UMCG_TASK_RUNNING``, ``UMCG_TASK_BLOCKED``, etc. In
+this document these are usually referred to simply ``RUNNING`` and
+``BLOCKED``, unless it creates ambiguity. Task state flags, e.g.
+``UMCG_TF_PREEMPTED``, are treated similarly.
+
+UMCG task states reflect the view from the userspace, rather than from
+the kernel. There are three fundamental task states:
+
+- ``RUNNING``: indicates that the task is schedulable by the kernel; applies
+ to both servers and workers;
+- ``IDLE``: indicates that the task is *not* schedulable by the kernel
+ (see ``umcg_idle_loop()`` in ``kernel/sched/umcg.c``); applies to
+ both servers and workers;
+- ``BLOCKED``: indicates that the worker is blocked in the kernel;
+ does not apply to servers.
+
+In addition to the three states above, two state flags help with
+state transitions:
+
+- ``LOCKED``: the userspace is preparing the worker for a state transition
+ and "locks" the worker until the worker is ready for the kernel to
+ act on the state transition; used similarly to preempt_disable or
+ irq_disable in the kernel; applies only to workers in ``RUNNING`` or
+ ``IDLE`` state; ``RUNNING|LOCKED`` means "this worker is about to
+ become ``RUNNING``, while ``IDLE|LOCKED`` means "this worker is about
+ to become ``IDLE`` or unregister;
+- ``PREEMPTED``: the userspace indicates it wants the worker to be
+ preempted; there are no situations when both ``LOCKED`` and ``PREEMPTED``
+ flags are set at the same time.
+
+struct umcg_task
+================
+
+From ``include/uapi/linux/umcg.h``:
+
+.. code-block:: C
+
+ struct umcg_task {
+ uint64_t state_ts; /* r/w */
+ uint32_t next_tid; /* r */
+ uint32_t flags; /* reserved */
+ uint64_t idle_workers_ptr; /* r/w */
+ uint64_t idle_server_tid_ptr; /* r* */
+ };
+
+Each UMCG task is identified by ``struct umcg_task``, which is provided
+to the kernel when the task is registered via ``sys_umcg_ctl()``.
+
+- ``uint64_t state_ts``: the current state of the task this struct
+ identifies, as described in the previous section, combined with a
+ unique timestamp indicating when the last state change happened.
+
+ Readable/writable by both the kernel and the userspace.
+
+ - bits 0 - 5: task state (RUNNING, IDLE, BLOCKED);
+ - bits 6 - 7: state flags (LOCKED, PREEMPTED);
+ - bits 8 - 12: reserved; must be zeroes;
+ - bits 13 - 17: for userspace use;
+ - bits 18 - 63: timestamp.
+
+ Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
+
+ It is highly benefitical to tag each state change with a unique
+ timestamp:
+
+ - timestamps will naturally provide instrumentation to measure
+ scheduling delays, both in the kernel and in the userspace;
+ - uniqueness of timestamps (module overflow) guarantees that state
+ change races, especially ABA races, are easily detected and avoided.
+
+ Each timestamp represents the moment in time the state change happened,
+ in nanoseconds, with the lower 4 bits and the upper 16 bits stripped.
+
+ In this document ``'umcg_task.state'`` is often used to talk about
+ ``'umcg_task.state_ts'`` field, as timestamps do not carry semantic
+ meaning at the moment.
+
+ This is how umcg_task.state_ts is updated in the kernel:
+
+ .. code-block:: C
+
+ /* kernel side */
+ /**
+ * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
+ * @state_ts - points to the state_ts member of struct umcg_task to update;
+ * @expected - the expected value of state_ts, including the timestamp;
+ * @desired - the desired value of state_ts, state part only;
+ * @may_fault - whether to use normal or _nofault cmpxchg.
+ *
+ * The function is basically cmpxchg(state_ts, expected, desired), with extra
+ * code to set the timestamp in @desired.
+ */
+ static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
+ bool may_fault)
+ {
+ u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
+ u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
+
+ /* Cut higher order bits. */
+ next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1);
+
+ if (next_ts == curr_ts)
+ ++next_ts;
+
+ /* Remove an old timestamp, if any. */
+ desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1);
+
+ /* Set the new timestamp. */
+ desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
+
+ if (may_fault)
+ return cmpxchg_user_64(state_ts, expected, desired);
+
+ return cmpxchg_user_64_nofault(state_ts, expected, desired);
+ }
+
+- ``uint32_t next_tid``: contains the TID of the task to context-switch-into
+ in ``sys_umcg_wait()``; can be zero; writable by the userspace, readable
+ by the kernel; if this is a RUNNING worker, this field contains
+ the TID of the server that should be woken when this worker blocks;
+ see ``sys_umcg_wait()`` for more details;
+
+- ``uint32_t flags``: reserved; must be zero.
+
+- ``uint64_t idle_workers_ptr``: this field forms a single-linked list
+ of idle workers: all RUNNING workers have this field set to point
+ to the head of the list (a pointer variable in the userspace).
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from ``BLOCKED`` to ``IDLE`` and adds the worker
+ to the top of the list of idle workers using this logic:
+
+ .. code-block:: C
+
+ /* kernel side */
+ /**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+ static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+ {
+ u64 __user *node = &ut_worker->idle_workers_ptr;
+ u64 __user *head_ptr;
+ u64 first = (u64)node;
+ u64 head;
+
+ if (get_user_nosleep(head, node) || !head)
+ return false;
+
+ head_ptr = (u64 __user *)head;
+
+ if (put_user_nosleep(UMCG_IDLE_NODE_PENDING, node))
+ return false;
+
+ if (xchg_user_64(head_ptr, &first))
+ return false;
+
+ if (put_user_nosleep(first, node))
+ return false;
+
+ return true;
+ }
+
+
+ In the userspace the list is cleared atomically using this logic:
+
+ .. code-block:: C
+
+ /* userspace side */
+ uint64_t *idle_workers = (uint64_t *)*head;
+
+ atomic_exchange(&idle_workers, NULL);
+
+ The userspace re-points workers' idle_workers_ptr to the list head
+ variable before the worker is allowed to become RUNNING again.
+
+ When processing the idle workers list, the userspace should wait for
+ workers marked as UMCG_IDLE_NODE_PENDING to have the flag cleared
+ (see ``enqueue_idle_worker()`` above).
+
+- ``uint64_t idle_server_tid_ptr``: points to a variable in the
+ userspace that points to an idle server, i.e. a server in IDLE state waiting
+ in sys_umcg_wait(); read-only; workers must have this field set; not used
+ in servers.
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from ``BLOCKED`` to ``IDLE``, adds the worker
+ to the list of idle workers, and wakes the idle server if present;
+ the kernel atomically exchanges ``(*idle_server_tid_ptr)`` with 0,
+ thus waking the idle server, if present, only once.
+ See `State transitions`_ below for more details.
+
+sys_umcg_ctl()
+==============
+
+``int sys_umcg_ctl(uint32_t flags, struct umcg_task *self)`` is used to
+register or unregister the current task as a worker or server. Flags
+can be one of the following:
+
+- ``UMCG_CTL_REGISTER``: register a server;
+- ``UMCG_CTL_REGISTER | UMCG_CTL_WORKER``: register a worker;
+- ``UMCG_CTL_UNREGISTER``: unregister the current server or worker.
+
+When registering a task, ``self`` must point to ``struct umcg_task``
+describing this server or worker; the pointer must remain valid until
+the task is unregistered.
+
+When registering a server, ``self->state`` must be ``RUNNING``; all other
+fields in ``self`` must be zeroes.
+
+When registering a worker, ``self->state`` must be ``RUNNING``;
+``self->idle_server_tid_ptr`` and ``self->idle_workers_ptr`` must be
+valid pointers as described in `struct umcg_task`_; ``self->next_tid`` must
+be zero.
+
+When unregistering a task, ``self`` must be ``NULL``.
+
+sys_umcg_wait()
+===============
+
+``int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout)`` operates
+on registered UMCG servers and workers: ``struct umcg_task *self`` provided
+to ``sys_umcg_ctl()`` when registering the current task is consulted
+in addition to ``flags`` and ``abs_timeout`` parameters.
+
+The function can be used to perform one of the three operations:
+
+- wait: if ``self->next_tid`` is zero, ``sys_umcg_wait()`` puts the current
+ task to sleep;
+- wake: if ``self->next_tid`` is not zero, and ``flags & UMCG_WAIT_WAKE_ONLY``,
+ the task identified by ``next_tid`` is woken;
+- context switch: if ``self->next_tid`` is not zero, and
+ ``!(flags & UMCG_WAIT_WAKE_ONLY)``, the current task is put to sleep and
+ the next task is woken, synchronously switching between the tasks on the
+ current CPU on the fast path.
+
+Flags can be zero or a combination of the following values:
+
+- ``UMCG_WAIT_WAKE_ONLY``: wake the next task, don't put the current task
+ to sleep;
+- ``UMCG_WAIT_WF_CURRENT_CPU``: wake the next task on the curent CPU;
+ this flag has an effect only if ``UMCG_WAIT_WAKE_ONLY`` is set: context
+ switching is always attempted to happen on the curent CPU.
+
+The section below provides more details on how servers and workers interact
+via ``sys_umcg_wait()``, during worker block/wake events, and during
+worker preemption.
+
+State transitions
+=================
+
+As mentioned above, the key principle of UMCG state transitions is that
+**the party initiating the state transition modifies the state of affected
+tasks**.
+
+Below, "``TASK:STATE``" indicates a task T, where T can be either W for
+worker or S for server, in state S, where S can be one of the three states,
+potentially ORed with a state flag. Each individual state transition
+is an atomic operation (cmpxchg) unless indicated otherwise. Also note
+that **the order of state transitions is important and is part of the
+contract between the userspace and the kernel. The kernel is free
+to kill the task (SIGKILL) if the contract is broken.**
+
+Some worker state transitions below include adding ``LOCKED`` flag to
+worker state. This is done to indicate to the kernel that the worker
+is transitioning state and should not participate in the block/wake
+detection routines, which can happen due to interrupts/pagefaults/signals.
+
+``IDLE|LOCKED`` means that a running worker is preparing to sleep, so
+interrupts should not lead to server wakeup; ``RUNNING|LOCKED`` means that
+an idle worker is going to be "scheduled to run", but may not yet have its
+server set up properly.
+
+Key state transitions:
+
+- server to worker context switch ("schedule a worker to run"):
+ ``S:RUNNING+W:IDLE => S:IDLE+W:RUNNING``:
+
+ - in the userspace, in the context of the server S running:
+
+ - ``S:RUNNING => S:IDLE`` (mark self as idle)
+ - ``W:IDLE => W:RUNNING|LOCKED`` (mark the worker as running)
+ - ``W.next_tid := S.tid; S.next_tid := W.tid``
+ (link the server with the worker)
+ - ``W:RUNNING|LOCKED => W:RUNNING`` (unlock the worker)
+ - ``S: sys_umcg_wait()`` (make the syscall)
+
+ - the kernel context switches from the server to the worker; the server
+ sleeps until it becomes ``RUNNING`` during one of the transitions below;
+
+- worker to server context switch (worker "yields"):
+ ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE``:
+
+ - in the userspace, in the context of the worker W running (note that
+ a running worker has its ``next_tid`` set to point to its server):
+
+ - ``W:RUNNING => W:IDLE|LOCKED`` (mark self as idle)
+ - ``S:IDLE => S:RUNNING`` (mark the server as running)
+ - ``W: sys_umcg_wait()`` (make the syscall)
+
+ - the kernel removes the ``LOCKED`` flag from the worker's state and
+ context switches from the worker to the server; the worker
+ sleeps until it becomes ``RUNNING``;
+
+- worker to worker context switch:
+ ``W1:RUNNING+W2:IDLE => W1:IDLE+W2:RUNNING``:
+
+ - in the userspace, in the context of W1 running:
+
+ - ``W2:IDLE => W2:RUNNING|LOCKED`` (mark W2 as running)
+ - ``W1:RUNNING => W1:IDLE|LOCKED`` (mark self as idle)
+ - ``W2.next_tid := W1.next_tid; S.next_tid := W2.tid``
+ (transfer the server W1 => W2)
+ - ``W1:next_tid := W2.tid`` (indicate that W1 should
+ context-switch into W2)
+ - ``W2:RUNNING|LOCKED => W2:RUNNING`` (unlock W2)
+ - ``W1: sys_umcg_wait()`` (make the syscall)
+
+ - same as above, the kernel removes the ``LOCKED`` flag from the W1's state
+ and context switches to next_tid;
+
+- worker wakeup: ``W:IDLE => W:RUNNING``:
+
+ - in the userspace, a server S can wake a worker W without "running" it:
+
+ - ``S:next_tid :=W.tid``
+ - ``W:next_tid := 0``
+ - ``W:IDLE => W:RUNNING``
+ - ``sys_umcg_wait(UMCG_WAIT_WAKE_ONLY)`` (make the syscall)
+
+ - the kernel will wake the worker W; as the worker does not have a server
+ assigned, "wake detection" will happen, the worker will be immediately
+ marked as ``IDLE`` and added to idle workers list; an idle server, if any,
+ will be woken (see 'wake detection' below);
+ - Note: if needed, it is possible for a worker to wake another worker:
+ the waker marks itself "IDLE|LOCKED", points its next_tid to the wakee,
+ makes the syscall, restores its server in next_tid, marks itself
+ as ``RUNNING``.
+
+- block detection: worker blocks in the kernel: ``S:IDLE+W:RUNNING => S:RUNNING+W:BLOCKED``:
+
+ - when a worker blocks in the kernel in ``RUNNING`` state (not ``LOCKED``),
+ before descheduling the task from the CPU the kernel performs these
+ operations:
+
+ - ``W:RUNNING => W:BLOCKED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wake_up(S)``
+
+ - if any of the first three operations above fail, the worker is killed via
+ ``SIGKILL``. Note that ``ttwu(S)`` is not required to succeed, as the
+ server may still be transitioning to sleep in ``sys_umcg_wait()``; before
+ actually putting the server to sleep its UMCG state is checked and, if
+ it is ``RUNNING``, sys_umcg_wait() returns to the userspace;
+ - if the worker has its ``LOCKED`` flag set, block detection does not trigger,
+ as the worker is assumed to be in the userspace scheduling code.
+
+- wake detection: worker wakes in the kernel: ``W:BLOCKED => W:IDLE``:
+
+ - all workers' returns to the userspace are intercepted:
+
+ - ``start:`` (a label)
+ - if ``W:RUNNING & W.next_tid != 0``: let the worker exit to the userspace,
+ as this is a ``RUNNING`` worker with a server;
+ - ``W:* => W:IDLE`` (previously blocked or woken without servers workers
+ are not allowed to return to the userspace);
+ - the worker is appended to ``W.idle_workers_ptr`` idle workers list;
+ - ``S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING; ttwu(S)``
+ - ``idle_loop(W)``: this is the same idle loop that ``sys_umcg_wait()``
+ uses: it breaks only when the worker becomes ``RUNNING``; when the
+ idle loop exits, it is assumed that the userspace has properly
+ removed the worker from the idle workers list before marking it
+ ``RUNNING``;
+ - ``goto start;`` (repeat from the beginning).
+
+ - the logic above is a bit more complicated in the presence of ``LOCKED`` or
+ ``PREEMPTED`` flags, but the main invariants stay the same:
+
+ - only ``RUNNING`` workers with servers assigned are allowed to run
+ in the userspace (unless ``LOCKED``);
+ - newly ``IDLE`` workers are added to the idle workers list; any
+ user-initiated state change assumes the userspace properly removed
+ the worker from the list;
+ - as with wake detection, any "breach of contract" by the userspace
+ will result in the task termination via ``SIGKILL``.
+
+- worker preemption: ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED``:
+
+ - when the userspace wants to preempt a ``RUNNING`` worker, it changes
+ it state, atomically, ``RUNNING => RUNNING|PREEMPTED`` and sends a signal
+ to the worker via ``tgkill()``; the signal handler, previously set up
+ by the userspace, can be a NOP (note that only ``RUNNING`` workers can be
+ preempted);
+ - if the worker, at the moment the signal arrived, continued to be running
+ on-CPU in the userspace, the "wake detection" code will be triggered that,
+ in addition to what was described above, will check if the worker is in
+ ``RUNNING|PREEMPTED`` state:
+
+ - ``W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wakeup(S)``
+
+ - if the signal arrives after the worker blocks in the kernel, the "block
+ detection" happened as described above, with the following change:
+
+ - ``W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wake_up(S)``
+
+ - in any case, the worker's server is woken, with its attached worker
+ (``S.next_tid``) either in ``BLOCKED|PREEMPTED`` or ``IDLE|PREEMPTED``
+ state.
+
+Server-only use cases
+=====================
+
+Some workloads/applications may benefit from fast and synchronous on-CPU
+user-initiated context switches without the need for full userspace
+scheduling (block/wake detection). These applications can use "standalone"
+UMCG servers to wait/wake/context-switch. At the moment only in-process
+operations are allowed. In the future this restriction will be lifted,
+and wait/wake/context-switch operations between servers in related processes
+be permitted (when it is safe to do so, e.g. if the processes belong
+to the same user and/or cgroup).
+
+These "worker-less" operations involve trivial ``RUNNING`` <==> ``IDLE``
+state changes, not discussed here for brevity.
+
--
2.25.1
Document User Managed Concurrency Groups syscalls, data structures,
state transitions, etc.
This is a text version of umcg.rst.
Signed-off-by: Peter Oskolkov <[email protected]>
---
Documentation/userspace-api/umcg.txt | 594 +++++++++++++++++++++++++++
1 file changed, 594 insertions(+)
create mode 100644 Documentation/userspace-api/umcg.txt
diff --git a/Documentation/userspace-api/umcg.txt b/Documentation/userspace-api/umcg.txt
new file mode 100644
index 000000000000..cabaa6f4aaad
--- /dev/null
+++ b/Documentation/userspace-api/umcg.txt
@@ -0,0 +1,594 @@
+UMCG USERSPACE API
+
+User Managed Concurrency Groups (UMCG) is an M:N threading
+subsystem/toolkit that lets user space application developers implement
+in-process user space schedulers.
+
+
+CONTENTS
+
+ WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
+ REQUIREMENTS
+ UMCG KERNEL API
+ SERVERS
+ WORKERS
+ UMCG TASK STATES
+ STRUCT UMCG_TASK
+ SYS_UMCG_CTL()
+ SYS_UMCG_WAIT()
+ STATE TRANSITIONS
+ SERVER-ONLY USE CASES
+
+
+WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
+
+Linux kernel's CFS scheduler is designed for the "common" use case, with
+efficiency/throughput in mind. Work isolation and workloads of different
+"urgency" are addressed by tools such as cgroups, CPU affinity, priorities,
+etc., which are difficult or impossible to efficiently use in-process.
+
+For example, a single DBMS process may receive tens of thousands requests
+per second; some of these requests may have strong response latency
+requirements as they serve live user requests (e.g. login authentication);
+some of these requests may not care much about latency but must be served
+within a certain time period (e.g. an hourly aggregate usage report); some
+of these requests are to be served only on a best-effort basis and can be
+NACKed under high load (e.g. an exploratory research/hypothesis testing
+workload).
+
+Beyond different work item latency/throughput requirements as outlined
+above, the DBMS may need to provide certain guarantees to different users;
+for example, user A may "reserve" 1 CPU for their high-priority/low latency
+requests, 2 CPUs for mid-level throughput workloads, and be allowed to send
+as many best-effort requests as possible, which may or may not be served,
+depending on the DBMS load. Besides, the best-effort work, started when the
+load was low, may need to be delayed if suddenly a large amount of
+higher-priority work arrives. With hundreds or thousands of users like
+this, it is very difficult to guarantee the application's responsiveness
+using standard Linux tools while maintaining high CPU utilization.
+
+Gaming is another use case: some in-process work must be completed before a
+certain deadline dictated by frame rendering schedule, while other work
+items can be delayed; some work may need to be cancelled/discarded because
+the deadline has passed; etc.
+
+User Managed Concurrency Groups is an M:N threading toolkit that allows
+constructing user space schedulers designed to efficiently manage
+heterogeneous in-process workloads described above while maintaining high
+CPU utilization (95%+).
+
+
+REQUIREMENTS
+
+One relatively established way to design high-efficiency, low-latency
+systems is to split all work into small on-cpu work items, with
+asynchronous I/O and continuations, all executed on a thread pool with the
+number of threads not exceeding the number of available CPUs. Although this
+approach works, it is quite difficult to develop and maintain such a
+system, as, for example, small continuations are difficult to piece
+together when debugging. Besides, such asynchronous callback-based systems
+tend to be somewhat cache-inefficient, as continuations can get scheduled
+on any CPU regardless of cache locality.
+
+M:N threading and cooperative user space scheduling enables controlled CPU
+usage (minimal OS preemption), synchronous coding style, and better cache
+locality.
+
+Specifically:
+
+* a variable/fluctuating number M of "application" threads should be
+ "scheduled over" a relatively fixed number N of "kernel" threads, where
+ N is less than or equal to the number of CPUs available;
+* only those application threads that are attached to kernel threads are
+ scheduled "on CPU";
+* application threads should be able to cooperatively
+ yield to each other;
+* when an application thread blocks in kernel (e.g. in I/O), this becomes
+ a scheduling event ("block") that the userspace scheduler should be able
+ to efficiently detect, and reassign a waiting application thread to the
+ freeded "kernel" thread;
+* when a blocked application thread wakes (e.g. its I/O operation
+ completes), this even ("wake") should also be detectable by the
+ userspace scheduler, which should be able to either quickly dispatch the
+ newly woken thread to an idle "kernel" thread or, if all "kernel"
+ threads are busy, put it in the waiting queue;
+* in addition to the above, it would be extremely useful for a separate
+ in-process "watchdog" facility to be able to monitor the state of each
+ of the M+N threads, and to intervene in case of runaway workloads
+ (interrupt/preempt).
+
+
+UMCG KERNEL API
+
+Based on the requrements above, UMCG kernel API is build around the
+following ideas:
+
+* UMCG server: a task/thread representing "kernel threads", or CPUs from
+ the requirements above;
+* UMCG worker: a task/thread representing "application threads", to be
+ scheduled over servers;
+* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
+ server or a worker) can be in;
+* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
+ can be ORed with the task state to communicate additional information to
+ the kernel;
+* struct umcg_task: a per-task userspace set of data fields, usually
+ residing in the TLS, that fully reflects the current task's UMCG state
+ and controls the way the kernel manages the task;
+* sys_umcg_ctl(): a syscall used to register the current task/thread as a
+ server or a worker, or to unregister a UMCG task;
+* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
+ wake another task, pontentially context-switching between the two tasks
+ on-CPU synchronously.
+
+
+SERVERS
+
+When a task/thread is registered as a server, it is in RUNNING state and
+behaves like any other normal task/thread. In addition, servers can
+interact with other UMCG tasks via sys_umcg_wait():
+
+* servers can voluntarily suspend their execution (wait), becoming IDLE;
+* servers can wake other IDLE servers;
+* servers can context-switch between each other.
+
+Note that if a server blocks in the kernel not via sys_umcg_wait(), it
+still retains its RUNNING state.
+
+
+WORKERS
+
+A worker cannot be RUNNING without having a server associated with it, so
+when a task is first registered as a worker, it enters the IDLE state.
+
+* a worker becomes RUNNING when a server calls sys_umcg_wait to
+ context-switch into it; the server goes IDLE, and the worker becomes
+ RUNNING in its place;
+* when a running worker blocks in the kernel, it becomes BLOCKED, its
+ associated server becomes RUNNING and the server's sys_umcg_wait() call
+ from the bullet above returns; this transition is sometimes called
+ "block detection";
+* when the syscall on which a BLOCKED worker completes, the worker
+ becomes IDLE and is added to the list of idle workers; if there is an
+ idle server waiting, the kernel wakes it; this transition is sometimes
+ called "wake detection";
+* running workers can voluntarily suspend their execution (wait),
+ becoming IDLE; their associated servers are woken;
+* a RUNNING worker can context-switch with an IDLE worker; the server of
+ the switched-out worker is transferred to the switched-in worker;
+* any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
+ this is a server running the worker as described in the first bullet in
+ this list, the worker remain IDLE but is added to the idle workers list;
+ this "wake" operation exists for completeness, to make sure
+ wait/wake/context-switch operations are available for all UMCG tasks;
+* the userspace can preempt a RUNNING worker by marking it
+ RUNNING|PREEMPTED and sending a signal to it; the userspace should have
+ installed a NOP signal handler for the signal; the kernel will then
+ transition the worker into IDLE|PREEMPTED state and wake its associated
+ server.
+
+
+UMCG TASK STATES
+
+Important: all state transitions described below involve at least two
+steps: the change of the state field in struct umcg_task, for example
+RUNNING to IDLE, and the corresponding change in struct task_struct state,
+for example a transition between the task running on CPU and being
+descheduled and removed from the kernel runqueue. The key principle of UMCG
+API design is that the party initiating the state transition modifies the
+state variable.
+
+For example, a task going IDLE first changes its state from RUNNING to IDLE
+in the userpace and then calls sys_umcg_wait(), which completes the
+transition.
+
+Note on documentation: in include/uapi/linux/umcg.h, task states have the
+form UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, etc. In this document these are
+usually referred to simply RUNNING and BLOCKED, unless it creates
+ambiguity. Task state flags, e.g. UMCG_TF_PREEMPTED, are treated similarly.
+
+UMCG task states reflect the view from the userspace, rather than from the
+kernel. There are three fundamental task states:
+
+* RUNNING: indicates that the task is schedulable by the kernel; applies
+ to both servers and workers;
+* IDLE: indicates that the task is not schedulable by the kernel (see
+ umcg_idle_loop() in kernel/sched/umcg.c); applies to both servers and
+ workers;
+* BLOCKED: indicates that the worker is blocked in the kernel; does not
+ apply to servers.
+
+In addition to the three states above, two state flags help with state
+transitions:
+
+* LOCKED: the userspace is preparing the worker for a state transition
+ and "locks" the worker until the worker is ready for the kernel to act
+ on the state transition; used similarly to preempt_disable or
+ irq_disable in the kernel; applies only to workers in RUNNING or IDLE
+ state; RUNNING|LOCKED means "this worker is about to become RUNNING,
+ while IDLE|LOCKED means "this worker is about to become IDLE or
+ unregister;
+* PREEMPTED: the userspace indicates it wants the worker to be preempted;
+ there are no situations when both LOCKED and PREEMPTED flags are set at
+ the same time.
+
+
+STRUCT UMCG_TASK
+
+From include/uapi/linux/umcg.h:
+
+struct umcg_task {
+ uint64_t state_ts; /* r/w */
+ uint32_t next_tid; /* r */
+ uint32_t flags; /* reserved */
+ uint64_t idle_workers_ptr; /* r/w */
+ uint64_t idle_server_tid_ptr; /* r* */
+};
+
+Each UMCG task is identified by struct umcg_task, which is provided to the
+kernel when the task is registered via sys_umcg_ctl().
+
+* uint64_t state_ts: the current state of the task this struct
+ identifies, as described in the previous section, combined with a
+ unique timestamp indicating when the last state change happened.
+
+ Readable/writable by both the kernel and the userspace.
+
+ bits 0 - 5: task state (RUNNING, IDLE, BLOCKED);
+ bits 6 - 7: state flags (LOCKED, PREEMPTED);
+ bits 8 - 12: reserved; must be zeroes;
+ bits 13 - 17: for userspace use;
+ bits 18 - 63: timestamp.
+
+ Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
+
+ It is highly benefitical to tag each state change with a unique
+ timestamp:
+
+ - timestamps will naturally provide instrumentation to measure
+ scheduling delays, both in the kernel and in the userspace;
+ - uniqueness of timestamps (module overflow) guarantees that state
+ change races, especially ABA races, are easily detected and avoided.
+
+ Each timestamp represents the moment in time the state change happened,
+ in nanoseconds, with the lower 4 bits and the upper 16 bits stripped.
+
+ In this document 'umcg_task.state' is often used to talk about
+ 'umcg_task.state_ts' field, as timestamps do not carry semantic
+ meaning at the moment.
+
+ This is how umcg_task.state_ts is updated in the kernel:
+
+ /* kernel side */
+ /**
+ * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
+ * @state_ts - points to the state_ts member of struct umcg_task to update;
+ * @expected - the expected value of state_ts, including the timestamp;
+ * @desired - the desired value of state_ts, state part only;
+ * @may_fault - whether to use normal or _nofault cmpxchg.
+ *
+ * The function is basically cmpxchg(state_ts, expected, desired), with extra
+ * code to set the timestamp in @desired.
+ */
+ static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
+ bool may_fault)
+ {
+ u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
+ u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
+
+ /* Cut higher order bits. */
+ next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1);
+
+ if (next_ts == curr_ts)
+ ++next_ts;
+
+ /* Remove an old timestamp, if any. */
+ desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1);
+
+ /* Set the new timestamp. */
+ desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
+
+ if (may_fault)
+ return cmpxchg_user_64(state_ts, expected, desired);
+
+ return cmpxchg_user_64_nofault(state_ts, expected, desired);
+ }
+
+* uint32_t next_tid: contains the TID of the task to context-switch-into
+ in sys_umcg_wait(); can be zero; writable by the userspace, readable by
+ the kernel; if this is a RUNNING worker, this field contains the TID of
+ the server that should be woken when this worker blocks; see
+ sys_umcg_wait() for more details;
+
+* uint32_t flags: reserved; must be zero.
+
+* uint64_t idle_workers_ptr: this field forms a single-linked list of
+ idle workers: all RUNNING workers have this field set to point to the
+ head of the list (a pointer variable in the userspace).
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from BLOCKED to IDLE and adds the worker to
+ the top of the list of idle workers using this logic:
+
+ /* kernel side */
+ /**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr
+ * list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+ static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+ {
+ u64 __user *node = &ut_worker->idle_workers_ptr;
+ u64 __user *head_ptr;
+ u64 first = (u64)node;
+ u64 head;
+
+ if (get_user_nosleep(head, node) || !head)
+ return false;
+
+ head_ptr = (u64 __user *)head;
+
+ if (put_user_nosleep(UMCG_IDLE_NODE_PENDING, node))
+ return false;
+
+ if (xchg_user_64(head_ptr, &first))
+ return false;
+
+ if (put_user_nosleep(first, node))
+ return false;
+
+ return true;
+ }
+
+ In the userspace the list is cleared atomically using this logic:
+
+ /* userspace side */
+ uint64_t *idle_workers = (uint64_t *)*head;
+
+ atomic_exchange(&idle_workers, NULL);
+
+ The userspace re-points workers' idle_workers_ptr to the list head
+ variable before the worker is allowed to become RUNNING again.
+
+ When processing the idle workers list, the userspace should wait for
+ workers marked as UMCG_IDLE_NODE_PENDING to have the flag cleared (see
+ enqueue_idle_worker() above).
+
+* uint64_t idle_server_tid_ptr: points to a variable in the userspace
+ that points to an idle server, i.e. a server in IDLE state waiting in
+ sys_umcg_wait(); read-only; workers must have this field set; not used
+ in servers.
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from BLOCKED to IDLE, adds the worker to the
+ list of idle workers, and wakes the idle server if present; the kernel
+ atomically exchanges (*idle_server_tid_ptr) with 0, thus waking the idle
+ server, if present, only once. See State transitions below for more
+ details.
+
+
+SYS_UMCG_CTL()
+
+int sys_umcg_ctl(uint32_t flags, struct umcg_task *self) is used to
+register or unregister the current task as a worker or server. Flags can be
+one of the following:
+
+ UMCG_CTL_REGISTER: register a server;
+ UMCG_CTL_REGISTER | UMCG_CTL_WORKER: register a worker;
+ UMCG_CTL_UNREGISTER: unregister the current server or worker.
+
+When registering a task, self must point to struct umcg_task describing
+this server or worker; the pointer must remain valid until the task is
+unregistered.
+
+When registering a server, self->state must be RUNNING; all other fields in
+self must be zeroes.
+
+When registering a worker, self->state must be RUNNING;
+self->idle_server_tid_ptr and self->idle_workers_ptr must be valid pointers
+as described in struct umcg_task; self->next_tid must be zero.
+
+When unregistering a task, self must be NULL.
+
+
+SYS_UMCG_WAIT()
+
+int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout) operates on
+registered UMCG servers and workers: struct umcg_task *self provided to
+sys_umcg_ctl() when registering the current task is consulted in addition
+to flags and abs_timeout parameters.
+
+The function can be used to perform one of the three operations:
+
+* wait: if self->next_tid is zero, sys_umcg_wait() puts the current
+ task to sleep;
+* wake: if self->next_tid is not zero, and flags & UMCG_WAIT_WAKE_ONLY,
+ the task identified by next_tid is woken;
+* context switch: if self->next_tid is not zero, and !(flags &
+ UMCG_WAIT_WAKE_ONLY), the current task is put to sleep and the next task
+ is woken, synchronously switching between the tasks on the current CPU
+ on the fast path.
+
+Flags can be zero or a combination of the following values:
+
+* UMCG_WAIT_WAKE_ONLY: wake the next task, don't put the current task to
+ sleep;
+* UMCG_WAIT_WF_CURRENT_CPU: wake the next task on the curent CPU; this
+ flag has an effect only if UMCG_WAIT_WAKE_ONLY is set: context switching
+ is always attempted to happen on the curent CPU.
+
+The section below provides more details on how servers and workers interact
+via sys_umcg_wait(), during worker block/wake events, and during worker
+preemption.
+
+
+STATE TRANSITIONS
+
+As mentioned above, the key principle of UMCG state transitions is that the
+party initiating the state transition modifies the state of affected tasks.
+
+Below, "TASK:STATE" indicates a task T, where T can be either W for worker
+or S for server, in state S, where S can be one of the three states,
+potentially ORed with a state flag. Each individual state transition is an
+atomic operation (cmpxchg) unless indicated otherwise. Also note that the
+order of state transitions is important and is part of the contract between
+the userspace and the kernel. The kernel is free to kill the task (SIGKILL)
+if the contract is broken.
+
+Some worker state transitions below include adding LOCKED flag to worker
+state. This is done to indicate to the kernel that the worker is
+transitioning state and should not participate in the block/wake detection
+routines, which can happen due to interrupts/pagefaults/signals.
+
+IDLE|LOCKED means that a running worker is preparing to sleep, so
+interrupts should not lead to server wakeup; RUNNING|LOCKED means that an
+idle worker is going to be "scheduled to run", but may not yet have its
+server set up properly.
+
+Key state transitions:
+
+* server to worker context switch ("schedule a worker to run"):
+ S:RUNNING+W:IDLE => S:IDLE+W:RUNNING:
+ in the userspace, in the context of the server S running:
+ S:RUNNING => S:IDLE (mark self as idle)
+ W:IDLE => W:RUNNING|LOCKED (mark the worker as running)
+ W.next_tid := S.tid; S.next_tid := W.tid (link the server with
+ the worker)
+ W:RUNNING|LOCKED => W:RUNNING (unlock the worker)
+ S: sys_umcg_wait() (make the syscall)
+ the kernel context switches from the server to the worker; the
+ server sleeps until it becomes RUNNING during one of the
+ transitions below;
+
+* worker to server context switch (worker "yields"): S:IDLE+W:RUNNING =>
+S:RUNNING+W:IDLE:
+ in the userspace, in the context of the worker W running (note that
+ a running worker has its next_tid set to point to its server):
+ W:RUNNING => W:IDLE|LOCKED (mark self as idle)
+ S:IDLE => S:RUNNING (mark the server as running)
+ W: sys_umcg_wait() (make the syscall)
+ the kernel removes the LOCKED flag from the worker's state and
+ context switches from the worker to the server; the worker sleeps
+ until it becomes RUNNING;
+
+* worker to worker context switch: W1:RUNNING+W2:IDLE =>
+ W1:IDLE+W2:RUNNING:
+ in the userspace, in the context of W1 running:
+ W2:IDLE => W2:RUNNING|LOCKED (mark W2 as running)
+ W1:RUNNING => W1:IDLE|LOCKED (mark self as idle)
+ W2.next_tid := W1.next_tid; S.next_tid := W2.tid (transfer the
+ server W1 => W2)
+ W1:next_tid := W2.tid (indicate that W1 should context-switch
+ into W2)
+ W2:RUNNING|LOCKED => W2:RUNNING (unlock W2)
+ W1: sys_umcg_wait() (make the syscall)
+ same as above, the kernel removes the LOCKED flag from the W1's
+ state and context switches to next_tid;
+
+* worker wakeup: W:IDLE => W:RUNNING:
+ in the userspace, a server S can wake a worker W without "running"
+ it:
+ S:next_tid :=W.tid
+ W:next_tid := 0
+ W:IDLE => W:RUNNING
+ sys_umcg_wait(UMCG_WAIT_WAKE_ONLY) (make the syscall)
+ the kernel will wake the worker W; as the worker does not have a
+ server assigned, "wake detection" will happen, the worker will be
+ immediately marked as IDLE and added to idle workers list; an idle
+ server, if any, will be woken (see 'wake detection' below);
+
+ Note: if needed, it is possible for a worker to wake another
+ worker: the waker marks itself "IDLE|LOCKED", points its next_tid
+ to the wakee, makes the syscall, restores its server in next_tid,
+ marks itself as RUNNING.
+
+* block detection: worker blocks in the kernel: S:IDLE+W:RUNNING =>
+ S:RUNNING+W:BLOCKED:
+ when a worker blocks in the kernel in RUNNING state (not LOCKED),
+ before descheduling the task from the CPU the kernel performs
+ these operations:
+ W:RUNNING => W:BLOCKED
+ S := W.next_tid
+ S:IDLE => S:RUNNING
+ try_to_wake_up(S)
+ if any of the first three operations above fail, the worker is
+ killed via SIGKILL. Note that ttwu(S) is not required to succeed,
+ as the server may still be transitioning to sleep in
+ sys_umcg_wait(); before actually putting the server to sleep its
+ UMCG state is checked and, if it is RUNNING, sys_umcg_wait()
+ returns to the userspace;
+ if the worker has its LOCKED flag set, block detection does not
+ trigger, as the worker is assumed to be in the userspace
+ scheduling code.
+
+* wake detection: worker wakes in the kernel: W:BLOCKED => W:IDLE:
+ all workers' returns to the userspace are intercepted:
+ start: (a label)
+ if W:RUNNING & W.next_tid != 0: let the worker exit to the
+ userspace, as this is a RUNNING worker with a server;
+ W:* => W:IDLE (previously blocked or woken without servers
+ workers are not allowed to return to the userspace);
+ the worker is appended to W.idle_workers_ptr idle workers list;
+ S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING;
+ ttwu(S)
+ idle_loop(W): this is the same idle loop that sys_umcg_wait()
+ uses: it breaks only when the worker becomes RUNNING; when
+ the idle loop exits, it is assumed that the userspace has
+ properly removed the worker from the idle workers list
+ before marking it RUNNING;
+ goto start; (repeat from the beginning).
+
+ the logic above is a bit more complicated in the presence of
+ LOCKED or PREEMPTED flags, but the main invariants
+ stay the same:
+ only RUNNING workers with servers assigned are allowed to run
+ in the userspace (unless LOCKED);
+ newly IDLE workers are added to the idle workers list; any
+ user-initiated state change assumes the userspace
+ properly removed the worker from the list;
+ as with wake detection, any "breach of contract" by the
+ userspace will result in the task termination via SIGKILL.
+
+* worker preemption: S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED:
+ when the userspace wants to preempt a RUNNING worker, it changes it
+ state, atomically, RUNNING => RUNNING|PREEMPTED and sends a
+ signal to the worker via tgkill(); the signal handler, previously
+ set up by the userspace, can be a NOP (note that only RUNNING
+ workers can be preempted);
+
+ if the worker, at the moment the signal arrived, continued to be
+ running on-CPU in the userspace, the "wake detection" code will be
+ triggered that, in addition to what was described above, will
+ check if the worker is in RUNNING|PREEMPTED state:
+ W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED
+ S := W.next_tid
+ S:IDLE => S:RUNNING
+ try_to_wakeup(S)
+
+ if the signal arrives after the worker blocks in the kernel,
+ the "block detection" happened as described above, with the
+ following change:
+ W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED
+ S := W.next_tid
+ S:IDLE => S:RUNNING
+ try_to_wake_up(S)
+
+ in any case, the worker's server is woken, with its attached
+ worker (S.next_tid) either in BLOCKED|PREEMPTED or IDLE|PREEMPTED
+ state.
+
+
+SERVER-ONLY USE CASES
+
+Some workloads/applications may benefit from fast and synchronous on-CPU
+user-initiated context switches without the need for full userspace
+scheduling (block/wake detection). These applications can use "standalone"
+UMCG servers to wait/wake/context-switch. At the moment only in-process
+operations are allowed. In the future this restriction will be lifted,
+and wait/wake/context-switch operations between servers in related processes
+be permitted (when it is safe to do so, e.g. if the processes belong
+to the same user and/or cgroup).
+
+These "worker-less" operations involve trivial RUNNING <==> IDLE state
+changes, not discussed here for brevity.
--
2.25.1
Hi Peter,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on hnaz-mm/master]
[cannot apply to tip/master tip/x86/core arnd-asm-generic/master linus/master tip/x86/asm tip/core/entry v5.15-rc5 next-20211013]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211013-072710
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git b2d5b9cec60fecc72a13191c2c6c05acf60975a5
config: x86_64-randconfig-r003-20211012 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/0dcffc800cf354f683ff5c495baa0070dfb60afb
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211013-072710
git checkout 0dcffc800cf354f683ff5c495baa0070dfb60afb
# save the attached .config to linux build tree
mkdir build_dir
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
In file included from <command-line>:32:
>> ./usr/include/linux/umcg.h:80:2: error: unknown type name 'uint64_t'
80 | uint64_t state_ts; /* r/w */
| ^~~~~~~~
>> ./usr/include/linux/umcg.h:91:2: error: unknown type name 'uint32_t'
91 | uint32_t next_tid; /* r */
| ^~~~~~~~
./usr/include/linux/umcg.h:93:2: error: unknown type name 'uint32_t'
93 | uint32_t flags; /* Reserved; must be zero. */
| ^~~~~~~~
./usr/include/linux/umcg.h:101:2: error: unknown type name 'uint64_t'
101 | uint64_t idle_workers_ptr; /* r/w */
| ^~~~~~~~
./usr/include/linux/umcg.h:107:2: error: unknown type name 'uint64_t'
107 | uint64_t idle_server_tid_ptr; /* r */
| ^~~~~~~~
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Peter,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on hnaz-mm/master]
[cannot apply to tip/master tip/x86/core arnd-asm-generic/master linus/master tip/x86/asm tip/core/entry v5.15-rc5 next-20211013]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211013-072710
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git b2d5b9cec60fecc72a13191c2c6c05acf60975a5
config: x86_64-randconfig-a012-20211012 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project adf55ac6657693f7bfbe3087b599b4031a765a44)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/0dcffc800cf354f683ff5c495baa0070dfb60afb
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211013-072710
git checkout 0dcffc800cf354f683ff5c495baa0070dfb60afb
# save the attached .config to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
In file included from <built-in>:1:
>> ./usr/include/linux/umcg.h:80:2: error: unknown type name 'uint64_t'
uint64_t state_ts; /* r/w */
^
>> ./usr/include/linux/umcg.h:91:2: error: unknown type name 'uint32_t'
uint32_t next_tid; /* r */
^
./usr/include/linux/umcg.h:93:2: error: unknown type name 'uint32_t'
uint32_t flags; /* Reserved; must be zero. */
^
./usr/include/linux/umcg.h:101:2: error: unknown type name 'uint64_t'
uint64_t idle_workers_ptr; /* r/w */
^
./usr/include/linux/umcg.h:107:2: error: unknown type name 'uint64_t'
uint64_t idle_server_tid_ptr; /* r */
^
5 errors generated.
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Peter,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on tip/sched/core]
[cannot apply to tip/master tip/x86/core hnaz-mm/master arnd-asm-generic/master linus/master tip/x86/asm tip/core/entry v5.15-rc5 next-20211015]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211013-072710
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git b2d5b9cec60fecc72a13191c2c6c05acf60975a5
config: arm64-randconfig-r015-20211014 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project acb3b187c4c88650a6a717a1bcb234d27d0d7f54)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm64 cross compiling tool for clang build
# apt-get install binutils-aarch64-linux-gnu
# https://github.com/0day-ci/linux/commit/0dcffc800cf354f683ff5c495baa0070dfb60afb
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211013-072710
git checkout 0dcffc800cf354f683ff5c495baa0070dfb60afb
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=arm64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:141:1: note: expanded from here
__arm64_sys_mremap
^
kernel/sys_ni.c:262:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:265:1: warning: no previous prototype for function '__arm64_sys_add_key' [-Wmissing-prototypes]
COND_SYSCALL(add_key);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:142:1: note: expanded from here
__arm64_sys_add_key
^
kernel/sys_ni.c:265:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:266:1: warning: no previous prototype for function '__arm64_sys_request_key' [-Wmissing-prototypes]
COND_SYSCALL(request_key);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:143:1: note: expanded from here
__arm64_sys_request_key
^
kernel/sys_ni.c:266:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:267:1: warning: no previous prototype for function '__arm64_sys_keyctl' [-Wmissing-prototypes]
COND_SYSCALL(keyctl);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:144:1: note: expanded from here
__arm64_sys_keyctl
^
kernel/sys_ni.c:267:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:268:1: warning: no previous prototype for function '__arm64_compat_sys_keyctl' [-Wmissing-prototypes]
COND_SYSCALL_COMPAT(keyctl);
^
arch/arm64/include/asm/syscall_wrapper.h:41:25: note: expanded from macro 'COND_SYSCALL_COMPAT'
asmlinkage long __weak __arm64_compat_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:145:1: note: expanded from here
__arm64_compat_sys_keyctl
^
kernel/sys_ni.c:268:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:41:13: note: expanded from macro 'COND_SYSCALL_COMPAT'
asmlinkage long __weak __arm64_compat_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:271:1: warning: no previous prototype for function '__arm64_sys_landlock_create_ruleset' [-Wmissing-prototypes]
COND_SYSCALL(landlock_create_ruleset);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:146:1: note: expanded from here
__arm64_sys_landlock_create_ruleset
^
kernel/sys_ni.c:271:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:272:1: warning: no previous prototype for function '__arm64_sys_landlock_add_rule' [-Wmissing-prototypes]
COND_SYSCALL(landlock_add_rule);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:147:1: note: expanded from here
__arm64_sys_landlock_add_rule
^
kernel/sys_ni.c:272:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:273:1: warning: no previous prototype for function '__arm64_sys_landlock_restrict_self' [-Wmissing-prototypes]
COND_SYSCALL(landlock_restrict_self);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:148:1: note: expanded from here
__arm64_sys_landlock_restrict_self
^
kernel/sys_ni.c:273:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
>> kernel/sys_ni.c:276:1: warning: no previous prototype for function '__arm64_sys_umcg_ctl' [-Wmissing-prototypes]
COND_SYSCALL(umcg_ctl);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:149:1: note: expanded from here
__arm64_sys_umcg_ctl
^
kernel/sys_ni.c:276:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
>> kernel/sys_ni.c:277:1: warning: no previous prototype for function '__arm64_sys_umcg_wait' [-Wmissing-prototypes]
COND_SYSCALL(umcg_wait);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:150:1: note: expanded from here
__arm64_sys_umcg_wait
^
kernel/sys_ni.c:277:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:282:1: warning: no previous prototype for function '__arm64_sys_fadvise64_64' [-Wmissing-prototypes]
COND_SYSCALL(fadvise64_64);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:151:1: note: expanded from here
__arm64_sys_fadvise64_64
^
kernel/sys_ni.c:282:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:285:1: warning: no previous prototype for function '__arm64_sys_swapon' [-Wmissing-prototypes]
COND_SYSCALL(swapon);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:152:1: note: expanded from here
__arm64_sys_swapon
^
kernel/sys_ni.c:285:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:286:1: warning: no previous prototype for function '__arm64_sys_swapoff' [-Wmissing-prototypes]
COND_SYSCALL(swapoff);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:153:1: note: expanded from here
__arm64_sys_swapoff
^
kernel/sys_ni.c:286:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:287:1: warning: no previous prototype for function '__arm64_sys_mprotect' [-Wmissing-prototypes]
COND_SYSCALL(mprotect);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:154:1: note: expanded from here
__arm64_sys_mprotect
^
kernel/sys_ni.c:287:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:288:1: warning: no previous prototype for function '__arm64_sys_msync' [-Wmissing-prototypes]
COND_SYSCALL(msync);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:155:1: note: expanded from here
__arm64_sys_msync
^
kernel/sys_ni.c:288:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:289:1: warning: no previous prototype for function '__arm64_sys_mlock' [-Wmissing-prototypes]
COND_SYSCALL(mlock);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:156:1: note: expanded from here
__arm64_sys_mlock
^
kernel/sys_ni.c:289:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
kernel/sys_ni.c:290:1: warning: no previous prototype for function '__arm64_sys_munlock' [-Wmissing-prototypes]
COND_SYSCALL(munlock);
^
arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs) \
^
<scratch space>:157:1: note: expanded from here
__arm64_sys_munlock
^
kernel/sys_ni.c:290:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
vim +/__arm64_sys_umcg_ctl +276 kernel/sys_ni.c
274
275 /* kernel/sched/umcg.c */
> 276 COND_SYSCALL(umcg_ctl);
> 277 COND_SYSCALL(umcg_wait);
278
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Peter,
On Tue, Oct 12, 2021 at 04:25:22PM -0700, Peter Oskolkov wrote:
> Document User Managed Concurrency Groups syscalls, data structures,
> state transitions, etc.
>
> This is a text version of umcg.rst.
>
> Signed-off-by: Peter Oskolkov <[email protected]>
> ---
> Documentation/userspace-api/umcg.txt | 594 +++++++++++++++++++++++++++
> 1 file changed, 594 insertions(+)
> create mode 100644 Documentation/userspace-api/umcg.txt
>
> diff --git a/Documentation/userspace-api/umcg.txt b/Documentation/userspace-api/umcg.txt
> new file mode 100644
> index 000000000000..cabaa6f4aaad
> --- /dev/null
> +++ b/Documentation/userspace-api/umcg.txt
> @@ -0,0 +1,594 @@
> +UMCG USERSPACE API
> +
> +User Managed Concurrency Groups (UMCG) is an M:N threading
> +subsystem/toolkit that lets user space application developers implement
> +in-process user space schedulers.
> +
> +
> +CONTENTS
> +
> + WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
> + REQUIREMENTS
> + UMCG KERNEL API
> + SERVERS
> + WORKERS
> + UMCG TASK STATES
> + STRUCT UMCG_TASK
> + SYS_UMCG_CTL()
> + SYS_UMCG_WAIT()
> + STATE TRANSITIONS
> + SERVER-ONLY USE CASES
> +
> +
> +WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
> +
> +Linux kernel's CFS scheduler is designed for the "common" use case, with
> +efficiency/throughput in mind. Work isolation and workloads of different
> +"urgency" are addressed by tools such as cgroups, CPU affinity, priorities,
> +etc., which are difficult or impossible to efficiently use in-process.
> +
> +For example, a single DBMS process may receive tens of thousands requests
> +per second; some of these requests may have strong response latency
> +requirements as they serve live user requests (e.g. login authentication);
> +some of these requests may not care much about latency but must be served
> +within a certain time period (e.g. an hourly aggregate usage report); some
> +of these requests are to be served only on a best-effort basis and can be
> +NACKed under high load (e.g. an exploratory research/hypothesis testing
> +workload).
> +
> +Beyond different work item latency/throughput requirements as outlined
> +above, the DBMS may need to provide certain guarantees to different users;
> +for example, user A may "reserve" 1 CPU for their high-priority/low latency
^^^^^^^^^^^
low-latency
> +requests, 2 CPUs for mid-level throughput workloads, and be allowed to send
> +as many best-effort requests as possible, which may or may not be served,
> +depending on the DBMS load. Besides, the best-effort work, started when the
> +load was low, may need to be delayed if suddenly a large amount of
> +higher-priority work arrives. With hundreds or thousands of users like
> +this, it is very difficult to guarantee the application's responsiveness
> +using standard Linux tools while maintaining high CPU utilization.
> +
> +Gaming is another use case: some in-process work must be completed before a
> +certain deadline dictated by frame rendering schedule, while other work
> +items can be delayed; some work may need to be cancelled/discarded because
> +the deadline has passed; etc.
> +
> +User Managed Concurrency Groups is an M:N threading toolkit that allows
> +constructing user space schedulers designed to efficiently manage
> +heterogeneous in-process workloads described above while maintaining high
> +CPU utilization (95%+).
> +
> +
> +REQUIREMENTS
> +
> +One relatively established way to design high-efficiency, low-latency
> +systems is to split all work into small on-cpu work items, with
> +asynchronous I/O and continuations, all executed on a thread pool with the
> +number of threads not exceeding the number of available CPUs. Although this
> +approach works, it is quite difficult to develop and maintain such a
> +system, as, for example, small continuations are difficult to piece
> +together when debugging. Besides, such asynchronous callback-based systems
> +tend to be somewhat cache-inefficient, as continuations can get scheduled
> +on any CPU regardless of cache locality.
> +
> +M:N threading and cooperative user space scheduling enables controlled CPU
> +usage (minimal OS preemption), synchronous coding style, and better cache
> +locality.
> +
> +Specifically:
> +
> +* a variable/fluctuating number M of "application" threads should be
> + "scheduled over" a relatively fixed number N of "kernel" threads, where
> + N is less than or equal to the number of CPUs available;
> +* only those application threads that are attached to kernel threads are
> + scheduled "on CPU";
> +* application threads should be able to cooperatively
> + yield to each other;
The above two lines can be in one line.
* application threads should be able to cooperatively yield to each other;
> +* when an application thread blocks in kernel (e.g. in I/O), this becomes
> + a scheduling event ("block") that the userspace scheduler should be able
> + to efficiently detect, and reassign a waiting application thread to the
> + freeded "kernel" thread;
> +* when a blocked application thread wakes (e.g. its I/O operation
> + completes), this even ("wake") should also be detectable by the
^^^^
event
> + userspace scheduler, which should be able to either quickly dispatch the
> + newly woken thread to an idle "kernel" thread or, if all "kernel"
> + threads are busy, put it in the waiting queue;
> +* in addition to the above, it would be extremely useful for a separate
> + in-process "watchdog" facility to be able to monitor the state of each
> + of the M+N threads, and to intervene in case of runaway workloads
> + (interrupt/preempt).
> +
> +
> +UMCG KERNEL API
> +
> +Based on the requrements above, UMCG kernel API is build around the
> +following ideas:
> +
> +* UMCG server: a task/thread representing "kernel threads", or CPUs from
> + the requirements above;
> +* UMCG worker: a task/thread representing "application threads", to be
> + scheduled over servers;
> +* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
> + server or a worker) can be in;
> +* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
> + can be ORed with the task state to communicate additional information to
> + the kernel;
> +* struct umcg_task: a per-task userspace set of data fields, usually
> + residing in the TLS, that fully reflects the current task's UMCG state
> + and controls the way the kernel manages the task;
> +* sys_umcg_ctl(): a syscall used to register the current task/thread as a
> + server or a worker, or to unregister a UMCG task;
> +* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
> + wake another task, pontentially context-switching between the two tasks
> + on-CPU synchronously.
> +
> +
> +SERVERS
> +
> +When a task/thread is registered as a server, it is in RUNNING state and
> +behaves like any other normal task/thread. In addition, servers can
> +interact with other UMCG tasks via sys_umcg_wait():
> +
> +* servers can voluntarily suspend their execution (wait), becoming IDLE;
> +* servers can wake other IDLE servers;
> +* servers can context-switch between each other.
> +
> +Note that if a server blocks in the kernel not via sys_umcg_wait(), it
> +still retains its RUNNING state.
> +
> +
> +WORKERS
> +
> +A worker cannot be RUNNING without having a server associated with it, so
> +when a task is first registered as a worker, it enters the IDLE state.
> +
> +* a worker becomes RUNNING when a server calls sys_umcg_wait to
> + context-switch into it; the server goes IDLE, and the worker becomes
> + RUNNING in its place;
> +* when a running worker blocks in the kernel, it becomes BLOCKED, its
> + associated server becomes RUNNING and the server's sys_umcg_wait() call
> + from the bullet above returns; this transition is sometimes called
> + "block detection";
> +* when the syscall on which a BLOCKED worker completes, the worker
> + becomes IDLE and is added to the list of idle workers; if there is an
> + idle server waiting, the kernel wakes it; this transition is sometimes
> + called "wake detection";
> +* running workers can voluntarily suspend their execution (wait),
^^^^^^^
RUNNING
> + becoming IDLE; their associated servers are woken;
> +* a RUNNING worker can context-switch with an IDLE worker; the server of
> + the switched-out worker is transferred to the switched-in worker;
> +* any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
> + this is a server running the worker as described in the first bullet in
> + this list, the worker remain IDLE but is added to the idle workers list;
> + this "wake" operation exists for completeness, to make sure
> + wait/wake/context-switch operations are available for all UMCG tasks;
> +* the userspace can preempt a RUNNING worker by marking it
> + RUNNING|PREEMPTED and sending a signal to it; the userspace should have
> + installed a NOP signal handler for the signal; the kernel will then
> + transition the worker into IDLE|PREEMPTED state and wake its associated
> + server.
> +
> +
> +UMCG TASK STATES
> +
> +Important: all state transitions described below involve at least two
> +steps: the change of the state field in struct umcg_task, for example
> +RUNNING to IDLE, and the corresponding change in struct task_struct state,
> +for example a transition between the task running on CPU and being
> +descheduled and removed from the kernel runqueue. The key principle of UMCG
> +API design is that the party initiating the state transition modifies the
> +state variable.
> +
> +For example, a task going IDLE first changes its state from RUNNING to IDLE
> +in the userpace and then calls sys_umcg_wait(), which completes the
> +transition.
> +
> +Note on documentation: in include/uapi/linux/umcg.h, task states have the
> +form UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, etc. In this document these are
> +usually referred to simply RUNNING and BLOCKED, unless it creates
> +ambiguity. Task state flags, e.g. UMCG_TF_PREEMPTED, are treated similarly.
> +
> +UMCG task states reflect the view from the userspace, rather than from the
> +kernel. There are three fundamental task states:
> +
> +* RUNNING: indicates that the task is schedulable by the kernel; applies
> + to both servers and workers;
> +* IDLE: indicates that the task is not schedulable by the kernel (see
> + umcg_idle_loop() in kernel/sched/umcg.c); applies to both servers and
> + workers;
> +* BLOCKED: indicates that the worker is blocked in the kernel; does not
> + apply to servers.
> +
> +In addition to the three states above, two state flags help with state
> +transitions:
> +
> +* LOCKED: the userspace is preparing the worker for a state transition
> + and "locks" the worker until the worker is ready for the kernel to act
> + on the state transition; used similarly to preempt_disable or
> + irq_disable in the kernel; applies only to workers in RUNNING or IDLE
> + state; RUNNING|LOCKED means "this worker is about to become RUNNING,
> + while IDLE|LOCKED means "this worker is about to become IDLE or
> + unregister;
> +* PREEMPTED: the userspace indicates it wants the worker to be preempted;
> + there are no situations when both LOCKED and PREEMPTED flags are set at
> + the same time.
> +
> +
> +STRUCT UMCG_TASK
> +
> +From include/uapi/linux/umcg.h:
> +
> +struct umcg_task {
> + uint64_t state_ts; /* r/w */
> + uint32_t next_tid; /* r */
> + uint32_t flags; /* reserved */
> + uint64_t idle_workers_ptr; /* r/w */
> + uint64_t idle_server_tid_ptr; /* r* */
> +};
> +
> +Each UMCG task is identified by struct umcg_task, which is provided to the
> +kernel when the task is registered via sys_umcg_ctl().
> +
> +* uint64_t state_ts: the current state of the task this struct
> + identifies, as described in the previous section, combined with a
> + unique timestamp indicating when the last state change happened.
> +
> + Readable/writable by both the kernel and the userspace.
> +
> + bits 0 - 5: task state (RUNNING, IDLE, BLOCKED);
> + bits 6 - 7: state flags (LOCKED, PREEMPTED);
> + bits 8 - 12: reserved; must be zeroes;
> + bits 13 - 17: for userspace use;
> + bits 18 - 63: timestamp.
> +
> + Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> +
> + It is highly benefitical to tag each state change with a unique
> + timestamp:
> +
> + - timestamps will naturally provide instrumentation to measure
> + scheduling delays, both in the kernel and in the userspace;
> + - uniqueness of timestamps (module overflow) guarantees that state
> + change races, especially ABA races, are easily detected and avoided.
> +
> + Each timestamp represents the moment in time the state change happened,
> + in nanoseconds, with the lower 4 bits and the upper 16 bits stripped.
> +
> + In this document 'umcg_task.state' is often used to talk about
> + 'umcg_task.state_ts' field, as timestamps do not carry semantic
> + meaning at the moment.
> +
> + This is how umcg_task.state_ts is updated in the kernel:
> +
> + /* kernel side */
> + /**
> + * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
> + * @state_ts - points to the state_ts member of struct umcg_task to update;
> + * @expected - the expected value of state_ts, including the timestamp;
> + * @desired - the desired value of state_ts, state part only;
> + * @may_fault - whether to use normal or _nofault cmpxchg.
> + *
> + * The function is basically cmpxchg(state_ts, expected, desired), with extra
> + * code to set the timestamp in @desired.
> + */
> + static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
> + bool may_fault)
> + {
> + u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
> + u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
> +
> + /* Cut higher order bits. */
> + next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1);
> +
> + if (next_ts == curr_ts)
> + ++next_ts;
> +
> + /* Remove an old timestamp, if any. */
> + desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1);
> +
> + /* Set the new timestamp. */
> + desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
> +
> + if (may_fault)
> + return cmpxchg_user_64(state_ts, expected, desired);
> +
> + return cmpxchg_user_64_nofault(state_ts, expected, desired);
> + }
> +
> +* uint32_t next_tid: contains the TID of the task to context-switch-into
> + in sys_umcg_wait(); can be zero; writable by the userspace, readable by
> + the kernel; if this is a RUNNING worker, this field contains the TID of
> + the server that should be woken when this worker blocks; see
> + sys_umcg_wait() for more details;
> +
> +* uint32_t flags: reserved; must be zero.
> +
> +* uint64_t idle_workers_ptr: this field forms a single-linked list of
> + idle workers: all RUNNING workers have this field set to point to the
> + head of the list (a pointer variable in the userspace).
> +
> + When a worker's blocking operation in the kernel completes, the kernel
> + changes the worker's state from BLOCKED to IDLE and adds the worker to
> + the top of the list of idle workers using this logic:
> +
> + /* kernel side */
> + /**
> + * enqueue_idle_worker - push an idle worker onto idle_workers_ptr
> + * list/stack.
> + *
> + * Returns true on success, false on a fatal failure.
> + */
> + static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
> + {
> + u64 __user *node = &ut_worker->idle_workers_ptr;
> + u64 __user *head_ptr;
> + u64 first = (u64)node;
> + u64 head;
> +
> + if (get_user_nosleep(head, node) || !head)
> + return false;
> +
> + head_ptr = (u64 __user *)head;
> +
> + if (put_user_nosleep(UMCG_IDLE_NODE_PENDING, node))
> + return false;
> +
> + if (xchg_user_64(head_ptr, &first))
> + return false;
> +
> + if (put_user_nosleep(first, node))
> + return false;
> +
> + return true;
> + }
> +
> + In the userspace the list is cleared atomically using this logic:
> +
> + /* userspace side */
> + uint64_t *idle_workers = (uint64_t *)*head;
> +
> + atomic_exchange(&idle_workers, NULL);
> +
> + The userspace re-points workers' idle_workers_ptr to the list head
> + variable before the worker is allowed to become RUNNING again.
> +
> + When processing the idle workers list, the userspace should wait for
> + workers marked as UMCG_IDLE_NODE_PENDING to have the flag cleared (see
> + enqueue_idle_worker() above).
> +
> +* uint64_t idle_server_tid_ptr: points to a variable in the userspace
> + that points to an idle server, i.e. a server in IDLE state waiting in
> + sys_umcg_wait(); read-only; workers must have this field set; not used
> + in servers.
> +
> + When a worker's blocking operation in the kernel completes, the kernel
> + changes the worker's state from BLOCKED to IDLE, adds the worker to the
> + list of idle workers, and wakes the idle server if present; the kernel
> + atomically exchanges (*idle_server_tid_ptr) with 0, thus waking the idle
> + server, if present, only once. See State transitions below for more
> + details.
> +
> +
> +SYS_UMCG_CTL()
> +
> +int sys_umcg_ctl(uint32_t flags, struct umcg_task *self) is used to
> +register or unregister the current task as a worker or server. Flags can be
> +one of the following:
> +
> + UMCG_CTL_REGISTER: register a server;
> + UMCG_CTL_REGISTER | UMCG_CTL_WORKER: register a worker;
> + UMCG_CTL_UNREGISTER: unregister the current server or worker.
> +
> +When registering a task, self must point to struct umcg_task describing
> +this server or worker; the pointer must remain valid until the task is
> +unregistered.
> +
> +When registering a server, self->state must be RUNNING; all other fields in
> +self must be zeroes.
> +
> +When registering a worker, self->state must be RUNNING;
^^^^^^^
IDLE
After looking through the document and code I feel the new registered worker'
state should be IDLE.
+
+A worker cannot be RUNNING without having a server associated with it, so
+when a task is first registered as a worker, it enters the IDLE state.
+
> +self->idle_server_tid_ptr and self->idle_workers_ptr must be valid pointers
> +as described in struct umcg_task; self->next_tid must be zero.
> +
> +When unregistering a task, self must be NULL.
> +
> +
> +SYS_UMCG_WAIT()
> +
> +int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout) operates on
> +registered UMCG servers and workers: struct umcg_task *self provided to
> +sys_umcg_ctl() when registering the current task is consulted in addition
> +to flags and abs_timeout parameters.
> +
> +The function can be used to perform one of the three operations:
> +
> +* wait: if self->next_tid is zero, sys_umcg_wait() puts the current
> + task to sleep;
> +* wake: if self->next_tid is not zero, and flags & UMCG_WAIT_WAKE_ONLY,
> + the task identified by next_tid is woken;
> +* context switch: if self->next_tid is not zero, and !(flags &
> + UMCG_WAIT_WAKE_ONLY), the current task is put to sleep and the next task
> + is woken, synchronously switching between the tasks on the current CPU
> + on the fast path.
> +
> +Flags can be zero or a combination of the following values:
> +
> +* UMCG_WAIT_WAKE_ONLY: wake the next task, don't put the current task to
> + sleep;
> +* UMCG_WAIT_WF_CURRENT_CPU: wake the next task on the curent CPU; this
> + flag has an effect only if UMCG_WAIT_WAKE_ONLY is set: context switching
> + is always attempted to happen on the curent CPU.
> +
> +The section below provides more details on how servers and workers interact
> +via sys_umcg_wait(), during worker block/wake events, and during worker
> +preemption.
> +
> +
> +STATE TRANSITIONS
> +
> +As mentioned above, the key principle of UMCG state transitions is that the
> +party initiating the state transition modifies the state of affected tasks.
> +
> +Below, "TASK:STATE" indicates a task T, where T can be either W for worker
> +or S for server, in state S, where S can be one of the three states,
> +potentially ORed with a state flag. Each individual state transition is an
> +atomic operation (cmpxchg) unless indicated otherwise. Also note that the
> +order of state transitions is important and is part of the contract between
> +the userspace and the kernel. The kernel is free to kill the task (SIGKILL)
> +if the contract is broken.
> +
> +Some worker state transitions below include adding LOCKED flag to worker
> +state. This is done to indicate to the kernel that the worker is
..worker is +in the+
> +transitioning state and should not participate in the block/wake detection
> +routines, which can happen due to interrupts/pagefaults/signals.
> +
> +IDLE|LOCKED means that a running worker is preparing to sleep, so
> +interrupts should not lead to server wakeup; RUNNING|LOCKED means that an
> +idle worker is going to be "scheduled to run", but may not yet have its
> +server set up properly.
> +
> +Key state transitions:
> +
> +* server to worker context switch ("schedule a worker to run"):
> + S:RUNNING+W:IDLE => S:IDLE+W:RUNNING:
> + in the userspace, in the context of the server S running:
> + S:RUNNING => S:IDLE (mark self as idle)
> + W:IDLE => W:RUNNING|LOCKED (mark the worker as running)
> + W.next_tid := S.tid; S.next_tid := W.tid (link the server with
> + the worker)
> + W:RUNNING|LOCKED => W:RUNNING (unlock the worker)
> + S: sys_umcg_wait() (make the syscall)
> + the kernel context switches from the server to the worker; the
> + server sleeps until it becomes RUNNING during one of the
> + transitions below;
> +
> +* worker to server context switch (worker "yields"): S:IDLE+W:RUNNING =>
> +S:RUNNING+W:IDLE:
> + in the userspace, in the context of the worker W running (note that
> + a running worker has its next_tid set to point to its server):
> + W:RUNNING => W:IDLE|LOCKED (mark self as idle)
> + S:IDLE => S:RUNNING (mark the server as running)
> + W: sys_umcg_wait() (make the syscall)
> + the kernel removes the LOCKED flag from the worker's state and
> + context switches from the worker to the server; the worker sleeps
> + until it becomes RUNNING;
> +
> +* worker to worker context switch: W1:RUNNING+W2:IDLE =>
> + W1:IDLE+W2:RUNNING:
> + in the userspace, in the context of W1 running:
> + W2:IDLE => W2:RUNNING|LOCKED (mark W2 as running)
> + W1:RUNNING => W1:IDLE|LOCKED (mark self as idle)
> + W2.next_tid := W1.next_tid; S.next_tid := W2.tid (transfer the
> + server W1 => W2)
> + W1:next_tid := W2.tid (indicate that W1 should context-switch
> + into W2)
> + W2:RUNNING|LOCKED => W2:RUNNING (unlock W2)
> + W1: sys_umcg_wait() (make the syscall)
> + same as above, the kernel removes the LOCKED flag from the W1's
> + state and context switches to next_tid;
> +
> +* worker wakeup: W:IDLE => W:RUNNING:
> + in the userspace, a server S can wake a worker W without "running"
> + it:
> + S:next_tid :=W.tid
> + W:next_tid := 0
> + W:IDLE => W:RUNNING
> + sys_umcg_wait(UMCG_WAIT_WAKE_ONLY) (make the syscall)
> + the kernel will wake the worker W; as the worker does not have a
> + server assigned, "wake detection" will happen, the worker will be
> + immediately marked as IDLE and added to idle workers list; an idle
> + server, if any, will be woken (see 'wake detection' below);
> +
> + Note: if needed, it is possible for a worker to wake another
> + worker: the waker marks itself "IDLE|LOCKED", points its next_tid
> + to the wakee, makes the syscall, restores its server in next_tid,
> + marks itself as RUNNING.
> +
> +* block detection: worker blocks in the kernel: S:IDLE+W:RUNNING =>
> + S:RUNNING+W:BLOCKED:
> + when a worker blocks in the kernel in RUNNING state (not LOCKED),
> + before descheduling the task from the CPU the kernel performs
> + these operations:
> + W:RUNNING => W:BLOCKED
> + S := W.next_tid
> + S:IDLE => S:RUNNING
> + try_to_wake_up(S)
> + if any of the first three operations above fail, the worker is
> + killed via SIGKILL. Note that ttwu(S) is not required to succeed,
> + as the server may still be transitioning to sleep in
> + sys_umcg_wait(); before actually putting the server to sleep its
> + UMCG state is checked and, if it is RUNNING, sys_umcg_wait()
> + returns to the userspace;
> + if the worker has its LOCKED flag set, block detection does not
> + trigger, as the worker is assumed to be in the userspace
> + scheduling code.
> +
> +* wake detection: worker wakes in the kernel: W:BLOCKED => W:IDLE:
> + all workers' returns to the userspace are intercepted:
> + start: (a label)
> + if W:RUNNING & W.next_tid != 0: let the worker exit to the
> + userspace, as this is a RUNNING worker with a server;
> + W:* => W:IDLE (previously blocked or woken without servers
> + workers are not allowed to return to the userspace);
> + the worker is appended to W.idle_workers_ptr idle workers list;
> + S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING;
> + ttwu(S)
> + idle_loop(W): this is the same idle loop that sys_umcg_wait()
> + uses: it breaks only when the worker becomes RUNNING; when
> + the idle loop exits, it is assumed that the userspace has
> + properly removed the worker from the idle workers list
> + before marking it RUNNING;
> + goto start; (repeat from the beginning).
> +
> + the logic above is a bit more complicated in the presence of
> + LOCKED or PREEMPTED flags, but the main invariants
> + stay the same:
> + only RUNNING workers with servers assigned are allowed to run
> + in the userspace (unless LOCKED);
> + newly IDLE workers are added to the idle workers list; any
> + user-initiated state change assumes the userspace
> + properly removed the worker from the list;
> + as with wake detection, any "breach of contract" by the
> + userspace will result in the task termination via SIGKILL.
> +
> +* worker preemption: S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED:
> + when the userspace wants to preempt a RUNNING worker, it changes it
> + state, atomically, RUNNING => RUNNING|PREEMPTED and sends a
> + signal to the worker via tgkill(); the signal handler, previously
> + set up by the userspace, can be a NOP (note that only RUNNING
> + workers can be preempted);
> +
> + if the worker, at the moment the signal arrived, continued to be
> + running on-CPU in the userspace, the "wake detection" code will be
> + triggered that, in addition to what was described above, will
> + check if the worker is in RUNNING|PREEMPTED state:
> + W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED
> + S := W.next_tid
> + S:IDLE => S:RUNNING
> + try_to_wakeup(S)
> +
> + if the signal arrives after the worker blocks in the kernel,
> + the "block detection" happened as described above, with the
> + following change:
> + W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED
> + S := W.next_tid
> + S:IDLE => S:RUNNING
> + try_to_wake_up(S)
> +
> + in any case, the worker's server is woken, with its attached
> + worker (S.next_tid) either in BLOCKED|PREEMPTED or IDLE|PREEMPTED
> + state.
> +
> +
> +SERVER-ONLY USE CASES
> +
> +Some workloads/applications may benefit from fast and synchronous on-CPU
> +user-initiated context switches without the need for full userspace
> +scheduling (block/wake detection). These applications can use "standalone"
> +UMCG servers to wait/wake/context-switch. At the moment only in-process
> +operations are allowed. In the future this restriction will be lifted,
> +and wait/wake/context-switch operations between servers in related processes
> +be permitted (when it is safe to do so, e.g. if the processes belong
> +to the same user and/or cgroup).
> +
> +These "worker-less" operations involve trivial RUNNING <==> IDLE state
> +changes, not discussed here for brevity.
> --
> 2.25.1
>
Thanks,
Tao
On Tue, Oct 12, 2021 at 04:25:20PM -0700, Peter Oskolkov wrote:
> Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
>
> User Managed Concurrency Groups is an M:N threading toolkit that allows
> constructing user space schedulers designed to efficiently manage
> heterogeneous in-process workloads while maintaining high CPU
> utilization (95%+).
>
> In addition, M:N threading and cooperative user space scheduling
> enables synchronous coding style and better cache locality when
> compared to asynchronous callback/continuation style of programming.
>
> UMCG kernel API is build around the following ideas:
>
> * UMCG server: a task/thread representing "kernel threads", or (v)CPUs;
> * UMCG worker: a task/thread representing "application threads", to be
> scheduled over servers;
> * UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
> server or a worker) can be in;
> * UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
> can be ORed with the task state to communicate additional information to
> the kernel;
> * struct umcg_task: a per-task userspace set of data fields, usually
> residing in the TLS, that fully reflects the current task's UMCG state
> and controls the way the kernel manages the task;
> * sys_umcg_ctl(): a syscall used to register the current task/thread as a
> server or a worker, or to unregister a UMCG task;
> * sys_umcg_wait(): a syscall used to put the current task to sleep and/or
> wake another task, pontentially context-switching between the two tasks
> on-CPU synchronously.
>
> In short, servers can be thought of as CPUs over which application
> threads (workers) are scheduled; at any one time a worker is either:
> - RUNNING: has a server and is schedulable by the kernel;
> - BLOCKED: blocked in the kernel (e.g. on I/O, or a futex);
> - IDLE: is not blocked, but cannot be scheduled by the kernel to
> run because it has no server assigned to it (e.g. because all
> available servers are busy "running" other workers).
>
> Usually the number of servers in a process is equal to the number of
> CPUs available to the kernel if the process is supposed to consume
> the whole machine, or less than the number of CPUs available if the
> process is sharing the machine with other workloads. The number of
> workers in a process can grow very large: tens of thousands is normal;
> hundreds of thousands and more (millions) is something that would
> be desirable to achieve in the future, as lightweight userspace
> threads in Java and Go easily scale to millions, and UMCG workers
> are (intended to be) conceptually similar to those.
>
> Detailed use cases and API behavior are provided in
> Documentation/userspace-api/umcg.[txt|rst] (see sibling patches).
>
> Some high-level implementation notes:
>
> UMCG tasks (workers and servers) are "tagged" with struct umcg_task
> residing in userspace (usually in TLS) to facilitate kernel/userspace
> communication. This makes the kernel-side code much simpler (see e.g.
> the implementation of sys_umcg_wait), but also requires some careful
> uaccess handling and page pinning (see below).
>
> The main UMCG server/worker interaction looks like:
>
> a. worker W1 is RUNNING, with a server S attached to it sleeping
> in IDLE state;
> b. worker W1 blocks in the kernel, e.g. on I/O;
> c. the kernel marks W1 as BLOCKED, the attached server S
> as RUNNING, and wakes S (the "block detection" event);
> d. the server now picks another IDLE worker W2 to run: marks
> W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait();
> e. when the blocking operation of W1 completes, the worker
> is marked by the kernel as IDLE and added to idle workers list
> (see struct umcg_task) for the userspace to pick up and
> later run (the "wake detection" event).
>
> While there are additional operations such as worker-to-worker
> context switch, preemption, workers "yielding", etc., the "workflow"
> above is the main worker/server interaction that drives the
> implementation.
>
> Specifically:
>
> - most operations are conceptually context switches:
> - scheduling a worker: a running server goes to sleep and "runs"
> a worker in its place;
> - block detection: worker is descheduled, and its server is woken;
> - wake detection: woken worker, running in the kernel, is descheduled,
> and if there is an idle server, it is woken to process the wake
> detection event;
> - to faciliate low scheduling latencies and cache locality, most
> server/worker interactions described above are performed synchronously
> "on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment
> the context switches are simulated by putting the switch-out task to
> sleep and waking the switch-into task on the same cpu, it is very much
> the long-term goal of this project to make the context switch much
> lighter, by tweaking runtime accounting and, maybe, even bypassing
> __schedule();
> - worker blocking is detected in a hook to sched_submit_work; as mentioned
> above, the server is to be woken on the same CPU, synchronously;
> this code may not pagefault, so to access worker's and server's
> userspace memory (struct umcg_task), memory pages containing the worker's
> and the server's structs umcg_task are pinned when the worker is
> exiting to the userspace, and unpinned when the worker is descheduled;
> - worker wakeup is detected in a hook to sched_update_worker, and processed
> in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN
> pagefault on the wakeup path;
> - worker preemption is implemented by the userspace tagging the worker
> with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it;
> on the exit to usermode the worker is intercepted and its server is woken
> (see Documentation/userspace-api/umcg.[txt|rst] for more details);
> - each state change is tagged with a unique timestamp (of MONOTONIC
> variety), so that
> - scheduling instrumentation is naturally available;
> - racing state changes are easily detected and ABA issues are
> avoided;
> see umcg_update_state() in umcg.c for implementation details, and
> Documentation/userspace-api/umcg.[txt|rst] for a higher-level
> description.
>
> The previous version of the patchset can be found at
> https://lore.kernel.org/all/[email protected]/
> containing some additional context and links to earlier discussions.
>
> More details are available in Documentation/userspace-api/umcg.[txt|rst]
> in sibling patches, and in doc-comments in the code.
>
> Signed-off-by: Peter Oskolkov <[email protected]>
> ---
> arch/x86/entry/syscalls/syscall_64.tbl | 2 +
> fs/exec.c | 1 +
> include/linux/sched.h | 71 ++
> include/linux/syscalls.h | 3 +
> include/uapi/asm-generic/unistd.h | 6 +-
> include/uapi/linux/umcg.h | 137 ++++
> init/Kconfig | 10 +
> kernel/entry/common.c | 4 +-
> kernel/exit.c | 5 +
> kernel/sched/Makefile | 1 +
> kernel/sched/core.c | 9 +-
> kernel/sched/umcg.c | 926 +++++++++++++++++++++++++
> kernel/sys_ni.c | 4 +
> 13 files changed, 1175 insertions(+), 4 deletions(-)
> create mode 100644 include/uapi/linux/umcg.h
> create mode 100644 kernel/sched/umcg.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 18b5500ea8bf..cb71f383060f 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -370,6 +370,8 @@
> 446 common landlock_restrict_self sys_landlock_restrict_self
> 447 common memfd_secret sys_memfd_secret
> 448 common process_mrelease sys_process_mrelease
> +449 common umcg_ctl sys_umcg_ctl
> +450 common umcg_wait sys_umcg_wait
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/fs/exec.c b/fs/exec.c
> index a098c133d8d7..dfa24bb99a97 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1840,6 +1840,7 @@ static int bprm_execve(struct linux_binprm *bprm,
> current->fs->in_exec = 0;
> current->in_execve = 0;
> rseq_execve(current);
> + umcg_execve(current);
> acct_update_integrals(current);
> task_numa_free(current, false);
> return retval;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 343603f77f8b..c7e812ceec3c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -67,6 +67,7 @@ struct sighand_struct;
> struct signal_struct;
> struct task_delay_info;
> struct task_group;
> +struct umcg_task;
>
> /*
> * Task state bitmask. NOTE! These bits are also
> @@ -1296,6 +1297,12 @@ struct task_struct {
> unsigned long rseq_event_mask;
> #endif
>
> +#ifdef CONFIG_UMCG
> + struct umcg_task __user *umcg_task;
> + struct page *pinned_umcg_worker_page; /* self */
> + struct page *pinned_umcg_server_page;
> +#endif
> +
> struct tlbflush_unmap_batch tlb_ubc;
>
> union {
> @@ -1688,6 +1695,13 @@ extern struct pid *cad_pid;
> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> +
> +#ifdef CONFIG_UMCG
> +#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
> +#else
> +#define PF_UMCG_WORKER 0x00000000
> +#endif
> +
> #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
> @@ -2275,6 +2289,63 @@ static inline void rseq_execve(struct task_struct *t)
>
> #endif
>
> +#ifdef CONFIG_UMCG
> +
> +void umcg_handle_resuming_worker(void);
> +void umcg_handle_exiting_worker(void);
> +void umcg_clear_child(struct task_struct *tsk);
> +
> +/* Called by bprm_execve() in fs/exec.c. */
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> + if (tsk->umcg_task)
> + umcg_clear_child(tsk);
> +}
> +
> +/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
> +static inline void umcg_handle_notify_resume(void)
> +{
> + if (current->flags & PF_UMCG_WORKER)
> + umcg_handle_resuming_worker();
> +}
> +
> +/* Called by do_exit() in kernel/exit.c. */
> +static inline void umcg_handle_exit(void)
> +{
> + if (current->flags & PF_UMCG_WORKER)
> + umcg_handle_exiting_worker();
> +}
> +
> +/*
> + * umcg_wq_worker_[sleeping|running] are called in core.c by
> + * sched_submit_work() and sched_update_worker().
> + */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk);
> +void umcg_wq_worker_running(struct task_struct *tsk);
> +
> +#else /* CONFIG_UMCG */
> +
> +static inline void umcg_clear_child(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_handle_notify_resume(void)
> +{
> +}
> +static inline void umcg_handle_exit(void)
> +{
> +}
> +static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> +}
> +
> +#endif
> +
> #ifdef CONFIG_DEBUG_RSEQ
>
> void rseq_syscall(struct pt_regs *regs);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 252243c7783d..97a05879da41 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -71,6 +71,7 @@ struct open_how;
> struct mount_attr;
> struct landlock_ruleset_attr;
> enum landlock_rule_type;
> +struct umcg_task;
>
> #include <linux/types.h>
> #include <linux/aio_abi.h>
> @@ -1052,6 +1053,8 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
> const void __user *rule_attr, __u32 flags);
> asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
> asmlinkage long sys_memfd_secret(unsigned int flags);
> +asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
> +asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
>
> /*
> * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 1c5fb86d455a..3e3d50de5137 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -879,9 +879,13 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
> #endif
> #define __NR_process_mrelease 448
> __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
> +#define __NR_umcg_ctl 449
> +__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
> +#define __NR_umcg_wait 450
> +__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 449
> +#define __NR_syscalls 451
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
> new file mode 100644
> index 000000000000..ce4c7980b837
> --- /dev/null
> +++ b/include/uapi/linux/umcg.h
> @@ -0,0 +1,137 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UMCG_H
> +#define _UAPI_LINUX_UMCG_H
> +
> +#include <linux/limits.h>
> +#include <linux/types.h>
> +
> +/*
> + * UMCG: User Managed Concurrency Groups.
> + *
> + * Syscalls (see kernel/sched/umcg.c):
> + * sys_umcg_ctl() - register/unregister UMCG tasks;
> + * sys_umcg_wait() - wait/wake/context-switch.
> + *
> + * struct umcg_task (below): controls the state of UMCG tasks.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
> + */
> +
> +/*
> + * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
> + * The states represent the user space point of view.
> + */
> +#define UMCG_TASK_NONE 0ULL
> +#define UMCG_TASK_RUNNING 1ULL
> +#define UMCG_TASK_IDLE 2ULL
> +#define UMCG_TASK_BLOCKED 3ULL
> +
> +/* UMCG task state flags, bits 7-8 */
> +
> +/*
> + * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
> + */
> +#define UMCG_TF_LOCKED (1ULL << 6)
> +
> +/*
> + * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
> + */
> +#define UMCG_TF_PREEMPTED (1ULL << 7)
> +
> +/* The first six bits: RUNNING, IDLE, or BLOCKED. */
> +#define UMCG_TASK_STATE_MASK 0x3fULL
> +
> +/* The full kernel state mask: the first 13 bits. */
> +#define UMCG_TASK_STATE_MASK_FULL 0x1fffULL
> +
> +/*
> + * The number of bits reserved for UMCG state timestamp in
> + * struct umcg_task.state_ts.
> + */
> +#define UMCG_STATE_TIMESTAMP_BITS 46
> +
> +/* The number of bits truncated from UMCG state timestamp. */
> +#define UMCG_STATE_TIMESTAMP_GRANULARITY 4
> +
> +/**
> + * struct umcg_task - controls the state of UMCG tasks.
> + *
> + * The struct is aligned at 64 bytes to ensure that it fits into
> + * a single cache line.
> + */
> +struct umcg_task {
> + /**
> + * @state_ts: the current state of the UMCG task described by
> + * this struct, with a unique timestamp indicating
> + * when the last state change happened.
> + *
> + * Readable/writable by both the kernel and the userspace.
> + *
> + * UMCG task state:
> + * bits 0 - 5: task state;
> + * bits 6 - 7: state flags;
> + * bits 8 - 12: reserved; must be zeroes;
> + * bits 13 - 17: for userspace use;
> + * bits 18 - 63: timestamp (see below).
> + *
> + * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
^^^^^^^
details.
> + */
> + uint64_t state_ts; /* r/w */
> +
> + /**
> + * @next_tid: the TID of the UMCG task that should be context-switched
> + * into in sys_umcg_wait(). Can be zero.
> + *
> + * Running UMCG workers must have next_tid set to point to IDLE
> + * UMCG servers.
> + *
> + * Read-only for the kernel, read/write for the userspace.
> + */
> + uint32_t next_tid; /* r */
> +
> + uint32_t flags; /* Reserved; must be zero. */
> +
> + /**
> + * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
> + *
> + * Readable/writable by both the kernel and the userspace: the
> + * kernel adds items to the list, the userspace removes them.
> + */
> + uint64_t idle_workers_ptr; /* r/w */
> +
> + /**
> + * @idle_server_tid_ptr: a pointer pointing to a single idle server.
> + * Readonly.
> + */
> + uint64_t idle_server_tid_ptr; /* r */
> +} __attribute__((packed, aligned(8 * sizeof(__u64))));
> +
> +/**
> + * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
> + * @UMCG_CTL_REGISTER: register the current task as a UMCG task
> + * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
> + * @UMCG_CTL_WORKER: register the current task as a UMCG worker
> + */
> +enum umcg_ctl_flag {
> + UMCG_CTL_REGISTER = 0x00001,
> + UMCG_CTL_UNREGISTER = 0x00002,
> + UMCG_CTL_WORKER = 0x10000,
> +};
> +
> +/**
> + * enum umcg_wait_flag - flags to pass to sys_umcg_wait
> + * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
> + * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
> + * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
> + * must be set.
> + */
> +enum umcg_wait_flag {
> + UMCG_WAIT_WAKE_ONLY = 1,
> + UMCG_WAIT_WF_CURRENT_CPU = 2,
> +};
> +
> +/* See Documentation/userspace-api/umcg.[txt|rst].*/
> +#define UMCG_IDLE_NODE_PENDING (1ULL)
> +
> +#endif /* _UAPI_LINUX_UMCG_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 11f8a845f259..b52a79cfb130 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1688,6 +1688,16 @@ config MEMBARRIER
>
> If unsure, say Y.
>
> +config UMCG
> + bool "Enable User Managed Concurrency Groups API"
> + depends on X86_64
> + default n
> + help
> + Enable User Managed Concurrency Groups API, which form the basis
> + for an in-process M:N userspace scheduling framework.
> + At the moment this is an experimental/RFC feature that is not
> + guaranteed to be backward-compatible.
> +
> config KALLSYMS
> bool "Load all symbols for debugging/ksymoops" if EXPERT
> default y
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index d5a61d565ad5..62453772a0c7 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -171,8 +171,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
> handle_signal_work(regs, ti_work);
>
> - if (ti_work & _TIF_NOTIFY_RESUME)
> + if (ti_work & _TIF_NOTIFY_RESUME) {
> + umcg_handle_notify_resume();
> tracehook_notify_resume(regs);
> + }
>
> /* Architecture specific TIF work */
> arch_exit_to_user_mode_work(regs, ti_work);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 63851320ae73..c55f9df430c8 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -745,6 +745,10 @@ void __noreturn do_exit(long code)
> if (unlikely(!tsk->pid))
> panic("Attempted to kill the idle task!");
>
> + /* Turn off UMCG sched hooks. */
> + if (unlikely(tsk->flags & PF_UMCG_WORKER))
> + tsk->flags &= ~PF_UMCG_WORKER;
> +
> /*
> * If do_exit is called because this processes oopsed, it's possible
> * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> @@ -781,6 +785,7 @@ void __noreturn do_exit(long code)
>
> io_uring_files_cancel();
> exit_signals(tsk); /* sets PF_EXITING */
> + umcg_handle_exit();
>
> /* sync mm's RSS info before statistics gathering */
> if (tsk->mm)
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 978fcfca5871..e4e481eee1b7 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
> obj-$(CONFIG_CPU_ISOLATION) += isolation.o
> obj-$(CONFIG_PSI) += psi.o
> obj-$(CONFIG_SCHED_CORE) += core_sched.o
> +obj-$(CONFIG_UMCG) += umcg.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d6da1efb5ce6..9ff63e32544a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4236,6 +4236,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->wake_entry.u_flags = CSD_TYPE_TTWU;
> p->migration_pending = NULL;
> #endif
> + umcg_clear_child(p);
> }
>
> DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> @@ -6265,9 +6266,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
> * If a worker goes to sleep, notify and ask workqueue whether it
> * wants to wake up a task to maintain concurrency.
> */
> - if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> + if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
> if (task_flags & PF_WQ_WORKER)
> wq_worker_sleeping(tsk);
> + else if (task_flags & PF_UMCG_WORKER)
> + umcg_wq_worker_sleeping(tsk);
> else
> io_wq_worker_sleeping(tsk);
> }
> @@ -6285,9 +6288,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
>
> static void sched_update_worker(struct task_struct *tsk)
> {
> - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
> if (tsk->flags & PF_WQ_WORKER)
> wq_worker_running(tsk);
> + else if (tsk->flags & PF_UMCG_WORKER)
> + umcg_wq_worker_running(tsk);
> else
> io_wq_worker_running(tsk);
> }
> diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
> new file mode 100644
> index 000000000000..bc4eeb3f5dd7
> --- /dev/null
> +++ b/kernel/sched/umcg.c
> @@ -0,0 +1,926 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * User Managed Concurrency Groups (UMCG).
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/umcg.h>
> +
> +#include "sched.h"
> +
> +/**
> + * get_user_nofault - get user value without sleeping.
> + *
> + * get_user() might sleep and therefore cannot be used in preempt-disabled
> + * regions.
> + */
> +#define get_user_nofault(out, uaddr) \
> +({ \
> + int ret = -EFAULT; \
> + \
> + if (access_ok((uaddr), sizeof(*(uaddr)))) { \
> + pagefault_disable(); \
> + \
> + if (!__get_user((out), (uaddr))) \
> + ret = 0; \
> + \
> + pagefault_enable(); \
> + } \
> + ret; \
> +})
> +
> +/**
> + * umcg_pin_pages: pin pages containing struct umcg_task of this worker
> + * and its server.
> + *
> + * The pages are pinned when the worker exits to the userspace and unpinned
> + * when the worker is in sched_submit_work(), i.e. when the worker is
> + * about to be removed from its runqueue. Thus at most NR_CPUS UMCG pages
> + * are pinned at any one time across the whole system.
> + *
> + * The pinning is needed so that going-to-sleep workers can access
> + * their and their servers' userspace umcg_task structs without page faults,
> + * as the code path can be executed in the context of a pagefault, with
> + * mm lock held.
> + */
> +static int umcg_pin_pages(u32 server_tid)
> +{
> + struct umcg_task __user *worker_ut = current->umcg_task;
> + struct umcg_task __user *server_ut = NULL;
> + struct task_struct *tsk;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(server_tid);
> + /* Server/worker interaction is allowed only within the same mm. */
> + if (tsk && current->mm == tsk->mm)
> + server_ut = READ_ONCE(tsk->umcg_task);
> + rcu_read_unlock();
> +
> + if (!server_ut)
> + return -EINVAL;
> +
> + tsk = current;
> +
> + /* worker_ut is stable, don't need to repin */
> + if (!tsk->pinned_umcg_worker_page)
> + if (1 != pin_user_pages_fast((unsigned long)worker_ut, 1, 0,
> + &tsk->pinned_umcg_worker_page))
> + return -EFAULT;
> +
> + /* server_ut may change, need to repin */
> + if (tsk->pinned_umcg_server_page) {
> + unpin_user_page(tsk->pinned_umcg_server_page);
> + tsk->pinned_umcg_server_page = NULL;
> + }
> +
> + if (1 != pin_user_pages_fast((unsigned long)server_ut, 1, 0,
> + &tsk->pinned_umcg_server_page))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static void umcg_unpin_pages(void)
> +{
> + struct task_struct *tsk = current;
> +
> + if (tsk->pinned_umcg_worker_page)
> + unpin_user_page(tsk->pinned_umcg_worker_page);
> + if (tsk->pinned_umcg_server_page)
> + unpin_user_page(tsk->pinned_umcg_server_page);
> +
> + tsk->pinned_umcg_worker_page = NULL;
> + tsk->pinned_umcg_server_page = NULL;
> +}
> +
> +static void umcg_clear_task(struct task_struct *tsk)
> +{
> + /*
> + * This is either called for the current task, or for a newly forked
> + * task that is not yet running, so we don't need strict atomicity
> + * below.
> + */
> + if (tsk->umcg_task) {
> + WRITE_ONCE(tsk->umcg_task, NULL);
> +
> + /* These can be simple writes - see the commment above. */
> + tsk->pinned_umcg_worker_page = NULL;
> + tsk->pinned_umcg_server_page = NULL;
> + tsk->flags &= ~PF_UMCG_WORKER;
> + }
> +}
> +
> +/* Called for a forked or execve-ed child. */
> +void umcg_clear_child(struct task_struct *tsk)
> +{
> + umcg_clear_task(tsk);
> +}
> +
> +/* Called both by normally (unregister) and abnormally exiting workers. */
> +void umcg_handle_exiting_worker(void)
> +{
> + umcg_unpin_pages();
> + umcg_clear_task(current);
> +}
> +
> +/**
> + * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
> + * @state_ts - points to the state_ts member of struct umcg_task to update;
> + * @expected - the expected value of state_ts, including the timestamp;
> + * @desired - the desired value of state_ts, state part only;
> + * @may_fault - whether to use normal or _nofault cmpxchg.
> + *
> + * The function is basically cmpxchg(state_ts, expected, desired), with extra
> + * code to set the timestamp in @desired.
> + */
> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
> + bool may_fault)
> +{
> + u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
> + u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
> +
> + /* Cut higher order bits. */
> + next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1);
> +
> + if (next_ts == curr_ts)
> + ++next_ts;
> +
> + /* Remove an old timestamp, if any. */
> + desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1);
> +
> + /* Set the new timestamp. */
> + desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
> +
> + if (may_fault)
> + return cmpxchg_user_64(state_ts, expected, desired);
> +
> + return cmpxchg_user_64_nofault(state_ts, expected, desired);
> +}
> +
> +/**
> + * sys_umcg_ctl: (un)register the current task as a UMCG task.
> + * @flags: ORed values from enum umcg_ctl_flag; see below;
> + * @self: a pointer to struct umcg_task that describes this
> + * task and governs the behavior of sys_umcg_wait if
> + * registering; must be NULL if unregistering.
> + *
> + * @flags & UMCG_CTL_REGISTER: register a UMCG task:
> + * UMCG workers:
> + * - @flags & UMCG_CTL_WORKER
> + * UMCG servers:
> + * - !(@flags & UMCG_CTL_WORKER)
> + *
> + * All tasks:
> + * - self->state must be UMCG_TASK_RUNNING
> + * - self->next_tid must be zero
> + *
> + * If the conditions above are met, sys_umcg_ctl() immediately returns
> + * if the registered task is a server; a worker will be added to
> + * idle_workers_ptr, and the worker put to sleep; an idle server
> + * from idle_server_tid_ptr will be woken, if present.
> + *
> + * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
> + * is a UMCG worker, the userspace is responsible for waking its
> + * server (before or after calling sys_umcg_ctl).
> + *
> + * Return:
> + * 0 - success
> + * -EFAULT - failed to read @self
> + * -EINVAL - some other error occurred
> + */
> +SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
> +{
> + struct umcg_task ut;
> +
> + if (flags == UMCG_CTL_UNREGISTER) {
> + if (self || !current->umcg_task)
> + return -EINVAL;
> +
> + if (current->flags & PF_UMCG_WORKER)
> + umcg_handle_exiting_worker();
> + else
> + umcg_clear_task(current);
> +
> + return 0;
> + }
> +
> + /* Register the current task as a UMCG task. */
> + if (!(flags & UMCG_CTL_REGISTER))
> + return -EINVAL;
> +
> + flags &= ~UMCG_CTL_REGISTER;
> + if (flags && flags != UMCG_CTL_WORKER)
> + return -EINVAL;
> +
> + if (current->umcg_task || !self)
> + return -EINVAL;
> +
> + if (copy_from_user(&ut, self, sizeof(ut)))
> + return -EFAULT;
> +
> + if (ut.next_tid)
> + return -EINVAL;
> +
> + if ((ut.state_ts & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_RUNNING)
> + return -EINVAL;
> +
> + WRITE_ONCE(current->umcg_task, self);
> +
> + if (flags == UMCG_CTL_WORKER) {
> + current->flags |= PF_UMCG_WORKER;
> +
> + /* Trigger umcg_handle_resuming_worker() */
> + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * handle_timedout_worker - make sure the worker is added to idle_workers
> + * upon a "clean" timeout.
> + */
> +static int handle_timedout_worker(struct umcg_task __user *self)
> +{
> + u64 curr_state, next_state;
> + int ret;
> +
> + if (get_user(curr_state, &self->state_ts))
> + return -EFAULT;
> +
> + if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE) {
> + /* TODO: should we care here about TF_LOCKED or TF_PREEMPTED? */
> +
> + next_state = curr_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_BLOCKED;
> +
> + ret = umcg_update_state(&self->state_ts, &curr_state, next_state, true);
> + if (ret)
> + return ret;
> +
> + return -ETIMEDOUT;
> + }
> +
> + return 0; /* Not really timed out. */
> +}
> +
> +/**
> + * umcg_idle_loop - sleep until the current task becomes RUNNING or a timeout
> + * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout
> + *
> + * The function marks the current task as INTERRUPTIBLE and calls
> + * freezable_schedule(). It returns when either the timeout expires or
Last version use schedule() here. And there are comment there and I don't
know freeable enough.
> + * the UMCG state of the task becomes RUNNING.
> + *
> + * Note: because UMCG workers should not be running WITHOUT attached servers,
> + * and because servers should not be running WITH attached workers,
> + * the function returns only on fatal signal pending and ignores/flushes
> + * all other signals.
> + */
> +static int umcg_idle_loop(u64 abs_timeout)
> +{
> + int ret;
> + struct page *pinned_page = NULL;
> + struct hrtimer_sleeper timeout;
> + struct umcg_task __user *self = current->umcg_task;
> +
> + if (abs_timeout) {
> + hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
> + HRTIMER_MODE_ABS);
> +
> + hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout,
> + current->timer_slack_ns);
> + }
> +
> + while (true) {
> + u64 umcg_state;
> +
> + /*
> + * We need to read from userspace _after_ the task is marked
> + * TASK_INTERRUPTIBLE, to properly handle concurrent wakeups;
> + * but faulting is not allowed; so we try a fast no-fault read,
> + * and if it fails, pin the page temporarily.
> + */
> +retry_once:
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + /* Order set_current_state above with get_user_nofault below. */
> + smp_mb();
> + ret = -EFAULT;
> + if (get_user_nofault(umcg_state, &self->state_ts)) {
> + set_current_state(TASK_RUNNING);
> +
> + if (pinned_page)
> + goto out;
> + else if (1 != pin_user_pages_fast((unsigned long)self,
> + 1, 0, &pinned_page))
> + goto out;
> +
> + goto retry_once;
> + }
> +
> + if (pinned_page) {
> + unpin_user_page(pinned_page);
> + pinned_page = NULL;
> + }
> +
> + ret = 0;
> + if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
> + set_current_state(TASK_RUNNING);
> + goto out;
> + }
> +
> + if (abs_timeout)
> + hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> +
> + if (!abs_timeout || timeout.task) {
> + /* Clear PF_UMCG_WORKER to elide workqueue handlers. */
> + const bool worker = current->flags & PF_UMCG_WORKER;
> +
> + if (worker)
> + current->flags &= ~PF_UMCG_WORKER;
> +
> + freezable_schedule();
> +
> + if (worker)
> + current->flags |= PF_UMCG_WORKER;
> + }
> + __set_current_state(TASK_RUNNING);
> +
> + /*
> + * Check for timeout before checking the state, as workers
> + * are not going to return from schedule() unless
^^^^^^^^^^
freezable_schedule()
Because you changed to use freezable_schedule() in this verison and
the comment in last version is lost.
Just a little note here.
> + * they are RUNNING.
> + */
> + ret = -ETIMEDOUT;
> + if (abs_timeout && !timeout.task)
> + goto out;
> +
> + ret = -EFAULT;
> + if (get_user(umcg_state, &self->state_ts))
> + goto out;
> +
> + ret = 0;
> + if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING)
> + goto out;
> +
> + ret = -EINTR;
> + if (fatal_signal_pending(current))
> + goto out;
> +
> + if (signal_pending(current))
> + flush_signals(current);
> + }
> +
> +out:
> + if (pinned_page) {
> + unpin_user_page(pinned_page);
> + pinned_page = NULL;
> + }
> +
> + if (abs_timeout) {
> + hrtimer_cancel(&timeout.timer);
> + destroy_hrtimer_on_stack(&timeout.timer);
> + }
> +
> + /* Workers must go through workqueue handlers upon wakeup. */
> + if (current->flags & PF_UMCG_WORKER) {
> + if (ret == -ETIMEDOUT)
> + ret = handle_timedout_worker(self);
> +
> + set_tsk_need_resched(current);
> + }
> +
> + return ret;
> +}
> +
> +/**
> + * umcg_wakeup_allowed - check whether @current can wake @tsk.
> + *
> + * Currently a placeholder that allows wakeups within a single process
> + * only (same mm). In the future the requirement will be relaxed (securely).
> + */
> +static bool umcg_wakeup_allowed(struct task_struct *tsk)
> +{
> + WARN_ON_ONCE(!rcu_read_lock_held());
> +
> + if (tsk->mm && tsk->mm == current->mm && READ_ONCE(tsk->umcg_task))
> + return true;
> +
> + return false;
> +}
> +
> +/*
> + * Try to wake up. May be called with preempt_disable set. May be called
> + * cross-process.
> + *
> + * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
> + * ordering logic.
> + */
> +static int umcg_ttwu(u32 next_tid, int wake_flags)
> +{
> + struct task_struct *next;
> +
> + rcu_read_lock();
> + next = find_task_by_vpid(next_tid);
> + if (!next || !umcg_wakeup_allowed(next)) {
> + rcu_read_unlock();
> + return -ESRCH;
> + }
> +
> + /* The result of ttwu below is ignored. */
> + try_to_wake_up(next, TASK_NORMAL, wake_flags);
> + rcu_read_unlock();
> +
> + return 0;
> +}
> +
> +/*
> + * At the moment, umcg_do_context_switch simply wakes up @next with
> + * WF_CURRENT_CPU and puts the current task to sleep.
> + *
> + * In the future an optimization will be added to adjust runtime accounting
> + * so that from the kernel scheduling perspective the two tasks are
> + * essentially treated as one. In addition, the context switch may be performed
> + * right here on the fast path, instead of going through the wake/wait pair.
> + */
> +static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
> +{
> + int ret;
> +
> + ret = umcg_ttwu(next_tid, WF_CURRENT_CPU);
> + if (ret)
> + return ret;
> +
> + return umcg_idle_loop(abs_timeout);
> +}
> +
> +/**
> + * sys_umcg_wait: put the current task to sleep and/or wake another task.
> + * @flags: zero or a value from enum umcg_wait_flag.
> + * @abs_timeout: when to wake the task, in nanoseconds; zero for no timeout.
> + *
> + * @self->state_ts must be UMCG_TASK_IDLE (where @self is current->umcg_task)
> + * if !(@flags & UMCG_WAIT_WAKE_ONLY).
> + *
> + * If @self->next_tid is not zero, it must point to an IDLE UMCG task.
> + * The userspace must have changed its state from IDLE to RUNNING
> + * before calling sys_umcg_wait() in the current task. This "next"
> + * task will be woken (context-switched-to on the fast path) when the
> + * current task is put to sleep.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
> + *
> + * Return:
> + * 0 - OK;
> + * -ETIMEDOUT - the timeout expired;
> + * -EFAULT - failed accessing struct umcg_task __user of the current
> + * task;
> + * -ESRCH - the task to wake not found or not a UMCG task;
> + * -EINVAL - another error happened (e.g. bad @flags, or the current
> + * task is not a UMCG task, etc.)
> + */
> +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
> +{
> + struct umcg_task __user *self = current->umcg_task;
> + u32 next_tid;
> +
> + if (!self)
> + return -EINVAL;
> +
> + if (get_user(next_tid, &self->next_tid))
> + return -EFAULT;
> +
> + if (flags & UMCG_WAIT_WAKE_ONLY) {
> + if (!next_tid || abs_timeout)
> + return -EINVAL;
> +
> + flags &= ~UMCG_WAIT_WAKE_ONLY;
> + if (flags & ~UMCG_WAIT_WF_CURRENT_CPU)
> + return -EINVAL;
> +
> + return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ?
> + WF_CURRENT_CPU : 0);
> + }
> +
> + /* Unlock the worker, if locked. */
> + if (current->flags & PF_UMCG_WORKER) {
> + u64 umcg_state;
> +
> + if (get_user(umcg_state, &self->state_ts))
> + return -EFAULT;
> +
> + if ((umcg_state & UMCG_TF_LOCKED) && umcg_update_state(
> + &self->state_ts, &umcg_state,
> + umcg_state & ~UMCG_TF_LOCKED, true))
> + return -EFAULT;
> + }
> +
> + if (next_tid)
> + return umcg_do_context_switch(next_tid, abs_timeout);
> +
> + return umcg_idle_loop(abs_timeout);
> +}
> +
> +/*
> + * NOTE: all code below is called from workqueue submit/update, or
> + * syscall exit to usermode loop, so all errors result in the
> + * termination of the current task (via SIGKILL).
> + */
> +
> +/*
> + * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
> + */
> +static int umcg_wake_idle_server_nofault(u32 server_tid)
> +{
> + struct umcg_task __user *ut_server = NULL;
> + struct task_struct *tsk;
> + int ret = -EINVAL;
> + u64 state;
> +
> + rcu_read_lock();
> +
> + tsk = find_task_by_vpid(server_tid);
> + /* Server/worker interaction is allowed only within the same mm. */
> + if (tsk && current->mm == tsk->mm)
> + ut_server = READ_ONCE(tsk->umcg_task);
> +
> + if (!ut_server)
> + goto out_rcu;
> +
> + ret = -EFAULT;
> + if (get_user_nofault(state, &ut_server->state_ts))
> + goto out_rcu;
> +
> + ret = -EAGAIN;
> + if ((state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_IDLE)
> + goto out_rcu;
> +
> + ret = umcg_update_state(&ut_server->state_ts, &state,
> + UMCG_TASK_RUNNING, false);
> +
> + if (ret)
> + goto out_rcu;
> +
> + try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
> + ret = 0;
> +
> +out_rcu:
> + rcu_read_unlock();
> + return ret;
> +}
> +
> +/*
> + * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
> + */
> +static int umcg_wake_idle_server_may_fault(u32 server_tid)
> +{
> + struct umcg_task __user *ut_server = NULL;
> + struct task_struct *tsk;
> + int ret = -EINVAL;
> + u64 state;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(server_tid);
> + if (tsk && current->mm == tsk->mm)
> + ut_server = READ_ONCE(tsk->umcg_task);
> + rcu_read_unlock();
> +
> + if (!ut_server)
> + return -EINVAL;
> +
> + if (get_user(state, &ut_server->state_ts))
> + return -EFAULT;
> +
> + if ((state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_IDLE)
> + return -EAGAIN;
> +
> + ret = umcg_update_state(&ut_server->state_ts, &state,
> + UMCG_TASK_RUNNING, true);
> + if (ret)
> + return ret;
> +
> + /*
> + * umcg_ttwu will call find_task_by_vpid again; but we cannot
> + * elide this, as we cannot do get_user() from an rcu-locked
> + * code block.
> + */
> + return umcg_ttwu(server_tid, WF_CURRENT_CPU);
> +}
> +
> +/*
> + * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
> + */
> +static int umcg_wake_idle_server(u32 server_tid, bool may_fault)
> +{
> + int ret = umcg_wake_idle_server_nofault(server_tid);
> +
> + if (!ret)
> + return 0;
> +
> + if (!may_fault || ret != -EFAULT)
> + return ret;
> +
> + return umcg_wake_idle_server_may_fault(server_tid);
> +}
> +
> +/*
> + * Called in sched_submit_work() context for UMCG workers. In the common case,
> + * the worker's state changes RUNNING => BLOCKED, and its server's state
> + * changes IDLE => RUNNING, and the server is ttwu-ed.
> + *
> + * Under some conditions (e.g. the worker is "locked", see
> + * /Documentation/userspace-api/umcg.[txt|rst] for more details), the
> + * function does nothing.
> + *
> + * The function is called with preempt disabled to make sure the retry_once
> + * logic below works correctly.
> + */
> +static void process_sleeping_worker(struct task_struct *tsk, u32 *server_tid)
> +{
> + struct umcg_task __user *ut_worker = tsk->umcg_task;
> + u64 curr_state, next_state;
> + bool retried = false;
> + u32 tid;
> + int ret;
> +
> + *server_tid = 0;
> +
> + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid UMCG worker."))
> + return;
> +
> + /* If the worker has no server, do nothing. */
> + if (unlikely(!tsk->pinned_umcg_server_page))
> + return;
> +
> + if (get_user_nofault(curr_state, &ut_worker->state_ts))
> + goto die;
> +
> + /*
> + * The userspace is allowed to concurrently change a RUNNING worker's
> + * state only once in a "short" period of time, so we retry state
> + * change at most once. As this retry block is within a
> + * preempt_disable region, "short" is truly short here.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for details.
> + */
> +retry_once:
> + if (curr_state & UMCG_TF_LOCKED)
> + return;
> +
> + if (WARN_ONCE((curr_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING,
> + "Unexpected UMCG worker state."))
> + goto die;
> +
> + next_state = curr_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_BLOCKED;
> +
> + ret = umcg_update_state(&ut_worker->state_ts, &curr_state, next_state, false);
> + if (ret == -EAGAIN) {
> + if (retried)
> + goto die;
> +
> + retried = true;
> + goto retry_once;
> + }
> + if (ret)
> + goto die;
> +
> + if (get_user_nofault(tid, &ut_worker->next_tid))
> + goto die;
> +
> + *server_tid = tid;
> + return;
> +
> +die:
> + pr_warn("%s: killing task %d\n", __func__, current->pid);
> + force_sig(SIGKILL);
> +}
> +
> +/* Called from sched_submit_work(). Must not fault/sleep. */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> + u32 server_tid;
> +
> + /*
> + * Disable preemption so that retry_once in process_sleeping_worker
> + * works properly.
> + */
> + preempt_disable();
> + process_sleeping_worker(tsk, &server_tid);
> + preempt_enable();
> +
> + if (server_tid) {
> + int ret = umcg_wake_idle_server_nofault(server_tid);
> +
> + if (ret && ret != -EAGAIN)
> + goto die;
> + }
> +
> + goto out;
> +
> +die:
> + pr_warn("%s: killing task %d\n", __func__, current->pid);
> + force_sig(SIGKILL);
> +out:
> + umcg_unpin_pages();
> +}
> +
> +/**
> + * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
> + *
> + * Returns true on success, false on a fatal failure.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for details.
> + */
> +static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
> +{
> + u64 __user *node = &ut_worker->idle_workers_ptr;
> + u64 __user *head_ptr;
> + u64 first = (u64)node;
> + u64 head;
> +
> + if (get_user(head, node) || !head)
> + return false;
> +
> + head_ptr = (u64 __user *)head;
> +
> + if (put_user(UMCG_IDLE_NODE_PENDING, node))
> + return false;
> +
> + if (xchg_user_64(head_ptr, &first))
> + return false;
> +
> + if (put_user(first, node))
> + return false;
> +
> + return true;
> +}
> +
> +/**
> + * get_idle_server - retrieve an idle server, if present.
> + *
> + * Returns true on success, false on a fatal failure.
> + */
> +static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid)
> +{
> + u64 server_tid_ptr;
> + u32 tid;
> +
> + /* Empty result is OK. */
> + *server_tid = 0;
> +
> + if (get_user(server_tid_ptr, &ut_worker->idle_server_tid_ptr))
> + return false;
> +
> + if (!server_tid_ptr)
> + return false;
> +
> + tid = 0;
> + if (xchg_user_32((u32 __user *)server_tid_ptr, &tid))
> + return false;
> +
> + *server_tid = tid;
> + return true;
> +}
> +
> +/*
> + * Returns true to wait for the userspace to schedule this worker, false
> + * to return to the userspace.
> + *
> + * In the common case, a BLOCKED worker is marked IDLE and enqueued
> + * to idle_workers_ptr list. The idle server is woken (if present).
> + *
> + * If a RUNNING worker is preempted, this function will trigger, in which
> + * case the worker is moved to IDLE state and its server is woken.
> + *
> + * Sets @server_tid to point to the server to be woken if the worker
> + * is going to sleep; sets @server_tid to point to the server assigned
> + * to this RUNNING worker if the worker is to return to the userspace.
> + */
> +static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
> +{
> + struct umcg_task __user *ut_worker = tsk->umcg_task;
> + u64 curr_state, next_state;
> +
> + *server_tid = 0;
> +
> + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
> + return false;
> +
> + if (fatal_signal_pending(tsk))
> + return false;
> +
> + if (get_user(curr_state, &ut_worker->state_ts))
> + goto die;
> +
> + if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
> + u32 tid;
> +
> + /* Wakeup: wait but don't enqueue. */
> + if (curr_state & UMCG_TF_LOCKED)
> + return true;
> +
> + smp_rmb(); /* Order getting state and getting server_tid */
> + if (get_user(tid, &ut_worker->next_tid))
> + goto die;
> +
> + if (tid) {
> + *server_tid = tid;
> +
> + /* pass-through: RUNNING with a server. */
> + if (!(curr_state & UMCG_TF_PREEMPTED))
> + return false;
> + } else if (curr_state & UMCG_TF_PREEMPTED)
> + /* PREEMPTED workers must have servers. */
> + goto die;
> +
> + /*
> + * Fallthrough to mark the worker IDLE: the worker is
> + * PREEMPTED, or the worker is RUNNING, but has no server
> + * (which happens via UMCG_WAIT_WAKE_ONLY).
> + */
> + } else if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
> + (curr_state & UMCG_TF_LOCKED)))
> + /* The worker prepares to sleep or to unregister. */
> + return false;
> +
> + if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE))
> + goto die;
> +
> + next_state = curr_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_IDLE;
> +
> + if (umcg_update_state(&ut_worker->state_ts, &curr_state,
> + next_state, true))
> + goto die;
> +
> + if (!enqueue_idle_worker(ut_worker))
> + goto die;
> +
> + smp_mb(); /* Order enqueuing the worker with getting the server. */
> + if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
> + goto die;
> +
> + return true;
> +
> +die:
> + pr_warn("umcg_process_waking_worker: killing task %d\n", current->pid);
> + force_sig(SIGKILL);
> + return false;
> +}
> +
> +/*
> + * Called from sched_update_worker(): defer all work until later, as
> + * sched_update_worker() may be called with in-kernel locks held.
> + */
> +void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> + set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
> +}
> +
> +/* Called via TIF_NOTIFY_RESUME flag from exit_to_user_mode_loop. */
> +void umcg_handle_resuming_worker(void)
> +{
> + u32 server_tid;
> +
> + /* Avoid recursion by removing PF_UMCG_WORKER */
> + current->flags &= ~PF_UMCG_WORKER;
> +
> + do {
> + bool should_wait;
> +
> + should_wait = process_waking_worker(current, &server_tid);
> +
> + if (!should_wait)
> + break;
> +
> + if (server_tid) {
> + int ret = umcg_wake_idle_server(server_tid, true);
> +
> + if (ret && ret != -EAGAIN)
> + goto die;
> + }
> +
> + umcg_idle_loop(0);
> + } while (true);
> +
> + if (!server_tid)
> + /* No server => no reason to pin pages. */
> + umcg_unpin_pages();
> + else if (umcg_pin_pages(server_tid))
> + goto die;
> +
> + goto out;
> +
> +die:
> + pr_warn("%s: killing task %d\n", __func__, current->pid);
> + force_sig(SIGKILL);
> +out:
> + current->flags |= PF_UMCG_WORKER;
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index f43d89d92860..682261d78ee7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -272,6 +272,10 @@ COND_SYSCALL(landlock_create_ruleset);
> COND_SYSCALL(landlock_add_rule);
> COND_SYSCALL(landlock_restrict_self);
>
> +/* kernel/sched/umcg.c */
> +COND_SYSCALL(umcg_ctl);
> +COND_SYSCALL(umcg_wait);
> +
> /* arch/example/kernel/sys_example.c */
>
> /* mm/fadvise.c */
> --
> 2.25.1
>
Thanks,
Tao