This is an update on v0.3:
https://lore.kernel.org/patchwork/cover/1461708/
Key changes v0.3 => v0.4:
- made idle workers list logic simpler
- only one idle server tracking
- removed two fields from struct umcg_task
- added timeout handling
- added worker preemption
- added a doc patch that now documents the syscalls and state
transitions.
More details on the state of the patchset and future work are provided
in the patch 4 commit message.
Peter Oskolkov (4):
sched: add WF_CURRENT_CPU and externise ttwu
sched/umcg: RFC: add userspace atomic helpers
sched/umcg: add Documentation/userspace-api/umcg.rst
sched/umcg: RFC: implement UMCG syscalls
Documentation/userspace-api/umcg.rst | 532 ++++++++++++++++++++++
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/linux/sched.h | 6 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 8 +-
include/uapi/linux/umcg.h | 114 +++++
init/Kconfig | 10 +
kernel/exit.c | 7 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 20 +-
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 15 +-
kernel/sched/umcg.c | 601 +++++++++++++++++++++++++
kernel/sched/umcg.h | 211 +++++++++
kernel/sys_ni.c | 4 +
15 files changed, 1528 insertions(+), 11 deletions(-)
create mode 100644 Documentation/userspace-api/umcg.rst
create mode 100644 include/uapi/linux/umcg.h
create mode 100644 kernel/sched/umcg.c
create mode 100644 kernel/sched/umcg.h
--
2.25.1
Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases such as UMCG.
In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.
Signed-off-by: Peter Oskolkov <[email protected]>
---
kernel/sched/core.c | 3 +--
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 15 +++++++++------
3 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c22cd026440..293f5801bf81 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3680,8 +3680,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
* Return: %true if @p->state changes (an actual wakeup was done),
* %false otherwise.
*/
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
unsigned long flags;
int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11d22943753f..16a9c93e6e82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6836,6 +6836,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (wake_flags & WF_TTWU) {
record_wakee(p);
+ if ((wake_flags & WF_CURRENT_CPU) &&
+ cpumask_test_cpu(cpu, p->cpus_ptr))
+ return cpu;
+
if (sched_energy_enabled()) {
new_cpu = find_energy_efficient_cpu(p, prev_cpu);
if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a1c6aeb9165..80de6836f8ae 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2031,13 +2031,14 @@ static inline int task_on_rq_migrating(struct task_struct *p)
}
/* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
+#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
-#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
-#define WF_ON_CPU 0x40 /* Wakee is on_cpu */
+#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_ON_CPU 0x40 /* Wakee is on_cpu */
+#define WF_CURRENT_CPU 0x80 /* Prefer to move the wakee to the current CPU. */
#ifdef CONFIG_SMP
static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3037,6 +3038,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
extern void swake_up_all_locked(struct swait_queue_head *q);
extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
extern int preempt_dynamic_mode;
extern int sched_dynamic_mode(const char *str);
--
2.25.1
Add helper functions to work atomically with userspace 32/64 bit values -
there are some .*futex.* named helpers, but they are not exactly
what is needed for UMCG; I haven't found what else I could use, so I
rolled these.
At the moment only X86_64 is supported.
Note: the helpers should probably go into arch/ somewhere; I have
them in kernel/sched/umcg.h temporarily for convenience. Please
let me know where I should put them.
Note: the current code follows sugestions here:
https://lore.kernel.org/lkml/[email protected]/
with the exception that I couldn't combine __try_cmpxchg_user_32/64 functions
into a macro, as my asm foo is not too strong. I'll continue trying to make
the macro work, but for the moment I've decided to post this RFC so that
other areas of the patchset could be reviewed.
Changelog:
v0.3->v0.4:
- added put_user_nosleep;
- removed linked list/stack operations patch;
v0.2->v0.3:
- renamed and refactored the helpers a bit, as described above;
- moved linked list/stack operations into a separate patch.
Signed-off-by: Peter Oskolkov <[email protected]>
---
kernel/sched/umcg.h | 198 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 198 insertions(+)
create mode 100644 kernel/sched/umcg.h
diff --git a/kernel/sched/umcg.h b/kernel/sched/umcg.h
new file mode 100644
index 000000000000..1db283071ca6
--- /dev/null
+++ b/kernel/sched/umcg.h
@@ -0,0 +1,198 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _KERNEL_SCHED_UMCG_H
+#define _KERNEL_SCHED_UMCG_H
+
+#ifdef CONFIG_X86_64
+
+#include <linux/uaccess.h>
+
+#include <asm/asm.h>
+#include <linux/atomic.h>
+
+/* TODO: move atomic operations below into arch/ headers */
+static inline int __try_cmpxchg_user_32(u32 *uval, u32 __user *uaddr,
+ u32 oldval, u32 newval)
+{
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\t" LOCK_PREFIX "cmpxchgl %4, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+ : "i" (-EFAULT), "r" (newval), "1" (oldval)
+ : "memory"
+ );
+ *uval = oldval;
+ return ret;
+}
+
+static inline int __try_cmpxchg_user_64(u64 *uval, u64 __user *uaddr,
+ u64 oldval, u64 newval)
+{
+ int ret = 0;
+
+ asm volatile("\n"
+ "1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n"
+ "2:\n"
+ "\t.section .fixup, \"ax\"\n"
+ "3:\tmov %3, %0\n"
+ "\tjmp 2b\n"
+ "\t.previous\n"
+ _ASM_EXTABLE_UA(1b, 3b)
+ : "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+ : "i" (-EFAULT), "r" (newval), "1" (oldval)
+ : "memory"
+ );
+ *uval = oldval;
+ return ret;
+}
+
+static inline int fix_pagefault(unsigned long uaddr, bool write_fault)
+{
+ struct mm_struct *mm = current->mm;
+ int ret;
+
+ mmap_read_lock(mm);
+ ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
+ NULL);
+ mmap_read_unlock(mm);
+
+ return ret < 0 ? ret : 0;
+}
+
+/**
+ * umcg_cmpxchg_32_user - compare_exchange 32-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int cmpxchg_user_32(u32 __user *uaddr, u32 *old, u32 new)
+{
+ int ret = -EFAULT;
+ u32 __old = *old;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+ __uaccess_begin_nospec();
+ ret = __try_cmpxchg_user_32(old, uaddr, __old, new);
+ user_access_end();
+
+ if (!ret) {
+ ret = *old == __old ? 0 : -EAGAIN;
+ break;
+ }
+
+ if (fix_pagefault((unsigned long)uaddr, true) < 0)
+ break;
+ }
+
+ pagefault_enable();
+ return ret;
+}
+
+/**
+ * umcg_cmpxchg_64_user - compare_exchange 64-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int cmpxchg_user_64(u64 __user *uaddr, u64 *old, u64 new)
+{
+ int ret = -EFAULT;
+ u64 __old = *old;
+
+ if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+ return -EFAULT;
+
+ pagefault_disable();
+
+ while (true) {
+ __uaccess_begin_nospec();
+ ret = __try_cmpxchg_user_64(old, uaddr, __old, new);
+ user_access_end();
+
+ if (!ret) {
+ ret = *old == __old ? 0 : -EAGAIN;
+ break;
+ }
+
+ if (fix_pagefault((unsigned long)uaddr, true) < 0)
+ break;
+ }
+
+ pagefault_enable();
+
+ return ret;
+}
+
+/**
+ * get_user_nosleep - get user value with inline fixup without sleeping.
+ *
+ * get_user() might sleep and therefore cannot be used in preempt-disabled
+ * regions.
+ */
+#define get_user_nosleep(out, uaddr) \
+({ \
+ int ret = -EFAULT; \
+ \
+ if (access_ok((uaddr), sizeof(*(uaddr)))) { \
+ pagefault_disable(); \
+ \
+ while (true) { \
+ if (!__get_user((out), (uaddr))) { \
+ ret = 0; \
+ break; \
+ } \
+ \
+ if (fix_pagefault((unsigned long)(uaddr), false) < 0) \
+ break; \
+ } \
+ \
+ pagefault_enable(); \
+ } \
+ ret; \
+})
+
+/**
+ * put_user_nosleep - put user value with inline fixup without sleeping.
+ *
+ * put_user() might sleep and therefore cannot be used in preempt-disabled
+ * regions.
+ */
+#define put_user_nosleep(out, uaddr) \
+({ \
+ int ret = -EFAULT; \
+ \
+ if (access_ok((uaddr), sizeof(*(uaddr)))) { \
+ pagefault_disable(); \
+ \
+ while (true) { \
+ if (!__put_user((out), (uaddr))) { \
+ ret = 0; \
+ break; \
+ } \
+ \
+ if (fix_pagefault((unsigned long)(uaddr), true) < 0) \
+ break; \
+ } \
+ \
+ pagefault_enable(); \
+ } \
+ ret; \
+})
+
+#endif /* CONFIG_X86_64 */
+#endif /* _KERNEL_SCHED_UMCG_H */
--
2.25.1
Document User Managed Concurrency Groups syscalls, data structures,
state transitions, etc.
Signed-off-by: Peter Oskolkov <[email protected]>
---
Documentation/userspace-api/umcg.rst | 532 +++++++++++++++++++++++++++
1 file changed, 532 insertions(+)
create mode 100644 Documentation/userspace-api/umcg.rst
diff --git a/Documentation/userspace-api/umcg.rst b/Documentation/userspace-api/umcg.rst
new file mode 100644
index 000000000000..680bf336bfdc
--- /dev/null
+++ b/Documentation/userspace-api/umcg.rst
@@ -0,0 +1,532 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+UMCG Userspace API
+=====================================
+
+User Managed Concurrency Groups (UMCG) is an M:N threading
+subsystem/toolkit that lets user space application developers
+implement in-process user space schedulers.
+
+.. contents:: :local:
+
+Why? Heterogeneous in-process workloads
+=======================================
+Linux kernel's CFS scheduler is designed for the "common" use case,
+with efficiency/throughput in mind. Work isolation and workloads of
+different "urgency" are addressed by tools such as cgroups, CPU
+affinity, priorities, etc., which are difficult or impossible to
+efficiently use in-process.
+
+For example, a single DBMS process may receive tens of thousands
+requests per second; some of these requests may have strong response
+latency requirements as they serve live user requests (e.g. login
+authentication); some of these requests may not care much about
+latency but must be served within a certain time period (e.g. an
+hourly aggregate usage report); some of these requests are to be
+served only on a best-effort basis and can be NACKed under high load
+(e.g. an exploratory research/hypothesis testing workload).
+
+Beyond different work item latency/throughput requirements as outlined
+above, the DBMS may need to provide certain guarantees to different
+users; for example, user A may "reserve" 1 CPU for their
+high-priority/low latency requests, 2 CPUs for mid-level throughput
+workloads, and be allowed to send as many best-effort requests as
+possible, which may or may not be served, depending on the DBMS load.
+Besides, the best-effort work, started when the load was low, may need
+to be delayed if suddenly a large amount of higher-priority work
+arrives. With hundreds or thousands of users like this, it is very
+difficult to guarantee the application's responsiveness using standard
+Linux tools while maintaining high CPU utilization.
+
+Gaming is another use case: some in-process work must be completed
+before a certain deadline dictated by frame rendering schedule, while
+other work items can be delayed; some work may need to be
+cancelled/discarded because the deadline has passed; etc.
+
+User Managed Concurrency Groups is an M:N threading toolkit that
+allows constructing user space schedulers designed to efficiently
+manage heterogeneous in-process workloads described above while
+maintaining high CPU utilization (95%+).
+
+Requirements
+============
+One relatively established way to design high-efficiency, low-latency
+systems is to split all work into small on-cpu work items, with
+asynchronous I/O and continuations, all executed on a thread pool with
+the number of threads not exceeding the number of available CPUs.
+Although this approach works, it is quite difficult to develop and
+maintain such a system, as, for example, small continuations are
+difficult to piece together when debugging. Besides, such asynchronous
+callback-based systems tend to be somewhat cache-inefficient, as
+continuations can get scheduled on any CPU regardless of cache
+locality.
+
+M:N threading and cooperative user space scheduling enables controlled
+CPU usage (minimal OS preemption), synchronous coding style, and
+better cache locality.
+
+Specifically:
+
+- a variable/fluctuating number M of "application" threads should be
+ "scheduled over" a relatively fixed number N of "kernel" threads,
+ where N is less than or equal to the number of CPUs available;
+- only those application threads that are attached to kernel threads
+ are scheduled "on CPU";
+- application threads should be able to cooperatively yield to each other;
+- when an application thread blocks in kernel (e.g. in I/O), this
+ becomes a scheduling event ("block") that the userspace scheduler
+ should be able to efficiently detect, and reassign a waiting
+ application thread to the freeded "kernel" thread;
+- when a blocked application thread wakes (e.g. its I/O operation
+ completes), this even ("wake") should also be detectable by the
+ userspace scheduler, which should be able to either quickly dispatch
+ the newly woken thread to an idle "kernel" thread or, if all "kernel"
+ threads are busy, put it in the waiting queue;
+- in addition to the above, it would be extremely useful for a
+ separate in-process "watchdog" facility to be able to monitor the
+ state of each of the M+N threads, and to intervene in case of runaway
+ workloads (interrupt/preempt).
+
+
+UMCG kernel API
+===============
+Based on the requrements above, UMCG *kernel* API is build around
+the following ideas:
+
+- *UMCG server*: a task/thread representing "kernel threads", or CPUs
+ from the requirements above;
+- *UMCG worker*: a task/thread representing "application threads", to
+ be scheduled over servers;
+- UMCG *task state*: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG
+ task (a server or a worker) can be in;
+- UMCG task *state flag*: LOCKED, PREEMPTED: additional state flags
+ that can be ORed with the task state to communicate additional information
+ to the kernel;
+- ``struct umcg_task``: a per-task userspace set of data fields, usually
+ residing in the TLS, that fully reflects the current task's UMCG
+ state and controls the way the kernel manages the task;
+- ``sys_umcg_ctl()``: a syscall used to register the current task/thread
+ as a server or a worker, or to unregister a UMCG task;
+- ``sys_umcg_wait()``: a syscall used to put the current task to
+ sleep and/or wake another task, pontentially context-switching
+ between the two tasks on-CPU synchronously.
+
+
+Servers
+=======
+
+When a task/thread is registered as a server, it is in RUNNING
+state and behaves like any other normal task/thread. In addition,
+servers can interact with other UMCG tasks via sys_umcg_wait():
+
+- servers can voluntarily suspend their execution (wait), becoming IDLE;
+- servers can wake other IDLE servers;
+- servers can context-switch between each other.
+
+Note that if a server blocks in the kernel *not* via sys_umcg_wait(),
+it still retains its RUNNING state.
+
+Also note that servers can be used for fast on-CPU context switching
+across process boundaries; server-worker interactions assume they
+belong to the same mm.
+
+See the next section on how servers interact with workers.
+
+Workers
+=======
+
+A worker cannot be RUNNING without having a server associated
+with it, so when a task is first registered as a worker, it enters
+the IDLE state.
+
+- a worker becomes RUNNING when a server calls sys_umcg_wait to
+ context-switch into it; the server goes IDLE, and the worker becomes
+ RUNNING in its place;
+- when a running worker blocks in the kernel, it becomes BLOCKED,
+ its associated server becomes RUNNING and the server's
+ sys_umcg_wait() call from the bullet above returns; this transition
+ is sometimes called "block detection";
+- when the syscall on which a BLOCKED worker completes, the worker
+ becomes IDLE and is added to the list of idle workers; if there
+ is an idle server waiting, the kernel wakes it; this transition
+ is sometimes called "wake detection";
+- running workers can voluntarily suspend their execution (wait),
+ becoming IDLE; their associated servers are woken;
+- a RUNNING worker can context-switch with an IDLE worker; the server
+ of the switched-out worker is transferred to the switched-in worker;
+- any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
+ this is a server running the worker as described in the first bullet
+ in this list, the worker remain IDLE but is added to the idle workers
+ list; this "wake" operation exists for completeness, to make sure
+ wait/wake/context-switch operations are available for all UMCG tasks;
+- the userspace can preempt a RUNNING worker by marking it
+ ``RUNNING|PREEMPTED`` and sending a signal to it; the userspace should
+ have installed a NOP signal handler for the signal; the kernel will
+ then transition the worker into ``IDLE|PREEMPTED`` state and wake
+ its associated server.
+
+UMCG task states
+================
+
+Important: all state transitions described below involve at least
+two steps: the change of the state field in ``struct umcg_task``,
+for example ``RUNNING`` to ``IDLE``, and the corresponding change in
+``struct task_struct`` state, for example a transition between the task
+running on CPU and being descheduled and removed from the kernel runqueue.
+The key principle of UMCG API design is that the party initiating
+the state transition modifies the state variable.
+
+For example, a task going ``IDLE`` first changes its state from ``RUNNING``
+to ``IDLE`` in the userpace and then calls ``sys_umcg_wait()``, which
+completes the transition.
+
+Note on documentation: in ``include/uapi/linux/umcg.h``, task states
+have the form ``UMCG_TASK_RUNNING``, ``UMCG_TASK_BLOCKED``, etc. In
+this document these are usually referred to simply ``RUNNING`` and
+``BLOCKED``, unless it creates ambiguity. Task state flags, e.g.
+``UMCG_TF_PREEMPTED``, are treated similarly.
+
+UMCG task states reflect the view from the userspace, rather than from
+the kernel. There are three fundamental task states:
+
+- ``RUNNING``: indicates that the task is schedulable by the kernel; applies
+ to both servers and workers;
+- ``IDLE``: indicates that the task is *not* schedulable by the kernel
+ (see ``umcg_idle_loop()`` in ``kernel/sched/umcg.c``); applies to
+ both servers and workers;
+- ``BLOCKED``: indicates that the worker is blocked in the kernel;
+ does not apply to servers.
+
+In addition to the three states above, two state flags help with
+state transitions:
+
+- ``LOCKED``: the userspace is preparing the worker for a state transition
+ and "locks" the worker until the worker is ready for the kernel to
+ act on the state transition; used similarly to preempt_disable or
+ irq_disable in the kernel; applies only to workers in ``RUNNING`` or
+ ``IDLE`` state; ``RUNNING|LOCKED`` means "this worker is about to
+ become ``RUNNING``, while ``IDLE|LOCKED`` means "this worker is about
+ to become ``IDLE`` or unregister;
+- ``PREEMPTED``: the userspace indicates it wants the worker to be
+ preempted; there are no situations when both ``LOCKED`` and ``PREEMPTED``
+ flags are set at the same time.
+
+struct umcg_task
+================
+
+From ``include/uapi/linux/umcg.h``:
+
+.. code-block:: C
+
+ struct umcg_task {
+ uint32_t state; /* r/w */
+ uint32_t next_tid; /* r */
+ uint64_t idle_workers_ptr; /* r/w */
+ uint64_t idle_server_tid_ptr; /* r* */
+ };
+
+Each UMCG task is identified by ``struct umcg_task``, which is provided
+to the kernel when the task is registered via ``sys_umcg_ctl()``.
+
+- ``uint32_t state``: the current state of the task this struct identifies,
+ as described in the previous section. Readable/writable by both the kernel
+ and the userspace.
+
+ - bits 0 - 7: task state (RUNNING, IDLE, BLOCKED);
+ - bits 8 - 15: state flags (LOCKED, PREEMPTED);
+ - bits 16 - 23: reserved; must be zeroes;
+ - bits 24 - 31: for userspace use.
+
+- ``uint32_t next_tid``: contains the TID of the task to context-switch-into
+ in ``sys_umcg_wait()``; can be zero; writable by the userspace, readable
+ by the kernel; if this is a RUNNING worker, this field contains
+ the TID of the server that should be woken when this worker blocks;
+ see ``sys_umcg_wait()`` for more details;
+- ``uint64_t idle_workers_ptr``: this field forms a single-linked list
+ of idle workers: all RUNNING workers have this field set to point
+ to the head of the list (a pointer variable in the userspace).
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from ``BLOCKED`` to ``IDLE`` and adds the worker
+ to the top of the list of idle workers using this logic:
+
+ .. code-block:: C
+
+ /* kernel side */
+ u64 *head = (u64 *)(worker->idle_workers_ptr); /* get the head pointer */
+ u64 *first = (u64 *)*head; /* get the first element */
+
+ /* make the worker's ptr point to the first element */
+ worker->idle_workers_ptr = first;
+
+ /* make the head pointer point to this worker */
+ if (cmpxchg(head, &first, &worker->idle_workers_ptr))
+ /* success */
+ else
+ /* retry, with exponential back-off */
+
+
+ In the userspace the list is cleared atomically using this logic:
+
+ .. code-block:: C
+
+ /* userspace side */
+ uint64_t *idle_workers = (uint64_t *)*head;
+
+ /* move the list from the global head to the local idle_workers */
+ if (cmpxchg(&head, &idle_workers, NULL))
+ /* success: process idle_workers */
+ else
+ /* retry */
+
+ The userspace re-points workers' idle_workers_ptr to the list head
+ variable before the worker is allowed to become RUNNING again.
+
+- ``uint64_t idle_server_tid_ptr``: points to a pointer variable in the
+ userspace that points to an idle server, i.e. a server in IDLE state waiting
+ in sys_umcg_wait(); read-only; workers must have this field set; not used
+ in servers.
+
+ When a worker's blocking operation in the kernel completes, the kernel
+ changes the worker's state from ``BLOCKED`` to ``IDLE``, adds the worker
+ to the list of idle workers, and checks whether
+ ``*idle_server_tid_ptr`` is not zero. If not, the kernel tries to cmpxchg()
+ it with zero; if cmpxchg() succeeds, the kernel will then wake the server.
+ See `State transitions`_ below for more details.
+
+sys_umcg_ctl()
+==============
+
+``int sys_umcg_ctl(uint32_t flags, struct umcg_task *self)`` is used to
+register or unregister the current task as a worker or server. Flags
+can be one of the following:
+
+- ``UMCG_CTL_REGISTER``: register a server;
+- ``UMCG_CTL_REGISTER | UMCG_CTL_WORKER``: register a worker;
+- ``UMCG_CTL_UNREGISTER``: unregister the current server or worker.
+
+When registering a task, ``self`` must point to ``struct umcg_task``
+describing this server or worker; the pointer must remain valid until
+the task is unregistered.
+
+When registering a server, ``self->state`` must be ``RUNNING``; all other
+fields in ``self`` must be zeroes.
+
+When registering a worker, ``self->state`` must be ``IDLE``;
+``self->idle_server_tid_ptr`` and ``self->idle_workers_ptr`` must be
+valid pointers as described in `struct umcg_task`_; ``self->next_tid`` must
+be zero.
+
+When unregistering a task, ``self`` must be ``NULL``.
+
+sys_umcg_wait()
+===============
+
+``int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout)`` operates
+on registered UMCG servers and workers: ``struct umcg_task *self`` provided
+to ``sys_umcg_ctl()`` when registering the current task is consulted
+in addition to ``flags`` and ``abs_timeout`` parameters.
+
+The function can be used to perform one of the three operations:
+
+- wait: if ``self->next_tid`` is zero, ``sys_umcg_wait()`` puts the current
+ server or worker to sleep;
+- wake: if ``self->next_tid`` is not zero, and ``flags & UMCG_WAIT_WAKE_ONLY``,
+ the task identified by ``next_tid`` is woken (must be in ``IDLE`` state);
+- context switch: if ``self->next_tid`` is not zero, and
+ ``!(flags & UMCG_WAIT_WAKE_ONLY)``, the current task is put to sleep and
+ the next task is woken, synchronously switching between the tasks on the
+ current CPU on the fast path.
+
+Flags can be zero or a combination of the following values:
+
+- ``UMCG_WAIT_WAKE_ONLY``: wake the next task, don't put the current task
+ to sleep;
+- ``UMCG_WAIT_WF_CURRENT_CPU``: wake the next task on the curent CPU;
+ this flag has an effect only if ``UMCG_WAIT_WAKE_ONLY`` is set: context
+ switching is always attempted to happen on the curent CPU.
+
+The section below provides more details on how servers and workers interact
+via ``sys_umcg_wait()``, during worker block/wake events, and during
+worker preemption.
+
+State transitions
+=================
+
+As mentioned above, the key principle of UMCG state transitions is that
+**the party initiating the state transition modifies the state of affected
+tasks**.
+
+Below, "``TASK:STATE``" indicates a task T, where T can be either W for
+worker or S for server, in state S, where S can be one of the three states,
+potentially ORed with a state flag. Each individual state transition
+is an atomic operation (cmpxchg) unless indicated otherwise. Also note
+that **the order of state transitions is important and is part of the
+contract between the userspace and the kernel. The kernel is free
+to kill the task (SIGSEGV) if the contract is broken.**
+
+Some worker state transitions below include adding ``LOCKED`` flag to
+worker state. This is done to indicate to the kernel that the worker
+is transitioning state and should not participate in the block/wake
+detection routines, which can happen due to interrupts/pagefaults/signals.
+
+``IDLE|LOCKED`` means that a running worker is preparing to sleep, so
+interrupts should not lead to server wakeup; ``RUNNING|LOCKED`` means that
+an idle worker is going to be "scheduled to run", but may not yet have its
+server set up properly.
+
+Key state transitions:
+
+- server to worker context switch ("schedule a worker to run"):
+ ``S:RUNNING+W:IDLE => S:IDLE+W:RUNNING``:
+
+ - in the userspace, in the context of the server S running:
+
+ - ``S:RUNNING => S:IDLE`` (mark self as idle)
+ - ``W:IDLE => W:RUNNING|LOCKED`` (mark the worker as running)
+ - ``W.next_tid := S.tid; S.next_tid := W.tid``
+ (link the server with the worker)
+ - ``W:RUNNING|LOCKED => W:RUNNING`` (unlock the worker)
+ - ``S: sys_umcg_wait()`` (make the syscall)
+
+ - the kernel context switches from the server to the worker; the server
+ sleeps until it becomes ``RUNNING`` during one of the transitions below;
+
+- worker to server context switch (worker "yields"):
+ ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE``:
+
+ - in the userspace, in the context of the worker W running (note that
+ a running worker has its ``next_tid`` set to point to its server):
+
+ - ``W:RUNNING => W:IDLE|LOCKED`` (mark self as idle)
+ - ``S:IDLE => S:RUNNING`` (mark the server as running)
+ - ``W: sys_umcg_wait()`` (make the syscall)
+
+ - the kernel removes the ``LOCKED`` flag from the worker's state and
+ context switches from the worker to the server; the worker
+ sleeps until it becomes ``RUNNING``;
+
+- worker to worker context switch:
+ ``W1:RUNNING+W2:IDLE => W1:IDLE+W2:RUNNING``:
+
+ - in the userspace, in the context of W1 running:
+
+ - ``W2:IDLE => W2:RUNNING|LOCKED`` (mark W2 as running)
+ - ``W1:RUNNING => W1:IDLE|LOCKED`` (mark self as idle)
+ - ``W2.next_tid := W1.next_tid; S.next_tid := W2.next_tid``
+ (transfer the server W1 => W2)
+ - ``W1:next_tid := W2.tid`` (indicate that W1 should
+ context-switch into W2)
+ - ``W2:RUNNING|LOCKED => W2:RUNNING`` (unlock W2)
+ - ``W1: sys_umcg_wait()`` (make the syscall)
+
+ - same as above, the kernel removes the ``LOCKED`` flag from the W1's state
+ and context switches to next_tid;
+
+- worker wakeup: ``W:IDLE => W:RUNNING``:
+
+ - in the userspace, a server S can wake a worker W without "running" it:
+
+ - ``S:next_tid :=W.tid``
+ - ``W:next_tid := 0``
+ - ``W:IDLE => W:RUNNING``
+ - ``sys_umcg_wait(UMCG_WAIT_WAKE_ONLY)`` (make the syscall)
+
+ - the kernel will wake the worker W; as the worker does not have a server
+ assigned, "wake detection" will happen, the worker will be immediately
+ marked as ``IDLE`` and added to idle workers list; an idle server, if any,
+ will be woken (see 'wake detection' below);
+ - Note: if needed, it is possible for a worker to wake another worker:
+ the waker marks itself "IDLE|LOCKED", points its next_tid to the wakee,
+ makes the syscall, restores its server in next_tid, marks itself
+ as ``RUNNING``.
+
+- block detection: worker blocks in the kernel: ``S:IDLE+W:RUNNING => S:RUNNING+W:BLOCKED``:
+
+ - when a worker blocks in the kernel in ``RUNNING`` state (not ``LOCKED``),
+ before descheduling the task from the CPU the kernel performs these
+ operations:
+
+ - ``W:RUNNING => W:BLOCKED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wake_up(S)``
+
+ - if any of the first three operations above fail, the worker is killed via
+ ``SIGSEGV``. Note that ``ttwu(S)`` is not required to succeed, as the
+ server may still be transitioning to sleep in ``sys_umcg_wait()``; before
+ actually putting the server to sleep its UMCG state is checked and, if
+ it is ``RUNNING``, sys_umcg_wait() returns to the userspace;
+ - if the worker has its ``LOCKED`` flag set, block detection does not trigger,
+ as the worker is assumed to be in the userspace scheduling code.
+
+- wake detection: worker wakes in the kernel: ``W:BLOCKED => W:IDLE``:
+
+ - all workers' returns to the userspace are intercepted:
+
+ - ``start:`` (a label)
+ - if ``W:RUNNING & W.next_tid != 0``: let the worker exit to the userspace,
+ as this is a ``RUNNING`` worker with a server;
+ - ``W:* => W:IDLE`` (previously blocked or woken without servers workers
+ are not allowed to return to the userspace);
+ - the worker is appended to ``W.idle_workers_ptr`` idle workers list;
+ - ``S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING; ttwu(S)``
+ - ``idle_loop(W)``: this is the same idle loop that ``sys_umcg_wait()``
+ uses: it breaks only when the worker becomes ``RUNNING``; when the
+ idle loop exits, it is assumed that the userspace has properly
+ removed the worker from the idle workers list before marking it
+ ``RUNNING``;
+ - ``goto start;`` (repeat from the beginning).
+
+ - the logic above is a bit more complicated in the presence of ``LOCKED`` or
+ ``PREEMPTED`` flags, but the main invariants stay the same:
+
+ - only ``RUNNING`` workers with servers assigned are allowed to run
+ in the userspace (unless ``LOCKED``);
+ - newly ``IDLE`` workers are added to the idle workers list; any
+ user-initiated state change assumes the userspace properly removed
+ the worker from the list;
+ - as with wake detection, any "breach of contract" by the userspace
+ will result in the task termination via ``SIGSEGV``.
+
+- worker preemption: ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED``:
+
+ - when the userspace wants to preempt a ``RUNNING`` worker, it changes
+ it state, atomically, ``RUNNING => RUNNING|PREEMPTED`` and sends a signal
+ to the worker via ``tgkill()``; the signal handler, previously set up
+ by the userspace, can be a NOP (note that only ``RUNNING`` workers can be
+ preempted);
+ - if the worker, at the moment the signal arrived, continued to be running
+ on-CPU in the userspace, the "wake detection" code will be triggered that,
+ in addition to what was described above, will check if the worker is in
+ ``RUNNING|PREEMPTED`` state:
+
+ - ``W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wakeup(S)``
+
+ - if the signal arrives after the worker blocks in the kernel, the "block
+ detection" happened as described above, with the following change:
+
+ - ``W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED``
+ - ``S := W.next_tid``
+ - ``S:IDLE => S:RUNNING``
+ - ``try_to_wake_up(S)``
+
+ - in any case, the worker's server is woken, with its attached worker
+ (``S.next_tid``) either in ``BLOCKED|PREEMPTED`` or ``IDLE|PREEMPTED``
+ state.
+
+Server-only use cases
+=====================
+
+Some workloads/applications may benefit from fast and synchronous on-CPU
+user-initiated context switches without the need for full userspace
+scheduling (block/wake detection). These applications can use "standalone"
+UMCG servers to wait/wake/context-switch, including across process boundaries.
+
+These "worker-less" operations involve trivial ``RUNNING`` <==> ``IDLE``
+state changes, not discussed here for brevity.
+
--
2.25.1
Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
All key operations, such as wait/wake/context-switch, as well as
timeouts and block/wake detection, are working quite reliably.
In addition, the userspace can now force the kernel to preempt
a running worker by changing its state from RUNNING to
RUNNING | PREEMPTED and sending a signal to it. This new functionality
is less well tested than the key operations above, but is working
well in the common case of the worker busy in the userspace.
The main challenge with preemption is to carefully navigate the
race when (1) the worker goes to block in the kernel via e.g.
calling nanosleep and (2) the userspace tries to preempt the worker
simultaneously: there is a brief time window when the worker
is already in INTERRUPTIBLE task state but still has its UMCG
state as RUNNING (not yet changed to BLOCKED); the _state_ change
is handled robustly; but if the actual signal arrives when the
state change is handled, things go wild.
At the moment this is dealt with by wrapping the state-change-handling
code in local_irq_disable/enable calls and completely ignoring non-fatal
signals, but probably a better, more direct handling of signals is
possible. Any suggestions here are welcome.
Peter Z. suggested a different approach in
https://lore.kernel.org/lkml/[email protected]/
but I'm not sure how a new syscall (called from a signal handler?)
would improve things: at least now the interrupter can be reasonably
assured that if the state change RUNNING => RUNNING | PREEMPTED
succeeded, and the signal delivered, the worker will be interrupted;
am I missing something?
Other than making signal/preemption handling more thought-out and
robust, these big things remain to be addressed:
- tracing/debugging
- faster context switches (see umcg_do_context_switch in umcg.c)
- other architectures (we will need at least arm64 in addition to amd64)
- tools/lib/umcg for userspace
- kselftests
I obviosly have the last two items (libumcg and selftests) to a
certain extent, but they are a bit rough at the moment - I'll
clean them up and post to LKML once things on the kernel side are
in good shape.
See Documentation/userspace-api/umcg.rst for API usage and other details.
v0.3->v0.4 changes:
- removed server_tid and api_version fields from struct umcg_task;
- added timeout handling to sys_umcg_wait();
- implemented worker preemption via signals;
- handling idle workers and servers is changed again (see umcg.rst).
v0.2->v0.3 changes:
- the overall approach is now based on peterz@'s suggestion in
https://lore.kernel.org/patchwork/cover/1433967/
(should I add Suggested-by?)
- new protocol for working with idle workers and servers is used, to avoid
spinning in the kernel;
- waking a UMCG task now does not require spinning.
Signed-off-by: Peter Oskolkov <[email protected]>
---
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/linux/sched.h | 6 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 8 +-
include/uapi/linux/umcg.h | 114 +++++
init/Kconfig | 10 +
kernel/exit.c | 7 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 17 +-
kernel/sched/umcg.c | 601 +++++++++++++++++++++++++
kernel/sched/umcg.h | 13 +
kernel/sys_ni.c | 4 +
12 files changed, 784 insertions(+), 3 deletions(-)
create mode 100644 include/uapi/linux/umcg.h
create mode 100644 kernel/sched/umcg.c
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ce18119ea0d0..0c6c7fd72b0b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,8 @@
444 common landlock_create_ruleset sys_landlock_create_ruleset
445 common landlock_add_rule sys_landlock_add_rule
446 common landlock_restrict_self sys_landlock_restrict_self
+447 common umcg_ctl sys_umcg_ctl
+448 common umcg_wait sys_umcg_wait
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50db9496c99d..185ad1cdde77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -66,6 +66,7 @@ struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct umcg_task;
/*
* Task state bitmask. NOTE! These bits are also
@@ -1223,6 +1224,10 @@ struct task_struct {
unsigned long rseq_event_mask;
#endif
+#ifdef CONFIG_UMCG
+ struct umcg_task __user *umcg_task;
+#endif
+
struct tlbflush_unmap_batch tlb_ubc;
union {
@@ -1599,6 +1604,7 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..f3e1ef8d842f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -71,6 +71,7 @@ struct open_how;
struct mount_attr;
struct landlock_ruleset_attr;
enum landlock_rule_type;
+struct umcg_task;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -1050,6 +1051,9 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
const void __user *rule_attr, __u32 flags);
asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
+
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 6de5a7fc066b..1a4c9ac0e296 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -873,8 +873,14 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
#define __NR_landlock_restrict_self 446
__SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
+#define __NR_umcg_ctl 447
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 448
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
+
+
#undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 449
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
new file mode 100644
index 000000000000..4faeb77b393c
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ * sys_umcg_ctl() - register/unregister UMCG tasks;
+ * sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ *
+ * See Documentation/userspace-api/umcg.rst for detals.
+ */
+
+/*
+ * UMCG task states, the first 8 bits. The states represent the user space
+ * point of view.
+ */
+#define UMCG_TASK_NONE 0
+#define UMCG_TASK_RUNNING 1
+#define UMCG_TASK_IDLE 2
+#define UMCG_TASK_BLOCKED 3
+
+/* The first byte: RUNNING, IDLE, or BLOCKED. */
+#define UMCG_TASK_STATE_MASK 0xff
+
+/* UMCG task state flags, bits 8-15 */
+
+/*
+ * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
+ */
+#define UMCG_TF_LOCKED (1 << 8)
+
+/*
+ * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
+ */
+#define UMCG_TF_PREEMPTED (1 << 9)
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+ /**
+ * @state: the current state of the UMCG task described by this struct.
+ *
+ * Readable/writable by both the kernel and the userspace.
+ *
+ * UMCG task state:
+ * bits 0 - 7: task state;
+ * bits 8 - 15: state flags;
+ * bits 16 - 23: reserved; must be zeroes;
+ * bits 24 - 31: for userspace use.
+ */
+ uint32_t state; /* r/w */
+
+ /**
+ * @next_tid: the TID of the UMCG task that should be context-switched
+ * into in sys_umcg_wait(). Can be zero.
+ *
+ * Running UMCG workers must have next_tid set to point to IDLE
+ * UMCG servers.
+ *
+ * Read-only for the kernel, read/write for the userspace.
+ */
+ uint32_t next_tid; /* r */
+
+ /**
+ * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
+ *
+ * Readable/writable by both the kernel and the userspace: the
+ * kernel adds items to the list, the userspace removes them.
+ */
+ uint64_t idle_workers_ptr; /* r/w */
+
+ /**
+ * @idle_server_tid_ptr: a pointer pointing to a pointer pointing to a
+ * single idle server. Readonly.
+ */
+ uint64_t idle_server_tid_ptr; /* r */
+} __attribute__((packed, aligned(8 * sizeof(__u64))));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER: register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER: register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+ UMCG_CTL_REGISTER = 0x00001,
+ UMCG_CTL_UNREGISTER = 0x00002,
+ UMCG_CTL_WORKER = 0x10000,
+};
+
+/**
+ * enum umcg_wait_flag - flags to pass to sys_umcg_wait
+ * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
+ * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
+ * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
+ * must be set.
+ */
+enum umcg_wait_flag {
+ UMCG_WAIT_WAKE_ONLY = 1,
+ UMCG_WAIT_WF_CURRENT_CPU = 2,
+};
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
index a61c92066c2e..c15a50a61ba6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1662,6 +1662,16 @@ config MEMBARRIER
If unsure, say Y.
+config UMCG
+ bool "Enable User Managed Concurrency Groups API"
+ depends on X86_64
+ default n
+ help
+ Enable User Managed Concurrency Groups API, which form the basis
+ for an in-process M:N userspace scheduling framework.
+ At the moment this is an experimental/RFC feature that is not
+ guaranteed to be backward-compatible.
+
config KALLSYMS
bool "Load all symbols for debugging/ksymoops" if EXPERT
default y
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..dc8398558d87 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -744,6 +744,13 @@ void __noreturn do_exit(long code)
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");
+#ifdef CONFIG_UMCG
+ if (unlikely(tsk->flags & PF_UMCG_WORKER))
+ tsk->flags &= ~PF_UMCG_WORKER;
+
+ tsk->umcg_task = NULL;
+#endif
+
/*
* If do_exit is called because this processes oopsed, it's possible
* that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 978fcfca5871..e4e481eee1b7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
obj-$(CONFIG_PSI) += psi.o
obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 293f5801bf81..f7ddeed72e30 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -26,6 +26,7 @@
#include "pelt.h"
#include "smp.h"
+#include "umcg.h"
/*
* Export tracepoints that act as a bare tracehook (ie: have no trace event
@@ -3961,6 +3962,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
#endif
+#ifdef CONFIG_UMCG
+ p->umcg_task = NULL;
+ p->flags &= ~PF_UMCG_WORKER;
+#endif
}
DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -5975,10 +5980,14 @@ static inline void sched_submit_work(struct task_struct *tsk)
* in the possible wakeup of a kworker and because wq_worker_sleeping()
* requires it.
*/
- if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
preempt_disable();
if (task_flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);
+#ifdef CONFIG_UMCG
+ else if (task_flags & PF_UMCG_WORKER)
+ umcg_wq_worker_sleeping(tsk);
+#endif
else
io_wq_worker_sleeping(tsk);
preempt_enable_no_resched();
@@ -5997,9 +6006,13 @@ static inline void sched_submit_work(struct task_struct *tsk)
static void sched_update_worker(struct task_struct *tsk)
{
- if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
if (tsk->flags & PF_WQ_WORKER)
wq_worker_running(tsk);
+#ifdef CONFIG_UMCG
+ else if (tsk->flags & PF_UMCG_WORKER)
+ umcg_wq_worker_running(tsk);
+#endif
else
io_wq_worker_running(tsk);
}
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
new file mode 100644
index 000000000000..c36902ffef5e
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,601 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ *
+ * See Documentation/userspace-api/umcg.rst for detals.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+#include <linux/freezer.h>
+
+#include "sched.h"
+#include "umcg.h"
+
+/**
+ * sys_umcg_ctl: (un)register the current task as a UMCG task.
+ * @flags: ORed values from enum umcg_ctl_flag; see below;
+ * @self: a pointer to struct umcg_task that describes this
+ * task and governs the behavior of sys_umcg_wait if
+ * registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ * UMCG workers:
+ * - self->state must be UMCG_TASK_IDLE
+ * - @flags & UMCG_CTL_WORKER
+ * UMCG servers:
+ * - self->state must be UMCG_TASK_RUNNING
+ * - !(@flags & UMCG_CTL_WORKER)
+ *
+ * All tasks:
+ * - self->next_tid must be zero
+ *
+ * If the conditions above are met, sys_umcg_ctl() immediately returns
+ * if the registered task is a server; a worker will be added to
+ * idle_workers_ptr, and the worker put to sleep; an idle server
+ * from idle_server_tid_ptr will be woken, if present.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
+ * is a UMCG worker, the userspace is responsible for waking its
+ * server (before or after calling sys_umcg_ctl).
+ *
+ * Return:
+ * 0 - success
+ * -EFAULT - failed to read @self
+ * -EINVAL - some other error occurred
+ */
+SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
+{
+ struct umcg_task ut;
+
+ if (flags == UMCG_CTL_UNREGISTER) {
+ if (self || !current->umcg_task)
+ return -EINVAL;
+
+ if (current->flags & PF_UMCG_WORKER)
+ current->flags &= ~PF_UMCG_WORKER;
+
+ current->umcg_task = NULL;
+ return 0;
+ }
+
+ /* Register the current task as a UMCG task. */
+ if (!(flags & UMCG_CTL_REGISTER))
+ return -EINVAL;
+
+ flags &= ~UMCG_CTL_REGISTER;
+ if (flags && flags != UMCG_CTL_WORKER)
+ return -EINVAL;
+
+ if (current->umcg_task || !self)
+ return -EINVAL;
+
+ if (copy_from_user(&ut, self, sizeof(ut)))
+ return -EFAULT;
+
+ if (ut.next_tid)
+ return -EINVAL;
+
+ if (flags == UMCG_CTL_WORKER) {
+ if (ut.state != UMCG_TASK_IDLE)
+ return -EINVAL;
+
+ current->umcg_task = self;
+ current->flags |= PF_UMCG_WORKER;
+
+ umcg_wq_worker_running(current); /* Will sleep. */
+ return 0;
+ }
+
+ /* This is a server task. */
+ if (ut.state != UMCG_TASK_RUNNING)
+ return -EINVAL;
+
+ current->umcg_task = self;
+ return 0;
+}
+
+/**
+ * umcg_idle_loop - sleep until the current task becomes RUNNING or a timeout
+ * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout
+ *
+ * The function marks the current task as INTERRUPTIBLE and calls
+ * freezable_schedule(). It returns when either the timeout expires or
+ * the UMCG state of the task becomes RUNNING.
+ *
+ * Note: because UMCG workers should not be running WITHOUT attached servers,
+ * and because servers should not be running WITH attached workers,
+ * the function returns only on fatal signal pending and ignores/flushes
+ * all other signals.
+ */
+static int umcg_idle_loop(u64 abs_timeout)
+{
+ int ret;
+ struct hrtimer_sleeper timeout;
+ struct umcg_task __user *self = current->umcg_task;
+
+ if (abs_timeout) {
+ hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
+ HRTIMER_MODE_ABS);
+
+ hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout,
+ current->timer_slack_ns);
+ }
+
+ while (true) {
+ u32 umcg_state;
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ smp_mb(); /* Order with set_current_state() above. */
+ ret = -EFAULT;
+ if (get_user(umcg_state, &self->state)) {
+ set_current_state(TASK_RUNNING);
+ goto out;
+ }
+
+ ret = 0;
+ if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+ set_current_state(TASK_RUNNING);
+ goto out;
+ }
+
+ if (abs_timeout) {
+ hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
+
+ if (!timeout.task) {
+ ret = -ETIMEDOUT;
+ __set_current_state(TASK_RUNNING);
+
+ if (current->flags & PF_UMCG_WORKER)
+ umcg_wq_worker_running(current);
+ break;
+ }
+ }
+
+ freezable_schedule();
+
+ /*
+ * Check for timeout before checking the state, as workers
+ * are not going to return from freezable_schedule() unless
+ * they are RUNNING.
+ */
+ ret = -ETIMEDOUT;
+ if (abs_timeout && !timeout.task)
+ goto out;
+
+ ret = -EFAULT;
+ if (get_user(umcg_state, &self->state))
+ goto out;
+
+ ret = 0;
+ if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING)
+ goto out;
+
+ ret = -EINTR;
+ if (fatal_signal_pending(current))
+ goto out;
+
+ if (signal_pending(current))
+ flush_signals(current);
+ }
+
+out:
+ if (abs_timeout) {
+ hrtimer_cancel(&timeout.timer);
+ destroy_hrtimer_on_stack(&timeout.timer);
+ }
+ return ret;
+}
+
+/*
+ * Try to wake up. May be called with preempt_disable set.
+ *
+ * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
+ * ordering logic.
+ */
+static int umcg_ttwu(u32 next_tid, int wake_flags)
+{
+ struct task_struct *next;
+
+ rcu_read_lock();
+ next = find_task_by_vpid(next_tid);
+ if (!next || !(READ_ONCE(next->umcg_task))) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+
+ try_to_wake_up(next, TASK_NORMAL, wake_flags); /* Result ignored. */
+ rcu_read_unlock();
+
+ return 0;
+}
+
+/*
+ * At the moment, umcg_do_context_switch simply wakes up @next with
+ * WF_CURRENT_CPU and puts the current task to sleep.
+ *
+ * In the future an optimization will be added to adjust runtime accounting
+ * so that from the kernel scheduling perspective the two tasks are
+ * essentially treated as one. In addition, the context switch may be performed
+ * right here on the fast path, instead of going through the wake/wait pair.
+ */
+static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
+{
+ struct task_struct *next;
+
+ rcu_read_lock();
+ next = find_task_by_vpid(next_tid);
+ if (!next) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+
+ /* TODO: instead of wake + sleep, do a context switch. */
+ try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU); /* Result ignored. */
+ rcu_read_unlock();
+
+ return umcg_idle_loop(abs_timeout);
+}
+
+/**
+ * sys_umcg_wait: put the current task to sleep and/or wake another task.
+ * @flags: zero or a value from enum umcg_wait_flag.
+ * @abs_timeout: when to wake the task, in nanoseconds; zero for no timeout.
+ *
+ * @self->state must be UMCG_TASK_IDLE (where @self is current->umcg_task)
+ * if !(@flags & UMCG_WAIT_WAKE_ONLY).
+ *
+ * If @self->next_tid is not zero, it must point to an IDLE UMCG task.
+ * The userspace must have changed its state from IDLE to RUNNING
+ * before calling sys_umcg_wait() in the current task. This "next"
+ * task will be woken (context-switched-to on the fast path) when the
+ * current task is put to sleep.
+ *
+ * See Documentation/userspace-api/umcg.rst for detals.
+ *
+ * Return:
+ * 0 - OK;
+ * -ETIMEDOUT - the timeout expired;
+ * -EFAULT - failed accessing struct umcg_task __user of the current
+ * task;
+ * -ESRCH - the task to wake not found or not a UMCG task;
+ * -EINVAL - another error happened (e.g. bad @flags, or the current
+ * task is not a UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
+{
+ struct umcg_task __user *self = current->umcg_task;
+ u32 next_tid;
+
+ if (!self)
+ return -EINVAL;
+
+ if (get_user(next_tid, &self->next_tid))
+ return -EFAULT;
+
+ if (flags & UMCG_WAIT_WAKE_ONLY) {
+ if (!next_tid || abs_timeout)
+ return -EINVAL;
+
+ flags &= ~UMCG_WAIT_WAKE_ONLY;
+ if (flags & ~UMCG_WAIT_WF_CURRENT_CPU)
+ return -EINVAL;
+
+ return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ?
+ WF_CURRENT_CPU : 0);
+ }
+
+ /* Unlock the worker, if locked. */
+ if (current->flags & PF_UMCG_WORKER) {
+ u32 umcg_state;
+
+ if (get_user(umcg_state, &self->state))
+ return -EFAULT;
+
+ if ((umcg_state & UMCG_TF_LOCKED) && cmpxchg_user_32(
+ &self->state, &umcg_state,
+ umcg_state & ~UMCG_TF_LOCKED))
+ return -EFAULT;
+ }
+
+ if (next_tid)
+ return umcg_do_context_switch(next_tid, abs_timeout);
+
+ return umcg_idle_loop(abs_timeout);
+}
+
+/*
+ * NOTE: all code below is called from workqueue submit/update, so all
+ * errors result in the termination of the current task (via SIGSEGV).
+ */
+
+/* Returns true on success, false on _any_ error. */
+static bool mark_server_running(u32 server_tid)
+{
+ struct umcg_task __user *ut_server = NULL;
+ u32 state = UMCG_TASK_IDLE;
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(server_tid);
+ if (tsk)
+ ut_server = READ_ONCE(tsk->umcg_task);
+ rcu_read_unlock();
+
+ if (!ut_server)
+ return false;
+
+ return !cmpxchg_user_32(&ut_server->state, &state, UMCG_TASK_RUNNING);
+}
+
+/*
+ * In the common case, change @tsk RUNNING => BLOCKED. Called from
+ * preempt_disable() and local_irq_disable() context.
+ */
+static void __umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+ struct umcg_task __user *ut_worker = tsk->umcg_task;
+ u32 prev_state, next_state, server_tid;
+ bool preempted = false;
+ int ret;
+
+ if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+ return;
+
+ smp_mb(); /* Guard the read below. */
+ if (get_user_nosleep(prev_state, &ut_worker->state))
+ goto die; /* EFAULT */
+
+ if (prev_state & UMCG_TF_LOCKED)
+ return;
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING)
+ return; /* the worker is in umcg_wait */
+
+retry_once:
+ next_state = prev_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_BLOCKED;
+ preempted = prev_state & UMCG_TF_PREEMPTED;
+
+ ret = cmpxchg_user_32(&ut_worker->state, &prev_state, next_state);
+ if (ret == -EAGAIN) {
+ if (preempted)
+ goto die; /* Preemption can only happen once. */
+
+ if (prev_state != (UMCG_TASK_RUNNING | UMCG_TF_PREEMPTED))
+ goto die; /* Only preemption can happen. */
+
+ preempted = true;
+ goto retry_once;
+ }
+ if (ret)
+ goto die; /* EFAULT */
+
+ if (get_user_nosleep(server_tid, &ut_worker->next_tid))
+ goto die; /* EFAULT */
+
+ if (!server_tid)
+ goto die; /* The user broke the contract. */
+
+ if (!mark_server_running(server_tid))
+ goto die; /* The user broke the contract. */
+
+ /* TODO: make a smarter context switch when available. */
+ umcg_ttwu(server_tid, WF_CURRENT_CPU);
+ return;
+
+die:
+ force_sig(SIGSEGV);
+}
+
+/* Called from sched_submit_work() with preempt_disable. */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+ /*
+ * Although UMCG preemption state change (UMCG_TF_PREEMPTED) racing
+ * with the worker blocking in a syscall is handled correctly in
+ * __umcg_wq_worker_sleeping() above, actual signal to the worker
+ * during the execution of this function might be causing
+ * isuses, based on some observed test failures. Disabling IRQs
+ * make the failures go away.
+ */
+ local_irq_disable();
+ __umcg_wq_worker_sleeping(tsk);
+ local_irq_enable();
+}
+
+/**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
+ *
+ * The function tries several times to enqueue the worker; if all attempts
+ * fail due to contention, the function sleeps with a timeout (exponential
+ * back-off).
+ *
+ * Note: the function is called from local_irq_disable() context; so it
+ * re-enables IRQs becore calling imcg_idle_loop().
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+{
+ u64 __user *node = &ut_worker->idle_workers_ptr;
+ u64 sleep_timeout = 500; /* ns */
+ u64 __user *head_ptr;
+ u64 head;
+
+ if (get_user_nosleep(head, node) || !head)
+ return false;
+
+ head_ptr = (u64 __user *)head;
+
+ while (true) {
+ int step, ret;
+
+ for (step = 0; step < nr_cpu_ids; ++step) {
+ u64 first;
+
+ if (get_user_nosleep(first, head_ptr))
+ return false;
+
+ if (put_user_nosleep(first, node))
+ return false;
+
+ ret = cmpxchg_user_64(head_ptr, &first, (u64)node);
+ if (!ret)
+ return true;
+
+ if (ret != -EAGAIN)
+ return false;
+ }
+
+ local_irq_enable(); /* This is called with IRQs disabled. */
+ ret = umcg_idle_loop(ktime_get_real() + sleep_timeout);
+ local_irq_disable();
+ if (fatal_signal_pending(current))
+ return false;
+
+ if (ret != -ETIMEDOUT)
+ return false;
+
+ if (sleep_timeout < 50000)
+ sleep_timeout += sleep_timeout;
+ }
+}
+
+/**
+ * get_idle_server - retrieve an idle server, if present.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid)
+{
+ u64 head;
+ u32 tid;
+ int ret;
+
+ *server_tid = 0; /* Empty result is OK. */
+
+ if (get_user_nosleep(head, &ut_worker->idle_server_tid_ptr))
+ return false;
+
+ if (!head)
+ return false;
+
+ if (get_user(tid, (u32 __user *)head))
+ return false;
+
+ if (!tid)
+ return true; /* No idle servers. */
+
+ ret = cmpxchg_user_32((u32 __user *)head, &tid, 0ULL);
+ if (ret == -EAGAIN)
+ return true; /* Another worker got it. */
+
+ if (ret)
+ return false;
+
+ *server_tid = tid;
+ return mark_server_running(tid);
+}
+
+/*
+ * Returns true to wait, false to return to the userspace. Called with IRQs
+ * disabled. In the common case, enqueues the worker to idle_workers_ptr list
+ * and wakes the idle server (if present).
+ */
+static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
+{
+ struct umcg_task __user *ut_worker = tsk->umcg_task;
+ u32 prev_state, next_state;
+ int ret = 0;
+
+ *server_tid = 0;
+
+ if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+ return false;
+
+ if (fatal_signal_pending(tsk))
+ return false;
+
+ smp_mb(); /* Guard the read below. */
+ if (get_user_nosleep(prev_state, &ut_worker->state))
+ goto die;
+
+ if ((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+ u32 tid;
+
+ if (prev_state & UMCG_TF_LOCKED)
+ return true; /* Wakeup: wait but don't enqueue. */
+
+ smp_mb(); /* Order getting state and getting server_tid */
+
+ if (get_user_nosleep(tid, &ut_worker->next_tid))
+ goto die;
+
+ if (prev_state & UMCG_TF_PREEMPTED) {
+ if (!tid)
+ goto die; /* PREEMPTED workers must have a server. */
+
+ /* Always enqueue preempted workers. */
+ if (!mark_server_running(tid))
+ goto die;
+
+ *server_tid = tid;
+ } else if (tid)
+ return false; /* pass-through: RUNNING with a server. */
+
+ /* If !PREEMPTED, the worker gets here via UMCG_WAIT_WAKE_ONLY */
+ } else if (unlikely((prev_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
+ (prev_state & UMCG_TF_LOCKED)))
+ return false; /* The worker prepares to sleep or to unregister. */
+
+ next_state = prev_state & ~UMCG_TASK_STATE_MASK;
+ next_state |= UMCG_TASK_IDLE;
+
+ if (prev_state != next_state)
+ ret = cmpxchg_user_32(&ut_worker->state, &prev_state, next_state);
+ if (ret)
+ goto die;
+
+ if (!enqueue_idle_worker(ut_worker))
+ goto die;
+
+ smp_mb(); /* Order enqueuing the worker with getting the server. */
+ if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
+ goto die;
+
+ return true;
+
+die:
+ force_sig(SIGSEGV);
+ return false;
+}
+
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+ /* Avoid recursion by removing PF_UMCG_WORKER */
+ current->flags &= ~PF_UMCG_WORKER;
+
+ do {
+ bool should_wait;
+ u32 server_tid;
+
+ local_irq_disable(); /* See comment in umcg_wq_worker_sleeping */
+ should_wait = process_waking_worker(tsk, &server_tid);
+ local_irq_enable();
+
+ if (!should_wait)
+ break;
+
+ if (server_tid)
+ umcg_do_context_switch(server_tid, 0);
+ else
+ umcg_idle_loop(0);
+ } while (true);
+
+ current->flags |= PF_UMCG_WORKER;
+}
diff --git a/kernel/sched/umcg.h b/kernel/sched/umcg.h
index 1db283071ca6..8129db57df20 100644
--- a/kernel/sched/umcg.h
+++ b/kernel/sched/umcg.h
@@ -9,6 +9,19 @@
#include <asm/asm.h>
#include <linux/atomic.h>
+#ifdef CONFIG_UMCG
+
+struct task_struct;
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+void umcg_wq_worker_sleeping(struct task_struct *tsk);
+void umcg_wq_worker_running(struct task_struct *tsk);
+
+#endif /* CONFIG_UMCG */
+
/* TODO: move atomic operations below into arch/ headers */
static inline int __try_cmpxchg_user_32(u32 *uval, u32 __user *uaddr,
u32 oldval, u32 newval)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..cd1be6356e42 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -272,6 +272,10 @@ COND_SYSCALL(landlock_create_ruleset);
COND_SYSCALL(landlock_add_rule);
COND_SYSCALL(landlock_restrict_self);
+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+
/* arch/example/kernel/sys_example.c */
/* mm/fadvise.c */
--
2.25.1
I've attached the rendering of the doc/rst file, in case somebody
prefers it this way.
On Sun, Aug 1, 2021 at 1:06 PM Peter Oskolkov <[email protected]> wrote:
>
> Document User Managed Concurrency Groups syscalls, data structures,
> state transitions, etc.
>
> Signed-off-by: Peter Oskolkov <[email protected]>
> ---
> Documentation/userspace-api/umcg.rst | 532 +++++++++++++++++++++++++++
> 1 file changed, 532 insertions(+)
> create mode 100644 Documentation/userspace-api/umcg.rst
>
> diff --git a/Documentation/userspace-api/umcg.rst b/Documentation/userspace-api/umcg.rst
> new file mode 100644
> index 000000000000..680bf336bfdc
> --- /dev/null
> +++ b/Documentation/userspace-api/umcg.rst
> @@ -0,0 +1,532 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================
> +UMCG Userspace API
> +=====================================
> +
> +User Managed Concurrency Groups (UMCG) is an M:N threading
> +subsystem/toolkit that lets user space application developers
> +implement in-process user space schedulers.
> +
> +.. contents:: :local:
> +
> +Why? Heterogeneous in-process workloads
> +=======================================
> +Linux kernel's CFS scheduler is designed for the "common" use case,
> +with efficiency/throughput in mind. Work isolation and workloads of
> +different "urgency" are addressed by tools such as cgroups, CPU
> +affinity, priorities, etc., which are difficult or impossible to
> +efficiently use in-process.
> +
> +For example, a single DBMS process may receive tens of thousands
> +requests per second; some of these requests may have strong response
> +latency requirements as they serve live user requests (e.g. login
> +authentication); some of these requests may not care much about
> +latency but must be served within a certain time period (e.g. an
> +hourly aggregate usage report); some of these requests are to be
> +served only on a best-effort basis and can be NACKed under high load
> +(e.g. an exploratory research/hypothesis testing workload).
> +
> +Beyond different work item latency/throughput requirements as outlined
> +above, the DBMS may need to provide certain guarantees to different
> +users; for example, user A may "reserve" 1 CPU for their
> +high-priority/low latency requests, 2 CPUs for mid-level throughput
> +workloads, and be allowed to send as many best-effort requests as
> +possible, which may or may not be served, depending on the DBMS load.
> +Besides, the best-effort work, started when the load was low, may need
> +to be delayed if suddenly a large amount of higher-priority work
> +arrives. With hundreds or thousands of users like this, it is very
> +difficult to guarantee the application's responsiveness using standard
> +Linux tools while maintaining high CPU utilization.
> +
> +Gaming is another use case: some in-process work must be completed
> +before a certain deadline dictated by frame rendering schedule, while
> +other work items can be delayed; some work may need to be
> +cancelled/discarded because the deadline has passed; etc.
> +
> +User Managed Concurrency Groups is an M:N threading toolkit that
> +allows constructing user space schedulers designed to efficiently
> +manage heterogeneous in-process workloads described above while
> +maintaining high CPU utilization (95%+).
> +
> +Requirements
> +============
> +One relatively established way to design high-efficiency, low-latency
> +systems is to split all work into small on-cpu work items, with
> +asynchronous I/O and continuations, all executed on a thread pool with
> +the number of threads not exceeding the number of available CPUs.
> +Although this approach works, it is quite difficult to develop and
> +maintain such a system, as, for example, small continuations are
> +difficult to piece together when debugging. Besides, such asynchronous
> +callback-based systems tend to be somewhat cache-inefficient, as
> +continuations can get scheduled on any CPU regardless of cache
> +locality.
> +
> +M:N threading and cooperative user space scheduling enables controlled
> +CPU usage (minimal OS preemption), synchronous coding style, and
> +better cache locality.
> +
> +Specifically:
> +
> +- a variable/fluctuating number M of "application" threads should be
> + "scheduled over" a relatively fixed number N of "kernel" threads,
> + where N is less than or equal to the number of CPUs available;
> +- only those application threads that are attached to kernel threads
> + are scheduled "on CPU";
> +- application threads should be able to cooperatively yield to each other;
> +- when an application thread blocks in kernel (e.g. in I/O), this
> + becomes a scheduling event ("block") that the userspace scheduler
> + should be able to efficiently detect, and reassign a waiting
> + application thread to the freeded "kernel" thread;
> +- when a blocked application thread wakes (e.g. its I/O operation
> + completes), this even ("wake") should also be detectable by the
> + userspace scheduler, which should be able to either quickly dispatch
> + the newly woken thread to an idle "kernel" thread or, if all "kernel"
> + threads are busy, put it in the waiting queue;
> +- in addition to the above, it would be extremely useful for a
> + separate in-process "watchdog" facility to be able to monitor the
> + state of each of the M+N threads, and to intervene in case of runaway
> + workloads (interrupt/preempt).
> +
> +
> +UMCG kernel API
> +===============
> +Based on the requrements above, UMCG *kernel* API is build around
> +the following ideas:
> +
> +- *UMCG server*: a task/thread representing "kernel threads", or CPUs
> + from the requirements above;
> +- *UMCG worker*: a task/thread representing "application threads", to
> + be scheduled over servers;
> +- UMCG *task state*: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG
> + task (a server or a worker) can be in;
> +- UMCG task *state flag*: LOCKED, PREEMPTED: additional state flags
> + that can be ORed with the task state to communicate additional information
> + to the kernel;
> +- ``struct umcg_task``: a per-task userspace set of data fields, usually
> + residing in the TLS, that fully reflects the current task's UMCG
> + state and controls the way the kernel manages the task;
> +- ``sys_umcg_ctl()``: a syscall used to register the current task/thread
> + as a server or a worker, or to unregister a UMCG task;
> +- ``sys_umcg_wait()``: a syscall used to put the current task to
> + sleep and/or wake another task, pontentially context-switching
> + between the two tasks on-CPU synchronously.
> +
> +
> +Servers
> +=======
> +
> +When a task/thread is registered as a server, it is in RUNNING
> +state and behaves like any other normal task/thread. In addition,
> +servers can interact with other UMCG tasks via sys_umcg_wait():
> +
> +- servers can voluntarily suspend their execution (wait), becoming IDLE;
> +- servers can wake other IDLE servers;
> +- servers can context-switch between each other.
> +
> +Note that if a server blocks in the kernel *not* via sys_umcg_wait(),
> +it still retains its RUNNING state.
> +
> +Also note that servers can be used for fast on-CPU context switching
> +across process boundaries; server-worker interactions assume they
> +belong to the same mm.
> +
> +See the next section on how servers interact with workers.
> +
> +Workers
> +=======
> +
> +A worker cannot be RUNNING without having a server associated
> +with it, so when a task is first registered as a worker, it enters
> +the IDLE state.
> +
> +- a worker becomes RUNNING when a server calls sys_umcg_wait to
> + context-switch into it; the server goes IDLE, and the worker becomes
> + RUNNING in its place;
> +- when a running worker blocks in the kernel, it becomes BLOCKED,
> + its associated server becomes RUNNING and the server's
> + sys_umcg_wait() call from the bullet above returns; this transition
> + is sometimes called "block detection";
> +- when the syscall on which a BLOCKED worker completes, the worker
> + becomes IDLE and is added to the list of idle workers; if there
> + is an idle server waiting, the kernel wakes it; this transition
> + is sometimes called "wake detection";
> +- running workers can voluntarily suspend their execution (wait),
> + becoming IDLE; their associated servers are woken;
> +- a RUNNING worker can context-switch with an IDLE worker; the server
> + of the switched-out worker is transferred to the switched-in worker;
> +- any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
> + this is a server running the worker as described in the first bullet
> + in this list, the worker remain IDLE but is added to the idle workers
> + list; this "wake" operation exists for completeness, to make sure
> + wait/wake/context-switch operations are available for all UMCG tasks;
> +- the userspace can preempt a RUNNING worker by marking it
> + ``RUNNING|PREEMPTED`` and sending a signal to it; the userspace should
> + have installed a NOP signal handler for the signal; the kernel will
> + then transition the worker into ``IDLE|PREEMPTED`` state and wake
> + its associated server.
> +
> +UMCG task states
> +================
> +
> +Important: all state transitions described below involve at least
> +two steps: the change of the state field in ``struct umcg_task``,
> +for example ``RUNNING`` to ``IDLE``, and the corresponding change in
> +``struct task_struct`` state, for example a transition between the task
> +running on CPU and being descheduled and removed from the kernel runqueue.
> +The key principle of UMCG API design is that the party initiating
> +the state transition modifies the state variable.
> +
> +For example, a task going ``IDLE`` first changes its state from ``RUNNING``
> +to ``IDLE`` in the userpace and then calls ``sys_umcg_wait()``, which
> +completes the transition.
> +
> +Note on documentation: in ``include/uapi/linux/umcg.h``, task states
> +have the form ``UMCG_TASK_RUNNING``, ``UMCG_TASK_BLOCKED``, etc. In
> +this document these are usually referred to simply ``RUNNING`` and
> +``BLOCKED``, unless it creates ambiguity. Task state flags, e.g.
> +``UMCG_TF_PREEMPTED``, are treated similarly.
> +
> +UMCG task states reflect the view from the userspace, rather than from
> +the kernel. There are three fundamental task states:
> +
> +- ``RUNNING``: indicates that the task is schedulable by the kernel; applies
> + to both servers and workers;
> +- ``IDLE``: indicates that the task is *not* schedulable by the kernel
> + (see ``umcg_idle_loop()`` in ``kernel/sched/umcg.c``); applies to
> + both servers and workers;
> +- ``BLOCKED``: indicates that the worker is blocked in the kernel;
> + does not apply to servers.
> +
> +In addition to the three states above, two state flags help with
> +state transitions:
> +
> +- ``LOCKED``: the userspace is preparing the worker for a state transition
> + and "locks" the worker until the worker is ready for the kernel to
> + act on the state transition; used similarly to preempt_disable or
> + irq_disable in the kernel; applies only to workers in ``RUNNING`` or
> + ``IDLE`` state; ``RUNNING|LOCKED`` means "this worker is about to
> + become ``RUNNING``, while ``IDLE|LOCKED`` means "this worker is about
> + to become ``IDLE`` or unregister;
> +- ``PREEMPTED``: the userspace indicates it wants the worker to be
> + preempted; there are no situations when both ``LOCKED`` and ``PREEMPTED``
> + flags are set at the same time.
> +
> +struct umcg_task
> +================
> +
> +From ``include/uapi/linux/umcg.h``:
> +
> +.. code-block:: C
> +
> + struct umcg_task {
> + uint32_t state; /* r/w */
> + uint32_t next_tid; /* r */
> + uint64_t idle_workers_ptr; /* r/w */
> + uint64_t idle_server_tid_ptr; /* r* */
> + };
> +
> +Each UMCG task is identified by ``struct umcg_task``, which is provided
> +to the kernel when the task is registered via ``sys_umcg_ctl()``.
> +
> +- ``uint32_t state``: the current state of the task this struct identifies,
> + as described in the previous section. Readable/writable by both the kernel
> + and the userspace.
> +
> + - bits 0 - 7: task state (RUNNING, IDLE, BLOCKED);
> + - bits 8 - 15: state flags (LOCKED, PREEMPTED);
> + - bits 16 - 23: reserved; must be zeroes;
> + - bits 24 - 31: for userspace use.
> +
> +- ``uint32_t next_tid``: contains the TID of the task to context-switch-into
> + in ``sys_umcg_wait()``; can be zero; writable by the userspace, readable
> + by the kernel; if this is a RUNNING worker, this field contains
> + the TID of the server that should be woken when this worker blocks;
> + see ``sys_umcg_wait()`` for more details;
> +- ``uint64_t idle_workers_ptr``: this field forms a single-linked list
> + of idle workers: all RUNNING workers have this field set to point
> + to the head of the list (a pointer variable in the userspace).
> +
> + When a worker's blocking operation in the kernel completes, the kernel
> + changes the worker's state from ``BLOCKED`` to ``IDLE`` and adds the worker
> + to the top of the list of idle workers using this logic:
> +
> + .. code-block:: C
> +
> + /* kernel side */
> + u64 *head = (u64 *)(worker->idle_workers_ptr); /* get the head pointer */
> + u64 *first = (u64 *)*head; /* get the first element */
> +
> + /* make the worker's ptr point to the first element */
> + worker->idle_workers_ptr = first;
> +
> + /* make the head pointer point to this worker */
> + if (cmpxchg(head, &first, &worker->idle_workers_ptr))
> + /* success */
> + else
> + /* retry, with exponential back-off */
> +
> +
> + In the userspace the list is cleared atomically using this logic:
> +
> + .. code-block:: C
> +
> + /* userspace side */
> + uint64_t *idle_workers = (uint64_t *)*head;
> +
> + /* move the list from the global head to the local idle_workers */
> + if (cmpxchg(&head, &idle_workers, NULL))
> + /* success: process idle_workers */
> + else
> + /* retry */
> +
> + The userspace re-points workers' idle_workers_ptr to the list head
> + variable before the worker is allowed to become RUNNING again.
> +
> +- ``uint64_t idle_server_tid_ptr``: points to a pointer variable in the
> + userspace that points to an idle server, i.e. a server in IDLE state waiting
> + in sys_umcg_wait(); read-only; workers must have this field set; not used
> + in servers.
> +
> + When a worker's blocking operation in the kernel completes, the kernel
> + changes the worker's state from ``BLOCKED`` to ``IDLE``, adds the worker
> + to the list of idle workers, and checks whether
> + ``*idle_server_tid_ptr`` is not zero. If not, the kernel tries to cmpxchg()
> + it with zero; if cmpxchg() succeeds, the kernel will then wake the server.
> + See `State transitions`_ below for more details.
> +
> +sys_umcg_ctl()
> +==============
> +
> +``int sys_umcg_ctl(uint32_t flags, struct umcg_task *self)`` is used to
> +register or unregister the current task as a worker or server. Flags
> +can be one of the following:
> +
> +- ``UMCG_CTL_REGISTER``: register a server;
> +- ``UMCG_CTL_REGISTER | UMCG_CTL_WORKER``: register a worker;
> +- ``UMCG_CTL_UNREGISTER``: unregister the current server or worker.
> +
> +When registering a task, ``self`` must point to ``struct umcg_task``
> +describing this server or worker; the pointer must remain valid until
> +the task is unregistered.
> +
> +When registering a server, ``self->state`` must be ``RUNNING``; all other
> +fields in ``self`` must be zeroes.
> +
> +When registering a worker, ``self->state`` must be ``IDLE``;
> +``self->idle_server_tid_ptr`` and ``self->idle_workers_ptr`` must be
> +valid pointers as described in `struct umcg_task`_; ``self->next_tid`` must
> +be zero.
> +
> +When unregistering a task, ``self`` must be ``NULL``.
> +
> +sys_umcg_wait()
> +===============
> +
> +``int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout)`` operates
> +on registered UMCG servers and workers: ``struct umcg_task *self`` provided
> +to ``sys_umcg_ctl()`` when registering the current task is consulted
> +in addition to ``flags`` and ``abs_timeout`` parameters.
> +
> +The function can be used to perform one of the three operations:
> +
> +- wait: if ``self->next_tid`` is zero, ``sys_umcg_wait()`` puts the current
> + server or worker to sleep;
> +- wake: if ``self->next_tid`` is not zero, and ``flags & UMCG_WAIT_WAKE_ONLY``,
> + the task identified by ``next_tid`` is woken (must be in ``IDLE`` state);
> +- context switch: if ``self->next_tid`` is not zero, and
> + ``!(flags & UMCG_WAIT_WAKE_ONLY)``, the current task is put to sleep and
> + the next task is woken, synchronously switching between the tasks on the
> + current CPU on the fast path.
> +
> +Flags can be zero or a combination of the following values:
> +
> +- ``UMCG_WAIT_WAKE_ONLY``: wake the next task, don't put the current task
> + to sleep;
> +- ``UMCG_WAIT_WF_CURRENT_CPU``: wake the next task on the curent CPU;
> + this flag has an effect only if ``UMCG_WAIT_WAKE_ONLY`` is set: context
> + switching is always attempted to happen on the curent CPU.
> +
> +The section below provides more details on how servers and workers interact
> +via ``sys_umcg_wait()``, during worker block/wake events, and during
> +worker preemption.
> +
> +State transitions
> +=================
> +
> +As mentioned above, the key principle of UMCG state transitions is that
> +**the party initiating the state transition modifies the state of affected
> +tasks**.
> +
> +Below, "``TASK:STATE``" indicates a task T, where T can be either W for
> +worker or S for server, in state S, where S can be one of the three states,
> +potentially ORed with a state flag. Each individual state transition
> +is an atomic operation (cmpxchg) unless indicated otherwise. Also note
> +that **the order of state transitions is important and is part of the
> +contract between the userspace and the kernel. The kernel is free
> +to kill the task (SIGSEGV) if the contract is broken.**
> +
> +Some worker state transitions below include adding ``LOCKED`` flag to
> +worker state. This is done to indicate to the kernel that the worker
> +is transitioning state and should not participate in the block/wake
> +detection routines, which can happen due to interrupts/pagefaults/signals.
> +
> +``IDLE|LOCKED`` means that a running worker is preparing to sleep, so
> +interrupts should not lead to server wakeup; ``RUNNING|LOCKED`` means that
> +an idle worker is going to be "scheduled to run", but may not yet have its
> +server set up properly.
> +
> +Key state transitions:
> +
> +- server to worker context switch ("schedule a worker to run"):
> + ``S:RUNNING+W:IDLE => S:IDLE+W:RUNNING``:
> +
> + - in the userspace, in the context of the server S running:
> +
> + - ``S:RUNNING => S:IDLE`` (mark self as idle)
> + - ``W:IDLE => W:RUNNING|LOCKED`` (mark the worker as running)
> + - ``W.next_tid := S.tid; S.next_tid := W.tid``
> + (link the server with the worker)
> + - ``W:RUNNING|LOCKED => W:RUNNING`` (unlock the worker)
> + - ``S: sys_umcg_wait()`` (make the syscall)
> +
> + - the kernel context switches from the server to the worker; the server
> + sleeps until it becomes ``RUNNING`` during one of the transitions below;
> +
> +- worker to server context switch (worker "yields"):
> + ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE``:
> +
> + - in the userspace, in the context of the worker W running (note that
> + a running worker has its ``next_tid`` set to point to its server):
> +
> + - ``W:RUNNING => W:IDLE|LOCKED`` (mark self as idle)
> + - ``S:IDLE => S:RUNNING`` (mark the server as running)
> + - ``W: sys_umcg_wait()`` (make the syscall)
> +
> + - the kernel removes the ``LOCKED`` flag from the worker's state and
> + context switches from the worker to the server; the worker
> + sleeps until it becomes ``RUNNING``;
> +
> +- worker to worker context switch:
> + ``W1:RUNNING+W2:IDLE => W1:IDLE+W2:RUNNING``:
> +
> + - in the userspace, in the context of W1 running:
> +
> + - ``W2:IDLE => W2:RUNNING|LOCKED`` (mark W2 as running)
> + - ``W1:RUNNING => W1:IDLE|LOCKED`` (mark self as idle)
> + - ``W2.next_tid := W1.next_tid; S.next_tid := W2.next_tid``
> + (transfer the server W1 => W2)
> + - ``W1:next_tid := W2.tid`` (indicate that W1 should
> + context-switch into W2)
> + - ``W2:RUNNING|LOCKED => W2:RUNNING`` (unlock W2)
> + - ``W1: sys_umcg_wait()`` (make the syscall)
> +
> + - same as above, the kernel removes the ``LOCKED`` flag from the W1's state
> + and context switches to next_tid;
> +
> +- worker wakeup: ``W:IDLE => W:RUNNING``:
> +
> + - in the userspace, a server S can wake a worker W without "running" it:
> +
> + - ``S:next_tid :=W.tid``
> + - ``W:next_tid := 0``
> + - ``W:IDLE => W:RUNNING``
> + - ``sys_umcg_wait(UMCG_WAIT_WAKE_ONLY)`` (make the syscall)
> +
> + - the kernel will wake the worker W; as the worker does not have a server
> + assigned, "wake detection" will happen, the worker will be immediately
> + marked as ``IDLE`` and added to idle workers list; an idle server, if any,
> + will be woken (see 'wake detection' below);
> + - Note: if needed, it is possible for a worker to wake another worker:
> + the waker marks itself "IDLE|LOCKED", points its next_tid to the wakee,
> + makes the syscall, restores its server in next_tid, marks itself
> + as ``RUNNING``.
> +
> +- block detection: worker blocks in the kernel: ``S:IDLE+W:RUNNING => S:RUNNING+W:BLOCKED``:
> +
> + - when a worker blocks in the kernel in ``RUNNING`` state (not ``LOCKED``),
> + before descheduling the task from the CPU the kernel performs these
> + operations:
> +
> + - ``W:RUNNING => W:BLOCKED``
> + - ``S := W.next_tid``
> + - ``S:IDLE => S:RUNNING``
> + - ``try_to_wake_up(S)``
> +
> + - if any of the first three operations above fail, the worker is killed via
> + ``SIGSEGV``. Note that ``ttwu(S)`` is not required to succeed, as the
> + server may still be transitioning to sleep in ``sys_umcg_wait()``; before
> + actually putting the server to sleep its UMCG state is checked and, if
> + it is ``RUNNING``, sys_umcg_wait() returns to the userspace;
> + - if the worker has its ``LOCKED`` flag set, block detection does not trigger,
> + as the worker is assumed to be in the userspace scheduling code.
> +
> +- wake detection: worker wakes in the kernel: ``W:BLOCKED => W:IDLE``:
> +
> + - all workers' returns to the userspace are intercepted:
> +
> + - ``start:`` (a label)
> + - if ``W:RUNNING & W.next_tid != 0``: let the worker exit to the userspace,
> + as this is a ``RUNNING`` worker with a server;
> + - ``W:* => W:IDLE`` (previously blocked or woken without servers workers
> + are not allowed to return to the userspace);
> + - the worker is appended to ``W.idle_workers_ptr`` idle workers list;
> + - ``S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING; ttwu(S)``
> + - ``idle_loop(W)``: this is the same idle loop that ``sys_umcg_wait()``
> + uses: it breaks only when the worker becomes ``RUNNING``; when the
> + idle loop exits, it is assumed that the userspace has properly
> + removed the worker from the idle workers list before marking it
> + ``RUNNING``;
> + - ``goto start;`` (repeat from the beginning).
> +
> + - the logic above is a bit more complicated in the presence of ``LOCKED`` or
> + ``PREEMPTED`` flags, but the main invariants stay the same:
> +
> + - only ``RUNNING`` workers with servers assigned are allowed to run
> + in the userspace (unless ``LOCKED``);
> + - newly ``IDLE`` workers are added to the idle workers list; any
> + user-initiated state change assumes the userspace properly removed
> + the worker from the list;
> + - as with wake detection, any "breach of contract" by the userspace
> + will result in the task termination via ``SIGSEGV``.
> +
> +- worker preemption: ``S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED``:
> +
> + - when the userspace wants to preempt a ``RUNNING`` worker, it changes
> + it state, atomically, ``RUNNING => RUNNING|PREEMPTED`` and sends a signal
> + to the worker via ``tgkill()``; the signal handler, previously set up
> + by the userspace, can be a NOP (note that only ``RUNNING`` workers can be
> + preempted);
> + - if the worker, at the moment the signal arrived, continued to be running
> + on-CPU in the userspace, the "wake detection" code will be triggered that,
> + in addition to what was described above, will check if the worker is in
> + ``RUNNING|PREEMPTED`` state:
> +
> + - ``W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED``
> + - ``S := W.next_tid``
> + - ``S:IDLE => S:RUNNING``
> + - ``try_to_wakeup(S)``
> +
> + - if the signal arrives after the worker blocks in the kernel, the "block
> + detection" happened as described above, with the following change:
> +
> + - ``W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED``
> + - ``S := W.next_tid``
> + - ``S:IDLE => S:RUNNING``
> + - ``try_to_wake_up(S)``
> +
> + - in any case, the worker's server is woken, with its attached worker
> + (``S.next_tid``) either in ``BLOCKED|PREEMPTED`` or ``IDLE|PREEMPTED``
> + state.
> +
> +Server-only use cases
> +=====================
> +
> +Some workloads/applications may benefit from fast and synchronous on-CPU
> +user-initiated context switches without the need for full userspace
> +scheduling (block/wake detection). These applications can use "standalone"
> +UMCG servers to wait/wake/context-switch, including across process boundaries.
> +
> +These "worker-less" operations involve trivial ``RUNNING`` <==> ``IDLE``
> +state changes, not discussed here for brevity.
> +
> --
> 2.25.1
>
These state transition descriptions are very helpful, but what is not
clear is the details of these transitions when there are concurrent
wake/waits. I do not know enough about the kernel code to be able to
read the implementation and answer my own questions.
For example, imagine two worker threads W1 and W2. W1 adds itself to a
concurrent list and calls umcg_wait(next_tid = 0). W2 pops from the list
and calls umcg_wait(UMCG_WAIT_WAKE_ONLY | UMCG_WAIT_WF_CURRENT_CPU) on the
popped worker, W1 in this example.
If W1 calls umcg_wait first, W2 context-switches to W1 and W2's state
changes to IDLE. My understanding is that wake detection/block does not
apply to this case.
If W2 calls umcg_wait first, what happens? I can imagine two different
behaviour in this case:
1. W2 waits for W1 to call umcg_wait, by spinning or blocking, and then
execution proceed like the first ordering.
2. W2 sets W1's state to RUNNING. When W1 eventually calls umcg_wait, it
simply notices the state change and returns without context-switching.
In this case, W1 is not migrated to W2's CPU.
Behaviour 1 makes me uncomfortable since it means umcg_wait must wait for
cooperation that potentially never comes.
But in Behaviour 2, the state of W2 after both calls to umcg_wait is not
clear to me, either. I could imagine that W2 is set to IDLE, but since W1
is not migrated, W2 could also simply be left RUNNING.
Which behaviour is correct and in what state does W2 end up?
Thierry
On Wed, Aug 4, 2021 at 12:13 PM Thierry Delisle <[email protected]> wrote:
>
> These state transition descriptions are very helpful, but what is not
> clear is the details of these transitions when there are concurrent
> wake/waits. I do not know enough about the kernel code to be able to
> read the implementation and answer my own questions.
>
> For example, imagine two worker threads W1 and W2. W1 adds itself to a
> concurrent list and calls umcg_wait(next_tid = 0). W2 pops from the list
> and calls umcg_wait(UMCG_WAIT_WAKE_ONLY | UMCG_WAIT_WF_CURRENT_CPU) on the
> popped worker, W1 in this example.
All _umcg_ state changes here happen in the userspace before
sys_umcg_wait() is called. So:
W1: cmpxchg W1:RUNNING => W1:IDLE
- if OK, call sys_umcg_wait()
- if failed, do something else (notes below)
W2: cmpxchg W1:IDLE => W1:RUNNING
- if OK, lock itself, set W2:next_tid to W1, call sys_umcg_wait()
(will not block nor spin), restore next_tid and state/unlock upon
syscall return
- if failed, do something else
So assuming the cmpxchg() calls succeeded, sys_umcg_wait() will be called.
W1 sys_umcg_wait(): (W1 sleeping):
- (1) mark itself as TASK_INTERRUPTIBLE (sleeping)
- (2) check _umcg_ state
- (a) if UMCG_RUNNING, mark itself as TASK_RUNNING, return to userspace
- (b) if still UMCG_IDLE, sleep
- (3) upon wakeup, go to step (1)
W2 sys_umcg_wait(): (wake W1):
- call try_to_wake_up(W1): if W1 is INTERRUPTIBLE, change it to
TASK_RUNNING, wake
- return
Note the ordering and interplay of UMCG state changes
(UMCG_IDLE/UMCG_RUNNING) and TASK state changes
(TASK_INTERRUPTIBLE/TASK_RUNNING).
As you can see, W2 does not block nor spin. W1 will either catch
_umcg_ state change to UMCG_RUNNING and abort, or ttwu() will wake it
_after_ it is marked as UMCG_RUNNING.
Now what happens if cmpxchg() calls above fail? That means that W1 is
still running when W2 tries to change its state RUNNING => IDLE. This
is a race in the userspace, and two options are available:
- the userspace spins waiting for W1 to become IDLE (note that the
userspace spins, not the kernel)
- the userspace "queues the wakeup" and returns; the sleeper (W1) sees
wakeup_queued and does not go to sleep; this is the solution I
have/use. See the previous version here:
https://lore.kernel.org/patchwork/patch/1433971/, search for
UMCG_TF_WAKEUP_QUEUED.
In the current version 0.4 WAKEUP_QUEUED is a purely userspace flag,
so it is not documented in the _kernel API_ doc. I'll post the
userspace parts (libumcg, selftests) a bit later. In short, wait/wake
races do not result in spinning, and sometimes even elide syscalls by
using WAKEUP_QUEUED userspace state flag.
>
> If W1 calls umcg_wait first, W2 context-switches to W1 and W2's state
> changes to IDLE. My understanding is that wake detection/block does not
> apply to this case.
>
> If W2 calls umcg_wait first, what happens? I can imagine two different
> behaviour in this case:
>
> 1. W2 waits for W1 to call umcg_wait, by spinning or blocking, and then
> execution proceed like the first ordering.
>
> 2. W2 sets W1's state to RUNNING. When W1 eventually calls umcg_wait, it
> simply notices the state change and returns without context-switching.
> In this case, W1 is not migrated to W2's CPU.
>
> Behaviour 1 makes me uncomfortable since it means umcg_wait must wait for
> cooperation that potentially never comes.
>
> But in Behaviour 2, the state of W2 after both calls to umcg_wait is not
> clear to me, either. I could imagine that W2 is set to IDLE, but since W1
> is not migrated, W2 could also simply be left RUNNING.
>
> Which behaviour is correct and in what state does W2 end up?
W2 will always end up RUNNING, as everything here is about W1. W2 will
never sleep nor spin. Just a couple of atomic ops and maybe a syscall
that does the same.
>
> Thierry
>
I have attached an atomic stack implementation I wrote. I believe it would
be applicable here. It is very similar except the kernel side no longer
needs a retry loop, the looping is moved to the user-space after the pop.
Using it instead of the code you have in enqueue_idle_worker means the
timeout is no longer needed.
> - ``uint64_t idle_server_tid_ptr``: points to a pointer variable in the
> userspace that points to an idle server, i.e. a server in IDLE
state waiting
> in sys_umcg_wait(); read-only; workers must have this field set;
not used
> in servers.
>
> When a worker's blocking operation in the kernel completes, the kernel
> changes the worker's state from ``BLOCKED`` to ``IDLE``, adds the
worker
> to the list of idle workers, and checks whether
> ``*idle_server_tid_ptr`` is not zero. If not, the kernel tries to
cmpxchg()
> it with zero; if cmpxchg() succeeds, the kernel will then wake the
server.
> See `State transitions`_ below for more details.
In this case, I believe cmpxchg is not necessary and xchg suffices.
On Wed, Aug 4, 2021 at 3:05 PM Thierry Delisle <[email protected]> wrote:
>
> I have attached an atomic stack implementation I wrote. I believe it would
> be applicable here. It is very similar except the kernel side no longer
> needs a retry loop, the looping is moved to the user-space after the pop.
> Using it instead of the code you have in enqueue_idle_worker means the
> timeout is no longer needed.
Thanks, Thierry! Your implementation seems reasonable - I'll test it
and will use it in the future versions of the patchset if everything
is OK.
>
> > - ``uint64_t idle_server_tid_ptr``: points to a pointer variable in the
> > userspace that points to an idle server, i.e. a server in IDLE
> state waiting
> > in sys_umcg_wait(); read-only; workers must have this field set;
> not used
> > in servers.
> >
> > When a worker's blocking operation in the kernel completes, the kernel
> > changes the worker's state from ``BLOCKED`` to ``IDLE``, adds the
> worker
> > to the list of idle workers, and checks whether
> > ``*idle_server_tid_ptr`` is not zero. If not, the kernel tries to
> cmpxchg()
> > it with zero; if cmpxchg() succeeds, the kernel will then wake the
> server.
> > See `State transitions`_ below for more details.
>
> In this case, I believe cmpxchg is not necessary and xchg suffices.
Right! I'll need to roll another atomic_xchg helper for this and the
atomic stack above...
> All _umcg_ state changes here happen in the userspace before
> sys_umcg_wait() is called. So:
>
> W1: cmpxchg W1:RUNNING => W1:IDLE
> - if OK, call sys_umcg_wait()
> - if failed, do something else (notes below)
>
> W2: cmpxchg W1:IDLE => W1:RUNNING
> - if OK, lock itself, set W2:next_tid to W1, call sys_umcg_wait()
> (will not block nor spin), restore next_tid and state/unlock upon
> syscall return
> - if failed, do something else
I am talking about the case where both cmpxchg() succeed and W2 sets
*both* UMCG_WAIT_WAKE_ONLY and UMCG_WAIT_WF_CURRENT_CPU. More
specifically, if things are ordered like so: (ideally use monospace font)
- w1 - - w2 -
w1:RUNNING => w1:IDLE|L |
S:IDLE => S:RUNNING |
sys_umcg_wait(): |
set ~UMCG_TF_LOCKED |
| w1:IDLE => w1:RUNNING|L
| w2:RUNNING => w2:IDLE|L
| w2:next_tid := W1.tid
| w1:RUNNING|L => w1:RUNNING
| sys_umcg_wait():
| set ~UMCG_TF_LOCKED
| do_context_switch()
idle_loop() |
What does ttwu() do with w1 if it has not reached idle_loop yet?
I am not familiar with the details of kernel context-switching, but in
user-level threading, more specifically Cforall, this would be a problem.
Between the call to do_context_switch() and and idle_loop(), there is a
window where 2 CPUs run the same thread at the same time. One thread is
performing the front side of the context switch and the other threads
wakes up on the backside of the context switch. This behaviour invariably
corrupts the program stack of that thread. Again, I do not know if that
applies here. I am not exactly sure how the program stack is handled when
inside a system call.
On Fri, Aug 6, 2021 at 9:52 AM Thierry Delisle <[email protected]> wrote:
>
> > All _umcg_ state changes here happen in the userspace before
> > sys_umcg_wait() is called. So:
> >
> > W1: cmpxchg W1:RUNNING => W1:IDLE
> > - if OK, call sys_umcg_wait()
> > - if failed, do something else (notes below)
> >
> > W2: cmpxchg W1:IDLE => W1:RUNNING
> > - if OK, lock itself, set W2:next_tid to W1, call sys_umcg_wait()
> > (will not block nor spin), restore next_tid and state/unlock upon
> > syscall return
> > - if failed, do something else
>
> I am talking about the case where both cmpxchg() succeed and W2 sets
> *both* UMCG_WAIT_WAKE_ONLY and UMCG_WAIT_WF_CURRENT_CPU. More
> specifically, if things are ordered like so: (ideally use monospace font)
>
> - w1 - - w2 -
>
> w1:RUNNING => w1:IDLE|L |
> S:IDLE => S:RUNNING |
> sys_umcg_wait(): |
> set ~UMCG_TF_LOCKED |
> | w1:IDLE => w1:RUNNING|L
> | w2:RUNNING => w2:IDLE|L
> | w2:next_tid := W1.tid
> | w1:RUNNING|L => w1:RUNNING
> | sys_umcg_wait():
> | set ~UMCG_TF_LOCKED
> | do_context_switch()
> idle_loop() |
>
> What does ttwu() do with w1 if it has not reached idle_loop yet?
If both cmpxchg() succeeded, but W1 was never put to sleep, ttwu()
will do nothing and W1 will continue running on its initial CPU, while
W2 will continue running on its own CPU. WF_CURRENT_CPU is an advisory
flag, and in this situation it will not do anything.
>
> I am not familiar with the details of kernel context-switching, but in
> user-level threading, more specifically Cforall, this would be a problem.
> Between the call to do_context_switch() and and idle_loop(), there is a
> window where 2 CPUs run the same thread at the same time. One thread is
> performing the front side of the context switch and the other threads
> wakes up on the backside of the context switch. This behaviour invariably
> corrupts the program stack of that thread. Again, I do not know if that
> applies here. I am not exactly sure how the program stack is handled when
> inside a system call.
This is a wake, not a context switch, right? I'm not sure why you are
concerned with context switching here. And even if it were a context
switch, the kernel manages thread stacks properly, there's nothing to
worry about.
Am I missing something?
> This is a wake, not a context switch, right?
I followed the "worker to worker context switch" procedure in the
documentation.
> I'm not sure why you are concerned with context switching here. And even
> if it were a context switch, the kernel manages thread stacks properly,
> there's nothing to worry about.
The reason I'm interested in this particular operation is because the
outcome is decided by an invisible race (between W1 and W2) in the
kernel. W2 might context-switch to W1 and it might not. Note I don't mean
race in the problematic sense, just that there are two possible outcomes
that are decided by relative speed. I'm wondering how many outcomes the
users needs to program for and if they may have to back-track anything.
For example, if W2 wants to "yield to", it must enqueue itself in the
user scheduler before the system call. But if the system call doesn't
context-switch and W2 keeps running, it may need to undo the enqueue.
I agree the comment about the stack was a tangent and I expected the
kernel to handle it. But, I believe, how the kernel handles this case
affects the number of outcomes for this scenario.
> If both cmpxchg() succeeded, but W1 was never put to sleep, ttwu()
> will do nothing and W1 will continue running on its initial CPU, while
> W2 will continue running on its own CPU. WF_CURRENT_CPU is an advisory
> flag, and in this situation it will not do anything.
This does not sound right to me. If ttwu does nothing, W1 and W2 keep
running. Who sets W2's state back to RUNNING?
Is W2 responsible for doing that? It's not "the party initiating
the state transition" in this case.
Since there is no way for W2 to tell if it did context-switch to W1, does
that mean that W2 should always cmpxchg() its state to RUNNING after a
sys_umcg_wait()?