Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp3111406pxb; Mon, 18 Oct 2021 08:26:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzLyQ+W/xwMN+qEgio9R+Te7VKwpy9ZTLmzPP0ZR1fDTDp2cV84Mg/mX+DdgzYtUzRpnz/T X-Received: by 2002:a17:906:5f8e:: with SMTP id a14mr30657104eju.155.1634570760545; Mon, 18 Oct 2021 08:26:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634570760; cv=none; d=google.com; s=arc-20160816; b=QUXMz5ewfDCwha7CboRAq8RWcrA9Ridywwh33rMamioyg17ox4ZxR26xV1TPzXEokg gqSVjYoGg7mnQPKza6UXz1Dsdz4eSNcB2Mn+ydODiqYn9B6dM291Z3xfRKK78RVjm0v7 p5L3GjC4K+iHMsrxaXgDWWdmv0HKjE5ghR/HdYP2nqgX9eho3Xtr8I5m7QHgjcYxc5YE IPbYmdfjHj7VYlAPyuqemjaxSD7GAMHHiLi2oG9v9oHBtDJQhi4u41N+VS7gwHFbcFVJ 2+EvinNo1YqYIiqtfuXN5W3g6l4QfbB3+XfhpS7461pYK7ObNYealYQbG0a/tILqEpKH ZpiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:dkim-signature:date; bh=qdqztK3Pcg0hv4PkHUlPuXT8SeU2IWlATI74a8hg4UM=; b=aouHOz0vZl9Cb7HoSH32KgLUxp7y08c83ifgvkVkJ4Mz8wxbox+kaAKq20SAQNkp5S ySOKR96eF+BMsLEPucdUMVnDY8ZbW26euHixLiZWHr9rJmA4+iYTUr3HA2ukRYzfkE0V Bv8mvEmuuvA4W87K3H7K+4n7Dw9mwqz59jLaLni++TZUV2n/RorES5xP2W4JcqGsWjFh ZpDLyTIBcTHgKEJZg41JNvLjGpJZtceMWfYZyp/NOyhnm8Ak7nhfb2A/WOQCEZxBfoMk j7N43vAVPjlfSDAMTEDFNYpTszzvruGX7EmaGZ4TAnca72CCzNB2pbo/FPuqOFb3pMoE d4+A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=v9WTJZ7p; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k9si25354544eje.548.2021.10.18.08.25.35; Mon, 18 Oct 2021 08:26:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=v9WTJZ7p; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232740AbhJRPZh (ORCPT + 99 others); Mon, 18 Oct 2021 11:25:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33664 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231905AbhJRPZf (ORCPT ); Mon, 18 Oct 2021 11:25:35 -0400 Received: from out10.migadu.com (out10.migadu.com [IPv6:2001:41d0:2:e8e3::]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1BB7DC06161C for ; Mon, 18 Oct 2021 08:23:23 -0700 (PDT) Date: Mon, 18 Oct 2021 23:23:57 +0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1634570602; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=qdqztK3Pcg0hv4PkHUlPuXT8SeU2IWlATI74a8hg4UM=; b=v9WTJZ7pQjYkud9ND8iyQUmw94QWHfTkCoZfQm5y0mCo9GNFj0AdQggeUZqs0qNfZZ74Oj P0roHGKMAAMzHm6undTa8O8uTGWBcyVo6Jip3NRS3weoK8/zmdpPcMU5hWRhI6SPtu6l/s HZzzyKgtgGpTUuP3iOCXsNIORc02kWc= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Tao Zhou To: Peter Oskolkov Cc: Peter Zijlstra , Ingo Molnar , Thomas Gleixner , Andrew Morton , Dave Hansen , Andy Lutomirski , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, Paul Turner , Ben Segall , Peter Oskolkov , Andrei Vagin , Jann Horn , Thierry Delisle , Tao Zhou Subject: Re: [PATCH v0.7 3/5] sched/umcg: implement UMCG syscalls Message-ID: References: <20211012232522.714898-1-posk@google.com> <20211012232522.714898-4-posk@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211012232522.714898-4-posk@google.com> X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: tao.zhou@linux.dev Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 12, 2021 at 04:25:20PM -0700, Peter Oskolkov wrote: > Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait. > > User Managed Concurrency Groups is an M:N threading toolkit that allows > constructing user space schedulers designed to efficiently manage > heterogeneous in-process workloads while maintaining high CPU > utilization (95%+). > > In addition, M:N threading and cooperative user space scheduling > enables synchronous coding style and better cache locality when > compared to asynchronous callback/continuation style of programming. > > UMCG kernel API is build around the following ideas: > > * UMCG server: a task/thread representing "kernel threads", or (v)CPUs; > * UMCG worker: a task/thread representing "application threads", to be > scheduled over servers; > * UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a > server or a worker) can be in; > * UMCG task state flag: LOCKED, PREEMPTED: additional state flags that > can be ORed with the task state to communicate additional information to > the kernel; > * struct umcg_task: a per-task userspace set of data fields, usually > residing in the TLS, that fully reflects the current task's UMCG state > and controls the way the kernel manages the task; > * sys_umcg_ctl(): a syscall used to register the current task/thread as a > server or a worker, or to unregister a UMCG task; > * sys_umcg_wait(): a syscall used to put the current task to sleep and/or > wake another task, pontentially context-switching between the two tasks > on-CPU synchronously. > > In short, servers can be thought of as CPUs over which application > threads (workers) are scheduled; at any one time a worker is either: > - RUNNING: has a server and is schedulable by the kernel; > - BLOCKED: blocked in the kernel (e.g. on I/O, or a futex); > - IDLE: is not blocked, but cannot be scheduled by the kernel to > run because it has no server assigned to it (e.g. because all > available servers are busy "running" other workers). > > Usually the number of servers in a process is equal to the number of > CPUs available to the kernel if the process is supposed to consume > the whole machine, or less than the number of CPUs available if the > process is sharing the machine with other workloads. The number of > workers in a process can grow very large: tens of thousands is normal; > hundreds of thousands and more (millions) is something that would > be desirable to achieve in the future, as lightweight userspace > threads in Java and Go easily scale to millions, and UMCG workers > are (intended to be) conceptually similar to those. > > Detailed use cases and API behavior are provided in > Documentation/userspace-api/umcg.[txt|rst] (see sibling patches). > > Some high-level implementation notes: > > UMCG tasks (workers and servers) are "tagged" with struct umcg_task > residing in userspace (usually in TLS) to facilitate kernel/userspace > communication. This makes the kernel-side code much simpler (see e.g. > the implementation of sys_umcg_wait), but also requires some careful > uaccess handling and page pinning (see below). > > The main UMCG server/worker interaction looks like: > > a. worker W1 is RUNNING, with a server S attached to it sleeping > in IDLE state; > b. worker W1 blocks in the kernel, e.g. on I/O; > c. the kernel marks W1 as BLOCKED, the attached server S > as RUNNING, and wakes S (the "block detection" event); > d. the server now picks another IDLE worker W2 to run: marks > W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait(); > e. when the blocking operation of W1 completes, the worker > is marked by the kernel as IDLE and added to idle workers list > (see struct umcg_task) for the userspace to pick up and > later run (the "wake detection" event). > > While there are additional operations such as worker-to-worker > context switch, preemption, workers "yielding", etc., the "workflow" > above is the main worker/server interaction that drives the > implementation. > > Specifically: > > - most operations are conceptually context switches: > - scheduling a worker: a running server goes to sleep and "runs" > a worker in its place; > - block detection: worker is descheduled, and its server is woken; > - wake detection: woken worker, running in the kernel, is descheduled, > and if there is an idle server, it is woken to process the wake > detection event; > - to faciliate low scheduling latencies and cache locality, most > server/worker interactions described above are performed synchronously > "on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment > the context switches are simulated by putting the switch-out task to > sleep and waking the switch-into task on the same cpu, it is very much > the long-term goal of this project to make the context switch much > lighter, by tweaking runtime accounting and, maybe, even bypassing > __schedule(); > - worker blocking is detected in a hook to sched_submit_work; as mentioned > above, the server is to be woken on the same CPU, synchronously; > this code may not pagefault, so to access worker's and server's > userspace memory (struct umcg_task), memory pages containing the worker's > and the server's structs umcg_task are pinned when the worker is > exiting to the userspace, and unpinned when the worker is descheduled; > - worker wakeup is detected in a hook to sched_update_worker, and processed > in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN > pagefault on the wakeup path; > - worker preemption is implemented by the userspace tagging the worker > with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it; > on the exit to usermode the worker is intercepted and its server is woken > (see Documentation/userspace-api/umcg.[txt|rst] for more details); > - each state change is tagged with a unique timestamp (of MONOTONIC > variety), so that > - scheduling instrumentation is naturally available; > - racing state changes are easily detected and ABA issues are > avoided; > see umcg_update_state() in umcg.c for implementation details, and > Documentation/userspace-api/umcg.[txt|rst] for a higher-level > description. > > The previous version of the patchset can be found at > https://lore.kernel.org/all/20210917180323.278250-1-posk@google.com/ > containing some additional context and links to earlier discussions. > > More details are available in Documentation/userspace-api/umcg.[txt|rst] > in sibling patches, and in doc-comments in the code. > > Signed-off-by: Peter Oskolkov > --- > arch/x86/entry/syscalls/syscall_64.tbl | 2 + > fs/exec.c | 1 + > include/linux/sched.h | 71 ++ > include/linux/syscalls.h | 3 + > include/uapi/asm-generic/unistd.h | 6 +- > include/uapi/linux/umcg.h | 137 ++++ > init/Kconfig | 10 + > kernel/entry/common.c | 4 +- > kernel/exit.c | 5 + > kernel/sched/Makefile | 1 + > kernel/sched/core.c | 9 +- > kernel/sched/umcg.c | 926 +++++++++++++++++++++++++ > kernel/sys_ni.c | 4 + > 13 files changed, 1175 insertions(+), 4 deletions(-) > create mode 100644 include/uapi/linux/umcg.h > create mode 100644 kernel/sched/umcg.c > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index 18b5500ea8bf..cb71f383060f 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -370,6 +370,8 @@ > 446 common landlock_restrict_self sys_landlock_restrict_self > 447 common memfd_secret sys_memfd_secret > 448 common process_mrelease sys_process_mrelease > +449 common umcg_ctl sys_umcg_ctl > +450 common umcg_wait sys_umcg_wait > > # > # Due to a historical design error, certain syscalls are numbered differently > diff --git a/fs/exec.c b/fs/exec.c > index a098c133d8d7..dfa24bb99a97 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1840,6 +1840,7 @@ static int bprm_execve(struct linux_binprm *bprm, > current->fs->in_exec = 0; > current->in_execve = 0; > rseq_execve(current); > + umcg_execve(current); > acct_update_integrals(current); > task_numa_free(current, false); > return retval; > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 343603f77f8b..c7e812ceec3c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -67,6 +67,7 @@ struct sighand_struct; > struct signal_struct; > struct task_delay_info; > struct task_group; > +struct umcg_task; > > /* > * Task state bitmask. NOTE! These bits are also > @@ -1296,6 +1297,12 @@ struct task_struct { > unsigned long rseq_event_mask; > #endif > > +#ifdef CONFIG_UMCG > + struct umcg_task __user *umcg_task; > + struct page *pinned_umcg_worker_page; /* self */ > + struct page *pinned_umcg_server_page; > +#endif > + > struct tlbflush_unmap_batch tlb_ubc; > > union { > @@ -1688,6 +1695,13 @@ extern struct pid *cad_pid; > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ > + > +#ifdef CONFIG_UMCG > +#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */ > +#else > +#define PF_UMCG_WORKER 0x00000000 > +#endif > + > #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ > #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ > #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */ > @@ -2275,6 +2289,63 @@ static inline void rseq_execve(struct task_struct *t) > > #endif > > +#ifdef CONFIG_UMCG > + > +void umcg_handle_resuming_worker(void); > +void umcg_handle_exiting_worker(void); > +void umcg_clear_child(struct task_struct *tsk); > + > +/* Called by bprm_execve() in fs/exec.c. */ > +static inline void umcg_execve(struct task_struct *tsk) > +{ > + if (tsk->umcg_task) > + umcg_clear_child(tsk); > +} > + > +/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/ > +static inline void umcg_handle_notify_resume(void) > +{ > + if (current->flags & PF_UMCG_WORKER) > + umcg_handle_resuming_worker(); > +} > + > +/* Called by do_exit() in kernel/exit.c. */ > +static inline void umcg_handle_exit(void) > +{ > + if (current->flags & PF_UMCG_WORKER) > + umcg_handle_exiting_worker(); > +} > + > +/* > + * umcg_wq_worker_[sleeping|running] are called in core.c by > + * sched_submit_work() and sched_update_worker(). > + */ > +void umcg_wq_worker_sleeping(struct task_struct *tsk); > +void umcg_wq_worker_running(struct task_struct *tsk); > + > +#else /* CONFIG_UMCG */ > + > +static inline void umcg_clear_child(struct task_struct *tsk) > +{ > +} > +static inline void umcg_execve(struct task_struct *tsk) > +{ > +} > +static inline void umcg_handle_notify_resume(void) > +{ > +} > +static inline void umcg_handle_exit(void) > +{ > +} > +static inline void umcg_wq_worker_sleeping(struct task_struct *tsk) > +{ > +} > +static inline void umcg_wq_worker_running(struct task_struct *tsk) > +{ > +} > + > +#endif > + > #ifdef CONFIG_DEBUG_RSEQ > > void rseq_syscall(struct pt_regs *regs); > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 252243c7783d..97a05879da41 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -71,6 +71,7 @@ struct open_how; > struct mount_attr; > struct landlock_ruleset_attr; > enum landlock_rule_type; > +struct umcg_task; > > #include > #include > @@ -1052,6 +1053,8 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru > const void __user *rule_attr, __u32 flags); > asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags); > asmlinkage long sys_memfd_secret(unsigned int flags); > +asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self); > +asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > index 1c5fb86d455a..3e3d50de5137 100644 > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -879,9 +879,13 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret) > #endif > #define __NR_process_mrelease 448 > __SYSCALL(__NR_process_mrelease, sys_process_mrelease) > +#define __NR_umcg_ctl 449 > +__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl) > +#define __NR_umcg_wait 450 > +__SYSCALL(__NR_umcg_wait, sys_umcg_wait) > > #undef __NR_syscalls > -#define __NR_syscalls 449 > +#define __NR_syscalls 451 > > /* > * 32 bit systems traditionally used different > diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h > new file mode 100644 > index 000000000000..ce4c7980b837 > --- /dev/null > +++ b/include/uapi/linux/umcg.h > @@ -0,0 +1,137 @@ > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ > +#ifndef _UAPI_LINUX_UMCG_H > +#define _UAPI_LINUX_UMCG_H > + > +#include > +#include > + > +/* > + * UMCG: User Managed Concurrency Groups. > + * > + * Syscalls (see kernel/sched/umcg.c): > + * sys_umcg_ctl() - register/unregister UMCG tasks; > + * sys_umcg_wait() - wait/wake/context-switch. > + * > + * struct umcg_task (below): controls the state of UMCG tasks. > + * > + * See Documentation/userspace-api/umcg.[txt|rst] for detals. > + */ > + > +/* > + * UMCG task states, the first 6 bits of struct umcg_task.state_ts. > + * The states represent the user space point of view. > + */ > +#define UMCG_TASK_NONE 0ULL > +#define UMCG_TASK_RUNNING 1ULL > +#define UMCG_TASK_IDLE 2ULL > +#define UMCG_TASK_BLOCKED 3ULL > + > +/* UMCG task state flags, bits 7-8 */ > + > +/* > + * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait. > + */ > +#define UMCG_TF_LOCKED (1ULL << 6) > + > +/* > + * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted. > + */ > +#define UMCG_TF_PREEMPTED (1ULL << 7) > + > +/* The first six bits: RUNNING, IDLE, or BLOCKED. */ > +#define UMCG_TASK_STATE_MASK 0x3fULL > + > +/* The full kernel state mask: the first 13 bits. */ > +#define UMCG_TASK_STATE_MASK_FULL 0x1fffULL > + > +/* > + * The number of bits reserved for UMCG state timestamp in > + * struct umcg_task.state_ts. > + */ > +#define UMCG_STATE_TIMESTAMP_BITS 46 > + > +/* The number of bits truncated from UMCG state timestamp. */ > +#define UMCG_STATE_TIMESTAMP_GRANULARITY 4 > + > +/** > + * struct umcg_task - controls the state of UMCG tasks. > + * > + * The struct is aligned at 64 bytes to ensure that it fits into > + * a single cache line. > + */ > +struct umcg_task { > + /** > + * @state_ts: the current state of the UMCG task described by > + * this struct, with a unique timestamp indicating > + * when the last state change happened. > + * > + * Readable/writable by both the kernel and the userspace. > + * > + * UMCG task state: > + * bits 0 - 5: task state; > + * bits 6 - 7: state flags; > + * bits 8 - 12: reserved; must be zeroes; > + * bits 13 - 17: for userspace use; > + * bits 18 - 63: timestamp (see below). > + * > + * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution. > + * See Documentation/userspace-api/umcg.[txt|rst] for detals. ^^^^^^^ details. > + */ > + uint64_t state_ts; /* r/w */ > + > + /** > + * @next_tid: the TID of the UMCG task that should be context-switched > + * into in sys_umcg_wait(). Can be zero. > + * > + * Running UMCG workers must have next_tid set to point to IDLE > + * UMCG servers. > + * > + * Read-only for the kernel, read/write for the userspace. > + */ > + uint32_t next_tid; /* r */ > + > + uint32_t flags; /* Reserved; must be zero. */ > + > + /** > + * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL. > + * > + * Readable/writable by both the kernel and the userspace: the > + * kernel adds items to the list, the userspace removes them. > + */ > + uint64_t idle_workers_ptr; /* r/w */ > + > + /** > + * @idle_server_tid_ptr: a pointer pointing to a single idle server. > + * Readonly. > + */ > + uint64_t idle_server_tid_ptr; /* r */ > +} __attribute__((packed, aligned(8 * sizeof(__u64)))); > + > +/** > + * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl > + * @UMCG_CTL_REGISTER: register the current task as a UMCG task > + * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task > + * @UMCG_CTL_WORKER: register the current task as a UMCG worker > + */ > +enum umcg_ctl_flag { > + UMCG_CTL_REGISTER = 0x00001, > + UMCG_CTL_UNREGISTER = 0x00002, > + UMCG_CTL_WORKER = 0x10000, > +}; > + > +/** > + * enum umcg_wait_flag - flags to pass to sys_umcg_wait > + * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep; > + * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU > + * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY > + * must be set. > + */ > +enum umcg_wait_flag { > + UMCG_WAIT_WAKE_ONLY = 1, > + UMCG_WAIT_WF_CURRENT_CPU = 2, > +}; > + > +/* See Documentation/userspace-api/umcg.[txt|rst].*/ > +#define UMCG_IDLE_NODE_PENDING (1ULL) > + > +#endif /* _UAPI_LINUX_UMCG_H */ > diff --git a/init/Kconfig b/init/Kconfig > index 11f8a845f259..b52a79cfb130 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1688,6 +1688,16 @@ config MEMBARRIER > > If unsure, say Y. > > +config UMCG > + bool "Enable User Managed Concurrency Groups API" > + depends on X86_64 > + default n > + help > + Enable User Managed Concurrency Groups API, which form the basis > + for an in-process M:N userspace scheduling framework. > + At the moment this is an experimental/RFC feature that is not > + guaranteed to be backward-compatible. > + > config KALLSYMS > bool "Load all symbols for debugging/ksymoops" if EXPERT > default y > diff --git a/kernel/entry/common.c b/kernel/entry/common.c > index d5a61d565ad5..62453772a0c7 100644 > --- a/kernel/entry/common.c > +++ b/kernel/entry/common.c > @@ -171,8 +171,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, > if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) > handle_signal_work(regs, ti_work); > > - if (ti_work & _TIF_NOTIFY_RESUME) > + if (ti_work & _TIF_NOTIFY_RESUME) { > + umcg_handle_notify_resume(); > tracehook_notify_resume(regs); > + } > > /* Architecture specific TIF work */ > arch_exit_to_user_mode_work(regs, ti_work); > diff --git a/kernel/exit.c b/kernel/exit.c > index 63851320ae73..c55f9df430c8 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -745,6 +745,10 @@ void __noreturn do_exit(long code) > if (unlikely(!tsk->pid)) > panic("Attempted to kill the idle task!"); > > + /* Turn off UMCG sched hooks. */ > + if (unlikely(tsk->flags & PF_UMCG_WORKER)) > + tsk->flags &= ~PF_UMCG_WORKER; > + > /* > * If do_exit is called because this processes oopsed, it's possible > * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before > @@ -781,6 +785,7 @@ void __noreturn do_exit(long code) > > io_uring_files_cancel(); > exit_signals(tsk); /* sets PF_EXITING */ > + umcg_handle_exit(); > > /* sync mm's RSS info before statistics gathering */ > if (tsk->mm) > diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile > index 978fcfca5871..e4e481eee1b7 100644 > --- a/kernel/sched/Makefile > +++ b/kernel/sched/Makefile > @@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o > obj-$(CONFIG_CPU_ISOLATION) += isolation.o > obj-$(CONFIG_PSI) += psi.o > obj-$(CONFIG_SCHED_CORE) += core_sched.o > +obj-$(CONFIG_UMCG) += umcg.o > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index d6da1efb5ce6..9ff63e32544a 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4236,6 +4236,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > p->wake_entry.u_flags = CSD_TYPE_TTWU; > p->migration_pending = NULL; > #endif > + umcg_clear_child(p); > } > > DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); > @@ -6265,9 +6266,11 @@ static inline void sched_submit_work(struct task_struct *tsk) > * If a worker goes to sleep, notify and ask workqueue whether it > * wants to wake up a task to maintain concurrency. > */ > - if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) { > + if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) { > if (task_flags & PF_WQ_WORKER) > wq_worker_sleeping(tsk); > + else if (task_flags & PF_UMCG_WORKER) > + umcg_wq_worker_sleeping(tsk); > else > io_wq_worker_sleeping(tsk); > } > @@ -6285,9 +6288,11 @@ static inline void sched_submit_work(struct task_struct *tsk) > > static void sched_update_worker(struct task_struct *tsk) > { > - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) { > + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) { > if (tsk->flags & PF_WQ_WORKER) > wq_worker_running(tsk); > + else if (tsk->flags & PF_UMCG_WORKER) > + umcg_wq_worker_running(tsk); > else > io_wq_worker_running(tsk); > } > diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c > new file mode 100644 > index 000000000000..bc4eeb3f5dd7 > --- /dev/null > +++ b/kernel/sched/umcg.c > @@ -0,0 +1,926 @@ > +// SPDX-License-Identifier: GPL-2.0-only > + > +/* > + * User Managed Concurrency Groups (UMCG). > + * > + * See Documentation/userspace-api/umcg.[txt|rst] for detals. > + */ > + > +#include > +#include > +#include > +#include > + > +#include "sched.h" > + > +/** > + * get_user_nofault - get user value without sleeping. > + * > + * get_user() might sleep and therefore cannot be used in preempt-disabled > + * regions. > + */ > +#define get_user_nofault(out, uaddr) \ > +({ \ > + int ret = -EFAULT; \ > + \ > + if (access_ok((uaddr), sizeof(*(uaddr)))) { \ > + pagefault_disable(); \ > + \ > + if (!__get_user((out), (uaddr))) \ > + ret = 0; \ > + \ > + pagefault_enable(); \ > + } \ > + ret; \ > +}) > + > +/** > + * umcg_pin_pages: pin pages containing struct umcg_task of this worker > + * and its server. > + * > + * The pages are pinned when the worker exits to the userspace and unpinned > + * when the worker is in sched_submit_work(), i.e. when the worker is > + * about to be removed from its runqueue. Thus at most NR_CPUS UMCG pages > + * are pinned at any one time across the whole system. > + * > + * The pinning is needed so that going-to-sleep workers can access > + * their and their servers' userspace umcg_task structs without page faults, > + * as the code path can be executed in the context of a pagefault, with > + * mm lock held. > + */ > +static int umcg_pin_pages(u32 server_tid) > +{ > + struct umcg_task __user *worker_ut = current->umcg_task; > + struct umcg_task __user *server_ut = NULL; > + struct task_struct *tsk; > + > + rcu_read_lock(); > + tsk = find_task_by_vpid(server_tid); > + /* Server/worker interaction is allowed only within the same mm. */ > + if (tsk && current->mm == tsk->mm) > + server_ut = READ_ONCE(tsk->umcg_task); > + rcu_read_unlock(); > + > + if (!server_ut) > + return -EINVAL; > + > + tsk = current; > + > + /* worker_ut is stable, don't need to repin */ > + if (!tsk->pinned_umcg_worker_page) > + if (1 != pin_user_pages_fast((unsigned long)worker_ut, 1, 0, > + &tsk->pinned_umcg_worker_page)) > + return -EFAULT; > + > + /* server_ut may change, need to repin */ > + if (tsk->pinned_umcg_server_page) { > + unpin_user_page(tsk->pinned_umcg_server_page); > + tsk->pinned_umcg_server_page = NULL; > + } > + > + if (1 != pin_user_pages_fast((unsigned long)server_ut, 1, 0, > + &tsk->pinned_umcg_server_page)) > + return -EFAULT; > + > + return 0; > +} > + > +static void umcg_unpin_pages(void) > +{ > + struct task_struct *tsk = current; > + > + if (tsk->pinned_umcg_worker_page) > + unpin_user_page(tsk->pinned_umcg_worker_page); > + if (tsk->pinned_umcg_server_page) > + unpin_user_page(tsk->pinned_umcg_server_page); > + > + tsk->pinned_umcg_worker_page = NULL; > + tsk->pinned_umcg_server_page = NULL; > +} > + > +static void umcg_clear_task(struct task_struct *tsk) > +{ > + /* > + * This is either called for the current task, or for a newly forked > + * task that is not yet running, so we don't need strict atomicity > + * below. > + */ > + if (tsk->umcg_task) { > + WRITE_ONCE(tsk->umcg_task, NULL); > + > + /* These can be simple writes - see the commment above. */ > + tsk->pinned_umcg_worker_page = NULL; > + tsk->pinned_umcg_server_page = NULL; > + tsk->flags &= ~PF_UMCG_WORKER; > + } > +} > + > +/* Called for a forked or execve-ed child. */ > +void umcg_clear_child(struct task_struct *tsk) > +{ > + umcg_clear_task(tsk); > +} > + > +/* Called both by normally (unregister) and abnormally exiting workers. */ > +void umcg_handle_exiting_worker(void) > +{ > + umcg_unpin_pages(); > + umcg_clear_task(current); > +} > + > +/** > + * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp. > + * @state_ts - points to the state_ts member of struct umcg_task to update; > + * @expected - the expected value of state_ts, including the timestamp; > + * @desired - the desired value of state_ts, state part only; > + * @may_fault - whether to use normal or _nofault cmpxchg. > + * > + * The function is basically cmpxchg(state_ts, expected, desired), with extra > + * code to set the timestamp in @desired. > + */ > +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired, > + bool may_fault) > +{ > + u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS); > + u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY; > + > + /* Cut higher order bits. */ > + next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1); > + > + if (next_ts == curr_ts) > + ++next_ts; > + > + /* Remove an old timestamp, if any. */ > + desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1); > + > + /* Set the new timestamp. */ > + desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS)); > + > + if (may_fault) > + return cmpxchg_user_64(state_ts, expected, desired); > + > + return cmpxchg_user_64_nofault(state_ts, expected, desired); > +} > + > +/** > + * sys_umcg_ctl: (un)register the current task as a UMCG task. > + * @flags: ORed values from enum umcg_ctl_flag; see below; > + * @self: a pointer to struct umcg_task that describes this > + * task and governs the behavior of sys_umcg_wait if > + * registering; must be NULL if unregistering. > + * > + * @flags & UMCG_CTL_REGISTER: register a UMCG task: > + * UMCG workers: > + * - @flags & UMCG_CTL_WORKER > + * UMCG servers: > + * - !(@flags & UMCG_CTL_WORKER) > + * > + * All tasks: > + * - self->state must be UMCG_TASK_RUNNING > + * - self->next_tid must be zero > + * > + * If the conditions above are met, sys_umcg_ctl() immediately returns > + * if the registered task is a server; a worker will be added to > + * idle_workers_ptr, and the worker put to sleep; an idle server > + * from idle_server_tid_ptr will be woken, if present. > + * > + * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task > + * is a UMCG worker, the userspace is responsible for waking its > + * server (before or after calling sys_umcg_ctl). > + * > + * Return: > + * 0 - success > + * -EFAULT - failed to read @self > + * -EINVAL - some other error occurred > + */ > +SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self) > +{ > + struct umcg_task ut; > + > + if (flags == UMCG_CTL_UNREGISTER) { > + if (self || !current->umcg_task) > + return -EINVAL; > + > + if (current->flags & PF_UMCG_WORKER) > + umcg_handle_exiting_worker(); > + else > + umcg_clear_task(current); > + > + return 0; > + } > + > + /* Register the current task as a UMCG task. */ > + if (!(flags & UMCG_CTL_REGISTER)) > + return -EINVAL; > + > + flags &= ~UMCG_CTL_REGISTER; > + if (flags && flags != UMCG_CTL_WORKER) > + return -EINVAL; > + > + if (current->umcg_task || !self) > + return -EINVAL; > + > + if (copy_from_user(&ut, self, sizeof(ut))) > + return -EFAULT; > + > + if (ut.next_tid) > + return -EINVAL; > + > + if ((ut.state_ts & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_RUNNING) > + return -EINVAL; > + > + WRITE_ONCE(current->umcg_task, self); > + > + if (flags == UMCG_CTL_WORKER) { > + current->flags |= PF_UMCG_WORKER; > + > + /* Trigger umcg_handle_resuming_worker() */ > + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); > + } > + > + return 0; > +} > + > +/** > + * handle_timedout_worker - make sure the worker is added to idle_workers > + * upon a "clean" timeout. > + */ > +static int handle_timedout_worker(struct umcg_task __user *self) > +{ > + u64 curr_state, next_state; > + int ret; > + > + if (get_user(curr_state, &self->state_ts)) > + return -EFAULT; > + > + if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE) { > + /* TODO: should we care here about TF_LOCKED or TF_PREEMPTED? */ > + > + next_state = curr_state & ~UMCG_TASK_STATE_MASK; > + next_state |= UMCG_TASK_BLOCKED; > + > + ret = umcg_update_state(&self->state_ts, &curr_state, next_state, true); > + if (ret) > + return ret; > + > + return -ETIMEDOUT; > + } > + > + return 0; /* Not really timed out. */ > +} > + > +/** > + * umcg_idle_loop - sleep until the current task becomes RUNNING or a timeout > + * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout > + * > + * The function marks the current task as INTERRUPTIBLE and calls > + * freezable_schedule(). It returns when either the timeout expires or Last version use schedule() here. And there are comment there and I don't know freeable enough. > + * the UMCG state of the task becomes RUNNING. > + * > + * Note: because UMCG workers should not be running WITHOUT attached servers, > + * and because servers should not be running WITH attached workers, > + * the function returns only on fatal signal pending and ignores/flushes > + * all other signals. > + */ > +static int umcg_idle_loop(u64 abs_timeout) > +{ > + int ret; > + struct page *pinned_page = NULL; > + struct hrtimer_sleeper timeout; > + struct umcg_task __user *self = current->umcg_task; > + > + if (abs_timeout) { > + hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME, > + HRTIMER_MODE_ABS); > + > + hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout, > + current->timer_slack_ns); > + } > + > + while (true) { > + u64 umcg_state; > + > + /* > + * We need to read from userspace _after_ the task is marked > + * TASK_INTERRUPTIBLE, to properly handle concurrent wakeups; > + * but faulting is not allowed; so we try a fast no-fault read, > + * and if it fails, pin the page temporarily. > + */ > +retry_once: > + set_current_state(TASK_INTERRUPTIBLE); > + > + /* Order set_current_state above with get_user_nofault below. */ > + smp_mb(); > + ret = -EFAULT; > + if (get_user_nofault(umcg_state, &self->state_ts)) { > + set_current_state(TASK_RUNNING); > + > + if (pinned_page) > + goto out; > + else if (1 != pin_user_pages_fast((unsigned long)self, > + 1, 0, &pinned_page)) > + goto out; > + > + goto retry_once; > + } > + > + if (pinned_page) { > + unpin_user_page(pinned_page); > + pinned_page = NULL; > + } > + > + ret = 0; > + if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) { > + set_current_state(TASK_RUNNING); > + goto out; > + } > + > + if (abs_timeout) > + hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS); > + > + if (!abs_timeout || timeout.task) { > + /* Clear PF_UMCG_WORKER to elide workqueue handlers. */ > + const bool worker = current->flags & PF_UMCG_WORKER; > + > + if (worker) > + current->flags &= ~PF_UMCG_WORKER; > + > + freezable_schedule(); > + > + if (worker) > + current->flags |= PF_UMCG_WORKER; > + } > + __set_current_state(TASK_RUNNING); > + > + /* > + * Check for timeout before checking the state, as workers > + * are not going to return from schedule() unless ^^^^^^^^^^ freezable_schedule() Because you changed to use freezable_schedule() in this verison and the comment in last version is lost. Just a little note here. > + * they are RUNNING. > + */ > + ret = -ETIMEDOUT; > + if (abs_timeout && !timeout.task) > + goto out; > + > + ret = -EFAULT; > + if (get_user(umcg_state, &self->state_ts)) > + goto out; > + > + ret = 0; > + if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) > + goto out; > + > + ret = -EINTR; > + if (fatal_signal_pending(current)) > + goto out; > + > + if (signal_pending(current)) > + flush_signals(current); > + } > + > +out: > + if (pinned_page) { > + unpin_user_page(pinned_page); > + pinned_page = NULL; > + } > + > + if (abs_timeout) { > + hrtimer_cancel(&timeout.timer); > + destroy_hrtimer_on_stack(&timeout.timer); > + } > + > + /* Workers must go through workqueue handlers upon wakeup. */ > + if (current->flags & PF_UMCG_WORKER) { > + if (ret == -ETIMEDOUT) > + ret = handle_timedout_worker(self); > + > + set_tsk_need_resched(current); > + } > + > + return ret; > +} > + > +/** > + * umcg_wakeup_allowed - check whether @current can wake @tsk. > + * > + * Currently a placeholder that allows wakeups within a single process > + * only (same mm). In the future the requirement will be relaxed (securely). > + */ > +static bool umcg_wakeup_allowed(struct task_struct *tsk) > +{ > + WARN_ON_ONCE(!rcu_read_lock_held()); > + > + if (tsk->mm && tsk->mm == current->mm && READ_ONCE(tsk->umcg_task)) > + return true; > + > + return false; > +} > + > +/* > + * Try to wake up. May be called with preempt_disable set. May be called > + * cross-process. > + * > + * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state > + * ordering logic. > + */ > +static int umcg_ttwu(u32 next_tid, int wake_flags) > +{ > + struct task_struct *next; > + > + rcu_read_lock(); > + next = find_task_by_vpid(next_tid); > + if (!next || !umcg_wakeup_allowed(next)) { > + rcu_read_unlock(); > + return -ESRCH; > + } > + > + /* The result of ttwu below is ignored. */ > + try_to_wake_up(next, TASK_NORMAL, wake_flags); > + rcu_read_unlock(); > + > + return 0; > +} > + > +/* > + * At the moment, umcg_do_context_switch simply wakes up @next with > + * WF_CURRENT_CPU and puts the current task to sleep. > + * > + * In the future an optimization will be added to adjust runtime accounting > + * so that from the kernel scheduling perspective the two tasks are > + * essentially treated as one. In addition, the context switch may be performed > + * right here on the fast path, instead of going through the wake/wait pair. > + */ > +static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout) > +{ > + int ret; > + > + ret = umcg_ttwu(next_tid, WF_CURRENT_CPU); > + if (ret) > + return ret; > + > + return umcg_idle_loop(abs_timeout); > +} > + > +/** > + * sys_umcg_wait: put the current task to sleep and/or wake another task. > + * @flags: zero or a value from enum umcg_wait_flag. > + * @abs_timeout: when to wake the task, in nanoseconds; zero for no timeout. > + * > + * @self->state_ts must be UMCG_TASK_IDLE (where @self is current->umcg_task) > + * if !(@flags & UMCG_WAIT_WAKE_ONLY). > + * > + * If @self->next_tid is not zero, it must point to an IDLE UMCG task. > + * The userspace must have changed its state from IDLE to RUNNING > + * before calling sys_umcg_wait() in the current task. This "next" > + * task will be woken (context-switched-to on the fast path) when the > + * current task is put to sleep. > + * > + * See Documentation/userspace-api/umcg.[txt|rst] for detals. > + * > + * Return: > + * 0 - OK; > + * -ETIMEDOUT - the timeout expired; > + * -EFAULT - failed accessing struct umcg_task __user of the current > + * task; > + * -ESRCH - the task to wake not found or not a UMCG task; > + * -EINVAL - another error happened (e.g. bad @flags, or the current > + * task is not a UMCG task, etc.) > + */ > +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout) > +{ > + struct umcg_task __user *self = current->umcg_task; > + u32 next_tid; > + > + if (!self) > + return -EINVAL; > + > + if (get_user(next_tid, &self->next_tid)) > + return -EFAULT; > + > + if (flags & UMCG_WAIT_WAKE_ONLY) { > + if (!next_tid || abs_timeout) > + return -EINVAL; > + > + flags &= ~UMCG_WAIT_WAKE_ONLY; > + if (flags & ~UMCG_WAIT_WF_CURRENT_CPU) > + return -EINVAL; > + > + return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ? > + WF_CURRENT_CPU : 0); > + } > + > + /* Unlock the worker, if locked. */ > + if (current->flags & PF_UMCG_WORKER) { > + u64 umcg_state; > + > + if (get_user(umcg_state, &self->state_ts)) > + return -EFAULT; > + > + if ((umcg_state & UMCG_TF_LOCKED) && umcg_update_state( > + &self->state_ts, &umcg_state, > + umcg_state & ~UMCG_TF_LOCKED, true)) > + return -EFAULT; > + } > + > + if (next_tid) > + return umcg_do_context_switch(next_tid, abs_timeout); > + > + return umcg_idle_loop(abs_timeout); > +} > + > +/* > + * NOTE: all code below is called from workqueue submit/update, or > + * syscall exit to usermode loop, so all errors result in the > + * termination of the current task (via SIGKILL). > + */ > + > +/* > + * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu. > + */ > +static int umcg_wake_idle_server_nofault(u32 server_tid) > +{ > + struct umcg_task __user *ut_server = NULL; > + struct task_struct *tsk; > + int ret = -EINVAL; > + u64 state; > + > + rcu_read_lock(); > + > + tsk = find_task_by_vpid(server_tid); > + /* Server/worker interaction is allowed only within the same mm. */ > + if (tsk && current->mm == tsk->mm) > + ut_server = READ_ONCE(tsk->umcg_task); > + > + if (!ut_server) > + goto out_rcu; > + > + ret = -EFAULT; > + if (get_user_nofault(state, &ut_server->state_ts)) > + goto out_rcu; > + > + ret = -EAGAIN; > + if ((state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_IDLE) > + goto out_rcu; > + > + ret = umcg_update_state(&ut_server->state_ts, &state, > + UMCG_TASK_RUNNING, false); > + > + if (ret) > + goto out_rcu; > + > + try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU); > + ret = 0; > + > +out_rcu: > + rcu_read_unlock(); > + return ret; > +} > + > +/* > + * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu. > + */ > +static int umcg_wake_idle_server_may_fault(u32 server_tid) > +{ > + struct umcg_task __user *ut_server = NULL; > + struct task_struct *tsk; > + int ret = -EINVAL; > + u64 state; > + > + rcu_read_lock(); > + tsk = find_task_by_vpid(server_tid); > + if (tsk && current->mm == tsk->mm) > + ut_server = READ_ONCE(tsk->umcg_task); > + rcu_read_unlock(); > + > + if (!ut_server) > + return -EINVAL; > + > + if (get_user(state, &ut_server->state_ts)) > + return -EFAULT; > + > + if ((state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_IDLE) > + return -EAGAIN; > + > + ret = umcg_update_state(&ut_server->state_ts, &state, > + UMCG_TASK_RUNNING, true); > + if (ret) > + return ret; > + > + /* > + * umcg_ttwu will call find_task_by_vpid again; but we cannot > + * elide this, as we cannot do get_user() from an rcu-locked > + * code block. > + */ > + return umcg_ttwu(server_tid, WF_CURRENT_CPU); > +} > + > +/* > + * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu. > + */ > +static int umcg_wake_idle_server(u32 server_tid, bool may_fault) > +{ > + int ret = umcg_wake_idle_server_nofault(server_tid); > + > + if (!ret) > + return 0; > + > + if (!may_fault || ret != -EFAULT) > + return ret; > + > + return umcg_wake_idle_server_may_fault(server_tid); > +} > + > +/* > + * Called in sched_submit_work() context for UMCG workers. In the common case, > + * the worker's state changes RUNNING => BLOCKED, and its server's state > + * changes IDLE => RUNNING, and the server is ttwu-ed. > + * > + * Under some conditions (e.g. the worker is "locked", see > + * /Documentation/userspace-api/umcg.[txt|rst] for more details), the > + * function does nothing. > + * > + * The function is called with preempt disabled to make sure the retry_once > + * logic below works correctly. > + */ > +static void process_sleeping_worker(struct task_struct *tsk, u32 *server_tid) > +{ > + struct umcg_task __user *ut_worker = tsk->umcg_task; > + u64 curr_state, next_state; > + bool retried = false; > + u32 tid; > + int ret; > + > + *server_tid = 0; > + > + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid UMCG worker.")) > + return; > + > + /* If the worker has no server, do nothing. */ > + if (unlikely(!tsk->pinned_umcg_server_page)) > + return; > + > + if (get_user_nofault(curr_state, &ut_worker->state_ts)) > + goto die; > + > + /* > + * The userspace is allowed to concurrently change a RUNNING worker's > + * state only once in a "short" period of time, so we retry state > + * change at most once. As this retry block is within a > + * preempt_disable region, "short" is truly short here. > + * > + * See Documentation/userspace-api/umcg.[txt|rst] for details. > + */ > +retry_once: > + if (curr_state & UMCG_TF_LOCKED) > + return; > + > + if (WARN_ONCE((curr_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING, > + "Unexpected UMCG worker state.")) > + goto die; > + > + next_state = curr_state & ~UMCG_TASK_STATE_MASK; > + next_state |= UMCG_TASK_BLOCKED; > + > + ret = umcg_update_state(&ut_worker->state_ts, &curr_state, next_state, false); > + if (ret == -EAGAIN) { > + if (retried) > + goto die; > + > + retried = true; > + goto retry_once; > + } > + if (ret) > + goto die; > + > + if (get_user_nofault(tid, &ut_worker->next_tid)) > + goto die; > + > + *server_tid = tid; > + return; > + > +die: > + pr_warn("%s: killing task %d\n", __func__, current->pid); > + force_sig(SIGKILL); > +} > + > +/* Called from sched_submit_work(). Must not fault/sleep. */ > +void umcg_wq_worker_sleeping(struct task_struct *tsk) > +{ > + u32 server_tid; > + > + /* > + * Disable preemption so that retry_once in process_sleeping_worker > + * works properly. > + */ > + preempt_disable(); > + process_sleeping_worker(tsk, &server_tid); > + preempt_enable(); > + > + if (server_tid) { > + int ret = umcg_wake_idle_server_nofault(server_tid); > + > + if (ret && ret != -EAGAIN) > + goto die; > + } > + > + goto out; > + > +die: > + pr_warn("%s: killing task %d\n", __func__, current->pid); > + force_sig(SIGKILL); > +out: > + umcg_unpin_pages(); > +} > + > +/** > + * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack. > + * > + * Returns true on success, false on a fatal failure. > + * > + * See Documentation/userspace-api/umcg.[txt|rst] for details. > + */ > +static bool enqueue_idle_worker(struct umcg_task __user *ut_worker) > +{ > + u64 __user *node = &ut_worker->idle_workers_ptr; > + u64 __user *head_ptr; > + u64 first = (u64)node; > + u64 head; > + > + if (get_user(head, node) || !head) > + return false; > + > + head_ptr = (u64 __user *)head; > + > + if (put_user(UMCG_IDLE_NODE_PENDING, node)) > + return false; > + > + if (xchg_user_64(head_ptr, &first)) > + return false; > + > + if (put_user(first, node)) > + return false; > + > + return true; > +} > + > +/** > + * get_idle_server - retrieve an idle server, if present. > + * > + * Returns true on success, false on a fatal failure. > + */ > +static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid) > +{ > + u64 server_tid_ptr; > + u32 tid; > + > + /* Empty result is OK. */ > + *server_tid = 0; > + > + if (get_user(server_tid_ptr, &ut_worker->idle_server_tid_ptr)) > + return false; > + > + if (!server_tid_ptr) > + return false; > + > + tid = 0; > + if (xchg_user_32((u32 __user *)server_tid_ptr, &tid)) > + return false; > + > + *server_tid = tid; > + return true; > +} > + > +/* > + * Returns true to wait for the userspace to schedule this worker, false > + * to return to the userspace. > + * > + * In the common case, a BLOCKED worker is marked IDLE and enqueued > + * to idle_workers_ptr list. The idle server is woken (if present). > + * > + * If a RUNNING worker is preempted, this function will trigger, in which > + * case the worker is moved to IDLE state and its server is woken. > + * > + * Sets @server_tid to point to the server to be woken if the worker > + * is going to sleep; sets @server_tid to point to the server assigned > + * to this RUNNING worker if the worker is to return to the userspace. > + */ > +static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid) > +{ > + struct umcg_task __user *ut_worker = tsk->umcg_task; > + u64 curr_state, next_state; > + > + *server_tid = 0; > + > + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker")) > + return false; > + > + if (fatal_signal_pending(tsk)) > + return false; > + > + if (get_user(curr_state, &ut_worker->state_ts)) > + goto die; > + > + if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) { > + u32 tid; > + > + /* Wakeup: wait but don't enqueue. */ > + if (curr_state & UMCG_TF_LOCKED) > + return true; > + > + smp_rmb(); /* Order getting state and getting server_tid */ > + if (get_user(tid, &ut_worker->next_tid)) > + goto die; > + > + if (tid) { > + *server_tid = tid; > + > + /* pass-through: RUNNING with a server. */ > + if (!(curr_state & UMCG_TF_PREEMPTED)) > + return false; > + } else if (curr_state & UMCG_TF_PREEMPTED) > + /* PREEMPTED workers must have servers. */ > + goto die; > + > + /* > + * Fallthrough to mark the worker IDLE: the worker is > + * PREEMPTED, or the worker is RUNNING, but has no server > + * (which happens via UMCG_WAIT_WAKE_ONLY). > + */ > + } else if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE && > + (curr_state & UMCG_TF_LOCKED))) > + /* The worker prepares to sleep or to unregister. */ > + return false; > + > + if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE)) > + goto die; > + > + next_state = curr_state & ~UMCG_TASK_STATE_MASK; > + next_state |= UMCG_TASK_IDLE; > + > + if (umcg_update_state(&ut_worker->state_ts, &curr_state, > + next_state, true)) > + goto die; > + > + if (!enqueue_idle_worker(ut_worker)) > + goto die; > + > + smp_mb(); /* Order enqueuing the worker with getting the server. */ > + if (!(*server_tid) && !get_idle_server(ut_worker, server_tid)) > + goto die; > + > + return true; > + > +die: > + pr_warn("umcg_process_waking_worker: killing task %d\n", current->pid); > + force_sig(SIGKILL); > + return false; > +} > + > +/* > + * Called from sched_update_worker(): defer all work until later, as > + * sched_update_worker() may be called with in-kernel locks held. > + */ > +void umcg_wq_worker_running(struct task_struct *tsk) > +{ > + set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME); > +} > + > +/* Called via TIF_NOTIFY_RESUME flag from exit_to_user_mode_loop. */ > +void umcg_handle_resuming_worker(void) > +{ > + u32 server_tid; > + > + /* Avoid recursion by removing PF_UMCG_WORKER */ > + current->flags &= ~PF_UMCG_WORKER; > + > + do { > + bool should_wait; > + > + should_wait = process_waking_worker(current, &server_tid); > + > + if (!should_wait) > + break; > + > + if (server_tid) { > + int ret = umcg_wake_idle_server(server_tid, true); > + > + if (ret && ret != -EAGAIN) > + goto die; > + } > + > + umcg_idle_loop(0); > + } while (true); > + > + if (!server_tid) > + /* No server => no reason to pin pages. */ > + umcg_unpin_pages(); > + else if (umcg_pin_pages(server_tid)) > + goto die; > + > + goto out; > + > +die: > + pr_warn("%s: killing task %d\n", __func__, current->pid); > + force_sig(SIGKILL); > +out: > + current->flags |= PF_UMCG_WORKER; > +} > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index f43d89d92860..682261d78ee7 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -272,6 +272,10 @@ COND_SYSCALL(landlock_create_ruleset); > COND_SYSCALL(landlock_add_rule); > COND_SYSCALL(landlock_restrict_self); > > +/* kernel/sched/umcg.c */ > +COND_SYSCALL(umcg_ctl); > +COND_SYSCALL(umcg_wait); > + > /* arch/example/kernel/sys_example.c */ > > /* mm/fadvise.c */ > -- > 2.25.1 > Thanks, Tao