Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp3080116pxb; Mon, 18 Oct 2021 07:52:28 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwpUs06o97JGciGi7Qo+LKAHMSrW7jos7RLFVmI2V1r3Z2C87v3+GD/7u0TclRWYBsCzOwJ X-Received: by 2002:a17:906:c302:: with SMTP id s2mr28989281ejz.499.1634568747902; Mon, 18 Oct 2021 07:52:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634568747; cv=none; d=google.com; s=arc-20160816; b=hdFxL82B/ZF3XQLVBYyWJ95W333I1cqjmrjbm8HyXctvmgZM4lRP4v40pVF6akFgfF 1pgoe5GQse7nnU6JsXvS7uxU881OZUoavp0SlSJ4hC34gSj20LTXwQij9I6iRhXg43D/ NNlaEiB3XM5pRNPZwcP7Ot9qSTPmGOP0WvwF5clSKws3HvJ8D8qFTmG41LvZO5u2QKb+ aQRj22K/Ejxp9h/yKmZP1l7m00dbr8ZX3clV5kVAs7342NiXTkAuM7G4UR5ASwn4UbYx qQwUr62fLOMlhvxPRYcNczQS1AG7OyKdWM47LKEeh/Y8JK80CbSV6pxIwEu/OnwXz/H/ SxLQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:dkim-signature:date; bh=TYwBkS1IbbGIcRZtPLPF6h72mhs2rw7GBXzx7jSno6Q=; b=sOOsbpswD/38xjEke/3sZiiZG3fptKBqQpFNYA5YwpjC+j/73poBeuOVhc4sKCcIVd 0/wJS8scLyVO28Gq4L+4Ml+O6/z9WnZ8S1yArPaghExDtiMOcIrUoD07Dw60dV4TMovp sc+0GLkTyHBW1wlG/yzwNciElI9iAk4gyQwUOwiSQZN48JKPaTnN+4KUgSF9w6PE9E4S 0YDu9z0T/pZn5X6q7sJWGkYAM3PezzuK94zFPU7DikS0mslxCWbYEdOpjCpzBbfWzO+n p/mxzNtMnH1tF5pstvGpGiCdeQ3mPpZ2LOhO+2PniFD55ZblDC4FbzZRbySYCli4eKAE fITw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=g63zUmwR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 8si20535766ejl.416.2021.10.18.07.52.03; Mon, 18 Oct 2021 07:52:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=g63zUmwR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232020AbhJROwR (ORCPT + 99 others); Mon, 18 Oct 2021 10:52:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53326 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231768AbhJROwR (ORCPT ); Mon, 18 Oct 2021 10:52:17 -0400 Received: from out2.migadu.com (out2.migadu.com [IPv6:2001:41d0:2:aacc::]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6B855C06161C; Mon, 18 Oct 2021 07:50:05 -0700 (PDT) Date: Mon, 18 Oct 2021 22:50:42 +0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1634568602; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=TYwBkS1IbbGIcRZtPLPF6h72mhs2rw7GBXzx7jSno6Q=; b=g63zUmwRf+n21hxpkpDYv5GOxuj/AikFjrDrmiK20JMrAX/+D1rLkLFOy4jTt8rjMAmL+s NCGXr3wc6eyaW/73qZyaLT0/12jdxh4eqMm8it2G4//P37tiKmJ192FiR8j6zXFjLflvDz o4ayCczDN3cQJagqoM32lTgLZDc3jXI= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Tao Zhou To: Peter Oskolkov Cc: Peter Zijlstra , Ingo Molnar , Thomas Gleixner , Andrew Morton , Dave Hansen , Andy Lutomirski , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, Paul Turner , Ben Segall , Peter Oskolkov , Andrei Vagin , Jann Horn , Thierry Delisle , Tao Zhou Subject: Re: [PATCH v0.7 5/5] sched/umcg: add Documentation/userspace-api/umcg.txt Message-ID: References: <20211012232522.714898-1-posk@google.com> <20211012232522.714898-6-posk@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211012232522.714898-6-posk@google.com> X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: tao.zhou@linux.dev Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Peter, On Tue, Oct 12, 2021 at 04:25:22PM -0700, Peter Oskolkov wrote: > Document User Managed Concurrency Groups syscalls, data structures, > state transitions, etc. > > This is a text version of umcg.rst. > > Signed-off-by: Peter Oskolkov > --- > Documentation/userspace-api/umcg.txt | 594 +++++++++++++++++++++++++++ > 1 file changed, 594 insertions(+) > create mode 100644 Documentation/userspace-api/umcg.txt > > diff --git a/Documentation/userspace-api/umcg.txt b/Documentation/userspace-api/umcg.txt > new file mode 100644 > index 000000000000..cabaa6f4aaad > --- /dev/null > +++ b/Documentation/userspace-api/umcg.txt > @@ -0,0 +1,594 @@ > +UMCG USERSPACE API > + > +User Managed Concurrency Groups (UMCG) is an M:N threading > +subsystem/toolkit that lets user space application developers implement > +in-process user space schedulers. > + > + > +CONTENTS > + > + WHY? HETEROGENEOUS IN-PROCESS WORKLOADS > + REQUIREMENTS > + UMCG KERNEL API > + SERVERS > + WORKERS > + UMCG TASK STATES > + STRUCT UMCG_TASK > + SYS_UMCG_CTL() > + SYS_UMCG_WAIT() > + STATE TRANSITIONS > + SERVER-ONLY USE CASES > + > + > +WHY? HETEROGENEOUS IN-PROCESS WORKLOADS > + > +Linux kernel's CFS scheduler is designed for the "common" use case, with > +efficiency/throughput in mind. Work isolation and workloads of different > +"urgency" are addressed by tools such as cgroups, CPU affinity, priorities, > +etc., which are difficult or impossible to efficiently use in-process. > + > +For example, a single DBMS process may receive tens of thousands requests > +per second; some of these requests may have strong response latency > +requirements as they serve live user requests (e.g. login authentication); > +some of these requests may not care much about latency but must be served > +within a certain time period (e.g. an hourly aggregate usage report); some > +of these requests are to be served only on a best-effort basis and can be > +NACKed under high load (e.g. an exploratory research/hypothesis testing > +workload). > + > +Beyond different work item latency/throughput requirements as outlined > +above, the DBMS may need to provide certain guarantees to different users; > +for example, user A may "reserve" 1 CPU for their high-priority/low latency ^^^^^^^^^^^ low-latency > +requests, 2 CPUs for mid-level throughput workloads, and be allowed to send > +as many best-effort requests as possible, which may or may not be served, > +depending on the DBMS load. Besides, the best-effort work, started when the > +load was low, may need to be delayed if suddenly a large amount of > +higher-priority work arrives. With hundreds or thousands of users like > +this, it is very difficult to guarantee the application's responsiveness > +using standard Linux tools while maintaining high CPU utilization. > + > +Gaming is another use case: some in-process work must be completed before a > +certain deadline dictated by frame rendering schedule, while other work > +items can be delayed; some work may need to be cancelled/discarded because > +the deadline has passed; etc. > + > +User Managed Concurrency Groups is an M:N threading toolkit that allows > +constructing user space schedulers designed to efficiently manage > +heterogeneous in-process workloads described above while maintaining high > +CPU utilization (95%+). > + > + > +REQUIREMENTS > + > +One relatively established way to design high-efficiency, low-latency > +systems is to split all work into small on-cpu work items, with > +asynchronous I/O and continuations, all executed on a thread pool with the > +number of threads not exceeding the number of available CPUs. Although this > +approach works, it is quite difficult to develop and maintain such a > +system, as, for example, small continuations are difficult to piece > +together when debugging. Besides, such asynchronous callback-based systems > +tend to be somewhat cache-inefficient, as continuations can get scheduled > +on any CPU regardless of cache locality. > + > +M:N threading and cooperative user space scheduling enables controlled CPU > +usage (minimal OS preemption), synchronous coding style, and better cache > +locality. > + > +Specifically: > + > +* a variable/fluctuating number M of "application" threads should be > + "scheduled over" a relatively fixed number N of "kernel" threads, where > + N is less than or equal to the number of CPUs available; > +* only those application threads that are attached to kernel threads are > + scheduled "on CPU"; > +* application threads should be able to cooperatively > + yield to each other; The above two lines can be in one line. * application threads should be able to cooperatively yield to each other; > +* when an application thread blocks in kernel (e.g. in I/O), this becomes > + a scheduling event ("block") that the userspace scheduler should be able > + to efficiently detect, and reassign a waiting application thread to the > + freeded "kernel" thread; > +* when a blocked application thread wakes (e.g. its I/O operation > + completes), this even ("wake") should also be detectable by the ^^^^ event > + userspace scheduler, which should be able to either quickly dispatch the > + newly woken thread to an idle "kernel" thread or, if all "kernel" > + threads are busy, put it in the waiting queue; > +* in addition to the above, it would be extremely useful for a separate > + in-process "watchdog" facility to be able to monitor the state of each > + of the M+N threads, and to intervene in case of runaway workloads > + (interrupt/preempt). > + > + > +UMCG KERNEL API > + > +Based on the requrements above, UMCG kernel API is build around the > +following ideas: > + > +* UMCG server: a task/thread representing "kernel threads", or CPUs from > + the requirements above; > +* UMCG worker: a task/thread representing "application threads", to be > + scheduled over servers; > +* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a > + server or a worker) can be in; > +* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that > + can be ORed with the task state to communicate additional information to > + the kernel; > +* struct umcg_task: a per-task userspace set of data fields, usually > + residing in the TLS, that fully reflects the current task's UMCG state > + and controls the way the kernel manages the task; > +* sys_umcg_ctl(): a syscall used to register the current task/thread as a > + server or a worker, or to unregister a UMCG task; > +* sys_umcg_wait(): a syscall used to put the current task to sleep and/or > + wake another task, pontentially context-switching between the two tasks > + on-CPU synchronously. > + > + > +SERVERS > + > +When a task/thread is registered as a server, it is in RUNNING state and > +behaves like any other normal task/thread. In addition, servers can > +interact with other UMCG tasks via sys_umcg_wait(): > + > +* servers can voluntarily suspend their execution (wait), becoming IDLE; > +* servers can wake other IDLE servers; > +* servers can context-switch between each other. > + > +Note that if a server blocks in the kernel not via sys_umcg_wait(), it > +still retains its RUNNING state. > + > + > +WORKERS > + > +A worker cannot be RUNNING without having a server associated with it, so > +when a task is first registered as a worker, it enters the IDLE state. > + > +* a worker becomes RUNNING when a server calls sys_umcg_wait to > + context-switch into it; the server goes IDLE, and the worker becomes > + RUNNING in its place; > +* when a running worker blocks in the kernel, it becomes BLOCKED, its > + associated server becomes RUNNING and the server's sys_umcg_wait() call > + from the bullet above returns; this transition is sometimes called > + "block detection"; > +* when the syscall on which a BLOCKED worker completes, the worker > + becomes IDLE and is added to the list of idle workers; if there is an > + idle server waiting, the kernel wakes it; this transition is sometimes > + called "wake detection"; > +* running workers can voluntarily suspend their execution (wait), ^^^^^^^ RUNNING > + becoming IDLE; their associated servers are woken; > +* a RUNNING worker can context-switch with an IDLE worker; the server of > + the switched-out worker is transferred to the switched-in worker; > +* any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless > + this is a server running the worker as described in the first bullet in > + this list, the worker remain IDLE but is added to the idle workers list; > + this "wake" operation exists for completeness, to make sure > + wait/wake/context-switch operations are available for all UMCG tasks; > +* the userspace can preempt a RUNNING worker by marking it > + RUNNING|PREEMPTED and sending a signal to it; the userspace should have > + installed a NOP signal handler for the signal; the kernel will then > + transition the worker into IDLE|PREEMPTED state and wake its associated > + server. > + > + > +UMCG TASK STATES > + > +Important: all state transitions described below involve at least two > +steps: the change of the state field in struct umcg_task, for example > +RUNNING to IDLE, and the corresponding change in struct task_struct state, > +for example a transition between the task running on CPU and being > +descheduled and removed from the kernel runqueue. The key principle of UMCG > +API design is that the party initiating the state transition modifies the > +state variable. > + > +For example, a task going IDLE first changes its state from RUNNING to IDLE > +in the userpace and then calls sys_umcg_wait(), which completes the > +transition. > + > +Note on documentation: in include/uapi/linux/umcg.h, task states have the > +form UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, etc. In this document these are > +usually referred to simply RUNNING and BLOCKED, unless it creates > +ambiguity. Task state flags, e.g. UMCG_TF_PREEMPTED, are treated similarly. > + > +UMCG task states reflect the view from the userspace, rather than from the > +kernel. There are three fundamental task states: > + > +* RUNNING: indicates that the task is schedulable by the kernel; applies > + to both servers and workers; > +* IDLE: indicates that the task is not schedulable by the kernel (see > + umcg_idle_loop() in kernel/sched/umcg.c); applies to both servers and > + workers; > +* BLOCKED: indicates that the worker is blocked in the kernel; does not > + apply to servers. > + > +In addition to the three states above, two state flags help with state > +transitions: > + > +* LOCKED: the userspace is preparing the worker for a state transition > + and "locks" the worker until the worker is ready for the kernel to act > + on the state transition; used similarly to preempt_disable or > + irq_disable in the kernel; applies only to workers in RUNNING or IDLE > + state; RUNNING|LOCKED means "this worker is about to become RUNNING, > + while IDLE|LOCKED means "this worker is about to become IDLE or > + unregister; > +* PREEMPTED: the userspace indicates it wants the worker to be preempted; > + there are no situations when both LOCKED and PREEMPTED flags are set at > + the same time. > + > + > +STRUCT UMCG_TASK > + > +From include/uapi/linux/umcg.h: > + > +struct umcg_task { > + uint64_t state_ts; /* r/w */ > + uint32_t next_tid; /* r */ > + uint32_t flags; /* reserved */ > + uint64_t idle_workers_ptr; /* r/w */ > + uint64_t idle_server_tid_ptr; /* r* */ > +}; > + > +Each UMCG task is identified by struct umcg_task, which is provided to the > +kernel when the task is registered via sys_umcg_ctl(). > + > +* uint64_t state_ts: the current state of the task this struct > + identifies, as described in the previous section, combined with a > + unique timestamp indicating when the last state change happened. > + > + Readable/writable by both the kernel and the userspace. > + > + bits 0 - 5: task state (RUNNING, IDLE, BLOCKED); > + bits 6 - 7: state flags (LOCKED, PREEMPTED); > + bits 8 - 12: reserved; must be zeroes; > + bits 13 - 17: for userspace use; > + bits 18 - 63: timestamp. > + > + Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution. > + > + It is highly benefitical to tag each state change with a unique > + timestamp: > + > + - timestamps will naturally provide instrumentation to measure > + scheduling delays, both in the kernel and in the userspace; > + - uniqueness of timestamps (module overflow) guarantees that state > + change races, especially ABA races, are easily detected and avoided. > + > + Each timestamp represents the moment in time the state change happened, > + in nanoseconds, with the lower 4 bits and the upper 16 bits stripped. > + > + In this document 'umcg_task.state' is often used to talk about > + 'umcg_task.state_ts' field, as timestamps do not carry semantic > + meaning at the moment. > + > + This is how umcg_task.state_ts is updated in the kernel: > + > + /* kernel side */ > + /** > + * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp. > + * @state_ts - points to the state_ts member of struct umcg_task to update; > + * @expected - the expected value of state_ts, including the timestamp; > + * @desired - the desired value of state_ts, state part only; > + * @may_fault - whether to use normal or _nofault cmpxchg. > + * > + * The function is basically cmpxchg(state_ts, expected, desired), with extra > + * code to set the timestamp in @desired. > + */ > + static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired, > + bool may_fault) > + { > + u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS); > + u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY; > + > + /* Cut higher order bits. */ > + next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1); > + > + if (next_ts == curr_ts) > + ++next_ts; > + > + /* Remove an old timestamp, if any. */ > + desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1); > + > + /* Set the new timestamp. */ > + desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS)); > + > + if (may_fault) > + return cmpxchg_user_64(state_ts, expected, desired); > + > + return cmpxchg_user_64_nofault(state_ts, expected, desired); > + } > + > +* uint32_t next_tid: contains the TID of the task to context-switch-into > + in sys_umcg_wait(); can be zero; writable by the userspace, readable by > + the kernel; if this is a RUNNING worker, this field contains the TID of > + the server that should be woken when this worker blocks; see > + sys_umcg_wait() for more details; > + > +* uint32_t flags: reserved; must be zero. > + > +* uint64_t idle_workers_ptr: this field forms a single-linked list of > + idle workers: all RUNNING workers have this field set to point to the > + head of the list (a pointer variable in the userspace). > + > + When a worker's blocking operation in the kernel completes, the kernel > + changes the worker's state from BLOCKED to IDLE and adds the worker to > + the top of the list of idle workers using this logic: > + > + /* kernel side */ > + /** > + * enqueue_idle_worker - push an idle worker onto idle_workers_ptr > + * list/stack. > + * > + * Returns true on success, false on a fatal failure. > + */ > + static bool enqueue_idle_worker(struct umcg_task __user *ut_worker) > + { > + u64 __user *node = &ut_worker->idle_workers_ptr; > + u64 __user *head_ptr; > + u64 first = (u64)node; > + u64 head; > + > + if (get_user_nosleep(head, node) || !head) > + return false; > + > + head_ptr = (u64 __user *)head; > + > + if (put_user_nosleep(UMCG_IDLE_NODE_PENDING, node)) > + return false; > + > + if (xchg_user_64(head_ptr, &first)) > + return false; > + > + if (put_user_nosleep(first, node)) > + return false; > + > + return true; > + } > + > + In the userspace the list is cleared atomically using this logic: > + > + /* userspace side */ > + uint64_t *idle_workers = (uint64_t *)*head; > + > + atomic_exchange(&idle_workers, NULL); > + > + The userspace re-points workers' idle_workers_ptr to the list head > + variable before the worker is allowed to become RUNNING again. > + > + When processing the idle workers list, the userspace should wait for > + workers marked as UMCG_IDLE_NODE_PENDING to have the flag cleared (see > + enqueue_idle_worker() above). > + > +* uint64_t idle_server_tid_ptr: points to a variable in the userspace > + that points to an idle server, i.e. a server in IDLE state waiting in > + sys_umcg_wait(); read-only; workers must have this field set; not used > + in servers. > + > + When a worker's blocking operation in the kernel completes, the kernel > + changes the worker's state from BLOCKED to IDLE, adds the worker to the > + list of idle workers, and wakes the idle server if present; the kernel > + atomically exchanges (*idle_server_tid_ptr) with 0, thus waking the idle > + server, if present, only once. See State transitions below for more > + details. > + > + > +SYS_UMCG_CTL() > + > +int sys_umcg_ctl(uint32_t flags, struct umcg_task *self) is used to > +register or unregister the current task as a worker or server. Flags can be > +one of the following: > + > + UMCG_CTL_REGISTER: register a server; > + UMCG_CTL_REGISTER | UMCG_CTL_WORKER: register a worker; > + UMCG_CTL_UNREGISTER: unregister the current server or worker. > + > +When registering a task, self must point to struct umcg_task describing > +this server or worker; the pointer must remain valid until the task is > +unregistered. > + > +When registering a server, self->state must be RUNNING; all other fields in > +self must be zeroes. > + > +When registering a worker, self->state must be RUNNING; ^^^^^^^ IDLE After looking through the document and code I feel the new registered worker' state should be IDLE. + +A worker cannot be RUNNING without having a server associated with it, so +when a task is first registered as a worker, it enters the IDLE state. + > +self->idle_server_tid_ptr and self->idle_workers_ptr must be valid pointers > +as described in struct umcg_task; self->next_tid must be zero. > + > +When unregistering a task, self must be NULL. > + > + > +SYS_UMCG_WAIT() > + > +int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout) operates on > +registered UMCG servers and workers: struct umcg_task *self provided to > +sys_umcg_ctl() when registering the current task is consulted in addition > +to flags and abs_timeout parameters. > + > +The function can be used to perform one of the three operations: > + > +* wait: if self->next_tid is zero, sys_umcg_wait() puts the current > + task to sleep; > +* wake: if self->next_tid is not zero, and flags & UMCG_WAIT_WAKE_ONLY, > + the task identified by next_tid is woken; > +* context switch: if self->next_tid is not zero, and !(flags & > + UMCG_WAIT_WAKE_ONLY), the current task is put to sleep and the next task > + is woken, synchronously switching between the tasks on the current CPU > + on the fast path. > + > +Flags can be zero or a combination of the following values: > + > +* UMCG_WAIT_WAKE_ONLY: wake the next task, don't put the current task to > + sleep; > +* UMCG_WAIT_WF_CURRENT_CPU: wake the next task on the curent CPU; this > + flag has an effect only if UMCG_WAIT_WAKE_ONLY is set: context switching > + is always attempted to happen on the curent CPU. > + > +The section below provides more details on how servers and workers interact > +via sys_umcg_wait(), during worker block/wake events, and during worker > +preemption. > + > + > +STATE TRANSITIONS > + > +As mentioned above, the key principle of UMCG state transitions is that the > +party initiating the state transition modifies the state of affected tasks. > + > +Below, "TASK:STATE" indicates a task T, where T can be either W for worker > +or S for server, in state S, where S can be one of the three states, > +potentially ORed with a state flag. Each individual state transition is an > +atomic operation (cmpxchg) unless indicated otherwise. Also note that the > +order of state transitions is important and is part of the contract between > +the userspace and the kernel. The kernel is free to kill the task (SIGKILL) > +if the contract is broken. > + > +Some worker state transitions below include adding LOCKED flag to worker > +state. This is done to indicate to the kernel that the worker is ..worker is +in the+ > +transitioning state and should not participate in the block/wake detection > +routines, which can happen due to interrupts/pagefaults/signals. > + > +IDLE|LOCKED means that a running worker is preparing to sleep, so > +interrupts should not lead to server wakeup; RUNNING|LOCKED means that an > +idle worker is going to be "scheduled to run", but may not yet have its > +server set up properly. > + > +Key state transitions: > + > +* server to worker context switch ("schedule a worker to run"): > + S:RUNNING+W:IDLE => S:IDLE+W:RUNNING: > + in the userspace, in the context of the server S running: > + S:RUNNING => S:IDLE (mark self as idle) > + W:IDLE => W:RUNNING|LOCKED (mark the worker as running) > + W.next_tid := S.tid; S.next_tid := W.tid (link the server with > + the worker) > + W:RUNNING|LOCKED => W:RUNNING (unlock the worker) > + S: sys_umcg_wait() (make the syscall) > + the kernel context switches from the server to the worker; the > + server sleeps until it becomes RUNNING during one of the > + transitions below; > + > +* worker to server context switch (worker "yields"): S:IDLE+W:RUNNING => > +S:RUNNING+W:IDLE: > + in the userspace, in the context of the worker W running (note that > + a running worker has its next_tid set to point to its server): > + W:RUNNING => W:IDLE|LOCKED (mark self as idle) > + S:IDLE => S:RUNNING (mark the server as running) > + W: sys_umcg_wait() (make the syscall) > + the kernel removes the LOCKED flag from the worker's state and > + context switches from the worker to the server; the worker sleeps > + until it becomes RUNNING; > + > +* worker to worker context switch: W1:RUNNING+W2:IDLE => > + W1:IDLE+W2:RUNNING: > + in the userspace, in the context of W1 running: > + W2:IDLE => W2:RUNNING|LOCKED (mark W2 as running) > + W1:RUNNING => W1:IDLE|LOCKED (mark self as idle) > + W2.next_tid := W1.next_tid; S.next_tid := W2.tid (transfer the > + server W1 => W2) > + W1:next_tid := W2.tid (indicate that W1 should context-switch > + into W2) > + W2:RUNNING|LOCKED => W2:RUNNING (unlock W2) > + W1: sys_umcg_wait() (make the syscall) > + same as above, the kernel removes the LOCKED flag from the W1's > + state and context switches to next_tid; > + > +* worker wakeup: W:IDLE => W:RUNNING: > + in the userspace, a server S can wake a worker W without "running" > + it: > + S:next_tid :=W.tid > + W:next_tid := 0 > + W:IDLE => W:RUNNING > + sys_umcg_wait(UMCG_WAIT_WAKE_ONLY) (make the syscall) > + the kernel will wake the worker W; as the worker does not have a > + server assigned, "wake detection" will happen, the worker will be > + immediately marked as IDLE and added to idle workers list; an idle > + server, if any, will be woken (see 'wake detection' below); > + > + Note: if needed, it is possible for a worker to wake another > + worker: the waker marks itself "IDLE|LOCKED", points its next_tid > + to the wakee, makes the syscall, restores its server in next_tid, > + marks itself as RUNNING. > + > +* block detection: worker blocks in the kernel: S:IDLE+W:RUNNING => > + S:RUNNING+W:BLOCKED: > + when a worker blocks in the kernel in RUNNING state (not LOCKED), > + before descheduling the task from the CPU the kernel performs > + these operations: > + W:RUNNING => W:BLOCKED > + S := W.next_tid > + S:IDLE => S:RUNNING > + try_to_wake_up(S) > + if any of the first three operations above fail, the worker is > + killed via SIGKILL. Note that ttwu(S) is not required to succeed, > + as the server may still be transitioning to sleep in > + sys_umcg_wait(); before actually putting the server to sleep its > + UMCG state is checked and, if it is RUNNING, sys_umcg_wait() > + returns to the userspace; > + if the worker has its LOCKED flag set, block detection does not > + trigger, as the worker is assumed to be in the userspace > + scheduling code. > + > +* wake detection: worker wakes in the kernel: W:BLOCKED => W:IDLE: > + all workers' returns to the userspace are intercepted: > + start: (a label) > + if W:RUNNING & W.next_tid != 0: let the worker exit to the > + userspace, as this is a RUNNING worker with a server; > + W:* => W:IDLE (previously blocked or woken without servers > + workers are not allowed to return to the userspace); > + the worker is appended to W.idle_workers_ptr idle workers list; > + S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING; > + ttwu(S) > + idle_loop(W): this is the same idle loop that sys_umcg_wait() > + uses: it breaks only when the worker becomes RUNNING; when > + the idle loop exits, it is assumed that the userspace has > + properly removed the worker from the idle workers list > + before marking it RUNNING; > + goto start; (repeat from the beginning). > + > + the logic above is a bit more complicated in the presence of > + LOCKED or PREEMPTED flags, but the main invariants > + stay the same: > + only RUNNING workers with servers assigned are allowed to run > + in the userspace (unless LOCKED); > + newly IDLE workers are added to the idle workers list; any > + user-initiated state change assumes the userspace > + properly removed the worker from the list; > + as with wake detection, any "breach of contract" by the > + userspace will result in the task termination via SIGKILL. > + > +* worker preemption: S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED: > + when the userspace wants to preempt a RUNNING worker, it changes it > + state, atomically, RUNNING => RUNNING|PREEMPTED and sends a > + signal to the worker via tgkill(); the signal handler, previously > + set up by the userspace, can be a NOP (note that only RUNNING > + workers can be preempted); > + > + if the worker, at the moment the signal arrived, continued to be > + running on-CPU in the userspace, the "wake detection" code will be > + triggered that, in addition to what was described above, will > + check if the worker is in RUNNING|PREEMPTED state: > + W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED > + S := W.next_tid > + S:IDLE => S:RUNNING > + try_to_wakeup(S) > + > + if the signal arrives after the worker blocks in the kernel, > + the "block detection" happened as described above, with the > + following change: > + W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED > + S := W.next_tid > + S:IDLE => S:RUNNING > + try_to_wake_up(S) > + > + in any case, the worker's server is woken, with its attached > + worker (S.next_tid) either in BLOCKED|PREEMPTED or IDLE|PREEMPTED > + state. > + > + > +SERVER-ONLY USE CASES > + > +Some workloads/applications may benefit from fast and synchronous on-CPU > +user-initiated context switches without the need for full userspace > +scheduling (block/wake detection). These applications can use "standalone" > +UMCG servers to wait/wake/context-switch. At the moment only in-process > +operations are allowed. In the future this restriction will be lifted, > +and wait/wake/context-switch operations between servers in related processes > +be permitted (when it is safe to do so, e.g. if the processes belong > +to the same user and/or cgroup). > + > +These "worker-less" operations involve trivial RUNNING <==> IDLE state > +changes, not discussed here for brevity. > -- > 2.25.1 > Thanks, Tao