Received: by 10.223.185.116 with SMTP id b49csp2398198wrg; Mon, 12 Feb 2018 08:56:42 -0800 (PST) X-Google-Smtp-Source: AH8x226mi1Uo4IqRfQwdD8a0XxqXbrOEyKYcQOQZaiXT4pcp+jDQzZgxELxWafU85IAxgzaZtpBn X-Received: by 10.99.116.22 with SMTP id p22mr9748882pgc.89.1518454602240; Mon, 12 Feb 2018 08:56:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518454602; cv=none; d=google.com; s=arc-20160816; b=ENlmkxP2m1r9kh+o8UT8tsNypWgnBJWBO9L/84sZHLXhbjqX6ci4SRM8K7U+eJyNra GfckP6OrvUtZ+mKCEHhsEz/Pz2uEngcT0vpwuMgpkHYXOHf6sPwS7dbLvfMEzrdD4HXV SFEmsHVs0bkAN3dk6HSzx0O6eS/4gX2xqHkNWh6HEAl8o0gjALdRSIrIo58s1MWfW8sF 0eekta7PuDkz/johUOwn5nX60fg0d9qVv+Yf7Ae9kW/o1VhtG+hFdZBdQWj6xAwyEndk T4Yaz1xNSuroP8TePWTzqdeZoRyFfr6J0OP1TXgxQNnHAm07rZjXK7uzCKP2WiaEZqoC dgTQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:thread-index:thread-topic :content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:arc-authentication-results; bh=q/Xl/fb7WR+sqrvkqZkNFCKZP8/8eMWSuHX1IeX6mmI=; b=h2fTkNvoi6ZTUxcWb6TYH1P52Rh2UNkO/qiRjpPz3fQRcXCm1ctGmdKfyGPMwDcUrB sE2t8syu2/YeQkGKnoFBYjlXw8EWMeWYQN6aIjukJsshweyrQR7rr6Wstie12qGWCe0b tTcGfkeEXgnuJmivDoWVPLh5qaZTlYPMPx7JctMGMvmAWZA6Draf/Ds/wSc7xr6yIrt+ I6UTWpNHVb4nc8c4XwBo4/YClUdSATGxQQ71joR0BaFXK4TSbIcmPIKFAtZw7XrmlYy9 DP7tJ/et3XxghUj7Muuvokfwnz5VgD4l23VJ/46crm09muUkHS1vYPesKzhUxBrTbg0i 9HkA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z20si5306273pgu.559.2018.02.12.08.56.26; Mon, 12 Feb 2018 08:56:42 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932430AbeBLPtB convert rfc822-to-8bit (ORCPT + 99 others); Mon, 12 Feb 2018 10:49:01 -0500 Received: from mail.efficios.com ([167.114.142.141]:57494 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753456AbeBLPs6 (ORCPT ); Mon, 12 Feb 2018 10:48:58 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 205B7340242; Mon, 12 Feb 2018 15:49:53 +0000 (UTC) Received: from mail.efficios.com ([127.0.0.1]) by localhost (evm-mail-1.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 2lEaABxS47Rd; Mon, 12 Feb 2018 15:49:38 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 11E983400F9; Mon, 12 Feb 2018 15:49:38 +0000 (UTC) X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (evm-mail-1.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id XuQoddOzgQDo; Mon, 12 Feb 2018 15:49:37 +0000 (UTC) Received: from evm-mail-1.efficios.com (evm-mail-1.efficios.com [167.114.142.141]) by mail.efficios.com (Postfix) with ESMTP id CED7E34021E; Mon, 12 Feb 2018 15:49:37 +0000 (UTC) Date: Mon, 12 Feb 2018 15:49:37 +0000 (UTC) From: Mathieu Desnoyers To: Alexander Viro Cc: linux-kernel , linux-api , "Paul E. McKenney" , Andy Lutomirski , Boqun Feng , Dave Watson , Peter Zijlstra , Paul Turner , Andrew Morton , Russell King , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Andrew Hunter , Andi Kleen , Chris Lameter , Ben Maurer , rostedt , Josh Triplett , Linus Torvalds , Catalin Marinas , Will Deacon , Michael Kerrisk Message-ID: <1489334073.20147.1518450577745.JavaMail.zimbra@efficios.com> In-Reply-To: <20171214161403.30643-11-mathieu.desnoyers@efficios.com> References: <20171214161403.30643-1-mathieu.desnoyers@efficios.com> <20171214161403.30643-11-mathieu.desnoyers@efficios.com> Subject: Re: [RFC PATCH for 4.16 10/21] cpu_opv: Provide cpu_opv system call (v5) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-Originating-IP: [167.114.142.141] X-Mailer: Zimbra 8.7.11_GA_1854 (ZimbraWebClient - FF52 (Linux)/8.7.11_GA_1854) Thread-Topic: cpu_opv: Provide cpu_opv system call (v5) Thread-Index: PvoCl1m7M9uuLb+yq+IUQ9uXmUj+7Q== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Al, Your feedback on this new cpu_opv system call would be welcome. This series is now aiming at the next merge window (4.17). The whole restartable sequences series can be fetched at: https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-rseq.git/ tag: v4.15-rc9-rseq-20180122 Thanks! Mathieu ----- On Dec 14, 2017, at 11:13 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > The cpu_opv system call executes a vector of operations on behalf of > user-space on a specific CPU with preemption disabled. It is inspired > by readv() and writev() system calls which take a "struct iovec" > array as argument. > > The operations available are: comparison, memcpy, add, or, and, xor, > left shift, right shift, and memory barrier. The system call receives > a CPU number from user-space as argument, which is the CPU on which > those operations need to be performed. All pointers in the ops must > have been set up to point to the per CPU memory of the CPU on which > the operations should be executed. The "comparison" operation can be > used to check that the data used in the preparation step did not > change between preparation of system call inputs and operation > execution within the preempt-off critical section. > > The reason why we require all pointer offsets to be calculated by > user-space beforehand is because we need to use get_user_pages_fast() > to first pin all pages touched by each operation. This takes care of > faulting-in the pages. Then, preemption is disabled, and the > operations are performed atomically with respect to other thread > execution on that CPU, without generating any page fault. > > An overall maximum of 4216 bytes in enforced on the sum of operation > length within an operation vector, so user-space cannot generate a > too long preempt-off critical section (cache cold critical section > duration measured as 4.7µs on x86-64). Each operation is also limited > a length of 4096 bytes, meaning that an operation can touch a > maximum of 4 pages (memcpy: 2 pages for source, 2 pages for > destination if addresses are not aligned on page boundaries). > > If the thread is not running on the requested CPU, it is migrated to > it. > > **** Justification for cpu_opv **** > > Here are a few reasons justifying why the cpu_opv system call is > needed in addition to rseq: > > 1) Allow algorithms to perform per-cpu data migration without relying on > sched_setaffinity() > > The use-cases are migrating memory between per-cpu memory free-lists, or > stealing tasks from other per-cpu work queues: each require that > accesses to remote per-cpu data structures are performed. > > Just rseq is not enough to cover those use-cases without additionally > relying on sched_setaffinity, which is unfortunately not > CPU-hotplug-safe. > > The cpu_opv system call receives a CPU number as argument, and migrates > the current task to the right CPU to perform the operation sequence. If > the requested CPU is offline, it performs the operations from the > current CPU while preventing CPU hotplug, and with a mutex held. > > 2) Handling single-stepping from tools > > Tools like debuggers, and simulators use single-stepping to run through > existing programs. If core libraries start to use restartable sequences > for e.g. memory allocation, this means pre-existing programs cannot be > single-stepped, simply because the underlying glibc or jemalloc has > changed. > > The rseq user-space does expose a __rseq_table section for the sake of > debuggers, so they can skip over the rseq critical sections if they > want. However, this requires upgrading tools, and still breaks > single-stepping in case where glibc or jemalloc is updated, but not the > tooling. > > Having a performance-related library improvement break tooling is likely > to cause a big push-back against wide adoption of rseq. > > 3) Forward-progress guarantee > > Having a piece of user-space code that stops progressing due to external > conditions is pretty bad. Developers are used to think of fast-path and > slow-path (e.g. for locking), where the contended vs uncontended cases > have different performance characteristics, but each need to provide > some level of progress guarantees. > > There are concerns about proposing just "rseq" without the associated > slow-path (cpu_opv) that guarantees progress. It's just asking for > trouble when real-life will happen: page faults, uprobes, and other > unforeseen conditions that would seldom cause a rseq fast-path to never > progress. > > 4) Handling page faults > > It's pretty easy to come up with corner-case scenarios where rseq does > not progress without the help from cpu_opv. For instance, a system with > swap enabled which is under high memory pressure could trigger page > faults at pretty much every rseq attempt. Although this scenario > is extremely unlikely, rseq becomes the weak link of the chain. > > 5) Comparison with LL/SC > > The layman versed in the load-link/store-conditional instructions in > RISC architectures will notice the similarity between rseq and LL/SC > critical sections. The comparison can even be pushed further: since > debuggers can handle those LL/SC critical sections, they should be > able to handle rseq c.s. in the same way. > > First, the way gdb recognises LL/SC c.s. patterns is very fragile: > it's limited to specific common patterns, and will miss the pattern > in all other cases. But fear not, having the rseq c.s. expose a > __rseq_table to debuggers removes that guessing part. > > The main difference between LL/SC and rseq is that debuggers had > to support single-stepping through LL/SC critical sections from the > get go in order to support a given architecture. For rseq, we're > adding critical sections into pre-existing applications/libraries, > so the user expectation is that tools don't break due to a library > optimization. > > 6) Perform maintenance operations on per-cpu data > > rseq c.s. are quite limited feature-wise: they need to end with a > *single* commit instruction that updates a memory location. On the other > hand, the cpu_opv system call can combine a sequence of operations that > need to be executed with preemption disabled. While slower than rseq, > this allows for more complex maintenance operations to be performed on > per-cpu data concurrently with rseq fast-paths, in cases where it's not > possible to map those sequences of ops to a rseq. > > 7) Use cpu_opv as generic implementation for architectures not > implementing rseq assembly code > > rseq critical sections require architecture-specific user-space code to > be crafted in order to port an algorithm to a given architecture. In > addition, it requires that the kernel architecture implementation adds > hooks into signal delivery and resume to user-space. > > In order to facilitate integration of rseq into user-space, cpu_opv can > provide a (relatively slower) architecture-agnostic implementation of > rseq. This means that user-space code can be ported to all architectures > through use of cpu_opv initially, and have the fast-path use rseq > whenever the asm code is implemented. > > 8) Allow libraries with multi-part algorithms to work on same per-cpu > data without affecting the allowed cpu mask > > The lttng-ust tracer presents an interesting use-case for per-cpu > buffers: the algorithm needs to update a "reserve" counter, serialize > data into the buffer, and then update a "commit" counter _on the same > per-cpu buffer_. Using rseq for both reserve and commit can bring > significant performance benefits. > > Clearly, if rseq reserve fails, the algorithm can retry on a different > per-cpu buffer. However, it's not that easy for the commit. It needs to > be performed on the same per-cpu buffer as the reserve. > > The cpu_opv system call solves that problem by receiving the cpu number > on which the operation needs to be performed as argument. It can push > the task to the right CPU if needed, and perform the operations there > with preemption disabled. > > Changing the allowed cpu mask for the current thread is not an > acceptable alternative for a tracing library, because the application > being traced does not expect that mask to be changed by libraries. > > 9) Ensure that data structures don't need store-release/load-acquire > semantic to handle fall-back > > cpu_opv performs the fall-back on the requested CPU by migrating the > task to that CPU. Executing the slow-path on the right CPU ensures that > store-release/load-acquire semantic is not required neither on the > fast-path nor slow-path. > > **** rseq and cpu_opv use-cases **** > > 1) per-cpu spinlock > > A per-cpu spinlock can be implemented as a rseq consisting of a > comparison operation (== 0) on a word, and a word store (1), followed > by an acquire barrier after control dependency. The unlock path can be > performed with a simple store-release of 0 to the word, which does > not require rseq. > > The cpu_opv fallback requires a single-word comparison (== 0) and a > single-word store (1). > > 2) per-cpu statistics counters > > A per-cpu statistics counters can be implemented as a rseq consisting > of a final "add" instruction on a word as commit. > > The cpu_opv fallback can be implemented as a "ADD" operation. > > Besides statistics tracking, these counters can be used to implement > user-space RCU per-cpu grace period tracking for both single and > multi-process user-space RCU. > > 3) per-cpu LIFO linked-list (unlimited size stack) > > A per-cpu LIFO linked-list has a "push" and "pop" operation, > which respectively adds an item to the list, and removes an > item from the list. > > The "push" operation can be implemented as a rseq consisting of > a word comparison instruction against head followed by a word store > (commit) to head. Its cpu_opv fallback can be implemented as a > word-compare followed by word-store as well. > > The "pop" operation can be implemented as a rseq consisting of > loading head, comparing it against NULL, loading the next pointer > at the right offset within the head item, and the next pointer as > a new head, returning the old head on success. > > The cpu_opv fallback for "pop" differs from its rseq algorithm: > considering that cpu_opv requires to know all pointers at system > call entry so it can pin all pages, so cpu_opv cannot simply load > head and then load the head->next address within the preempt-off > critical section. User-space needs to pass the head and head->next > addresses to the kernel, and the kernel needs to check that the > head address is unchanged since it has been loaded by user-space. > However, when accessing head->next in a ABA situation, it's > possible that head is unchanged, but loading head->next can > result in a page fault due to a concurrently freed head object. > This is why the "expect_fault" operation field is introduced: if a > fault is triggered by this access, "-EAGAIN" will be returned by > cpu_opv rather than -EFAULT, thus indicating the the operation > vector should be attempted again. The "pop" operation can thus be > implemented as a word comparison of head against the head loaded > by user-space, followed by a load of the head->next pointer (which > may fault), and a store of that pointer as a new head. > > 4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack) > > This structure is useful for passing around allocated objects > by passing pointers through per-cpu fixed-sized stack. > > The "push" side can be implemented with a check of the current > offset against the maximum buffer length, followed by a rseq > consisting of a comparison of the previously loaded offset > against the current offset, a word "try store" operation into the > next ring buffer array index (it's OK to abort after a try-store, > since it's not the commit, and its side-effect can be overwritten), > then followed by a word-store to increment the current offset (commit). > > The "push" cpu_opv fallback can be done with the comparison, and > two consecutive word stores, all within the preempt-off section. > > The "pop" side can be implemented with a check that offset is not > 0 (whether the buffer is empty), a load of the "head" pointer before the > offset array index, followed by a rseq consisting of a word > comparison checking that the offset is unchanged since previously > loaded, another check ensuring that the "head" pointer is unchanged, > followed by a store decrementing the current offset. > > The cpu_opv "pop" can be implemented with the same algorithm > as the rseq fast-path (compare, compare, store). > > 5) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack) > supporting "peek" from remote CPU > > In order to implement work queues with work-stealing between CPUs, it is > useful to ensure the offset "commit" in scenario 4) "push" have a > store-release semantic, thus allowing remote CPU to load the offset > with acquire semantic, and load the top pointer, in order to check if > work-stealing should be performed. The task (work queue item) existence > should be protected by other means, e.g. RCU. > > If the peek operation notices that work-stealing should indeed be > performed, a thread can use cpu_opv to move the task between per-cpu > workqueues, by first invoking cpu_opv passing the remote work queue > cpu number as argument to pop the task, and then again as "push" with > the target work queue CPU number. > > 6) per-cpu LIFO ring buffer with data copy (fixed-sized stack) > (with and without acquire-release) > > This structure is useful for passing around data without requiring > memory allocation by copying the data content into per-cpu fixed-sized > stack. > > The "push" operation is performed with an offset comparison against > the buffer size (figuring out if the buffer is full), followed by > a rseq consisting of a comparison of the offset, a try-memcpy attempting > to copy the data content into the buffer (which can be aborted and > overwritten), and a final store incrementing the offset. > > The cpu_opv fallback needs to same operations, except that the memcpy > is guaranteed to complete, given that it is performed with preemption > disabled. This requires a memcpy operation supporting length up to 4kB. > > The "pop" operation is similar to the "push, except that the offset > is first compared to 0 to ensure the buffer is not empty. The > copy source is the ring buffer, and the destination is an output > buffer. > > 7) per-cpu FIFO ring buffer (fixed-sized queue) > > This structure is useful wherever a FIFO behavior (queue) is needed. > One major use-case is tracer ring buffer. > > An implementation of this ring buffer has a "reserve", followed by > serialization of multiple bytes into the buffer, ended by a "commit". > The "reserve" can be implemented as a rseq consisting of a word > comparison followed by a word store. The reserve operation moves the > producer "head". The multi-byte serialization can be performed > non-atomically. Finally, the "commit" update can be performed with > a rseq "add" commit instruction with store-release semantic. The > ring buffer consumer reads the commit value with load-acquire > semantic to know whenever it is safe to read from the ring buffer. > > This use-case requires that both "reserve" and "commit" operations > be performed on the same per-cpu ring buffer, even if a migration > happens between those operations. In the typical case, both operations > will happens on the same CPU and use rseq. In the unlikely event of a > migration, the cpu_opv system call will ensure the commit can be > performed on the right CPU by migrating the task to that CPU. > > On the consumer side, an alternative to using store-release and > load-acquire on the commit counter would be to use cpu_opv to > ensure the commit counter load is performed on the right CPU. This > effectively allows moving a consumer thread between CPUs to execute > close to the ring buffer cache lines it will read. > > Signed-off-by: Mathieu Desnoyers > CC: "Paul E. McKenney" > CC: Peter Zijlstra > CC: Paul Turner > CC: Thomas Gleixner > CC: Andrew Hunter > CC: Andy Lutomirski > CC: Andi Kleen > CC: Dave Watson > CC: Chris Lameter > CC: Ingo Molnar > CC: "H. Peter Anvin" > CC: Ben Maurer > CC: Steven Rostedt > CC: Josh Triplett > CC: Linus Torvalds > CC: Andrew Morton > CC: Russell King > CC: Catalin Marinas > CC: Will Deacon > CC: Michael Kerrisk > CC: Boqun Feng > CC: linux-api@vger.kernel.org > --- > Changes since v1: > - handle CPU hotplug, > - cleanup implementation using function pointers: We can use function > pointers to implement the operations rather than duplicating all the > user-access code. > - refuse device pages: Performing cpu_opv operations on io map'd pages > with preemption disabled could generate long preempt-off critical > sections, which leads to unwanted scheduler latency. Return EFAULT if > a device page is received as parameter > - restrict op vector to 4216 bytes length sum: Restrict the operation > vector to length sum of: > - 4096 bytes (typical page size on most architectures, should be > enough for a string, or structures) > - 15 * 8 bytes (typical operations on integers or pointers). > The goal here is to keep the duration of preempt off critical section > short, so we don't add significant scheduler latency. > - Add INIT_ONSTACK macro: Introduce the > CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users > correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their > stack to 0 on 32-bit architectures. > - Add CPU_MB_OP operation: > Use-cases with: > - two consecutive stores, > - a mempcy followed by a store, > require a memory barrier before the final store operation. A typical > use-case is a store-release on the final store. Given that this is a > slow path, just providing an explicit full barrier instruction should > be sufficient. > - Add expect fault field: > The use-case of list_pop brings interesting challenges. With rseq, we > can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer, > compare it against NULL, add an offset, and load the target "next" > pointer from the object, all within a single req critical section. > > Life is not so easy for cpu_opv in this use-case, mainly because we > need to pin all pages we are going to touch in the preempt-off > critical section beforehand. So we need to know the target object (in > which we apply an offset to fetch the next pointer) when we pin pages > before disabling preemption. > > So the approach is to load the head pointer and compare it against > NULL in user-space, before doing the cpu_opv syscall. User-space can > then compute the address of the head->next field, *without loading it*. > > The cpu_opv system call will first need to pin all pages associated > with input data. This includes the page backing the head->next object, > which may have been concurrently deallocated and unmapped. Therefore, > in this case, getting -EFAULT when trying to pin those pages may > happen: it just means they have been concurrently unmapped. This is > an expected situation, and should just return -EAGAIN to user-space, > to user-space can distinguish between "should retry" type of > situations and actual errors that should be handled with extreme > prejudice to the program (e.g. abort()). > > Therefore, add "expect_fault" fields along with op input address > pointers, so user-space can identify whether a fault when getting a > field should return EAGAIN rather than EFAULT. > - Add compiler barrier between operations: Adding a compiler barrier > between store operations in a cpu_opv sequence can be useful when > paired with membarrier system call. > > An algorithm with a paired slow path and fast path can use > sys_membarrier on the slow path to replace fast-path memory barriers > by compiler barrier. > > Adding an explicit compiler barrier between operations allows > cpu_opv to be used as fallback for operations meant to match > the membarrier system call. > > Changes since v2: > > - Fix memory leak by introducing struct cpu_opv_pinned_pages. > Suggested by Boqun Feng. > - Cast argument 1 passed to access_ok from integer to void __user *, > fixing sparse warning. > > Changes since v3: > > - Fix !SMP by adding push_task_to_cpu() empty static inline. > - Add missing sys_cpu_opv() asmlinkage declaration to > include/linux/syscalls.h. > > Changes since v4: > > - Cleanup based on Thomas Gleixner's feedback. > - Handle retry in case where the scheduler migrates the thread away > from the target CPU after migration within the syscall rather than > returning EAGAIN to user-space. > - Move push_task_to_cpu() to its own patch. > - New scheme for touching user-space memory: > 1) get_user_pages_fast() to pin/get all pages (which can sleep), > 2) vm_map_ram() those pages > 3) grab mmap_sem (read lock) > 4) __get_user_pages_fast() (or get_user_pages() on failure) > -> Confirm that the same page pointers are returned. This > catches cases where COW mappings are changed concurrently. > -> If page pointers differ, or on gup failure, release mmap_sem, > vm_unmap_ram/put_page and retry from step (1). > -> perform put_page on the extra reference immediately for each > page. > 5) preempt disable > 6) Perform operations on vmap. Those operations are normal > loads/stores/memcpy. > 7) preempt enable > 8) release mmap_sem > 9) vm_unmap_ram() all virtual addresses > 10) put_page() all pages > - Handle architectures with VIVT caches along with vmap(): call > flush_kernel_vmap_range() after each "write" operation. This > ensures that the user-space mapping and vmap reach a consistent > state between each operation. > - Depend on MMU for is_zero_pfn(). e.g. Blackfin and SH architectures > don't provide the zero_pfn symbol. > > --- > Man page associated: > > CPU_OPV(2) Linux Programmer's Manual CPU_OPV(2) > > NAME > cpu_opv - CPU preempt-off operation vector system call > > SYNOPSIS > #include > > int cpu_opv(struct cpu_op * cpu_opv, int cpuopcnt, int cpu, int flags); > > DESCRIPTION > The cpu_opv system call executes a vector of operations on behalf > of user-space on a specific CPU with preemption disabled. > > The operations available are: comparison, memcpy, add, or, and, > xor, left shift, right shift, and memory barrier. The system call > receives a CPU number from user-space as argument, which is the > CPU on which those operations need to be performed. All pointers > in the ops must have been set up to point to the per CPU memory > of the CPU on which the operations should be executed. The "com‐ > parison" operation can be used to check that the data used in the > preparation step did not change between preparation of system > call inputs and operation execution within the preempt-off criti‐ > cal section. > > An overall maximum of 4216 bytes in enforced on the sum of opera‐ > tion length within an operation vector, so user-space cannot gen‐ > erate a too long preempt-off critical section. Each operation is > also limited a length of 4096 bytes. A maximum limit of 16 opera‐ > tions per cpu_opv syscall invocation is enforced. > > If the thread is not running on the requested CPU, it is migrated > to it. > > The layout of struct cpu_opv is as follows: > > Fields > > op Operation of type enum cpu_op_type to perform. This opera‐ > tion type selects the associated "u" union field. > > len > Length (in bytes) of data to consider for this operation. > > u.compare_op > For a CPU_COMPARE_EQ_OP , and CPU_COMPARE_NE_OP , contains > the a and b pointers to compare. The expect_fault_a and > expect_fault_b fields indicate whether a page fault should > be expected for each of those pointers. If expect_fault_a > , or expect_fault_b is set, EAGAIN is returned on fault, > else EFAULT is returned. The len field is allowed to take > values from 0 to 4096 for comparison operations. > > u.memcpy_op > For a CPU_MEMCPY_OP , contains the dst and src pointers, > expressing a copy of src into dst. The expect_fault_dst > and expect_fault_src fields indicate whether a page fault > should be expected for each of those pointers. If > expect_fault_dst , or expect_fault_src is set, EAGAIN is > returned on fault, else EFAULT is returned. The len field > is allowed to take values from 0 to 4096 for memcpy opera‐ > tions. > > u.arithmetic_op > For a CPU_ADD_OP , contains the p , count , and > expect_fault_p fields, which are respectively a pointer to > the memory location to increment, the 64-bit signed inte‐ > ger value to add, and whether a page fault should be > expected for p . If expect_fault_p is set, EAGAIN is > returned on fault, else EFAULT is returned. The len field > is allowed to take values of 1, 2, 4, 8 bytes for arith‐ > metic operations. > > u.bitwise_op > For a CPU_OR_OP , CPU_AND_OP , and CPU_XOR_OP , contains > the p , mask , and expect_fault_p fields, which are > respectively a pointer to the memory location to target, > the mask to apply, and whether a page fault should be > expected for p . If expect_fault_p is set, EAGAIN is > returned on fault, else EFAULT is returned. The len field > is allowed to take values of 1, 2, 4, 8 bytes for bitwise > operations. > > u.shift_op > For a CPU_LSHIFT_OP , and CPU_RSHIFT_OP , contains the p , > bits , and expect_fault_p fields, which are respectively a > pointer to the memory location to target, the number of > bits to shift either left of right, and whether a page > fault should be expected for p . If expect_fault_p is > set, EAGAIN is returned on fault, else EFAULT is returned. > The len field is allowed to take values of 1, 2, 4, 8 > bytes for shift operations. The bits field is allowed to > take values between 0 and 63. > > The enum cpu_op_types contains the following operations: > > · CPU_COMPARE_EQ_OP: Compare whether two memory locations are > equal, > > · CPU_COMPARE_NE_OP: Compare whether two memory locations differ, > > · CPU_MEMCPY_OP: Copy a source memory location into a destina‐ > tion, > > · CPU_ADD_OP: Increment a target memory location of a given > count, > > · CPU_OR_OP: Apply a "or" mask to a memory location, > > · CPU_AND_OP: Apply a "and" mask to a memory location, > > · CPU_XOR_OP: Apply a "xor" mask to a memory location, > > · CPU_LSHIFT_OP: Shift a memory location left of a given number > of bits, > > · CPU_RSHIFT_OP: Shift a memory location right of a given number > of bits. > > · CPU_MB_OP: Issue a memory barrier. > > All of the operations above provide single-copy atomicity guar‐ > antees for word-sized, word-aligned target pointers, for both > loads and stores. > > The cpuopcnt argument is the number of elements in the cpu_opv > array. It can take values from 0 to 16. > > The cpu argument is the CPU number on which the operation > sequence needs to be executed. > > The flags argument is expected to be 0. > > RETURN VALUE > A return value of 0 indicates success. On error, -1 is returned, > and errno is set appropriately. If a comparison operation fails, > execution of the operation vector is stopped, and the return > value is the index after the comparison operation (values between > 1 and 16). > > ERRORS > EAGAIN cpu_opv() system call should be attempted again. > > EINVAL Either flags contains an invalid value, or cpu contains an > invalid value or a value not allowed by the current > thread's allowed cpu mask, or cpuopcnt contains an invalid > value, or the cpu_opv operation vector contains an invalid > op value, or the cpu_opv operation vector contains an > invalid len value, or the cpu_opv operation vector sum of > len values is too large. > > ENOSYS The cpu_opv() system call is not implemented by this ker‐ > nel. > > EFAULT cpu_opv is an invalid address, or a pointer contained > within an operation is invalid (and a fault is not > expected for that pointer). > > VERSIONS > The cpu_opv() system call was added in Linux 4.X (TODO). > > CONFORMING TO > cpu_opv() is Linux-specific. > > SEE ALSO > membarrier(2), rseq(2) > > Linux 2017-11-10 CPU_OPV(2) > --- > MAINTAINERS | 7 + > include/linux/syscalls.h | 3 + > include/uapi/linux/cpu_opv.h | 114 +++++ > init/Kconfig | 16 + > kernel/Makefile | 1 + > kernel/cpu_opv.c | 1078 ++++++++++++++++++++++++++++++++++++++++++ > kernel/sys_ni.c | 1 + > 7 files changed, 1220 insertions(+) > create mode 100644 include/uapi/linux/cpu_opv.h > create mode 100644 kernel/cpu_opv.c > > diff --git a/MAINTAINERS b/MAINTAINERS > index 4ede6c16d49f..36c5246b385b 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -3732,6 +3732,13 @@ B: https://bugzilla.kernel.org > F: drivers/cpuidle/* > F: include/linux/cpuidle.h > > +CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT > +M: Mathieu Desnoyers > +L: linux-kernel@vger.kernel.org > +S: Supported > +F: kernel/cpu_opv.c > +F: include/uapi/linux/cpu_opv.h > + > CRAMFS FILESYSTEM > M: Nicolas Pitre > S: Maintained > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 340650b4ec54..32d289f41f62 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -67,6 +67,7 @@ struct perf_event_attr; > struct file_handle; > struct sigaltstack; > struct rseq; > +struct cpu_op; > union bpf_attr; > > #include > @@ -943,5 +944,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, > unsigned flags, > unsigned mask, struct statx __user *buffer); > asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len, > int flags, uint32_t sig); > +asmlinkage long sys_cpu_opv(struct cpu_op __user *ucpuopv, int cpuopcnt, > + int cpu, int flags); > > #endif > diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h > new file mode 100644 > index 000000000000..ccd8167fc189 > --- /dev/null > +++ b/include/uapi/linux/cpu_opv.h > @@ -0,0 +1,114 @@ > +#ifndef _UAPI_LINUX_CPU_OPV_H > +#define _UAPI_LINUX_CPU_OPV_H > + > +/* > + * linux/cpu_opv.h > + * > + * CPU preempt-off operation vector system call API > + * > + * Copyright (c) 2017 Mathieu Desnoyers > + * > + * Permission is hereby granted, free of charge, to any person obtaining a copy > + * of this software and associated documentation files (the "Software"), to > deal > + * in the Software without restriction, including without limitation the rights > + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell > + * copies of the Software, and to permit persons to whom the Software is > + * furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice shall be included in > + * all copies or substantial portions of the Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE > + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING > FROM, > + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN > THE > + * SOFTWARE. > + */ > + > +#ifdef __KERNEL__ > +# include > +#else > +# include > +#endif > + > +#include > + > +#define CPU_OP_VEC_LEN_MAX 16 > +#define CPU_OP_ARG_LEN_MAX 24 > +/* Maximum data len per operation. */ > +#define CPU_OP_DATA_LEN_MAX 4096 > +/* > + * Maximum data len for overall vector. Restrict the amount of user-space > + * data touched by the kernel in non-preemptible context, so it does not > + * introduce long scheduler latencies. > + * This allows one copy of up to 4096 bytes, and 15 operations touching 8 > + * bytes each. > + * This limit is applied to the sum of length specified for all operations > + * in a vector. > + */ > +#define CPU_OP_MEMCPY_EXPECT_LEN 4096 > +#define CPU_OP_EXPECT_LEN 8 > +#define CPU_OP_VEC_DATA_LEN_MAX \ > + (CPU_OP_MEMCPY_EXPECT_LEN + \ > + (CPU_OP_VEC_LEN_MAX - 1) * CPU_OP_EXPECT_LEN) > + > +enum cpu_op_type { > + /* compare */ > + CPU_COMPARE_EQ_OP, > + CPU_COMPARE_NE_OP, > + /* memcpy */ > + CPU_MEMCPY_OP, > + /* arithmetic */ > + CPU_ADD_OP, > + /* bitwise */ > + CPU_OR_OP, > + CPU_AND_OP, > + CPU_XOR_OP, > + /* shift */ > + CPU_LSHIFT_OP, > + CPU_RSHIFT_OP, > + /* memory barrier */ > + CPU_MB_OP, > +}; > + > +/* Vector of operations to perform. Limited to 16. */ > +struct cpu_op { > + /* enum cpu_op_type. */ > + int32_t op; > + /* data length, in bytes. */ > + uint32_t len; > + union { > + struct { > + LINUX_FIELD_u32_u64(a); > + LINUX_FIELD_u32_u64(b); > + uint8_t expect_fault_a; > + uint8_t expect_fault_b; > + } compare_op; > + struct { > + LINUX_FIELD_u32_u64(dst); > + LINUX_FIELD_u32_u64(src); > + uint8_t expect_fault_dst; > + uint8_t expect_fault_src; > + } memcpy_op; > + struct { > + LINUX_FIELD_u32_u64(p); > + int64_t count; > + uint8_t expect_fault_p; > + } arithmetic_op; > + struct { > + LINUX_FIELD_u32_u64(p); > + uint64_t mask; > + uint8_t expect_fault_p; > + } bitwise_op; > + struct { > + LINUX_FIELD_u32_u64(p); > + uint32_t bits; > + uint8_t expect_fault_p; > + } shift_op; > + char __padding[CPU_OP_ARG_LEN_MAX]; > + } u; > +}; > + > +#endif /* _UAPI_LINUX_CPU_OPV_H */ > diff --git a/init/Kconfig b/init/Kconfig > index 88e36395390f..8a4995ed1d19 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1404,6 +1404,7 @@ config RSEQ > bool "Enable rseq() system call" if EXPERT > default y > depends on HAVE_RSEQ > + select CPU_OPV > select MEMBARRIER > help > Enable the restartable sequences system call. It provides a > @@ -1414,6 +1415,21 @@ config RSEQ > > If unsure, say Y. > > +# CPU_OPV depends on MMU for is_zero_pfn() > +config CPU_OPV > + bool "Enable cpu_opv() system call" if EXPERT > + default y > + depends on MMU > + help > + Enable the CPU preempt-off operation vector system call. > + It allows user-space to perform a sequence of operations on > + per-cpu data with preemption disabled. Useful as > + single-stepping fall-back for restartable sequences, and for > + performing more complex operations on per-cpu data that would > + not be otherwise possible to do with restartable sequences. > + > + If unsure, say Y. > + > config EMBEDDED > bool "Embedded system" > option allnoconfig_y > diff --git a/kernel/Makefile b/kernel/Makefile > index 3574669dafd9..cac8855196ff 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -113,6 +113,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o > > obj-$(CONFIG_HAS_IOMEM) += memremap.o > obj-$(CONFIG_RSEQ) += rseq.o > +obj-$(CONFIG_CPU_OPV) += cpu_opv.o > > $(obj)/configs.o: $(obj)/config_data.h > > diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c > new file mode 100644 > index 000000000000..965fbf0a86b0 > --- /dev/null > +++ b/kernel/cpu_opv.c > @@ -0,0 +1,1078 @@ > +/* > + * CPU preempt-off operation vector system call > + * > + * It allows user-space to perform a sequence of operations on per-cpu > + * data with preemption disabled. Useful as single-stepping fall-back > + * for restartable sequences, and for performing more complex operations > + * on per-cpu data that would not be otherwise possible to do with > + * restartable sequences. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * Copyright (C) 2017, EfficiOS Inc., > + * Mathieu Desnoyers > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "sched/sched.h" > + > +/* > + * Typical invocation of cpu_opv need few virtual address pointers. Keep > + * those in an array on the stack of the cpu_opv system call up to > + * this limit, beyond which the array is dynamically allocated. > + */ > +#define NR_VADDR_ON_STACK 8 > + > +/* Maximum pages per op. */ > +#define CPU_OP_MAX_PAGES 4 > + > +/* Maximum number of virtual addresses per op. */ > +#define CPU_OP_VEC_MAX_ADDR (2 * CPU_OP_VEC_LEN_MAX) > + > +union op_fn_data { > + uint8_t _u8; > + uint16_t _u16; > + uint32_t _u32; > + uint64_t _u64; > +#if (BITS_PER_LONG < 64) > + uint32_t _u64_split[2]; > +#endif > +}; > + > +struct vaddr { > + unsigned long mem; > + unsigned long uaddr; > + struct page *pages[2]; > + unsigned int nr_pages; > + int write; > +}; > + > +struct cpu_opv_vaddr { > + struct vaddr *addr; > + size_t nr_vaddr; > + bool is_kmalloc; > +}; > + > +typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len); > + > +/* > + * Provide mutual exclution for threads executing a cpu_opv against an > + * offline CPU. > + */ > +static DEFINE_MUTEX(cpu_opv_offline_lock); > + > +/* > + * The cpu_opv system call executes a vector of operations on behalf of > + * user-space on a specific CPU with preemption disabled. It is inspired > + * by readv() and writev() system calls which take a "struct iovec" > + * array as argument. > + * > + * The operations available are: comparison, memcpy, add, or, and, xor, > + * left shift, right shift, and memory barrier. The system call receives > + * a CPU number from user-space as argument, which is the CPU on which > + * those operations need to be performed. All pointers in the ops must > + * have been set up to point to the per CPU memory of the CPU on which > + * the operations should be executed. The "comparison" operation can be > + * used to check that the data used in the preparation step did not > + * change between preparation of system call inputs and operation > + * execution within the preempt-off critical section. > + * > + * The reason why we require all pointer offsets to be calculated by > + * user-space beforehand is because we need to use get_user_pages_fast() > + * to first pin all pages touched by each operation. This takes care of > + * faulting-in the pages. Then, preemption is disabled, and the > + * operations are performed atomically with respect to other thread > + * execution on that CPU, without generating any page fault. > + * > + * An overall maximum of 4216 bytes in enforced on the sum of operation > + * length within an operation vector, so user-space cannot generate a > + * too long preempt-off critical section (cache cold critical section > + * duration measured as 4.7µs on x86-64). Each operation is also limited > + * a length of 4096 bytes, meaning that an operation can touch a > + * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for > + * destination if addresses are not aligned on page boundaries). > + * > + * If the thread is not running on the requested CPU, it is migrated to > + * it. > + */ > + > +static unsigned long cpu_op_range_nr_pages(unsigned long addr, > + unsigned long len) > +{ > + return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1; > +} > + > +static int cpu_op_count_pages(unsigned long addr, unsigned long len) > +{ > + unsigned long nr_pages; > + > + if (!len) > + return 0; > + nr_pages = cpu_op_range_nr_pages(addr, len); > + if (nr_pages > 2) { > + WARN_ON(1); > + return -EINVAL; > + } > + return nr_pages; > +} > + > +static struct vaddr *cpu_op_alloc_vaddr_vector(int nr_vaddr) > +{ > + return kzalloc(nr_vaddr * sizeof(struct vaddr), GFP_KERNEL); > +} > + > +/* > + * Check operation types and length parameters. Count number of pages. > + */ > +static int cpu_opv_check_op(struct cpu_op *op, int *nr_vaddr, uint32_t *sum) > +{ > + int ret; > + > + switch (op->op) { > + case CPU_MB_OP: > + break; > + default: > + *sum += op->len; > + } > + > + /* Validate inputs. */ > + switch (op->op) { > + case CPU_COMPARE_EQ_OP: > + case CPU_COMPARE_NE_OP: > + case CPU_MEMCPY_OP: > + if (op->len > CPU_OP_DATA_LEN_MAX) > + return -EINVAL; > + break; > + case CPU_ADD_OP: > + case CPU_OR_OP: > + case CPU_AND_OP: > + case CPU_XOR_OP: > + switch (op->len) { > + case 1: > + case 2: > + case 4: > + case 8: > + break; > + default: > + return -EINVAL; > + } > + break; > + case CPU_LSHIFT_OP: > + case CPU_RSHIFT_OP: > + switch (op->len) { > + case 1: > + if (op->u.shift_op.bits > 7) > + return -EINVAL; > + break; > + case 2: > + if (op->u.shift_op.bits > 15) > + return -EINVAL; > + break; > + case 4: > + if (op->u.shift_op.bits > 31) > + return -EINVAL; > + break; > + case 8: > + if (op->u.shift_op.bits > 63) > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + break; > + case CPU_MB_OP: > + break; > + default: > + return -EINVAL; > + } > + > + /* Count pages and virtual addresses. */ > + switch (op->op) { > + case CPU_COMPARE_EQ_OP: > + case CPU_COMPARE_NE_OP: > + ret = cpu_op_count_pages(op->u.compare_op.a, op->len); > + if (ret < 0) > + return ret; > + ret = cpu_op_count_pages(op->u.compare_op.b, op->len); > + if (ret < 0) > + return ret; > + *nr_vaddr += 2; > + break; > + case CPU_MEMCPY_OP: > + ret = cpu_op_count_pages(op->u.memcpy_op.dst, op->len); > + if (ret < 0) > + return ret; > + ret = cpu_op_count_pages(op->u.memcpy_op.src, op->len); > + if (ret < 0) > + return ret; > + *nr_vaddr += 2; > + break; > + case CPU_ADD_OP: > + ret = cpu_op_count_pages(op->u.arithmetic_op.p, op->len); > + if (ret < 0) > + return ret; > + (*nr_vaddr)++; > + break; > + case CPU_OR_OP: > + case CPU_AND_OP: > + case CPU_XOR_OP: > + ret = cpu_op_count_pages(op->u.bitwise_op.p, op->len); > + if (ret < 0) > + return ret; > + (*nr_vaddr)++; > + break; > + case CPU_LSHIFT_OP: > + case CPU_RSHIFT_OP: > + ret = cpu_op_count_pages(op->u.shift_op.p, op->len); > + if (ret < 0) > + return ret; > + (*nr_vaddr)++; > + break; > + case CPU_MB_OP: > + break; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +/* > + * Check operation types and length parameters. Count number of pages. > + */ > +static int cpu_opv_check(struct cpu_op *cpuopv, int cpuopcnt, int *nr_vaddr) > +{ > + uint32_t sum = 0; > + int i, ret; > + > + for (i = 0; i < cpuopcnt; i++) { > + ret = cpu_opv_check_op(&cpuopv[i], nr_vaddr, &sum); > + if (ret) > + return ret; > + } > + if (sum > CPU_OP_VEC_DATA_LEN_MAX) > + return -EINVAL; > + return 0; > +} > + > +static int cpu_op_check_page(struct page *page, int write) > +{ > + struct address_space *mapping; > + > + if (is_zone_device_page(page)) > + return -EFAULT; > + > + /* > + * The page lock protects many things but in this context the page > + * lock stabilizes mapping, prevents inode freeing in the shared > + * file-backed region case and guards against movement to swap > + * cache. > + * > + * Strictly speaking the page lock is not needed in all cases being > + * considered here and page lock forces unnecessarily serialization > + * From this point on, mapping will be re-verified if necessary and > + * page lock will be acquired only if it is unavoidable > + * > + * Mapping checks require the head page for any compound page so the > + * head page and mapping is looked up now. > + */ > + page = compound_head(page); > + mapping = READ_ONCE(page->mapping); > + > + /* > + * If page->mapping is NULL, then it cannot be a PageAnon page; > + * but it might be the ZERO_PAGE (which is OK to read from), or > + * in the gate area or in a special mapping (for which this > + * check should fail); or it may have been a good file page when > + * get_user_pages_fast found it, but truncated or holepunched or > + * subjected to invalidate_complete_page2 before the page lock > + * is acquired (also cases which should fail). Given that a > + * reference to the page is currently held, refcount care in > + * invalidate_complete_page's remove_mapping prevents > + * drop_caches from setting mapping to NULL concurrently. > + * > + * The case to guard against is when memory pressure cause > + * shmem_writepage to move the page from filecache to swapcache > + * concurrently: an unlikely race, but a retry for page->mapping > + * is required in that situation. > + */ > + if (!mapping) { > + int shmem_swizzled; > + > + /* > + * Check again with page lock held to guard against > + * memory pressure making shmem_writepage move the page > + * from filecache to swapcache. > + */ > + lock_page(page); > + shmem_swizzled = PageSwapCache(page) || page->mapping; > + unlock_page(page); > + if (shmem_swizzled) > + return -EAGAIN; > + /* > + * It is valid to read from, but invalid to write to the > + * ZERO_PAGE. > + */ > + if (!(is_zero_pfn(page_to_pfn(page)) || > + is_huge_zero_page(page)) || write) { > + return -EFAULT; > + } > + } > + return 0; > +} > + > +static int cpu_op_check_pages(struct page **pages, > + unsigned long nr_pages, > + int write) > +{ > + unsigned long i; > + > + for (i = 0; i < nr_pages; i++) { > + int ret; > + > + ret = cpu_op_check_page(pages[i], write); > + if (ret) > + return ret; > + } > + return 0; > +} > + > +static int cpu_op_pin_pages(unsigned long addr, unsigned long len, > + struct cpu_opv_vaddr *vaddr_ptrs, > + unsigned long *vaddr, int write) > +{ > + struct page *pages[2]; > + int ret, nr_pages, nr_put_pages, n; > + unsigned long _vaddr; > + struct vaddr *va; > + > + nr_pages = cpu_op_count_pages(addr, len); > + if (!nr_pages) > + return 0; > +again: > + ret = get_user_pages_fast(addr, nr_pages, write, pages); > + if (ret < nr_pages) { > + if (ret >= 0) { > + nr_put_pages = ret; > + ret = -EFAULT; > + } else { > + nr_put_pages = 0; > + } > + goto error; > + } > + ret = cpu_op_check_pages(pages, nr_pages, write); > + if (ret) { > + nr_put_pages = nr_pages; > + goto error; > + } > + va = &vaddr_ptrs->addr[vaddr_ptrs->nr_vaddr++]; > + _vaddr = (unsigned long)vm_map_ram(pages, nr_pages, numa_node_id(), > + PAGE_KERNEL); > + if (!_vaddr) { > + nr_put_pages = nr_pages; > + ret = -ENOMEM; > + goto error; > + } > + va->mem = _vaddr; > + va->uaddr = addr; > + for (n = 0; n < nr_pages; n++) > + va->pages[n] = pages[n]; > + va->nr_pages = nr_pages; > + va->write = write; > + *vaddr = _vaddr + (addr & ~PAGE_MASK); > + return 0; > + > +error: > + for (n = 0; n < nr_put_pages; n++) > + put_page(pages[n]); > + /* > + * Retry if a page has been faulted in, or is being swapped in. > + */ > + if (ret == -EAGAIN) > + goto again; > + return ret; > +} > + > +static int cpu_opv_pin_pages_op(struct cpu_op *op, > + struct cpu_opv_vaddr *vaddr_ptrs, > + bool *expect_fault) > +{ > + int ret; > + unsigned long vaddr = 0; > + > + switch (op->op) { > + case CPU_COMPARE_EQ_OP: > + case CPU_COMPARE_NE_OP: > + ret = -EFAULT; > + *expect_fault = op->u.compare_op.expect_fault_a; > + if (!access_ok(VERIFY_READ, > + (void __user *)op->u.compare_op.a, > + op->len)) > + return ret; > + ret = cpu_op_pin_pages(op->u.compare_op.a, op->len, > + vaddr_ptrs, &vaddr, 0); > + if (ret) > + return ret; > + op->u.compare_op.a = vaddr; > + ret = -EFAULT; > + *expect_fault = op->u.compare_op.expect_fault_b; > + if (!access_ok(VERIFY_READ, > + (void __user *)op->u.compare_op.b, > + op->len)) > + return ret; > + ret = cpu_op_pin_pages(op->u.compare_op.b, op->len, > + vaddr_ptrs, &vaddr, 0); > + if (ret) > + return ret; > + op->u.compare_op.b = vaddr; > + break; > + case CPU_MEMCPY_OP: > + ret = -EFAULT; > + *expect_fault = op->u.memcpy_op.expect_fault_dst; > + if (!access_ok(VERIFY_WRITE, > + (void __user *)op->u.memcpy_op.dst, > + op->len)) > + return ret; > + ret = cpu_op_pin_pages(op->u.memcpy_op.dst, op->len, > + vaddr_ptrs, &vaddr, 1); > + if (ret) > + return ret; > + op->u.memcpy_op.dst = vaddr; > + ret = -EFAULT; > + *expect_fault = op->u.memcpy_op.expect_fault_src; > + if (!access_ok(VERIFY_READ, > + (void __user *)op->u.memcpy_op.src, > + op->len)) > + return ret; > + ret = cpu_op_pin_pages(op->u.memcpy_op.src, op->len, > + vaddr_ptrs, &vaddr, 0); > + if (ret) > + return ret; > + op->u.memcpy_op.src = vaddr; > + break; > + case CPU_ADD_OP: > + ret = -EFAULT; > + *expect_fault = op->u.arithmetic_op.expect_fault_p; > + if (!access_ok(VERIFY_WRITE, > + (void __user *)op->u.arithmetic_op.p, > + op->len)) > + return ret; > + ret = cpu_op_pin_pages(op->u.arithmetic_op.p, op->len, > + vaddr_ptrs, &vaddr, 1); > + if (ret) > + return ret; > + op->u.arithmetic_op.p = vaddr; > + break; > + case CPU_OR_OP: > + case CPU_AND_OP: > + case CPU_XOR_OP: > + ret = -EFAULT; > + *expect_fault = op->u.bitwise_op.expect_fault_p; > + if (!access_ok(VERIFY_WRITE, > + (void __user *)op->u.bitwise_op.p, > + op->len)) > + return ret; > + ret = cpu_op_pin_pages(op->u.bitwise_op.p, op->len, > + vaddr_ptrs, &vaddr, 1); > + if (ret) > + return ret; > + op->u.bitwise_op.p = vaddr; > + break; > + case CPU_LSHIFT_OP: > + case CPU_RSHIFT_OP: > + ret = -EFAULT; > + *expect_fault = op->u.shift_op.expect_fault_p; > + if (!access_ok(VERIFY_WRITE, > + (void __user *)op->u.shift_op.p, > + op->len)) > + return ret; > + ret = cpu_op_pin_pages(op->u.shift_op.p, op->len, > + vaddr_ptrs, &vaddr, 1); > + if (ret) > + return ret; > + op->u.shift_op.p = vaddr; > + break; > + case CPU_MB_OP: > + break; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt, > + struct cpu_opv_vaddr *vaddr_ptrs) > +{ > + int ret, i; > + bool expect_fault = false; > + > + /* Check access, pin pages. */ > + for (i = 0; i < cpuopcnt; i++) { > + ret = cpu_opv_pin_pages_op(&cpuop[i], vaddr_ptrs, > + &expect_fault); > + if (ret) > + goto error; > + } > + return 0; > + > +error: > + /* > + * If faulting access is expected, return EAGAIN to user-space. > + * It allows user-space to distinguish between a fault caused by > + * an access which is expect to fault (e.g. due to concurrent > + * unmapping of underlying memory) from an unexpected fault from > + * which a retry would not recover. > + */ > + if (ret == -EFAULT && expect_fault) > + return -EAGAIN; > + return ret; > +} > + > +static int __op_get(union op_fn_data *data, void *p, size_t len) > +{ > + switch (len) { > + case 1: > + data->_u8 = READ_ONCE(*(uint8_t *)p); > + break; > + case 2: > + data->_u16 = READ_ONCE(*(uint16_t *)p); > + break; > + case 4: > + data->_u32 = READ_ONCE(*(uint32_t *)p); > + break; > + case 8: > +#if (BITS_PER_LONG == 64) > + data->_u64 = READ_ONCE(*(uint64_t *)p); > +#else > + { > + data->_u64_split[0] = READ_ONCE(*(uint32_t *)p); > + data->_u64_split[1] = READ_ONCE(*((uint32_t *)p + 1)); > + } > +#endif > + break; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +static int __op_put(union op_fn_data *data, void *p, size_t len) > +{ > + switch (len) { > + case 1: > + WRITE_ONCE(*(uint8_t *)p, data->_u8); > + break; > + case 2: > + WRITE_ONCE(*(uint16_t *)p, data->_u16); > + break; > + case 4: > + WRITE_ONCE(*(uint32_t *)p, data->_u32); > + break; > + case 8: > +#if (BITS_PER_LONG == 64) > + WRITE_ONCE(*(uint64_t *)p, data->_u64); > +#else > + { > + WRITE_ONCE(*(uint32_t *)p, data->_u64_split[0]); > + WRITE_ONCE(*((uint32_t *)p + 1), data->_u64_split[1]); > + } > +#endif > + break; > + default: > + return -EINVAL; > + } > + flush_kernel_vmap_range(p, len); > + return 0; > +} > + > +/* Return 0 if same, > 0 if different, < 0 on error. */ > +static int do_cpu_op_compare(unsigned long _a, unsigned long _b, uint32_t len) > +{ > + void *a = (void *)_a; > + void *b = (void *)_b; > + union op_fn_data tmp[2]; > + int ret; > + > + switch (len) { > + case 1: > + case 2: > + case 4: > + case 8: > + if (!IS_ALIGNED(_a, len) || !IS_ALIGNED(_b, len)) > + goto memcmp; > + break; > + default: > + goto memcmp; > + } > + > + ret = __op_get(&tmp[0], a, len); > + if (ret) > + return ret; > + ret = __op_get(&tmp[1], b, len); > + if (ret) > + return ret; > + > + switch (len) { > + case 1: > + ret = !!(tmp[0]._u8 != tmp[1]._u8); > + break; > + case 2: > + ret = !!(tmp[0]._u16 != tmp[1]._u16); > + break; > + case 4: > + ret = !!(tmp[0]._u32 != tmp[1]._u32); > + break; > + case 8: > + ret = !!(tmp[0]._u64 != tmp[1]._u64); > + break; > + default: > + return -EINVAL; > + } > + return ret; > + > +memcmp: > + if (memcmp(a, b, len)) > + return 1; > + return 0; > +} > + > +/* Return 0 on success, < 0 on error. */ > +static int do_cpu_op_memcpy(unsigned long _dst, unsigned long _src, > + uint32_t len) > +{ > + void *dst = (void *)_dst; > + void *src = (void *)_src; > + union op_fn_data tmp; > + int ret; > + > + switch (len) { > + case 1: > + case 2: > + case 4: > + case 8: > + if (!IS_ALIGNED(_dst, len) || !IS_ALIGNED(_src, len)) > + goto memcpy; > + break; > + default: > + goto memcpy; > + } > + > + ret = __op_get(&tmp, src, len); > + if (ret) > + return ret; > + return __op_put(&tmp, dst, len); > + > +memcpy: > + memcpy(dst, src, len); > + flush_kernel_vmap_range(dst, len); > + return 0; > +} > + > +static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len) > +{ > + switch (len) { > + case 1: > + data->_u8 += (uint8_t)count; > + break; > + case 2: > + data->_u16 += (uint16_t)count; > + break; > + case 4: > + data->_u32 += (uint32_t)count; > + break; > + case 8: > + data->_u64 += (uint64_t)count; > + break; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len) > +{ > + switch (len) { > + case 1: > + data->_u8 |= (uint8_t)mask; > + break; > + case 2: > + data->_u16 |= (uint16_t)mask; > + break; > + case 4: > + data->_u32 |= (uint32_t)mask; > + break; > + case 8: > + data->_u64 |= (uint64_t)mask; > + break; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len) > +{ > + switch (len) { > + case 1: > + data->_u8 &= (uint8_t)mask; > + break; > + case 2: > + data->_u16 &= (uint16_t)mask; > + break; > + case 4: > + data->_u32 &= (uint32_t)mask; > + break; > + case 8: > + data->_u64 &= (uint64_t)mask; > + break; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len) > +{ > + switch (len) { > + case 1: > + data->_u8 ^= (uint8_t)mask; > + break; > + case 2: > + data->_u16 ^= (uint16_t)mask; > + break; > + case 4: > + data->_u32 ^= (uint32_t)mask; > + break; > + case 8: > + data->_u64 ^= (uint64_t)mask; > + break; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len) > +{ > + switch (len) { > + case 1: > + data->_u8 <<= (uint8_t)bits; > + break; > + case 2: > + data->_u16 <<= (uint16_t)bits; > + break; > + case 4: > + data->_u32 <<= (uint32_t)bits; > + break; > + case 8: > + data->_u64 <<= (uint64_t)bits; > + break; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len) > +{ > + switch (len) { > + case 1: > + data->_u8 >>= (uint8_t)bits; > + break; > + case 2: > + data->_u16 >>= (uint16_t)bits; > + break; > + case 4: > + data->_u32 >>= (uint32_t)bits; > + break; > + case 8: > + data->_u64 >>= (uint64_t)bits; > + break; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +/* Return 0 on success, < 0 on error. */ > +static int do_cpu_op_fn(op_fn_t op_fn, unsigned long _p, uint64_t v, > + uint32_t len) > +{ > + union op_fn_data tmp; > + void *p = (void *)_p; > + int ret; > + > + ret = __op_get(&tmp, p, len); > + if (ret) > + return ret; > + ret = op_fn(&tmp, v, len); > + if (ret) > + return ret; > + ret = __op_put(&tmp, p, len); > + if (ret) > + return ret; > + return 0; > +} > + > +/* > + * Return negative value on error, positive value if comparison > + * fails, 0 on success. > + */ > +static int __do_cpu_opv_op(struct cpu_op *op) > +{ > + /* Guarantee a compiler barrier between each operation. */ > + barrier(); > + > + switch (op->op) { > + case CPU_COMPARE_EQ_OP: > + return do_cpu_op_compare(op->u.compare_op.a, > + op->u.compare_op.b, > + op->len); > + case CPU_COMPARE_NE_OP: > + { > + int ret; > + > + ret = do_cpu_op_compare(op->u.compare_op.a, > + op->u.compare_op.b, > + op->len); > + if (ret < 0) > + return ret; > + /* > + * Stop execution, return positive value if comparison > + * is identical. > + */ > + if (ret == 0) > + return 1; > + return 0; > + } > + case CPU_MEMCPY_OP: > + return do_cpu_op_memcpy(op->u.memcpy_op.dst, > + op->u.memcpy_op.src, > + op->len); > + case CPU_ADD_OP: > + return do_cpu_op_fn(op_add_fn, op->u.arithmetic_op.p, > + op->u.arithmetic_op.count, op->len); > + case CPU_OR_OP: > + return do_cpu_op_fn(op_or_fn, op->u.bitwise_op.p, > + op->u.bitwise_op.mask, op->len); > + case CPU_AND_OP: > + return do_cpu_op_fn(op_and_fn, op->u.bitwise_op.p, > + op->u.bitwise_op.mask, op->len); > + case CPU_XOR_OP: > + return do_cpu_op_fn(op_xor_fn, op->u.bitwise_op.p, > + op->u.bitwise_op.mask, op->len); > + case CPU_LSHIFT_OP: > + return do_cpu_op_fn(op_lshift_fn, op->u.shift_op.p, > + op->u.shift_op.bits, op->len); > + case CPU_RSHIFT_OP: > + return do_cpu_op_fn(op_rshift_fn, op->u.shift_op.p, > + op->u.shift_op.bits, op->len); > + case CPU_MB_OP: > + /* Memory barrier provided by this operation. */ > + smp_mb(); > + return 0; > + default: > + return -EINVAL; > + } > +} > + > +static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt) > +{ > + int i, ret; > + > + for (i = 0; i < cpuopcnt; i++) { > + ret = __do_cpu_opv_op(&cpuop[i]); > + /* If comparison fails, stop execution and return index + 1. */ > + if (ret > 0) > + return i + 1; > + /* On error, stop execution. */ > + if (ret < 0) > + return ret; > + } > + return 0; > +} > + > +/* > + * Check that the page pointers pinned by get_user_pages_fast() > + * are still in the page table. Invoked with mmap_sem held. > + * Return 0 if pointers match, -EAGAIN if they don't. > + */ > +static int vaddr_check(struct vaddr *vaddr) > +{ > + struct page *pages[2]; > + int ret, n; > + > + ret = __get_user_pages_fast(vaddr->uaddr, vaddr->nr_pages, > + vaddr->write, pages); > + for (n = 0; n < ret; n++) > + put_page(pages[n]); > + if (ret < vaddr->nr_pages) { > + ret = get_user_pages(vaddr->uaddr, vaddr->nr_pages, > + vaddr->write ? FOLL_WRITE : 0, > + pages, NULL); > + if (ret < 0) > + return -EAGAIN; > + for (n = 0; n < ret; n++) > + put_page(pages[n]); > + if (ret < vaddr->nr_pages) > + return -EAGAIN; > + } > + for (n = 0; n < vaddr->nr_pages; n++) { > + if (pages[n] != vaddr->pages[n]) > + return -EAGAIN; > + } > + return 0; > +} > + > +static int vaddr_ptrs_check(struct cpu_opv_vaddr *vaddr_ptrs) > +{ > + int i; > + > + for (i = 0; i < vaddr_ptrs->nr_vaddr; i++) { > + int ret; > + > + ret = vaddr_check(&vaddr_ptrs->addr[i]); > + if (ret) > + return ret; > + } > + return 0; > +} > + > +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, > + struct cpu_opv_vaddr *vaddr_ptrs, int cpu) > +{ > + struct mm_struct *mm = current->mm; > + int ret; > + > +retry: > + if (cpu != raw_smp_processor_id()) { > + ret = push_task_to_cpu(current, cpu); > + if (ret) > + goto check_online; > + } > + down_read(&mm->mmap_sem); > + ret = vaddr_ptrs_check(vaddr_ptrs); > + if (ret) > + goto end; > + preempt_disable(); > + if (cpu != smp_processor_id()) { > + preempt_enable(); > + up_read(&mm->mmap_sem); > + goto retry; > + } > + ret = __do_cpu_opv(cpuop, cpuopcnt); > + preempt_enable(); > +end: > + up_read(&mm->mmap_sem); > + return ret; > + > +check_online: > + if (!cpu_possible(cpu)) > + return -EINVAL; > + get_online_cpus(); > + if (cpu_online(cpu)) { > + put_online_cpus(); > + goto retry; > + } > + /* > + * CPU is offline. Perform operation from the current CPU with > + * cpu_online read lock held, preventing that CPU from coming online, > + * and with mutex held, providing mutual exclusion against other > + * CPUs also finding out about an offline CPU. > + */ > + down_read(&mm->mmap_sem); > + ret = vaddr_ptrs_check(vaddr_ptrs); > + if (ret) > + goto offline_end; > + mutex_lock(&cpu_opv_offline_lock); > + ret = __do_cpu_opv(cpuop, cpuopcnt); > + mutex_unlock(&cpu_opv_offline_lock); > +offline_end: > + up_read(&mm->mmap_sem); > + put_online_cpus(); > + return ret; > +} > + > +/* > + * cpu_opv - execute operation vector on a given CPU with preempt off. > + * > + * Userspace should pass current CPU number as parameter. > + */ > +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt, > + int, cpu, int, flags) > +{ > + struct vaddr vaddr_on_stack[NR_VADDR_ON_STACK]; > + struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX]; > + struct cpu_opv_vaddr vaddr_ptrs = { > + .addr = vaddr_on_stack, > + .nr_vaddr = 0, > + .is_kmalloc = false, > + }; > + int ret, i, nr_vaddr = 0; > + bool retry = false; > + > + if (unlikely(flags)) > + return -EINVAL; > + if (unlikely(cpu < 0)) > + return -EINVAL; > + if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX) > + return -EINVAL; > + if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op))) > + return -EFAULT; > + ret = cpu_opv_check(cpuopv, cpuopcnt, &nr_vaddr); > + if (ret) > + return ret; > + if (nr_vaddr > NR_VADDR_ON_STACK) { > + vaddr_ptrs.addr = cpu_op_alloc_vaddr_vector(nr_vaddr); > + if (!vaddr_ptrs.addr) { > + ret = -ENOMEM; > + goto end; > + } > + vaddr_ptrs.is_kmalloc = true; > + } > +again: > + ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &vaddr_ptrs); > + if (ret) > + goto end; > + ret = do_cpu_opv(cpuopv, cpuopcnt, &vaddr_ptrs, cpu); > + if (ret == -EAGAIN) > + retry = true; > +end: > + for (i = 0; i < vaddr_ptrs.nr_vaddr; i++) { > + struct vaddr *vaddr = &vaddr_ptrs.addr[i]; > + int j; > + > + vm_unmap_ram((void *)vaddr->mem, vaddr->nr_pages); > + for (j = 0; j < vaddr->nr_pages; j++) { > + if (vaddr->write) > + set_page_dirty(vaddr->pages[j]); > + put_page(vaddr->pages[j]); > + } > + } > + if (retry) { > + retry = false; > + vaddr_ptrs.nr_vaddr = 0; > + goto again; > + } > + if (vaddr_ptrs.is_kmalloc) > + kfree(vaddr_ptrs.addr); > + return ret; > +} > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index bfa1ee1bf669..59e622296dc3 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -262,3 +262,4 @@ cond_syscall(sys_pkey_free); > > /* restartable sequence */ > cond_syscall(sys_rseq); > +cond_syscall(sys_cpu_opv); > -- > 2.11.0 -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com