Received: by 10.213.65.68 with SMTP id h4csp688858imn; Wed, 28 Mar 2018 10:56:52 -0700 (PDT) X-Google-Smtp-Source: AIpwx48aPvum2GD1LnFax4q+6bla3XPcIB9ChIb9jbHBo0fCLq2fEMDiMNyMlEUHC1WFXxOuE8qI X-Received: by 10.101.68.82 with SMTP id e18mr3187055pgq.329.1522259812489; Wed, 28 Mar 2018 10:56:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522259812; cv=none; d=google.com; s=arc-20160816; b=U6kqJuhQpsN3usjU2pfO3VqI4MKVIQtQd5dqmZ9YvHnTI4DieOANOOnJNI4/cnQG9e jUiDBU/rqz6PKhQQn9paa8aa6GeHZJE9fa2sjiFMj66cf65e1ih9ygCGM/IcLe/eL83E JA9Kjfp6HisKAa70XlMST6g7xlx+2uRM9KIbWKwiJVUfiY2WSs/d9xbtwf1s1MB7JtOq JKnQkwhQCHcRXKYin/xIqcVIoga2PM0y+is03SrWvYKUQqGxiD2tW+PVm6A7jpm4rlKr v2IE95isrJOblSbXV8lGKAiVFgjgpOGTLMtAPuucfqH1ZBeXAqyi8nUctKOAENqTZpU2 AnLw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:thread-index:thread-topic :content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:arc-authentication-results; bh=NkBUhU88pwap1CA+XlrMCNzhW8K0jlV2BvfDFfBZy1M=; b=uvfF/yigFW8AXUJHrqHkOC9Ar+KEbCME2t//ALAfl2/TLOpiR5nYjsfq22kov967Db Yv8KaDL2xfSzT1kd9+jSnIMiSclPCVA7fMgrKJsYkmCb1PeUaG/Sup68UdprXWStgGbh wbWIZyaPcBcXTxA1SZi+O33gFpmhniw2TQ3PZsi4yfmFCIgJsR/gdttIVzt+6Y0CazxK eogjd58lSICHE4OWEZ0m1gB0yWsg304ltEIXaCxK2muHDSTICAA6Frg1/vWQxZOx8gQP Mdk+VCW7tkAO2fT84e31j5P1VCeqpVwTKp9Gc0W4FHdVB60NwDFIe9QZ+L5JICsyYynE mDug== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v10-v6si3889660ply.681.2018.03.28.10.56.37; Wed, 28 Mar 2018 10:56:52 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753389AbeC1Ryo (ORCPT + 99 others); Wed, 28 Mar 2018 13:54:44 -0400 Received: from mail.efficios.com ([167.114.142.138]:35354 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753349AbeC1Rym (ORCPT ); Wed, 28 Mar 2018 13:54:42 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 225401A8A3A; Wed, 28 Mar 2018 13:54:42 -0400 (EDT) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail02.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id H-B4tO8k8aQd; Wed, 28 Mar 2018 13:54:41 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id A3A961A8A36; Wed, 28 Mar 2018 13:54:41 -0400 (EDT) X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail02.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id XzcrlCWPUzGH; Wed, 28 Mar 2018 13:54:41 -0400 (EDT) Received: from mail02.efficios.com (mail02.efficios.com [167.114.142.138]) by mail.efficios.com (Postfix) with ESMTP id 836661A8A1F; Wed, 28 Mar 2018 13:54:41 -0400 (EDT) Date: Wed, 28 Mar 2018 13:54:41 -0400 (EDT) From: Mathieu Desnoyers To: Peter Zijlstra Cc: "Paul E. McKenney" , Boqun Feng , Andy Lutomirski , Dave Watson , linux-kernel , linux-api , Paul Turner , Andrew Morton , Russell King , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Andrew Hunter , Andi Kleen , Chris Lameter , Ben Maurer , rostedt , Josh Triplett , Linus Torvalds , Catalin Marinas , Will Deacon , Michael Kerrisk Message-ID: <1109208604.169.1522259681295.JavaMail.zimbra@efficios.com> In-Reply-To: <20180328152203.GW4043@hirez.programming.kicks-ass.net> References: <20180327160542.28457-1-mathieu.desnoyers@efficios.com> <20180327160542.28457-11-mathieu.desnoyers@efficios.com> <20180328152203.GW4043@hirez.programming.kicks-ass.net> Subject: Re: [RFC PATCH for 4.17 10/21] cpu_opv: Provide cpu_opv system call (v6) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [167.114.142.138] X-Mailer: Zimbra 8.8.7_GA_1964 (ZimbraWebClient - FF52 (Linux)/8.8.7_GA_1964) Thread-Topic: cpu_opv: Provide cpu_opv system call (v6) Thread-Index: 8s+ruNV5RmeQN6e3c0jumjOKFuybzg== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- On Mar 28, 2018, at 11:22 AM, Peter Zijlstra peterz@infradead.org wrote: > On Tue, Mar 27, 2018 at 12:05:31PM -0400, Mathieu Desnoyers wrote: > >> 1) Allow algorithms to perform per-cpu data migration without relying on >> sched_setaffinity() >> >> The use-cases are migrating memory between per-cpu memory free-lists, or >> stealing tasks from other per-cpu work queues: each require that >> accesses to remote per-cpu data structures are performed. > > I think that one completely reduces to the per-cpu (spin)lock case, > right? Because, as per the below, your logging case (8) can 'easily' be > done without the cpu_opv monstrosity. > > And if you can construct a per-cpu lock, that can be used to construct > aribtrary logic. The per-cpu spinlock does not have the same performance characteristics as lock-free alternatives for various operations. A rseq compare-and-store is faster than a rseq spinlock for linked-list operations. > > And the difficult case for the per-cpu lock is the remote acquire; all > the other cases are (relatively) trivial. > > I've not really managed to get anything sensible to work, I've tried > several variations of split lock, but you invariably end up with > barriers in the fast (local) path, which sucks. > > But I feel this should be solvable without cpu_opv. As in, I really hate > that thing ;-) I have not developed cpu_opv out of any kind of love for that solution. I just realized that it did solve all my issues after failing for quite some time to implement acceptable solutions for the remote access problem, and for ensuring progress of single-stepping with current debuggers that don't know about the rseq_table section. > >> 8) Allow libraries with multi-part algorithms to work on same per-cpu >> data without affecting the allowed cpu mask >> >> The lttng-ust tracer presents an interesting use-case for per-cpu >> buffers: the algorithm needs to update a "reserve" counter, serialize >> data into the buffer, and then update a "commit" counter _on the same >> per-cpu buffer_. Using rseq for both reserve and commit can bring >> significant performance benefits. >> >> Clearly, if rseq reserve fails, the algorithm can retry on a different >> per-cpu buffer. However, it's not that easy for the commit. It needs to >> be performed on the same per-cpu buffer as the reserve. >> >> The cpu_opv system call solves that problem by receiving the cpu number >> on which the operation needs to be performed as argument. It can push >> the task to the right CPU if needed, and perform the operations there >> with preemption disabled. >> >> Changing the allowed cpu mask for the current thread is not an >> acceptable alternative for a tracing library, because the application >> being traced does not expect that mask to be changed by libraries. > > We talked about this use-case, and it can be solved without cpu_opv if > you keep a dual commit counter, one local and one (atomic) remote. Right. > > We retain the cpu_id from the first rseq, and the second part will, when > it (unlikely) finds it runs remotely, do an atomic increment on the > remote counter. The consumer of the counter will then have to sum both > the local and remote counter parts. Yes, I did a prototype of this specific case with split-counters a while ago. However, if we need cpu_opv as fallback for other reasons (e.g. remote accesses), then the split-counters are not needed, and there is no need to change the layout of user-space data to accommodate the extra per-cpu counter. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com