Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757594AbcDGUzt (ORCPT ); Thu, 7 Apr 2016 16:55:49 -0400 Received: from mail.efficios.com ([78.47.125.74]:47621 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756577AbcDGUzr (ORCPT ); Thu, 7 Apr 2016 16:55:47 -0400 Date: Thu, 7 Apr 2016 20:55:41 +0000 (UTC) From: Mathieu Desnoyers To: Andi Kleen Cc: Linus Torvalds , Peter Zijlstra , Florian Weimer , "H. Peter Anvin" , Andrew Morton , Russell King , Thomas Gleixner , Ingo Molnar , Linux Kernel Mailing List , linux-api , Paul Turner , Andrew Hunter , Andy Lutomirski , Dave Watson , Chris Lameter , Ben Maurer , rostedt , "Paul E. McKenney" , Josh Triplett , Catalin Marinas , Will Deacon , Michael Kerrisk , Boqun Feng Message-ID: <1353194988.49705.1460062541205.JavaMail.zimbra@efficios.com> In-Reply-To: <20160407202232.GF9407@two.firstfloor.org> References: <1459789313-4917-1-git-send-email-mathieu.desnoyers@efficios.com> <20160407103158.GP3430@twins.programming.kicks-ass.net> <570638D9.7010108@redhat.com> <20160407111938.GR3430@twins.programming.kicks-ass.net> <1025228632.49344.1460054592801.JavaMail.zimbra@efficios.com> <20160407202232.GF9407@two.firstfloor.org> Subject: Re: [RFC PATCH v6 1/5] Thread-local ABI system call: cache CPU number of running thread MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [78.47.125.74] X-Mailer: Zimbra 8.6.0_GA_1178 (ZimbraWebClient - FF45 (Linux)/8.6.0_GA_1178) Thread-Topic: Thread-local ABI system call: cache CPU number of running thread Thread-Index: qWcggJdraAMeHkg9rcM/5t3A9ZuBRA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2668 Lines: 63 ----- On Apr 7, 2016, at 4:22 PM, Andi Kleen andi@firstfloor.org wrote: >> One basic use of cpu id cache is to speed up the sched_getcpu(3) >> implementation in glibc. This is why I'm proposing it as a stand-alone > > I don't think rseq is needed for faster getcpu. I agree that rseq is not needed for faster getcpu. This is why I was proposing to make "cpu_id" feature configurable separately from the rseq feature. E.g. a kernel configuration that don't want to take the hit of rseq handling in signal delivery and preemption could just enable the cpu_id feature, and thus only need to add work in the migration code path, and when returning to userspace. Also, if a thread only registers the cpu_id feature, the kernel can skip the rseq code quickly in signal delivery and preemption too. > > User space has to be able handle stale return values anyways, as it > has no way to lock itself to a cpu while it is using the return value. > So it can be only a hint. > > The original version of getcpu just had a jiffies based cache. The CPU > value was valid up to a jiffie (the next time jiffie changes), and then it > gets looked up again. > > Processes are unlikely to switch CPUs more often than a jiffie, so it's > good enough as a hint. One example use-case where this would hurt: we use the CPU id heavily when tracing to a ring buffer in user-space. Having one event written into the wrong buffer once in a while is not a big deal, but tracing a whole burst of events within a jiffy (e.g. 4ms at 250Hz) to the wrong cpu buffer whenever the thread migrates is really an unwanted side-effect latency-wise. > > This doesn't need any new kernel interfaces at all because jiffies is already > exported to the vdso. My understanding is that although your assumptions about availability of those features in vdso are true for x86 32/64, but do not currently apply to ARM32. ARM32 is my main target architecture for the CPU id cache work. x86 32/64 simply also happen to benefit from that work too (see my benchmark numbers in changelog of patch 1/5). > It just needs a new entry point into the vdso that handles the jiffie > check. This would likely require to extend the ARM vdso page to expose the jiffies counter to user-space, and update user-space libraries to use this counter in sched_getcpu. But it would still be slower than the cpu_id cache I propose, due to the required function call to sched_getcpu, unless you want to open-code the jiffies check within all applications as an ABI. It would also be bad for fast bursts of cpu id use (e.g. per-cpu ring buffers). Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com