Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757213AbbEVWGv (ORCPT ); Fri, 22 May 2015 18:06:51 -0400 Received: from mail-qk0-f171.google.com ([209.85.220.171]:36013 "EHLO mail-qk0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756771AbbEVWGs (ORCPT ); Fri, 22 May 2015 18:06:48 -0400 MIME-Version: 1.0 In-Reply-To: References: <1432219487-13364-1-git-send-email-mathieu.desnoyers@efficios.com> Date: Fri, 22 May 2015 15:06:47 -0700 Message-ID: Subject: Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections From: Andrew Hunter To: Andy Lutomirski Cc: Michael Kerrisk , Mathieu Desnoyers , Paul Turner , Ben Maurer , Linux Kernel , Peter Zijlstra , Ingo Molnar , Steven Rostedt , "Paul E. McKenney" , Josh Triplett , Lai Jiangshan , Linus Torvalds , Andrew Morton , Linux API Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2133 Lines: 42 On Fri, May 22, 2015 at 1:53 PM, Andy Lutomirski wrote: > Create an array of user-managed locks, one per cpu. Call them lock[i] > for 0 <= i < ncpus. > > To acquire, look up your CPU number. Then, atomically, check that > lock[cpu] isn't held and, if so, mark it held and record both your tid > and your lock acquisition count. If you learn that the lock *was* > held after all, signal the holder (with kill or your favorite other > mechanism), telling it which lock acquisition count is being aborted. > Then atomically steal the lock, but only if the lock acquisition count > hasn't changed. > We had to deploy the userspace percpu API (percpu sharded locks, {double,}compare-and-swap, atomic-increment, etc) universally to the fleet without waiting for 100% kernel penetration, not to mention wanting to disable the kernel acceleration in case of kernel bugs. (Since this is mostly used in core infrastructure--malloc, various statistics platforms, etc--in userspace, checking for availability isn't feasible. The primitives have to work 100% of the time or it would be too complex for our developers to bother using them.) So we did basically this (without the lock stealing...): we have a single per-cpu spin lock manipulated with atomics, which we take very briefly to implement (e.g.) compare-and-swap. The performance is hugely worse; typical overheads are in the 10x range _without_ any on-cpu contention. Uncontended atomics are much cheaper than they were on pre-Nehalem chips, but they still can't hold a candle to unsynchronized instructions. As a fallback path for userspace, this is fine--if 5% of binaries on busted kernels aren't quite as fast, we can work with that in exchange for being able to write a percpu op without worrying about what to do on -ENOSYS. But it's just not fast enough to compete as the intended way to do things. AHH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/