Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751520AbbEZVS6 (ORCPT ); Tue, 26 May 2015 17:18:58 -0400 Received: from mail-la0-f47.google.com ([209.85.215.47]:36543 "EHLO mail-la0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751188AbbEZVSz (ORCPT ); Tue, 26 May 2015 17:18:55 -0400 MIME-Version: 1.0 In-Reply-To: <821493560.8531.1432674243321.JavaMail.zimbra@efficios.com> References: <1432219487-13364-1-git-send-email-mathieu.desnoyers@efficios.com> <757752240.6470.1432330487312.JavaMail.zimbra@efficios.com> <1839774559.6579.1432400944032.JavaMail.zimbra@efficios.com> <1184354091.7499.1432578613872.JavaMail.zimbra@efficios.com> <821493560.8531.1432674243321.JavaMail.zimbra@efficios.com> From: Andy Lutomirski Date: Tue, 26 May 2015 14:18:32 -0700 Message-ID: Subject: Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections To: Mathieu Desnoyers Cc: Andi Kleen , Borislav Petkov , "H. Peter Anvin" , Lai Jiangshan , Ben Maurer , "Paul E. McKenney" , Ingo Molnar , Josh Triplett , Andrew Morton , Michael Kerrisk , Linux API , Linux Kernel , Paul Turner , Peter Zijlstra , Linus Torvalds , Steven Rostedt , Andrew Hunter Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4677 Lines: 109 On Tue, May 26, 2015 at 2:04 PM, Mathieu Desnoyers wrote: > ----- Original Message ----- >> On May 25, 2015 11:54 AM, "Andy Lutomirski" wrote: > [...] >> I might be guilty of being too x86-centric here. On x86, as long as >> the lock and unlock primitives are sufficiently atomic, everything >> should be okay. On other architectures, though, a primitive that >> gives lock, unlock, and abort of a per-cpu lock without checking that >> you're still on the right cpu at unlock time may not be sufficient. >> If the primitive is implemented purely with loads and stores, then >> even if you take the lock, migrate, finish your work, and unlock >> without anyone else contending from the lock (and hence don't abort), >> the next thread to take the same lock will end up unsynchronized >> unless there's appropriate memory ordering. For example, if taking >> the lock were an acquire and unlocking were a release, we'd be fine. > > If we look at x86-TSO, where stores reordered after loads is pretty > much the only thing we need to care about, we have: > > CPU 0 CPU 1 > > load, test lock[1] > store lock[1] > loads/stores to data[1] > [preempted, migrate to cpu 0] > [run other thread] > load, test lock[1] (loop) > loads/stores to data[1] (A) > store 0 to lock[1] (B) > load, test lock[1] (C) > store lock[1] > loads/stores to data[1] (D) > > What we want to ensure is that load/stores to data[1] from CPU 0 and > 1 don't get interleaved. > > On CPU 0 doing unlock, loads and stores to data[1] (A) cannot be reordered > against the store to lock[1] (B) due to TSO (all we care about is stores > reordered after loads). > > On CPU 1 grabbing the 2nd lock, we care about loads/store from/to > data[1] (C) being reordered before load, test of lock[1] (C). Again, > thanks to TSO, loads/stores from/to data[1] (D) cannot be reordered > wrt (C). > > So on x86-TSO we should be OK, but as you say, we need to be careful > to have the right barriers in place on other architectures. > >> >> Your RFC design certainly works (in principle -- I haven't looked at >> the code in detail), but I can't shake the feeling that it's overkill > > If we can find a simpler way to achieve the same result, I'm all ears :) > >> and that it could be improved to avoid unnecessary aborts every time >> the lock holder is scheduled out. > > The restart is only performed if preemption occurs when the thread is > trying to grab the lock. Once the thread has the lock (it's now the > lock holder), it will not be aborted even if it is migrated. Then maybe this does have the same acquire/release issue after all. > >> >> This isn't a problem in your RFC design, but if we wanted to come up >> with tighter primitives, we'd have to be quite careful to document >> exactly what memory ordering guarantees they come with. > > Indeed. > >> >> It may be that all architectures for which you care about the >> performance boost already have efficient acquire and release >> operations. Certainly x86 does, and I don't know how fast the new ARM >> instructions are, but I imagine they're pretty good. > > We may even need memory barriers around lock and unlock on ARM. :/ ARM > smp_store_release() and smp_load_acquire() currently implements those with > smp_mb(). I would have expected it to use LDA/STL on new enough cpus: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/CIHDEACA.html > >> >> It's too bad that not all architectures have a single-instruction >> unlocked compare-and-exchange. > > Based on my benchmarks, it's not clear that single-instruction > unlocked CAS is actually faster than doing the same with many > instructions. True, but with a single instruction the user can't get preempted in the middle. Looking at your code, it looks like percpu_user_sched_in has some potentially nasty issues with page faults. Avoiding touching user memory from the scheduler would be quite nice from an implementation POV, and the x86-specific gs hack wins in that regard. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/