Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753185AbcDCKG7 (ORCPT ); Sun, 3 Apr 2016 06:06:59 -0400 Received: from www.linutronix.de ([62.245.132.108]:58658 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752716AbcDCKG5 (ORCPT ); Sun, 3 Apr 2016 06:06:57 -0400 Date: Sun, 3 Apr 2016 12:05:20 +0200 (CEST) From: Thomas Gleixner To: Rasmus Villemoes cc: LKML , Sebastian Andrzej Siewior , Darren Hart , Peter Zijlstra , Ingo Molnar , Michael Kerrisk , Davidlohr Bueso , Chris Mason , "Carlos O'Donell" , Torvald Riegel , Eric Dumazet Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes In-Reply-To: <87zitbmv2g.fsf@rasmusvillemoes.dk> Message-ID: References: <20160402095108.894519835@linutronix.de> <20160402110035.753145539@linutronix.de> <87zitbmv2g.fsf@rasmusvillemoes.dk> User-Agent: Alpine 2.11 (DEB 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001,URIBL_BLOCKED=0.001 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3824 Lines: 90 On Sun, 3 Apr 2016, Rasmus Villemoes wrote: > On Sat, Apr 02 2016, Thomas Gleixner wrote: > > > The standard futex mechanism in the Linux kernel uses a global hash to store > > transient state. Collisions on that hash can lead to performance degradation > > and on real-time enabled kernels even to priority inversions. > > > > To guarantee futexes without collisions on the global kernel hash, we provide > > a mechanism to attach to a futex. This creates futex private state which > > avoids hash collisions and on NUMA systems also cross node memory access. > > Hi, > > A few minor comments inline below, and a question about the design: > > How is an application supposed to handle it when the kernel fails to > achieve the no collision-goal? With any reasonable upper bound on the > size of the local hash table (which of course has to be there, whether > sysctl'ed or not), and regardless of the hashing scheme used, it > seems inevitable that someone is going to get -ENOSPC when trying to > attach. Moreover, since different threads can attach to different sets > of futexes, one thread may succesfully attach to a futex, while another > fails - the second thread is then permanently prevented from operating > on that futex (?). Yes. There is not much we can do about that except adding it to the documentation. > Why not use some sort of tree instead? Or fall back to a traditional > chained hash table once we're no longer allowed to increase the table > size? Of course these have worse lookup performance, and maybe failing > the attach in the rare case is better than penalizing the common case, > but it would be nice to have some mention of this in the change log. The lookup performance is critical for the futex ops (wait/wake ....) > Alternatively [this is not really thought through], maybe one could move > the decision and the complexity to userspace: On succesful FUTEX_ATTACH, > return an index into a small per-task array of struct futex_state*. On > subsequent FUTEX_ATTACHED operations on that futex, userspace passes in > this index somehow (either instead of uaddr, in which case the kernel > array would have to include this in addition to the futex_state pointer, > or by making uaddr actually point to a struct { int *futex_addr; int > attach_idx; }, or...) Then each thread would have to maintain a (futex > address => index) mapping, but that's more or less what the kernel > otherwise has to do. Right. We tried this as a first attempt. That moves the complexity of hashing futex to index per thread to user space. It can be done simple enough, though the resulting > > + case 4096: return 4093; > > + } > > +} > > + > > There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here > so that anyone updating those would also look here, so we don't end up > using 13 out of 8192 slots... BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE > != {16,4096}) should do it. As Peter pointed out there is hash_ptr() already so this will go away. > > + slot = find_first_bit(map, size); > > + for (; slot < size; slot = find_next_bit(map, size, slot + 1)) { > > + newslot = hash_local_futex(tc->slots[slot].uaddr, prime); > > + /* > > + * Paranoia. Rehashing to a larger cache should not result in > > + * collisions which did not exist in the small one. > > + */ > > This doesn't sound right when you're doing mod prime hashing; two > numbers can easily have the same remainder mod 31 while having distinct > remainders mod 13. I know, but as far as my extensive testing went I never hit that case. > > + > > + /* Populate the new cache and return the slot number */ > > + current->futex_cache = tcnew; > > + return slot; > > +} > > I may be misreading it, but this seems to leak the old ->futex_cache? Duh, you are right. Thanks, tglx