Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751712AbcDBXsN (ORCPT ); Sat, 2 Apr 2016 19:48:13 -0400 Received: from mail-lb0-f174.google.com ([209.85.217.174]:36789 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750833AbcDBXsL (ORCPT ); Sat, 2 Apr 2016 19:48:11 -0400 From: Rasmus Villemoes To: Thomas Gleixner Cc: LKML , Sebastian Andrzej Siewior , Darren Hart , Peter Zijlstra , Ingo Molnar , Michael Kerrisk , Davidlohr Bueso , Chris Mason , "Carlos O'Donell" , Torvald Riegel , Eric Dumazet Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes Organization: D03 References: <20160402095108.894519835@linutronix.de> <20160402110035.753145539@linutronix.de> X-Hashcash: 1:20:160402:mingo@kernel.org::+/c/iI26eToHU6/U:00SNi X-Hashcash: 1:20:160402:tglx@linutronix.de::W1wdMt7yc50nAoqt:00000000000000000000000000000000000000000001FhM X-Hashcash: 1:20:160402:edumazet@google.com::SIM51TYfo/xH6bmW:0000000000000000000000000000000000000000000nQc X-Hashcash: 1:20:160402:dave@stgolabs.net::vRFxbcI2tMEDtYLm:000000000000000000000000000000000000000000001Pgw X-Hashcash: 1:20:160402:darren@dvhart.com::DmhQRLPKSafJJj2Q:000000000000000000000000000000000000000000002IIE X-Hashcash: 1:20:160402:mtk.manpages@googlemail.com::+1smmZerfzi/Ot79:00000000000000000000000000000000002oGq X-Hashcash: 1:20:160402:bigeasy@linutronix.de::tRrs8ah9M/NvLzj5:00000000000000000000000000000000000000003HZn X-Hashcash: 1:20:160402:triegel@redhat.com::AQr+N6+ACALY3ctB:000000000000000000000000000000000000000000048d3 X-Hashcash: 1:20:160402:carlos@redhat.com::zcrgeF20a0BZenSE:000000000000000000000000000000000000000000004soV X-Hashcash: 1:20:160402:linux-kernel@vger.kernel.org::eGPEn0oKIUNBxOxu:000000000000000000000000000000000851K X-Hashcash: 1:20:160402:clm@fb.com::jMGNkhYi74ZJxuRM:0000000BbjD X-Hashcash: 1:20:160402:peterz@infradead.org::P37FnJkl7qEN6ef+:000000000000000000000000000000000000000008hsJ Date: Sun, 03 Apr 2016 01:48:07 +0200 In-Reply-To: <20160402110035.753145539@linutronix.de> (Thomas Gleixner's message of "Sat, 02 Apr 2016 11:09:18 -0000") Message-ID: <87zitbmv2g.fsf@rasmusvillemoes.dk> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5907 Lines: 175 On Sat, Apr 02 2016, Thomas Gleixner wrote: > The standard futex mechanism in the Linux kernel uses a global hash to store > transient state. Collisions on that hash can lead to performance degradation > and on real-time enabled kernels even to priority inversions. > > To guarantee futexes without collisions on the global kernel hash, we provide > a mechanism to attach to a futex. This creates futex private state which > avoids hash collisions and on NUMA systems also cross node memory access. Hi, A few minor comments inline below, and a question about the design: How is an application supposed to handle it when the kernel fails to achieve the no collision-goal? With any reasonable upper bound on the size of the local hash table (which of course has to be there, whether sysctl'ed or not), and regardless of the hashing scheme used, it seems inevitable that someone is going to get -ENOSPC when trying to attach. Moreover, since different threads can attach to different sets of futexes, one thread may succesfully attach to a futex, while another fails - the second thread is then permanently prevented from operating on that futex (?). Why not use some sort of tree instead? Or fall back to a traditional chained hash table once we're no longer allowed to increase the table size? Of course these have worse lookup performance, and maybe failing the attach in the rare case is better than penalizing the common case, but it would be nice to have some mention of this in the change log. Alternatively [this is not really thought through], maybe one could move the decision and the complexity to userspace: On succesful FUTEX_ATTACH, return an index into a small per-task array of struct futex_state*. On subsequent FUTEX_ATTACHED operations on that futex, userspace passes in this index somehow (either instead of uaddr, in which case the kernel array would have to include this in addition to the futex_state pointer, or by making uaddr actually point to a struct { int *futex_addr; int attach_idx; }, or...) Then each thread would have to maintain a (futex address => index) mapping, but that's more or less what the kernel otherwise has to do. > + > +static unsigned int hash_prime(unsigned int size) > +{ > + switch(size) { > + case 16: > + default: return 13; > + case 32: return 31; > + case 64: return 61; > + case 128: return 127; > + case 256: return 251; > + case 512: return 509; > + case 1024: return 1021; > + case 2048: return 2039; > + case 4096: return 4093; > + } > +} > + There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here so that anyone updating those would also look here, so we don't end up using 13 out of 8192 slots... BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE != {16,4096}) should do it. > +static struct futex_cache *futex_alloc_cache(int cache_size) > +{ > + struct futex_cache *tc; > + size_t size; > + > + /* Allocate a new task cache */ > + size = sizeof(*tc) + cache_size * sizeof(struct futex_cache_slot); > + tc = kzalloc_node(size, GFP_KERNEL, numa_node_id()); > + if (tc) { > + tc->hash_prime = hash_prime(cache_size); > + tc->cache_size = cache_size; > + } > + return tc; > +} > + > +static int > +futex_rehash_task_cache(struct futex_cache *tc, struct futex_cache *tcnew) > +{ > + unsigned long *newmap = tcnew->cache_map; > + unsigned int prime = tcnew->hash_prime; > + unsigned long *map = tc->cache_map; > + unsigned int size = tc->cache_size; > + unsigned int slot, newslot; > + > + slot = find_first_bit(map, size); > + for (; slot < size; slot = find_next_bit(map, size, slot + 1)) { > + newslot = hash_local_futex(tc->slots[slot].uaddr, prime); > + /* > + * Paranoia. Rehashing to a larger cache should not result in > + * collisions which did not exist in the small one. > + */ This doesn't sound right when you're doing mod prime hashing; two numbers can easily have the same remainder mod 31 while having distinct remainders mod 13. > + if (__test_and_set_bit(newslot, newmap)) > + return -ENOSPC; > + /* Copy uaddr and futex state pointer */ > + tcnew->slots[newslot] = tc->slots[slot]; > + } > + return 0; > +} > + > +/** > + * futex_get_task_cache_slot - Get a slot in the tasks local cache > + * > + * If the cache is not yet available it's allocated. If the existing cache is > + * too small the cache is extended. > + * > + * Returns a valid slot or an error code > + */ > +static int futex_get_task_cache_slot(u32 __user *uaddr) > +{ > + struct futex_cache *tcnew, *tc = current->futex_cache; > + unsigned int cache_size; > + int slot; > + > + /* First caller allocates the initial cache */ > + if (!tc) { > + tc = futex_alloc_cache(TASK_CACHE_BASE_SIZE); > + if (!tc) > + return -ENOMEM; > + current->futex_cache = tc; > + slot = hash_local_futex(uaddr, tc->hash_prime); > + return slot; > + } > + > + slot = hash_local_futex(uaddr, tc->hash_prime); > + > + /* Check whether the slot is populated already */ > + if (!test_bit(slot, tc->cache_map)) > + return slot; > + > + /* Was this futex attached already ? */ > + if (tc->slots[slot].uaddr == uaddr) > + return -EEXIST; > + > + cache_size = tc->cache_size; > +retry: > + /* Task has reached max cache size? */ > + if (cache_size >= TASK_CACHE_MAX_SIZE) > + return -ENOSPC; > + > + cache_size *= 2; > + tcnew = futex_alloc_cache(cache_size); > + if (!tcnew) > + return -ENOMEM; > + > + /* > + * If the rehashing fails or the slot for uaddr is busy after > + * rehashing, try with a larger cache. > + */ > + slot = hash_local_futex(uaddr, tcnew->hash_prime); > + > + if (futex_rehash_task_cache(tc, tcnew) || > + test_bit(slot, tcnew->cache_map)) { > + kfree(tcnew); > + goto retry; > + } > + > + /* Populate the new cache and return the slot number */ > + current->futex_cache = tcnew; > + return slot; > +} I may be misreading it, but this seems to leak the old ->futex_cache? Rasmus