Received: by 2002:a05:6359:6284:b0:131:369:b2a3 with SMTP id se4csp2966601rwb; Mon, 7 Aug 2023 06:23:58 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF4eZu21JxKFZH3S83s4yDHcGLMVERU3z+LY9jcTAbNoYkMEo1w/NFZbv3BdYRxhWKmRhwL X-Received: by 2002:a17:903:25c6:b0:1bc:5197:73c5 with SMTP id jc6-20020a17090325c600b001bc519773c5mr7834680plb.54.1691414638572; Mon, 07 Aug 2023 06:23:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691414638; cv=none; d=google.com; s=arc-20160816; b=eW68VGedhrgbq7c+1gJHjh2HwU2SpFTZ+fHL8tkRWxpUnMVOyKQmfethYADy9gRmZ4 0iXqVnOrgnqsxxZHzXAJvGdLpEqZnOFidUl62a2E41A4tlwKpzNVdHodBBleY8Pm2VG4 9G3mWCoACzoIGI8JfHvXMaoQhylHDtQU6wrT2p1hNNoN2R2msvOobeTBMm3RDfqsKnY3 WKMfe3uqGFW7ggoeIX+ac+3KdcLqRIlNnysF/Cx5g1ofCSES4ZXFPobs2b6q71skzGQb uUxKz3rosyT45nslTleS1OVz2brtHALe6ZUuYqVd4ABbQHZQd29dfSfATMsdvZ/vUSGY lwgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:subject:cc:to:from:date :user-agent:message-id:dkim-signature; bh=15dV+fw3TVZwuZ2GLz+s4AB6Ib3B7DldkK+IHYVR8MU=; fh=EjmxZ37PYlcfWlhzAqi5wmPe8R8k7/dCt/LKb5qak3s=; b=HRKL4v87YaTrs13ohUEuqZYOPi6nijs8kitqSCzHSwe8wid4G0942ZMMvIDg5QJXiB tqIKh1JEvwHmG/J022KLazwMqgIcJ4E5R+PDnxjAfWflErSSCz5ZP92pCqYiev1TMuZF XjJAs804jPnVWNNpMfMMTsN520r6Vb5S6q8pwyCiv0BTnst7NFtd1xhv/48bpSMvQ4tl QMAnF+14OxcXkZcDE3udOPi264AB1BQEdldFCnI1+l8pNcO9PFCaoMYUmmRTfmIyV/52 16OZKZNHIYgx+BTR9FOj6JelWWX5i0AtnekXXJP4QHe89OW1EcOXhBCQykGWgk+Ok3kE GOkg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=rPipdjkX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ik26-20020a170902ab1a00b001b9be3b94dfsi5730988plb.268.2023.08.07.06.23.46; Mon, 07 Aug 2023 06:23:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=rPipdjkX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233681AbjHGMhq (ORCPT + 99 others); Mon, 7 Aug 2023 08:37:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36532 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233597AbjHGMhT (ORCPT ); Mon, 7 Aug 2023 08:37:19 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6326710F6; Mon, 7 Aug 2023 05:37:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=15dV+fw3TVZwuZ2GLz+s4AB6Ib3B7DldkK+IHYVR8MU=; b=rPipdjkXy2N64QJCm3H49Wb5YA s2TPd1tAd7bFDfW/5nyPT5H2/NrqKO7byVKH7j2gZYeLITPjUUMdOTl4F57jAyJFa3J4fmmpUhMug 8dLUwR/TM/6SD8smp7MtGFdU1hN4l6p2XUy9FA30zOZJmzR+UbAF6KsRMEZbHlTnSPNe3cz9k+Svw V6bFGevXhncv9QbCtNwhbsvoOUQynVl1GzEMcjL6hhEtXF8f78PhIAVAma6BiVNd1o70j/HZ2E8+1 meAATMOisxckPnmmuzz9EcDDZwowdLZGsA5Z691qKWRah5tXHHQZzS2V3VaClSFj1WTC956zmoakZ DXK6aBGg==; Received: from j130084.upc-j.chello.nl ([24.132.130.84] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1qSzTl-00AxGK-GQ; Mon, 07 Aug 2023 12:36:57 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 307603033B5; Mon, 7 Aug 2023 14:36:56 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id BC0D12021C3D6; Mon, 7 Aug 2023 14:36:54 +0200 (CEST) Message-ID: <20230807123323.504975124@infradead.org> User-Agent: quilt/0.66 Date: Mon, 07 Aug 2023 14:18:54 +0200 From: Peter Zijlstra To: tglx@linutronix.de, axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, mingo@redhat.com, dvhart@infradead.org, dave@stgolabs.net, andrealmeid@igalia.com, Andrew Morton , urezki@gmail.com, hch@infradead.org, lstoakes@gmail.com, Arnd Bergmann , linux-api@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, malteskarupke@web.de Subject: [PATCH v2 11/14] futex: Implement FUTEX2_NUMA References: <20230807121843.710612856@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Extend the futex2 interface to be numa aware. When FUTEX2_NUMA is specified for a futex, the user value is extended to two words (of the same size). The first is the user value we all know, the second one will be the node to place this futex on. struct futex_numa_32 { u32 val; u32 node; }; When node is set to ~0, WAIT will set it to the current node_id such that WAKE knows where to find it. If userspace corrupts the node value between WAIT and WAKE, the futex will not be found and no wakeup will happen. When FUTEX2_NUMA is not set, the node is simply an extention of the hash, such that traditional futexes are still interleaved over the nodes. This is done to avoid having to have a separate !numa hash-table. Signed-off-by: Peter Zijlstra (Intel) --- include/linux/futex.h | 3 + kernel/futex/core.c | 129 +++++++++++++++++++++++++++++++++++++++--------- kernel/futex/futex.h | 25 +++++++-- kernel/futex/syscalls.c | 2 4 files changed, 128 insertions(+), 31 deletions(-) --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -34,6 +34,7 @@ union futex_key { u64 i_seq; unsigned long pgoff; unsigned int offset; + /* unsigned int node; */ } shared; struct { union { @@ -42,11 +43,13 @@ union futex_key { }; unsigned long address; unsigned int offset; + /* unsigned int node; */ } private; struct { u64 ptr; unsigned long word; unsigned int offset; + unsigned int node; /* NOT hashed! */ } both; }; --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -34,7 +34,8 @@ #include #include #include -#include +#include +#include #include #include @@ -47,12 +48,14 @@ * reside in the same cacheline. */ static struct { - struct futex_hash_bucket *queues; unsigned long hashsize; + unsigned int hashshift; + struct futex_hash_bucket *queues[MAX_NUMNODES]; } __futex_data __read_mostly __aligned(2*sizeof(long)); -#define futex_queues (__futex_data.queues) -#define futex_hashsize (__futex_data.hashsize) +#define futex_hashsize (__futex_data.hashsize) +#define futex_hashshift (__futex_data.hashshift) +#define futex_queues (__futex_data.queues) /* * Fault injections for futexes. @@ -105,6 +108,26 @@ late_initcall(fail_futex_debugfs); #endif /* CONFIG_FAIL_FUTEX */ +static int futex_get_value(u32 *val, u32 __user *from, unsigned int flags) +{ + switch (futex_size(flags)) { + case 1: return __get_user(*val, (u8 __user *)from); + case 2: return __get_user(*val, (u16 __user *)from); + case 4: return __get_user(*val, (u32 __user *)from); + default: BUG(); + } +} + +static int futex_put_value(u32 val, u32 __user *to, unsigned int flags) +{ + switch (futex_size(flags)) { + case 1: return __put_user(val, (u8 __user *)to); + case 2: return __put_user(val, (u16 __user *)to); + case 4: return __put_user(val, (u32 __user *)to); + default: BUG(); + } +} + /** * futex_hash - Return the hash bucket in the global hash * @key: Pointer to the futex key for which the hash is calculated @@ -114,10 +137,29 @@ late_initcall(fail_futex_debugfs); */ struct futex_hash_bucket *futex_hash(union futex_key *key) { - u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4, + u32 hash = jhash2((u32 *)key, + offsetof(typeof(*key), both.offset) / sizeof(u32), key->both.offset); + int node = key->both.node; + + if (node == -1) { + /* + * In case of !FLAGS_NUMA, use some unused hash bits to pick a + * node -- this ensures regular futexes are interleaved across + * the nodes and avoids having to allocate multiple + * hash-tables. + * + * NOTE: this isn't perfectly uniform, but it is fast and + * handles sparse node masks. + */ + node = (hash >> futex_hashshift) % nr_node_ids; + if (!node_possible(node)) { + node = find_next_bit_wrap(node_possible_map.bits, + nr_node_ids, node); + } + } - return &futex_queues[hash & (futex_hashsize - 1)]; + return &futex_queues[node][hash & (futex_hashsize - 1)]; } @@ -217,32 +259,56 @@ static u64 get_inode_sequence_number(str * * lock_page() might sleep, the caller should not hold a spinlock. */ -int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key, +int get_futex_key(void __user *uaddr, unsigned int flags, union futex_key *key, enum futex_access rw) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; struct page *page, *tail; struct address_space *mapping; - int err, ro = 0; + int node, err, size, ro = 0; bool fshared; fshared = flags & FLAGS_SHARED; + size = futex_size(flags); + if (flags & FLAGS_NUMA) + size *= 2; /* * The futex address must be "naturally" aligned. */ key->both.offset = address % PAGE_SIZE; - if (unlikely((address % sizeof(u32)) != 0)) + if (unlikely((address % size) != 0)) return -EINVAL; address -= key->both.offset; - if (unlikely(!access_ok(uaddr, sizeof(u32)))) + if (unlikely(!access_ok(uaddr, size))) return -EFAULT; if (unlikely(should_fail_futex(fshared))) return -EFAULT; + if (flags & FLAGS_NUMA) { + void __user *naddr = uaddr + size / 2; + + if (futex_get_value(&node, naddr, flags)) + return -EFAULT; + + if (node == -1) { + node = numa_node_id(); + if (futex_put_value(node, naddr, flags)) + return -EFAULT; + + } else if (node >= MAX_NUMNODES || !node_possible(node)) { + return -EINVAL; + } + + key->both.node = node; + + } else { + key->both.node = -1; + } + /* * PROCESS_PRIVATE futexes are fast. * As the mm cannot disappear under us and the 'key' only needs @@ -1125,27 +1191,42 @@ void futex_exit_release(struct task_stru static int __init futex_init(void) { - unsigned int futex_shift; - unsigned long i; + unsigned int order, n; + unsigned long size, i; #if CONFIG_BASE_SMALL futex_hashsize = 16; #else - futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus()); + futex_hashsize = 256 * num_possible_cpus(); + futex_hashsize /= num_possible_nodes(); + futex_hashsize = roundup_pow_of_two(futex_hashsize); #endif + futex_hashshift = ilog2(futex_hashsize); + size = sizeof(struct futex_hash_bucket) * futex_hashsize; + order = get_order(size); + + for_each_node(n) { + struct futex_hash_bucket *table; + + if (order > MAX_ORDER) + table = vmalloc_huge_node(size, GFP_KERNEL, n); + else + table = alloc_pages_exact_nid(n, size, GFP_KERNEL); + + BUG_ON(!table); + + for (i = 0; i < futex_hashsize; i++) { + atomic_set(&table[i].waiters, 0); + spin_lock_init(&table[i].lock); + plist_head_init(&table[i].chain); + } - futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues), - futex_hashsize, 0, - futex_hashsize < 256 ? HASH_SMALL : 0, - &futex_shift, NULL, - futex_hashsize, futex_hashsize); - futex_hashsize = 1UL << futex_shift; - - for (i = 0; i < futex_hashsize; i++) { - atomic_set(&futex_queues[i].waiters, 0); - plist_head_init(&futex_queues[i].chain); - spin_lock_init(&futex_queues[i].lock); + futex_queues[n] = table; } + pr_info("futex hash table, %d nodes, %ld entries (order: %d, %lu bytes)\n", + num_possible_nodes(), + futex_hashsize, order, + sizeof(struct futex_hash_bucket) * futex_hashsize); return 0; } --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -65,6 +65,11 @@ static inline unsigned int futex2_to_fla return flags; } +static inline unsigned int futex_size(unsigned int flags) +{ + return 1 << (flags & FLAGS_SIZE_MASK); +} + static inline bool futex_flags_valid(unsigned int flags) { /* Only 64bit futexes for 64bit code */ @@ -77,12 +82,20 @@ static inline bool futex_flags_valid(uns if ((flags & FLAGS_SIZE_MASK) != FLAGS_SIZE_32) return false; - return true; -} + /* + * Must be able to represent both NUMA_NO_NODE and every valid nodeid + * in a futex word. + */ + if (flags & FLAGS_NUMA) { + int bits = 8 * futex_size(flags); + u64 max = ~0ULL; -static inline unsigned int futex_size(unsigned int flags) -{ - return 1 << (flags & FLAGS_SIZE_MASK); + max >>= 64 - bits; + if (nr_node_ids >= max) + return false; + } + + return true; } static inline bool futex_validate_input(unsigned int flags, u64 val) @@ -183,7 +196,7 @@ enum futex_access { FUTEX_WRITE }; -extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key, +extern int get_futex_key(void __user *uaddr, unsigned int flags, union futex_key *key, enum futex_access rw); extern struct hrtimer_sleeper * --- a/kernel/futex/syscalls.c +++ b/kernel/futex/syscalls.c @@ -179,7 +179,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uad return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3); } -#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE) +#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE) /** * futex_parse_waitv - Parse a waitv array from userspace