Received: by 10.223.176.5 with SMTP id f5csp3400220wra; Mon, 29 Jan 2018 12:38:04 -0800 (PST) X-Google-Smtp-Source: AH8x22531nvIin9hpuvsEMbJf8ooaQkrrn70S1ZiffroprJMtJNRwZ+2kZLPOxjQlQ4PWd9rZKtX X-Received: by 2002:a17:902:42a3:: with SMTP id h32-v6mr22469849pld.56.1517258284022; Mon, 29 Jan 2018 12:38:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517258283; cv=none; d=google.com; s=arc-20160816; b=Ejom+t7kyagWAe+SJ9dKcZkuSdk5EnsA8xykykdUij5FjixfUUtm/I553hK1hTerNV QxgvjsySseht9PdOu4//ZptKC187//sU8ZV4sfMalhgiebTqqbIBj7xb/UT7SQ30TooB 5WAHzgsvyiC6XhNn51g4viPrdow4uPi3h4eLd1yYD/KtO1fCL9UbzYxBteUM7OJvjY6r MasX5e5qSdmkrZqSACxjBVyL3a87g9at6/JikrvkFJBEAAqbAu3jPcAC134UyBmjh3sb j+2kAdfayKciw1IdxvVARABd7MhRsvvho6v5oeUaQMMyuW11DA8tMt3D33aHlKQSYrXR w+AA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :in-reply-to:message-id:date:subject:cc:to:from :arc-authentication-results; bh=ohHYPSk1egvDZUVd64hiXn1eWoJvMT3Cgsi/20F/aVQ=; b=ioKMRuiBWeB+8O0igKZY4ktl98C551W5gy8FRDdzwtOfT9qxtGjKnJYZZXTLGf1tdK 4Py76nZppL4ky9VPyy7zR4fV9HHPKxByfaOuM+Ez2r0UQYGU+11WajNGrFasSmuqYW46 KRcbBUe2Ah03LU00JS8DnAU6wCMx9nXRcs06HkKdXWLBvLXUMVE+zroxYOiJXSZ69SsL UoQE/wgHo2mLR585Z/FsyxeGuwJmb8Sx80XhlryyzJwXB+WaP+v8TGFiSHULa2YLlpw0 PcF2CR23TOvMeFmjo/fJHw/wWvMQ6Vwkokb+cgdIQZLx0x9FlqSTsE4HtsK2hsYXUuPa GkEA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x6-v6si10091455plo.104.2018.01.29.12.37.49; Mon, 29 Jan 2018 12:38:03 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754728AbeA2UhY (ORCPT + 99 others); Mon, 29 Jan 2018 15:37:24 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:38512 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753925AbeA2UNj (ORCPT ); Mon, 29 Jan 2018 15:13:39 -0500 Received: from localhost (LFbn-1-12258-90.w90-92.abo.wanadoo.fr [90.92.71.90]) by mail.linuxfoundation.org (Postfix) with ESMTPSA id 52EF4304B; Mon, 29 Jan 2018 13:11:19 +0000 (UTC) From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, "ast@kernel.org, stable@vger.kernel.org, Daniel Borkmann" , Alexei Starovoitov , Daniel Borkmann Subject: [PATCH 4.14 66/71] bpf: avoid false sharing of map refcount with max_entries Date: Mon, 29 Jan 2018 13:57:34 +0100 Message-Id: <20180129123832.040554129@linuxfoundation.org> X-Mailer: git-send-email 2.16.1 In-Reply-To: <20180129123827.271171825@linuxfoundation.org> References: <20180129123827.271171825@linuxfoundation.org> User-Agent: quilt/0.65 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 4.14-stable review patch. If anyone has any objections, please let me know. ------------------ From: Daniel Borkmann [ upstream commit be95a845cc4402272994ce290e3ad928aff06cb9 ] In addition to commit b2157399cc98 ("bpf: prevent out-of-bounds speculation") also change the layout of struct bpf_map such that false sharing of fast-path members like max_entries is avoided when the maps reference counter is altered. Therefore enforce them to be placed into separate cachelines. pahole dump after change: struct bpf_map { const struct bpf_map_ops * ops; /* 0 8 */ struct bpf_map * inner_map_meta; /* 8 8 */ void * security; /* 16 8 */ enum bpf_map_type map_type; /* 24 4 */ u32 key_size; /* 28 4 */ u32 value_size; /* 32 4 */ u32 max_entries; /* 36 4 */ u32 map_flags; /* 40 4 */ u32 pages; /* 44 4 */ u32 id; /* 48 4 */ int numa_node; /* 52 4 */ bool unpriv_array; /* 56 1 */ /* XXX 7 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ struct user_struct * user; /* 64 8 */ atomic_t refcnt; /* 72 4 */ atomic_t usercnt; /* 76 4 */ struct work_struct work; /* 80 32 */ char name[16]; /* 112 16 */ /* --- cacheline 2 boundary (128 bytes) --- */ /* size: 128, cachelines: 2, members: 17 */ /* sum members: 121, holes: 1, sum holes: 7 */ }; Now all entries in the first cacheline are read only throughout the life time of the map, set up once during map creation. Overall struct size and number of cachelines doesn't change from the reordering. struct bpf_map is usually first member and embedded in map structs in specific map implementations, so also avoid those members to sit at the end where it could potentially share the cacheline with first map values e.g. in the array since remote CPUs could trigger map updates just as well for those (easily dirtying members like max_entries intentionally as well) while having subsequent values in cache. Quoting from Google's Project Zero blog [1]: Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array's length to be slow on other physical CPU cores (intentional false sharing). While this doesn't 'control' the out-of-bounds speculation through masking the index as in commit b2157399cc98, triggering a manipulation of the map's reference counter is really trivial, so lets not allow to easily affect max_entries from it. Splitting to separate cachelines also generally makes sense from a performance perspective anyway in that fast-path won't have a cache miss if the map gets pinned, reused in other progs, etc out of control path, thus also avoids unintentional false sharing. [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Signed-off-by: Greg Kroah-Hartman --- include/linux/bpf.h | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -42,7 +42,14 @@ struct bpf_map_ops { }; struct bpf_map { - atomic_t refcnt; + /* 1st cacheline with read-mostly members of which some + * are also accessed in fast-path (e.g. ops, max_entries). + */ + const struct bpf_map_ops *ops ____cacheline_aligned; + struct bpf_map *inner_map_meta; +#ifdef CONFIG_SECURITY + void *security; +#endif enum bpf_map_type map_type; u32 key_size; u32 value_size; @@ -52,11 +59,15 @@ struct bpf_map { u32 id; int numa_node; bool unpriv_array; - struct user_struct *user; - const struct bpf_map_ops *ops; - struct work_struct work; + /* 7 bytes hole */ + + /* 2nd cacheline with misc members to avoid false sharing + * particularly with refcounting. + */ + struct user_struct *user ____cacheline_aligned; + atomic_t refcnt; atomic_t usercnt; - struct bpf_map *inner_map_meta; + struct work_struct work; }; /* function argument constraints */