Received: by 2002:a05:7412:3b8b:b0:fc:a2b0:25d7 with SMTP id nd11csp103383rdb; Wed, 7 Feb 2024 23:38:12 -0800 (PST) X-Google-Smtp-Source: AGHT+IF8bl1H1dTGl1PJWIPr1RsMWcpmZxiN09NQi0hld4y22N67cT8mLJvo6qL7w1RUuSAhu0Ps X-Received: by 2002:a05:6a20:6a21:b0:19e:aa84:d548 with SMTP id p33-20020a056a206a2100b0019eaa84d548mr3158046pzk.18.1707377892336; Wed, 07 Feb 2024 23:38:12 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707377892; cv=pass; d=google.com; s=arc-20160816; b=wEFzqPY+PYxpy7WkC5JSnsEZhGDCs9T55wYBqvRGEn7VgHSPGB7uwsCRJJ2Sk0b4ai W+wDT6jpw23CrH/Mi9C9u2uQpPmdiJPK9eazlNOJbg9ncGFXnVQit+fusH76iNhZp91u lBdujOw3rCYLWDa53krhUoEKqS9f8lErcP0Uy0vhw25zNRZTgGNCvT/XoOx020yfADqQ 0Ii0AmS4KfWlk2DrIQl1HuV22HIstdIX869lJ+WGWXgNqSNi/o5xdW4+v1gjcza8Kr+F BQ9m9V8zkZSf2OjKg0JVWTQvXF9X6HxGwWxzxJVMkIL0A52LBlWnq0D41kWv5dicI3RN cfNg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :dkim-signature; bh=IA6p3n1ZEVVHs3Sv4vYngSUcXEgK4Am17fVh3dyRzrg=; fh=7YJ3cD7CinAu3iMYxiUs7dWs1zMejkQ2S323xa1GaT0=; b=v2Jt0fYI9UGWBdI+kRNk5ysKt1s0wmNywMd1TSz5+B5BfqxEQys907p/N6mSmP1Nbs F+yxb3MzFDvoDC+wQBWTw7Zo1kPaqu+nAoXlM6/qEFJAy2PQ76UBSmgh/gRA+8BsPHye g64iRAL9TUkUCTYd+kreB3MLdedvo7/eN2tpihp/sL0J7WB8VNVUqWmBeRzdJzxqm5UB 9weTvVgLji57qu7cGsoMi4Slu1r/16pvN2zPspEsGfBTe0PeaAa4PURAJhsJxXq3X1yP dEb+3HvSnCi/fZE6428sYX/GyJW4anctBRTLMdpXNic1Jus//rNu1gikDOL57Z4rCcLp QYKw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="JC/0kMyP"; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-57563-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-57563-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com X-Forwarded-Encrypted: i=2; AJvYcCXfN50JVrD+ho4URnZSnr6NuDLMGKWaHSFEQ4nMpzivragVPjpWvklxZMSRsFQ5r22KqhvEJFTqTJ78J6xr5X7dKjg2ZkTtXUPJ7TI6aw== Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id t20-20020a639554000000b005d8bd587734si3257372pgn.772.2024.02.07.23.38.12 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Feb 2024 23:38:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-57563-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="JC/0kMyP"; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-57563-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-57563-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id ECA37287418 for ; Thu, 8 Feb 2024 07:38:11 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 78CFC69970; Thu, 8 Feb 2024 07:38:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="JC/0kMyP" Received: from mail-ua1-f49.google.com (mail-ua1-f49.google.com [209.85.222.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C3F1869317 for ; Thu, 8 Feb 2024 07:38:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.49 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707377883; cv=none; b=FFQMAQYYCKNqTLfs5MxbNPfAQenMWwmWBXAJKj/W4AWCj/oIDIgF/40L2lNOiR/etkeEKYfyBRZ0Zj3DtRr4xnJGDLD+9le7tLc7LFZlsKYd9E+4JjYab++bCpOuHBmOiJKAAmssB0DXIBRg6K73TZxeUquB4vwaJXuKerwA2N4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707377883; c=relaxed/simple; bh=C8eAIQ3ydyM/HUmEtB8c0Bl+XbiKqgz8XlTsRr8n+Is=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=a6HwMW8csy6lt6DHPJHyZoPNSSsWATG54UHwXYAKwKQG9Y5k2jcSV3cK3uHpwzzd+OimIBwk0vtn35uHokSc3wCXRpd0IMsNufICgkiYwTk4jVOAqVGK8xKF/oVI7g2++9WwpCRS8UGasc6San+yxpjFNCe6/48M1YmXkjOLNMU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=JC/0kMyP; arc=none smtp.client-ip=209.85.222.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-ua1-f49.google.com with SMTP id a1e0cc1a2514c-7d625a3ace6so386339241.0 for ; Wed, 07 Feb 2024 23:38:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1707377880; x=1707982680; darn=vger.kernel.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=IA6p3n1ZEVVHs3Sv4vYngSUcXEgK4Am17fVh3dyRzrg=; b=JC/0kMyPEhGGwCwThybdC3/9WWbX6khEWbxqgpbgzqe1MIM2vxEB+Mls9Q0kJPJVoM a3r6VLGsA/evZ94w0mkidItgvtQxaLHuQxsFYGUK7FzDfAT4fk9b6VkL5YPizkET09sl q16j99j2ePxi9asjTnpLkxzq3b4VQW+9PUIDrnsaDWkKp0YFPel1sazw0uYpLB08en3H /p3GoWA3kd9Jx1CxSGBqY7hmuuM7qHHlyP++n6nvAluvh/A3zHpv4pVWF013vAMAPWZt emZMMVzJ/LAzbe0w/JpVXQ4FA0Z00aV+1wXKDsw+7KOWnrMSBCI+ThRnbswzIGJ8LFlP oJqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707377880; x=1707982680; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=IA6p3n1ZEVVHs3Sv4vYngSUcXEgK4Am17fVh3dyRzrg=; b=B1HaOa+YyVubvT+PFKnECr0EIAmoDYwmWa4VdJ/9kurbjr4iNVg5KHTVZljnKzKng0 cLTzZLerAOv7x+DSuNNO+zoDtz0pr2xhqJLmK7XzvbIyFtxLui3tC/c1rhs/znRs9uHd s7kNn3H30c4PtjPLYg4NZbnz8Q9POB6+kPauDuMBK9gt/N2hXEcrpTl9IENol6sfeREI lIQkpcVqXhS7lNRrIBk5eFEN5Y2rBz/rx8gRDxxN5zDzRkr4ZVp5a8pFPBNTp72AIWjq /wlejJOv+JZ8I1C5Gr/3ZZ6s624lwoapWlodpH+QkxVDwcadRXn8T8BnNTnwdP+y5hWy IuGQ== X-Gm-Message-State: AOJu0YxQVPuciANHYqPX7srNJo0fdHaGXuT5zcsExA09u1TcrRMj7nLH yms0vjI/TKG2HRZt5yU1H081tYSeoH+QvV46LSZxITWz0Af3pRUUsOsCZdZdBokpSNvI6JZ2CT5 vXDaOOczCfVaRKEdUr+afhzfq9ZLmVUmF5Z2Z X-Received: by 2002:a05:6102:5589:b0:46d:28a6:6e10 with SMTP id dc9-20020a056102558900b0046d28a66e10mr2094170vsb.12.1707377880392; Wed, 07 Feb 2024 23:38:00 -0800 (PST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20240207122626.3508658-1-elver@google.com> <289242c3-052b-436d-8c7c-b0fa5ae45bce@linux.dev> In-Reply-To: <289242c3-052b-436d-8c7c-b0fa5ae45bce@linux.dev> From: Marco Elver Date: Thu, 8 Feb 2024 08:37:24 +0100 Message-ID: Subject: Re: [PATCH bpf-next v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup() To: Yonghong Song Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , Mykola Lysenko , Shuah Khan , Ilya Leoshkevich , Yafang Shao , Tejun Heo , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" On Thu, 8 Feb 2024 at 00:58, Yonghong Song wrote: > On 2/7/24 4:26 AM, Marco Elver wrote: > > In various performance profiles of kernels with BPF programs attached, > > bpf_local_storage_lookup() appears as a significant portion of CPU > > cycles spent. To enable the compiler generate more optimal code, turn > > bpf_local_storage_lookup() into a static inline function, where only the > > cache insertion code path is outlined > > > > Notably, outlining cache insertion helps avoid bloating callers by > > duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on > > architectures which do not inline spin_lock/unlock, such as x86), which > > would cause the compiler produce worse code by deciding to outline > > otherwise inlinable functions. The call overhead is neutral, because we > > make 2 calls either way: either calling raw_spin_lock_irqsave() and > > raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(), > > which calls raw_spin_lock_irqsave(), followed by a tail-call to > > raw_spin_unlock_irqsave() where the compiler can perform TCO and (in > > optimized uninstrumented builds) turns it into a plain jump. The call to > > __bpf_local_storage_insert_cache() can be elided entirely if > > cacheit_lockit is a false constant expression. > > > > Based on results from './benchs/run_bench_local_storage.sh' (21 trials, > > reboot between each trial; x86 defconfig + BPF, clang 16) this produces > > improvements in throughput and latency in the majority of cases, with an > > average (geomean) improvement of 8%: [...] > > include/linux/bpf_local_storage.h | 30 ++++++++++- > > kernel/bpf/bpf_local_storage.c | 52 +++++-------------- > > .../bpf/prog_tests/task_local_storage.c | 6 --- > > .../selftests/bpf/progs/cgrp_ls_recursion.c | 26 ---------- > > .../selftests/bpf/progs/task_ls_recursion.c | 17 ------ > > 5 files changed, 41 insertions(+), 90 deletions(-) > > > > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h > > index 173ec7f43ed1..dcddb0aef7d8 100644 > > --- a/include/linux/bpf_local_storage.h > > +++ b/include/linux/bpf_local_storage.h > > @@ -129,10 +129,36 @@ bpf_local_storage_map_alloc(union bpf_attr *attr, > > struct bpf_local_storage_cache *cache, > > bool bpf_ma); > > > > -struct bpf_local_storage_data * > > +void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage, > > + struct bpf_local_storage_map *smap, > > + struct bpf_local_storage_elem *selem); > > +/* If cacheit_lockit is false, this lookup function is lockless */ > > +static inline struct bpf_local_storage_data * > > bpf_local_storage_lookup(struct bpf_local_storage *local_storage, > > struct bpf_local_storage_map *smap, > > - bool cacheit_lockit); > > + bool cacheit_lockit) > > +{ > > + struct bpf_local_storage_data *sdata; > > + struct bpf_local_storage_elem *selem; > > + > > + /* Fast path (cache hit) */ > > + sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx], > > + bpf_rcu_lock_held()); > > + if (sdata && rcu_access_pointer(sdata->smap) == smap) > > + return sdata; > > I think we should focus on fast path (your v1 patch) > and I suppose most production environments > want to hit fast path in most times. In your production environment did > you see more than 16 local storage maps per entity (task/sk/inode)? I think having more than 16 local storage maps isn't entirely unlikely as eBPF usage grows. But at the moment, it should be rare. > In the fast path, the memory accesses are > two from local_storage->cache[smap->cache_idx] and > one from sdata->smap > > > > + > > + /* Slow path (cache miss) */ > > + hlist_for_each_entry_rcu(selem, &local_storage->list, snode, > > + rcu_read_lock_trace_held()) > > + if (rcu_access_pointer(SDATA(selem)->smap) == smap) > > + break; > > But if we reach slow path here which means we have more than 16 local > storage maps, then traversing the list and getting SDATA(selem)->smap > will be very expensive, in addition to memory accesses in fast path. > > I suppose here we mostly care about socket local storage since it is > totally possible for a production workload to have millions of sockets. > To improve performance, fast path should hit in most cases. > If there are too many sk local storage maps, some kind of sharing > can be done so multiple applications might be using a single sk > local storage. > > Your above inlining/outlining analysis also show how tricky it is > for compilation optimization. Without profiling, it is totally > possible that compiler might do optimization differently in > the future. Sure, but it's usually the case that we have to help the compiler a little to produce more optimal code - if the compiler becomes stupid in future, we need either fix the compiler or help it some more. > So here is my suggestion, let us do inlining > for fast path and focus on performance of fast path. The slow-path (iterate list w/o cache insertion) is still relatively small (it's a pointer-chasing loop and a compare), and I decided that it can be justified inlining it. Martin asked in v1 why there were slowdowns above 16 local maps, and I analyzed, and concluded that inlining most is needed to fix and does not hurt performance: in fact, the current version is better than v1 in all cases (even for 16 maps or below). Let me know which version you prefer, and I'll change it. However, based on the results, I would prefer the current version. Thanks, -- Marco