Received: by 2002:a05:7412:8d1c:b0:fa:4c10:6cad with SMTP id bj28csp516131rdb; Wed, 17 Jan 2024 08:47:36 -0800 (PST) X-Google-Smtp-Source: AGHT+IH6jo7vWw+5vfzeGB/YYAn9UNjoZn9NWCF4lEvVsd/vvKConCTmdJjGRKx2CdR4c3S/Ur/O X-Received: by 2002:a17:903:1251:b0:1d6:ea63:a149 with SMTP id u17-20020a170903125100b001d6ea63a149mr1723702plh.119.1705510055893; Wed, 17 Jan 2024 08:47:35 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1705510055; cv=pass; d=google.com; s=arc-20160816; b=QNqvUN4dfTTtw3kgpbXxtlODhqh6Dc8z4wlSpsJ0nBaUEd2Wil4Xvst93YcERIcIdv gQu4G0iOggLGOmvz2Tn32FM5Jj6gvnRxGda0y01/0EEBAkRoo5x5Vo88V9CeYNMpz/Lf ti0SvsHeJuNoarMYkPW64cXERY1pejKpKfjHFK2WbRWY2FBz/gE3FchOdTHaLJCfKcTW WM9TtNuqrkI9hsoRE8854beUiuAn79srx7W25qM/Ob+hTKdPdTpoqWbFmU5WxpypPCUJ XOJSBZGMQZ59Y3ByBv8dxCxT2kgRb4y2qmMejq5PxgwMiKmWjkvUY8iP2RT8rSJW/TpT 5y3Q== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:references :in-reply-to:subject:cc:to:from:dkim-signature; bh=se4elMsGJU9EC+SNklZxwLpVdn5NqIm93z2voyeARIM=; fh=ziiDCShbkWLlBWK9tWvAHQREZ7hsHuYwq4GklQBOPX8=; b=HgDpnTtZUwqhK9EFYg3uYMeEe27LnYbv/DHbYXQ0rtlWka5oqfhmk/C/jjFfQ7FWku jqNqYIp56E1RhFrKEMEgFJ1NKSvdlMH9poDKDk+1cooyGFFn7LbRi0l+CGDlHyant5wP z1x1xhdk/sD/sDzUx4P43TPZ/11Af2EpgYBe80rkG+r1SSGCf58NsCjvZiIgSHFOC1OB nOEYGQx/+oYVyMcnCrtmEmndGjsSF0eTnMLJRNHirxDZqbQDwjPkctELtiwUBvFtIguF sHdyoXHnHJHscClLCRZa1kMhwag88tvy/5kI/GbiGeyQFbIpz3G6917/dVVqx1DFE1+S qbOg== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MIFeBT5n; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-29231-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-29231-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id r2-20020a170902be0200b001d54b66517asi13655724pls.388.2024.01.17.08.47.35 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Jan 2024 08:47:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-29231-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MIFeBT5n; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-29231-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-29231-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 901AC290FD9 for ; Wed, 17 Jan 2024 16:38:57 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 954B6225A6; Wed, 17 Jan 2024 16:37:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="MIFeBT5n" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D61A51E528 for ; Wed, 17 Jan 2024 16:37:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705509455; cv=none; b=EPrERTu91HNFOhJaydcGKOvPJtDpK9VKuZhifIfpdq2dRIhP8TxkOYEBRInVTAJTJoikBbGJVWYOGnTxpSQ1s0n6A5w3u8DzJ5S05S7PkRnov3tLyz97YvBoJEXXrdlKAIyhVReFSa36EpA5zT3tM3AtP7A43gAJ+5l4BWtOe5Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705509455; c=relaxed/simple; bh=quxqBlELDmx/23WoH7bvn2sOo+t/7W8a40DpmJUSCWU=; h=DKIM-Signature:Received:X-MC-Unique:Received: X-Google-DKIM-Signature:X-Gm-Message-State:X-Received: X-Google-Smtp-Source:X-Received:Received:Received:From:To:Cc: Subject:In-Reply-To:References:X-Clacks-Overhead:Date:Message-ID: MIME-Version:Content-Type:Content-Transfer-Encoding; b=JTj4/3QdZFX2xdri53puaeFq8IWntlu5bUhw5bXsdvsIdcjNzXxWmZwYgtNWLTrYoZKNjbesJDAvCU6Yxe4yZYIW3Quy6qJigkrlwWMfseaEc65XnrkpHeH1N39uMxVs/O4KvfbiyVEzfocxNGmRhl/IyF0nPYt+82rMXXQ+PAs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=MIFeBT5n; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1705509453; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=se4elMsGJU9EC+SNklZxwLpVdn5NqIm93z2voyeARIM=; b=MIFeBT5npyxCyXklulyqUmAopxWJ7MudZyHFKjKkyJwK3Sh2prDN7x2MAJkUAFq0uT7U+W 781LDs3g3XQ7p+xFESyHuDu3zk85o2xeEVEE6SSegf6UKg032IYfTmUeNY27ctfQVEvgF6 YzxlBgAED3aZch9svmbFWyzQ1HM4XgA= Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com [209.85.208.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-426-9rVWlZyXOqOJq6QIJWTumg-1; Wed, 17 Jan 2024 11:37:31 -0500 X-MC-Unique: 9rVWlZyXOqOJq6QIJWTumg-1 Received: by mail-ed1-f72.google.com with SMTP id 4fb4d7f45d1cf-55894c36888so3649933a12.2 for ; Wed, 17 Jan 2024 08:37:31 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705509450; x=1706114250; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=se4elMsGJU9EC+SNklZxwLpVdn5NqIm93z2voyeARIM=; b=V3Gpd2kVosRXnVU/k7uxIRPyYV7imQWDiJmSxtJjFiJCjAK5EJQueGMceL7ZRxc/qX acZP3fFkUgXYns5+YNOjyBPPbMN/U6KvKRTeYb35F6Dvkrp8xhJ0/guCvDCitDtUnKac PXsUJZnGc7gcmCGLM/El6q1AYako/dYR41/1+0M2Zo7mp4GAA/imqShcEuEC8xdPumrt Bq/mhsWOTSpacr07jknOeEBHVboSQHvWvrTfCKd48dzH/SPcZcYYTEA42VRFsbNMxYzT 4h1PHc6hW3Zl7Q8mmtJfTGgokafJ7RHu4KtxoD6+Q/frNQndfrWVXeNBUEhqKb2JqlIV 7NpQ== X-Gm-Message-State: AOJu0Yy6SgCI/d0AsoDFvfC3tY0tDE6wL01UL920LfJJMjPvoIw/LSga qCQktYg9Z2j+rlfI7C3zYvzF+miUao2mDTgRQHcXLbGp6tBO6EBM3KyOlzU7b6VTyA3GS02e6XO xDIidu+N5hFAdmHX4i7Gwf5SBvbl1m5PV X-Received: by 2002:a05:6402:1247:b0:559:edce:d10c with SMTP id l7-20020a056402124700b00559edced10cmr288429edw.18.1705509450575; Wed, 17 Jan 2024 08:37:30 -0800 (PST) X-Received: by 2002:a05:6402:1247:b0:559:edce:d10c with SMTP id l7-20020a056402124700b00559edced10cmr288393edw.18.1705509450147; Wed, 17 Jan 2024 08:37:30 -0800 (PST) Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id x9-20020aa7cd89000000b00558e0481b2fsm6868305edv.47.2024.01.17.08.37.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Jan 2024 08:37:29 -0800 (PST) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 4701F1088A04; Wed, 17 Jan 2024 17:37:29 +0100 (CET) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Sebastian Andrzej Siewior Cc: Alexei Starovoitov , LKML , Network Development , "David S. Miller" , Boqun Feng , Daniel Borkmann , Eric Dumazet , Frederic Weisbecker , Ingo Molnar , Jakub Kicinski , Paolo Abeni , Peter Zijlstra , Thomas Gleixner , Waiman Long , Will Deacon , Alexei Starovoitov , Andrii Nakryiko , Cong Wang , Hao Luo , Jamal Hadi Salim , Jesper Dangaard Brouer , Jiri Olsa , Jiri Pirko , John Fastabend , KP Singh , Martin KaFai Lau , Ronak Doshi , Song Liu , Stanislav Fomichev , VMware PV-Drivers Reviewers , Yonghong Song , bpf Subject: Re: [PATCH net-next 15/24] net: Use nested-BH locking for XDP redirect. In-Reply-To: <20240112174138.tMmUs11o@linutronix.de> References: <20231215171020.687342-1-bigeasy@linutronix.de> <20231215171020.687342-16-bigeasy@linutronix.de> <87r0iw524h.fsf@toke.dk> <20240112174138.tMmUs11o@linutronix.de> X-Clacks-Overhead: GNU Terry Pratchett Date: Wed, 17 Jan 2024 17:37:29 +0100 Message-ID: <87ttnb6hme.fsf@toke.dk> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sebastian Andrzej Siewior writes: > On 2024-01-04 20:29:02 [+0100], Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> Alexei Starovoitov writes: >>=20 >> >> @@ -3925,6 +3926,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qe= vent *qe, struct Qdisc *sch, stru >> >> >> >> fl =3D rcu_dereference_bh(qe->filter_chain); >> >> >> >> + guard(local_lock_nested_bh)(&bpf_run_lock.redirect_lock); >> >> switch (tcf_classify(skb, NULL, fl, &cl_res, false)) { >> >> case TC_ACT_SHOT: >> >> qdisc_qstats_drop(sch); >> > >> > Here and in all other places this patch adds locks that >> > will kill performance of XDP, tcx and everything else in networking. >> > >> > I'm surprised Jesper and other folks are not jumping in with nacks. >> > We measure performance in nanoseconds here. >> > Extra lock is no go. >> > Please find a different way without ruining performance. >>=20 >> I'll add that while all this compiles out as no-ops on !PREEMPT_RT, I do >> believe there are people who are using XDP on PREEMPT_RT kernels and >> still expect decent performance. And to achieve that it is absolutely >> imperative that we can amortise expensive operations (such as locking) >> over multiple packets. >>=20 >> I realise there's a fundamental trade-off between the amount of >> amortisation and the latency hit that we take from holding locks for >> longer, but tuning the batch size (while still keeping some amount of >> batching) may be a way forward? I suppose Jakub's suggestion in the >> other part of the thread, of putting the locks around napi->poll(), is a >> step towards something like this. > > The RT requirements are usually different. Networking as in CAN might be > important but Ethernet could only used for remote communication and so > "not" important. People complained that they need to wait for Ethernet > to be done until the CAN packet can be injected into the stack. > With that expectation you would like to pause Ethernet immediately and > switch over the CAN interrupt thread. > > But if someone managed to setup XDP then it is likely to be important. > With RT traffic it is usually not the throughput that matters but the > latency. You are likely in the position to receive a packet, say every > 1ms, and need to respond immediately. XDP would be used to inspect the > packet and either hand it over to the stack or process it. I am not contesting that latency is important, but it's a pretty fundamental trade-off and we don't want to kill throughput entirely either. Especially since this is global to the whole kernel; and there are definitely people who want to use XDP on an RT kernel and still achieve high PPS rates. (Whether those people really strictly speaking need to be running an RT kernel is maybe debatable, but it does happen). > I expected the lock operation (under RT) to always succeeds and not > cause any delay because it should not be contended. A lock does cause delay even when it's not contended. Bear in mind that at 10 Gbps line rate, we have a budget of 64 nanoseconds to process each packet (for 64-byte packets). So just the atomic op to figure out whether there's any contention (around 10ns on the Intel processors I usually test on) will blow a huge chunk of the total processing budget. We can't actually do the full processing needed in those 64 nanoseconds (not to mention the 6.4 nanoseconds we have available at 100Gbps), which is why it's essential to amortise as much as we can over multiple packets. This is all back-of-the-envelope calculations, of course. Having some actual numbers to look at would be great; I don't suppose you have a setup where you can run xdp-bench and see how your patches affect the throughput? > It should only block if something with higher priority preempted the > current interrupt thread _and_ also happen to use XDP on the same CPU. > In that case (XDP is needed) it would flush the current user out of > the locked section before the higher-prio thread could take over. > Doing bulk and allowing the low-priority thread to complete would > delay the high-priority thread. Maybe I am too pessimistic here and > having two XDP programs on one CPU is unlikely to happen. > > Adding the lock on per-NAPI basis would allow to batch packets. > Acquiring the lock only if XDP is supported would not block the CAN > drivers since they dont't support XDP. But sounds like a hack. I chatted with Jesper about this, and he had an idea not too far from this: split up the XDP and regular stack processing in two stages, each with their individual batching. So whereas right now we're doing something like: run_napi() bh_disable() for pkt in budget: act =3D run_xdp(pkt) if (act =3D=3D XDP_PASS) run_netstack(pkt) // this is the expensive bit bh_enable() We could instead do: run_napi() bh_disable() for pkt in budget: act =3D run_xdp(pkt) if (act =3D=3D XDP_PASS) add_to_list(pkt, to_stack_list) bh_enable() // sched point bh_disable() for pkt in to_stack_list: run_netstack(pkt) bh_enable() This would limit the batching that blocks everything to only the XDP processing itself, which should limit the maximum time spent in the blocking state significantly compared to what we have today. The caveat being that rearranging things like this is potentially a pretty major refactoring task that needs to touch all the drivers (even if some of the logic can be moved into the core code in the process). So not really sure if this approach is feasible, TBH. > Daniel said netkit doesn't need this locking because it is not > supporting this redirect and it made me think. Would it work to make > the redirect structures part of the bpf_prog-structure instead of > per-CPU? My understanding is that eBPF's programs data structures are > part of it and contain locking allowing one eBPF program preempt > another one. > Having the redirect structures part of the program would obsolete > locking. Do I miss anything? This won't work, unfortunately: the same XDP program can be attached to multiple interfaces simultaneously, and for hardware with multiple receive queues (which is most of the hardware that supports XDP), it can even run simultaneously on multiple CPUs on the same interface. This is the reason why this is all being kept in per-CPU variables today. -Toke