Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <751fd5bb13a49583b1593fa209bfabc4917290ae.camel@redhat.com>
Subject: Re: [PATCH net-next] net/core: add optional threading for backlog
 processing
From:   Paolo Abeni <pabeni@redhat.com>
To:     Felix Fietkau <nbd@nbd.name>, Jakub Kicinski <kuba@kernel.org>
Cc:     netdev@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>,
        "David S. Miller" <davem@davemloft.net>,
        Eric Dumazet <edumazet@google.com>, linux-doc@vger.kernel.org,
        linux-kernel@vger.kernel.org
Date:   Tue, 28 Mar 2023 11:29:24 +0200
In-Reply-To: <f59ee83f-7267-04df-7286-f7ea147b5b49@nbd.name>
References: <20230324171314.73537-1-nbd@nbd.name>
         <20230324102038.7d91355c@kernel.org>
         <2d251879-1cf4-237d-8e62-c42bb4feb047@nbd.name>
         <20230324104733.571466bc@kernel.org>
         <f59ee83f-7267-04df-7286-f7ea147b5b49@nbd.name>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
User-Agent: Evolution 3.46.4 (3.46.4-1.fc37) 
MIME-Version: 1.0
Precedence: bulk

On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote:
> On 24.03.23 18:47, Jakub Kicinski wrote:
> > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote:
> > > I'm primarily testing this on routers with 2 or 4 CPUs and limited=
=20
> > > processing power, handling routing/NAT. RPS is typically needed to=
=20
> > > properly distribute the load across all available CPUs. When there is=
=20
> > > only a small number of flows that are pushing a lot of traffic, a sta=
tic=20
> > > RPS assignment often leaves some CPUs idle, whereas others become a=
=20
> > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, b=
ut=20
> > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone.
> >=20
> > The NAPI thread becomes a bottleneck with RPS enabled?
>=20
> The devices that I work with often only have a single rx queue. That can
> easily become a bottleneck.
>=20
> > > Making backlog processing threaded helps split up the processing work=
=20
> > > even more and distribute it onto remaining idle CPUs.
> >=20
> > You'd want to have both threaded NAPI and threaded backlog enabled?
>=20
> Yes
>=20
> > > It can basically be used to make RPS a bit more dynamic and=20
> > > configurable, because you can assign multiple backlog threads to a se=
t=20
> > > of CPUs and selectively steer packets from specific devices / rx queu=
es=20
> >=20
> > Can you give an example?
> >=20
> > With the 4 CPU example, in case 2 queues are very busy - you're trying
> > to make sure that the RPS does not end up landing on the same CPU as
> > the other busy queue?
>=20
> In this part I'm thinking about bigger systems where you want to have a
> group of CPUs dedicated to dealing with network traffic without
> assigning a fixed function (e.g. NAPI processing or RPS target) to each
> one, allowing for more dynamic processing.
>=20
> > > to them and allow the scheduler to take care of the rest.
> >=20
> > You trust the scheduler much more than I do, I think :)
>=20
> In my tests it brings down latency (both avg and p99) considerably in
> some cases. I posted some numbers here:
> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.n=
ame/

It's still not 110% clear to me why/how this additional thread could
reduce latency. What/which threads are competing for the busy CPU[s]? I
suspect it could be easier/cleaner move away the others (non RPS)
threads.

Cheers,

Paolo