From:   Thomas Gleixner <tglx@linutronix.de>
To:     Eric Dumazet <edumazet@google.com>
Cc:     LKML <linux-kernel@vger.kernel.org>,
        Linus Torvalds <torvalds@linuxfoundation.org>, x86@kernel.org,
        Wangyang Guo <wangyang.guo@intel.com>,
        Arjan van De Ven <arjan@linux.intel.com>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Paolo Abeni <pabeni@redhat.com>, netdev@vger.kernel.org,
        Will Deacon <will@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Boqun Feng <boqun.feng@gmail.com>,
        Mark Rutland <mark.rutland@arm.com>,
        Marc Zyngier <maz@kernel.org>
Subject: Re: [patch 0/3] net, refcount: Address dst_entry reference count
 scalability issues
In-Reply-To: <CANn89iL2pYt2QA2sS4KkXrCSjprz9byE_p+Geom3MTNPMzfFDw@mail.gmail.com>
References: <20230228132118.978145284@linutronix.de>
 <CANn89iL2pYt2QA2sS4KkXrCSjprz9byE_p+Geom3MTNPMzfFDw@mail.gmail.com>
Date:   Tue, 28 Feb 2023 17:38:25 +0100
Message-ID: <87h6v5n3su.ffs@tglx>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Eric!

On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> On Tue, Feb 28, 2023 at 3:33=E2=80=AFPM Thomas Gleixner <tglx@linutronix.=
de> wrote:
>>
>> Hi!
>>
>> Wangyang and Arjan reported a bottleneck in the networking code related =
to
>> struct dst_entry::__refcnt. Performance tanks massively when concurrency=
 on
>> a dst_entry increases.
>
> We have per-cpu or per-tcp-socket dst though.
>
> Input path is RCU and does not touch dst refcnt.
>
> In real workloads (200Gbit NIC and above), we do not observe
> contention on a dst refcnt.
>
> So it would be nice knowing in which case you noticed some issues,
> maybe there is something wrong in some layer.

Two lines further down I explained which benchmark was used, no?

>> This happens when there are a large amount of connections to or from the
>> same IP address. The memtier benchmark when run on the same host as
>> memcached amplifies this massively. But even over real network connectio=
ns
>> this issue can be observed at an obviously smaller scale (due to the
>> network bandwith limitations in my setup, i.e. 1Gb).
>>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
>>       which exposes O(N^2) behaviour under contention with N concurrent
>>       operations.
>>
>>       Lightweight instrumentation exposed an average of 8!! retry loops =
per
>>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>>       running concurrently on 112 CPUs.
>
> User space benchmark <> kernel space.

I know that. The point was to illustrate the non-scalability.

> And we tend not using 112 cpus for kernel stack processing.
>
> Again, concurrent dst->refcnt changes are quite unlikely.

So unlikely that they stand out in that particular benchmark.

>> The overall gain of both changes for localhost memtier ranges from 1.2X =
to
>> 3.2X and from +2% to %5% range for networked operations on a 1Gb connect=
ion.
>>
>> A micro benchmark which enforces maximized concurrency shows a gain betw=
een
>> 1.2X and 4.7X!!!
>
> Can you elaborate on what networking benchmark you have used,
> and what actual gains you got ?

I'm happy to repeat here that it was memtier/memcached as I explained
more than once in the cover letter.

> In which path access to dst->lwtstate proved to be a problem ?

ip_finish_output2()
   if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read

> To me, this looks like someone wanted to push a new piece of infra
> (include/linux/rcuref.h)
> and decided that dst->refcnt would be a perfect place.
>
> Not the other way (noticing there is an issue, enquire networking
> folks about it, before designing a solution)

We looked at this because the reference count operations stood out in
perf top and we analyzed it down to the false sharing _and_ the
non-scalability of atomic_inc_not_zero().

That's what made me to think about a different implementation and yes,
the case at hand, which happens to be in the network code, allowed me to
verify that it actually works and scales better.

I'm terrible sorry, that I missed to first ask network people for
permission to think. Won't happen again.

Thanks,

        tglx