MIME-Version: 1.0
References: <20230228132118.978145284@linutronix.de> <CANn89iL2pYt2QA2sS4KkXrCSjprz9byE_p+Geom3MTNPMzfFDw@mail.gmail.com>
 <87h6v5n3su.ffs@tglx>
In-Reply-To: <87h6v5n3su.ffs@tglx>
From:   Eric Dumazet <edumazet@google.com>
Date:   Tue, 28 Feb 2023 17:59:52 +0100
Message-ID: <CANn89iL_ey=S=FjkhJ+mk7gabOdVag6ENKnu9GnZkcF31qOaZA@mail.gmail.com>
Subject: Re: [patch 0/3] net, refcount: Address dst_entry reference count
 scalability issues
To:     Thomas Gleixner <tglx@linutronix.de>
Cc:     LKML <linux-kernel@vger.kernel.org>,
        Linus Torvalds <torvalds@linuxfoundation.org>, x86@kernel.org,
        Wangyang Guo <wangyang.guo@intel.com>,
        Arjan van De Ven <arjan@linux.intel.com>,
        "David S. Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Paolo Abeni <pabeni@redhat.com>, netdev@vger.kernel.org,
        Will Deacon <will@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Boqun Feng <boqun.feng@gmail.com>,
        Mark Rutland <mark.rutland@arm.com>,
        Marc Zyngier <maz@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

On Tue, Feb 28, 2023 at 5:38=E2=80=AFPM Thomas Gleixner <tglx@linutronix.de=
> wrote:
>
> Eric!
>
> On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> > On Tue, Feb 28, 2023 at 3:33=E2=80=AFPM Thomas Gleixner <tglx@linutroni=
x.de> wrote:
> >>
> >> Hi!
> >>
> >> Wangyang and Arjan reported a bottleneck in the networking code relate=
d to
> >> struct dst_entry::__refcnt. Performance tanks massively when concurren=
cy on
> >> a dst_entry increases.
> >
> > We have per-cpu or per-tcp-socket dst though.
> >
> > Input path is RCU and does not touch dst refcnt.
> >
> > In real workloads (200Gbit NIC and above), we do not observe
> > contention on a dst refcnt.
> >
> > So it would be nice knowing in which case you noticed some issues,
> > maybe there is something wrong in some layer.
>
> Two lines further down I explained which benchmark was used, no?
>
> >> This happens when there are a large amount of connections to or from t=
he
> >> same IP address. The memtier benchmark when run on the same host as
> >> memcached amplifies this massively. But even over real network connect=
ions
> >> this issue can be observed at an obviously smaller scale (due to the
> >> network bandwith limitations in my setup, i.e. 1Gb).
> >>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() lo=
op,
> >>       which exposes O(N^2) behaviour under contention with N concurren=
t
> >>       operations.
> >>
> >>       Lightweight instrumentation exposed an average of 8!! retry loop=
s per
> >>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
> >>       running concurrently on 112 CPUs.
> >
> > User space benchmark <> kernel space.
>
> I know that. The point was to illustrate the non-scalability.
>
> > And we tend not using 112 cpus for kernel stack processing.
> >
> > Again, concurrent dst->refcnt changes are quite unlikely.
>
> So unlikely that they stand out in that particular benchmark.
>
> >> The overall gain of both changes for localhost memtier ranges from 1.2=
X to
> >> 3.2X and from +2% to %5% range for networked operations on a 1Gb conne=
ction.
> >>
> >> A micro benchmark which enforces maximized concurrency shows a gain be=
tween
> >> 1.2X and 4.7X!!!
> >
> > Can you elaborate on what networking benchmark you have used,
> > and what actual gains you got ?
>
> I'm happy to repeat here that it was memtier/memcached as I explained
> more than once in the cover letter.
>
> > In which path access to dst->lwtstate proved to be a problem ?
>
> ip_finish_output2()
>    if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read

This change alone should be easy to measure, please do this ?

Oftentimes, moving a field looks sane, but the cache line access is
simply done later.
For example when refcnt is changed :)

Making dsts one cache line bigger has a performance impact.

>
> > To me, this looks like someone wanted to push a new piece of infra
> > (include/linux/rcuref.h)
> > and decided that dst->refcnt would be a perfect place.
> >
> > Not the other way (noticing there is an issue, enquire networking
> > folks about it, before designing a solution)
>
> We looked at this because the reference count operations stood out in
> perf top and we analyzed it down to the false sharing _and_ the
> non-scalability of atomic_inc_not_zero().
>

Please share your recipe and perf results.

We must have been very lucky to not see this at Google.

tcp_rr workloads show dst_mtu() costs (60k GCU in Google fleet) ,
which outperform the dst refcnt you are mentioning here.