Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <20201103121506.1533-1-liqiang64@huawei.com> <20201103121506.1533-2-liqiang64@huawei.com>
 <20201104175742.GA846@sol.localdomain> <2dad168c-f6cb-103c-04ce-cc3c2561e01b@huawei.com>
In-Reply-To: <2dad168c-f6cb-103c-04ce-cc3c2561e01b@huawei.com>
From:   Ard Biesheuvel <ardb@kernel.org>
Date:   Thu, 5 Nov 2020 08:51:08 +0100
Message-ID: <CAMj1kXG+YJvHLFDMk7ABAD=WthxLx5Uh0LAXCP6+2tXEySj7eg@mail.gmail.com>
Subject: Re: [PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.
To:     Li Qiang <liqiang64@huawei.com>
Cc:     Eric Biggers <ebiggers@kernel.org>,
        Herbert Xu <herbert@gondor.apana.org.au>,
        "David S. Miller" <davem@davemloft.net>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Will Deacon <will@kernel.org>,
        Maxime Coquelin <mcoquelin.stm32@gmail.com>,
        Alexandre Torgue <alexandre.torgue@st.com>,
        Linux ARM <linux-arm-kernel@lists.infradead.org>,
        Linux Crypto Mailing List <linux-crypto@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

On Thu, 5 Nov 2020 at 03:50, Li Qiang <liqiang64@huawei.com> wrote:
>
> Hi Eric,
>
> =E5=9C=A8 2020/11/5 1:57, Eric Biggers =E5=86=99=E9=81=93:
> > On Tue, Nov 03, 2020 at 08:15:06PM +0800, l00374334 wrote:
> >> From: liqiang <liqiang64@huawei.com>
> >>
> >>      In the libz library, the checksum algorithm adler32 usually occup=
ies
> >>      a relatively high hot spot, and the SVE instruction set can easil=
y
> >>      accelerate it, so that the performance of libz library will be
> >>      significantly improved.
> >>
> >>      We can divides buf into blocks according to the bit width of SVE,
> >>      and then uses vector registers to perform operations in units of =
blocks
> >>      to achieve the purpose of acceleration.
> >>
> >>      On machines that support ARM64 sve instructions, this algorithm i=
s
> >>      about 3~4 times faster than the algorithm implemented in C langua=
ge
> >>      in libz. The wider the SVE instruction, the better the accelerati=
on effect.
> >>
> >>      Measured on a Taishan 1951 machine that supports 256bit width SVE=
,
> >>      below are the results of my measured random data of 1M and 10M:
> >>
> >>              [root@xxx adler32]# ./benchmark 1000000
> >>              Libz alg: Time used:    608 us, 1644.7 Mb/s.
> >>              SVE  alg: Time used:    166 us, 6024.1 Mb/s.
> >>
> >>              [root@xxx adler32]# ./benchmark 10000000
> >>              Libz alg: Time used:   6484 us, 1542.3 Mb/s.
> >>              SVE  alg: Time used:   2034 us, 4916.4 Mb/s.
> >>
> >>      The blocks can be of any size, so the algorithm can automatically=
 adapt
> >>      to SVE hardware with different bit widths without modifying the c=
ode.
> >>
> >>
> >> Signed-off-by: liqiang <liqiang64@huawei.com>
> >
> > Note that this patch does nothing to actually wire up the kernel's copy=
 of libz
> > (lib/zlib_{deflate,inflate}/) to use this implementation of Adler32.  T=
o do so,
> > libz would either need to be changed to use the shash API, or you'd nee=
d to
> > implement an adler32() function in lib/crypto/ that automatically uses =
an
> > accelerated implementation if available, and make libz call it.
> >
> > Also, in either case a C implementation would be required too.  There c=
an't be
> > just an architecture-specific implementation.
>
> Okay, thank you for the problems and suggestions you gave. I will continu=
e to
> improve my code.
>
> >
> > Also as others have pointed out, there's probably not much point in hav=
ing a SVE
> > implementation of Adler32 when there isn't even a NEON implementation y=
et.  It's
> > not too hard to implement Adler32 using NEON, and there are already sev=
eral
> > permissively-licensed NEON implementations out there that could be used=
 as a
> > reference, e.g. my implementation using NEON instrinsics here:
> > https://github.com/ebiggers/libdeflate/blob/v1.6/lib/arm/adler32_impl.h
> >
> > - Eric
> > .
> >
>
> I am very happy to get this NEON implementation code. :)
>

Note that NEON intrinsics can be compiled for 32-bit ARM as well (with
a bit of care - please refer to lib/raid6/recov_neon_inner.c for an
example of how to deal with intrinsics that are only available on
arm64) and are less error prone, so intrinsics should be preferred if
feasible.

However, you have still not explained how optimizing Adler32 makes a
difference for a real-world use case. Where is libdeflate used on a
hot path?