Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <CGME20230103111359eucas1p137cb823bdc80d790544de20c3835faf2@eucas1p1.samsung.com>
 <Y7NizHFsWfMW/cC2@sol.localdomain> <dleftjbknfoopx.fsf%l.stelmach@samsung.com>
In-Reply-To: <dleftjbknfoopx.fsf%l.stelmach@samsung.com>
From:   Ard Biesheuvel <ardb@kernel.org>
Date:   Tue, 3 Jan 2023 15:01:47 +0100
Message-ID: <CAMj1kXGY4zZovOKY5kD54pFEXeOoX=3JCuHVCDpQJf+Wo+oBiw@mail.gmail.com>
Subject: Re: xor_blocks() assumptions
To:     Lukasz Stelmach <l.stelmach@samsung.com>
Cc:     Eric Biggers <ebiggers@kernel.org>,
        Herbert Xu <herbert@gondor.apana.org.au>,
        "David S. Miller" <davem@davemloft.net>,
        linux-crypto@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Tue, 3 Jan 2023 at 12:14, Lukasz Stelmach <l.stelmach@samsung.com> wrote:
>
> It was <2023-01-02 pon 15:03>, when Eric Biggers wrote:
> > On Mon, Jan 02, 2023 at 11:44:35PM +0100, Lukasz Stelmach wrote:
> >> I am researching possibility to use xor_blocks() in crypto_xor() and
> >> crypto_xor_cpy(). What I've found already is that different architecture
> >> dependent xor functions work on different blocks between 16 and 512
> >> (Intel AVX) bytes long. There is a hint in the comment for
> >> async_xor_offs() that src_cnt (as passed to do_sync_xor_offs()) counts
> >> pages. Thus, it is assumed, that the smallest chunk xor_blocks() gets is
> >> a single page. Am I right?
> >>
> >> Do you think adding block_len field to struct xor_block_template (and
> >> maybe some information about required alignment) and using it to call
> >> do_2 from crypto_xor() may work? I am thinking especially about disk
> >> encryption where sectors of 512~4096 are handled.
> >>
> >
> > Taking a step back, it sounds like you think the word-at-a-time XOR in
> > crypto_xor() is not performant enough, so you want to use a SIMD (e.g. NEON,
> > SSE, or AVX) implementation instead.
>
> Yes.
>
> > Have you tested that this would actually give a benefit on the input
> > sizes in question,
>
> --8<---------------cut here---------------start------------->8---
> [    0.938006] xor: measuring software checksum speed
> [    0.947383]    crypto          :  1052 MB/sec
> [    0.953299]    arm4regs        :  1689 MB/sec
> [    0.960674]    8regs           :  1352 MB/sec
> [    0.968033]    32regs          :  1352 MB/sec
> [    0.972078]    neon            :  2448 MB/sec
> --8<---------------cut here---------------end--------------->8---
>

This doesn't really answer the question. This only tells us that NEON
is faster on this core when XOR'ing the same cache-hot page multple
times in succession.

So the question is really which crypto algorithm you intend to
accelerate with this change, and the input sizes it operates on.

If the algo in question is the generic XTS template wrapped around a
h/w accelerated implementation of ECB, I suspect that we would be
better off wiring up xor_blocks() into xts.c, rather than modifying
crypto_xor() and crypto_xor_cpy(). Many other calls to crypto_xor()
operate on small buffers where this optimization is unlikely to help.

So rephrase the question: which invocation of crypto_xor() is the
bottleneck in the use case you are trying to optimize?


> (Linux 6.2.0-rc1 running on Odroid XU3 board with Arm Cortex-A15)
>
> The patch below copies, adapts and plugs in __crypto_xor() as
> xor_block_crypto.do_2. You can see its results labeled "crypto" above.
> Disk encryption is comparable to RAID5 checksumming so the results above
> should be adequate.
>
> > especially considering that SIMD can only be used in the kernel if
> > kernel_fpu_begin() is executed first?
>
> That depends on architecture. As far as I can tell this applies to Intel
> only.
>

On ARM, you must use kernel_neon_begin/_end() when using SIMD in
kernel mode, which comes down to the same thing.

> > It also would be worth considering just optimizing crypto_xor() by
> > unrolling the word-at-a-time loop to 4x or so.
>
> If I understand correctly the generic 8regs and 32regs implementations
> in include/asm-generic/xor.h are what you mean. Using xor_blocks() in
> crypto_xor() could enable them for free on architectures lacking SIMD or
> vector instructions.
>

-- 
Ard.