Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Tue, 3 Jan 2023 23:46:44 -0800
From:   Eric Biggers <ebiggers@kernel.org>
To:     Lukasz Stelmach <l.stelmach@samsung.com>
Cc:     Herbert Xu <herbert@gondor.apana.org.au>,
        "David S. Miller" <davem@davemloft.net>,
        linux-crypto@vger.kernel.org
Subject: Re: xor_blocks() assumptions
Message-ID: <Y7Uu5GkxfrejPJXL@sol.localdomain>
References: <Y7NizHFsWfMW/cC2@sol.localdomain>
 <CGME20230103111359eucas1p137cb823bdc80d790544de20c3835faf2@eucas1p1.samsung.com>
 <dleftjbknfoopx.fsf%l.stelmach@samsung.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <dleftjbknfoopx.fsf%l.stelmach@samsung.com>
Precedence: bulk

On Tue, Jan 03, 2023 at 12:13:30PM +0100, Lukasz Stelmach wrote:
> > It also would be worth considering just optimizing crypto_xor() by
> > unrolling the word-at-a-time loop to 4x or so.
> 
> If I understand correctly the generic 8regs and 32regs implementations
> in include/asm-generic/xor.h are what you mean. Using xor_blocks() in
> crypto_xor() could enable them for free on architectures lacking SIMD or
> vector instructions.

I actually meant exactly what I said -- unrolling the word-at-a-time loop in
crypto_xor().  Not using xor_blocks().  Something like this:

diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
index 61b327206b557..c0b90f14cae18 100644
--- a/include/crypto/algapi.h
+++ b/include/crypto/algapi.h
@@ -167,7 +167,18 @@ static inline void crypto_xor(u8 *dst, const u8 *src, unsigned int size)
 		unsigned long *s = (unsigned long *)src;
 		unsigned long l;
 
-		while (size > 0) {
+		while (size >= 4 * sizeof(unsigned long)) {
+			l = get_unaligned(d) ^ get_unaligned(s++);
+			put_unaligned(l, d++);
+			l = get_unaligned(d) ^ get_unaligned(s++);
+			put_unaligned(l, d++);
+			l = get_unaligned(d) ^ get_unaligned(s++);
+			put_unaligned(l, d++);
+			l = get_unaligned(d) ^ get_unaligned(s++);
+			put_unaligned(l, d++);
+			size -= 4 * sizeof(unsigned long);
+		}
+		if (size > 0) {
 			l = get_unaligned(d) ^ get_unaligned(s++);
 			put_unaligned(l, d++);
 			size -= sizeof(unsigned long);

Actually, the compiler might unroll the loop automatically anyway, so even the
above change might not even be necessary.  The point is, I expect that a proper
scalar implementation will perform well for pretty much anything other than
large input sizes.

It's only large input sizes where xor_blocks() might be worth it, considering
the significant overhead of the indirect call in xor_blocks() as well as
entering an SIMD code section.  (Note that indirect calls are very expensive
these days, due to the speculative execution mitigations.)

Of course, the real question is what real-world scenario are you actually trying
to optimize for...

- Eric