2018-09-04 08:06:42

by Ondrej Mosnacek

[permalink] [raw]
Subject: [PATCH v2] crypto: xts - Drop use of auxiliary buffer

Since commit acb9b159c784 ("crypto: gf128mul - define gf128mul_x_* in
gf128mul.h"), the gf128mul_x_*() functions are very fast and therefore
caching the computed XTS tweaks has only negligible advantage over
computing them twice.

In fact, since the current caching implementation limits the size of
the calls to the child ecb(...) algorithm to PAGE_SIZE (usually 4096 B),
it is often actually slower than the simple recomputing implementation.

This patch simplifies the XTS template to recompute the XTS tweaks from
scratch in the second pass and thus also removes the need to allocate a
dynamic buffer using kmalloc().

As discussed at [1], the use of kmalloc causes deadlocks with dm-crypt.

PERFORMANCE RESULTS
I measured time to encrypt/decrypt a memory buffer of varying sizes with
xts(ecb-aes-aesni) using a tool I wrote ([2]) and the results suggest
that after this patch the performance is either better or comparable for
both small and large buffers. Note that there is a lot of noise in the
measurements, but the overall difference is easy to see.

Old code:
ALGORITHM KEY (b) DATA (B) TIME ENC (ns) TIME DEC (ns)
xts(aes) 256 64 331 328
xts(aes) 384 64 332 333
xts(aes) 512 64 338 348
xts(aes) 256 512 889 920
xts(aes) 384 512 1019 993
xts(aes) 512 512 1032 990
xts(aes) 256 4096 2152 2292
xts(aes) 384 4096 2453 2597
xts(aes) 512 4096 3041 2641
xts(aes) 256 16384 9443 8027
xts(aes) 384 16384 8536 8925
xts(aes) 512 16384 9232 9417
xts(aes) 256 32768 16383 14897
xts(aes) 384 32768 17527 16102
xts(aes) 512 32768 18483 17322

New code:
ALGORITHM KEY (b) DATA (B) TIME ENC (ns) TIME DEC (ns)
xts(aes) 256 64 328 324
xts(aes) 384 64 324 319
xts(aes) 512 64 320 322
xts(aes) 256 512 476 473
xts(aes) 384 512 509 492
xts(aes) 512 512 531 514
xts(aes) 256 4096 2132 1829
xts(aes) 384 4096 2357 2055
xts(aes) 512 4096 2178 2027
xts(aes) 256 16384 6920 6983
xts(aes) 384 16384 8597 7505
xts(aes) 512 16384 7841 8164
xts(aes) 256 32768 13468 12307
xts(aes) 384 32768 14808 13402
xts(aes) 512 32768 15753 14636

[1] https://lkml.org/lkml/2018/8/23/1315
[2] https://gitlab.com/omos/linux-crypto-bench

Cc: Mikulas Patocka <[email protected]>
Signed-off-by: Ondrej Mosnacek <[email protected]>
---
crypto/xts.c | 258 ++++++---------------------------------------------
1 file changed, 30 insertions(+), 228 deletions(-)

v1: https://www.spinics.net/lists/linux-crypto/msg34776.html
Changes in v2:
- rebase to latest cryptodev tree

diff --git a/crypto/xts.c b/crypto/xts.c
index ccf55fbb8bc2..6c49e76df8ca 100644
--- a/crypto/xts.c
+++ b/crypto/xts.c
@@ -26,8 +26,6 @@
#include <crypto/b128ops.h>
#include <crypto/gf128mul.h>

-#define XTS_BUFFER_SIZE 128u
-
struct priv {
struct crypto_skcipher *child;
struct crypto_cipher *tweak;
@@ -39,19 +37,7 @@ struct xts_instance_ctx {
};

struct rctx {
- le128 buf[XTS_BUFFER_SIZE / sizeof(le128)];
-
le128 t;
-
- le128 *ext;
-
- struct scatterlist srcbuf[2];
- struct scatterlist dstbuf[2];
- struct scatterlist *src;
- struct scatterlist *dst;
-
- unsigned int left;
-
struct skcipher_request subreq;
};

@@ -96,265 +82,81 @@ static int setkey(struct crypto_skcipher *parent, const u8 *key,
return err;
}

-static int post_crypt(struct skcipher_request *req)
+static int xor_tweak(struct rctx *rctx, struct skcipher_request *req)
{
- struct rctx *rctx = skcipher_request_ctx(req);
- le128 *buf = rctx->ext ?: rctx->buf;
- struct skcipher_request *subreq;
const int bs = XTS_BLOCK_SIZE;
struct skcipher_walk w;
- struct scatterlist *sg;
- unsigned offset;
+ le128 t = rctx->t;
int err;

- subreq = &rctx->subreq;
- err = skcipher_walk_virt(&w, subreq, false);
+ err = skcipher_walk_virt(&w, req, false);

while (w.nbytes) {
unsigned int avail = w.nbytes;
+ le128 *wsrc;
le128 *wdst;

+ wsrc = w.src.virt.addr;
wdst = w.dst.virt.addr;

do {
- le128_xor(wdst, buf++, wdst);
- wdst++;
+ le128_xor(wdst++, &t, wsrc++);
+ gf128mul_x_ble(&t, &t);
} while ((avail -= bs) >= bs);

err = skcipher_walk_done(&w, avail);
}

- rctx->left -= subreq->cryptlen;
-
- if (err || !rctx->left)
- goto out;
-
- rctx->dst = rctx->dstbuf;
-
- scatterwalk_done(&w.out, 0, 1);
- sg = w.out.sg;
- offset = w.out.offset;
-
- if (rctx->dst != sg) {
- rctx->dst[0] = *sg;
- sg_unmark_end(rctx->dst);
- scatterwalk_crypto_chain(rctx->dst, sg_next(sg), 2);
- }
- rctx->dst[0].length -= offset - sg->offset;
- rctx->dst[0].offset = offset;
-
-out:
return err;
}

-static int pre_crypt(struct skcipher_request *req)
+static void crypt_done(struct crypto_async_request *areq, int err)
{
+ struct skcipher_request *req = areq->data;
struct rctx *rctx = skcipher_request_ctx(req);
- le128 *buf = rctx->ext ?: rctx->buf;
- struct skcipher_request *subreq;
- const int bs = XTS_BLOCK_SIZE;
- struct skcipher_walk w;
- struct scatterlist *sg;
- unsigned cryptlen;
- unsigned offset;
- bool more;
- int err;
-
- subreq = &rctx->subreq;
- cryptlen = subreq->cryptlen;
-
- more = rctx->left > cryptlen;
- if (!more)
- cryptlen = rctx->left;
-
- skcipher_request_set_crypt(subreq, rctx->src, rctx->dst,
- cryptlen, NULL);
-
- err = skcipher_walk_virt(&w, subreq, false);
-
- while (w.nbytes) {
- unsigned int avail = w.nbytes;
- le128 *wsrc;
- le128 *wdst;
-
- wsrc = w.src.virt.addr;
- wdst = w.dst.virt.addr;
-
- do {
- *buf++ = rctx->t;
- le128_xor(wdst++, &rctx->t, wsrc++);
- gf128mul_x_ble(&rctx->t, &rctx->t);
- } while ((avail -= bs) >= bs);
-
- err = skcipher_walk_done(&w, avail);
- }
-
- skcipher_request_set_crypt(subreq, rctx->dst, rctx->dst,
- cryptlen, NULL);
+ struct skcipher_request *subreq = &rctx->subreq;

- if (err || !more)
- goto out;
+ if (!err)
+ err = xor_tweak(rctx, subreq);

- rctx->src = rctx->srcbuf;
-
- scatterwalk_done(&w.in, 0, 1);
- sg = w.in.sg;
- offset = w.in.offset;
-
- if (rctx->src != sg) {
- rctx->src[0] = *sg;
- sg_unmark_end(rctx->src);
- scatterwalk_crypto_chain(rctx->src, sg_next(sg), 2);
- }
- rctx->src[0].length -= offset - sg->offset;
- rctx->src[0].offset = offset;
-
-out:
- return err;
+ skcipher_request_complete(req, err);
}

-static int init_crypt(struct skcipher_request *req, crypto_completion_t done)
+static void init_crypt(struct skcipher_request *req)
{
struct priv *ctx = crypto_skcipher_ctx(crypto_skcipher_reqtfm(req));
struct rctx *rctx = skcipher_request_ctx(req);
- struct skcipher_request *subreq;
- gfp_t gfp;
+ struct skcipher_request *subreq = &rctx->subreq;

- subreq = &rctx->subreq;
skcipher_request_set_tfm(subreq, ctx->child);
- skcipher_request_set_callback(subreq, req->base.flags, done, req);
-
- gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ? GFP_KERNEL :
- GFP_ATOMIC;
- rctx->ext = NULL;
-
- subreq->cryptlen = XTS_BUFFER_SIZE;
- if (req->cryptlen > XTS_BUFFER_SIZE) {
- unsigned int n = min(req->cryptlen, (unsigned int)PAGE_SIZE);
-
- rctx->ext = kmalloc(n, gfp);
- if (rctx->ext)
- subreq->cryptlen = n;
- }
-
- rctx->src = req->src;
- rctx->dst = req->dst;
- rctx->left = req->cryptlen;
+ skcipher_request_set_callback(subreq, req->base.flags, crypt_done, req);
+ skcipher_request_set_crypt(subreq, req->dst, req->dst,
+ req->cryptlen, NULL);

/* calculate first value of T */
crypto_cipher_encrypt_one(ctx->tweak, (u8 *)&rctx->t, req->iv);
-
- return 0;
-}
-
-static void exit_crypt(struct skcipher_request *req)
-{
- struct rctx *rctx = skcipher_request_ctx(req);
-
- rctx->left = 0;
-
- if (rctx->ext)
- kzfree(rctx->ext);
-}
-
-static int do_encrypt(struct skcipher_request *req, int err)
-{
- struct rctx *rctx = skcipher_request_ctx(req);
- struct skcipher_request *subreq;
-
- subreq = &rctx->subreq;
-
- while (!err && rctx->left) {
- err = pre_crypt(req) ?:
- crypto_skcipher_encrypt(subreq) ?:
- post_crypt(req);
-
- if (err == -EINPROGRESS || err == -EBUSY)
- return err;
- }
-
- exit_crypt(req);
- return err;
-}
-
-static void encrypt_done(struct crypto_async_request *areq, int err)
-{
- struct skcipher_request *req = areq->data;
- struct skcipher_request *subreq;
- struct rctx *rctx;
-
- rctx = skcipher_request_ctx(req);
-
- if (err == -EINPROGRESS) {
- if (rctx->left != req->cryptlen)
- return;
- goto out;
- }
-
- subreq = &rctx->subreq;
- subreq->base.flags &= CRYPTO_TFM_REQ_MAY_BACKLOG;
-
- err = do_encrypt(req, err ?: post_crypt(req));
- if (rctx->left)
- return;
-
-out:
- skcipher_request_complete(req, err);
}

static int encrypt(struct skcipher_request *req)
-{
- return do_encrypt(req, init_crypt(req, encrypt_done));
-}
-
-static int do_decrypt(struct skcipher_request *req, int err)
{
struct rctx *rctx = skcipher_request_ctx(req);
- struct skcipher_request *subreq;
-
- subreq = &rctx->subreq;
+ struct skcipher_request *subreq = &rctx->subreq;

- while (!err && rctx->left) {
- err = pre_crypt(req) ?:
- crypto_skcipher_decrypt(subreq) ?:
- post_crypt(req);
-
- if (err == -EINPROGRESS || err == -EBUSY)
- return err;
- }
-
- exit_crypt(req);
- return err;
-}
-
-static void decrypt_done(struct crypto_async_request *areq, int err)
-{
- struct skcipher_request *req = areq->data;
- struct skcipher_request *subreq;
- struct rctx *rctx;
-
- rctx = skcipher_request_ctx(req);
-
- if (err == -EINPROGRESS) {
- if (rctx->left != req->cryptlen)
- return;
- goto out;
- }
-
- subreq = &rctx->subreq;
- subreq->base.flags &= CRYPTO_TFM_REQ_MAY_BACKLOG;
-
- err = do_decrypt(req, err ?: post_crypt(req));
- if (rctx->left)
- return;
-
-out:
- skcipher_request_complete(req, err);
+ init_crypt(req);
+ return xor_tweak(rctx, req) ?:
+ crypto_skcipher_encrypt(subreq) ?:
+ xor_tweak(rctx, subreq);
}

static int decrypt(struct skcipher_request *req)
{
- return do_decrypt(req, init_crypt(req, decrypt_done));
+ struct rctx *rctx = skcipher_request_ctx(req);
+ struct skcipher_request *subreq = &rctx->subreq;
+
+ init_crypt(req);
+ return xor_tweak(rctx, req) ?:
+ crypto_skcipher_decrypt(subreq) ?:
+ xor_tweak(rctx, subreq);
}

static int init_tfm(struct crypto_skcipher *tfm)
--
2.17.1


2018-09-05 08:35:54

by Ondrej Mosnacek

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: xts - Drop use of auxiliary buffer

Hi Eric,

On Wed, Sep 5, 2018 at 8:32 AM Eric Biggers <[email protected]> wrote:
> Hi Ondrej,
>
> On Tue, Sep 04, 2018 at 10:06:42AM +0200, Ondrej Mosnacek wrote:
> > Since commit acb9b159c784 ("crypto: gf128mul - define gf128mul_x_* in
> > gf128mul.h"), the gf128mul_x_*() functions are very fast and therefore
> > caching the computed XTS tweaks has only negligible advantage over
> > computing them twice.
> >
> > In fact, since the current caching implementation limits the size of
> > the calls to the child ecb(...) algorithm to PAGE_SIZE (usually 4096 B),
> > it is often actually slower than the simple recomputing implementation.
> >
> > This patch simplifies the XTS template to recompute the XTS tweaks from
> > scratch in the second pass and thus also removes the need to allocate a
> > dynamic buffer using kmalloc().
> >
> > As discussed at [1], the use of kmalloc causes deadlocks with dm-crypt.
> >
> > PERFORMANCE RESULTS
> > I measured time to encrypt/decrypt a memory buffer of varying sizes with
> > xts(ecb-aes-aesni) using a tool I wrote ([2]) and the results suggest
> > that after this patch the performance is either better or comparable for
> > both small and large buffers. Note that there is a lot of noise in the
> > measurements, but the overall difference is easy to see.
> >
> > Old code:
> > ALGORITHM KEY (b) DATA (B) TIME ENC (ns) TIME DEC (ns)
> > xts(aes) 256 64 331 328
> > xts(aes) 384 64 332 333
> > xts(aes) 512 64 338 348
> > xts(aes) 256 512 889 920
> > xts(aes) 384 512 1019 993
> > xts(aes) 512 512 1032 990
> > xts(aes) 256 4096 2152 2292
> > xts(aes) 384 4096 2453 2597
> > xts(aes) 512 4096 3041 2641
> > xts(aes) 256 16384 9443 8027
> > xts(aes) 384 16384 8536 8925
> > xts(aes) 512 16384 9232 9417
> > xts(aes) 256 32768 16383 14897
> > xts(aes) 384 32768 17527 16102
> > xts(aes) 512 32768 18483 17322
> >
> > New code:
> > ALGORITHM KEY (b) DATA (B) TIME ENC (ns) TIME DEC (ns)
> > xts(aes) 256 64 328 324
> > xts(aes) 384 64 324 319
> > xts(aes) 512 64 320 322
> > xts(aes) 256 512 476 473
> > xts(aes) 384 512 509 492
> > xts(aes) 512 512 531 514
> > xts(aes) 256 4096 2132 1829
> > xts(aes) 384 4096 2357 2055
> > xts(aes) 512 4096 2178 2027
> > xts(aes) 256 16384 6920 6983
> > xts(aes) 384 16384 8597 7505
> > xts(aes) 512 16384 7841 8164
> > xts(aes) 256 32768 13468 12307
> > xts(aes) 384 32768 14808 13402
> > xts(aes) 512 32768 15753 14636
>
> Can you align the headers of these tables?

Sure.

>
> > +static int xor_tweak(struct rctx *rctx, struct skcipher_request *req)
> > {
> > - struct rctx *rctx = skcipher_request_ctx(req);
> > - le128 *buf = rctx->ext ?: rctx->buf;
> > - struct skcipher_request *subreq;
> > const int bs = XTS_BLOCK_SIZE;
> > struct skcipher_walk w;
> > - struct scatterlist *sg;
> > - unsigned offset;
> > + le128 t = rctx->t;
> > int err;
>
> Maybe you could add a brief comment above xor_tweak() explaining the design
> choice for posterity, e.g.:
>
> /*
> * We compute the tweak masks twice (both before and after the ECB encryption or
> * decryption) to avoid having to allocate a temporary buffer, which usually
> * would be slower than just doing the gf128mul_x_ble() calls again.
> */

Definitely, that's a good idea! I'll put something like that into v3.

>
> Otherwise this looks good. Thanks for doing this!
>
> The new implementation isn't *guaranteed* to be faster, but it should be most of
> the time, and it's definitely much simpler. And the current one has had bugs.

Yes, it's not guaranteed, but the complexity of gfmul should be always
small enough when compared to a block cipher call. If it is small
enough compared to AES-NI, which is already crazy-fast, then should be
fine in all realistic cases. Either way, the massive simplification
should be worth it even with a minor slowdown.

>
> Note that if ever needed there's also still room for optimizing the GF(2^128)
> multiplications further, e.g. multiplying by 'x' and 'x^2' in parallel, or maybe
> having a version specialized for 32-bit processors.
>
> FYI, I think that 'subreq' can have an alignmask insufficient for 'le128', which
> can cause misaligned accesses during the second xor_tweak(). But, the current
> version has that bug too...

That's a good point, I haven't thought of that... I think I could fix
it with a clever one-liner in my version, but ideally someone should
write a patch for the old version, too, so it can get fixed in stable
as well... but that would be more work than I planned to spend on
this, I should really be doing other things right now :)

BTW, I noticed later that crypto/lrw.c uses a very similar pattern
(with kmalloc and calls to ECB). I am now trying to simplify it in the
same way, but in this case the xor_tweak operation seems to be slower
and the difference is more noticeable. I have managed to optimize it
quite a bit and bring the difference down, hopefully enough to mandate
the simplification. I will send a patch later with some concrete
numbers.

>
> Reviewed-by: Eric Biggers <[email protected]>
>
> - Eric

Thanks,

--
Ondrej Mosnacek <omosnace at redhat dot com>
Associate Software Engineer, Security Technologies
Red Hat, Inc.

2018-09-05 15:30:14

by Samuel Neves

[permalink] [raw]
Subject: Re: [PATCH v2] crypto: xts - Drop use of auxiliary buffer

On Wed, Sep 5, 2018 at 7:32 AM, Eric Biggers <[email protected]> wrote:
> Note that if ever needed there's also still room for optimizing the GF(2^128)
> multiplications further, e.g. multiplying by 'x' and 'x^2' in parallel, or maybe
> having a version specialized for 32-bit processors.

Given that this is used to encrypt small buffers only, skipping ahead
seems like it may also be a viable strategy. For example, for the XTS
polynomial x^128 + x^7 + x^2 + x + 1 one can multiply by x^64 very
efficiently with

u128 skip64(u128 x)
{
u128 b64 = (x >> 64);
u128 b63 = (x >> 63) & ~(u128)0x01;
u128 b62 = (x >> 62) & ~(u128)0x03;
u128 b57 = (x >> 57) & ~(u128)0x7f;
return (x << 64) ^ (b64 ^ b63 ^ b62 ^ b57);
}

Calling this twice skips exactly 128 blocks, in which case we can xor
both halves of a 4096-byte sector in parallel.