2017-03-31 09:28:06

by Ondrej Mosnáček

[permalink] [raw]
Subject: [PATCH v3] crypto: gf128mul - define gf128mul_x_* in gf128mul.h

The gf128mul_x_ble function is currently defined in gf128mul.c, because
it depends on the gf128mul_table_be multiplication table.

However, since the function is very small and only uses two values from
the table, it is better for it to be defined as inline function in
gf128mul.h. That way, the function can be inlined by the compiler for
better performance.

For consistency, the other gf128mul_x_* functions are also moved to the
header file. In addition, the code is rewritten to be constant-time.

After this change, the speed of the generic 'xts(aes)' implementation
increased from ~225 MiB/s to ~235 MiB/s (measured using 'cryptsetup
benchmark -c aes-xts-plain64' on an Intel system with CRYPTO_AES_X86_64
and CRYPTO_AES_NI_INTEL disabled).

Signed-off-by: Ondrej Mosnacek <[email protected]>
Cc: Eric Biggers <[email protected]>
---
v2 -> v3: constant-time implementation
v1 -> v2: move all _x_ functions to the header, not just gf128mul_x_ble

crypto/gf128mul.c | 33 +---------------------------
include/crypto/gf128mul.h | 55 +++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 54 insertions(+), 34 deletions(-)

diff --git a/crypto/gf128mul.c b/crypto/gf128mul.c
index 04facc0..dc01212 100644
--- a/crypto/gf128mul.c
+++ b/crypto/gf128mul.c
@@ -130,43 +130,12 @@ static const u16 gf128mul_table_le[256] = gf128mul_dat(xda_le);
static const u16 gf128mul_table_be[256] = gf128mul_dat(xda_be);

/*
- * The following functions multiply a field element by x or by x^8 in
+ * The following functions multiply a field element by x^8 in
* the polynomial field representation. They use 64-bit word operations
* to gain speed but compensate for machine endianness and hence work
* correctly on both styles of machine.
*/

-static void gf128mul_x_lle(be128 *r, const be128 *x)
-{
- u64 a = be64_to_cpu(x->a);
- u64 b = be64_to_cpu(x->b);
- u64 _tt = gf128mul_table_le[(b << 7) & 0xff];
-
- r->b = cpu_to_be64((b >> 1) | (a << 63));
- r->a = cpu_to_be64((a >> 1) ^ (_tt << 48));
-}
-
-static void gf128mul_x_bbe(be128 *r, const be128 *x)
-{
- u64 a = be64_to_cpu(x->a);
- u64 b = be64_to_cpu(x->b);
- u64 _tt = gf128mul_table_be[a >> 63];
-
- r->a = cpu_to_be64((a << 1) | (b >> 63));
- r->b = cpu_to_be64((b << 1) ^ _tt);
-}
-
-void gf128mul_x_ble(be128 *r, const be128 *x)
-{
- u64 a = le64_to_cpu(x->a);
- u64 b = le64_to_cpu(x->b);
- u64 _tt = gf128mul_table_be[b >> 63];
-
- r->a = cpu_to_le64((a << 1) ^ _tt);
- r->b = cpu_to_le64((b << 1) | (a >> 63));
-}
-EXPORT_SYMBOL(gf128mul_x_ble);
-
static void gf128mul_x8_lle(be128 *x)
{
u64 a = be64_to_cpu(x->a);
diff --git a/include/crypto/gf128mul.h b/include/crypto/gf128mul.h
index 0bc9b5f..6e43be5 100644
--- a/include/crypto/gf128mul.h
+++ b/include/crypto/gf128mul.h
@@ -49,6 +49,7 @@
#ifndef _CRYPTO_GF128MUL_H
#define _CRYPTO_GF128MUL_H

+#include <asm/byteorder.h>
#include <crypto/b128ops.h>
#include <linux/slab.h>

@@ -163,8 +164,58 @@ void gf128mul_lle(be128 *a, const be128 *b);

void gf128mul_bbe(be128 *a, const be128 *b);

-/* multiply by x in ble format, needed by XTS */
-void gf128mul_x_ble(be128 *a, const be128 *b);
+/*
+ * The following functions multiply a field element by x in
+ * the polynomial field representation. They use 64-bit word operations
+ * to gain speed but compensate for machine endianness and hence work
+ * correctly on both styles of machine.
+ *
+ * They are defined here for performance.
+ */
+
+static inline u64 gf128mul_mask_from_bit(u64 x, int which)
+{
+ /* a constant-time version of 'x & ((u64)1 << which) ? (u64)-1 : 0' */
+ return ((s64)(x << (63 - which)) >> 63);
+}
+
+static inline void gf128mul_x_lle(be128 *r, const be128 *x)
+{
+ u64 a = be64_to_cpu(x->a);
+ u64 b = be64_to_cpu(x->b);
+
+ /* equivalent to gf128mul_table_le[(b << 7) & 0xff] >> 8
+ * (see crypto/gf128mul.c): */
+ u64 _tt = gf128mul_mask_from_bit(b, 0) & 0xe1;
+
+ r->b = cpu_to_be64((b >> 1) | (a << 63));
+ r->a = cpu_to_be64((a >> 1) ^ (_tt << 56));
+}
+
+static inline void gf128mul_x_bbe(be128 *r, const be128 *x)
+{
+ u64 a = be64_to_cpu(x->a);
+ u64 b = be64_to_cpu(x->b);
+
+ /* equivalent to gf128mul_table_be[a >> 63] (see crypto/gf128mul.c): */
+ u64 _tt = gf128mul_mask_from_bit(a, 63) & 0x87;
+
+ r->a = cpu_to_be64((a << 1) | (b >> 63));
+ r->b = cpu_to_be64((b << 1) ^ _tt);
+}
+
+/* needed by XTS */
+static inline void gf128mul_x_ble(be128 *r, const be128 *x)
+{
+ u64 a = le64_to_cpu(x->a);
+ u64 b = le64_to_cpu(x->b);
+
+ /* equivalent to gf128mul_table_be[b >> 63] (see crypto/gf128mul.c): */
+ u64 _tt = gf128mul_mask_from_bit(b, 63) & 0x87;
+
+ r->a = cpu_to_le64((a << 1) ^ _tt);
+ r->b = cpu_to_le64((b << 1) | (a >> 63));
+}

/* 4k table optimization */

--
2.9.3


2017-04-01 03:44:10

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v3] crypto: gf128mul - define gf128mul_x_* in gf128mul.h

On Fri, Mar 31, 2017 at 11:27:03AM +0200, Ondrej Mosnacek wrote:
> The gf128mul_x_ble function is currently defined in gf128mul.c, because
> it depends on the gf128mul_table_be multiplication table.
>
> However, since the function is very small and only uses two values from
> the table, it is better for it to be defined as inline function in
> gf128mul.h. That way, the function can be inlined by the compiler for
> better performance.
>
> For consistency, the other gf128mul_x_* functions are also moved to the
> header file. In addition, the code is rewritten to be constant-time.
>
> After this change, the speed of the generic 'xts(aes)' implementation
> increased from ~225 MiB/s to ~235 MiB/s (measured using 'cryptsetup
> benchmark -c aes-xts-plain64' on an Intel system with CRYPTO_AES_X86_64
> and CRYPTO_AES_NI_INTEL disabled).
>
> Signed-off-by: Ondrej Mosnacek <[email protected]>
> Cc: Eric Biggers <[email protected]>

Reviewed-by: Eric Biggers <[email protected]>

Also, I realized that for gf128mul_x_lle() now that we aren't using the table we
don't need to shift '_tt' but rather can use the constant 0xe100000000000000:

/* equivalent to (u64)gf128mul_table_le[(b << 7) & 0xff] << 48
* (see crypto/gf128mul.c): */
u64 _tt = gf128mul_mask_from_bit(b, 0) & 0xe100000000000000;

r->b = cpu_to_be64((b >> 1) | (a << 63));
r->a = cpu_to_be64((a >> 1) ^ _tt);

I think that would be better and you could send a v4 to do it that way if you
want. It's not a huge deal though.

Thanks!

- Eric

2017-04-01 15:14:13

by Ondrej Mosnáček

[permalink] [raw]
Subject: Re: [PATCH v3] crypto: gf128mul - define gf128mul_x_* in gf128mul.h

2017-04-01 5:44 GMT+02:00 Eric Biggers <[email protected]>:
> Also, I realized that for gf128mul_x_lle() now that we aren't using the table we
> don't need to shift '_tt' but rather can use the constant 0xe100000000000000:
>
> /* equivalent to (u64)gf128mul_table_le[(b << 7) & 0xff] << 48
> * (see crypto/gf128mul.c): */
> u64 _tt = gf128mul_mask_from_bit(b, 0) & 0xe100000000000000;
>
> r->b = cpu_to_be64((b >> 1) | (a << 63));
> r->a = cpu_to_be64((a >> 1) ^ _tt);
>
> I think that would be better and you could send a v4 to do it that way if you
> want. It's not a huge deal though.

Yes, I was hoping the compiler would be wise enough to fold the shift
into the constant, but I didn't actually check the assembly output...
I took the time to write a quick benchmark and the version without
shift is indeed notably faster.

That said, I'll go the extra mile and send a v4.

Thanks for the review!

O.M.