2020-10-21 10:04:51

by Arvind Sankar

[permalink] [raw]
Subject: [PATCH v2 0/6] crypto: lib/sha256 - cleanup/optimization

Patch 1 -- Use memzero_explicit() instead of structure assignment/plain
memset() to clear sensitive state.

Patch 2 -- I am not sure about this one: currently the temporary
variables used in the generic sha256 implementation are cleared, but the
clearing is optimized away due to lack of compiler barriers. I don't
think it's really necessary to clear them, but I'm not a cryptanalyst,
so I would like comment on whether it's indeed safe not to, or we should
instead add the required barriers to force clearing.

The last four patches are optimizations for generic sha256.

v2:
- Add patch to combine K and W arrays, suggested by David
- Reformat SHA256_ROUND() macro a little

Arvind Sankar (6):
crypto: Use memzero_explicit() for clearing state
crypto: lib/sha256 - Don't clear temporary variables
crypto: lib/sha256 - Clear W[] in sha256_update() instead of
sha256_transform()
crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
crypto: lib/sha256 - Unroll LOAD and BLEND loops
crypto: lib/sha - Combine round constants and message schedule

include/crypto/sha1_base.h | 3 +-
include/crypto/sha256_base.h | 3 +-
include/crypto/sha512_base.h | 3 +-
include/crypto/sm3_base.h | 3 +-
lib/crypto/sha256.c | 211 +++++++++++------------------------
5 files changed, 71 insertions(+), 152 deletions(-)

--
2.26.2


2020-10-21 10:05:37

by Arvind Sankar

[permalink] [raw]
Subject: [PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

This reduces code size substantially (on x86_64 with gcc-10 the size of
sha256_update() goes from 7593 bytes to 1952 bytes including the new
SHA256_K array), and on x86 is slightly faster than the full unroll
(tesed on Broadwell Xeon).

Signed-off-by: Arvind Sankar <[email protected]>
---
lib/crypto/sha256.c | 166 ++++++++------------------------------------
1 file changed, 30 insertions(+), 136 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index c6bfeacc5b81..5efd390706c6 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -18,6 +18,17 @@
#include <crypto/sha.h>
#include <asm/unaligned.h>

+static const u32 SHA256_K[] = {
+ 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+ 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+ 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+ 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+ 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+ 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+ 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+ 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
+};
+
static inline u32 Ch(u32 x, u32 y, u32 z)
{
return z ^ (x & (y ^ z));
@@ -43,9 +54,17 @@ static inline void BLEND_OP(int I, u32 *W)
W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
}

+#define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do { \
+ u32 t1, t2; \
+ t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i]; \
+ t2 = e0(a) + Maj(a, b, c); \
+ d += t1; \
+ h = t1 + t2; \
+} while (0)
+
static void sha256_transform(u32 *state, const u8 *input, u32 *W)
{
- u32 a, b, c, d, e, f, g, h, t1, t2;
+ u32 a, b, c, d, e, f, g, h;
int i;

/* load the input */
@@ -61,141 +80,16 @@ static void sha256_transform(u32 *state, const u8 *input, u32 *W)
e = state[4]; f = state[5]; g = state[6]; h = state[7];

/* now iterate */
- t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
- t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
- t1 = g + e1(d) + Ch(d, e, f) + 0x71374491 + W[1];
- t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
- t1 = f + e1(c) + Ch(c, d, e) + 0xb5c0fbcf + W[2];
- t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
- t1 = e + e1(b) + Ch(b, c, d) + 0xe9b5dba5 + W[3];
- t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
- t1 = d + e1(a) + Ch(a, b, c) + 0x3956c25b + W[4];
- t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
- t1 = c + e1(h) + Ch(h, a, b) + 0x59f111f1 + W[5];
- t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
- t1 = b + e1(g) + Ch(g, h, a) + 0x923f82a4 + W[6];
- t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
- t1 = a + e1(f) + Ch(f, g, h) + 0xab1c5ed5 + W[7];
- t2 = e0(b) + Maj(b, c, d); e += t1; a = t1 + t2;
-
- t1 = h + e1(e) + Ch(e, f, g) + 0xd807aa98 + W[8];
- t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
- t1 = g + e1(d) + Ch(d, e, f) + 0x12835b01 + W[9];
- t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
- t1 = f + e1(c) + Ch(c, d, e) + 0x243185be + W[10];
- t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
- t1 = e + e1(b) + Ch(b, c, d) + 0x550c7dc3 + W[11];
- t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
- t1 = d + e1(a) + Ch(a, b, c) + 0x72be5d74 + W[12];
- t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
- t1 = c + e1(h) + Ch(h, a, b) + 0x80deb1fe + W[13];
- t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
- t1 = b + e1(g) + Ch(g, h, a) + 0x9bdc06a7 + W[14];
- t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
- t1 = a + e1(f) + Ch(f, g, h) + 0xc19bf174 + W[15];
- t2 = e0(b) + Maj(b, c, d); e += t1; a = t1 + t2;
-
- t1 = h + e1(e) + Ch(e, f, g) + 0xe49b69c1 + W[16];
- t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
- t1 = g + e1(d) + Ch(d, e, f) + 0xefbe4786 + W[17];
- t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
- t1 = f + e1(c) + Ch(c, d, e) + 0x0fc19dc6 + W[18];
- t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
- t1 = e + e1(b) + Ch(b, c, d) + 0x240ca1cc + W[19];
- t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
- t1 = d + e1(a) + Ch(a, b, c) + 0x2de92c6f + W[20];
- t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
- t1 = c + e1(h) + Ch(h, a, b) + 0x4a7484aa + W[21];
- t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
- t1 = b + e1(g) + Ch(g, h, a) + 0x5cb0a9dc + W[22];
- t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
- t1 = a + e1(f) + Ch(f, g, h) + 0x76f988da + W[23];
- t2 = e0(b) + Maj(b, c, d); e += t1; a = t1 + t2;
-
- t1 = h + e1(e) + Ch(e, f, g) + 0x983e5152 + W[24];
- t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
- t1 = g + e1(d) + Ch(d, e, f) + 0xa831c66d + W[25];
- t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
- t1 = f + e1(c) + Ch(c, d, e) + 0xb00327c8 + W[26];
- t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
- t1 = e + e1(b) + Ch(b, c, d) + 0xbf597fc7 + W[27];
- t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
- t1 = d + e1(a) + Ch(a, b, c) + 0xc6e00bf3 + W[28];
- t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
- t1 = c + e1(h) + Ch(h, a, b) + 0xd5a79147 + W[29];
- t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
- t1 = b + e1(g) + Ch(g, h, a) + 0x06ca6351 + W[30];
- t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
- t1 = a + e1(f) + Ch(f, g, h) + 0x14292967 + W[31];
- t2 = e0(b) + Maj(b, c, d); e += t1; a = t1 + t2;
-
- t1 = h + e1(e) + Ch(e, f, g) + 0x27b70a85 + W[32];
- t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
- t1 = g + e1(d) + Ch(d, e, f) + 0x2e1b2138 + W[33];
- t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
- t1 = f + e1(c) + Ch(c, d, e) + 0x4d2c6dfc + W[34];
- t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
- t1 = e + e1(b) + Ch(b, c, d) + 0x53380d13 + W[35];
- t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
- t1 = d + e1(a) + Ch(a, b, c) + 0x650a7354 + W[36];
- t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
- t1 = c + e1(h) + Ch(h, a, b) + 0x766a0abb + W[37];
- t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
- t1 = b + e1(g) + Ch(g, h, a) + 0x81c2c92e + W[38];
- t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
- t1 = a + e1(f) + Ch(f, g, h) + 0x92722c85 + W[39];
- t2 = e0(b) + Maj(b, c, d); e += t1; a = t1 + t2;
-
- t1 = h + e1(e) + Ch(e, f, g) + 0xa2bfe8a1 + W[40];
- t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
- t1 = g + e1(d) + Ch(d, e, f) + 0xa81a664b + W[41];
- t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
- t1 = f + e1(c) + Ch(c, d, e) + 0xc24b8b70 + W[42];
- t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
- t1 = e + e1(b) + Ch(b, c, d) + 0xc76c51a3 + W[43];
- t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
- t1 = d + e1(a) + Ch(a, b, c) + 0xd192e819 + W[44];
- t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
- t1 = c + e1(h) + Ch(h, a, b) + 0xd6990624 + W[45];
- t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
- t1 = b + e1(g) + Ch(g, h, a) + 0xf40e3585 + W[46];
- t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
- t1 = a + e1(f) + Ch(f, g, h) + 0x106aa070 + W[47];
- t2 = e0(b) + Maj(b, c, d); e += t1; a = t1 + t2;
-
- t1 = h + e1(e) + Ch(e, f, g) + 0x19a4c116 + W[48];
- t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
- t1 = g + e1(d) + Ch(d, e, f) + 0x1e376c08 + W[49];
- t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
- t1 = f + e1(c) + Ch(c, d, e) + 0x2748774c + W[50];
- t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
- t1 = e + e1(b) + Ch(b, c, d) + 0x34b0bcb5 + W[51];
- t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
- t1 = d + e1(a) + Ch(a, b, c) + 0x391c0cb3 + W[52];
- t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
- t1 = c + e1(h) + Ch(h, a, b) + 0x4ed8aa4a + W[53];
- t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
- t1 = b + e1(g) + Ch(g, h, a) + 0x5b9cca4f + W[54];
- t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
- t1 = a + e1(f) + Ch(f, g, h) + 0x682e6ff3 + W[55];
- t2 = e0(b) + Maj(b, c, d); e += t1; a = t1 + t2;
-
- t1 = h + e1(e) + Ch(e, f, g) + 0x748f82ee + W[56];
- t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
- t1 = g + e1(d) + Ch(d, e, f) + 0x78a5636f + W[57];
- t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
- t1 = f + e1(c) + Ch(c, d, e) + 0x84c87814 + W[58];
- t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
- t1 = e + e1(b) + Ch(b, c, d) + 0x8cc70208 + W[59];
- t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
- t1 = d + e1(a) + Ch(a, b, c) + 0x90befffa + W[60];
- t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
- t1 = c + e1(h) + Ch(h, a, b) + 0xa4506ceb + W[61];
- t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
- t1 = b + e1(g) + Ch(g, h, a) + 0xbef9a3f7 + W[62];
- t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
- t1 = a + e1(f) + Ch(f, g, h) + 0xc67178f2 + W[63];
- t2 = e0(b) + Maj(b, c, d); e += t1; a = t1 + t2;
+ for (i = 0; i < 64; i += 8) {
+ SHA256_ROUND(i + 0, a, b, c, d, e, f, g, h);
+ SHA256_ROUND(i + 1, h, a, b, c, d, e, f, g);
+ SHA256_ROUND(i + 2, g, h, a, b, c, d, e, f);
+ SHA256_ROUND(i + 3, f, g, h, a, b, c, d, e);
+ SHA256_ROUND(i + 4, e, f, g, h, a, b, c, d);
+ SHA256_ROUND(i + 5, d, e, f, g, h, a, b, c);
+ SHA256_ROUND(i + 6, c, d, e, f, g, h, a, b);
+ SHA256_ROUND(i + 7, b, c, d, e, f, g, h, a);
+ }

state[0] += a; state[1] += b; state[2] += c; state[3] += d;
state[4] += e; state[5] += f; state[6] += g; state[7] += h;
--
2.26.2

2020-10-21 10:05:58

by Arvind Sankar

[permalink] [raw]
Subject: [PATCH v2 5/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops

Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
(tested on Broadwell Xeon) while not increasing code size too much.

Signed-off-by: Arvind Sankar <[email protected]>
---
lib/crypto/sha256.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 5efd390706c6..3a8802d5f747 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -68,12 +68,28 @@ static void sha256_transform(u32 *state, const u8 *input, u32 *W)
int i;

/* load the input */
- for (i = 0; i < 16; i++)
- LOAD_OP(i, W, input);
+ for (i = 0; i < 16; i += 8) {
+ LOAD_OP(i + 0, W, input);
+ LOAD_OP(i + 1, W, input);
+ LOAD_OP(i + 2, W, input);
+ LOAD_OP(i + 3, W, input);
+ LOAD_OP(i + 4, W, input);
+ LOAD_OP(i + 5, W, input);
+ LOAD_OP(i + 6, W, input);
+ LOAD_OP(i + 7, W, input);
+ }

/* now blend */
- for (i = 16; i < 64; i++)
- BLEND_OP(i, W);
+ for (i = 16; i < 64; i += 8) {
+ BLEND_OP(i + 0, W);
+ BLEND_OP(i + 1, W);
+ BLEND_OP(i + 2, W);
+ BLEND_OP(i + 3, W);
+ BLEND_OP(i + 4, W);
+ BLEND_OP(i + 5, W);
+ BLEND_OP(i + 6, W);
+ BLEND_OP(i + 7, W);
+ }

/* load the state into our registers */
a = state[0]; b = state[1]; c = state[2]; d = state[3];
--
2.26.2

2020-10-21 10:06:26

by Arvind Sankar

[permalink] [raw]
Subject: [PATCH v2 3/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform()

The temporary W[] array is currently zeroed out once every call to
sha256_transform(), i.e. once every 64 bytes of input data. Moving it to
sha256_update() instead so that it is cleared only once per update can
save about 2-3% of the total time taken to compute the digest, with a
reasonable memset() implementation, and considerably more (~20%) with a
bad one (eg the x86 purgatory currently uses a memset() coded in C).

Signed-off-by: Arvind Sankar <[email protected]>
---
lib/crypto/sha256.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 099cd11f83c1..c6bfeacc5b81 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -43,10 +43,9 @@ static inline void BLEND_OP(int I, u32 *W)
W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
}

-static void sha256_transform(u32 *state, const u8 *input)
+static void sha256_transform(u32 *state, const u8 *input, u32 *W)
{
u32 a, b, c, d, e, f, g, h, t1, t2;
- u32 W[64];
int i;

/* load the input */
@@ -200,15 +199,13 @@ static void sha256_transform(u32 *state, const u8 *input)

state[0] += a; state[1] += b; state[2] += c; state[3] += d;
state[4] += e; state[5] += f; state[6] += g; state[7] += h;
-
- /* clear any sensitive info... */
- memzero_explicit(W, 64 * sizeof(u32));
}

void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
{
unsigned int partial, done;
const u8 *src;
+ u32 W[64];

partial = sctx->count & 0x3f;
sctx->count += len;
@@ -223,11 +220,13 @@ void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
}

do {
- sha256_transform(sctx->state, src);
+ sha256_transform(sctx->state, src, W);
done += 64;
src = data + done;
} while (done + 63 < len);

+ memzero_explicit(W, sizeof(W));
+
partial = 0;
}
memcpy(sctx->buf + partial, src, len - done);
--
2.26.2

2020-10-21 10:06:57

by Arvind Sankar

[permalink] [raw]
Subject: [PATCH v2 2/6] crypto: lib/sha256 - Don't clear temporary variables

The assignments to clear a through h and t1/t2 are optimized out by the
compiler because they are unused after the assignments.

These variables shouldn't be very sensitive: t1/t2 can be calculated
from a through h, so they don't reveal any additional information.
Knowing a through h is equivalent to knowing one 64-byte block's SHA256
hash (with non-standard initial value) which, assuming SHA256 is secure,
doesn't reveal any information about the input.

Signed-off-by: Arvind Sankar <[email protected]>
---
lib/crypto/sha256.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index d43bc39ab05e..099cd11f83c1 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 *input)
state[4] += e; state[5] += f; state[6] += g; state[7] += h;

/* clear any sensitive info... */
- a = b = c = d = e = f = g = h = t1 = t2 = 0;
memzero_explicit(W, 64 * sizeof(u32));
}

--
2.26.2

2020-10-22 08:24:44

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] crypto: lib/sha256 - Don't clear temporary variables

On Tue, Oct 20, 2020 at 04:39:53PM -0400, Arvind Sankar wrote:
> The assignments to clear a through h and t1/t2 are optimized out by the
> compiler because they are unused after the assignments.
>
> These variables shouldn't be very sensitive: t1/t2 can be calculated
> from a through h, so they don't reveal any additional information.
> Knowing a through h is equivalent to knowing one 64-byte block's SHA256
> hash (with non-standard initial value) which, assuming SHA256 is secure,
> doesn't reveal any information about the input.
>
> Signed-off-by: Arvind Sankar <[email protected]>

I don't entirely buy the second paragraph. It could be the case that the input
is less than or equal to one SHA-256 block (64 bytes), in which case leaking
'a' through 'h' would reveal the final SHA-256 hash if the input length is
known. And note that callers might consider either the input, the resulting
hash, or both to be sensitive information -- it depends.

> ---
> lib/crypto/sha256.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> index d43bc39ab05e..099cd11f83c1 100644
> --- a/lib/crypto/sha256.c
> +++ b/lib/crypto/sha256.c
> @@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 *input)
> state[4] += e; state[5] += f; state[6] += g; state[7] += h;
>
> /* clear any sensitive info... */
> - a = b = c = d = e = f = g = h = t1 = t2 = 0;
> memzero_explicit(W, 64 * sizeof(u32));
> }

Your change itself is fine, though. As you mentioned, these assignments get
optimized out, so they weren't accomplishing anything.

The fact is, there just isn't any way to guarantee in C code that all sensitive
variables get cleared.

So we shouldn't (and generally don't) bother trying to clear individual u32's,
ints, etc. like this, but rather only structs and arrays, as clearing those is
more likely to work as intended.

- Eric

2020-10-22 08:25:01

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform()

On Tue, Oct 20, 2020 at 04:39:54PM -0400, Arvind Sankar wrote:
> The temporary W[] array is currently zeroed out once every call to
> sha256_transform(), i.e. once every 64 bytes of input data. Moving it to
> sha256_update() instead so that it is cleared only once per update can
> save about 2-3% of the total time taken to compute the digest, with a
> reasonable memset() implementation, and considerably more (~20%) with a
> bad one (eg the x86 purgatory currently uses a memset() coded in C).
>
> Signed-off-by: Arvind Sankar <[email protected]>

Looks good,

Reviewed-by: Eric Biggers <[email protected]>

2020-10-22 08:25:30

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

On Tue, Oct 20, 2020 at 04:39:55PM -0400, Arvind Sankar wrote:
> This reduces code size substantially (on x86_64 with gcc-10 the size of
> sha256_update() goes from 7593 bytes to 1952 bytes including the new
> SHA256_K array), and on x86 is slightly faster than the full unroll
> (tesed on Broadwell Xeon).

tesed => tested

>
> Signed-off-by: Arvind Sankar <[email protected]>
> ---
> lib/crypto/sha256.c | 166 ++++++++------------------------------------
> 1 file changed, 30 insertions(+), 136 deletions(-)
>
> diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> index c6bfeacc5b81..5efd390706c6 100644
> --- a/lib/crypto/sha256.c
> +++ b/lib/crypto/sha256.c
> @@ -18,6 +18,17 @@
> #include <crypto/sha.h>
> #include <asm/unaligned.h>
>
> +static const u32 SHA256_K[] = {
> + 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
> + 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
> + 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
> + 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
> + 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
> + 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
> + 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
> + 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
> +};

Limit this to 80 columns?

Otherwise this looks good.

- Eric

2020-10-22 08:25:30

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops

On Tue, Oct 20, 2020 at 04:39:56PM -0400, Arvind Sankar wrote:
> Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> (tested on Broadwell Xeon) while not increasing code size too much.
>
> Signed-off-by: Arvind Sankar <[email protected]>
> ---

Looks good,

Reviewed-by: Eric Biggers <[email protected]>

2020-10-23 07:49:58

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

On Wed, Oct 21, 2020 at 10:02:19PM -0700, Eric Biggers wrote:
> On Tue, Oct 20, 2020 at 04:39:55PM -0400, Arvind Sankar wrote:
> > This reduces code size substantially (on x86_64 with gcc-10 the size of
> > sha256_update() goes from 7593 bytes to 1952 bytes including the new
> > SHA256_K array), and on x86 is slightly faster than the full unroll
> > (tesed on Broadwell Xeon).
>
> tesed => tested
>
> >
> > Signed-off-by: Arvind Sankar <[email protected]>
> > ---
> > lib/crypto/sha256.c | 166 ++++++++------------------------------------
> > 1 file changed, 30 insertions(+), 136 deletions(-)
> >
> > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> > index c6bfeacc5b81..5efd390706c6 100644
> > --- a/lib/crypto/sha256.c
> > +++ b/lib/crypto/sha256.c
> > @@ -18,6 +18,17 @@
> > #include <crypto/sha.h>
> > #include <asm/unaligned.h>
> >
> > +static const u32 SHA256_K[] = {
> > + 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
> > + 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
> > + 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
> > + 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
> > + 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
> > + 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
> > + 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
> > + 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
> > +};
>
> Limit this to 80 columns?

I was aiming for 8 columns per line to match all the other groupings by
eight. It does slightly exceed 100 columns but can this be an exception,
or should I maybe make it 4 columns per line?

>
> Otherwise this looks good.
>
> - Eric

2020-10-23 07:50:22

by Arvind Sankar

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] crypto: lib/sha256 - Don't clear temporary variables

On Wed, Oct 21, 2020 at 09:58:50PM -0700, Eric Biggers wrote:
> On Tue, Oct 20, 2020 at 04:39:53PM -0400, Arvind Sankar wrote:
> > The assignments to clear a through h and t1/t2 are optimized out by the
> > compiler because they are unused after the assignments.
> >
> > These variables shouldn't be very sensitive: t1/t2 can be calculated
> > from a through h, so they don't reveal any additional information.
> > Knowing a through h is equivalent to knowing one 64-byte block's SHA256
> > hash (with non-standard initial value) which, assuming SHA256 is secure,
> > doesn't reveal any information about the input.
> >
> > Signed-off-by: Arvind Sankar <[email protected]>
>
> I don't entirely buy the second paragraph. It could be the case that the input
> is less than or equal to one SHA-256 block (64 bytes), in which case leaking
> 'a' through 'h' would reveal the final SHA-256 hash if the input length is
> known. And note that callers might consider either the input, the resulting
> hash, or both to be sensitive information -- it depends.

The "non-standard initial value" was just parenthetical -- my thinking
was that revealing the hash, whether the real SHA hash or an
intermediate one starting at some other initial value, shouldn't reveal
the input; not that you get any additional security from being an
intermediate block. But if the hash itself could be sensitive, yeah then
a-h are sensitive anyway.

>
> > ---
> > lib/crypto/sha256.c | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> > index d43bc39ab05e..099cd11f83c1 100644
> > --- a/lib/crypto/sha256.c
> > +++ b/lib/crypto/sha256.c
> > @@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 *input)
> > state[4] += e; state[5] += f; state[6] += g; state[7] += h;
> >
> > /* clear any sensitive info... */
> > - a = b = c = d = e = f = g = h = t1 = t2 = 0;
> > memzero_explicit(W, 64 * sizeof(u32));
> > }
>
> Your change itself is fine, though. As you mentioned, these assignments get
> optimized out, so they weren't accomplishing anything.
>
> The fact is, there just isn't any way to guarantee in C code that all sensitive
> variables get cleared.
>
> So we shouldn't (and generally don't) bother trying to clear individual u32's,
> ints, etc. like this, but rather only structs and arrays, as clearing those is
> more likely to work as intended.
>
> - Eric

Ok, I'll just drop the second paragraph from the commit message then.

2020-10-23 07:51:24

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

On Thu, Oct 22, 2020 at 11:12:36PM -0400, Arvind Sankar wrote:
>
> I was aiming for 8 columns per line to match all the other groupings by
> eight. It does slightly exceed 100 columns but can this be an exception,
> or should I maybe make it 4 columns per line?

Please limit it to 4 columns.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt