2024-06-03 18:39:52

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v4 0/8] Optimize dm-verity and fsverity using multibuffer hashing

On many modern CPUs, it is possible to compute the SHA-256 hash of two
equal-length messages in about the same time as a single message, if all
the instructions are interleaved. This is because each SHA-256 (and
also most other cryptographic hash functions) is inherently serialized
and therefore can't always take advantage of the CPU's full throughput.

An earlier attempt to support multibuffer hashing in Linux was based
around the ahash API. That approach had some major issues. This
patchset instead takes a much simpler approach of just adding a
synchronous API for hashing equal-length messages.

This works well for dm-verity and fsverity, which use Merkle trees and
therefore hash large numbers of equal-length messages.

This patchset is organized as follows:

- Patch 1-3 add crypto_shash_finup_mb() and tests for it.
- Patch 4-5 implement finup_mb on x86_64 and arm64, using an
interleaving factor of 2.
- Patch 6-8 update fsverity and dm-verity to use crypto_shash_finup_mb()
to hash pairs of data blocks when possible. Note: the patch
"dm-verity: hash blocks with shash import+finup when possible" is
revived from its original submission
(https://lore.kernel.org/dm-devel/[email protected]/)
because this new work provides a new motivation for it.

On CPUs that support multiple concurrent SHA-256's (all arm64 CPUs I
tested, and AMD Zen CPUs), raw SHA-256 hashing throughput increases by
70-98%, and the throughput of cold-cache reads from dm-verity and
fsverity increases by very roughly 35%.

Changed in v4:
- Reorganized the fsverity and dm-verity code to have a unified code
path for single-block vs. multi-block processing. For data blocks
they now use only crypto_shash_finup_mb().

Changed in v3:
- Change API from finup2x to finup_mb. It now takes arrays of data
buffer and output buffers, avoiding hardcoding 2x in the API.

Changed in v2:
- Rebase onto cryptodev/master
- Add more comments to assembly
- Reorganize some of the assembly slightly
- Fix the claimed throughput improvement on arm64
- Fix incorrect kunmap order in fs/verity/verify.c
- Adjust testmgr generation logic slightly
- Explicitly check for INT_MAX before casting unsigned int to int
- Mention SHA3 based parallel hashes
- Mention AVX512-based approach

Eric Biggers (8):
crypto: shash - add support for finup_mb
crypto: testmgr - generate power-of-2 lengths more often
crypto: testmgr - add tests for finup_mb
crypto: x86/sha256-ni - add support for finup_mb
crypto: arm64/sha256-ce - add support for finup_mb
fsverity: improve performance by using multibuffer hashing
dm-verity: hash blocks with shash import+finup when possible
dm-verity: improve performance by using multibuffer hashing

arch/arm64/crypto/sha2-ce-core.S | 281 +++++++++++++-
arch/arm64/crypto/sha2-ce-glue.c | 40 ++
arch/x86/crypto/sha256_ni_asm.S | 368 +++++++++++++++++++
arch/x86/crypto/sha256_ssse3_glue.c | 39 ++
crypto/shash.c | 60 +++
crypto/testmgr.c | 90 ++++-
drivers/md/dm-verity-fec.c | 31 +-
drivers/md/dm-verity-fec.h | 7 +-
drivers/md/dm-verity-target.c | 545 ++++++++++++++++++++--------
drivers/md/dm-verity.h | 43 +--
fs/verity/fsverity_private.h | 7 +
fs/verity/hash_algs.c | 8 +-
fs/verity/verify.c | 173 +++++++--
include/crypto/hash.h | 45 ++-
14 files changed, 1494 insertions(+), 243 deletions(-)


base-commit: aabbf2135f9a9526991f17cb0c78cf1ec878f1c2
--
2.45.1



2024-06-03 18:39:55

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v4 1/8] crypto: shash - add support for finup_mb

From: Eric Biggers <[email protected]>

Most cryptographic hash functions are serialized, in the sense that they
have an internal block size and the blocks must be processed serially.
(BLAKE3 is a notable exception that has tree-based hashing built-in, but
all the more common choices such as the SHAs and BLAKE2 are serialized.
ParallelHash and Sakura are parallel hashes based on SHA3, but SHA3 is
much slower than SHA256 in software even with the ARMv8 SHA3 extension.)

This limits the performance of computing a single hash. Yet, computing
multiple hashes simultaneously does not have this limitation. Modern
CPUs are superscalar and often can execute independent instructions in
parallel. As a result, on many modern CPUs, it is possible to hash two
equal-length messages in about the same time as a single message, if all
the instructions are interleaved.

Meanwhile, a very common use case for hashing in the Linux kernel is
dm-verity and fs-verity. Both use a Merkle tree that has a fixed block
size, usually 4096 bytes with an empty or 32-byte salt prepended. The
hash algorithm is usually SHA-256. Usually, many blocks need to be
hashed at a time. This is an ideal scenario for multibuffer hashing.

Linux actually used to support SHA-256 multibuffer hashing on x86_64,
before it was removed by commit ab8085c130ed ("crypto: x86 - remove SHA
multibuffer routines and mcryptd"). However, it was integrated with the
crypto API in a weird way, where it behaved as an asynchronous hash that
queued up and executed all requests on a global queue. This made it
very complex, buggy, and virtually unusable.

This patch takes a new approach of just adding an API
crypto_shash_finup_mb() that synchronously computes the hash of multiple
equal-length messages, starting from a common state that represents the
(possibly empty) common prefix shared by the messages.

The new API is part of the "shash" algorithm type, as it does not make
sense in "ahash". It does a "finup" operation rather than a "digest"
operation in order to support the salt that is used by dm-verity and
fs-verity. The data and output buffers are provided in arrays of length
@num_msgs in order to make the API itself extensible to interleaving
factors other than 2. (Though, initially only 2x will actually be used.
There are some platforms in which a higher factor could help, but there
are significant trade-offs.)

Signed-off-by: Eric Biggers <[email protected]>
---
crypto/shash.c | 60 +++++++++++++++++++++++++++++++++++++++++++
include/crypto/hash.h | 45 +++++++++++++++++++++++++++++++-
2 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/crypto/shash.c b/crypto/shash.c
index 301ab42bf849..5a2352933fbf 100644
--- a/crypto/shash.c
+++ b/crypto/shash.c
@@ -73,10 +73,57 @@ int crypto_shash_finup(struct shash_desc *desc, const u8 *data,
{
return crypto_shash_alg(desc->tfm)->finup(desc, data, len, out);
}
EXPORT_SYMBOL_GPL(crypto_shash_finup);

+static noinline_for_stack int
+shash_finup_mb_fallback(struct shash_desc *desc, const u8 * const data[],
+ unsigned int len, u8 * const outs[],
+ unsigned int num_msgs)
+{
+ struct crypto_shash *tfm = desc->tfm;
+ SHASH_DESC_ON_STACK(desc2, tfm);
+ unsigned int i;
+ int err;
+
+ for (i = 0; i < num_msgs - 1; i++) {
+ desc2->tfm = tfm;
+ memcpy(shash_desc_ctx(desc2), shash_desc_ctx(desc),
+ crypto_shash_descsize(tfm));
+ err = crypto_shash_finup(desc2, data[i], len, outs[i]);
+ if (err)
+ return err;
+ }
+ return crypto_shash_finup(desc, data[i], len, outs[i]);
+}
+
+int crypto_shash_finup_mb(struct shash_desc *desc, const u8 * const data[],
+ unsigned int len, u8 * const outs[],
+ unsigned int num_msgs)
+{
+ struct shash_alg *alg = crypto_shash_alg(desc->tfm);
+ int err;
+
+ if (num_msgs == 1)
+ return crypto_shash_finup(desc, data[0], len, outs[0]);
+
+ if (num_msgs == 0)
+ return 0;
+
+ if (WARN_ON_ONCE(num_msgs > alg->mb_max_msgs))
+ goto fallback;
+
+ err = alg->finup_mb(desc, data, len, outs, num_msgs);
+ if (unlikely(err == -EOPNOTSUPP))
+ goto fallback;
+ return err;
+
+fallback:
+ return shash_finup_mb_fallback(desc, data, len, outs, num_msgs);
+}
+EXPORT_SYMBOL_GPL(crypto_shash_finup_mb);
+
static int shash_default_digest(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
struct shash_alg *shash = crypto_shash_alg(desc->tfm);

@@ -312,10 +359,20 @@ static int shash_prepare_alg(struct shash_alg *alg)
return -EINVAL;

if ((alg->export && !alg->import) || (alg->import && !alg->export))
return -EINVAL;

+ if (alg->mb_max_msgs) {
+ if (alg->mb_max_msgs > HASH_MAX_MB_MSGS)
+ return -EINVAL;
+ if (!alg->finup_mb)
+ return -EINVAL;
+ } else {
+ if (alg->finup_mb)
+ return -EINVAL;
+ }
+
err = hash_prepare_alg(&alg->halg);
if (err)
return err;

base->cra_type = &crypto_shash_type;
@@ -339,10 +396,13 @@ static int shash_prepare_alg(struct shash_alg *alg)
if (!alg->export)
alg->halg.statesize = alg->descsize;
if (!alg->setkey)
alg->setkey = shash_no_setkey;

+ if (!alg->mb_max_msgs)
+ alg->mb_max_msgs = 1;
+
return 0;
}

int crypto_register_shash(struct shash_alg *alg)
{
diff --git a/include/crypto/hash.h b/include/crypto/hash.h
index 2d5ea9f9ff43..002099610755 100644
--- a/include/crypto/hash.h
+++ b/include/crypto/hash.h
@@ -154,11 +154,13 @@ struct ahash_alg {
struct shash_desc {
struct crypto_shash *tfm;
void *__ctx[] __aligned(ARCH_SLAB_MINALIGN);
};

-#define HASH_MAX_DIGESTSIZE 64
+#define HASH_MAX_DIGESTSIZE 64
+
+#define HASH_MAX_MB_MSGS 2 /* max value of crypto_shash_mb_max_msgs() */

/*
* Worst case is hmac(sha3-224-generic). Its context is a nested 'shash_desc'
* containing a 'struct sha3_state'.
*/
@@ -177,10 +179,19 @@ struct shash_desc {
* @finup: see struct ahash_alg
* @digest: see struct ahash_alg
* @export: see struct ahash_alg
* @import: see struct ahash_alg
* @setkey: see struct ahash_alg
+ * @finup_mb: **[optional]** Multibuffer hashing support. Finish calculating
+ * the digests of multiple messages, interleaving the instructions to
+ * potentially achieve better performance than hashing each message
+ * individually. The num_msgs argument will be between 2 and
+ * @mb_max_msgs inclusively. If there are particular values of len
+ * or num_msgs, or a particular calling context (e.g. no-SIMD) that
+ * the implementation does not support with this method, the
+ * implementation may return -EOPNOTSUPP from this method in those
+ * cases to cause the crypto API to fall back to repeated finups.
* @init_tfm: Initialize the cryptographic transformation object.
* This function is called only once at the instantiation
* time, right after the transformation context was
* allocated. In case the cryptographic hardware has
* some special requirements which need to be handled
@@ -192,10 +203,11 @@ struct shash_desc {
* various changes set in @init_tfm.
* @clone_tfm: Copy transform into new object, may allocate memory.
* @descsize: Size of the operational state for the message digest. This state
* size is the memory size that needs to be allocated for
* shash_desc.__ctx
+ * @mb_max_msgs: Maximum supported value of num_msgs argument to @finup_mb
* @halg: see struct hash_alg_common
* @HASH_ALG_COMMON: see struct hash_alg_common
*/
struct shash_alg {
int (*init)(struct shash_desc *desc);
@@ -208,15 +220,19 @@ struct shash_alg {
unsigned int len, u8 *out);
int (*export)(struct shash_desc *desc, void *out);
int (*import)(struct shash_desc *desc, const void *in);
int (*setkey)(struct crypto_shash *tfm, const u8 *key,
unsigned int keylen);
+ int (*finup_mb)(struct shash_desc *desc, const u8 * const data[],
+ unsigned int len, u8 * const outs[],
+ unsigned int num_msgs);
int (*init_tfm)(struct crypto_shash *tfm);
void (*exit_tfm)(struct crypto_shash *tfm);
int (*clone_tfm)(struct crypto_shash *dst, struct crypto_shash *src);

unsigned int descsize;
+ unsigned int mb_max_msgs;

union {
struct HASH_ALG_COMMON;
struct hash_alg_common halg;
};
@@ -750,10 +766,20 @@ static inline unsigned int crypto_shash_digestsize(struct crypto_shash *tfm)
static inline unsigned int crypto_shash_statesize(struct crypto_shash *tfm)
{
return crypto_shash_alg(tfm)->statesize;
}

+/*
+ * Return the maximum supported multibuffer hashing interleaving factor, i.e.
+ * the maximum num_msgs that can be passed to crypto_shash_finup_mb(). The
+ * return value will be between 1 and HASH_MAX_MB_MSGS inclusively.
+ */
+static inline unsigned int crypto_shash_mb_max_msgs(struct crypto_shash *tfm)
+{
+ return crypto_shash_alg(tfm)->mb_max_msgs;
+}
+
static inline u32 crypto_shash_get_flags(struct crypto_shash *tfm)
{
return crypto_tfm_get_flags(crypto_shash_tfm(tfm));
}

@@ -843,10 +869,27 @@ int crypto_shash_digest(struct shash_desc *desc, const u8 *data,
* Return: 0 on success; < 0 if an error occurred.
*/
int crypto_shash_tfm_digest(struct crypto_shash *tfm, const u8 *data,
unsigned int len, u8 *out);

+/**
+ * crypto_shash_finup_mb() - multibuffer message hashing
+ * @desc: the starting state that is forked for each message. It contains the
+ * state after hashing a (possibly-empty) common prefix of the messages.
+ * @data: the data of each message (not including any common prefix from @desc)
+ * @len: length of each data buffer in bytes
+ * @outs: output buffer for each message digest
+ * @num_msgs: number of messages, i.e. the number of entries in @data and @outs.
+ * This can't be more than crypto_shash_mb_max_msgs().
+ *
+ * Context: Any context.
+ * Return: 0 on success; a negative errno value on failure.
+ */
+int crypto_shash_finup_mb(struct shash_desc *desc, const u8 * const data[],
+ unsigned int len, u8 * const outs[],
+ unsigned int num_msgs);
+
/**
* crypto_shash_export() - extract operational state for message digest
* @desc: reference to the operational state handle whose state is exported
* @out: output buffer of sufficient size that can hold the hash state
*
--
2.45.1


2024-06-03 18:40:04

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v4 3/8] crypto: testmgr - add tests for finup_mb

From: Eric Biggers <[email protected]>

Update the shash self-tests to test the new finup_mb method when
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y.

Signed-off-by: Eric Biggers <[email protected]>
---
crypto/testmgr.c | 74 +++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 67 insertions(+), 7 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 2c57ebcaf368..3253dc1501e4 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -227,10 +227,11 @@ enum flush_type {

/* finalization function for hash algorithms */
enum finalization_type {
FINALIZATION_TYPE_FINAL, /* use final() */
FINALIZATION_TYPE_FINUP, /* use finup() */
+ FINALIZATION_TYPE_FINUP_MB, /* use finup_mb() */
FINALIZATION_TYPE_DIGEST, /* use digest() */
};

/*
* Whether the crypto operation will occur in-place, and if so whether the
@@ -290,10 +291,15 @@ struct test_sg_division {
* the @iv_offset
* @key_offset: misalignment of the key, where 0 is default alignment
* @key_offset_relative_to_alignmask: if true, add the algorithm's alignmask to
* the @key_offset
* @finalization_type: what finalization function to use for hashes
+ * @multibuffer_index: random number used to generate the message index to use
+ * for finup_mb (if finup_mb is used).
+ * @multibuffer_count: random number used to generate the num_msgs parameter to
+ * finup_mb (if finup_mb is used).
+ *
* @nosimd: execute with SIMD disabled? Requires !CRYPTO_TFM_REQ_MAY_SLEEP.
*/
struct testvec_config {
const char *name;
enum inplace_mode inplace_mode;
@@ -303,10 +309,12 @@ struct testvec_config {
unsigned int iv_offset;
unsigned int key_offset;
bool iv_offset_relative_to_alignmask;
bool key_offset_relative_to_alignmask;
enum finalization_type finalization_type;
+ unsigned int multibuffer_index;
+ unsigned int multibuffer_count;
bool nosimd;
};

#define TESTVEC_CONFIG_NAMELEN 192

@@ -1109,19 +1117,27 @@ static void generate_random_testvec_config(struct rnd_state *rng,
if (prandom_bool(rng)) {
cfg->req_flags |= CRYPTO_TFM_REQ_MAY_SLEEP;
p += scnprintf(p, end - p, " may_sleep");
}

- switch (prandom_u32_below(rng, 4)) {
+ switch (prandom_u32_below(rng, 8)) {
case 0:
+ case 1:
cfg->finalization_type = FINALIZATION_TYPE_FINAL;
p += scnprintf(p, end - p, " use_final");
break;
- case 1:
+ case 2:
cfg->finalization_type = FINALIZATION_TYPE_FINUP;
p += scnprintf(p, end - p, " use_finup");
break;
+ case 3:
+ case 4:
+ cfg->finalization_type = FINALIZATION_TYPE_FINUP_MB;
+ cfg->multibuffer_index = prandom_u32_state(rng);
+ cfg->multibuffer_count = prandom_u32_state(rng);
+ p += scnprintf(p, end - p, " use_finup_mb");
+ break;
default:
cfg->finalization_type = FINALIZATION_TYPE_DIGEST;
p += scnprintf(p, end - p, " use_digest");
break;
}
@@ -1270,10 +1286,37 @@ static inline int check_shash_op(const char *op, int err,
pr_err("alg: shash: %s %s() failed with err %d on test vector %s, cfg=\"%s\"\n",
driver, op, err, vec_name, cfg->name);
return err;
}

+static int do_finup_mb(struct shash_desc *desc,
+ const u8 *data, unsigned int len, u8 *result,
+ const struct testvec_config *cfg,
+ const struct test_sglist *tsgl)
+{
+ struct crypto_shash *tfm = desc->tfm;
+ const u8 *unused_data = tsgl->bufs[XBUFSIZE - 1];
+ u8 unused_result[HASH_MAX_DIGESTSIZE];
+ const u8 *datas[HASH_MAX_MB_MSGS];
+ u8 *outs[HASH_MAX_MB_MSGS];
+ unsigned int num_msgs;
+ unsigned int msg_idx;
+ unsigned int i;
+
+ num_msgs = 1 + (cfg->multibuffer_count % crypto_shash_mb_max_msgs(tfm));
+ if (WARN_ON_ONCE(num_msgs > HASH_MAX_MB_MSGS))
+ return -EINVAL;
+ msg_idx = cfg->multibuffer_index % num_msgs;
+ for (i = 0; i < num_msgs; i++) {
+ datas[i] = unused_data;
+ outs[i] = unused_result;
+ }
+ datas[msg_idx] = data;
+ outs[msg_idx] = result;
+ return crypto_shash_finup_mb(desc, datas, len, outs, num_msgs);
+}
+
/* Test one hash test vector in one configuration, using the shash API */
static int test_shash_vec_cfg(const struct hash_testvec *vec,
const char *vec_name,
const struct testvec_config *cfg,
struct shash_desc *desc,
@@ -1346,11 +1389,14 @@ static int test_shash_vec_cfg(const struct hash_testvec *vec,
return -EINVAL;
}
goto result_ready;
}

- /* Using init(), zero or more update(), then final() or finup() */
+ /*
+ * Using init(), zero or more update(), then either final(), finup(), or
+ * finup_mb().
+ */

if (cfg->nosimd)
crypto_disable_simd_for_test();
err = crypto_shash_init(desc);
if (cfg->nosimd)
@@ -1358,28 +1404,42 @@ static int test_shash_vec_cfg(const struct hash_testvec *vec,
err = check_shash_op("init", err, driver, vec_name, cfg);
if (err)
return err;

for (i = 0; i < tsgl->nents; i++) {
+ const u8 *data = sg_virt(&tsgl->sgl[i]);
+ unsigned int len = tsgl->sgl[i].length;
+
if (i + 1 == tsgl->nents &&
cfg->finalization_type == FINALIZATION_TYPE_FINUP) {
if (divs[i]->nosimd)
crypto_disable_simd_for_test();
- err = crypto_shash_finup(desc, sg_virt(&tsgl->sgl[i]),
- tsgl->sgl[i].length, result);
+ err = crypto_shash_finup(desc, data, len, result);
if (divs[i]->nosimd)
crypto_reenable_simd_for_test();
err = check_shash_op("finup", err, driver, vec_name,
cfg);
if (err)
return err;
goto result_ready;
}
+ if (i + 1 == tsgl->nents &&
+ cfg->finalization_type == FINALIZATION_TYPE_FINUP_MB) {
+ if (divs[i]->nosimd)
+ crypto_disable_simd_for_test();
+ err = do_finup_mb(desc, data, len, result, cfg, tsgl);
+ if (divs[i]->nosimd)
+ crypto_reenable_simd_for_test();
+ err = check_shash_op("finup_mb", err, driver, vec_name,
+ cfg);
+ if (err)
+ return err;
+ goto result_ready;
+ }
if (divs[i]->nosimd)
crypto_disable_simd_for_test();
- err = crypto_shash_update(desc, sg_virt(&tsgl->sgl[i]),
- tsgl->sgl[i].length);
+ err = crypto_shash_update(desc, data, len);
if (divs[i]->nosimd)
crypto_reenable_simd_for_test();
err = check_shash_op("update", err, driver, vec_name, cfg);
if (err)
return err;
--
2.45.1


2024-06-03 18:40:17

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

From: Eric Biggers <[email protected]>

When supported by the hash algorithm, use crypto_shash_finup_mb() to
interleave the hashing of pairs of data blocks. On some CPUs this
nearly doubles hashing performance. The increase in overall throughput
of cold-cache fsverity reads that I'm seeing on arm64 and x86_64 is
roughly 35% (though this metric is hard to measure as it jumps around a
lot).

For now this is only done on the verification path, and only for data
blocks, not Merkle tree blocks. We could use finup_mb on Merkle tree
blocks too, but that is less important as there aren't as many Merkle
tree blocks as data blocks, and that would require some additional code
restructuring. We could also use finup_mb to accelerate building the
Merkle tree, but verification performance is more important.

Signed-off-by: Eric Biggers <[email protected]>
---
fs/verity/fsverity_private.h | 7 ++
fs/verity/hash_algs.c | 8 +-
fs/verity/verify.c | 173 +++++++++++++++++++++++++++++------
3 files changed, 156 insertions(+), 32 deletions(-)

diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index b3506f56e180..7535c9d9b516 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -27,10 +27,15 @@ struct fsverity_hash_alg {
/*
* The HASH_ALGO_* constant for this algorithm. This is different from
* FS_VERITY_HASH_ALG_*, which uses a different numbering scheme.
*/
enum hash_algo algo_id;
+ /*
+ * The maximum supported interleaving factor for multibuffer hashing, or
+ * 1 if the algorithm doesn't support multibuffer hashing
+ */
+ int mb_max_msgs;
};

/* Merkle tree parameters: hash algorithm, initial hash state, and topology */
struct merkle_tree_params {
const struct fsverity_hash_alg *hash_alg; /* the hash algorithm */
@@ -150,8 +155,10 @@ static inline void fsverity_init_signature(void)
}
#endif /* !CONFIG_FS_VERITY_BUILTIN_SIGNATURES */

/* verify.c */

+#define FS_VERITY_MAX_PENDING_DATA_BLOCKS 2
+
void __init fsverity_init_workqueue(void);

#endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/hash_algs.c b/fs/verity/hash_algs.c
index 6b08b1d9a7d7..f24d7c295455 100644
--- a/fs/verity/hash_algs.c
+++ b/fs/verity/hash_algs.c
@@ -82,12 +82,16 @@ const struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode,
if (WARN_ON_ONCE(alg->digest_size != crypto_shash_digestsize(tfm)))
goto err_free_tfm;
if (WARN_ON_ONCE(alg->block_size != crypto_shash_blocksize(tfm)))
goto err_free_tfm;

- pr_info("%s using implementation \"%s\"\n",
- alg->name, crypto_shash_driver_name(tfm));
+ alg->mb_max_msgs = min(crypto_shash_mb_max_msgs(tfm),
+ FS_VERITY_MAX_PENDING_DATA_BLOCKS);
+
+ pr_info("%s using implementation \"%s\"%s\n",
+ alg->name, crypto_shash_driver_name(tfm),
+ alg->mb_max_msgs > 1 ? " (multibuffer)" : "");

/* pairs with smp_load_acquire() above */
smp_store_release(&alg->tfm, tfm);
goto out_unlock;

diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 4fcad0825a12..e9fb299ffa77 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -8,10 +8,32 @@
#include "fsverity_private.h"

#include <crypto/hash.h>
#include <linux/bio.h>

+struct fsverity_pending_block {
+ const void *data;
+ u64 pos;
+ u8 hash[FS_VERITY_MAX_DIGEST_SIZE];
+};
+
+struct fsverity_verification_context {
+ struct inode *inode;
+ struct fsverity_info *vi;
+ unsigned long max_ra_pages;
+
+ /*
+ * This is the queue of data blocks that are pending verification. We
+ * allow multiple blocks to be queued up in order to support hash
+ * algorithm implementations that provide support for multibuffer
+ * hashing, i.e. interleaving the hashing of multiple messages. On many
+ * CPUs this improves performance significantly.
+ */
+ int num_pending;
+ struct fsverity_pending_block pending_blocks[FS_VERITY_MAX_PENDING_DATA_BLOCKS];
+};
+
static struct workqueue_struct *fsverity_read_workqueue;

/*
* Returns true if the hash block with index @hblock_idx in the tree, located in
* @hpage, has already been verified.
@@ -77,23 +99,25 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
SetPageChecked(hpage);
return false;
}

/*
- * Verify a single data block against the file's Merkle tree.
+ * Verify the hash of a single data block against the file's Merkle tree.
*
* In principle, we need to verify the entire path to the root node. However,
* for efficiency the filesystem may cache the hash blocks. Therefore we need
* only ascend the tree until an already-verified hash block is seen, and then
* verify the path to that block.
*
* Return: %true if the data block is valid, else %false.
*/
static bool
verify_data_block(struct inode *inode, struct fsverity_info *vi,
- const void *data, u64 data_pos, unsigned long max_ra_pages)
+ const struct fsverity_pending_block *dblock,
+ unsigned long max_ra_pages)
{
+ const u64 data_pos = dblock->pos;
const struct merkle_tree_params *params = &vi->tree_params;
const unsigned int hsize = params->digest_size;
int level;
u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE];
const u8 *want_hash;
@@ -113,23 +137,27 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
* The index of the previous level's block within that level; also the
* index of that block's hash within the current level.
*/
u64 hidx = data_pos >> params->log_blocksize;

- /* Up to 1 + FS_VERITY_MAX_LEVELS pages may be mapped at once */
- BUILD_BUG_ON(1 + FS_VERITY_MAX_LEVELS > KM_MAX_IDX);
+ /*
+ * Up to FS_VERITY_MAX_PENDING_DATA_BLOCKS + FS_VERITY_MAX_LEVELS pages
+ * may be mapped at once.
+ */
+ BUILD_BUG_ON(FS_VERITY_MAX_PENDING_DATA_BLOCKS +
+ FS_VERITY_MAX_LEVELS > KM_MAX_IDX);

if (unlikely(data_pos >= inode->i_size)) {
/*
* This can happen in the data page spanning EOF when the Merkle
* tree block size is less than the page size. The Merkle tree
* doesn't cover data blocks fully past EOF. But the entire
* page spanning EOF can be visible to userspace via a mmap, and
* any part past EOF should be all zeroes. Therefore, we need
* to verify that any data blocks fully past EOF are all zeroes.
*/
- if (memchr_inv(data, 0, params->block_size)) {
+ if (memchr_inv(dblock->data, 0, params->block_size)) {
fsverity_err(inode,
"FILE CORRUPTED! Data past EOF is not zeroed");
return false;
}
return true;
@@ -219,55 +247,122 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
want_hash = _want_hash;
kunmap_local(haddr);
put_page(hpage);
}

- /* Finally, verify the data block. */
- if (fsverity_hash_block(params, inode, data, real_hash) != 0)
- goto error;
- if (memcmp(want_hash, real_hash, hsize) != 0)
+ /* Finally, verify the hash of the data block. */
+ if (memcmp(want_hash, dblock->hash, hsize) != 0)
goto corrupted;
return true;

corrupted:
fsverity_err(inode,
"FILE CORRUPTED! pos=%llu, level=%d, want_hash=%s:%*phN, real_hash=%s:%*phN",
data_pos, level - 1,
params->hash_alg->name, hsize, want_hash,
- params->hash_alg->name, hsize, real_hash);
+ params->hash_alg->name, hsize,
+ level == 0 ? dblock->hash : real_hash);
error:
for (; level > 0; level--) {
kunmap_local(hblocks[level - 1].addr);
put_page(hblocks[level - 1].page);
}
return false;
}

+static inline void
+fsverity_init_verification_context(struct fsverity_verification_context *ctx,
+ struct inode *inode,
+ unsigned long max_ra_pages)
+{
+ ctx->inode = inode;
+ ctx->vi = inode->i_verity_info;
+ ctx->max_ra_pages = max_ra_pages;
+ ctx->num_pending = 0;
+}
+
+static inline void
+fsverity_clear_pending_blocks(struct fsverity_verification_context *ctx)
+{
+ int i;
+
+ for (i = ctx->num_pending - 1; i >= 0; i--) {
+ kunmap_local(ctx->pending_blocks[i].data);
+ ctx->pending_blocks[i].data = NULL;
+ }
+ ctx->num_pending = 0;
+}
+
static bool
-verify_data_blocks(struct folio *data_folio, size_t len, size_t offset,
- unsigned long max_ra_pages)
+fsverity_verify_pending_blocks(struct fsverity_verification_context *ctx)
{
- struct inode *inode = data_folio->mapping->host;
- struct fsverity_info *vi = inode->i_verity_info;
- const unsigned int block_size = vi->tree_params.block_size;
- u64 pos = (u64)data_folio->index << PAGE_SHIFT;
+ struct inode *inode = ctx->inode;
+ struct fsverity_info *vi = ctx->vi;
+ const struct merkle_tree_params *params = &vi->tree_params;
+ SHASH_DESC_ON_STACK(desc, params->hash_alg->tfm);
+ const u8 *data[FS_VERITY_MAX_PENDING_DATA_BLOCKS];
+ u8 *outs[FS_VERITY_MAX_PENDING_DATA_BLOCKS];
+ int i;
+ int err;
+
+ if (ctx->num_pending == 0)
+ return true;
+
+ for (i = 0; i < ctx->num_pending; i++) {
+ data[i] = ctx->pending_blocks[i].data;
+ outs[i] = ctx->pending_blocks[i].hash;
+ }
+
+ desc->tfm = params->hash_alg->tfm;
+ if (params->hashstate)
+ err = crypto_shash_import(desc, params->hashstate);
+ else
+ err = crypto_shash_init(desc);
+ if (err) {
+ fsverity_err(inode, "Error %d importing hash state", err);
+ return false;
+ }
+ err = crypto_shash_finup_mb(desc, data, params->block_size, outs,
+ ctx->num_pending);
+ if (err) {
+ fsverity_err(inode, "Error %d computing block hashes", err);
+ return false;
+ }
+
+ for (i = 0; i < ctx->num_pending; i++) {
+ if (!verify_data_block(inode, vi, &ctx->pending_blocks[i],
+ ctx->max_ra_pages))
+ return false;
+ }
+
+ fsverity_clear_pending_blocks(ctx);
+ return true;
+}
+
+static bool
+fsverity_add_data_blocks(struct fsverity_verification_context *ctx,
+ struct folio *data_folio, size_t len, size_t offset)
+{
+ struct fsverity_info *vi = ctx->vi;
+ const struct merkle_tree_params *params = &vi->tree_params;
+ const unsigned int block_size = params->block_size;
+ const int mb_max_msgs = params->hash_alg->mb_max_msgs;
+ u64 pos = ((u64)data_folio->index << PAGE_SHIFT) + offset;

if (WARN_ON_ONCE(len <= 0 || !IS_ALIGNED(len | offset, block_size)))
return false;
if (WARN_ON_ONCE(!folio_test_locked(data_folio) ||
folio_test_uptodate(data_folio)))
return false;
do {
- void *data;
- bool valid;
-
- data = kmap_local_folio(data_folio, offset);
- valid = verify_data_block(inode, vi, data, pos + offset,
- max_ra_pages);
- kunmap_local(data);
- if (!valid)
+ ctx->pending_blocks[ctx->num_pending].data =
+ kmap_local_folio(data_folio, offset);
+ ctx->pending_blocks[ctx->num_pending].pos = pos;
+ if (++ctx->num_pending == mb_max_msgs &&
+ !fsverity_verify_pending_blocks(ctx))
return false;
+ pos += block_size;
offset += block_size;
len -= block_size;
} while (len);
return true;
}
@@ -284,11 +379,19 @@ verify_data_blocks(struct folio *data_folio, size_t len, size_t offset,
*
* Return: %true if the data is valid, else %false.
*/
bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset)
{
- return verify_data_blocks(folio, len, offset, 0);
+ struct fsverity_verification_context ctx;
+
+ fsverity_init_verification_context(&ctx, folio->mapping->host, 0);
+
+ if (fsverity_add_data_blocks(&ctx, folio, len, offset) &&
+ fsverity_verify_pending_blocks(&ctx))
+ return true;
+ fsverity_clear_pending_blocks(&ctx);
+ return false;
}
EXPORT_SYMBOL_GPL(fsverity_verify_blocks);

#ifdef CONFIG_BLOCK
/**
@@ -305,10 +408,12 @@ EXPORT_SYMBOL_GPL(fsverity_verify_blocks);
* filesystems) must instead call fsverity_verify_page() directly on each page.
* All filesystems must also call fsverity_verify_page() on holes.
*/
void fsverity_verify_bio(struct bio *bio)
{
+ struct inode *inode = bio_first_folio_all(bio)->mapping->host;
+ struct fsverity_verification_context ctx;
struct folio_iter fi;
unsigned long max_ra_pages = 0;

if (bio->bi_opf & REQ_RAHEAD) {
/*
@@ -321,17 +426,25 @@ void fsverity_verify_bio(struct bio *bio)
* reduces the number of I/O requests made to the Merkle tree.
*/
max_ra_pages = bio->bi_iter.bi_size >> (PAGE_SHIFT + 2);
}

+ fsverity_init_verification_context(&ctx, inode, max_ra_pages);
+
bio_for_each_folio_all(fi, bio) {
- if (!verify_data_blocks(fi.folio, fi.length, fi.offset,
- max_ra_pages)) {
- bio->bi_status = BLK_STS_IOERR;
- break;
- }
+ if (!fsverity_add_data_blocks(&ctx, fi.folio, fi.length,
+ fi.offset))
+ goto ioerr;
}
+
+ if (!fsverity_verify_pending_blocks(&ctx))
+ goto ioerr;
+ return;
+
+ioerr:
+ fsverity_clear_pending_blocks(&ctx);
+ bio->bi_status = BLK_STS_IOERR;
}
EXPORT_SYMBOL_GPL(fsverity_verify_bio);
#endif /* CONFIG_BLOCK */

/**
--
2.45.1


2024-06-03 18:40:26

by Eric Biggers

[permalink] [raw]
Subject: [PATCH v4 8/8] dm-verity: improve performance by using multibuffer hashing

From: Eric Biggers <[email protected]>

When supported by the hash algorithm, use crypto_shash_finup_mb() to
interleave the hashing of pairs of data blocks. On some CPUs this
nearly doubles hashing performance. The increase in overall throughput
of cold-cache dm-verity reads that I'm seeing on arm64 and x86_64 is
roughly 35% (though this metric is hard to measure as it jumps around a
lot).

For now this is only done on data blocks, not Merkle tree blocks. We
could use finup_mb on Merkle tree blocks too, but that is less important
as there aren't as many Merkle tree blocks as data blocks, and that
would require some additional code restructuring.

Signed-off-by: Eric Biggers <[email protected]>
---
drivers/md/dm-verity-fec.c | 24 +--
drivers/md/dm-verity-fec.h | 7 +-
drivers/md/dm-verity-target.c | 345 ++++++++++++++++++++++------------
drivers/md/dm-verity.h | 28 +--
4 files changed, 247 insertions(+), 157 deletions(-)

diff --git a/drivers/md/dm-verity-fec.c b/drivers/md/dm-verity-fec.c
index b436b8e4d750..c1677137a682 100644
--- a/drivers/md/dm-verity-fec.c
+++ b/drivers/md/dm-verity-fec.c
@@ -184,18 +184,18 @@ static int fec_decode_bufs(struct dm_verity *v, struct dm_verity_io *io,
* Locate data block erasures using verity hashes.
*/
static int fec_is_erasure(struct dm_verity *v, struct dm_verity_io *io,
u8 *want_digest, u8 *data)
{
+ u8 real_digest[HASH_MAX_DIGESTSIZE];
+
if (unlikely(verity_compute_hash_virt(v, io, data,
1 << v->data_dev_block_bits,
- verity_io_real_digest(v, io),
- true)))
+ real_digest, true)))
return 0;

- return memcmp(verity_io_real_digest(v, io), want_digest,
- v->digest_size) != 0;
+ return memcmp(real_digest, want_digest, v->digest_size) != 0;
}

/*
* Read data blocks that are part of the RS block and deinterleave as much as
* fits into buffers. Check for erasure locations if @neras is non-NULL.
@@ -362,14 +362,15 @@ static void fec_init_bufs(struct dm_verity *v, struct dm_verity_fec_io *fio)
* (indicated by @offset) in fio->output. If @use_erasures is non-zero, uses
* hashes to locate erasures.
*/
static int fec_decode_rsb(struct dm_verity *v, struct dm_verity_io *io,
struct dm_verity_fec_io *fio, u64 rsb, u64 offset,
- bool use_erasures)
+ const u8 *want_digest, bool use_erasures)
{
int r, neras = 0;
unsigned int pos;
+ u8 real_digest[HASH_MAX_DIGESTSIZE];

r = fec_alloc_bufs(v, fio);
if (unlikely(r < 0))
return r;

@@ -389,16 +390,15 @@ static int fec_decode_rsb(struct dm_verity *v, struct dm_verity_io *io,
}

/* Always re-validate the corrected block against the expected hash */
r = verity_compute_hash_virt(v, io, fio->output,
1 << v->data_dev_block_bits,
- verity_io_real_digest(v, io), true);
+ real_digest, true);
if (unlikely(r < 0))
return r;

- if (memcmp(verity_io_real_digest(v, io), verity_io_want_digest(v, io),
- v->digest_size)) {
+ if (memcmp(real_digest, want_digest, v->digest_size)) {
DMERR_LIMIT("%s: FEC %llu: failed to correct (%d erasures)",
v->data_dev->name, (unsigned long long)rsb, neras);
return -EILSEQ;
}

@@ -419,12 +419,12 @@ static int fec_bv_copy(struct dm_verity *v, struct dm_verity_io *io, u8 *data,
/*
* Correct errors in a block. Copies corrected block to dest if non-NULL,
* otherwise to a bio_vec starting from iter.
*/
int verity_fec_decode(struct dm_verity *v, struct dm_verity_io *io,
- enum verity_block_type type, sector_t block, u8 *dest,
- struct bvec_iter *iter)
+ enum verity_block_type type, sector_t block,
+ const u8 *want_digest, u8 *dest, struct bvec_iter *iter)
{
int r;
struct dm_verity_fec_io *fio = fec_io(io);
u64 offset, res, rsb;

@@ -463,13 +463,13 @@ int verity_fec_decode(struct dm_verity *v, struct dm_verity_io *io,
/*
* Locating erasures is slow, so attempt to recover the block without
* them first. Do a second attempt with erasures if the corruption is
* bad enough.
*/
- r = fec_decode_rsb(v, io, fio, rsb, offset, false);
+ r = fec_decode_rsb(v, io, fio, rsb, offset, want_digest, false);
if (r < 0) {
- r = fec_decode_rsb(v, io, fio, rsb, offset, true);
+ r = fec_decode_rsb(v, io, fio, rsb, offset, want_digest, true);
if (r < 0)
goto done;
}

if (dest)
diff --git a/drivers/md/dm-verity-fec.h b/drivers/md/dm-verity-fec.h
index 8454070d2824..57c3f674cae9 100644
--- a/drivers/md/dm-verity-fec.h
+++ b/drivers/md/dm-verity-fec.h
@@ -68,11 +68,12 @@ struct dm_verity_fec_io {

extern bool verity_fec_is_enabled(struct dm_verity *v);

extern int verity_fec_decode(struct dm_verity *v, struct dm_verity_io *io,
enum verity_block_type type, sector_t block,
- u8 *dest, struct bvec_iter *iter);
+ const u8 *want_digest, u8 *dest,
+ struct bvec_iter *iter);

extern unsigned int verity_fec_status_table(struct dm_verity *v, unsigned int sz,
char *result, unsigned int maxlen);

extern void verity_fec_finish_io(struct dm_verity_io *io);
@@ -97,12 +98,12 @@ static inline bool verity_fec_is_enabled(struct dm_verity *v)
return false;
}

static inline int verity_fec_decode(struct dm_verity *v,
struct dm_verity_io *io,
- enum verity_block_type type,
- sector_t block, u8 *dest,
+ enum verity_block_type type, sector_t block,
+ const u8 *want_digest, u8 *dest,
struct bvec_iter *iter)
{
return -EOPNOTSUPP;
}

diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
index 2dd15f5e91b7..ec0a8c51d99e 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c
@@ -40,10 +40,12 @@
#define DM_VERITY_OPT_TASKLET_VERIFY "try_verify_in_tasklet"

#define DM_VERITY_OPTS_MAX (4 + DM_VERITY_OPTS_FEC + \
DM_VERITY_ROOT_HASH_VERIFICATION_OPTS)

+#define DM_VERITY_MAX_PENDING_DATA_BLOCKS 2
+
static unsigned int dm_verity_prefetch_cluster = DM_VERITY_DEFAULT_PREFETCH_SIZE;

module_param_named(prefetch_cluster, dm_verity_prefetch_cluster, uint, 0644);

/* Is at least one dm-verity instance using the bh workqueue? */
@@ -300,16 +302,16 @@ static int verity_handle_err(struct dm_verity *v, enum verity_block_type type,

/*
* Verify hash of a metadata block pertaining to the specified data block
* ("block" argument) at a specified level ("level" argument).
*
- * On successful return, verity_io_want_digest(v, io) contains the hash value
- * for a lower tree level or for the data block (if we're at the lowest level).
+ * On successful return, want_digest contains the hash value for a lower tree
+ * level or for the data block (if we're at the lowest level).
*
* If "skip_unverified" is true, unverified buffer is skipped and 1 is returned.
* If "skip_unverified" is false, unverified buffer is hashed and verified
- * against current value of verity_io_want_digest(v, io).
+ * against current value of want_digest.
*/
static int verity_verify_level(struct dm_verity *v, struct dm_verity_io *io,
sector_t block, int level, bool skip_unverified,
u8 *want_digest)
{
@@ -318,10 +320,11 @@ static int verity_verify_level(struct dm_verity *v, struct dm_verity_io *io,
u8 *data;
int r;
sector_t hash_block;
unsigned int offset;
struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size);
+ u8 real_digest[HASH_MAX_DIGESTSIZE];

verity_hash_at_level(v, block, level, &hash_block, &offset);

if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) {
data = dm_bufio_get(v->bufio, hash_block, &buf);
@@ -349,27 +352,26 @@ static int verity_verify_level(struct dm_verity *v, struct dm_verity_io *io,
goto release_ret_r;
}

r = verity_compute_hash_virt(v, io, data,
1 << v->hash_dev_block_bits,
- verity_io_real_digest(v, io),
- !io->in_bh);
+ real_digest, !io->in_bh);
if (unlikely(r < 0))
goto release_ret_r;

- if (likely(memcmp(verity_io_real_digest(v, io), want_digest,
- v->digest_size) == 0))
+ if (likely(!memcmp(real_digest, want_digest, v->digest_size)))
aux->hash_verified = 1;
else if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) {
/*
* Error handling code (FEC included) cannot be run in a
* tasklet since it may sleep, so fallback to work-queue.
*/
r = -EAGAIN;
goto release_ret_r;
} else if (verity_fec_decode(v, io, DM_VERITY_BLOCK_TYPE_METADATA,
- hash_block, data, NULL) == 0)
+ hash_block, want_digest,
+ data, NULL) == 0)
aux->hash_verified = 1;
else if (verity_handle_err(v,
DM_VERITY_BLOCK_TYPE_METADATA,
hash_block)) {
struct bio *bio =
@@ -473,71 +475,10 @@ static int verity_ahash_update_block(struct dm_verity *v,
} while (todo);

return 0;
}

-static int verity_compute_hash(struct dm_verity *v, struct dm_verity_io *io,
- struct bvec_iter *iter, u8 *digest,
- bool may_sleep)
-{
- int r;
-
- if (static_branch_unlikely(&ahash_enabled) && !v->shash_tfm) {
- struct ahash_request *req = verity_io_hash_req(v, io);
- struct crypto_wait wait;
-
- r = verity_ahash_init(v, req, &wait, may_sleep);
- if (unlikely(r))
- goto error;
-
- r = verity_ahash_update_block(v, io, iter, &wait);
- if (unlikely(r))
- goto error;
-
- r = verity_ahash_final(v, req, digest, &wait);
- if (unlikely(r))
- goto error;
- } else {
- struct shash_desc *desc = verity_io_hash_req(v, io);
- struct bio *bio =
- dm_bio_from_per_bio_data(io, v->ti->per_io_data_size);
- struct bio_vec bv = bio_iter_iovec(bio, *iter);
- const unsigned int len = 1 << v->data_dev_block_bits;
- const void *virt;
-
- if (unlikely(len > bv.bv_len)) {
- /*
- * Data block spans pages. This should not happen,
- * since this code path is not used if the data block
- * size is greater than the page size, and all I/O
- * should be data block aligned because dm-verity sets
- * logical_block_size to the data block size.
- */
- DMERR_LIMIT("unaligned io (data block spans pages)");
- return -EIO;
- }
-
- desc->tfm = v->shash_tfm;
- r = crypto_shash_import(desc, v->initial_hashstate);
- if (unlikely(r))
- goto error;
-
- virt = bvec_kmap_local(&bv);
- r = crypto_shash_finup(desc, virt, len, digest);
- kunmap_local(virt);
- if (unlikely(r))
- goto error;
-
- bio_advance_iter(bio, iter, len);
- }
- return 0;
-
-error:
- DMERR("Error hashing block from bio iter: %d", r);
- return r;
-}
-
/*
* Calls function process for 1 << v->data_dev_block_bits bytes in the bio_vec
* starting from iter.
*/
int verity_for_bv_block(struct dm_verity *v, struct dm_verity_io *io,
@@ -581,41 +522,42 @@ static int verity_recheck_copy(struct dm_verity *v, struct dm_verity_io *io,
io->recheck_buffer += len;

return 0;
}

-static noinline int verity_recheck(struct dm_verity *v, struct dm_verity_io *io,
- struct bvec_iter start, sector_t cur_block)
+static int verity_recheck(struct dm_verity *v, struct dm_verity_io *io,
+ struct bvec_iter start, sector_t blkno,
+ const u8 *want_digest)
{
struct page *page;
void *buffer;
int r;
struct dm_io_request io_req;
struct dm_io_region io_loc;
+ u8 real_digest[HASH_MAX_DIGESTSIZE];

page = mempool_alloc(&v->recheck_pool, GFP_NOIO);
buffer = page_to_virt(page);

io_req.bi_opf = REQ_OP_READ;
io_req.mem.type = DM_IO_KMEM;
io_req.mem.ptr.addr = buffer;
io_req.notify.fn = NULL;
io_req.client = v->io;
io_loc.bdev = v->data_dev->bdev;
- io_loc.sector = cur_block << (v->data_dev_block_bits - SECTOR_SHIFT);
+ io_loc.sector = blkno << (v->data_dev_block_bits - SECTOR_SHIFT);
io_loc.count = 1 << (v->data_dev_block_bits - SECTOR_SHIFT);
r = dm_io(&io_req, 1, &io_loc, NULL, IOPRIO_DEFAULT);
if (unlikely(r))
goto free_ret;

r = verity_compute_hash_virt(v, io, buffer, 1 << v->data_dev_block_bits,
- verity_io_real_digest(v, io), true);
+ real_digest, true);
if (unlikely(r))
goto free_ret;

- if (memcmp(verity_io_real_digest(v, io),
- verity_io_want_digest(v, io), v->digest_size)) {
+ if (memcmp(real_digest, want_digest, v->digest_size)) {
r = -EIO;
goto free_ret;
}

io->recheck_buffer = buffer;
@@ -647,22 +589,144 @@ static inline void verity_bv_skip_block(struct dm_verity *v,
struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size);

bio_advance_iter(bio, iter, 1 << v->data_dev_block_bits);
}

+static noinline int
+verity_handle_data_hash_mismatch(struct dm_verity *v, struct dm_verity_io *io,
+ struct bio *bio, struct bvec_iter *start,
+ sector_t blkno, const u8 *want_digest)
+{
+ if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) {
+ /*
+ * Error handling code (FEC included) cannot be run in the
+ * BH workqueue, so fallback to a standard workqueue.
+ */
+ return -EAGAIN;
+ }
+ if (verity_recheck(v, io, *start, blkno, want_digest) == 0) {
+ if (v->validated_blocks)
+ set_bit(blkno, v->validated_blocks);
+ return 0;
+ }
+#if defined(CONFIG_DM_VERITY_FEC)
+ if (verity_fec_decode(v, io, DM_VERITY_BLOCK_TYPE_DATA, blkno,
+ want_digest, NULL, start) == 0)
+ return 0;
+#endif
+ if (bio->bi_status)
+ return -EIO; /* Error correction failed; Just return error */
+
+ if (verity_handle_err(v, DM_VERITY_BLOCK_TYPE_DATA, blkno)) {
+ dm_audit_log_bio(DM_MSG_PREFIX, "verify-data", bio, blkno, 0);
+ return -EIO;
+ }
+ return 0;
+}
+
+struct pending_block {
+ const void *data;
+ sector_t blkno;
+ struct bvec_iter start;
+ u8 want_digest[HASH_MAX_DIGESTSIZE];
+ u8 real_digest[HASH_MAX_DIGESTSIZE];
+};
+
+struct verification_context {
+ struct dm_verity *v;
+ struct dm_verity_io *io;
+ struct bio *bio;
+ struct pending_block pending_blocks[DM_VERITY_MAX_PENDING_DATA_BLOCKS];
+ int num_pending;
+};
+
+static void verity_clear_pending_blocks(struct verification_context *ctx)
+{
+ int i;
+
+ for (i = ctx->num_pending - 1; i >= 0; i--) {
+ kunmap_local(ctx->pending_blocks[i].data);
+ ctx->pending_blocks[i].data = NULL;
+ }
+ ctx->num_pending = 0;
+}
+
+static __always_inline int
+verity_check_data_block_hash(struct dm_verity *v, struct dm_verity_io *io,
+ struct bio *bio, struct pending_block *block)
+{
+ if (likely(memcmp(block->real_digest, block->want_digest,
+ v->digest_size) == 0)) {
+ if (v->validated_blocks)
+ set_bit(block->blkno, v->validated_blocks);
+ return 0;
+ }
+ return verity_handle_data_hash_mismatch(v, io, bio, &block->start,
+ block->blkno,
+ block->want_digest);
+}
+
+static int verity_verify_pending_blocks(struct verification_context *ctx)
+{
+ struct dm_verity *v = ctx->v;
+ struct dm_verity_io *io = ctx->io;
+ struct bio *bio = ctx->bio;
+ const u8 *data[DM_VERITY_MAX_PENDING_DATA_BLOCKS];
+ u8 *outs[DM_VERITY_MAX_PENDING_DATA_BLOCKS];
+ struct shash_desc *desc = verity_io_hash_req(v, io);
+ int i;
+ int r;
+
+ if (ctx->num_pending == 0)
+ return 0;
+
+ for (i = 0; i < ctx->num_pending; i++) {
+ data[i] = ctx->pending_blocks[i].data;
+ outs[i] = ctx->pending_blocks[i].real_digest;
+ }
+
+ desc->tfm = v->shash_tfm;
+ r = crypto_shash_import(desc, v->initial_hashstate);
+ if (unlikely(r)) {
+ DMERR("Error importing hash state: %d", r);
+ return r;
+ }
+ r = crypto_shash_finup_mb(desc, data, 1 << v->data_dev_block_bits, outs,
+ ctx->num_pending);
+ if (unlikely(r)) {
+ DMERR("Error hashing data blocks: %d", r);
+ return r;
+ }
+
+ for (i = 0; i < ctx->num_pending; i++) {
+ r = verity_check_data_block_hash(v, io, bio,
+ &ctx->pending_blocks[i]);
+ if (unlikely(r))
+ return r;
+ }
+ verity_clear_pending_blocks(ctx);
+ return 0;
+}
+
/*
* Verify one "dm_verity_io" structure.
*/
static int verity_verify_io(struct dm_verity_io *io)
{
- bool is_zero;
struct dm_verity *v = io->v;
- struct bvec_iter start;
+ const unsigned int block_size = 1 << v->data_dev_block_bits;
struct bvec_iter iter_copy;
struct bvec_iter *iter;
struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size);
+ struct verification_context ctx;
unsigned int b;
+ int r;
+
+ ctx.v = v;
+ ctx.io = io;
+ ctx.bio = bio;
+ ctx.num_pending = 0;

if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) {
/*
* Copy the iterator in case we need to restart
* verification in a work-queue.
@@ -671,82 +735,98 @@ static int verity_verify_io(struct dm_verity_io *io)
iter = &iter_copy;
} else
iter = &io->iter;

for (b = 0; b < io->n_blocks; b++) {
- int r;
- sector_t cur_block = io->block + b;
+ sector_t blkno = io->block + b;
+ struct pending_block *block;
+ bool is_zero;

if (v->validated_blocks && bio->bi_status == BLK_STS_OK &&
- likely(test_bit(cur_block, v->validated_blocks))) {
+ likely(test_bit(blkno, v->validated_blocks))) {
verity_bv_skip_block(v, io, iter);
continue;
}

- r = verity_hash_for_block(v, io, cur_block,
- verity_io_want_digest(v, io),
+ block = &ctx.pending_blocks[ctx.num_pending];
+ block->blkno = blkno;
+ block->start = *iter;
+
+ r = verity_hash_for_block(v, io, blkno, block->want_digest,
&is_zero);
if (unlikely(r < 0))
- return r;
+ goto error;

if (is_zero) {
/*
* If we expect a zero block, don't validate, just
* return zeros.
*/
r = verity_for_bv_block(v, io, iter,
verity_bv_zero);
if (unlikely(r < 0))
- return r;
+ goto error;

continue;
}

- start = *iter;
- r = verity_compute_hash(v, io, iter,
- verity_io_real_digest(v, io),
- !io->in_bh);
- if (unlikely(r < 0))
- return r;
+ if (static_branch_unlikely(&ahash_enabled) && !v->shash_tfm) {
+ /* Hash and verify one data block using ahash. */
+ struct ahash_request *req = verity_io_hash_req(v, io);
+ struct crypto_wait wait;
+
+ r = verity_ahash_init(v, req, &wait, !io->in_bh) ?:
+ verity_ahash_update_block(v, io, iter, &wait) ?:
+ verity_ahash_final(v, req, block->real_digest,
+ &wait);
+ if (unlikely(r)) {
+ DMERR("Error hashing data block: %d", r);
+ goto error;
+ }

- if (likely(memcmp(verity_io_real_digest(v, io),
- verity_io_want_digest(v, io), v->digest_size) == 0)) {
- if (v->validated_blocks)
- set_bit(cur_block, v->validated_blocks);
- continue;
- } else if (static_branch_unlikely(&use_bh_wq_enabled) && io->in_bh) {
- /*
- * Error handling code (FEC included) cannot be run in a
- * tasklet since it may sleep, so fallback to work-queue.
- */
- return -EAGAIN;
- } else if (verity_recheck(v, io, start, cur_block) == 0) {
- if (v->validated_blocks)
- set_bit(cur_block, v->validated_blocks);
- continue;
-#if defined(CONFIG_DM_VERITY_FEC)
- } else if (verity_fec_decode(v, io, DM_VERITY_BLOCK_TYPE_DATA,
- cur_block, NULL, &start) == 0) {
- continue;
-#endif
+ r = verity_check_data_block_hash(v, io, bio, block);
+ if (unlikely(r))
+ goto error;
} else {
- if (bio->bi_status) {
+ /* Queue up one block to be hashed with shash. */
+ struct bio_vec bv = bio_iter_iovec(bio, *iter);
+
+ if (unlikely(bv.bv_len < block_size)) {
/*
- * Error correction failed; Just return error
+ * Data block spans pages. This should not
+ * happen, since this code path is not used if
+ * the data block size is greater than the page
+ * size, and all I/O should be data block
+ * aligned because dm-verity sets
+ * logical_block_size to the data block size.
*/
- return -EIO;
+ DMERR_LIMIT("unaligned io (data block spans pages)");
+ r = -EIO;
+ goto error;
}
- if (verity_handle_err(v, DM_VERITY_BLOCK_TYPE_DATA,
- cur_block)) {
- dm_audit_log_bio(DM_MSG_PREFIX, "verify-data",
- bio, cur_block, 0);
- return -EIO;
+
+ block->data = bvec_kmap_local(&bv);
+ if (++ctx.num_pending == v->mb_max_msgs) {
+ /* Queue is full. Verify the blocks. */
+ r = verity_verify_pending_blocks(&ctx);
+ if (r)
+ goto error;
+
}
+ bio_advance_iter(bio, iter, block_size);
}
}

+ r = verity_verify_pending_blocks(&ctx);
+ if (r)
+ goto error;
+
return 0;
+
+error:
+ verity_clear_pending_blocks(&ctx);
+ return r;
}

/*
* Skip verity work in response to I/O error when system is shutting down.
*/
@@ -1321,10 +1401,34 @@ static int verity_setup_hash_alg(struct dm_verity *v, const char *alg_name)
if (!v->alg_name) {
ti->error = "Cannot allocate algorithm name";
return -ENOMEM;
}

+ /*
+ * Allocate the hash transformation object that this dm-verity instance
+ * will use. We have a choice of two APIs: shash and ahash. Most
+ * dm-verity users use CPU-based hashing, and for this shash is optimal
+ * since it matches the underlying algorithm implementations and also
+ * allows the use of fast multibuffer hashing (crypto_shash_finup_mb()).
+ * ahash adds support for off-CPU hash offloading. It also provides
+ * access to shash algorithms, but does so less efficiently.
+ *
+ * Meanwhile, hashing a block in dm-verity in general requires an
+ * init+update+final sequence with multiple updates. However, usually
+ * the salt is prepended to the block rather than appended, and the data
+ * block size is not greater than the page size. In this very common
+ * case, the sequence can be optimized to import+finup, where the first
+ * step imports the pre-computed state after init+update(salt). This
+ * can reduce the crypto API overhead significantly.
+ *
+ * To provide optimal performance for the vast majority of dm-verity
+ * users while still supporting off-CPU hash offloading and the rarer
+ * dm-verity settings, we therefore have two code paths: one using shash
+ * where we use import+finup_mb, and one using ahash where we use
+ * init+update(s)+final. We use the former code path when it's possible
+ * to use and shash gives the same algorithm as ahash.
+ */
ahash = crypto_alloc_ahash(alg_name, 0,
v->use_bh_wq ? CRYPTO_ALG_ASYNC : 0);
if (IS_ERR(ahash)) {
ti->error = "Cannot initialize hash function";
return PTR_ERR(ahash);
@@ -1345,14 +1449,17 @@ static int verity_setup_hash_alg(struct dm_verity *v, const char *alg_name)
}
if (!IS_ERR_OR_NULL(shash)) {
crypto_free_ahash(ahash);
ahash = NULL;
v->shash_tfm = shash;
+ v->mb_max_msgs = min(crypto_shash_mb_max_msgs(shash),
+ DM_VERITY_MAX_PENDING_DATA_BLOCKS);
v->digest_size = crypto_shash_digestsize(shash);
v->hash_reqsize = sizeof(struct shash_desc) +
crypto_shash_descsize(shash);
- DMINFO("%s using shash \"%s\"", alg_name, driver_name);
+ DMINFO("%s using shash \"%s\"%s", alg_name, driver_name,
+ v->mb_max_msgs > 1 ? " (multibuffer)" : "");
} else {
v->ahash_tfm = ahash;
static_branch_inc(&ahash_enabled);
v->digest_size = crypto_ahash_digestsize(ahash);
v->hash_reqsize = sizeof(struct ahash_request) +
diff --git a/drivers/md/dm-verity.h b/drivers/md/dm-verity.h
index 15ffb0881cc9..932b0c437d21 100644
--- a/drivers/md/dm-verity.h
+++ b/drivers/md/dm-verity.h
@@ -55,10 +55,11 @@ struct dm_verity {
unsigned char hash_per_block_bits; /* log2(hashes in hash block) */
unsigned char levels; /* the number of tree levels */
unsigned char version;
bool hash_failed:1; /* set if hash of any block failed */
bool use_bh_wq:1; /* try to verify in BH wq before normal work-queue */
+ unsigned char mb_max_msgs; /* max multibuffer hashing interleaving factor */
unsigned int digest_size; /* digest size for the current hash algorithm */
unsigned int hash_reqsize; /* the size of temporary space for crypto */
enum verity_mode mode; /* mode for handling verification errors */
unsigned int corrupted_errs;/* Number of errors for corrupted blocks */

@@ -92,42 +93,23 @@ struct dm_verity_io {
struct work_struct bh_work;

char *recheck_buffer;

/*
- * Three variably-size fields follow this struct:
- *
- * u8 hash_req[v->hash_reqsize];
- * u8 real_digest[v->digest_size];
- * u8 want_digest[v->digest_size];
- *
- * To access them use: verity_io_hash_req(), verity_io_real_digest()
- * and verity_io_want_digest().
- *
- * hash_req is either a struct ahash_request or a struct shash_desc,
- * depending on whether ahash_tfm or shash_tfm is being used.
+ * This struct is followed by a variable-sized hash request of size
+ * v->hash_reqsize, either a struct ahash_request or a struct shash_desc
+ * (depending on whether ahash_tfm or shash_tfm is being used). To
+ * access it, use verity_io_hash_req().
*/
};

static inline void *verity_io_hash_req(struct dm_verity *v,
struct dm_verity_io *io)
{
return io + 1;
}

-static inline u8 *verity_io_real_digest(struct dm_verity *v,
- struct dm_verity_io *io)
-{
- return (u8 *)(io + 1) + v->hash_reqsize;
-}
-
-static inline u8 *verity_io_want_digest(struct dm_verity *v,
- struct dm_verity_io *io)
-{
- return (u8 *)(io + 1) + v->hash_reqsize + v->digest_size;
-}
-
extern int verity_for_bv_block(struct dm_verity *v, struct dm_verity_io *io,
struct bvec_iter *iter,
int (*process)(struct dm_verity *v,
struct dm_verity_io *io,
u8 *data, size_t len));
--
2.45.1


2024-06-04 18:42:27

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Tue, Jun 04, 2024 at 05:37:36PM +0800, Herbert Xu wrote:
> On Mon, Jun 03, 2024 at 11:37:29AM -0700, Eric Biggers wrote:
> >
> > + for (i = 0; i < ctx->num_pending; i++) {
> > + data[i] = ctx->pending_blocks[i].data;
> > + outs[i] = ctx->pending_blocks[i].hash;
> > + }
> > +
> > + desc->tfm = params->hash_alg->tfm;
> > + if (params->hashstate)
> > + err = crypto_shash_import(desc, params->hashstate);
> > + else
> > + err = crypto_shash_init(desc);
> > + if (err) {
> > + fsverity_err(inode, "Error %d importing hash state", err);
> > + return false;
> > + }
> > + err = crypto_shash_finup_mb(desc, data, params->block_size, outs,
> > + ctx->num_pending);
> > + if (err) {
> > + fsverity_err(inode, "Error %d computing block hashes", err);
> > + return false;
> > + }
>
> So with ahash operating in synchronous mode (callback == NULL), this
> would look like:
>
> struct ahash_request *reqs[FS_VERITY_MAX_PENDING_DATA_BLOCKS];
>
> for (i = 0; i < ctx->num_pending; i++) {
> reqs[i] = fsverity_alloc_hash_request();
> if (!req) {
> free all reqs;
> return false;
> }
>
> if (params->hashstate)
> err = crypto_ahash_import(&reqs[i], params->hashstate);
> else
> err = crypto_ahash_init(&reqs[i]);
>
> if (err) {
> fsverity_err(inode, "Error %d importing hash state", err);
> free all reqs;
> return false;
> }
> }
>
> for (i = 0; i < ctx->num_pending; i++) {
> unsigned more;
>
> if (params->hashstate)
> err = crypto_ahash_import(req, params->hashstate);
> else
> err = crypto_ahash_init(req);
>
> if (err) {
> fsverity_err(inode, "Error %d importing hash state", err);
> free all requests;
> return false;
> }
>
> more = 0;
> if (i + 1 < ctx->num_pending)
> more = CRYPTO_TFM_REQ_MORE;
> ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP | more,
> NULL, NULL);
> ahash_request_set_crypt(req, ctx->pending_blocks[i].sg,
> ctx->pending_blocks[i].hash,
> params->block_size);
>
> err = crypto_ahash_finup(req);
> if (err) {
> fsverity_err(inode, "Error %d computing block hashes", err);
> free all requests;
> return false;
> }
> }
>
> You're hiding some of the complexity by not allocating memory
> explicitly for each hash state. This might fit on the stack
> for two requests, but eventually you will have to allocate memory.
>
> With the ahash API, the allocation is explicit.
>

This doesn't make any sense, though. First, the requests need to be enqueued
for the task, but crypto_ahash_finup() would only have the ability to enqueue it
in a queue associated with the tfm, which is shared by many tasks. So it can't
actually work unless the tfm maintained a separate queue for each task, which
would be really complex. Second, it adds a memory allocation per block which is
very undesirable. You claim that it's needed anyway, but actually it's not;
with my API there is only one initial hash state regardless of how high the
interleaving factor is. In fact, if multiple initial states were allowed,
multibuffer hashing would become much more complex because the underlying
algorithm would need to validate that these different states are synced up. My
proposal is much simpler and avoids all this unnecessary overhead.

Really the only reason to even consider ahash at all would be try to support
software hashing and off-CPU hardware accelerators using the "same" code.
However, your proposal would not achieve that either, as it would not use the
async callback. Note, as far as I know no one actually cares about off-CPU
hardware accelerator support in fsverity anyway...

- Eric

2024-06-04 19:07:46

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] crypto: shash - add support for finup_mb

On Mon, 3 Jun 2024 at 20:39, Eric Biggers <[email protected]> wrote:
>
> From: Eric Biggers <[email protected]>
>
> Most cryptographic hash functions are serialized, in the sense that they
> have an internal block size and the blocks must be processed serially.
> (BLAKE3 is a notable exception that has tree-based hashing built-in, but
> all the more common choices such as the SHAs and BLAKE2 are serialized.
> ParallelHash and Sakura are parallel hashes based on SHA3, but SHA3 is
> much slower than SHA256 in software even with the ARMv8 SHA3 extension.)
>
> This limits the performance of computing a single hash. Yet, computing
> multiple hashes simultaneously does not have this limitation. Modern
> CPUs are superscalar and often can execute independent instructions in
> parallel. As a result, on many modern CPUs, it is possible to hash two
> equal-length messages in about the same time as a single message, if all
> the instructions are interleaved.
>

It's not only about out-of-order/superscalar execution. In some cases
(at least on ARM), it takes more than a cycle for the result of an
instruction to become available to the next one, even if the
computation itself completes in a single cycle, and this affects
in-order cores as well.

The crux here is that the associated benefit only exists if the
independent inputs can be interleaved at the instruction level. OoO
cores will have some tolerance for deviations from this, but in the
general case, this kind of multi-stream processing requires meticulous
parallellization.

That also means that it is substantially different from the
asynchronous accelerator use case where a single IP block may have
different queues that can be used in parallel. For these, it might
make sense to provide some infrastructure to mix inputs from disparate
sources, but the same logic is unlikely to be useful for the CPU based
parallel hashing case.

...
>
> This patch takes a new approach of just adding an API
> crypto_shash_finup_mb() that synchronously computes the hash of multiple
> equal-length messages, starting from a common state that represents the
> (possibly empty) common prefix shared by the messages.
>

This is an independent optimization, right? This could be useful even
more sequential hashing, and is not a fundamental aspect of parallel
hashing?

> The new API is part of the "shash" algorithm type, as it does not make
> sense in "ahash". It does a "finup" operation rather than a "digest"
> operation in order to support the salt that is used by dm-verity and
> fs-verity. The data and output buffers are provided in arrays of length
> @num_msgs in order to make the API itself extensible to interleaving
> factors other than 2. (Though, initially only 2x will actually be used.
> There are some platforms in which a higher factor could help, but there
> are significant trade-offs.)
>

I could imagine cases where 3-way would have an advantage over 2-way -
it is highly uarch dependent, though, so I wouldn't spend too much
time accommodating this before a use case actually materializes.

> Signed-off-by: Eric Biggers <[email protected]>
> ---
> crypto/shash.c | 60 +++++++++++++++++++++++++++++++++++++++++++
> include/crypto/hash.h | 45 +++++++++++++++++++++++++++++++-
> 2 files changed, 104 insertions(+), 1 deletion(-)
>
> diff --git a/crypto/shash.c b/crypto/shash.c
> index 301ab42bf849..5a2352933fbf 100644
> --- a/crypto/shash.c
> +++ b/crypto/shash.c
> @@ -73,10 +73,57 @@ int crypto_shash_finup(struct shash_desc *desc, const u8 *data,
> {
> return crypto_shash_alg(desc->tfm)->finup(desc, data, len, out);
> }
> EXPORT_SYMBOL_GPL(crypto_shash_finup);
>
> +static noinline_for_stack int
> +shash_finup_mb_fallback(struct shash_desc *desc, const u8 * const data[],
> + unsigned int len, u8 * const outs[],
> + unsigned int num_msgs)
> +{
> + struct crypto_shash *tfm = desc->tfm;
> + SHASH_DESC_ON_STACK(desc2, tfm);
> + unsigned int i;
> + int err;
> +
> + for (i = 0; i < num_msgs - 1; i++) {
> + desc2->tfm = tfm;
> + memcpy(shash_desc_ctx(desc2), shash_desc_ctx(desc),
> + crypto_shash_descsize(tfm));
> + err = crypto_shash_finup(desc2, data[i], len, outs[i]);
> + if (err)
> + return err;
> + }
> + return crypto_shash_finup(desc, data[i], len, outs[i]);
> +}
> +
> +int crypto_shash_finup_mb(struct shash_desc *desc, const u8 * const data[],
> + unsigned int len, u8 * const outs[],
> + unsigned int num_msgs)
> +{
> + struct shash_alg *alg = crypto_shash_alg(desc->tfm);
> + int err;
> +
> + if (num_msgs == 1)
> + return crypto_shash_finup(desc, data[0], len, outs[0]);
> +
> + if (num_msgs == 0)
> + return 0;
> +
> + if (WARN_ON_ONCE(num_msgs > alg->mb_max_msgs))
> + goto fallback;
> +
> + err = alg->finup_mb(desc, data, len, outs, num_msgs);
> + if (unlikely(err == -EOPNOTSUPP))
> + goto fallback;
> + return err;
> +
> +fallback:
> + return shash_finup_mb_fallback(desc, data, len, outs, num_msgs);
> +}
> +EXPORT_SYMBOL_GPL(crypto_shash_finup_mb);
> +
> static int shash_default_digest(struct shash_desc *desc, const u8 *data,
> unsigned int len, u8 *out)
> {
> struct shash_alg *shash = crypto_shash_alg(desc->tfm);
>
> @@ -312,10 +359,20 @@ static int shash_prepare_alg(struct shash_alg *alg)
> return -EINVAL;
>
> if ((alg->export && !alg->import) || (alg->import && !alg->export))
> return -EINVAL;
>
> + if (alg->mb_max_msgs) {
> + if (alg->mb_max_msgs > HASH_MAX_MB_MSGS)
> + return -EINVAL;
> + if (!alg->finup_mb)
> + return -EINVAL;
> + } else {
> + if (alg->finup_mb)
> + return -EINVAL;
> + }
> +
> err = hash_prepare_alg(&alg->halg);
> if (err)
> return err;
>
> base->cra_type = &crypto_shash_type;
> @@ -339,10 +396,13 @@ static int shash_prepare_alg(struct shash_alg *alg)
> if (!alg->export)
> alg->halg.statesize = alg->descsize;
> if (!alg->setkey)
> alg->setkey = shash_no_setkey;
>
> + if (!alg->mb_max_msgs)
> + alg->mb_max_msgs = 1;
> +
> return 0;
> }
>
> int crypto_register_shash(struct shash_alg *alg)
> {
> diff --git a/include/crypto/hash.h b/include/crypto/hash.h
> index 2d5ea9f9ff43..002099610755 100644
> --- a/include/crypto/hash.h
> +++ b/include/crypto/hash.h
> @@ -154,11 +154,13 @@ struct ahash_alg {
> struct shash_desc {
> struct crypto_shash *tfm;
> void *__ctx[] __aligned(ARCH_SLAB_MINALIGN);
> };
>
> -#define HASH_MAX_DIGESTSIZE 64
> +#define HASH_MAX_DIGESTSIZE 64
> +
> +#define HASH_MAX_MB_MSGS 2 /* max value of crypto_shash_mb_max_msgs() */
>
> /*
> * Worst case is hmac(sha3-224-generic). Its context is a nested 'shash_desc'
> * containing a 'struct sha3_state'.
> */
> @@ -177,10 +179,19 @@ struct shash_desc {
> * @finup: see struct ahash_alg
> * @digest: see struct ahash_alg
> * @export: see struct ahash_alg
> * @import: see struct ahash_alg
> * @setkey: see struct ahash_alg
> + * @finup_mb: **[optional]** Multibuffer hashing support. Finish calculating
> + * the digests of multiple messages, interleaving the instructions to
> + * potentially achieve better performance than hashing each message
> + * individually. The num_msgs argument will be between 2 and
> + * @mb_max_msgs inclusively. If there are particular values of len
> + * or num_msgs, or a particular calling context (e.g. no-SIMD) that
> + * the implementation does not support with this method, the
> + * implementation may return -EOPNOTSUPP from this method in those
> + * cases to cause the crypto API to fall back to repeated finups.
> * @init_tfm: Initialize the cryptographic transformation object.
> * This function is called only once at the instantiation
> * time, right after the transformation context was
> * allocated. In case the cryptographic hardware has
> * some special requirements which need to be handled
> @@ -192,10 +203,11 @@ struct shash_desc {
> * various changes set in @init_tfm.
> * @clone_tfm: Copy transform into new object, may allocate memory.
> * @descsize: Size of the operational state for the message digest. This state
> * size is the memory size that needs to be allocated for
> * shash_desc.__ctx
> + * @mb_max_msgs: Maximum supported value of num_msgs argument to @finup_mb
> * @halg: see struct hash_alg_common
> * @HASH_ALG_COMMON: see struct hash_alg_common
> */
> struct shash_alg {
> int (*init)(struct shash_desc *desc);
> @@ -208,15 +220,19 @@ struct shash_alg {
> unsigned int len, u8 *out);
> int (*export)(struct shash_desc *desc, void *out);
> int (*import)(struct shash_desc *desc, const void *in);
> int (*setkey)(struct crypto_shash *tfm, const u8 *key,
> unsigned int keylen);
> + int (*finup_mb)(struct shash_desc *desc, const u8 * const data[],
> + unsigned int len, u8 * const outs[],
> + unsigned int num_msgs);
> int (*init_tfm)(struct crypto_shash *tfm);
> void (*exit_tfm)(struct crypto_shash *tfm);
> int (*clone_tfm)(struct crypto_shash *dst, struct crypto_shash *src);
>
> unsigned int descsize;
> + unsigned int mb_max_msgs;
>
> union {
> struct HASH_ALG_COMMON;
> struct hash_alg_common halg;
> };
> @@ -750,10 +766,20 @@ static inline unsigned int crypto_shash_digestsize(struct crypto_shash *tfm)
> static inline unsigned int crypto_shash_statesize(struct crypto_shash *tfm)
> {
> return crypto_shash_alg(tfm)->statesize;
> }
>
> +/*
> + * Return the maximum supported multibuffer hashing interleaving factor, i.e.
> + * the maximum num_msgs that can be passed to crypto_shash_finup_mb(). The
> + * return value will be between 1 and HASH_MAX_MB_MSGS inclusively.
> + */
> +static inline unsigned int crypto_shash_mb_max_msgs(struct crypto_shash *tfm)
> +{
> + return crypto_shash_alg(tfm)->mb_max_msgs;
> +}
> +
> static inline u32 crypto_shash_get_flags(struct crypto_shash *tfm)
> {
> return crypto_tfm_get_flags(crypto_shash_tfm(tfm));
> }
>
> @@ -843,10 +869,27 @@ int crypto_shash_digest(struct shash_desc *desc, const u8 *data,
> * Return: 0 on success; < 0 if an error occurred.
> */
> int crypto_shash_tfm_digest(struct crypto_shash *tfm, const u8 *data,
> unsigned int len, u8 *out);
>
> +/**
> + * crypto_shash_finup_mb() - multibuffer message hashing
> + * @desc: the starting state that is forked for each message. It contains the
> + * state after hashing a (possibly-empty) common prefix of the messages.
> + * @data: the data of each message (not including any common prefix from @desc)
> + * @len: length of each data buffer in bytes
> + * @outs: output buffer for each message digest
> + * @num_msgs: number of messages, i.e. the number of entries in @data and @outs.
> + * This can't be more than crypto_shash_mb_max_msgs().
> + *
> + * Context: Any context.
> + * Return: 0 on success; a negative errno value on failure.
> + */
> +int crypto_shash_finup_mb(struct shash_desc *desc, const u8 * const data[],
> + unsigned int len, u8 * const outs[],
> + unsigned int num_msgs);
> +
> /**
> * crypto_shash_export() - extract operational state for message digest
> * @desc: reference to the operational state handle whose state is exported
> * @out: output buffer of sufficient size that can hold the hash state
> *
> --
> 2.45.1
>
>

2024-06-04 19:30:00

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] crypto: shash - add support for finup_mb

On Tue, Jun 04, 2024 at 08:55:48PM +0200, Ard Biesheuvel wrote:
> >
> > This patch takes a new approach of just adding an API
> > crypto_shash_finup_mb() that synchronously computes the hash of multiple
> > equal-length messages, starting from a common state that represents the
> > (possibly empty) common prefix shared by the messages.
> >
>
> This is an independent optimization, right? This could be useful even
> more sequential hashing, and is not a fundamental aspect of parallel
> hashing?

If you're referring to the part about using a common starting state, that's not
an independent optimization. Only multibuffer hashing processes multiple
messages in one call and therefore has an opportunity to share a starting
shash_desc for finup. This isn't just an optimization but it also makes the
multibuffer hashing API and its implementation much simpler.

With single-buffer there has to be one shash_desc per message as usual.

If you're asking if crypto_shash_finup_mb() can be used even without multibuffer
hashing support, the answer is yes. This patchset makes crypto_shash_finup_mb()
fall back to crypto_shash_finup() as needed, and this is used by fsverity and
dm-verity to have one code path that uses crypto_shash_finup_mb() instead of
separate code paths that use crypto_shash_finup_mb() and crypto_shash_finup().
This just makes things a bit simpler and isn't an optimization; note that the
fallback has to copy the shash_desc for each message beyond the first.

- Eric

2024-06-05 18:58:38

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Wed, Jun 05, 2024 at 05:19:01PM +0800, Herbert Xu wrote:
> On Tue, Jun 04, 2024 at 11:42:20AM -0700, Eric Biggers wrote:
> >
> > This doesn't make any sense, though. First, the requests need to be enqueued
> > for the task, but crypto_ahash_finup() would only have the ability to enqueue it
> > in a queue associated with the tfm, which is shared by many tasks. So it can't
>
> OK I screwed up that one. But that's not hard to fix. We could
> simply add request chaining:
>
> ahash_request_chain(req1, req2);
> ahash_request_chain(req2, req3);
> ...
> ahash_request_chain(reqn1, reqn);
>
> err = crypto_ahash_finup(req1);

So after completely changing several times your proposal is getting a bit closer
to mine, but it still uses a very clumsy API based around ahash that would be
much harder to use and implement correctly. It also says nothing about what the
API would look like on the shash side, which would be needed anyway because
ahash is almost always just a pointless wrapper for shash. Is there a different
API that you are asking for on the shash side? If so, what?

> > actually work unless the tfm maintained a separate queue for each task, which
> > would be really complex. Second, it adds a memory allocation per block which is
> > very undesirable. You claim that it's needed anyway, but actually it's not;
> > with my API there is only one initial hash state regardless of how high the
> > interleaving factor is. In fact, if multiple initial states were allowed,
>
> Sure you don't need it for two interleaved requests. But as you
> scale up to 16 and beyond, surely at some point you're going to
> want to move the hash states off the stack.

To reiterate, with my proposal there is only one state in memory. It's a simple
API that can't be misused by providing inconsistent properties in the requests.
Yes, separate states would be needed if we were to support arbitrary updates,
but why add all that complexity before it's actually needed?

Also, "16 and beyond" is highly unlikely to be useful for kernel use cases.

> > multibuffer hashing would become much more complex because the underlying
> > algorithm would need to validate that these different states are synced up. My
> > proposal is much simpler and avoids all this unnecessary overhead.
>
> We could simply state that these chained requests must be on block
> boundaries, similar to how we handle block ciphers. This is not a
> big deal.

... which would make it useless for most dm-verity users, as dm-verity uses a
32-byte salt with SHA-256 (which has a 64-byte block size) by default.

>
> > Really the only reason to even consider ahash at all would be try to support
> > software hashing and off-CPU hardware accelerators using the "same" code.
> > However, your proposal would not achieve that either, as it would not use the
> > async callback. Note, as far as I know no one actually cares about off-CPU
> > hardware accelerator support in fsverity anyway...
>
> The other thing is that verity doesn't benefit from shash at all.
> It appears to be doing kmap's on every single request.

The diff from switching fsverity from ahash to shash clearly demonstrates
otherwise. Yes, fsverity has to map the pages to pass into shash, but that's a
very minor thing compared to all the complexity of ahash that was saved. And
fsverity already had to map most of the pages anyway to access them.

- Eric

2024-06-05 19:14:17

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Wed, Jun 05, 2024 at 05:46:27PM +0800, Herbert Xu wrote:
> On Wed, Jun 05, 2024 at 05:22:21PM +0800, Herbert Xu wrote:
> >
> > However, I really dislike the idea of shoehorning this into shash.
> > I know you really like shash, but I think there are some clear
> > benefits to be had by coupling this with ahash.
>
> If we do this properly, we should be able to immediately use the
> mb code with IPsec. In the network stack, we already aggregate
> the data prior to IPsec with GSO. So at the boundary between
> IPsec and the Crypto API, it's dividing chunks of data up to 64K
> into 1500-byte packets and feeding them to crypto one at a time.
>
> It really should be sending the whole chain of packets to us as
> a unit.
>
> Once we have a proper mb interface, we can fix that and immediately
> get the benefit of mb hashing.
>

This would at most apply to AH, not to ESP. Is AH commonly used these days?
Also even for AH, the IPsec code would need to be significantly restructured to
make use of multibuffer hashing. See how the segmentation happens in
xfrm_output_gso(), but the ahash calls happen much lower in the stack.

I'm guessing that you've had the AH use case in mind since your earlier
comments. Given you were originally pushing for this to be supported using the
existing async support in the ahash API (which would have required fewer code
changes on the AH side), but we now agree that is not feasible, maybe it is time
to reconsider whether it would still be worthwhile to make all the changes to
the AH code needed to support this?

Also, even if it would be worthwhile and would use ahash, ahash is almost always
just a wrapper for shash. So the shash support would be needed anyway.

- Eric

2024-06-06 05:28:10

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Thu, Jun 06, 2024 at 10:00:05AM +0800, Herbert Xu wrote:
> On Wed, Jun 05, 2024 at 12:14:10PM -0700, Eric Biggers wrote:
> >
> > This would at most apply to AH, not to ESP. Is AH commonly used these days?
>
> No AH is completely useless. However, this applies perfectly to
> ESP, in conjunction with authenc. Obviously we would need to add
> request linking to authenc (AEAD) as well so that it can pass it
> along to sha.
>
> BTW, does any of this interleaving apply to AES? If so we should
> explore adding request linking to skcipher as well.
>

With AES, interleaving would only help with non-parallelizable modes such as CBC
encryption. Anyone who cares about IPsec performance should of course be using
AES-GCM, which is parallelizable. Especially since my other patch
https://lore.kernel.org/linux-crypto/[email protected]/
is making AES-GCM twice as fast...

With hashing we unfortunately don't have the luxury of there being widely used
and accepted parallelizable algorithms. In particular, all the SHAs are
serialized. So that's why interleaving makes sense there.

In any case, it seems that what you're asking for at this point is far beyond
the scope of this patchset.

- Eric

2024-06-06 06:59:06

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Thu, 6 Jun 2024 at 07:41, Herbert Xu <[email protected]> wrote:
>
> On Wed, Jun 05, 2024 at 10:28:01PM -0700, Eric Biggers wrote:
> >
> > With AES, interleaving would only help with non-parallelizable modes such as CBC
> > encryption. Anyone who cares about IPsec performance should of course be using
> > AES-GCM, which is parallelizable. Especially since my other patch
> > https://lore.kernel.org/linux-crypto/[email protected]/
> > is making AES-GCM twice as fast...
>
> Algorithm selection may be limited by peer capability. For IPsec,
> if SHA is being used, then most likely CBC is also being used.
>

IPSec users relying on software crypto and authenc() and caring about
performance seems like a rather niche use case to me.

> > In any case, it seems that what you're asking for at this point is far beyond
> > the scope of this patchset.
>
> I'm more than happy to take this over if you don't wish to extend
> it beyond the storage usage cases. According to the original Intel
> sha2-mb submission, this should result in at least a two-fold
> speed-up.
>

I'm struggling to follow this debate. Surely, if this functionality
needs to live in ahash, the shash fallbacks need to implement this
parallel scheme too, or ahash would end up just feeding the requests
into shash sequentially, defeating the purpose. It is then up to the
API client to choose between ahash or shash, just as it can today.

So Eric has a pretty strong case for his shash implementation;
kmap_local() is essentially a NOP on architectures that anyone still
cares about (unlike kmap_atomic() which still disables preemption), so
I don't have a problem with the caller relying on that in order to be
able to use shash directly. The whole scatterlist / request alloc
dance is just too tedious and pointless, given that in practice, it
all gets relegated to shash anyway.

But my point is that even if we go with Herbert's proposal for the
ahash, we'll still need something like Eric's code on the shash side.

For the true async accelerator use case, none of this should make any
difference, right? If the caller already tolerates async (but
in-order) completion, implementing this request chaining doesn't buy
it anything. So only when the caller is sync and the implementation is
async, we might be able to do something smart where the caller waits
on a single completion that signals the completion of a set of inputs.
But this is also rather niche, so not worth holding up this effort.

So Herbert, what would the ahash_to_shash plumbing look like for the
ahash API that you are proposing? What changes would it require to
shash, and how much to they deviate from what Eric is suggesting?

2024-06-06 07:56:16

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Thu, 6 Jun 2024 at 09:34, Herbert Xu <[email protected]> wrote:
>
> On Thu, Jun 06, 2024 at 08:58:47AM +0200, Ard Biesheuvel wrote:
> >
> > IPSec users relying on software crypto and authenc() and caring about
> > performance seems like a rather niche use case to me.
>
> It's no more niche than fs/verity and dm-integrity.

fsverity is used by Android, which is the diametrical opposite of
niche when it comes to Linux distros.

> this could potentially be used for all algorithms. Just the
> reduction in the number of function calls may produce enough
> of a benefit (this is something I observed when adding GSO,
> even without any actual hardware offload, simply aggregating
> packets into larger units produced a visible benefit).
>

Sure. But my point is that anyone who cares enough about IPsec
performance would have stopped using authenc(cbc(aes),hmac(sha2)) long
ago and moved to GCM or even ChaChaPoly. This is not just a matter of
efficiency in the implementation - using a general purpose hash
function such as SHA-256 [twice] rather than GHASH or Poly1305 is just
overkill.

So complicating the performance optimization of something that runs on
every (non-Apple) phone for the benefit of something that is rarely
used in a context where the performance matters seems unreasonable to
me.

> > I'm struggling to follow this debate. Surely, if this functionality
> > needs to live in ahash, the shash fallbacks need to implement this
> > parallel scheme too, or ahash would end up just feeding the requests
> > into shash sequentially, defeating the purpose. It is then up to the
> > API client to choose between ahash or shash, just as it can today.
>
> I've never suggested it adding it to shash at all. In fact
> that's my primary problem with this interface. I think it
> should be ahash only. Just like skcipher and aead.
>

So again, how would that work for ahash falling back to shash. Are you
saying every existing shash implementation should be duplicated into
an ahash so that the multibuffer optimization can be added? shash is a
public interface so we cannot just remove the existing ones and we'll
end up carrying both forever.

> My reasoning is that this should cater mostly to bulk data, i.e.,
> multiple pages (e.g., for IPsec we're talking about 64K chunks,
> actually that (the 64K limit) is something that we should really
> extend, it's not 2014 anymore). These typically will be more
> easily accessed as a number of distinct pages rather than as a
> contiguous mapping.
>

Sure, but the block I/O world is very different. Forcing it to use an
API modeled after how IPsec might use it seems, again, unreasonable.

So these 64k chunks are made up of 1500 byte frames that can be hashed
in parallel?

2024-06-06 08:33:51

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Thu, 6 Jun 2024 at 10:08, Herbert Xu <[email protected]> wrote:
>
> On Thu, Jun 06, 2024 at 09:55:56AM +0200, Ard Biesheuvel wrote:
> >
> > So again, how would that work for ahash falling back to shash. Are you
> > saying every existing shash implementation should be duplicated into
> > an ahash so that the multibuffer optimization can be added? shash is a
> > public interface so we cannot just remove the existing ones and we'll
> > end up carrying both forever.
>
> It should do the same thing for ahash algorithms that do not support
> multiple requests. IOW it should process the requests one by one.
>

That is not what I am asking.

Are you suggesting that, e.g., the arm64 sha2 shash implementation
that is modified by this series should instead expose both an shash as
before, and an ahash built around the same asm code that exposes the
multibuffer capability?

> > Sure, but the block I/O world is very different. Forcing it to use an
> > API modeled after how IPsec might use it seems, again, unreasonable.
>
> It's not different at all. You can see that by the proliferation
> of kmap calls in fs/verity. It's a fundamental issue. You can't
> consistently get a large contiguous allocation beyond one page due
> to fragmentation. So large data is always going to be scattered.
>

I don't think this is true for many uses of the block layer.

> BTW, I'm all for elminating the overhead when you already have a
> linear address for scattered memory, e.g., through vmalloc. We
> should definitely improve our interface for ahash/skcipher/aead so
> that vmalloc addresses (as well as kmalloc virtual addresses by
> extension) are supported as first class citizens, and we don't turn
> them into SG lists unless it's necessary for DMA.
>

Yes, this is something I've been pondering for a while. An
ahash/skcipher/aead with CRYPTO_ALG_ASYNC cleared (which would
guarantee that any provided VA would not be referenced after the algo
invocation returns) should be able to consume a request that carries
virtual addresses rather than SG lists. Given that it is up to the
caller to choose between sync and async, it would be in a good
position also to judge whether it wants to use stack/vmalloc
addresses.

I'll have a stab at this.

2024-06-10 16:51:30

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

Hi Herbert,

On Thu, Jun 06, 2024 at 05:15:16PM +0800, Herbert Xu wrote:
> On Thu, Jun 06, 2024 at 10:33:15AM +0200, Ard Biesheuvel wrote:
> >
> > Are you suggesting that, e.g., the arm64 sha2 shash implementation
> > that is modified by this series should instead expose both an shash as
> > before, and an ahash built around the same asm code that exposes the
> > multibuffer capability?
>
> Yes the multi-request handling should be implemented as ahash
> only.
>

That is not a reasonable way to do it. It would be much more complex, more
error-prone, and slower than my proposal. Also your proposal wouldn't actually
bring you much closer to being able to optimize your authenc use case.

First, each algorithm implementation that wants to support your scheme would
need to implement the ahash API alongside shash. This would be a lot of
duplicated code, and it would need to support all quirks of the ahash API
including scatterlists.

Second, allowing an ahash_request to actually be a list of requests creates an
ambiguity where it will often be unclear when code is supposed to operate on an
ahash_request individually vs. on the whole list. This is error-prone, as it
invites bugs where a crypto operation is done on only one request of the list.
This is a bad design, especially for a cryptographic API. We would have
crypto_ahash_*() do some checks to prevent multi-requests from reaching
algorithms that don't support the particular kind of request being made. But
there will be a lot of cases to consider.

Third, it would be hard to actually implement multibuffer hashing given an
arbitrary list of requests. For multibuffer hashing to work, the lengths of all
the buffers must be synced up, including the internal buffers in the hash states
as well as every buffer that is produced while walking the scatterlists. We can
place constraints on what is supported, but those constraints will need to be
clearly specified, and each algorithm will actually need to check for them and
do something sane (error or fallback) when they are not met. Note that it would
be possible for the messages to get out of sync in the middle of walking the
scatterlists, which would be difficult to handle. All this would add a lot of
complexity and overhead compared to my proposal, which naturally expresses the
same-length constraints in the API.

Fourth, having the API be ahash-only would also force fsverity to switch back to
the ahash API, which would add complexity and overhead. The shash API is easier
to use even when working with pages, as the diffstat of commit 8fcd94add6c5
clearly shows. The ahash API also has more overhead, including in doing the
needed memory allocation, setting up a scatterlist, initializing required but
irrelevant or redundant fields in the ahash_request, and shash_ahash_finup()
running the fully generalized scatterlist walker on the scatterlist just to
finally end up with a pointer which the caller could have provided directly.
All this overhead adds up, even when hashing 4K blocks. Back when CPU-based
crypto was slow this didn't really matter, but now it's fast and these small
overheads are significant when trying to keep up with storage device speeds
(which are also fast now). E.g., even disregarding the memory allocation,
hashing a 4K block is about 5% faster with shash than with ahash on x86_64.

Of course, I'd need to support ahash in fsverity anyway if someone were to
actually need support for non-inline hardware offload in fsverity (and if I
decide to accept that despite many hardware drivers being buggy). But really
the best way to do this would be to support ahash alongside shash, like what I'm
proposing in dm-verity -- perhaps with ahash support limited to the data blocks
only, as that's the most performance critical part. The ahash code path would
use the async callback to actually properly support offload, which would mean
the code would be substantially different anyway due to the fundamentally
different computational model, especially vs. multibuffer hashing. A "single"
API, that works for both hardware offload and for the vast majority of users
that simply need very low overhead software crypto, would be nice in theory.
But I'm afraid it's been shown repeatedly that just doesn't work... The
existence of lib/crypto/, shash, lskcipher, and scomp all reflect this.

I understand that you think the ahash based API would make it easier to add
multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems
to be a very important use case for you (though it isn't relevant to nearly as
many systems as dm-verity and fsverity are). Regardless, the reality is that it
would be much more difficult to take advantage of multibuffer crypto in the
IPsec authenc use case than in dm-verity and fsverity. authenc uses multiple
underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use
multibuffer crypto in order to see a significant benefit, seeing as even if the
SHA-256 support could be wired up through HMAC-SHA256, encryption would be
bottlenecked on AES-CBC, especially on Intel CPUs. It also looks like the IPsec
code would need a lot of updates to support multibuffer crypto.

At the same time, an easier way to optimize "authenc(hmac(sha256),cbc(aes))"
would be to add an implementation of it to arch/x86/crypto/ that interleaves the
AES-CBC and SHA-256 and also avoids the overhead associated with the template
based approach (such as all the extra indirect calls). Of course, that would
require writing assembly, but so would multibuffer crypto. It seems unlikely
that someone would step in to do all the work to implement a multibuffer
optimization for this algorithm and its use in IPsec, when no one has ever
bothered to optimize the single-buffer case in the first place, which has been
possible all along and would require no API or IPsec changes...

In any case, any potential multi-request support in ahash, skcipher, or aead
should be separate considerations from the simple function in shash that I'm
proposing. It makes sense to have the shash support regardless.

Ultimately, I need to have dm-verity and fsverity be properly optimized in the
downstreams that are most relevant to me. If you're not going to allow the
upstream crypto API to provide the needed functionality in a reasonable way,
then I'll need to shift my focus to getting this patchset into downstream
kernels such as Android and Chrome OS instead.

- Eric

2024-06-11 15:46:23

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Tue, 11 Jun 2024 at 17:21, Herbert Xu <[email protected]> wrote:
>
> On Mon, Jun 10, 2024 at 09:42:58AM -0700, Eric Biggers wrote:
> >
> > I understand that you think the ahash based API would make it easier to add
> > multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems
> > to be a very important use case for you (though it isn't relevant to nearly as
> > many systems as dm-verity and fsverity are). Regardless, the reality is that it
> > would be much more difficult to take advantage of multibuffer crypto in the
> > IPsec authenc use case than in dm-verity and fsverity. authenc uses multiple
> > underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use
> > multibuffer crypto in order to see a significant benefit, seeing as even if the
> > SHA-256 support could be wired up through HMAC-SHA256, encryption would be
> > bottlenecked on AES-CBC, especially on Intel CPUs. It also looks like the IPsec
> > code would need a lot of updates to support multibuffer crypto.
>
> The linked-request thing feeds nicely into networking. In fact
> that's where I got the idea of linking them from. In networking
> a large GSO (currently limited to 64K but theoretically we could
> make it unlimited) packet is automatically split up into a linked
> list of MTU-sized skb's.
>
> Therefore if we switched to a linked-list API networking could
> give us the buffers with minimal changes.
>
> BTW, I found an old Intel paper that claims through their multi-
> buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM.
> I wonder if we could still replicate this today:
>
> https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf
>

This looks like the whitepaper that describes the buggy multibuffer
code that we ripped out.

> > Ultimately, I need to have dm-verity and fsverity be properly optimized in the
> > downstreams that are most relevant to me. If you're not going to allow the
> > upstream crypto API to provide the needed functionality in a reasonable way,
> > then I'll need to shift my focus to getting this patchset into downstream
> > kernels such as Android and Chrome OS instead.
>
> I totally understand that this is your priority. But please give
> me some time to see if we can devise something that works for both
> scenarios.
>

The issue here is that the CPU based multibuffer approach has rather
tight constraints in terms of input length and the shared prefix, and
so designing a more generic API based on ahash doesn't help at all.
The intel multibuffer code went off into the weeds entirely attempting
to apply this parallel scheme to arbitrary combinations of inputs, so
this is something we know we should avoid.

2024-06-11 20:19:08

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote:
>
> BTW, I found an old Intel paper that claims through their multi-
> buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM.
> I wonder if we could still replicate this today:
>
> https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf

No, not even close. Even assuming that the lack of parallelizability in AES-CBC
and AES-XCBC can be entirely compensated for via multibuffer crypto (which
really it can't -- consider single packets, for example), doing AES twice is
much more expensive than doing AES and GHASH. GHASH is a universal hash
function, and computing a universal hash function is inherently cheaper than
computing a cryptographic hash function. But also modern Intel CPUs have very
fast carryless multiplication, and it uses a different execution port from what
AES uses. So the overhead of AES + GHASH over AES alone is very small. By
doing AES twice, you'd be entirely bottlenecked by the ports that can execute
the AES instructions, while the other ports go nearly unused. So it would
probably be approaching twice as slow as AES-GCM.

Westmere (2010) through Ivy Bridge (2012) are the only Intel CPUs where
multibuffer AES-CBC-XCBC could plausibly be faster than AES-GCM (given a
sufficiently large number of messages at once), due to the very slow pclmulqdq
instruction on those CPUs. This is long since fixed, as pclmulqdq became much
faster in Haswell (2013), and faster still in Broadwell. This is exactly what
that Intel paper shows; they show AES-GCM becoming fastest in "Gen 4", i.e.
Haswell. The paper is from 2012, so of course they don't show anything after
that. But AES-GCM has only pulled ahead even more since then.

In theory something like AES-CBC + SHA-256 could be slightly more competitive
than AES-CBC + AES-XCBC. But it would still be worse than simply doing AES-GCM
-- which again, doesn't need multibuffer, and my recent patches have already
fully optimized for recent x86_64 CPUs.

- Eric

2024-06-11 20:32:19

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

On Tue, Jun 11, 2024 at 11:39:08PM +0800, Herbert Xu wrote:
> On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote:
> >
> > Therefore if we switched to a linked-list API networking could
> > give us the buffers with minimal changes.
>
> BTW, this is not just about parallelising hashing. Just as one of
> the most significant benefits of GSO does not come from hardware
> offload, but rather the amortisation of (network) stack overhead.
> IOW you're traversing a very deep stack once instead of 40 times
> (this is the factor for 64K vs MTU, if we extend beyond 64K (which
> we absolute should do) the benefit would increase as well).
>
> The same should apply to the Crypto API. So even if this was a
> purely software solution with no assembly code at all, it may well
> improve GCM performance (at least for users able to feed us bulk
> data, like networking).
>

At best this would save an indirect call per message, if the underlying
algorithm explicitly added support for it and the user of the API migrated to
the multi-request model. This alone doesn't seem worth the effort of migrating
to multi-request, especially considering the many other already-possible
optimizations that would not require API changes or migrating users to
multi-request. The x86_64 AES-GCM is pretty well optimized now after my recent
patches, but there's still an indirect call associated with the use of the SIMD
helper which could be eliminated, saving one per message (already as much as we
could hope to get from multi-request). authenc on the other hand is almost
totally unoptimized, as I mentioned before; it makes little sense to talk about
any sort of multi-request optimization for it at this point.

- Eric