2020-02-12 13:21:42

by Horia Geanta

[permalink] [raw]
Subject: Re: [PATCH v7 5/9] crypto: caam - simplify RNG implementation

On 1/27/2020 6:57 PM, Andrey Smirnov wrote:
> Rework CAAM RNG implementation as follows:
>
> - Make use of the fact that HWRNG supports partial reads and will
> handle such cases gracefully by removing recursion in caam_read()
>
> - Convert blocking caam_read() codepath to do a single blocking job
> read directly into requested buffer, bypassing any intermediary
> buffers
>
> - Convert async caam_read() codepath into a simple single
> reader/single writer FIFO use-case, thus simplifying concurrency
> handling and delegating buffer read/write position management to KFIFO
> subsystem.
>
> - Leverage the same low level RNG data extraction code for both async
> and blocking caam_read() scenarios, get rid of the shared job
> descriptor and make non-shared one as a simple as possible (just
> HEADER + ALGORITHM OPERATION + FIFO STORE)
>
> - Split private context from DMA related memory, so that the former
> could be allocated without GFP_DMA.
>
> NOTE: On its face value this commit decreased throughput numbers
> reported by
>
> dd if=/dev/hwrng of=/dev/null bs=1 count=100K [iflag=nonblock]
>
> by about 15%, however commits that enable prediction resistance and
Running dd as mentioned above, on a i.MX8MM board I see:
~ 20% decrease in non-blocking case (525 kB/s vs. 662 kB/s)
~ 75% decrease in blocking case (170 kB/s vs. 657 kB/s)

bs=1 is a bit drastic.
Using bs=16 the numbers look better in terms of overall speed,
however the relative degradation is still there:
~ 66% decrease in blocking case (3.5 MB/s vs. 10.1 MB/s)

> limit JR total size impact the performance so much and move the
> bottleneck such as to make this regression irrelevant.
>
Yes, performance is greatly impacted by moving from a DRBG configuration
to a TRNG one.

The speed that I get with this patch set (1.3 kB/s)
is ~ 20% lower than theoretical output (1.583 kB/s) (see below).
Seeing this and also the relative decrease in case of DRBG
makes me wonder whether the SW overhead could be lowered.

Theoretical TRNG output speed in this configuration
can be computed as:
Speed = (SZ x CAAM_CLK_FREQ) / (RTSDCTL[ENT_DLY] x RTSDCTL[SAMP_SIZE]) [bps]

SZ is sample taken from the DRBG, b/w two consecutive reseedings.
As previously discussed, this is limited to 128 bits (16 bytes),
such that the DRBG behaves as a TRNG.

If:
-CAAM_CLK_FREQ = 166 MHz (as for i.MXM*)
-RTSDCTL[ENT_DLY] = 3200 clocks (default / POR value)
-RTSDCTL[SAMP_SIZE] = 512 (recommended; default / POR value is 2500)
then theoretical speed is 1.583 kB/s.

> @@ -45,38 +22,34 @@
> #include "jr.h"
> #include "error.h"
>
> +/* length of descriptors */
This comment is misplaced, length of descriptors (CAAM_RNG_DESC_LEN)
is further below.

> +#define CAAM_RNG_MAX_FIFO_STORE_SIZE U16_MAX
> +
> +#define CAAM_RNG_FIFO_LEN SZ_32K /* Must be a multiple of 2 */
> +
> /*
> - * Maximum buffer size: maximum number of random, cache-aligned bytes that
> - * will be generated and moved to seq out ptr (extlen not allowed)
> + * See caam_init_desc()
> */
> -#define RN_BUF_SIZE (0xffff / L1_CACHE_BYTES * \
> - L1_CACHE_BYTES)
> +#define CAAM_RNG_DESC_LEN (CAAM_CMD_SZ + \
> + CAAM_CMD_SZ + \
> + CAAM_CMD_SZ + CAAM_PTR_SZ_MAX)

> +typedef u8 caam_rng_desc[CAAM_RNG_DESC_LEN];
Is this really necessary?

> -static int caam_read(struct hwrng *rng, void *data, size_t max, bool wait)
> +static int caam_rng_read_one(struct device *jrdev,
> + void *dst, int len,
> + void *desc,
> + struct completion *done)
[...]
> + len = min_t(int, len, CAAM_RNG_MAX_FIFO_STORE_SIZE);
For the blocking case, i.e. caam_read() -> caam_rng_read_one(),
"len" is at least 32B - cf. include/linux/hw_random.h:
* @read: New API. drivers can fill up to max bytes of data
* into the buffer. The buffer is aligned for any type
* and max is a multiple of 4 and >= 32 bytes.

For reducing the SW overhead, it might be worth optimizing this path.
For example, considering
min_t(int, len, CAAM_RNG_MAX_FIFO_STORE_SIZE) = CAAM_RNG_MAX_FIFO_STORE_SIZE
this means length is fixed, thus also ctx->desc[DESC_SYNC] descriptor is fixed
and its generation could be moved out of the hot path.

Horia