2023-10-16 11:27:52

by Mikulas Patocka

[permalink] [raw]
Subject: [RFC PATCH] crypto: de-prioritize QAT

Hi

I created this kernel module that stress-tests the crypto API:
https://people.redhat.com/~mpatocka/benchmarks/qat/tools/module-multithreaded.c

It shows that QAT underperforms significantly compared to AES-NI (for
large requests it is 10 times slower; for small requests it is even worse)
- see the second table in this document:
https://people.redhat.com/~mpatocka/benchmarks/qat/kernel-module.txt

QAT has higher priority than AES-NI, so the kernel prefers it (it is not
used for dm-crypt because it has the flag "CRYPTO_ALG_ALLOCATES_MEMORY",
but it is preferred over AES-NI in other cases).

AES-NI has priority 300 or 400, unaccelerated AES has 100, so I suggest to
lower the priority of QAT from 4001 to 201.

Signed-off-by: Mikulas Patocka <[email protected]>

---
drivers/crypto/intel/qat/qat_common/qat_algs.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6/drivers/crypto/intel/qat/qat_common/qat_algs.c
===================================================================
--- linux-2.6.orig/drivers/crypto/intel/qat/qat_common/qat_algs.c
+++ linux-2.6/drivers/crypto/intel/qat/qat_common/qat_algs.c
@@ -1277,7 +1277,7 @@ static struct aead_alg qat_aeads[] = { {
.base = {
.cra_name = "authenc(hmac(sha1),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha1",
- .cra_priority = 4001,
+ .cra_priority = 201,
.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_aead_ctx),
@@ -1294,7 +1294,7 @@ static struct aead_alg qat_aeads[] = { {
.base = {
.cra_name = "authenc(hmac(sha256),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha256",
- .cra_priority = 4001,
+ .cra_priority = 201,
.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_aead_ctx),
@@ -1311,7 +1311,7 @@ static struct aead_alg qat_aeads[] = { {
.base = {
.cra_name = "authenc(hmac(sha512),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha512",
- .cra_priority = 4001,
+ .cra_priority = 201,
.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_aead_ctx),
@@ -1329,7 +1329,7 @@ static struct aead_alg qat_aeads[] = { {
static struct skcipher_alg qat_skciphers[] = { {
.base.cra_name = "cbc(aes)",
.base.cra_driver_name = "qat_aes_cbc",
- .base.cra_priority = 4001,
+ .base.cra_priority = 201,
.base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.base.cra_blocksize = AES_BLOCK_SIZE,
.base.cra_ctxsize = sizeof(struct qat_alg_skcipher_ctx),
@@ -1347,7 +1347,7 @@ static struct skcipher_alg qat_skciphers
}, {
.base.cra_name = "ctr(aes)",
.base.cra_driver_name = "qat_aes_ctr",
- .base.cra_priority = 4001,
+ .base.cra_priority = 201,
.base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct qat_alg_skcipher_ctx),
@@ -1365,7 +1365,7 @@ static struct skcipher_alg qat_skciphers
}, {
.base.cra_name = "xts(aes)",
.base.cra_driver_name = "qat_aes_xts",
- .base.cra_priority = 4001,
+ .base.cra_priority = 201,
.base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK |
CRYPTO_ALG_ALLOCATES_MEMORY,
.base.cra_blocksize = AES_BLOCK_SIZE,


2023-10-20 12:59:01

by Cabiddu, Giovanni

[permalink] [raw]
Subject: Re: [RFC PATCH] crypto: de-prioritize QAT

On Mon, Oct 16, 2023 at 01:26:47PM +0200, Mikulas Patocka wrote:
> Hi
>
> I created this kernel module that stress-tests the crypto API:
> https://people.redhat.com/~mpatocka/benchmarks/qat/tools/module-multithreaded.c
>
> It shows that QAT underperforms significantly compared to AES-NI (for
> large requests it is 10 times slower; for small requests it is even worse)
> - see the second table in this document:
> https://people.redhat.com/~mpatocka/benchmarks/qat/kernel-module.txt
>
> QAT has higher priority than AES-NI, so the kernel prefers it (it is not
> used for dm-crypt because it has the flag "CRYPTO_ALG_ALLOCATES_MEMORY",
> but it is preferred over AES-NI in other cases).
Probably you can get better performance by modifying your configuration
and test.
From your test application I can infer that you are using a single QAT
device. The driver allocates a ring pair per TFM and it loads balances
allocations between devices.
In addition, jobs are submitted synchronously. This way the cost of
offload is not amortised between requests.

Regards,

--
Giovanni

2023-10-20 20:02:38

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [RFC PATCH] crypto: de-prioritize QAT



On Fri, 20 Oct 2023, Giovanni Cabiddu wrote:

> On Mon, Oct 16, 2023 at 01:26:47PM +0200, Mikulas Patocka wrote:
> > Hi
> >
> > I created this kernel module that stress-tests the crypto API:
> > https://people.redhat.com/~mpatocka/benchmarks/qat/tools/module-multithreaded.c
> >
> > It shows that QAT underperforms significantly compared to AES-NI (for
> > large requests it is 10 times slower; for small requests it is even worse)
> > - see the second table in this document:
> > https://people.redhat.com/~mpatocka/benchmarks/qat/kernel-module.txt
> >
> > QAT has higher priority than AES-NI, so the kernel prefers it (it is not
> > used for dm-crypt because it has the flag "CRYPTO_ALG_ALLOCATES_MEMORY",
> > but it is preferred over AES-NI in other cases).
> Probably you can get better performance by modifying your configuration
> and test.
> >From your test application I can infer that you are using a single QAT
> device. The driver allocates a ring pair per TFM and it loads balances
> allocations between devices.
> In addition, jobs are submitted synchronously. This way the cost of
> offload is not amortised between requests.
>
> Regards,
>
> --
> Giovanni

Yes, I thought about using more TFMs, but I don't have access to the dual
Xeon machine anymore (I would have to request access and wait until people
who are using it release it).

We can run more tests, but it would be best to batch all the tests in a
small timeframe, to minimize blocking the machine for other people.

Regarding synchronous submission - the current test submits 112 requests
concurrently. Do you think that it is too small and we should submit more?
What would be the appropriate number of requests to submit concurrently?

I did synchronous submission because it was easier to write it than
asynchronous submission, but if you think that 112 requests is too small
and we need to submit more, I can try to do it.

Mikulas