2021-06-04 16:51:08

by Thara Gopinath

[permalink] [raw]
Subject: Qualcomm Crypto Engine performance numbers on mainline kernel


Hi All,

Below are the performance numbers from running "crypsetup benchmark" on
CE algorithms in the mainline kernel. All numbers are in MiB/s. The
platform used is RB3 for sdm845 and MTPs for rest of them.


SDM845 SM8150 SM8250 SM8350
AES-CBC (128)
Encrypt / Decrypt 114/106 36/48 120/188 133/197

AES-XTS (256)
Encrypt / Decrypt 100/102 49/48 186/187 n/a


--
Warm Regards
Thara (She/Her/Hers)


2021-06-05 15:33:47

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: Qualcomm Crypto Engine performance numbers on mainline kernel

Hello Thara,

On Fri, 4 Jun 2021 at 18:49, Thara Gopinath <[email protected]> wrote:
>
>
> Hi All,
>
> Below are the performance numbers from running "crypsetup benchmark" on
> CE algorithms in the mainline kernel. All numbers are in MiB/s. The
> platform used is RB3 for sdm845 and MTPs for rest of them.
>
>
> SDM845 SM8150 SM8250 SM8350
> AES-CBC (128)
> Encrypt / Decrypt 114/106 36/48 120/188 133/197
>
> AES-XTS (256)
> Encrypt / Decrypt 100/102 49/48 186/187 n/a
>

The CPU instruction based ones are apparently an order of magnitude
faster, and are synchronous so their latency should be lower.

So, as Eric already pointed out IIRC, there doesn't seem to be much
value in enabling this IP in Linux - it should not be the default
choice/highest priority, and it is not obvious to me whether/when you
would prefer this implementation over the CPU based one. Do you have
any idea how many queues it has, or how much data it can process in
parallel? Are there other features that stand out?

2021-06-06 06:51:38

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: Qualcomm Crypto Engine performance numbers on mainline kernel

Hi,

On Sat, Jun 5, 2021 at 6:33 PM Ard Biesheuvel <[email protected]> wrote:
>
> Hello Thara,
>
> On Fri, 4 Jun 2021 at 18:49, Thara Gopinath <[email protected]> wrote:
> >
> >
> > Hi All,
> >
> > Below are the performance numbers from running "crypsetup benchmark" on
> > CE algorithms in the mainline kernel. All numbers are in MiB/s. The
> > platform used is RB3 for sdm845 and MTPs for rest of them.
> >
> >
> > SDM845 SM8150 SM8250 SM8350
> > AES-CBC (128)
> > Encrypt / Decrypt 114/106 36/48 120/188 133/197
> >
> > AES-XTS (256)
> > Encrypt / Decrypt 100/102 49/48 186/187 n/a
> >
>
> The CPU instruction based ones are apparently an order of magnitude
> faster, and are synchronous so their latency should be lower.
>
> So, as Eric already pointed out IIRC, there doesn't seem to be much
> value in enabling this IP in Linux - it should not be the default
> choice/highest priority, and it is not obvious to me whether/when you
> would prefer this implementation over the CPU based one. Do you have
> any idea how many queues it has, or how much data it can process in
> parallel? Are there other features that stand out?

One of the things to consider with separate hardware block
implementation vis a vis CPU instruction based ones in general is that
often the consideration is more about getting a good enough
performance while freeing the CPU to perform other tasks which results
in better overall system performance rather than getting the best
possible performance in the specific task at hand. This is sometimes
further extended with power considerations where you can get better
power consumption when the lower performance engine is used.
Less often, a lower jitter is more important than the peak
performance. I've seen this with encrypted video decoding for example.

Sadly, whether any of these considerations is applicable is very much
system and work load specific.

So my 2c contribution would be to include support for this, even if
not make this the default.

Gilad




--
Gilad Ben-Yossef
Chief Coffee Drinker

values of β will give rise to dom!

2021-06-06 10:42:49

by Christian Lamparter

[permalink] [raw]
Subject: Re: Qualcomm Crypto Engine performance numbers on mainline kernel

On 05/06/2021 17:32, Ard Biesheuvel wrote:
> Hello Thara,
>
> On Fri, 4 Jun 2021 at 18:49, Thara Gopinath <[email protected]> wrote:
>>
>>
>> Hi All,
>>
>> Below are the performance numbers from running "crypsetup benchmark" on
>> CE algorithms in the mainline kernel. All numbers are in MiB/s. The
>> platform used is RB3 for sdm845 and MTPs for rest of them.
>>
>>
>> SDM845 SM8150 SM8250 SM8350
>> AES-CBC (128)
>> Encrypt / Decrypt 114/106 36/48 120/188 133/197
>>
>> AES-XTS (256)
>> Encrypt / Decrypt 100/102 49/48 186/187 n/a
>>
>
> The CPU instruction based ones are apparently an order of magnitude
> faster, and are synchronous so their latency should be lower.
>
> So, as Eric already pointed out IIRC, there doesn't seem to be much
> value in enabling this IP in Linux - it should not be the default
> choice/highest priority, and it is not obvious to me whether/when you
> would prefer this implementation over the CPU based one. Do you have
> any idea how many queues it has, or how much data it can process in
> parallel? Are there other features that stand out?

While I can't say much for the qce-crypto. I do know that "cryptsetup
benchmark" isn't the greatest for pitting the hardware accelerated
crypto against the CPU in some instances.

In my case (crypto4xx / CPU is a PowerPC 464 800MHz - Hardware is a
Western Digital My Book Live - NAS) the "benchmark" results look
exceptionally poor:
# Algorithm | Key | Encryption | Decryption
aes-cbc 128b 8.0 MiB/s 8.7 MiB/s
aes-cbc 256b 8.7 MiB/s 8.7 MiB/s
aes-xts 256b 5.3 MiB/s 7.9 MiB/s
aes-xts 512b 7.9 MiB/s 7.9 MiB/s
(Hardware doesn't have cts/xts, but aes-cbc, aes-ctr and aes-gcm)

(for comparison, these are numbers that are produced by only the
800 MHz PowerPC CPU)
aes-cbc 128b 15.8 MiB/s 16.3 MiB/s
aes-cbc 256b 12.3 MiB/s 12.8 MiB/s
aes-xts 256b 12.5 MiB/s 15.1 MiB/s
aes-xts 512b 11.9 MiB/s 12.0 MiB/s


and (openssl speed -evp aes-128-cbc --elapsed -seconds 3) software
manages similar numbers:

type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 12646.42k 16806.66k 18349.31k 18762.07k 18896.21k 18879.83k

However, when I format a partition on the NAS HDD with
cryptsetup + crypto4xx and use hdparm -i / dd

# hdparm -t /dev/mapper/aes-cbc-hw-test

/dev/mapper/aes-cbc-hw-test:
Timing buffered disk reads: 96 MB in 3.05 seconds = 31.46 MB/sec

# dd if=/dev/mapper/aes-cbc-hw-test of=/dev/null bs=8M status=progress
5318377472 bytes (5.3 GB, 5.0 GiB) copied, 143 s, 37.2 MB/s^C
639+0 records in
638+0 records out
5351931904 bytes (5.4 GB, 5.0 GiB) copied, 144.246 s, 37.1 MB/s

whereas without crypto4xx:

# hdparm -t /dev/mapper/aes-cbc-hw-test

/dev/mapper/aes-cbc-hw-test:
Timing buffered disk reads: 34 MB in 3.14 seconds = 10.82 MB/sec

# dd if=/dev/mapper/aes-cbc-hw-test of=/dev/null bs=8M status=progress
46+0 records in
45+0 records out
377487360 bytes (377 MB, 360 MiB) copied, 33.1952 s, 11.4 MB/s

This is 2-3 times the throughput that the CPU alone could do.

@Thara, Do you have a usb-3.0 + fast 3.0 usb-stick? If so, try
to format a partition on it for cryptsetup and try it there.

Cheers,
Christian