2023-08-22 05:28:42

by Eric Biggers

[permalink] [raw]
Subject: Re: [EXTERNAL] Re: [PATCH v2 0/6] Add support for Texas Instruments MCRC64 engine

On Fri, Aug 18, 2023 at 02:36:34PM +0530, Kamlesh Gurudasani wrote:
> Hi Eric,
>
> We are more interested in offload than performance, with splice system
> call and DMA mode in driver(will be implemented after this series gets
> merged), good amount of cpu cycles will be saved.

So it's for power usage, then? Or freeing up CPU for other tasks?

> There is one more mode(auto mode) in mcrc64 which helps to verify crc64
> values against pre calculated crc64, saving the efforts of comparing in
> userspace.

Is there any path forward to actually support this?

>
> Current generic implementation of crc64-iso(part of this series)
> gives 173 Mb/s of speed as opposed to mcrc64 which gives speed of 812
> Mb/s when tested with tcrypt.

This doesn't answer my question, which to reiterate was:

How does performance compare to a properly optimized software CRC
implementation on your platform, i.e. an implementation using carryless
multiplication instructions (e.g. ARMv8 CE) if available on your platform,
otherwise an implementation using the slice-by-8 or slice-by-16 method?

The implementation you tested was slice-by-1. Compared to that, it's common for
slice-by-8 to speed up CRCs by about 4 times and for folding with carryless
multiplication to speed up CRCs by 10-30 times, sometimes limited only by memory
bandwidth. I don't know what specific results you would get on your specific
CPU and for this specific CRC, and you could certainly see something different
if you e.g. have some low-end embedded CPU. But those are the typical results
I've seen for other CRCs on different CPUs. So, a software implementation may
be more attractive than you realize. It could very well be the case that a
PMULL based CRC implementation actually ends up with less CPU load than your
"hardware offload", when taking into syscall, algif_hash, and driver overhead...

- Eric


2023-08-30 20:38:54

by Kamlesh Gurudasani

[permalink] [raw]
Subject: Re: [EXTERNAL] Re: [EXTERNAL] Re: [PATCH v2 0/6] Add support for Texas Instruments MCRC64 engine

Eric Biggers <[email protected]> writes:

> On Fri, Aug 18, 2023 at 02:36:34PM +0530, Kamlesh Gurudasani wrote:
>> Hi Eric,
>>
>> We are more interested in offload than performance, with splice system
>> call and DMA mode in driver(will be implemented after this series gets
>> merged), good amount of cpu cycles will be saved.
>
> So it's for power usage, then? Or freeing up CPU for other tasks?
>
It's for freeing CPU fpr other tasks
>> There is one more mode(auto mode) in mcrc64 which helps to verify crc64
>> values against pre calculated crc64, saving the efforts of comparing in
>> userspace.
>
> Is there any path forward to actually support this?
>
>>
>> Current generic implementation of crc64-iso(part of this series)
>> gives 173 Mb/s of speed as opposed to mcrc64 which gives speed of 812
>> Mb/s when tested with tcrypt.
>
> This doesn't answer my question, which to reiterate was:
>
> How does performance compare to a properly optimized software CRC
> implementation on your platform, i.e. an implementation using carryless
> multiplication instructions (e.g. ARMv8 CE) if available on your platform,
> otherwise an implementation using the slice-by-8 or slice-by-16 method?
>
> The implementation you tested was slice-by-1. Compared to that, it's common for
> slice-by-8 to speed up CRCs by about 4 times and for folding with carryless
> multiplication to speed up CRCs by 10-30 times, sometimes limited only by memory
> bandwidth. I don't know what specific results you would get on your specific
> CPU and for this specific CRC, and you could certainly see something different
> if you e.g. have some low-end embedded CPU. But those are the typical results
> I've seen for other CRCs on different CPUs. So, a software implementation may
> be more attractive than you realize. It could very well be the case that a
> PMULL based CRC implementation actually ends up with less CPU load than your
> "hardware offload", when taking into syscall, algif_hash, and driver overhead...
>
> - Eric

Hi Eric, thanks for your detailed and valuable inputs.

As per your suggestion, we did some profiling.

Use case is to calculate crc32/crc64 for file input from user space.

Instead of directly implementing PMULL based CRC64, we made first comparison between
Case 1.
CRC32 (splice() + kernel space SW driver)
https://gist.github.com/ti-kamlesh/5be75dbde292e122135ddf795fad9f21

Case 2.
CRC32(mmap() + userspace armv8 crc32 instruction implementation)
(tried read() as well to get contents of file, but that lost to mmap() so not mentioning number here)
https://gist.github.com/ti-kamlesh/002df094dd522422c6cb62069e15c40d

Case 3.
CRC64 (splice() + MCRC64 HW)
https://gist.github.com/ti-kamlesh/98b1fc36c9a7c3defcc2dced4136b8a0


Overall, overhead of userspace + af_alg + driver in (Case 1) and ( Case 3) is ~0.025s, which is constant for any file size.
This is calculated using real time to calculate crc - driver time (time spend inside init() + update() +final()) = overhead ~0.025s

Here, if we consider similar numbers for crc64 PMULL implementation as crc32 (case 2) , we save good number of cpu cycles using mcrc64
in case of files bigger than 5-10mb as most of the time is being spent
in HW offload.

╔═══════════════════╦═════════════════════════════╦═══════════════════════╦════════════════════════╦════════════════════════╗
║ ║ ║ ║ ║ ║
║ File size ║ 120mb(ideal size for us) ║ 20mb ║ 15mb ║ 5mb ║
╠═══════════════════╬═════════════════════════════╬═══════════════════════╬════════════════════════╬════════════════════════╣
║ ║ ║ ║ ║ ║
║ CRC32 (Case 1) ║ Driver time 0.155s ║ Driver time 0.0325s ║ Driver time 0.019s ║ Driver time 0.0062s ║
║ ║ real time 0.18s ║ real time 0.06s ║ real time 0.04s ║ real time 0.03s ║
║ ║ overhead 0.025s ║ overhead 0.025s ║ overhead 0.021s ║ overhead ~0.023s ║
╠═══════════════════╬═════════════════════════════╬═══════════════════════╬════════════════════════╬════════════════════════╣
║ ║ ║ ║ ║ ║
║ CRC32 (Case 2) ║ Real time 0.30s ║ Real time 0.05s ║ Real time 0.04s ║ Real time 0.02s ║
╠═══════════════════╬═════════════════════════════╬═══════════════════════╬════════════════════════╬════════════════════════╣
║ ║ ║ ║ ║ ║
║ CRC64 (Case 3) ║ Driver time 0.385s ║ Driver time 0.0665s ║ Driver time 0.0515s ║ Driver time 0.019s ║
║ ║ real time 0.41s ║ real time 0.09s ║ real time 0.08s ║ real time 0.04s ║
║ ║ overhead 0.025s ║ overhead 0.025s ║ overhead ~0.025s ║ overhead ~0.021s ║
╚═══════════════════╩═════════════════════════════╩═══════════════════════╩════════════════════════╩════════════════════════╝

2023-08-30 22:10:27

by Kamlesh Gurudasani

[permalink] [raw]
Subject: Re: [EXTERNAL] Re: [EXTERNAL] Re: [PATCH v2 0/6] Add support for Texas Instruments MCRC64 engine

Eric Biggers <[email protected]> writes:

> On Fri, Aug 18, 2023 at 02:36:34PM +0530, Kamlesh Gurudasani wrote:
>> Hi Eric,
>>
>> We are more interested in offload than performance, with splice system
>> call and DMA mode in driver(will be implemented after this series gets
>> merged), good amount of cpu cycles will be saved.
>
> So it's for power usage, then? Or freeing up CPU for other tasks?
>

It's for freeing up CPU for other tasks

>> There is one more mode(auto mode) in mcrc64 which helps to verify crc64
>> values against pre calculated crc64, saving the efforts of comparing in
>> userspace.
>
> Is there any path forward to actually support this?
>
>>
>> Current generic implementation of crc64-iso(part of this series)
>> gives 173 Mb/s of speed as opposed to mcrc64 which gives speed of 812
>> Mb/s when tested with tcrypt.
>
> This doesn't answer my question, which to reiterate was:
>
> How does performance compare to a properly optimized software CRC
> implementation on your platform, i.e. an implementation using carryless
> multiplication instructions (e.g. ARMv8 CE) if available on your platform,
> otherwise an implementation using the slice-by-8 or slice-by-16 method?
>
> The implementation you tested was slice-by-1. Compared to that, it's common for
> slice-by-8 to speed up CRCs by about 4 times and for folding with carryless
> multiplication to speed up CRCs by 10-30 times, sometimes limited only by memory
> bandwidth. I don't know what specific results you would get on your specific
> CPU and for this specific CRC, and you could certainly see something different
> if you e.g. have some low-end embedded CPU. But those are the typical results
> I've seen for other CRCs on different CPUs. So, a software implementation may
> be more attractive than you realize. It could very well be the case that a
> PMULL based CRC implementation actually ends up with less CPU load than your
> "hardware offload", when taking into syscall, algif_hash, and driver overhead...
>
> - Eric

Hi Eric, thanks for your detailed and valuable inputs.

As per your suggestion, we did some profiling.

Use case is to calculate crc32/crc64 for file input from user space.

Instead of directly implementing PMULL based CRC64, we made first comparison between
Case 1.
CRC32 (splice() + kernel space SW driver)
https://gist.github.com/ti-kamlesh/5be75dbde292e122135ddf795fad9f21

Case 2.
CRC32(mmap() + userspace armv8 crc32 instruction implementation)
(tried read() as well to get contents of file, but that lost to mmap() so not mentioning number here)
https://gist.github.com/ti-kamlesh/002df094dd522422c6cb62069e15c40d

Case 3.
CRC64 (splice() + MCRC64 HW)
https://gist.github.com/ti-kamlesh/98b1fc36c9a7c3defcc2dced4136b8a0


Overall, overhead of userspace + af_alg + driver in (Case 1) and ( Case 3) is ~0.025s, which is constant for any file size.
This is calculated using real time to calculate crc - driver time (time spend inside init() + update() +final()) = overhead ~0.025s

Here, if we consider similar numbers for crc64 PMULL implementation as crc32 (case 2) , we save good number of cpu cycles using mcrc64
in case of files bigger than 5-10mb as most of the time is being spent in HW offload.

+-------------------+-----------------------------+-----------------------+------------------------+------------------------+
| | | | | |
| File size | 120mb(ideal size for us) | 20mb | 15mb | 5mb |
+===================+=============================+=======================+========================+========================+
| | | | | |
| CRC32 (Case 1) | Driver time 0.155s | Driver time 0.0325s | Driver time 0.019s | Driver time 0.0062s |
| | real time 0.18s | real time 0.06s | real time 0.04s | real time 0.03s |
| | overhead 0.025s | overhead 0.025s | overhead 0.021s | overhead ~0.023s |
+-------------------+-----------------------------+-----------------------+------------------------+------------------------+
| | | | | |
| CRC32 (Case 2) | Real time 0.30s | Real time 0.05s | Real time 0.04s | Real time 0.02s |
+-------------------+-----------------------------+-----------------------+------------------------+------------------------+
| | | | | |
| CRC64 (Case 3) | Driver time 0.385s | Driver time 0.0665s | Driver time 0.0515s | Driver time 0.019s |
| | real time 0.41s | real time 0.09s | real time 0.08s | real time 0.04s |
| | overhead 0.025s | overhead 0.025s | overhead ~0.025s | overhead ~0.021s |
+-------------------+-----------------------------+-----------------------+------------------------+------------------------+