Subject: [RFC 0/2] AES ablkcipher driver for SPUs

Hello Herbert,

This driver adds support for AES on SPU. Patch is for review only because
some parts of the code are not upstream yet.
Patch one contains the main driver (which uses ablkcipher_ctx_cast()),
patch two is for clarity (parts of the missing API that is used).

Currently only ECB block mode is supported. I plan support for CBC but the
way the IV currently handled is unfavorable (later more).

aes_spu_wrap.c and kspu_helper.c run in the kernel, spu_main.c will run on
a SPU (my hardware for computing :)). SPU can access kernel memory (even
virtual) via asynchronous DMA transfers.
All requests from the crypto user end up in a linked list which is managed
by the kspu module (even no crypto requests will end up there as well but
currently the AES driver is the only user). AES callback function
(aes_queue_work_items()) is called to queue the request in a ring buffer
which is located on the hardware. Once some requests are enqueued the SPU
is started.
The SPU requests the first couple of blocks via DMA (init_get_data()).
This request may not get satisfied immediately, the command does not
block. Once all requests (DMA_BUFFERS num) are fired up, the SPU waits
for the first buffer to complete and starts processing (via spu_funcs()).
Ideally there are always transfers in the background (copy new data from
main storage to SPU and copy processed data from SPU to main storage)
while the SPU is processing a block of data.
This is where my problems with the IV are starting. Currently I have to
request the IV from main storage, wait for it, than I can use it and once
I processed the block, I must write it back.
What about a different handling of the IV with two functions like
ablk_set_iv()
ablk_get_iv()
With something like this, I could store the IV in the SPU (in my key
struct for instance) and don't have to transfer it on every request
(similar to what I do now with the key). I don't know if there are any
crypto user that have multiple IVs/key but in such a case, I could cache
IVs like I cache keys now. Any comments on that?

Sebastian
--


2007-06-27 10:24:56

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC 0/2] AES ablkcipher driver for SPUs

On Wed, Jun 27, 2007 at 12:59:52AM +0200, Sebastian Siewior ([email protected]) wrote:
> This driver adds support for AES on SPU. Patch is for review only because
> some parts of the code are not upstream yet.
> Patch one contains the main driver (which uses ablkcipher_ctx_cast()),
> patch two is for clarity (parts of the missing API that is used).
>
> Currently only ECB block mode is supported. I plan support for CBC but the
> way the IV currently handled is unfavorable (later more).

Interesting. Do you have any benchmark of the SPU handling AES crypto?

--
Evgeniy Polyakov

Subject: Re: [RFC 0/2] AES ablkcipher driver for SPUs

* Evgeniy Polyakov | 2007-06-27 14:24:20 [+0400]:

>On Wed, Jun 27, 2007 at 12:59:52AM +0200, Sebastian Siewior ([email protected]) wrote:
>> This driver adds support for AES on SPU. Patch is for review only because
>> some parts of the code are not upstream yet.
>> Patch one contains the main driver (which uses ablkcipher_ctx_cast()),
>> patch two is for clarity (parts of the missing API that is used).
>>
>> Currently only ECB block mode is supported. I plan support for CBC but the
>> way the IV currently handled is unfavorable (later more).
>
>Interesting. Do you have any benchmark of the SPU handling AES crypto?
Yes I do. Those number are gathered from a PS3 and with a sync
interface. sync means the SPU is idle, I queue the request, start the
SPU, SPU requests the data, waits from completion, computes it,
transfers it back and finally the SPU stops (idle again). Oh and only
one SPU is used.
The test is generated with a simple module that allocated four pages (16
kb) and calls the SPU crypto code over and over again until approx 156
MB of memory passed/processed. From the time and total size I get my
kb/sec.
Diagram [1] is exactly that. SIMD is my SIMD version of AES on SPU,
generic is the already present version of AES (crypto/aes.c, modified
to fit the required signature for the encryption. decryption has been
left apart since it is the same code, only different tables) also on the
SPU.
Diagram [2] shows how relevant the transfer size actually is. I still
transfer 156 MB data but in different transfer sizes. Smaller transfer
size means more transfers, waiting, sbox reloading and more start/stop.
Generic-PPU is "wrong". This speed is taken the first diagram. With more
loops I should get slightly slower (at least due to branches). Operation
is ECB+Encryption+128b key

I did not measure how my SIMD code behaves if the buffers are already
there and I have never to start the SPU. Maybe later that week (as well
as fixing/completing diagram 2).

Ach one last thing: Everything is ECB mode. From experience with VMX I
must say that that one little xor operation in CBC makes no difference
at all.

[1] http://breakpoint.cc/spu_aes/spu_code.png
[2] http://breakpoint.cc/spu_aes/spu_sync_blocksize.png

>--
> Evgeniy Polyakov
Sebastian

2007-06-28 10:51:06

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC 0/2] AES ablkcipher driver for SPUs

On Wed, Jun 27, 2007 at 01:41:59PM +0200, Sebastian Siewior ([email protected]) wrote:
> Yes I do. Those number are gathered from a PS3 and with a sync
> interface. sync means the SPU is idle, I queue the request, start the
> SPU, SPU requests the data, waits from completion, computes it,
> transfers it back and finally the SPU stops (idle again). Oh and only
> one SPU is used.
> The test is generated with a simple module that allocated four pages (16
> kb) and calls the SPU crypto code over and over again until approx 156
> MB of memory passed/processed. From the time and total size I get my
> kb/sec.
...
> [1] http://breakpoint.cc/spu_aes/spu_code.png
> [2] http://breakpoint.cc/spu_aes/spu_sync_blocksize.png

Mmm, looks really good. Did powerpc folks acked this changes?

--
Evgeniy Polyakov

Subject: Re: [RFC 0/2] AES ablkcipher driver for SPUs

* Evgeniy Polyakov | 2007-06-28 14:50:36 [+0400]:

>On Wed, Jun 27, 2007 at 01:41:59PM +0200, Sebastian Siewior ([email protected]) wrote:
>> Yes I do. Those number are gathered from a PS3 and with a sync
>> interface. sync means the SPU is idle, I queue the request, start the
>> SPU, SPU requests the data, waits from completion, computes it,
>> transfers it back and finally the SPU stops (idle again). Oh and only
>> one SPU is used.
>> The test is generated with a simple module that allocated four pages (16
>> kb) and calls the SPU crypto code over and over again until approx 156
>> MB of memory passed/processed. From the time and total size I get my
>> kb/sec.
>...
>> [1] http://breakpoint.cc/spu_aes/spu_code.png
>> [2] http://breakpoint.cc/spu_aes/spu_sync_blocksize.png
>
>Mmm, looks really good. Did powerpc folks acked this changes?
I submitted some patches last week or two weeks ago to the cbe-oss-dev
ml and I did not get a nack. Just style, naming and this sort of things.
I plan to clean those up (address all issues) and post it once again.
Maybe, the IV problem is solved until then :)

>--
> Evgeniy Polyakov
Sebastian

2007-07-12 07:50:52

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC 0/2] AES ablkcipher driver for SPUs

On Thu, Jul 12, 2007 at 09:18:33AM +0200, Sebastian Siewior wrote:
>
> If there is a new IV on *every* request and I don't have to write the IV back
> than I put it in my request struct next to the source address :). This indeed
> solves my problem. What about jumbo frames in IPsec? Do I get 16 contiguous

You mean not writing back the final IV is good enough for you?
If so we can probably add a request flag for that.

> pages (16 * 4KiB for almost 64KiB which may be encrypted) or is this what you
> mean with "the IV is going to change for almost every request" ?

They won't be contiguous so yes caching would help here. However
this case is exceedingly rare in the wild.

> Oh Oh. Where do I get random numbers from? When do you expect me to do
> this (aftee request received, request completed or in callback) and what
> is this useful for (or how random must the number really be)? I could

Well I was hoping that you had a built-in RNG or some such :)

> The only thing I'm scared is that I can't use the updated IV for
> following requests since it is not updated on enqueue time (my
> assumption was that I need this for requests like jumbo frames where the
> whole 64k frame is spitted into multiple 8k (smaller) requests and the IV from
> previous 8k is required for the following 8k).

If your hardware can't chain them up for you then yes you'd
have to wait for each chunk to finish before processing the
next one. But as I said before this isn't that common at all.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt