2020-07-08 08:58:14

by Van Leeuwen, Pascal

[permalink] [raw]
Subject: question regarding crypto driver DMA issue

Hi,

I have a question on behalf of a customer of ours trying to use the inside-secure crypto
API driver. They are experiencing issues with result data not arriving in the result buffer.
This seems to have something to do with not being able to DMA to said buffer, as they
can workaround the issue by explicitly allocating a DMA buffer on the fly and copying
data from there to the original destination.

The problem I have is that I do not have access to their hardware and the driver seems
to work just fine on any hardware (both x64 and ARM64) I have available here, so I
have to approach this purely theoretically ...

For the situation where this problem is occuring, the actual buffers are stored inside
the ahash_req structure. So my question is: is there any reason why this structure may
not be DMA-able on some systems? (as I have a hunch that may be the problem ...)

Regards,
Pascal van Leeuwen
Silicon IP Architect Multi-Protocol Engines, Rambus Security
Rambus ROTW Holding BV
+31-73 6581953

Note: The Inside Secure/Verimatrix Silicon IP team was recently acquired by Rambus.
Please be so kind to update your e-mail address book with my new e-mail address.


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>


2020-07-08 09:16:13

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: question regarding crypto driver DMA issue

On Wed, 8 Jul 2020 at 11:56, Van Leeuwen, Pascal <[email protected]> wrote:
>
> Hi,
>
> I have a question on behalf of a customer of ours trying to use the inside-secure crypto
> API driver. They are experiencing issues with result data not arriving in the result buffer.
> This seems to have something to do with not being able to DMA to said buffer, as they
> can workaround the issue by explicitly allocating a DMA buffer on the fly and copying
> data from there to the original destination.
>
> The problem I have is that I do not have access to their hardware and the driver seems
> to work just fine on any hardware (both x64 and ARM64) I have available here, so I
> have to approach this purely theoretically ...
>
> For the situation where this problem is occuring, the actual buffers are stored inside
> the ahash_req structure. So my question is: is there any reason why this structure may
> not be DMA-able on some systems? (as I have a hunch that may be the problem ...)
>

If DMA is non-coherent, and the ahash_req structure is also modified
by the CPU while it is mapped for DMA, you are likely to get a
conflict.

It should help if you align the DMA-able fields sufficiently, and make
sure you never touch them while they are mapped for writing by the
device.

2020-07-08 13:37:15

by Van Leeuwen, Pascal

[permalink] [raw]
Subject: RE: question regarding crypto driver DMA issue

Hi Ard,

Thanks for responding!

> > For the situation where this problem is occuring, the actual buffers are stored inside
> > the ahash_req structure. So my question is: is there any reason why this structure may
> > not be DMA-able on some systems? (as I have a hunch that may be the problem ...)
> >
>
> If DMA is non-coherent, and the ahash_req structure is also modified
> by the CPU while it is mapped for DMA, you are likely to get a
> conflict.
>
Ah ... I get it. If I dma_map TO_DEVICE then all relevant cachelines are flushed, then
if the CPU accesses any other data sharing those cachelines, they get read back into
the cache. Any subsequent access of the actual result will then read stale data from
the cache.

> It should help if you align the DMA-able fields sufficiently, and make
> sure you never touch them while they are mapped for writing by the
> device.
>
Yes, I guess that is the only way. I also toyed with the idea of using dedicated properly
dma_alloc'ed buffers with pointers in the ahash_request structure, but I don't see how
I can allocate per-request buffers as there is no callback to the driver on req creation.

So ... is there any magical way within the Linux kernel to cacheline-align members of
a structure? Considering cacheline size is very system-specific?

Regards,
Pascal van Leeuwen
Silicon IP Architect Multi-Protocol Engines, Rambus Security
Rambus ROTW Holding BV
+31-73 6581953

Note: The Inside Secure/Verimatrix Silicon IP team was recently acquired by Rambus.
Please be so kind to update your e-mail address book with my new e-mail address.


** This message and any attachments are for the sole use of the intended recipient(s). It may contain information that is confidential and privileged. If you are not the intended recipient of this message, you are prohibited from printing, copying, forwarding or saving it. Please delete the message and attachments and notify the sender immediately. **

Rambus Inc.<http://www.rambus.com>

2020-07-08 13:42:31

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: question regarding crypto driver DMA issue

On Wed, 8 Jul 2020 at 16:35, Van Leeuwen, Pascal <[email protected]> wrote:
>
> Hi Ard,
>
> Thanks for responding!
>
> > > For the situation where this problem is occuring, the actual buffers are stored inside
> > > the ahash_req structure. So my question is: is there any reason why this structure may
> > > not be DMA-able on some systems? (as I have a hunch that may be the problem ...)
> > >
> >
> > If DMA is non-coherent, and the ahash_req structure is also modified
> > by the CPU while it is mapped for DMA, you are likely to get a
> > conflict.
> >
> Ah ... I get it. If I dma_map TO_DEVICE then all relevant cachelines are flushed, then
> if the CPU accesses any other data sharing those cachelines, they get read back into
> the cache. Any subsequent access of the actual result will then read stale data from
> the cache.
>
> > It should help if you align the DMA-able fields sufficiently, and make
> > sure you never touch them while they are mapped for writing by the
> > device.
> >
> Yes, I guess that is the only way. I also toyed with the idea of using dedicated properly
> dma_alloc'ed buffers with pointers in the ahash_request structure, but I don't see how
> I can allocate per-request buffers as there is no callback to the driver on req creation.
>
> So ... is there any magical way within the Linux kernel to cacheline-align members of
> a structure? Considering cacheline size is very system-specific?
>

You can use __cacheline_aligned as a modifier on struct members that
are accessed by the device. However, this is a typical value, not a
worst case value, and since this is taken into account at compile
time, you really need a worst case value.

On arm64, the maximum CWG (Cache Writeback Granule) value is 2k, which
is a bit excessive, so it might help to do this at runtime. One thing
you might do is increase the reqsize at TFM init time (in which case
you could also check whether the device is cache coherent for DMA),
and have a helper that gives you the address of the sub-struct inside
the request struct based on the current cache alignment.