2016-12-20 09:41:20

by Binoy Jayan

[permalink] [raw]
Subject: dm-crypt optimization

At a high level the goal is to maximize the size of data blocks that get passed
to hardware accelerators, minimizing the overhead from setting up and tearing
down operations in the hardware. Currently dm-crypt itself is a big blocker as
it manually implements ESSIV and similar algorithms which allow per-block
encryption of the data so the low level operations from the crypto API can
only operate on a single block. This is done because currently the crypto API
doesn't have software implementations of these algorithms itself so dm-crypt
can't rely on it being able to provide the functionality. The plan to address
this was to provide some software implementations in the crypto API, then
update dm-crypt to rely on those. Even for a pure software implementation
with no hardware acceleration that should hopefully provide a small
optimization as we need to call into the crypto API less often but it's likely
to be marginal given the overhead of crypto, the real win would be on a system
that has an accelerator that can replace the software implementation.

Currently dm-crypt handles data only in single blocks. This means that it can't
make good use of hardware cryptography engines since there is an overhead to
each transaction with the engine but transfers must be split into block sized
chunks. Allowing the transfer of larger blocks e.g. 'struct bio', could
mitigate against these costs and could improve performance in operating systems
with encrypted filesystems. Although qualcomm chipsets support another variant
of the device-mapper dm-req-crypt, it is not something generic and in
mainline-able state. Also, it only supports 'XTS-AES' mode of encryption and
is not compatible with other modes supported by dm-crypt.

However, there are some challenges and a few possibilities to address this. I
request you to provide your suggestions on whether the points mentioned below
makes sense and if it could be done differently.

1. Move the 'real' IV generation algorithms to crypto layer (e.g. essiv)
2. Increase the 'length' of the scatterlist nodes used in the crypto api. It
can be made equal to the size of a main memory segment (as defined in
'struct bio') as they are physcially contiguous.
3. Multiple segments in 'struct bio' can be represented as scatterlist of all
segments in a 'struct bio'.

4. Move algorithms 'lmk' and 'tcw' (which are IV combined with hacks to the
cbc mode) to create a customized cbc algorithm, implemented in a seperate
file (e.g. cbc_lmk.c/cbc_tcw.c). As Milan suggested, these can't be treated
as real IVs as these include hacks to the cbc mode (and directly manipulate
encrypted data).

5. Move key selection logic to user space or always assume keycount as '1'
(as mentioned in the dm-crypt param format below) so that the key selection
logic does not have to be dependent on the sector number. This is necessary
as the key is selected otherwise based on sector number:

key_index = sector & (key_count - 1)

If block size for scatterlist nodes are increased beyond sector boundary
(which is what we plan to achieve, for performance), the key set for every
cipher operation cannot be changed at the sector level.

dm-crypt param format : cipher[:keycount]-mode-iv:ivopts
Example : aes:2-cbc-essiv:sha256

Also as Milan suggested, it is not wise to move the key selection logic to
the crypto layer as it will prevent any changes to the key structure later.

The following is a reference to an earlier patchset. It had the cipher mode
'cbc' mixed up with the IV algorithms and is usually not the preferred way.

Reference:
https://lkml.org/lkml/2016/12/13/65
https://lkml.org/lkml/2016/12/13/66


2016-12-21 12:47:26

by Milan Broz

[permalink] [raw]
Subject: Re: dm-crypt optimization

On 12/20/2016 10:41 AM, Binoy Jayan wrote:
> At a high level the goal is to maximize the size of data blocks that get passed
> to hardware accelerators, minimizing the overhead from setting up and tearing
> down operations in the hardware. Currently dm-crypt itself is a big blocker as
> it manually implements ESSIV and similar algorithms which allow per-block
> encryption of the data so the low level operations from the crypto API can
> only operate on a single block. This is done because currently the crypto API
> doesn't have software implementations of these algorithms itself so dm-crypt
> can't rely on it being able to provide the functionality. The plan to address
> this was to provide some software implementations in the crypto API, then
> update dm-crypt to rely on those. Even for a pure software implementation
> with no hardware acceleration that should hopefully provide a small
> optimization as we need to call into the crypto API less often but it's likely
> to be marginal given the overhead of crypto, the real win would be on a system
> that has an accelerator that can replace the software implementation.
>
> Currently dm-crypt handles data only in single blocks. This means that it can't
> make good use of hardware cryptography engines since there is an overhead to
> each transaction with the engine but transfers must be split into block sized
> chunks. Allowing the transfer of larger blocks e.g. 'struct bio', could
> mitigate against these costs and could improve performance in operating systems
> with encrypted filesystems. Although qualcomm chipsets support another variant
> of the device-mapper dm-req-crypt, it is not something generic and in
> mainline-able state. Also, it only supports 'XTS-AES' mode of encryption and
> is not compatible with other modes supported by dm-crypt.

So the core problem is that your crypto accelerator can operate efficiently only
with bigger batch sizes.

How big blocks your crypto hw need to be able to operate more efficiently?
What about 4k blocks (no batches), could it be usable trade-off?

With some (backward incompatible) changes in LUKS format I would like to see support
for encryption blocks equivalent to sectors size, so it basically means for 4k drive 4k
encryption block.
(This should decrease overhead, now is everything processed on 512 blocks only.)

Support of bigger block sizes would be unsafe without additional mechanism that provides
atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...)

The above is not going against your proposal, I am just curious if this is enough
to provide better performance on your hw accelerator or not.

Milan

> However, there are some challenges and a few possibilities to address this. I
> request you to provide your suggestions on whether the points mentioned below
> makes sense and if it could be done differently.
>
> 1. Move the 'real' IV generation algorithms to crypto layer (e.g. essiv)
> 2. Increase the 'length' of the scatterlist nodes used in the crypto api. It
> can be made equal to the size of a main memory segment (as defined in
> 'struct bio') as they are physcially contiguous.
> 3. Multiple segments in 'struct bio' can be represented as scatterlist of all
> segments in a 'struct bio'.
>
> 4. Move algorithms 'lmk' and 'tcw' (which are IV combined with hacks to the
> cbc mode) to create a customized cbc algorithm, implemented in a seperate
> file (e.g. cbc_lmk.c/cbc_tcw.c). As Milan suggested, these can't be treated
> as real IVs as these include hacks to the cbc mode (and directly manipulate
> encrypted data).
>
> 5. Move key selection logic to user space or always assume keycount as '1'
> (as mentioned in the dm-crypt param format below) so that the key selection
> logic does not have to be dependent on the sector number. This is necessary
> as the key is selected otherwise based on sector number:
>
> key_index = sector & (key_count - 1)
>
> If block size for scatterlist nodes are increased beyond sector boundary
> (which is what we plan to achieve, for performance), the key set for every
> cipher operation cannot be changed at the sector level.
>
> dm-crypt param format : cipher[:keycount]-mode-iv:ivopts
> Example : aes:2-cbc-essiv:sha256
>
> Also as Milan suggested, it is not wise to move the key selection logic to
> the crypto layer as it will prevent any changes to the key structure later.
>
> The following is a reference to an earlier patchset. It had the cipher mode
> 'cbc' mixed up with the IV algorithms and is usually not the preferred way.
>
> Reference:
> https://lkml.org/lkml/2016/12/13/65
> https://lkml.org/lkml/2016/12/13/66
>

2016-12-22 08:26:03

by Binoy Jayan

[permalink] [raw]
Subject: Re: dm-crypt optimization

Hi Milan,

On 21 December 2016 at 18:17, Milan Broz <[email protected]> wrote:

> So the core problem is that your crypto accelerator can operate efficiently only
> with bigger batch sizes.

Thank you for the reply. Yes, that would be rather an improvement when having
bigger block sizes.

> How big blocks your crypto hw need to be able to operate more efficiently?
> What about 4k blocks (no batches), could it be usable trade-off?

The benchmark results for Qualcomm Snapdragon SoC's (mentioned below) show
significant improvement with 4K blocks but in batches of all such contiguous
segments in the block layer's request queue in the form of a chained
scatterlist.
However, it uses the algorithm 'aes-xts' instead of the conventional
'essiv-cbc-aes'
used in dm-crypt. Also, it uses the device mapper dm-req-crypt instead
of dm-cypt.

http://nelenkov.blogspot.in/2015/05/hardware-accelerated-disk-encryption-in.html
Section : 'Performance'

Its reports and IO rate of 46.3MB/s compared to an IO rate of 25.1MB/s while
using a software-based FDE (based on dm-crypt). But I am not sure how genuine
this data is or how it was tested.

Since qualcomm SoC's use hardware backed keystore for managing keys and since
there is no easy way to make dm-crypt work with qualcomm's engines, I do not
have solid benchmark data to show an improved performance when using 4k blocks.

> With some (backward incompatible) changes in LUKS format I would like to see support
> for encryption blocks equivalent to sectors size, so it basically means for 4k drive 4k
> encryption block.
> (This should decrease overhead, now is everything processed on 512 blocks only.)
>
> Support of bigger block sizes would be unsafe without additional mechanism that provides
> atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...)

Did you mean write to the crypto output buffers or the actual disk write?
I didn't quite understand how the block size for encryption affects atomic
writes as it is the block layer which handles them. As far as dm-crypt is,
concerned it just encrypts/decrypts a 'struct bio' instance and submits the IO
operation to the block layer.

> The above is not going against your proposal, I am just curious if this is enough
> to provide better performance on your hw accelerator or not.

May be I should be able to procure an open crypto board and get back to you with
some results. Or may be show even a marginal improvement while using software
algorithm by avoiding the crypto overhead for every 512 bytes.

-Binoy

2016-12-22 09:00:29

by Herbert Xu

[permalink] [raw]
Subject: Re: dm-crypt optimization

On Thu, Dec 22, 2016 at 01:55:59PM +0530, Binoy Jayan wrote:
>
> > Support of bigger block sizes would be unsafe without additional mechanism that provides
> > atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...)
>
> Did you mean write to the crypto output buffers or the actual disk write?
> I didn't quite understand how the block size for encryption affects atomic
> writes as it is the block layer which handles them. As far as dm-crypt is,
> concerned it just encrypts/decrypts a 'struct bio' instance and submits the IO
> operation to the block layer.

I think Milan's talking about increasing the real block size, which
would obviously require the hardware to be able to write that out
atomically, as otherwise it breaks the crypto.

But if we can instead do the IV generation within the crypto API,
then the block size won't be an issue at all. Because you can
supply as many blocks as you want and they would be processed
block-by-block.

Now there is a disadvantage to this approach, and that is you
have to wait for the whole thing to be encrypted before you can
start doing the IO. I'm not sure how big a problem that is but
if it is bad enough to affect performance, we can look into adding
some form of partial completion to the crypto API.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2016-12-22 10:16:17

by Ofir Drang

[permalink] [raw]
Subject: RE: dm-crypt optimization



-----Original Message-----
From: Herbert Xu [mailto:[email protected]]
Sent: Thursday, December 22, 2016 10:59 AM
To: Binoy Jayan
Cc: Milan Broz; Oded Golombek; Ofir Drang; Arnd Bergmann; Mark Brown; Alasdair Kergon; David S. Miller; [email protected]; [email protected]; [email protected]; Rajendra; Linux kernel mailing list; [email protected]; Shaohua Li; Mike Snitzer
Subject: Re: dm-crypt optimization

On Thu, Dec 22, 2016 at 01:55:59PM +0530, Binoy Jayan wrote:
>>
>> > Support of bigger block sizes would be unsafe without additional
>> > mechanism that provides atomic writes of multiple sectors. Maybe it
>> > applies to 4k as well on some devices though...)
>>
>> Did you mean write to the crypto output buffers or the actual disk write?
>> I didn't quite understand how the block size for encryption affects
>> atomic writes as it is the block layer which handles them. As far as
>> dm-crypt is, concerned it just encrypts/decrypts a 'struct bio'
>> instance and submits the IO operation to the block layer.

>I think Milan's talking about increasing the real block size, which would obviously require the hardware to be able to write that out atomically, as otherwise it breaks the crypto.
>
>But if we can instead do the IV generation within the crypto API, then the block size won't be an issue at all. Because you can supply as many blocks as you want and they would be processed block-by-block.
>
>Now there is a disadvantage to this approach, and that is you have to wait for the whole thing to be encrypted before you can start doing the IO. I'm not sure how big a problem that is but if it is bad enough to affect performance, we can look into adding >some form of partial completion to the crypto API.
>
>Cheers,

But assuming we have hardware accelerator that know to handle the IV generation for each sector, it will make sense to send out to the hardware the maximum block size as this will allow us to better utilize the hardware and offload the software. So if possible we need to provide generic interface that will be able to optimize the hardware accelerates.

Thx Ofir
--
Email: Herbert Xu <[email protected]> Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.