The crypto akcipher interface has exactly one user, the keyring
subsystem. That user only deals with kernel pointers, not SG lists.
Therefore the use of SG lists in the akcipher interface is
completely pointless.
As there is only one user, changing it isn't that hard. This
patch series is a first step in that direction. It introduces
a new interface for encryption and decryption without SG lists:
int crypto_akcipher_sync_encrypt(struct crypto_akcipher *tfm,
const void *src, unsigned int slen,
void *dst, unsigned int dlen);
int crypto_akcipher_sync_decrypt(struct crypto_akcipher *tfm,
const void *src, unsigned int slen,
void *dst, unsigned int dlen);
I've decided to split out signing and verification because most
(all but one) of our signature algorithms do not support encryption
or decryption. These can now be accessed through the dsa interface:
int crypto_dsa_sign(struct crypto_dsa *tfm,
const void *src, unsigned int slen,
void *dst, unsigned int dlen);
int crypto_dsa_verify(struct crypto_dsa *tfm,
const void *src, unsigned int slen,
const void *digest, unsigned int dlen);
The keyring system has been converted to this interface.
The next step would be to convert the code within the Crypto API so
that SG lists are not used at all on the software path. This
would eliminate the unnecessary copying that currently happens.
Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu <[email protected]> wrote:
> I've decided to split out signing and verification because most
> (all but one) of our signature algorithms do not support encryption
> or decryption. These can now be accessed through the dsa interface:
That feels wrongly named as there's a DSA public key algorithm.
David
On Tue, Jun 13, 2023 at 01:53:28PM +0100, David Howells wrote:
.
> That feels wrongly named as there's a DSA public key algorithm.
You're quite right. This is indeed confusing. I'll rename it
to sig instead. So we'll have crypto_sig_sign and crypto_sig_verify.
Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
v2 changes:
- Rename dsa to sig.
- Add braces around else clause.
The crypto akcipher interface has exactly one user, the keyring
subsystem. That user only deals with kernel pointers, not SG lists.
Therefore the use of SG lists in the akcipher interface is
completely pointless.
As there is only one user, changing it isn't that hard. This
patch series is a first step in that direction. It introduces
a new interface for encryption and decryption without SG lists:
int crypto_akcipher_sync_encrypt(struct crypto_akcipher *tfm,
const void *src, unsigned int slen,
void *dst, unsigned int dlen);
int crypto_akcipher_sync_decrypt(struct crypto_akcipher *tfm,
const void *src, unsigned int slen,
void *dst, unsigned int dlen);
I've decided to split out signing and verification because most
(all but one) of our signature algorithms do not support encryption
or decryption. These can now be accessed through the sig interface:
int crypto_sig_sign(struct crypto_sig *tfm,
const void *src, unsigned int slen,
void *dst, unsigned int dlen);
int crypto_sig_verify(struct crypto_sig *tfm,
const void *src, unsigned int slen,
const void *digest, unsigned int dlen);
The keyring system has been converted to this interface.
The next step would be to convert the code within the Crypto API so
that SG lists are not used at all on the software path. This
would eliminate the unnecessary copying that currently happens.
Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Thu, 15 Jun 2023 at 12:26, Herbert Xu <[email protected]> wrote:
>
> v2 changes:
>
> - Rename dsa to sig.
> - Add braces around else clause.
>
> The crypto akcipher interface has exactly one user, the keyring
> subsystem. That user only deals with kernel pointers, not SG lists.
> Therefore the use of SG lists in the akcipher interface is
> completely pointless.
>
> As there is only one user, changing it isn't that hard. This
> patch series is a first step in that direction. It introduces
> a new interface for encryption and decryption without SG lists:
>
> int crypto_akcipher_sync_encrypt(struct crypto_akcipher *tfm,
> const void *src, unsigned int slen,
> void *dst, unsigned int dlen);
>
> int crypto_akcipher_sync_decrypt(struct crypto_akcipher *tfm,
> const void *src, unsigned int slen,
> void *dst, unsigned int dlen);
>
> I've decided to split out signing and verification because most
> (all but one) of our signature algorithms do not support encryption
> or decryption. These can now be accessed through the sig interface:
>
> int crypto_sig_sign(struct crypto_sig *tfm,
> const void *src, unsigned int slen,
> void *dst, unsigned int dlen);
>
> int crypto_sig_verify(struct crypto_sig *tfm,
> const void *src, unsigned int slen,
> const void *digest, unsigned int dlen);
>
> The keyring system has been converted to this interface.
>
This looks like a worthwhile improvement to me.
As I asked before, could we do the same for the acomp API? The only
existing user blocks on the completion, and the vast majority of
implementations is software only.
On Mon, Jun 26, 2023 at 11:21:20AM +0200, Ard Biesheuvel wrote:
>
> As I asked before, could we do the same for the acomp API? The only
> existing user blocks on the completion, and the vast majority of
> implementations is software only.
We already have scomp. It's not exported so we could export it.
However, compression is quite different because it was one of the
original network stack algorithms (IPcomp). So SG will always be
needed.
In fact, if anything SG fits quite well with decompression because
it would allow us to allocate pages as we go, avoiding the need
to allocate a large chunk of memory at the start as we do today.
We don't make use of that ability today but that's something that
I'd like to address.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Mon, 26 Jun 2023 at 11:53, Herbert Xu <[email protected]> wrote:
>
> On Mon, Jun 26, 2023 at 11:21:20AM +0200, Ard Biesheuvel wrote:
> >
> > As I asked before, could we do the same for the acomp API? The only
> > existing user blocks on the completion, and the vast majority of
> > implementations is software only.
>
> We already have scomp. It's not exported so we could export it.
>
> However, compression is quite different because it was one of the
> original network stack algorithms (IPcomp). So SG will always be
> needed.
>
> In fact, if anything SG fits quite well with decompression because
> it would allow us to allocate pages as we go, avoiding the need
> to allocate a large chunk of memory at the start as we do today.
>
> We don't make use of that ability today but that's something that
> I'd like to address.
>
OK, so you are saying that you'd like to see the code in
net/xfrm/xfrm_ipcomp.c ported to acomp, so that IPcomp can be backed
using h/w offload?
Would that be a tidying up exercise? Or does that actually address a
real-world use case?
In any case, what I would like to see addressed is the horrid scomp to
acomp layer that ties up megabytes of memory in scratch space, just to
emulate the acomp interface on top of scomp drivers, while no code
exists that makes use of the async nature. Do you have an idea on how
we might address this particular issue?
On Mon, Jun 26, 2023 at 12:03:04PM +0200, Ard Biesheuvel wrote:
>
> In any case, what I would like to see addressed is the horrid scomp to
> acomp layer that ties up megabytes of memory in scratch space, just to
> emulate the acomp interface on top of scomp drivers, while no code
> exists that makes use of the async nature. Do you have an idea on how
> we might address this particular issue?
The whole reason why need to allocate megabytes of memory is because
of the lack of SG lists in the underlying algorithm. If they
actually used SG lists and allocated pages as they went during
decompression, then we wouldn't need to pre-allocate any memory
at all.
It's not going to be easy of course since we need to convert
every single algorithm. Perhaps what we could do in the mean
time is to get the scomp implementations do exponential retries
(e.g., start with allocating size * 2, if that fails then double
it, up to whatever we currently pre-allocate).
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Mon, Jun 26, 2023 at 06:13:44PM +0800, Herbert Xu wrote:
> On Mon, Jun 26, 2023 at 12:03:04PM +0200, Ard Biesheuvel wrote:
> >
> > In any case, what I would like to see addressed is the horrid scomp to
> > acomp layer that ties up megabytes of memory in scratch space, just to
> > emulate the acomp interface on top of scomp drivers, while no code
> > exists that makes use of the async nature. Do you have an idea on how
> > we might address this particular issue?
>
> The whole reason why need to allocate megabytes of memory is because
> of the lack of SG lists in the underlying algorithm. If they
> actually used SG lists and allocated pages as they went during
> decompression, then we wouldn't need to pre-allocate any memory
> at all.
I don't think that is a realistic expectation. Decompressors generally need a
contiguous buffer for decompressed data anyway, up to a certain size which is
32KB for DEFLATE but can be much larger for the more modern algorithms. This is
because they decode "matches" that refer to previously decompressed data by
offset, and it has to be possible to index the data efficiently.
(Some decompressors, e.g. zlib, provide "streaming" APIs where you can read
arbitrary amounts. But that works by actually decompressing into an internal
buffer that has sufficient size, then copying to the user provided buffer.)
The same applies to compressors too, with regards to the original data.
I think the "input/output is a list of pages" model just fundamentally does not
work well for software compression and decompression. To support it, either
large temporary buffers are needed (they might be hidden inside the
(de)compressor, but they are there), or vmap() or vm_map_ram() is needed.
FWIW, f2fs compression uses vm_map_ram() and skips the crypto API entirely...
If acomp has to be kept for the hardware support, then maybe its scomp backend
should use vm_map_ram() instead of scratch buffers?
- Eric
On Wed, 28 Jun 2023 at 08:21, Eric Biggers <[email protected]> wrote:
>
> On Mon, Jun 26, 2023 at 06:13:44PM +0800, Herbert Xu wrote:
> > On Mon, Jun 26, 2023 at 12:03:04PM +0200, Ard Biesheuvel wrote:
> > >
> > > In any case, what I would like to see addressed is the horrid scomp to
> > > acomp layer that ties up megabytes of memory in scratch space, just to
> > > emulate the acomp interface on top of scomp drivers, while no code
> > > exists that makes use of the async nature. Do you have an idea on how
> > > we might address this particular issue?
> >
> > The whole reason why need to allocate megabytes of memory is because
> > of the lack of SG lists in the underlying algorithm. If they
> > actually used SG lists and allocated pages as they went during
> > decompression, then we wouldn't need to pre-allocate any memory
> > at all.
>
> I don't think that is a realistic expectation. Decompressors generally need a
> contiguous buffer for decompressed data anyway, up to a certain size which is
> 32KB for DEFLATE but can be much larger for the more modern algorithms. This is
> because they decode "matches" that refer to previously decompressed data by
> offset, and it has to be possible to index the data efficiently.
>
> (Some decompressors, e.g. zlib, provide "streaming" APIs where you can read
> arbitrary amounts. But that works by actually decompressing into an internal
> buffer that has sufficient size, then copying to the user provided buffer.)
>
> The same applies to compressors too, with regards to the original data.
>
> I think the "input/output is a list of pages" model just fundamentally does not
> work well for software compression and decompression. To support it, either
> large temporary buffers are needed (they might be hidden inside the
> (de)compressor, but they are there), or vmap() or vm_map_ram() is needed.
>
> FWIW, f2fs compression uses vm_map_ram() and skips the crypto API entirely...
>
> If acomp has to be kept for the hardware support, then maybe its scomp backend
> should use vm_map_ram() instead of scratch buffers?
>
Yeah, but we'll run into similar issues related to the fact that
scatterlists can describe arbitrary sequences of sub-page size memory
chunks, which means vmap()ing the pages may not be sufficient to get a
virtual linear representation of the buffers.
With zswap being the only current user, which uses a single contiguous
buffers for decompression out of place, and blocks on the completion,
the level of additional complexity we have in the acomp stack is mind
boggling. And the scomp-to-acomp adaptation layer, with its fixed size
per-CPU in and output buffer (implying that acomp in/output has a
hardcoded size limit) which are never freed makes it rather
unpalatable to me tbh.
On Wed, Jun 28, 2023 at 06:58:58PM +0200, Ard Biesheuvel wrote:
> On Wed, 28 Jun 2023 at 08:21, Eric Biggers <[email protected]> wrote:
> >
> > On Mon, Jun 26, 2023 at 06:13:44PM +0800, Herbert Xu wrote:
> > > On Mon, Jun 26, 2023 at 12:03:04PM +0200, Ard Biesheuvel wrote:
> > > >
> > > > In any case, what I would like to see addressed is the horrid scomp to
> > > > acomp layer that ties up megabytes of memory in scratch space, just to
> > > > emulate the acomp interface on top of scomp drivers, while no code
> > > > exists that makes use of the async nature. Do you have an idea on how
> > > > we might address this particular issue?
> > >
> > > The whole reason why need to allocate megabytes of memory is because
> > > of the lack of SG lists in the underlying algorithm. If they
> > > actually used SG lists and allocated pages as they went during
> > > decompression, then we wouldn't need to pre-allocate any memory
> > > at all.
> >
> > I don't think that is a realistic expectation. Decompressors generally need a
> > contiguous buffer for decompressed data anyway, up to a certain size which is
> > 32KB for DEFLATE but can be much larger for the more modern algorithms. This is
> > because they decode "matches" that refer to previously decompressed data by
> > offset, and it has to be possible to index the data efficiently.
> >
> > (Some decompressors, e.g. zlib, provide "streaming" APIs where you can read
> > arbitrary amounts. But that works by actually decompressing into an internal
> > buffer that has sufficient size, then copying to the user provided buffer.)
> >
> > The same applies to compressors too, with regards to the original data.
> >
> > I think the "input/output is a list of pages" model just fundamentally does not
> > work well for software compression and decompression. To support it, either
> > large temporary buffers are needed (they might be hidden inside the
> > (de)compressor, but they are there), or vmap() or vm_map_ram() is needed.
> >
> > FWIW, f2fs compression uses vm_map_ram() and skips the crypto API entirely...
> >
> > If acomp has to be kept for the hardware support, then maybe its scomp backend
> > should use vm_map_ram() instead of scratch buffers?
> >
>
> Yeah, but we'll run into similar issues related to the fact that
> scatterlists can describe arbitrary sequences of sub-page size memory
> chunks, which means vmap()ing the pages may not be sufficient to get a
> virtual linear representation of the buffers.
Yes, that is annoying... Maybe the acomp API should not support arbitrary
scatterlists, but rather only ones that can be mapped contiguously?
>
> With zswap being the only current user, which uses a single contiguous
> buffers for decompression out of place, and blocks on the completion,
> the level of additional complexity we have in the acomp stack is mind
> boggling. And the scomp-to-acomp adaptation layer, with its fixed size
> per-CPU in and output buffer (implying that acomp in/output has a
> hardcoded size limit) which are never freed makes it rather
> unpalatable to me tbh.
Either way, I think the way out may be that zcomp should support *both* the
scomp and acomp APIs. It should use scomp by default, and if someone *really*
wants to use a hardware (de)compression accelerator, they can configure it to
use acomp.
Also, I see that crypto/scompress.c currently allocates scratch buffers (256KB
per CPU!) whenever any crypto_scomp tfm is initialized. That seems wrong. It
should only happen when a crypto_acomp tfm that is backed by a crypto_scomp tfm
is initialized. Then, if acomp is not used, all this craziness will be avoided.
- Eric
On Wed, 28 Jun 2023 at 19:33, Eric Biggers <[email protected]> wrote:
>
> On Wed, Jun 28, 2023 at 06:58:58PM +0200, Ard Biesheuvel wrote:
> > On Wed, 28 Jun 2023 at 08:21, Eric Biggers <[email protected]> wrote:
> > >
> > > On Mon, Jun 26, 2023 at 06:13:44PM +0800, Herbert Xu wrote:
> > > > On Mon, Jun 26, 2023 at 12:03:04PM +0200, Ard Biesheuvel wrote:
> > > > >
> > > > > In any case, what I would like to see addressed is the horrid scomp to
> > > > > acomp layer that ties up megabytes of memory in scratch space, just to
> > > > > emulate the acomp interface on top of scomp drivers, while no code
> > > > > exists that makes use of the async nature. Do you have an idea on how
> > > > > we might address this particular issue?
> > > >
> > > > The whole reason why need to allocate megabytes of memory is because
> > > > of the lack of SG lists in the underlying algorithm. If they
> > > > actually used SG lists and allocated pages as they went during
> > > > decompression, then we wouldn't need to pre-allocate any memory
> > > > at all.
> > >
> > > I don't think that is a realistic expectation. Decompressors generally need a
> > > contiguous buffer for decompressed data anyway, up to a certain size which is
> > > 32KB for DEFLATE but can be much larger for the more modern algorithms. This is
> > > because they decode "matches" that refer to previously decompressed data by
> > > offset, and it has to be possible to index the data efficiently.
> > >
> > > (Some decompressors, e.g. zlib, provide "streaming" APIs where you can read
> > > arbitrary amounts. But that works by actually decompressing into an internal
> > > buffer that has sufficient size, then copying to the user provided buffer.)
> > >
> > > The same applies to compressors too, with regards to the original data.
> > >
> > > I think the "input/output is a list of pages" model just fundamentally does not
> > > work well for software compression and decompression. To support it, either
> > > large temporary buffers are needed (they might be hidden inside the
> > > (de)compressor, but they are there), or vmap() or vm_map_ram() is needed.
> > >
> > > FWIW, f2fs compression uses vm_map_ram() and skips the crypto API entirely...
> > >
> > > If acomp has to be kept for the hardware support, then maybe its scomp backend
> > > should use vm_map_ram() instead of scratch buffers?
> > >
> >
> > Yeah, but we'll run into similar issues related to the fact that
> > scatterlists can describe arbitrary sequences of sub-page size memory
> > chunks, which means vmap()ing the pages may not be sufficient to get a
> > virtual linear representation of the buffers.
>
> Yes, that is annoying... Maybe the acomp API should not support arbitrary
> scatterlists, but rather only ones that can be mapped contiguously?
>
Herbert seems to think that this could be useful for IPcomp as well,
so it would depend on whether that would also work under this
restriction.
> >
> > With zswap being the only current user, which uses a single contiguous
> > buffers for decompression out of place, and blocks on the completion,
> > the level of additional complexity we have in the acomp stack is mind
> > boggling. And the scomp-to-acomp adaptation layer, with its fixed size
> > per-CPU in and output buffer (implying that acomp in/output has a
> > hardcoded size limit) which are never freed makes it rather
> > unpalatable to me tbh.
>
> Either way, I think the way out may be that zcomp should support *both* the
> scomp and acomp APIs. It should use scomp by default, and if someone *really*
> wants to use a hardware (de)compression accelerator, they can configure it to
> use acomp.
>
Both the sole acomp h/w implementation and the zswap acomp conversion
were contributed by HiSilicon, and the code is synchronous.
So perhaps that driver should expose (blocking) scomp as well as
acomp, and zswap can just use scomp with virtual addresses.
> Also, I see that crypto/scompress.c currently allocates scratch buffers (256KB
> per CPU!) whenever any crypto_scomp tfm is initialized. That seems wrong. It
> should only happen when a crypto_acomp tfm that is backed by a crypto_scomp tfm
> is initialized. Then, if acomp is not used, all this craziness will be avoided.
>
Yeah, this is why I am trying to do something about this. I have a 32
core 4 way SMT ThunderX2 workstation, and so 32 MiB of physical memory
is permanently tied up this way, which is just silly (IIRC this is due
to pstore compression allocating a compression TFM upfront so it can
compress dmesg on a panic - I've already argued with Kees that
pstore's selection of compression algorithms is rather pointless in
any case, and it should just use the GZIP library)
On Wed, 28 Jun 2023 at 10:45, Ard Biesheuvel <[email protected]> wrote:
>
> Both the sole acomp h/w implementation and the zswap acomp conversion
> were contributed by HiSilicon, and the code is synchronous.
I really think people need to realize that the age of async external
accelerators is long long gone.
Yes, it was cool in the 80s.
But dammit, so was big hair, mullets, and Sony Walkmans.
Just give it up. It's a complete failure, and it is not making a come-back.
What is still relevant is:
(a) inline accelerators
Doing checksumming, big packets, encryption etc ON THE NETWORK
CARD as it is being sent out and received is still very much relevant.
But not this crazy "CPU gives data to external accelerator, does
something else, and is notified of the result and goes back to look at
it" is bogus and completely wrong.
(b) synchronous CPU accelerated routines
Whether this is using just vector instructions, or special
hardware, they are synchronous, and they use CPU virtual addresses
directly.
and nothing else matters.
Christ, people, Metallica even wrote one of the greatest song of all
time about it! How much more proof do you need?
I seriously believe that the crypto layer should basically aim to get
rid of the silly async interface entirely. It adds no real value, and
it has caused endless amounts of pain.
Linus
What about something like the Intel on-die accelerators (e.g. IAA and QAT)? I
think they can do async compression.
David
On Wed, 28 Jun 2023 at 11:34, David Howells <[email protected]> wrote:
>
> What about something like the Intel on-die accelerators (e.g. IAA and QAT)? I
> think they can do async compression.
I'm sure they can. And for some made-up benchmark it might even help.
Do people use it in real life?
The *big* wins come from being able to do compression/encryption
inline, when you don't need to do double-buffering etc.
Anything else is completely broken, imnsho. Once you need to
double-buffer your IO, you've already lost the whole point.
Linus
Hi Linus,
On 2023/6/29 04:10, Linus Torvalds wrote:
> On Wed, 28 Jun 2023 at 11:34, David Howells <[email protected]> wrote:
>>
>> What about something like the Intel on-die accelerators (e.g. IAA and QAT)? I
>> think they can do async compression.
>
> I'm sure they can. And for some made-up benchmark it might even help.
> Do people use it in real life?
>
> The *big* wins come from being able to do compression/encryption
> inline, when you don't need to do double-buffering etc.
>
> Anything else is completely broken, imnsho. Once you need to
> double-buffer your IO, you've already lost the whole point.
I'm not sure if I could say much about this for now, yet we're
slowly evaluating Intel IAA builtin DEFLATE engine for our
Cloud workloads and currently we don't have end-to-end numbers
yet.
Storage inline accelerators are great, especially
"do {en,de}cryption inline" since it consumes very little
on-chip memory, yet afaik "(de)compression" inline engine story
is different since it needs more SRAM space for their LZ sliding
windows for matching (e.g. 32kb for deflate each channel, 64kb
for LZ4 each channel, and much much more for Zstd, LZMA, etc.
I think those are quite expensive to integrate) in addition to
some additional memory for huffman/FSE tables.
So in production, inline "(de)compression" accelerators are
hardly seen as a part of storage at least in end consumer
markets.
Intel already has their on-die accelerators (IAA very recently),
yeah, it still needs double-buffer I/O, but we're considering
at least using in async writeback/readahead use cases as a
start for bulk async I/Os, which are not quite latency
sensitive. Intel also shows their Zswap work [1] , yet I don't
dive into that since I'm only focusing on storage use cases.
As for crypto current apis (no matter acomp or scomp), I'm not
sure I will say more too since I mostly agree with what Eric
said previously.
Thanks,
Gao Xiang
[1] https://lore.kernel.org/r/[email protected]/
>
> Linus
>