2019-05-23 19:32:57

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> -----Original Message-----
> From: Eric Biggers [mailto:[email protected]]
> Sent: Thursday, May 23, 2019 8:59 PM
> To: Pascal Van Leeuwen <[email protected]>
> Cc: [email protected]
> Subject: Re: another testmgr question
>
> On Thu, May 23, 2019 at 01:07:25PM +0000, Pascal Van Leeuwen wrote:
> > Eric,
> >
> > I'm running into some trouble with some random vectors that do *zero*
> > length operations. Now you can go all formal about how the API does
> > not explictly disallow this, but how much sense does it really make
> > to essentially encrypt, hash or authenticate absolutely *nothing*?
> >
> > It makes so little sense that we never bothered to support it in any
> > of our hardware developed over the past two decades ... and no
> > customer has ever complained about this, to the best of my knowledge.
> >
> > Can't you just remove those zero length tests?
> >
> > Regards,
> >
> > Pascal van Leeuwen
> > Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
> > http://www.insidesecure.com
>
> Please send this to linux-crypto (not linux-crypto-owner) so this can be
> discussed on the list.
>

Oops ... looks like I've been copy-pasting the wrong e-mail list address.
Mea culpa. Now forwarded to the correct list.

Regards,

Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com



2019-05-23 20:06:57

by Eric Biggers

[permalink] [raw]
Subject: Re: another testmgr question

On Thu, May 23, 2019 at 01:07:25PM +0000, Pascal Van Leeuwen wrote:
> Eric,
>
> I'm running into some trouble with some random vectors that do *zero*
> length operations. Now you can go all formal about how the API does
> not explictly disallow this, but how much sense does it really make
> to essentially encrypt, hash or authenticate absolutely *nothing*?
>
> It makes so little sense that we never bothered to support it in any
> of our hardware developed over the past two decades ... and no
> customer has ever complained about this, to the best of my knowledge.
>
> Can't you just remove those zero length tests?
>

For hashes this is absolutely a valid case. Try this:

$ touch file
$ sha256sum file
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 file

That shows the SHA-256 digest of the empty message.

For AEADs it's a valid case too. You still get an authenticated ciphertext even
if the plaintext and/or AAD are empty, telling you that the (plaintext, AAD)
pair is authentically from someone with the key.

It's really only skciphers (length preserving encryption) where it's
questionable, since for those an empty input can only map to an empty output.

Regardless of what we do, I think it's really important that the behavior is
*consistent*, so users see the same behavior no matter what implementation of
the algorithm is used.

Allowing empty messages works out naturally for most skcipher implementations,
and it also conceptually simplifies the length restrictions of the API (e.g. for
most block cipher modes: just need nbytes % blocksize == 0, as opposed to that
*and* nbytes != 0). So that seems to be how we ended up with it.

If we do change this, IMO we need to make it the behavior for all
implementations, not make it implementation-defined.

Note that it's not necessary that your *hardware* supports empty messages, since
you can simply do this in the driver instead:

if (req->cryptlen == 0)
return 0;

- Eric

2019-05-23 21:46:00

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> -----Original Message-----
> From: Eric Biggers [mailto:[email protected]]
> Sent: Thursday, May 23, 2019 10:06 PM
> To: Pascal Van Leeuwen <[email protected]>
> Cc: [email protected]
> Subject: Re: another testmgr question
>
> On Thu, May 23, 2019 at 01:07:25PM +0000, Pascal Van Leeuwen wrote:
> > Eric,
> >
> > I'm running into some trouble with some random vectors that do *zero*
> > length operations. Now you can go all formal about how the API does
> > not explictly disallow this, but how much sense does it really make
> > to essentially encrypt, hash or authenticate absolutely *nothing*?
> >
> > It makes so little sense that we never bothered to support it in any
> > of our hardware developed over the past two decades ... and no
> > customer has ever complained about this, to the best of my knowledge.
> >
> > Can't you just remove those zero length tests?
> >
>
> For hashes this is absolutely a valid case. Try this:
>
> $ touch file
> $ sha256sum file
> e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 file
>
> That shows the SHA-256 digest of the empty message.
>
Valid? A totally fabricated case, if you ask me. Yes, you could do that,
but is it *useful* at all? Really?
No, it's not because a file of length 0 is a file of length 0, the length
in itself is sufficient guarantee of its contents. The hash does not add
*anything* in this case. It's a constant anyway, the same value for *any*
zero-length file. It doesn't tell you anything you didn't already know.
IMHO the tool should just return a message stating "hashing an empty file
does not make any sense at all ...".

>
> For AEADs it's a valid case too. You still get an authenticated ciphertext
> even
> if the plaintext and/or AAD are empty, telling you that the (plaintext, AAD)
> pair is authentically from someone with the key.
>
Again, you could *theoretically* do that, but I don't know of any *practicle*
use case (protocol, application) where you can have *and* 0 length AAD *and*
0 length payload (but do correct me when I'm wrong there!)
In any case, that would result in a value *only* depending on the key (same
thing applies to zero-length HMAC), which is likely some sort of security
risk anyway.

As I mentioned before, we made a lot of hashing and authentication hardware
over the past 20+ years that has never been capable of doing zero-length
operations and this has *never* proved to be a problem to *anyone*. Not a
single support question has *ever* been raised on the subject.

>
> It's really only skciphers (length preserving encryption) where it's
> questionable, since for those an empty input can only map to an empty output.
>
True, although that's also the least problematic case to handle.
Doing nothing at all is not so hard ...

> Regardless of what we do, I think it's really important that the behavior is
> *consistent*, so users see the same behavior no matter what implementation of
> the algorithm is used.
>
Consistency should only be required for *legal* ranges of input parameters.
Which then obviously need to be properly specified somewhere.
It should be fair to put some reasonable restrictions on these inputs as to
not burden implementions with potentially difficult to handle fringe cases.

> Allowing empty messages works out naturally for most skcipher
> implementations,
> and it also conceptually simplifies the length restrictions of the API (e.g.
> for
> most block cipher modes: just need nbytes % blocksize == 0, as opposed to
> that
> *and* nbytes != 0). So that seems to be how we ended up with it.
>
I fail to see the huge burden of the extra len>0 restriction.
Especially if you compare it to the burden of adding all the code for
handling such useless exception cases to all individual drivers.

> If we do change this, IMO we need to make it the behavior for all
> implementations, not make it implementation-defined.
>
I don't see how disallowing 0 length inputs would affect implementations that
ARE capable of processing them. Unless you would require them to start
throwing errors for these cases, which I'm not suggesting.

> Note that it's not necessary that your *hardware* supports empty messages,
> since you can simply do this in the driver instead:
>
> if (req->cryptlen == 0)
> return 0;
>
For skciphers, yes, it's not such a problem. Neither for basic hash.
(And thanks for the code suggestion BTW, this will be a lot more efficient
then what I'm doing now for this particular case :-)
For HMAC, however, where you would have to return a value depending on the
key ... not so easy to solve. I don't have a solution for that yet :-(

And I'm pretty sure this affects all Inside Secure HW drivers in the tree:
inside-secure, amcc, mediatek and omap ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines, Inside Secure
http://www.insidesecure.com

2019-05-23 23:50:06

by Eric Biggers

[permalink] [raw]
Subject: Re: another testmgr question

On Thu, May 23, 2019 at 09:43:53PM +0000, Pascal Van Leeuwen wrote:
> > -----Original Message-----
> > From: Eric Biggers [mailto:[email protected]]
> > Sent: Thursday, May 23, 2019 10:06 PM
> > To: Pascal Van Leeuwen <[email protected]>
> > Cc: [email protected]
> > Subject: Re: another testmgr question
> >
> > On Thu, May 23, 2019 at 01:07:25PM +0000, Pascal Van Leeuwen wrote:
> > > Eric,
> > >
> > > I'm running into some trouble with some random vectors that do *zero*
> > > length operations. Now you can go all formal about how the API does
> > > not explictly disallow this, but how much sense does it really make
> > > to essentially encrypt, hash or authenticate absolutely *nothing*?
> > >
> > > It makes so little sense that we never bothered to support it in any
> > > of our hardware developed over the past two decades ... and no
> > > customer has ever complained about this, to the best of my knowledge.
> > >
> > > Can't you just remove those zero length tests?
> > >
> >
> > For hashes this is absolutely a valid case. Try this:
> >
> > $ touch file
> > $ sha256sum file
> > e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 file
> >
> > That shows the SHA-256 digest of the empty message.
> >
> Valid? A totally fabricated case, if you ask me. Yes, you could do that,
> but is it *useful* at all? Really?
> No, it's not because a file of length 0 is a file of length 0, the length
> in itself is sufficient guarantee of its contents. The hash does not add
> *anything* in this case. It's a constant anyway, the same value for *any*
> zero-length file. It doesn't tell you anything you didn't already know.
> IMHO the tool should just return a message stating "hashing an empty file
> does not make any sense at all ...".
>

Of course it's useful. It means that *every* possible file has a SHA-256
digest. So if you're validating a file, you just check the SHA-256 digest; or
if you're creating a manifest, you just hash the file and list the SHA-256
digest. Making everyone handle empty files specially would be insane.

> >
> > For AEADs it's a valid case too. You still get an authenticated ciphertext
> > even
> > if the plaintext and/or AAD are empty, telling you that the (plaintext, AAD)
> > pair is authentically from someone with the key.
> >
> Again, you could *theoretically* do that, but I don't know of any *practicle*
> use case (protocol, application) where you can have *and* 0 length AAD *and*
> 0 length payload (but do correct me when I'm wrong there!)
> In any case, that would result in a value *only* depending on the key (same
> thing applies to zero-length HMAC), which is likely some sort of security
> risk anyway.
>
> As I mentioned before, we made a lot of hashing and authentication hardware
> over the past 20+ years that has never been capable of doing zero-length
> operations and this has *never* proved to be a problem to *anyone*. Not a
> single support question has *ever* been raised on the subject.
>

The standard attack model for MACs assumes the attacker can send an arbitrary
(message, MAC) pair. Depending on the protocol there may be nothing preventing
them from sending an empty message, e.g. maybe it's just a file on the
filesystem which can be empty. So it makes perfect sense for the HMAC of an
empty message to be defined so that it can be checked without a special case for
empty messages, and indeed the HMAC specification
(https://csrc.nist.gov/csrc/media/publications/fips/198/1/final/documents/fips-198-1_final.pdf)
clearly says that 0 is an allowed input length. Note that the algorithmic
description of HMAC handles this case naturally; indeed, it would be a special
case if 0 were *not* allowed.

Essentially the same applies for AEADs.

> >
> > It's really only skciphers (length preserving encryption) where it's
> > questionable, since for those an empty input can only map to an empty output.
> >
> True, although that's also the least problematic case to handle.
> Doing nothing at all is not so hard ...
>
> > Regardless of what we do, I think it's really important that the behavior is
> > *consistent*, so users see the same behavior no matter what implementation of
> > the algorithm is used.
> >
> Consistency should only be required for *legal* ranges of input parameters.
> Which then obviously need to be properly specified somewhere.
> It should be fair to put some reasonable restrictions on these inputs as to
> not burden implementions with potentially difficult to handle fringe cases.
>

People can develop weird dependencies on corner cases of APIs, so it's best to
avoid cases where the behavior differs depending on which implementation of the
API is used. So far it's been pretty straightforward to get all the crypto
implementations consistent, so IMO we should simply continue to do that.

What might make sense is moving more checks into the common code so that
implementations need to handle less, e.g. see how
https://patchwork.kernel.org/patch/10949189/ proposed to check the message
length alignment for skciphers (though that particular patch is broken as-is).

- Eric

2019-05-24 05:25:19

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Thu, 23 May 2019 at 22:44, Pascal Van Leeuwen
<[email protected]> wrote:
>
> > -----Original Message-----
> > From: Eric Biggers [mailto:[email protected]]
> > Sent: Thursday, May 23, 2019 10:06 PM
> > To: Pascal Van Leeuwen <[email protected]>
> > Cc: [email protected]
> > Subject: Re: another testmgr question
> >
> > On Thu, May 23, 2019 at 01:07:25PM +0000, Pascal Van Leeuwen wrote:
> > > Eric,
> > >
> > > I'm running into some trouble with some random vectors that do *zero*
> > > length operations. Now you can go all formal about how the API does
> > > not explictly disallow this, but how much sense does it really make
> > > to essentially encrypt, hash or authenticate absolutely *nothing*?
> > >
> > > It makes so little sense that we never bothered to support it in any
> > > of our hardware developed over the past two decades ... and no
> > > customer has ever complained about this, to the best of my knowledge.
> > >
> > > Can't you just remove those zero length tests?
> > >
> >
> > For hashes this is absolutely a valid case. Try this:
> >
> > $ touch file
> > $ sha256sum file
> > e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 file
> >
> > That shows the SHA-256 digest of the empty message.
> >
> Valid? A totally fabricated case, if you ask me. Yes, you could do that,
> but is it *useful* at all? Really?

Yes, really. The likelihood of a test vector occurring in practice is
entirely irrelevant. What matters is that the test vectors provide
known outputs for known inputs, and many algorithm specifications
explicitly contain the empty message as one of the documented test
vectors.

In fact, given the above, I am slightly shocked that your hardware
does not handle empty messages correctly. Are you sure it is a
hardware problem and not a driver problem?

In any case, as Eric points out as well, nothing is stopping you from
adding a special case to your driver that falls back to the software
implementation for known broken test cases.

Removing test cases from the set because of broken hardware is out of
the question IMO. It doesn't actually fix the problem, and it may
actually result in breakage, especially for h/w accelerated crypto
exposed to userland, of which we have no idea whatsoever how it is
being used, and whether correct handling of zero length vectors is
likely to break anything or not.

> No, it's not because a file of length 0 is a file of length 0, the length
> in itself is sufficient guarantee of its contents. The hash does not add
> *anything* in this case. It's a constant anyway, the same value for *any*
> zero-length file. It doesn't tell you anything you didn't already know.
> IMHO the tool should just return a message stating "hashing an empty file
> does not make any sense at all ...".
>

You are making assumptions about how the crypto is being used at a
higher level. Eric's example may not make sense to you, but arguing
that *any* use of sha256sum on empty files is guaranteed to be
non-sensical in all imaginable cases is ridiculous.


> >
> > For AEADs it's a valid case too. You still get an authenticated ciphertext
> > even
> > if the plaintext and/or AAD are empty, telling you that the (plaintext, AAD)
> > pair is authentically from someone with the key.
> >
> Again, you could *theoretically* do that, but I don't know of any *practicle*
> use case (protocol, application) where you can have *and* 0 length AAD *and*
> 0 length payload (but do correct me when I'm wrong there!)
> In any case, that would result in a value *only* depending on the key (same
> thing applies to zero-length HMAC), which is likely some sort of security
> risk anyway.
>
> As I mentioned before, we made a lot of hashing and authentication hardware
> over the past 20+ years that has never been capable of doing zero-length
> operations and this has *never* proved to be a problem to *anyone*. Not a
> single support question has *ever* been raised on the subject.
>

Again, that is shocking to me, since it means nobody has ever bothered
to run, e.g., the documented SHA-xxx test vectors on them. Weird.

> >
> > It's really only skciphers (length preserving encryption) where it's
> > questionable, since for those an empty input can only map to an empty output.
> >
> True, although that's also the least problematic case to handle.
> Doing nothing at all is not so hard ...
>
> > Regardless of what we do, I think it's really important that the behavior is
> > *consistent*, so users see the same behavior no matter what implementation of
> > the algorithm is used.
> >
> Consistency should only be required for *legal* ranges of input parameters.
> Which then obviously need to be properly specified somewhere.
> It should be fair to put some reasonable restrictions on these inputs as to
> not burden implementions with potentially difficult to handle fringe cases.
>
> > Allowing empty messages works out naturally for most skcipher
> > implementations,
> > and it also conceptually simplifies the length restrictions of the API (e.g.
> > for
> > most block cipher modes: just need nbytes % blocksize == 0, as opposed to
> > that
> > *and* nbytes != 0). So that seems to be how we ended up with it.
> >
> I fail to see the huge burden of the extra len>0 restriction.
> Especially if you compare it to the burden of adding all the code for
> handling such useless exception cases to all individual drivers.
>
> > If we do change this, IMO we need to make it the behavior for all
> > implementations, not make it implementation-defined.
> >
> I don't see how disallowing 0 length inputs would affect implementations that
> ARE capable of processing them. Unless you would require them to start
> throwing errors for these cases, which I'm not suggesting.
>
> > Note that it's not necessary that your *hardware* supports empty messages,
> > since you can simply do this in the driver instead:
> >
> > if (req->cryptlen == 0)
> > return 0;
> >
> For skciphers, yes, it's not such a problem. Neither for basic hash.
> (And thanks for the code suggestion BTW, this will be a lot more efficient
> then what I'm doing now for this particular case :-)
> For HMAC, however, where you would have to return a value depending on the
> key ... not so easy to solve. I don't have a solution for that yet :-(
>
> And I'm pretty sure this affects all Inside Secure HW drivers in the tree:
> inside-secure, amcc, mediatek and omap ...
>
> Regards,
> Pascal van Leeuwen
> Silicon IP Architect, Multi-Protocol Engines, Inside Secure
> http://www.insidesecure.com

2019-05-24 06:15:56

by Jeffrey Walton

[permalink] [raw]
Subject: Re: another testmgr question

On Thu, May 23, 2019 at 4:06 PM Eric Biggers <[email protected]> wrote:
>
> On Thu, May 23, 2019 at 01:07:25PM +0000, Pascal Van Leeuwen wrote:
> >
> > I'm running into some trouble with some random vectors that do *zero*
> > length operations. Now you can go all formal about how the API does
> > not explictly disallow this, but how much sense does it really make
> > to essentially encrypt, hash or authenticate absolutely *nothing*?
> >
> > It makes so little sense that we never bothered to support it in any
> > of our hardware developed over the past two decades ... and no
> > customer has ever complained about this, to the best of my knowledge.
> >
> > Can't you just remove those zero length tests?
>
> For hashes this is absolutely a valid case. Try this:
>
> $ touch file
> $ sha256sum file
> e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 file
>
> That shows the SHA-256 digest of the empty message.
>
> For AEADs it's a valid case too. You still get an authenticated ciphertext even
> if the plaintext and/or AAD are empty, telling you that the (plaintext, AAD)
> pair is authentically from someone with the key.
>
> It's really only skciphers (length preserving encryption) where it's
> questionable, since for those an empty input can only map to an empty output.
>
> Regardless of what we do, I think it's really important that the behavior is
> *consistent*, so users see the same behavior no matter what implementation of
> the algorithm is used.
>
> Allowing empty messages works out naturally for most skcipher implementations,
> and it also conceptually simplifies the length restrictions of the API (e.g. for
> most block cipher modes: just need nbytes % blocksize == 0, as opposed to that
> *and* nbytes != 0). So that seems to be how we ended up with it.
>
> If we do change this, IMO we need to make it the behavior for all
> implementations, not make it implementation-defined.
>
> Note that it's not necessary that your *hardware* supports empty messages, since
> you can simply do this in the driver instead:
>
> if (req->cryptlen == 0)
> return 0;

+1. It seems like a firmware update in the hardware or a software
update to the driver are the ways to proceed.

Why isn't the driver able to work around the hardware bugs?

I don't think it is wise to remove tests from the Test Manager.

Jeff

2019-05-24 06:22:02

by Christophe Leroy

[permalink] [raw]
Subject: Re: another testmgr question

Hi Pascal,

Le 23/05/2019 à 23:43, Pascal Van Leeuwen a écrit :
>> -----Original Message-----
>> From: Eric Biggers [mailto:[email protected]]

[...]

>> Note that it's not necessary that your *hardware* supports empty messages,
>> since you can simply do this in the driver instead:
>>
>> if (req->cryptlen == 0)
>> return 0;
>>
> For skciphers, yes, it's not such a problem. Neither for basic hash.
> (And thanks for the code suggestion BTW, this will be a lot more efficient
> then what I'm doing now for this particular case :-)
> For HMAC, however, where you would have to return a value depending on the
> key ... not so easy to solve. I don't have a solution for that yet :-(

I had the same issue when porting the SEC2 Talitos driver to also
support SEC1. See following commit to see the way it has been fixed:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d02905ebd22c0271a25e424ab209c8b7067be67

Christophe

>
> And I'm pretty sure this affects all Inside Secure HW drivers in the tree:
> inside-secure, amcc, mediatek and omap ...
>
> Regards,
> Pascal van Leeuwen
> Silicon IP Architect, Multi-Protocol Engines, Inside Secure
> http://www.insidesecure.com
>

2019-05-24 07:05:16

by Kamil Konieczny

[permalink] [raw]
Subject: Re: another testmgr question

On 24.05.2019 07:24, Ard Biesheuvel wrote:
> On Thu, 23 May 2019 at 22:44, Pascal Van Leeuwen
> <[email protected]> wrote:
>>
> [...]
> In fact, given the above, I am slightly shocked that your hardware
> does not handle empty messages correctly. [...]

Imho hardware handles it in the way it was designed, for Exynos
writing non-zero value to len message register starts hash/hmac.
Writing zero and waiting for result will cause hang, as there will be
no result ever.
It can be done in HW to handle zero with additional register and more wires,
but it was left to software driver as trivial case.

--
Best regards,
Kamil Konieczny
Samsung R&D Institute Poland

2019-05-24 07:47:42

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> > Valid? A totally fabricated case, if you ask me. Yes, you could do that,
> > but is it *useful* at all? Really?
>
> Yes, really. The likelihood of a test vector occurring in practice is
> entirely irrelevant. What matters is that the test vectors provide
> known outputs for known inputs, and many algorithm specifications
> explicitly contain the empty message as one of the documented test
> vectors.
>
I wholeheartedly disagree - in the context of hardware design, anyway.
When you implement something, you care about practicle usability.
Wasting gates / timing / power / cycles / whatever on something that
has no use relevant use case is just silly. (In the particular case of
zero length messages, this is not so trivial to implement in a data
driven HW architecture! Data driven ... But no data ...)
As long as you properly specify your hardware's limitations, it should
not need to be an issue. If the HW doesn't match your use case, then
don't use it for that ... HW almost always has such limitations.

For FIPS certification, zero length vectors are considered *optional*.
Probably because they realized most HW out there can't do it ...

> In fact, given the above, I am slightly shocked that your hardware
> does not handle empty messages correctly. Are you sure it is a
> hardware problem and not a driver problem?
>
As a matter of fact, pretty sure, yeah, as I'm actually "the HW guy".

Nothing really shocking about that. We mainly do network protocol
acceleration anyway, being able to do some basic operations is just a
bonus. We do specify very extensively what we can and cannot support.

> In any case, as Eric points out as well, nothing is stopping you from
> adding a special case to your driver that falls back to the software
> implementation for known broken test cases.
>
Sure. But my point is that you end up with a driver full of special cases.
Which is just bloat slowing it down and blowing it up.

Besides, in order to be able to fallback to software for this case
I would have to maintain a shadow software context for *every* HW context
just because someone *might* do some zero length operation somewhere in
the future. Because by the time the context is established (setkey),
I cannot predict that I will get such a zero length request yet. Yuck.

> Removing test cases from the set because of broken hardware is out of
> the question IMO. It doesn't actually fix the problem, and it may
> actually result in breakage, especially for h/w accelerated crypto
> exposed to userland, of which we have no idea whatsoever how it is
> being used, and whether correct handling of zero length vectors is
> likely to break anything or not.
>
The driver could check for it and return an -EINVAL error code.
That would not break anything besides the application itself trying
to do this. Which could then fail gracefully.

> > No, it's not because a file of length 0 is a file of length 0, the length
> > in itself is sufficient guarantee of its contents. The hash does not add
> > *anything* in this case. It's a constant anyway, the same value for *any*
> > zero-length file. It doesn't tell you anything you didn't already know.
> > IMHO the tool should just return a message stating "hashing an empty file
> > does not make any sense at all ...".
> >
>
> You are making assumptions about how the crypto is being used at a
> higher level. Eric's example may not make sense to you, but arguing
> that *any* use of sha256sum on empty files is guaranteed to be
> non-sensical in all imaginable cases is ridiculous.
>
Actually, the thought occurred to me in the shower this morning that that
*might* be useful in case you don't have expectations of the length
whatsoever and only know the expected sum. In which case that would
validate that a zero-length file is indeed what you should have ...

Still - is that something the crypto API is *currently* being used for?
And if not, do we ever *intend* to use it for something like that?
If not, then we could just specify it to be illegal such that no one
would ever attempt to do it. You have that freedom as long as there are
no existing applications depending on it ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines

Tel. : +31 (0)73 65 81 900

http://www.insidesecure.com

2019-05-24 08:01:07

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> > you can simply do this in the driver instead:
> >
> > if (req->cryptlen == 0)
> > return 0;
>
> +1. It seems like a firmware update in the hardware or a software
> update to the driver are the ways to proceed.
>

Hardware typically does not involve firmware for things like this.
And you cannot update silicon.

>
> Why isn't the driver able to work around the hardware bugs?
>

1) It's NOT a bug. A bug is something that is not intentional, but
this is well specified behavior of the hardware.
Hardware always has limitations for the simple reason that some
things are simply too complex, costly or risky to implement in HW.

2) Of course you can always workaround things in the driver, but
breaking out all those exception cases slows down the normal cases.
Plus it adds tremendous complexity and bloat to the driver itself.

There should be SOME margin for not having to support everything
but the kitchen sink. Also keeping in mind that most of this HW
existed long before the Crypto API was even conceived.

> I don't think it is wise to remove tests from the Test Manager.
>
Well, in this particular case they are tests intended for development
and not for production use, so I guess it's OK to keep them for that.
Although I would like to see the testing continue if a test fails,
such that I have the option to ignore the ones I don't care about ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-24 08:06:22

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

>
> I had the same issue when porting the SEC2 Talitos driver to also
> support SEC1. See following commit to see the way it has been fixed:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id
> =2d02905ebd22c0271a25e424ab209c8b7067be67
>
> Christophe
>
Cristophe,

Thanks for the help! I just had a quick look and this indeed looks like a good
solution that I could use for my driver as well.

(Actually, just before reading your mail I had a discussion with a coworker
here and he came up with the exact same solution. But still: thanks :-)

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-24 08:45:22

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> > Valid? A totally fabricated case, if you ask me. Yes, you could do that,
> > but is it *useful* at all? Really?
> > No, it's not because a file of length 0 is a file of length 0, the length
> > in itself is sufficient guarantee of its contents. The hash does not add
> > *anything* in this case. It's a constant anyway, the same value for *any*
> > zero-length file. It doesn't tell you anything you didn't already know.
> > IMHO the tool should just return a message stating "hashing an empty file
> > does not make any sense at all ...".
> >
>
> Of course it's useful. It means that *every* possible file has a SHA-256
> digest. So if you're validating a file, you just check the SHA-256 digest;
> or
> if you're creating a manifest, you just hash the file and list the SHA-256
> digest. Making everyone handle empty files specially would be insane.
>
As I already mentioned in another thread somewhere, this morning in the
shower I realised that this may be useful if you have no expectation of
the length itself. But it's still a pretty specific use case which was
never considered for our hardware. And our HW doesn't seem to be alone in
this.
Does shaXXXsum or md5sum use the kernel crypto API though?

>
> The standard attack model for MACs assumes the attacker can send an arbitrary
> (message, MAC) pair. Depending on the protocol there may be nothing
> preventing
> them from sending an empty message, e.g. maybe it's just a file on the
> filesystem which can be empty. So it makes perfect sense for the HMAC of an
> empty message to be defined so that it can be checked without a special case
> for> empty messages, and indeed the HMAC specification
> (https://csrc.nist.gov/csrc/media/publications/fips/198/1/final/documents/fip
> s-198-1_final.pdf)
> clearly says that 0 is an allowed input length. Note that the algorithmic
> description of HMAC handles this case naturally; indeed, it would be a
> special case if 0 were *not* allowed.
>
> Essentially the same applies for AEADs.
>
Oh, it's defined all right. I never argued that it wasn't.
But just because the *algorithm* allows it doesn't necessarily mean that you
*have* to support it from your API. Even FIPS allows for not supporting zero
length for validation - zero length support is optional there.

And I'm just not aware of any real use case - usually some standard non-zero
header is included as AAD - so why bother going out of your way to support it.
If something fails because of this, you can always fix it right there and then.

>
> People can develop weird dependencies on corner cases of APIs, so it's best
> to
> avoid cases where the behavior differs depending on which implementation of
> the
> API is used. So far it's been pretty straightforward to get all the crypto
> implementations consistent, so IMO we should simply continue to do that.
>
I want to bet the people having to implement all these workarounds and fixes
in those drivers would argue that it's not so straightforward at all ...

> What might make sense is moving more checks into the common code so that
> implementations need to handle less, e.g. see how
> https://patchwork.kernel.org/patch/10949189/ proposed to check the message
> length alignment for skciphers (though that particular patch is broken as-
> is).
>
That goes without saying.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-24 08:47:21

by Christophe Leroy

[permalink] [raw]
Subject: Re: another testmgr question



Le 24/05/2019 à 10:44, Pascal Van Leeuwen a écrit :
>>> Valid? A totally fabricated case, if you ask me. Yes, you could do that,
>>> but is it *useful* at all? Really?
>>> No, it's not because a file of length 0 is a file of length 0, the length
>>> in itself is sufficient guarantee of its contents. The hash does not add
>>> *anything* in this case. It's a constant anyway, the same value for *any*
>>> zero-length file. It doesn't tell you anything you didn't already know.
>>> IMHO the tool should just return a message stating "hashing an empty file
>>> does not make any sense at all ...".
>>>
>>
>> Of course it's useful. It means that *every* possible file has a SHA-256
>> digest. So if you're validating a file, you just check the SHA-256 digest;
>> or
>> if you're creating a manifest, you just hash the file and list the SHA-256
>> digest. Making everyone handle empty files specially would be insane.
>>
> As I already mentioned in another thread somewhere, this morning in the
> shower I realised that this may be useful if you have no expectation of
> the length itself. But it's still a pretty specific use case which was
> never considered for our hardware. And our HW doesn't seem to be alone in
> this.
> Does shaXXXsum or md5sum use the kernel crypto API though?

The ones from libkcapi do (http://www.chronox.de/libkcapi.html)

Christophe

2019-05-24 08:50:18

by Jeffrey Walton

[permalink] [raw]
Subject: Re: another testmgr question

On Fri, May 24, 2019 at 4:47 AM Christophe Leroy
<[email protected]> wrote:
> ...
> > As I already mentioned in another thread somewhere, this morning in the
> > shower I realised that this may be useful if you have no expectation of
> > the length itself. But it's still a pretty specific use case which was
> > never considered for our hardware. And our HW doesn't seem to be alone in
> > this.
> > Does shaXXXsum or md5sum use the kernel crypto API though?
>
> The ones from libkcapi do (http://www.chronox.de/libkcapi.html)

And they can be loaded into OpenSSL through the afalg interface.

Lots of potential uses cases. I would not cross my fingers no one is using them.

Jeff

2019-05-24 09:16:52

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Fri, 24 May 2019 at 09:47, Pascal Van Leeuwen
<[email protected]> wrote:
>
> > > Valid? A totally fabricated case, if you ask me. Yes, you could do that,
> > > but is it *useful* at all? Really?
> >
> > Yes, really. The likelihood of a test vector occurring in practice is
> > entirely irrelevant. What matters is that the test vectors provide
> > known outputs for known inputs, and many algorithm specifications
> > explicitly contain the empty message as one of the documented test
> > vectors.
> >
> I wholeheartedly disagree - in the context of hardware design, anyway.
> When you implement something, you care about practicle usability.
> Wasting gates / timing / power / cycles / whatever on something that
> has no use relevant use case is just silly. (In the particular case of
> zero length messages, this is not so trivial to implement in a data
> driven HW architecture! Data driven ... But no data ...)
> As long as you properly specify your hardware's limitations, it should
> not need to be an issue. If the HW doesn't match your use case, then
> don't use it for that ... HW almost always has such limitations.
>
> For FIPS certification, zero length vectors are considered *optional*.
> Probably because they realized most HW out there can't do it ...
>
> > In fact, given the above, I am slightly shocked that your hardware
> > does not handle empty messages correctly. Are you sure it is a
> > hardware problem and not a driver problem?
> >
> As a matter of fact, pretty sure, yeah, as I'm actually "the HW guy".
>

Apologies, I did not mean to imply that you don't understand your own hardware.

> Nothing really shocking about that. We mainly do network protocol
> acceleration anyway, being able to do some basic operations is just a
> bonus. We do specify very extensively what we can and cannot support.
>
> > In any case, as Eric points out as well, nothing is stopping you from
> > adding a special case to your driver that falls back to the software
> > implementation for known broken test cases.
> >
> Sure. But my point is that you end up with a driver full of special cases.
> Which is just bloat slowing it down and blowing it up.
>

Those are usually the consequences of a HW guy deciding to punt
something to software. The software never looks better for it :-)

> Besides, in order to be able to fallback to software for this case
> I would have to maintain a shadow software context for *every* HW context
> just because someone *might* do some zero length operation somewhere in
> the future. Because by the time the context is established (setkey),
> I cannot predict that I will get such a zero length request yet. Yuck.
>

Yuck indeed. Sacrificing correctness for reduced silicon area comes with a cost.

> > Removing test cases from the set because of broken hardware is out of
> > the question IMO. It doesn't actually fix the problem, and it may
> > actually result in breakage, especially for h/w accelerated crypto
> > exposed to userland, of which we have no idea whatsoever how it is
> > being used, and whether correct handling of zero length vectors is
> > likely to break anything or not.
> >
> The driver could check for it and return an -EINVAL error code.
> That would not break anything besides the application itself trying
> to do this. Which could then fail gracefully.
>

This is intractible. Software already exists that does not treat the
zero length vector as a special case. Such software may be plumbed
into the kernel crypto API via its AF_ALG interface. So the only
wiggling room we have is in the kernel driver.

> > > No, it's not because a file of length 0 is a file of length 0, the length
> > > in itself is sufficient guarantee of its contents. The hash does not add
> > > *anything* in this case. It's a constant anyway, the same value for *any*
> > > zero-length file. It doesn't tell you anything you didn't already know.
> > > IMHO the tool should just return a message stating "hashing an empty file
> > > does not make any sense at all ...".
> > >
> >
> > You are making assumptions about how the crypto is being used at a
> > higher level. Eric's example may not make sense to you, but arguing
> > that *any* use of sha256sum on empty files is guaranteed to be
> > non-sensical in all imaginable cases is ridiculous.
> >
> Actually, the thought occurred to me in the shower this morning that that
> *might* be useful in case you don't have expectations of the length
> whatsoever and only know the expected sum. In which case that would
> validate that a zero-length file is indeed what you should have ...
>
> Still - is that something the crypto API is *currently* being used for?
> And if not, do we ever *intend* to use it for something like that?
> If not, then we could just specify it to be illegal such that no one
> would ever attempt to do it. You have that freedom as long as there are
> no existing applications depending on it ...
>

As others have pointed out as well, h/w accelerated crypto is exposed
to userland, so nobody knows how exactly it is being used in the
field.

2019-05-24 09:22:52

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> > As I already mentioned in another thread somewhere, this morning in the
> > shower I realised that this may be useful if you have no expectation of
> > the length itself. But it's still a pretty specific use case which was
> > never considered for our hardware. And our HW doesn't seem to be alone in
> > this.
> > Does shaXXXsum or md5sum use the kernel crypto API though?
>
> The ones from libkcapi do (http://www.chronox.de/libkcapi.html)
>
> Christophe
>
I was not aware of that, so thanks for pointing that out.
Do they use the async calls (_aio) though? Because otherwise they shouldn't
end up being hardware accelerated anyway ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-24 09:26:32

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Fri, 24 May 2019 at 11:21, Pascal Van Leeuwen
<[email protected]> wrote:
>
> > > As I already mentioned in another thread somewhere, this morning in the
> > > shower I realised that this may be useful if you have no expectation of
> > > the length itself. But it's still a pretty specific use case which was
> > > never considered for our hardware. And our HW doesn't seem to be alone in
> > > this.
> > > Does shaXXXsum or md5sum use the kernel crypto API though?
> >
> > The ones from libkcapi do (http://www.chronox.de/libkcapi.html)
> >
> > Christophe
> >
> I was not aware of that, so thanks for pointing that out.
> Do they use the async calls (_aio) though? Because otherwise they shouldn't
> end up being hardware accelerated anyway ...
>

All userland clients of the in-kernel crypto use it specifically to
access h/w accelerators, given that software crypto doesn't require
the higher privilege level (no point in issuing those AES CPU
instructions from the kernel if you can issue them in your program
directly)

Basically, what is used is a socket interface that can block on
read()/write(). So the userspace program doesn't need to be aware of
the asynchronous nature, it is just frozen while the calls are being
handled by the hardware.

2019-05-24 09:28:00

by Stephan Mueller

[permalink] [raw]
Subject: Re: another testmgr question

Am Freitag, 24. Mai 2019, 11:21:30 CEST schrieb Pascal Van Leeuwen:

Hi Pascal,

> I was not aware of that, so thanks for pointing that out.
> Do they use the async calls (_aio) though? Because otherwise they shouldn't
> end up being hardware accelerated anyway ...

Yes, AIO is supported: http://chronox.de//libkcapi/html/ch02s09.html


Ciao
Stephan


2019-05-24 09:29:08

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

>
> Apologies, I did not mean to imply that you don't understand your own
> hardware.
>
Not needed at all. That's not how I read it and besides that I'm not
easily offended anyway. Conversely, I don't intend to offend others,
I just enjoy heated discussions ;-) Which I don't mind losing at all,
BTW as I accept that I'm only human and don't know everything about
everything ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-24 09:35:37

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> All userland clients of the in-kernel crypto use it specifically to
> access h/w accelerators, given that software crypto doesn't require
> the higher privilege level (no point in issuing those AES CPU
> instructions from the kernel if you can issue them in your program
> directly)
>
> Basically, what is used is a socket interface that can block on
> read()/write(). So the userspace program doesn't need to be aware of
> the asynchronous nature, it is just frozen while the calls are being
> handled by the hardware.
>
With all due respect, but if the userland application is indeed
*frozen* while the calls are being handled, then that seems like its
pretty useless - for symmetric crypto, anyway - as performance would be
dominated by latency, not throughput.
Hardware acceleration would almost always lose that compared to a local
software implementation.
I certainly wouldn't want such an operation to end up at my driver!

Which brings up a question: is there some way for a driver to indicate
something like "don't use me unless you can seriously pipeline your
requests"?

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-24 09:43:16

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> On Fri, May 24, 2019 at 4:47 AM Christophe Leroy
> <[email protected]> wrote:
> > ...
> > > As I already mentioned in another thread somewhere, this morning in the
> > > shower I realised that this may be useful if you have no expectation of
> > > the length itself. But it's still a pretty specific use case which was
> > > never considered for our hardware. And our HW doesn't seem to be alone in
> > > this.
> > > Does shaXXXsum or md5sum use the kernel crypto API though?
> >
> > The ones from libkcapi do (http://www.chronox.de/libkcapi.html)
>
> And they can be loaded into OpenSSL through the afalg interface.
>

Hmm ... yeah ... OpenSSL ... yet another synchronous API that doesn't stand a
chance of being HW accelerated (although Intel contributed some async API
fairly recently - but I don't know of any application actually ported to use
that - would love to hear about any if they do exist).

Sorry, open wound ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-24 09:46:03

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Fri, 24 May 2019 at 11:34, Pascal Van Leeuwen
<[email protected]> wrote:
>
> > All userland clients of the in-kernel crypto use it specifically to
> > access h/w accelerators, given that software crypto doesn't require
> > the higher privilege level (no point in issuing those AES CPU
> > instructions from the kernel if you can issue them in your program
> > directly)
> >
> > Basically, what is used is a socket interface that can block on
> > read()/write(). So the userspace program doesn't need to be aware of
> > the asynchronous nature, it is just frozen while the calls are being
> > handled by the hardware.
> >
> With all due respect, but if the userland application is indeed
> *frozen* while the calls are being handled, then that seems like its
> pretty useless - for symmetric crypto, anyway - as performance would be
> dominated by latency, not throughput.
> Hardware acceleration would almost always lose that compared to a local
> software implementation.
> I certainly wouldn't want such an operation to end up at my driver!
>

Again, you are making assumptions here that don't always hold. Note that
- a frozen process frees up the CPU to do other things while the
crypto is in progress;
- h/w crypto is typically more power efficient than CPU crypto;
- several userland programs and in-kernel users may be active at the
same time, so the fact that a single user sleeps doesn't mean the
hardware is used inefficiently

> Which brings up a question: is there some way for a driver to indicate
> something like "don't use me unless you can seriously pipeline your
> requests"?
>
> Regards,
> Pascal van Leeuwen
> Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
> http://www.insidesecure.com
>

2019-05-24 09:57:47

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> Again, you are making assumptions here that don't always hold. Note that
> - a frozen process frees up the CPU to do other things while the
> crypto is in progress;
> - h/w crypto is typically more power efficient than CPU crypto;
>
True. Those are the "other" reasons - besides acceleration - to use hardware
offload which we often use to sell our IP.
But the honest story there is that that only works out for situations
where there's enough work to do to make the software overhead for actually
starting and managing that work insignificant.

And even then, it's only a valid use case if that is your *intention*.
If you *just* needed the highest performance, you don't want to go through
the HW in this case (unless you have a *very* weak CPU perhaps, or a
huge amount of data to process in one go).

The catch is in the "always". But how do you make an informed decision
here? The current Crypto API does not really seem to provide a mechanism
for doing so. In which case MY approach would be "if I'm not SURE that
the HW can do it better, then I probably shouldn't be doing in on the HW".

> - several userland programs and in-kernel users may be active at the
> same time, so the fact that a single user sleeps doesn't mean the
> hardware is used inefficiently
>
I'm not worried about the *HW* being used inefficiently.
I'm worried about using the HW not being an improvement.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-24 11:10:20

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Fri, 24 May 2019 at 11:57, Pascal Van Leeuwen
<[email protected]> wrote:
>
> > Again, you are making assumptions here that don't always hold. Note that
> > - a frozen process frees up the CPU to do other things while the
> > crypto is in progress;
> > - h/w crypto is typically more power efficient than CPU crypto;
> >
> True. Those are the "other" reasons - besides acceleration - to use hardware
> offload which we often use to sell our IP.
> But the honest story there is that that only works out for situations
> where there's enough work to do to make the software overhead for actually
> starting and managing that work insignificant.
>
> And even then, it's only a valid use case if that is your *intention*.
> If you *just* needed the highest performance, you don't want to go through
> the HW in this case (unless you have a *very* weak CPU perhaps, or a
> huge amount of data to process in one go).
>
> The catch is in the "always". But how do you make an informed decision
> here? The current Crypto API does not really seem to provide a mechanism
> for doing so. In which case MY approach would be "if I'm not SURE that
> the HW can do it better, then I probably shouldn't be doing in on the HW".
>

It becomes even more complicated to reason about if you take into
account that you may have 10s or 100s of instances of the CPU crypto
logic (one for each CPU) while the number of h/w IP blocks and/or
parallel processing queues typically doesn't scale in the same way.

But we are going down a rabbit hole here: even if you and I would
agree that it never makes any sense whatsoever to use h/w accelerators
from userland, the reality is that this is happening today, and so we
have to ensure that all drivers expose an interface that produces the
correct result for all imaginable corner cases.

> > - several userland programs and in-kernel users may be active at the
> > same time, so the fact that a single user sleeps doesn't mean the
> > hardware is used inefficiently
> >
> I'm not worried about the *HW* being used inefficiently.
> I'm worried about using the HW not being an improvement.
>

Evidently, it requires some care to use the AF_ALG interface
meaningfully. But that does not mean it cannot ever be used in the
right way.

2019-05-25 01:26:57

by Eric Biggers

[permalink] [raw]
Subject: Re: another testmgr question

On Fri, May 24, 2019 at 11:25:52AM +0200, Ard Biesheuvel wrote:
>
> All userland clients of the in-kernel crypto use it specifically to
> access h/w accelerators, given that software crypto doesn't require
> the higher privilege level (no point in issuing those AES CPU
> instructions from the kernel if you can issue them in your program
> directly)

Unfortunately people also use AF_ALG because they're too lazy to use a userspace
crypto library, e.g. systemd uses it for HMAC-SHA256, and iproute2 uses it for
SHA-1.

- Eric

2019-05-27 09:47:12

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> -----Original Message-----
> From: Ard Biesheuvel [mailto:[email protected]]
> Sent: Friday, May 24, 2019 11:45 AM
> To: Pascal Van Leeuwen <[email protected]>
> Cc: Christophe Leroy <[email protected]>; [email protected]
> Subject: Re: another testmgr question
>
> On Fri, 24 May 2019 at 11:34, Pascal Van Leeuwen
> <[email protected]> wrote:
> >
> > > All userland clients of the in-kernel crypto use it specifically to
> > > access h/w accelerators, given that software crypto doesn't require
> > > the higher privilege level (no point in issuing those AES CPU
> > > instructions from the kernel if you can issue them in your program
> > > directly)
> > >
> > > Basically, what is used is a socket interface that can block on
> > > read()/write(). So the userspace program doesn't need to be aware of
> > > the asynchronous nature, it is just frozen while the calls are being
> > > handled by the hardware.
> > >
> > With all due respect, but if the userland application is indeed
> > *frozen* while the calls are being handled, then that seems like its
> > pretty useless - for symmetric crypto, anyway - as performance would be
> > dominated by latency, not throughput.
> > Hardware acceleration would almost always lose that compared to a local
> > software implementation.
> > I certainly wouldn't want such an operation to end up at my driver!
> >
>
> Again, you are making assumptions here that don't always hold. Note that
> - a frozen process frees up the CPU to do other things while the
> crypto is in progress;
> - h/w crypto is typically more power efficient than CPU crypto;
> - several userland programs and in-kernel users may be active at the
> same time, so the fact that a single user sleeps doesn't mean the
> hardware is used inefficiently
>
With all due respect, but you are making assumptions as well. You are
making the assumption that reducing CPU load and/or reducing power
consumption is *more* important than absolute application performance or
latency. Which is certainly not always the case.

In addition to the assumption that using the hardware will actually
*achieve* this, while that really depends on the ratio of driver overhead
(which can be quite significant, unfortunately, especially if the API was
not really created from the get-go with HW in mind) vs hardware processing
time.

In many cases where only small amounts of data are processed sequentially,
the hardware will simply lose on all accounts ... So Linus actually did
have a point there. Hardware only wins for specific use cases. It's
important to realize that and not try and use hardware for everything.

Regards,

Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-27 09:50:21

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Mon, 27 May 2019 at 11:44, Pascal Van Leeuwen
<[email protected]> wrote:
>
> > -----Original Message-----
> > From: Ard Biesheuvel [mailto:[email protected]]
> > Sent: Friday, May 24, 2019 11:45 AM
> > To: Pascal Van Leeuwen <[email protected]>
> > Cc: Christophe Leroy <[email protected]>; [email protected]
> > Subject: Re: another testmgr question
> >
> > On Fri, 24 May 2019 at 11:34, Pascal Van Leeuwen
> > <[email protected]> wrote:
> > >
> > > > All userland clients of the in-kernel crypto use it specifically to
> > > > access h/w accelerators, given that software crypto doesn't require
> > > > the higher privilege level (no point in issuing those AES CPU
> > > > instructions from the kernel if you can issue them in your program
> > > > directly)
> > > >
> > > > Basically, what is used is a socket interface that can block on
> > > > read()/write(). So the userspace program doesn't need to be aware of
> > > > the asynchronous nature, it is just frozen while the calls are being
> > > > handled by the hardware.
> > > >
> > > With all due respect, but if the userland application is indeed
> > > *frozen* while the calls are being handled, then that seems like its
> > > pretty useless - for symmetric crypto, anyway - as performance would be
> > > dominated by latency, not throughput.
> > > Hardware acceleration would almost always lose that compared to a local
> > > software implementation.
> > > I certainly wouldn't want such an operation to end up at my driver!
> > >
> >
> > Again, you are making assumptions here that don't always hold. Note that
> > - a frozen process frees up the CPU to do other things while the
> > crypto is in progress;
> > - h/w crypto is typically more power efficient than CPU crypto;
> > - several userland programs and in-kernel users may be active at the
> > same time, so the fact that a single user sleeps doesn't mean the
> > hardware is used inefficiently
> >
> With all due respect, but you are making assumptions as well. You are
> making the assumption that reducing CPU load and/or reducing power
> consumption is *more* important than absolute application performance or
> latency. Which is certainly not always the case.
>

I never said power consumption is *always* more important. You were
assuming it never is.

> In addition to the assumption that using the hardware will actually
> *achieve* this, while that really depends on the ratio of driver overhead
> (which can be quite significant, unfortunately, especially if the API was
> not really created from the get-go with HW in mind) vs hardware processing
> time.
>

Of course.

> In many cases where only small amounts of data are processed sequentially,
> the hardware will simply lose on all accounts ... So Linus actually did
> have a point there. Hardware only wins for specific use cases. It's
> important to realize that and not try and use hardware for everything.
>

True. But we have already painted ourselves into a corner here, since
whatever we expose to userland today cannot simply be revoked.

I guess you could argue that your particular driver should not be
exposed to userland, and I think we may even have a CRYPTO_ALG_xxx
flag for that. But even if that does happen, it doesn't mean you can
stop caring about zero length inputs :-)

2019-05-27 09:52:49

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> -----Original Message-----
> From: Ard Biesheuvel [mailto:[email protected]]
> Sent: Friday, May 24, 2019 1:10 PM
> To: Pascal Van Leeuwen <[email protected]>
> Cc: Christophe Leroy <[email protected]>; [email protected]
> Subject: Re: another testmgr question
>
> On Fri, 24 May 2019 at 11:57, Pascal Van Leeuwen
> <[email protected]> wrote:
> >
> > > Again, you are making assumptions here that don't always hold. Note that
> > > - a frozen process frees up the CPU to do other things while the
> > > crypto is in progress;
> > > - h/w crypto is typically more power efficient than CPU crypto;
> > >
> > True. Those are the "other" reasons - besides acceleration - to use
> hardware
> > offload which we often use to sell our IP.
> > But the honest story there is that that only works out for situations
> > where there's enough work to do to make the software overhead for actually
> > starting and managing that work insignificant.
> >
> > And even then, it's only a valid use case if that is your *intention*.
> > If you *just* needed the highest performance, you don't want to go through
> > the HW in this case (unless you have a *very* weak CPU perhaps, or a
> > huge amount of data to process in one go).
> >
> > The catch is in the "always". But how do you make an informed decision
> > here? The current Crypto API does not really seem to provide a mechanism
> > for doing so. In which case MY approach would be "if I'm not SURE that
> > the HW can do it better, then I probably shouldn't be doing in on the HW".
> >
>
> It becomes even more complicated to reason about if you take into
> account that you may have 10s or 100s of instances of the CPU crypto
> logic (one for each CPU) while the number of h/w IP blocks and/or
> parallel processing queues typically doesn't scale in the same way.
>
> But we are going down a rabbit hole here: even if you and I would
> agree that it never makes any sense whatsoever to use h/w accelerators
> from userland, the reality is that this is happening today, and so we
> have to ensure that all drivers expose an interface that produces the
> correct result for all imaginable corner cases.
>
> > > - several userland programs and in-kernel users may be active at the
> > > same time, so the fact that a single user sleeps doesn't mean the
> > > hardware is used inefficiently
> > >
> > I'm not worried about the *HW* being used inefficiently.
> > I'm worried about using the HW not being an improvement.
> >
>
> Evidently, it requires some care to use the AF_ALG interface
> meaningfully. But that does not mean it cannot ever be used in the
> right way.
>
I think we agree on the big picture.
As I already argued in a different thread, the best approach is probably
to not select the hardware by default at all, but make it an explicit
choice.

I suppose that could be achieved within the current framework by just
giving all hardware drivers a low cra_priority figure.

Thing is, that requires providing some confidence to hardware vendors
that ALL crypto consumers will indeed provide some configuration option
to select the driver explicitly.

In which case hardware vendors could simply recommend using their driver
for certain applications that have been tested and known to be usefully
accelerated (or achieve lower power consumption or lower CPU load,
whatever is your specific requirement). Or to NOT use their drivers for
certain applications that don't benefit.

Another psychological barrier, though, is that cra_priority can be
perceived as some indication of how "good" the hardware is. And obviously
no hardware vendor would want to make his/her hardware look worse than
the competion. Or worse than software implementations for that matter.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-27 09:57:10

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> -----Original Message-----
> From: Eric Biggers [mailto:[email protected]]
> Sent: Saturday, May 25, 2019 3:23 AM
> To: Ard Biesheuvel <[email protected]>
> Cc: Pascal Van Leeuwen <[email protected]>; Christophe Leroy
> <[email protected]>; [email protected]
> Subject: Re: another testmgr question
>
> On Fri, May 24, 2019 at 11:25:52AM +0200, Ard Biesheuvel wrote:
> >
> > All userland clients of the in-kernel crypto use it specifically to
> > access h/w accelerators, given that software crypto doesn't require
> > the higher privilege level (no point in issuing those AES CPU
> > instructions from the kernel if you can issue them in your program
> > directly)
>
> Unfortunately people also use AF_ALG because they're too lazy to use a
> userspace
> crypto library, e.g. systemd uses it for HMAC-SHA256, and iproute2 uses it
> for
> SHA-1.
>
> - Eric
>

Are they really too lazy or just under the impression they will benefit from
that somehow? Maybe someone education is in order ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-27 10:06:24

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> > With all due respect, but you are making assumptions as well. You are
> > making the assumption that reducing CPU load and/or reducing power
> > consumption is *more* important than absolute application performance or
> > latency. Which is certainly not always the case.
> >
>
> I never said power consumption is *always* more important. You were
> assuming it never is.
>
Nooooo I wasn't. Not on purpose, anyway. Power consumption is a major selling
point for us. If you got that impression, then that's some misunderstanding.
My argument was simply that there *may* be other requirements. You don't know,
so you shouldn't make assumptions in the other direction either.

> > In many cases where only small amounts of data are processed sequentially,
> > the hardware will simply lose on all accounts ... So Linus actually did
> > have a point there. Hardware only wins for specific use cases. It's
> > important to realize that and not try and use hardware for everything.
> >
>
> True. But we have already painted ourselves into a corner here, since
> whatever we expose to userland today cannot simply be revoked.
>
> I guess you could argue that your particular driver should not be
> exposed to userland, and I think we may even have a CRYPTO_ALG_xxx
>
Well, I understood from someone else on this list that CRYPTO_ALG can
do async operations in which case I would assume it doesn't block??

I would rather see a flag denoting "do not use me in a synchronous
fashion on relatively small datablocks, only use me if you intend to
seriously pipeline your requests". Or somesuch.

But then again that would still be too simplistic to select to best
driver under all possible circumstances ... so why even bother.

> flag for that. But even if that does happen, it doesn't mean you can
> stop caring about zero length inputs :-)
>
If the selection of the hardware driver becomes explicit and not
automatic, you could argue for a case where the driver does NOT have
to implement all dark corners of the API. As, as a hardware vendor,
we could simply recommend NOT to use it for application XYZ because
it does things - like zero length messages - we don't support.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-27 10:30:07

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Mon, 27 May 2019 at 12:04, Pascal Van Leeuwen
<[email protected]> wrote:
>
> > > With all due respect, but you are making assumptions as well. You are
> > > making the assumption that reducing CPU load and/or reducing power
> > > consumption is *more* important than absolute application performance or
> > > latency. Which is certainly not always the case.
> > >
> >
> > I never said power consumption is *always* more important. You were
> > assuming it never is.
> >
> Nooooo I wasn't. Not on purpose, anyway. Power consumption is a major selling
> point for us. If you got that impression, then that's some misunderstanding.
> My argument was simply that there *may* be other requirements. You don't know,
> so you shouldn't make assumptions in the other direction either.
>

That was basically *my* point. But you're welcome to use it :-)

> > > In many cases where only small amounts of data are processed sequentially,
> > > the hardware will simply lose on all accounts ... So Linus actually did
> > > have a point there. Hardware only wins for specific use cases. It's
> > > important to realize that and not try and use hardware for everything.
> > >
> >
> > True. But we have already painted ourselves into a corner here, since
> > whatever we expose to userland today cannot simply be revoked.
> >
> > I guess you could argue that your particular driver should not be
> > exposed to userland, and I think we may even have a CRYPTO_ALG_xxx
> >
> Well, I understood from someone else on this list that CRYPTO_ALG can
> do async operations in which case I would assume it doesn't block??
>
> I would rather see a flag denoting "do not use me in a synchronous
> fashion on relatively small datablocks, only use me if you intend to
> seriously pipeline your requests". Or somesuch.
>

As I explained before, what appears synchronous to a single userland
process may look very differently from the OS and h/w perspective. But
of course, I take your point that h/w is not faster than the CPU in
the general case, and so care must be taken.

This is made worse by the priority scheme, which does not really
convery information like this.

> But then again that would still be too simplistic to select to best
> driver under all possible circumstances ... so why even bother.
>
> > flag for that. But even if that does happen, it doesn't mean you can
> > stop caring about zero length inputs :-)
> >
> If the selection of the hardware driver becomes explicit and not
> automatic, you could argue for a case where the driver does NOT have
> to implement all dark corners of the API. As, as a hardware vendor,
> we could simply recommend NOT to use it for application XYZ because
> it does things - like zero length messages - we don't support.
>

Spoken like a true h/w guy :-)

Our crypto s/w stack and the storage, networking and other subsystems
that are layered on top of it are complex enough that we shouldn't try
to cater for non-compliant hardware. This is why you need to fix this
in your driver: to prevent the issue from leaking into other layers,
making it even more difficult to do testing and validation.

2019-05-27 10:45:21

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> As I explained before, what appears synchronous to a single userland
> process may look very differently from the OS and h/w perspective. But
> of course, I take your point that h/w is not faster than the CPU in
> the general case, and so care must be taken.
>
"Synchronous" in this context was referring to a use case where the
application does a request and then *waits* for the result of that
request before continuing.
While "asynchronous" would refer to a case where the application fires
off a request and then merrily goes off doing "other" things (which
could be, but is not limited to, preparing and posting new requests).

So I'm strictly viewing it from the applications' perspective here.

> This is made worse by the priority scheme, which does not really
> convery information like this.
>
Yes, the priority scheme is far too simplistic to cover all details
regarding hardware acceleration. Which why we probably shouldn't use
it to select hardware drivers at all.

> > But then again that would still be too simplistic to select to best
> > driver under all possible circumstances ... so why even bother.
> >
> > > flag for that. But even if that does happen, it doesn't mean you can
> > > stop caring about zero length inputs :-)
> > >
> > If the selection of the hardware driver becomes explicit and not
> > automatic, you could argue for a case where the driver does NOT have
> > to implement all dark corners of the API. As, as a hardware vendor,
> > we could simply recommend NOT to use it for application XYZ because
> > it does things - like zero length messages - we don't support.
> >
>
> Spoken like a true h/w guy :-)
>
Guilty as charged. I AM a true H/W guy and not a software engineer at all.
But have you ever stopped to wonder WHY all hardware guys talk like that?
Maybe, just maybe, they have a damn good reason to do so ...

> Our crypto s/w stack and the storage, networking and other subsystems
> that are layered on top of it are complex enough that we shouldn't try
> to cater for non-compliant hardware. This is why you need to fix this
> in your driver: to prevent the issue from leaking into other layers,
> making it even more difficult to do testing and validation.
>
Now where am I suggesting that applications should cater for non-compliant
hardware? I'm simply suggesting that you should NOT use the hardware for
such an application at all. If you make it explicit, you can do that.

And besides, who decides what is "compliant" and what the rules are?
Please keep in mind that existing hardware cannot be changed. So why
wasn't the API designed around the limitations of *existing* hardware?
It can take several years for a hardware fix to reach the end user ...

As for testing and validation: if the selection is explicit, then the
responsibility for the testing and validation can move to the HW vendor.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-27 10:59:25

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Mon, 27 May 2019 at 12:43, Pascal Van Leeuwen
<[email protected]> wrote:
>
> > As I explained before, what appears synchronous to a single userland
> > process may look very differently from the OS and h/w perspective. But
> > of course, I take your point that h/w is not faster than the CPU in
> > the general case, and so care must be taken.
> >
> "Synchronous" in this context was referring to a use case where the
> application does a request and then *waits* for the result of that
> request before continuing.
> While "asynchronous" would refer to a case where the application fires
> off a request and then merrily goes off doing "other" things (which
> could be, but is not limited to, preparing and posting new requests).
>
> So I'm strictly viewing it from the applications' perspective here.
>

I understand that. But even if the application is synchronous, it does
not mean that the whole world stops and nothing is using the
accelerator in the mean time.

> > This is made worse by the priority scheme, which does not really
> > convery information like this.
> >
> Yes, the priority scheme is far too simplistic to cover all details
> regarding hardware acceleration. Which why we probably shouldn't use
> it to select hardware drivers at all.
>
> > > But then again that would still be too simplistic to select to best
> > > driver under all possible circumstances ... so why even bother.
> > >
> > > > flag for that. But even if that does happen, it doesn't mean you can
> > > > stop caring about zero length inputs :-)
> > > >
> > > If the selection of the hardware driver becomes explicit and not
> > > automatic, you could argue for a case where the driver does NOT have
> > > to implement all dark corners of the API. As, as a hardware vendor,
> > > we could simply recommend NOT to use it for application XYZ because
> > > it does things - like zero length messages - we don't support.
> > >
> >
> > Spoken like a true h/w guy :-)
> >
> Guilty as charged. I AM a true H/W guy and not a software engineer at all.
> But have you ever stopped to wonder WHY all hardware guys talk like that?
> Maybe, just maybe, they have a damn good reason to do so ...
>

Of course. And so do we. And that is why we meet in the middle to compromise.

> > Our crypto s/w stack and the storage, networking and other subsystems
> > that are layered on top of it are complex enough that we shouldn't try
> > to cater for non-compliant hardware. This is why you need to fix this
> > in your driver: to prevent the issue from leaking into other layers,
> > making it even more difficult to do testing and validation.
> >
> Now where am I suggesting that applications should cater for non-compliant
> hardware? I'm simply suggesting that you should NOT use the hardware for
> such an application at all. If you make it explicit, you can do that.
>
> And besides, who decides what is "compliant" and what the rules are?

If the algorithm in question is defined for zero length inputs, but
the h/w chooses not to implement that case, I think non-compliant is a
rather nice way to say 'broken'. I know there is a gradient here going
from hashes, AEADs to symmetric ciphers, but I think this applies to
all of them.

> Please keep in mind that existing hardware cannot be changed. So why
> wasn't the API designed around the limitations of *existing* hardware?

From a software point of view, adding special cases for zero length
inputs amounts to what you are trying to avoid: using more 'silicon
area'.

Proper validation requires coverage based testing, i.e., that all
statements in a program can be proven to be exercised by some use
case, and produce the correct result.

This means that, if we have to add 'if (message_length > 0) { do this;
} else { do that; }' everywhere, we are moving the effort from your
corner to mine. Of course I am going to oppose to that :-)

> It can take several years for a hardware fix to reach the end user ...
>

While software implementations can sometimes be fixed quickly,
software APIs have *really* long lifetimes as well, especially in the
server space. And until you have reached sufficient coverage with your
updated API, you are stuck with both the old one and the new one, so
you have even more code to worry about.

So a crypto API where zero length inputs are not permitted or treated
specially is not the way to fix this.

> As for testing and validation: if the selection is explicit, then the
> responsibility for the testing and validation can move to the HW vendor.
>

I think the bottom line is still to fix the driver and be done with
it. I honestly don't care about what exactly your h/w supports, as
long as the driver that encapsulates it addresses the impedance
mismatch between what the h/w layer provides and what the upper layer
expects.

2019-05-27 12:23:00

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

>
> I understand that. But even if the application is synchronous, it does
> not mean that the whole world stops and nothing is using the
> accelerator in the mean time.
>
I understand that as well. But that doesn't change the fact that the
application may be waiting for a loooooong (relatively speaking) time
for it's results. As latency through hardware may be several orders of
a magnitude larger than the time it actually takes to *process* the
request. So when used synchronously the HW may appear to work at a mere
fraction of its true performance.

And if your main interest is in that application, you may not care so
much about what the rest of the system does, even if it can use the
remaining bandwidth of the accelerator.

In which case it may be desirable *not* to use the accelerator for that
application at all due to *very* poor performance (for that application).

Which would make even more cycles on the accelerator available to the
other applications in the system, so that knife cuts both ways ...

> > > This is made worse by the priority scheme, which does not really
> > > convery information like this.
> > >
> > Yes, the priority scheme is far too simplistic to cover all details
> > regarding hardware acceleration. Which why we probably shouldn't use
> > it to select hardware drivers at all.
> >
> > > > But then again that would still be too simplistic to select to best
> > > > driver under all possible circumstances ... so why even bother.
> > > >
> > > > > flag for that. But even if that does happen, it doesn't mean you can
> > > > > stop caring about zero length inputs :-)
> > > > >
> > > > If the selection of the hardware driver becomes explicit and not
> > > > automatic, you could argue for a case where the driver does NOT have
> > > > to implement all dark corners of the API. As, as a hardware vendor,
> > > > we could simply recommend NOT to use it for application XYZ because
> > > > it does things - like zero length messages - we don't support.
> > > >
> > >
> > > Spoken like a true h/w guy :-)
> > >
> > Guilty as charged. I AM a true H/W guy and not a software engineer at all.
> > But have you ever stopped to wonder WHY all hardware guys talk like that?
> > Maybe, just maybe, they have a damn good reason to do so ...
> >
>
> Of course. And so do we. And that is why we meet in the middle to compromise.
>
Yes, we try where we can. But you have to remember that ultimately hardware
is bound by the limitations of the physical world. Which doesn't compromise :-)
And compromises have consequences that need to be carefully considered.

Adding tons of workarounds to drivers, for example, slows them down, makes them
use more CPU cycles and more power, and ultimately defeats the purpose of having
a hardware accelerator at all. That is actually my concern.
And as an aside, once workarounds have been implemented and proven to "work", the
underlying issue rarely makes it to the HW guys so we're stuck with it forever.

> > > Our crypto s/w stack and the storage, networking and other subsystems
> > > that are layered on top of it are complex enough that we shouldn't try
> > > to cater for non-compliant hardware. This is why you need to fix this
> > > in your driver: to prevent the issue from leaking into other layers,
> > > making it even more difficult to do testing and validation.
> > >
> > Now where am I suggesting that applications should cater for non-compliant
> > hardware? I'm simply suggesting that you should NOT use the hardware for
> > such an application at all. If you make it explicit, you can do that.
> >
> > And besides, who decides what is "compliant" and what the rules are?
>
> If the algorithm in question is defined for zero length inputs, but
> the h/w chooses not to implement that case, I think non-compliant is a
> rather nice way to say 'broken'.
>
NO. Hardware is broken if it doesn't comply to its own specifications -
which *may* include references to industry standards it must comply with.
If I intentionally specify that zero length hashes are not supported, and
I don't pretend to comply with any industry standard that requires them,
then that's just a *limitation* of the hardware, most certainly not a bug.
Which may be perfectly valid as hardware is usually created for specific
use cases.
In the case of the Inside Secure HW/driver: mainly IPsec and perhaps disk
encryption, but certainly not Ye Olde's basic random crypto request.

Hardware necessarily *always* has limitations because of all kinds of
constraints: area, power, complexity. And even something as mundane as a
schedule constraint where you simply can't fit all desired features in the
desired schedule. Which is usually very solid due to timeslots being
planned in a fab etc. We don't have the luxury of extending our schedule
forever like SW guys tend to do ... we're very proud of our track record
of always meeting our promised schedules. Plus - silicon can't be patched,
so what's done is done and you have to live with it. For many years to
come, usually.

> I know there is a gradient here going
> from hashes, AEADs to symmetric ciphers, but I think this applies to
> all of them.
>
> > Please keep in mind that existing hardware cannot be changed. So why
> > wasn't the API designed around the limitations of *existing* hardware?
>
> From a software point of view, adding special cases for zero length
> inputs amounts to what you are trying to avoid: using more 'silicon
> area'.
>
No, that's actually not the reason at all in this case. We're trying to
avoid significant extra complexity and effort on both the hardware itself
and the verification thereof. Silicon area is not even in the picture as
a concern for something as "small" as this.

Adding zero length support to our hardware architecture is not a trivial
exercise. And then you have to weigh added complexity - =added risk, when
you talk about hardware with multi-million dollar mask sets in play -
against usefulness. Zero-length support was - and still is! - simply not
worth the added risk and effort.

> Proper validation requires coverage based testing, i.e., that all
> statements in a program can be proven to be exercised by some use
> case, and produce the correct result.
>
> This means that, if we have to add 'if (message_length > 0) { do this;
> } else { do that; }' everywhere, we are moving the effort from your
> corner to mine. Of course I am going to oppose to that :-)
>
> > It can take several years for a hardware fix to reach the end user ...
> >
>
> While software implementations can sometimes be fixed quickly,
> software APIs have *really* long lifetimes as well, especially in the
> server space. And until you have reached sufficient coverage with your
> updated API, you are stuck with both the old one and the new one, so
> you have even more code to worry about.
>
> So a crypto API where zero length inputs are not permitted or treated
> specially is not the way to fix this.
>
Well, for one thing even FIPS certification allows zero lengths not to be
supported by an implementation. So there's definitely prior art to that.
You could handle this by means of capability flags or profiles or whatever.
But I was not even going that far in my suggestions.

I was merely suggesting that IF a driver needs to be explicitly selected to
be used, THEN you could allow that driver to be not fully compliant to some
extent. And then the driver could come with a README or so - maintained by
the HW vendor - detailing which use cases have actually been validated with
it.

> > As for testing and validation: if the selection is explicit, then the
> > responsibility for the testing and validation can move to the HW vendor.
> >
>
> I think the bottom line is still to fix the driver and be done with
> it. I honestly don't care about what exactly your h/w supports, as
> long as the driver that encapsulates it addresses the impedance
> mismatch between what the h/w layer provides and what the upper layer
> expects.
>
And if you go that naive route, just fix everything in the driver, then
you simply end up with something terribly inefficient because all those
corner case checks end up in the fast path and eating up code space.

For a someone claiming to "meet in the middle to compromise" you're
surely not compromising anything at all ... No offense.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Inside Secure
http://www.insidesecure.com

2019-05-27 15:01:19

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Mon, 27 May 2019 at 14:22, Pascal Van Leeuwen
<[email protected]> wrote:
>
> >
> > I understand that. But even if the application is synchronous, it does
> > not mean that the whole world stops and nothing is using the
> > accelerator in the mean time.
> >
> I understand that as well. But that doesn't change the fact that the
> application may be waiting for a loooooong (relatively speaking) time
> for it's results. As latency through hardware may be several orders of
> a magnitude larger than the time it actually takes to *process* the
> request. So when used synchronously the HW may appear to work at a mere
> fraction of its true performance.
>

Of course. Sometimes you care about that, and sometimes you don't.

> And if your main interest is in that application, you may not care so
> much about what the rest of the system does, even if it can use the
> remaining bandwidth of the accelerator.
>

100s of instances of that application, thread, etc could be running at
the same time, and throughput may be more important than latency.

> In which case it may be desirable *not* to use the accelerator for that
> application at all due to *very* poor performance (for that application).
>
> Which would make even more cycles on the accelerator available to the
> other applications in the system, so that knife cuts both ways ...
>

Single thread perfomance is only one metric, and it may not be the one
you care about most.

> > > > This is made worse by the priority scheme, which does not really
> > > > convery information like this.
> > > >
> > > Yes, the priority scheme is far too simplistic to cover all details
> > > regarding hardware acceleration. Which why we probably shouldn't use
> > > it to select hardware drivers at all.
> > >
> > > > > But then again that would still be too simplistic to select to best
> > > > > driver under all possible circumstances ... so why even bother.
> > > > >
> > > > > > flag for that. But even if that does happen, it doesn't mean you can
> > > > > > stop caring about zero length inputs :-)
> > > > > >
> > > > > If the selection of the hardware driver becomes explicit and not
> > > > > automatic, you could argue for a case where the driver does NOT have
> > > > > to implement all dark corners of the API. As, as a hardware vendor,
> > > > > we could simply recommend NOT to use it for application XYZ because
> > > > > it does things - like zero length messages - we don't support.
> > > > >
> > > >
> > > > Spoken like a true h/w guy :-)
> > > >
> > > Guilty as charged. I AM a true H/W guy and not a software engineer at all.
> > > But have you ever stopped to wonder WHY all hardware guys talk like that?
> > > Maybe, just maybe, they have a damn good reason to do so ...
> > >
> >
> > Of course. And so do we. And that is why we meet in the middle to compromise.
> >
> Yes, we try where we can. But you have to remember that ultimately hardware
> is bound by the limitations of the physical world. Which doesn't compromise :-)
> And compromises have consequences that need to be carefully considered.
>

Of course.

> Adding tons of workarounds to drivers, for example, slows them down, makes them
> use more CPU cycles and more power, and ultimately defeats the purpose of having
> a hardware accelerator at all. That is actually my concern.

If the workaround is in a driver and not on a hot path, we don't
really care about the memory footprint.

> And as an aside, once workarounds have been implemented and proven to "work", the
> underlying issue rarely makes it to the HW guys so we're stuck with it forever.
>

Well, the starting point of the argument was that you deliberately
omitted handling of zero length inputs to save silicon area. So the
issue would already be known to the h/w guys, and they decided it was
something they'd punt to the software instead.

> > > > Our crypto s/w stack and the storage, networking and other subsystems
> > > > that are layered on top of it are complex enough that we shouldn't try
> > > > to cater for non-compliant hardware. This is why you need to fix this
> > > > in your driver: to prevent the issue from leaking into other layers,
> > > > making it even more difficult to do testing and validation.
> > > >
> > > Now where am I suggesting that applications should cater for non-compliant
> > > hardware? I'm simply suggesting that you should NOT use the hardware for
> > > such an application at all. If you make it explicit, you can do that.
> > >
> > > And besides, who decides what is "compliant" and what the rules are?
> >
> > If the algorithm in question is defined for zero length inputs, but
> > the h/w chooses not to implement that case, I think non-compliant is a
> > rather nice way to say 'broken'.
> >
> NO. Hardware is broken if it doesn't comply to its own specifications -
> which *may* include references to industry standards it must comply with.
> If I intentionally specify that zero length hashes are not supported, and
> I don't pretend to comply with any industry standard that requires them,
> then that's just a *limitation* of the hardware, most certainly not a bug.

Fair enough. But if you want to integrate that h/w in a system that
does aim to comply, it is up to the software to fix the impedance
mismatch.

> Which may be perfectly valid as hardware is usually created for specific
> use cases.
> In the case of the Inside Secure HW/driver: mainly IPsec and perhaps disk
> encryption, but certainly not Ye Olde's basic random crypto request.
>

Sure.

> Hardware necessarily *always* has limitations because of all kinds of
> constraints: area, power, complexity. And even something as mundane as a
> schedule constraint where you simply can't fit all desired features in the
> desired schedule. Which is usually very solid due to timeslots being
> planned in a fab etc. We don't have the luxury of extending our schedule
> forever like SW guys tend to do ... we're very proud of our track record
> of always meeting our promised schedules. Plus - silicon can't be patched,
> so what's done is done and you have to live with it. For many years to
> come, usually.
>

This is all pretty well understood. We all have different interests to
balance against each other, which is why we are perfectly fine with
handling some corner cases in the driver. What we are not prepared to
do is let those corner cases leak into the core crypto layer as cases
that require special handling.

> > I know there is a gradient here going
> > from hashes, AEADs to symmetric ciphers, but I think this applies to
> > all of them.
> >
> > > Please keep in mind that existing hardware cannot be changed. So why
> > > wasn't the API designed around the limitations of *existing* hardware?
> >
> > From a software point of view, adding special cases for zero length
> > inputs amounts to what you are trying to avoid: using more 'silicon
> > area'.
> >
> No, that's actually not the reason at all in this case. We're trying to
> avoid significant extra complexity and effort on both the hardware itself
> and the verification thereof. Silicon area is not even in the picture as
> a concern for something as "small" as this.
>
> Adding zero length support to our hardware architecture is not a trivial
> exercise. And then you have to weigh added complexity - =added risk, when
> you talk about hardware with multi-million dollar mask sets in play -
> against usefulness. Zero-length support was - and still is! - simply not
> worth the added risk and effort.
>

Of course. That is why it is perfectly fine to handle this in your driver.

> > Proper validation requires coverage based testing, i.e., that all
> > statements in a program can be proven to be exercised by some use
> > case, and produce the correct result.
> >
> > This means that, if we have to add 'if (message_length > 0) { do this;
> > } else { do that; }' everywhere, we are moving the effort from your
> > corner to mine. Of course I am going to oppose to that :-)
> >
> > > It can take several years for a hardware fix to reach the end user ...
> > >
> >
> > While software implementations can sometimes be fixed quickly,
> > software APIs have *really* long lifetimes as well, especially in the
> > server space. And until you have reached sufficient coverage with your
> > updated API, you are stuck with both the old one and the new one, so
> > you have even more code to worry about.
> >
> > So a crypto API where zero length inputs are not permitted or treated
> > specially is not the way to fix this.
> >
> Well, for one thing even FIPS certification allows zero lengths not to be
> supported by an implementation. So there's definitely prior art to that.
> You could handle this by means of capability flags or profiles or whatever.
> But I was not even going that far in my suggestions.
>
> I was merely suggesting that IF a driver needs to be explicitly selected to
> be used, THEN you could allow that driver to be not fully compliant to some
> extent. And then the driver could come with a README or so - maintained by
> the HW vendor - detailing which use cases have actually been validated with
> it.
>

That is also fine. If you choose to expose your hardware via a
different subsystem than the crypto subsystem, there is obviously no
need to abide by the crypto subsystem's really. But if you claim to
implement these algorithms, your driver must do so without special
corner cases.

> > > As for testing and validation: if the selection is explicit, then the
> > > responsibility for the testing and validation can move to the HW vendor.
> > >
> >
> > I think the bottom line is still to fix the driver and be done with
> > it. I honestly don't care about what exactly your h/w supports, as
> > long as the driver that encapsulates it addresses the impedance
> > mismatch between what the h/w layer provides and what the upper layer
> > expects.
> >
> And if you go that naive route, just fix everything in the driver, then
> you simply end up with something terribly inefficient because all those
> corner case checks end up in the fast path and eating up code space.
>

This is *exactly* the reason why we want this workaround in your
driver, because if it is not in your driver, we will have to put it in
generic code where it affects everybody.

> For a someone claiming to "meet in the middle to compromise" you're
> surely not compromising anything at all ... No offense.
>

None taken. I am really trying to work with you here, but changing
core code to address the limitations of one particular h/w
implementation is not something we do lightly.

2019-05-27 15:59:09

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> > I understand that as well. But that doesn't change the fact that the
> > application may be waiting for a loooooong (relatively speaking) time
> > for it's results. As latency through hardware may be several orders of
> > a magnitude larger than the time it actually takes to *process* the
> > request. So when used synchronously the HW may appear to work at a mere
> > fraction of its true performance.
> >
>
> Of course. Sometimes you care about that, and sometimes you don't.
>
> > And if your main interest is in that application, you may not care so
> > much about what the rest of the system does, even if it can use the
> > remaining bandwidth of the accelerator.
> >
>
> 100s of instances of that application, thread, etc could be running at
> the same time, and throughput may be more important than latency.
>
> > In which case it may be desirable *not* to use the accelerator for that
> > application at all due to *very* poor performance (for that application).
> >
> > Which would make even more cycles on the accelerator available to the
> > other applications in the system, so that knife cuts both ways ...
> >
>
> Single thread perfomance is only one metric, and it may not be the one
> you care about most.
>

How relevant is the fact that there may be (other) situations where latency
is not relevant for someone being in a situation where it is relevant?

I was just talking about that situation where it actually is relevant and
therefore you do *not* wish the hardware driver to be used.

Ok, let's phrase it such that can be no further misunderstandings:

If you want performance from a single-threaded single application that does
synchronous, blocking, crypto API calls, then you do not want those to end
up at our hardware. Or, very likely, pretty much any other hardware.
The cra_priority mechanism does not allow the driver to convey such a thing.

> > Adding tons of workarounds to drivers, for example, slows them down, makes
> them
> > use more CPU cycles and more power, and ultimately defeats the purpose of
> having
> > a hardware accelerator at all. That is actually my concern.
>
> If the workaround is in a driver and not on a hot path, we don't
> really care about the memory footprint.
>
These workarounds are on a hot path by definition, as they have to filter
out specific requests coming in, i.e. it affects every single request done
to the driver. As for memory footprint: that is still relevant for embedded
systems even today. Besides, memory footprint affects instruction cache hit
ratio and therefore, indirectly, performance as well.

> > And as an aside, once workarounds have been implemented and proven to
> "work", the
> > underlying issue rarely makes it to the HW guys so we're stuck with it
> forever.
> >
>
> Well, the starting point of the argument was that you deliberately
> omitted handling of zero length inputs to save silicon area.
>
Where did I ever say that we omitted that to save silicon area?
You're putting words in my mouth (or fingers) now. I never said that,
that is not the reason at all.

> So the
> issue would already be known to the h/w guys, and they decided it was
> something they'd punt to the software instead.
>
NO. We never decided any such thing. We decided that it was not a relevant
use case that we needed to support at all. Neither in the hardware nor in
the driver. Our own generic OS driver does not contain any such workarounds.
In fact, based on this Linux driver thing we had a new internal discussion
on it and the outcome did not change: not a relevant use case for us.

> > NO. Hardware is broken if it doesn't comply to its own specifications -
> > which *may* include references to industry standards it must comply with.
> > If I intentionally specify that zero length hashes are not supported, and
> > I don't pretend to comply with any industry standard that requires them,
> > then that's just a *limitation* of the hardware, most certainly not a bug.
>
> Fair enough. But if you want to integrate that h/w in a system that
> does aim to comply, it is up to the software to fix the impedance
> mismatch.
>
Comply with what exactly? You can't "comply" with algorithms ... you just
implement whatever subset makes sense and specify the constraints. You can
comply with protocol specifications, and that's what we do. None of those
requires zero length hashing, HMAC, cipher or AEAD operations.
Many algorithms are unbounded anyway and hardware is bounded by definition.

> > Hardware necessarily *always* has limitations because of all kinds of
> > constraints: area, power, complexity. And even something as mundane as a
> > schedule constraint where you simply can't fit all desired features in the
> > desired schedule. Which is usually very solid due to timeslots being
> > planned in a fab etc. We don't have the luxury of extending our schedule
> > forever like SW guys tend to do ... we're very proud of our track record
> > of always meeting our promised schedules. Plus - silicon can't be patched,
> > so what's done is done and you have to live with it. For many years to
> > come, usually.
> >
>
> This is all pretty well understood. We all have different interests to
> balance against each other, which is why we are perfectly fine with
> handling some corner cases in the driver. What we are not prepared to
> do is let those corner cases leak into the core crypto layer as cases
> that require special handling.
>
Which can be avoided by not selecting a driver for an application it
does not support ... if the corner case is not exercised, then no harm is
done. No need for any "leaking" through the crypto layers.

> > > I know there is a gradient here going
> > > from hashes, AEADs to symmetric ciphers, but I think this applies to
> > > all of them.
> > >
> > > > Please keep in mind that existing hardware cannot be changed. So why
> > > > wasn't the API designed around the limitations of *existing* hardware?
> > >
> > > From a software point of view, adding special cases for zero length
> > > inputs amounts to what you are trying to avoid: using more 'silicon
> > > area'.
> > >
> > No, that's actually not the reason at all in this case. We're trying to
> > avoid significant extra complexity and effort on both the hardware itself
> > and the verification thereof. Silicon area is not even in the picture as
> > a concern for something as "small" as this.
> >
> > Adding zero length support to our hardware architecture is not a trivial
> > exercise. And then you have to weigh added complexity - =added risk, when
> > you talk about hardware with multi-million dollar mask sets in play -
> > against usefulness. Zero-length support was - and still is! - simply not
> > worth the added risk and effort.
> >
>
> Of course. That is why it is perfectly fine to handle this in your driver.
>
Perfectly fine for you, maybe, but not so much for me.

Why:
Performance loss. Driver complexity vs maintenance.
Wasted effort implementing totally irrelevant cases.

> > And if you go that naive route, just fix everything in the driver, then
> > you simply end up with something terribly inefficient because all those
> > corner case checks end up in the fast path and eating up code space.
> >
>
> This is *exactly* the reason why we want this workaround in your
> driver, because if it is not in your driver, we will have to put it in
> generic code where it affects everybody.
>
> > For a someone claiming to "meet in the middle to compromise" you're
> > surely not compromising anything at all ... No offense.
> >
>
> None taken. I am really trying to work with you here, but changing
> core code to address the limitations of one particular h/w
> implementation is not something we do lightly.
>
Well, for one thing it is not "one particular h/w implementation".
As for the zero length thing, I believe it's almost all of them, based
on comments from other driver maintainers. I've only seen 1 comment saying
that the HW *did* support it. And that was a "maybe".

And my main suggestion does not require any core code changes at all.


Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines
http://www.insidesecure.com

2019-05-27 16:21:42

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: another testmgr question

On Mon, 27 May 2019 at 17:56, Pascal Van Leeuwen
<[email protected]> wrote:
>
> > > I understand that as well. But that doesn't change the fact that the
> > > application may be waiting for a loooooong (relatively speaking) time
> > > for it's results. As latency through hardware may be several orders of
> > > a magnitude larger than the time it actually takes to *process* the
> > > request. So when used synchronously the HW may appear to work at a mere
> > > fraction of its true performance.
> > >
> >
> > Of course. Sometimes you care about that, and sometimes you don't.
> >
> > > And if your main interest is in that application, you may not care so
> > > much about what the rest of the system does, even if it can use the
> > > remaining bandwidth of the accelerator.
> > >
> >
> > 100s of instances of that application, thread, etc could be running at
> > the same time, and throughput may be more important than latency.
> >
> > > In which case it may be desirable *not* to use the accelerator for that
> > > application at all due to *very* poor performance (for that application).
> > >
> > > Which would make even more cycles on the accelerator available to the
> > > other applications in the system, so that knife cuts both ways ...
> > >
> >
> > Single thread perfomance is only one metric, and it may not be the one
> > you care about most.
> >
>
> How relevant is the fact that there may be (other) situations where latency
> is not relevant for someone being in a situation where it is relevant?
>
> I was just talking about that situation where it actually is relevant and
> therefore you do *not* wish the hardware driver to be used.
>
> Ok, let's phrase it such that can be no further misunderstandings:
>
> If you want performance from a single-threaded single application that does
> synchronous, blocking, crypto API calls, then you do not want those to end
> up at our hardware. Or, very likely, pretty much any other hardware.
> The cra_priority mechanism does not allow the driver to convey such a thing.
>

Agreed.

But we came to this point in response to your assertion that a
userland application can only make meaningful use of the hardware
accelerator if it uses some kind of asynchronous API like AIO, and I
tried to explain that this is not the case.

> > > Adding tons of workarounds to drivers, for example, slows them down, makes
> > them
> > > use more CPU cycles and more power, and ultimately defeats the purpose of
> > having
> > > a hardware accelerator at all. That is actually my concern.
> >
> > If the workaround is in a driver and not on a hot path, we don't
> > really care about the memory footprint.
> >
> These workarounds are on a hot path by definition, as they have to filter
> out specific requests coming in, i.e. it affects every single request done
> to the driver. As for memory footprint: that is still relevant for embedded
> systems even today. Besides, memory footprint affects instruction cache hit
> ratio and therefore, indirectly, performance as well.
>

Of course. But what appears on the actual hot path is a single 'cbz'
instruction that is always predicted correctly, and the actual code
lives somewhere else in the binary. That is why I said *memory*
footprint not *cache* footprint, since it only affects the former and
not the latter.

> > > And as an aside, once workarounds have been implemented and proven to
> > "work", the
> > > underlying issue rarely makes it to the HW guys so we're stuck with it
> > forever.
> > >
> >
> > Well, the starting point of the argument was that you deliberately
> > omitted handling of zero length inputs to save silicon area.
> >
> Where did I ever say that we omitted that to save silicon area?
> You're putting words in my mouth (or fingers) now. I never said that,
> that is not the reason at all.
>

Fair enough. But I did understand correctly that this was a deliberate
decision, no?

> > So the
> > issue would already be known to the h/w guys, and they decided it was
> > something they'd punt to the software instead.
> >
> NO. We never decided any such thing. We decided that it was not a relevant
> use case that we needed to support at all. Neither in the hardware nor in
> the driver. Our own generic OS driver does not contain any such workarounds.
> In fact, based on this Linux driver thing we had a new internal discussion
> on it and the outcome did not change: not a relevant use case for us.
>

RIght. So how does this this relate to your remark above that working
workarounds prevent issues from being known to the h/w guys?

> > > NO. Hardware is broken if it doesn't comply to its own specifications -
> > > which *may* include references to industry standards it must comply with.
> > > If I intentionally specify that zero length hashes are not supported, and
> > > I don't pretend to comply with any industry standard that requires them,
> > > then that's just a *limitation* of the hardware, most certainly not a bug.
> >
> > Fair enough. But if you want to integrate that h/w in a system that
> > does aim to comply, it is up to the software to fix the impedance
> > mismatch.
> >
> Comply with what exactly? You can't "comply" with algorithms ... you just
> implement whatever subset makes sense and specify the constraints. You can
> comply with protocol specifications, and that's what we do. None of those
> requires zero length hashing, HMAC, cipher or AEAD operations.
> Many algorithms are unbounded anyway and hardware is bounded by definition.
>

I'll ignore the remark about boundedness since it has no bearing
whatsoever on this discussion.

As for compliance, many of the zero length test vectors were sourced
from FIPS or NIST documents, so i don't care what you call it, but it
is a perfectly reasonable requirement that new implementations work as
expected for test vectors that have been published along with the
algorithm.

Again, I am not saying your hardware should do this. I am only saying
that, from a software engineering perspective, your driver is where we
fix up the differences, not anywhere else.


> > > Hardware necessarily *always* has limitations because of all kinds of
> > > constraints: area, power, complexity. And even something as mundane as a
> > > schedule constraint where you simply can't fit all desired features in the
> > > desired schedule. Which is usually very solid due to timeslots being
> > > planned in a fab etc. We don't have the luxury of extending our schedule
> > > forever like SW guys tend to do ... we're very proud of our track record
> > > of always meeting our promised schedules. Plus - silicon can't be patched,
> > > so what's done is done and you have to live with it. For many years to
> > > come, usually.
> > >
> >
> > This is all pretty well understood. We all have different interests to
> > balance against each other, which is why we are perfectly fine with
> > handling some corner cases in the driver. What we are not prepared to
> > do is let those corner cases leak into the core crypto layer as cases
> > that require special handling.
> >
> Which can be avoided by not selecting a driver for an application it
> does not support ... if the corner case is not exercised, then no harm is
> done. No need for any "leaking" through the crypto layers.
>

True. So again, if you choose to support your hardware as part of a
subsystem that does not have these requirements, I am perfectly fine
with that.

> > > > I know there is a gradient here going
> > > > from hashes, AEADs to symmetric ciphers, but I think this applies to
> > > > all of them.
> > > >
> > > > > Please keep in mind that existing hardware cannot be changed. So why
> > > > > wasn't the API designed around the limitations of *existing* hardware?
> > > >
> > > > From a software point of view, adding special cases for zero length
> > > > inputs amounts to what you are trying to avoid: using more 'silicon
> > > > area'.
> > > >
> > > No, that's actually not the reason at all in this case. We're trying to
> > > avoid significant extra complexity and effort on both the hardware itself
> > > and the verification thereof. Silicon area is not even in the picture as
> > > a concern for something as "small" as this.
> > >
> > > Adding zero length support to our hardware architecture is not a trivial
> > > exercise. And then you have to weigh added complexity - =added risk, when
> > > you talk about hardware with multi-million dollar mask sets in play -
> > > against usefulness. Zero-length support was - and still is! - simply not
> > > worth the added risk and effort.
> > >
> >
> > Of course. That is why it is perfectly fine to handle this in your driver.
> >
> Perfectly fine for you, maybe, but not so much for me.
>
> Why:
> Performance loss.

Negligible

> Driver complexity vs maintenance.

Yes, but again, this complexity has to live *somewhere*, and we don't
want it in the generic code.

> Wasted effort implementing totally irrelevant cases.
>

I agree that it is unfortunate that we have to spend time on this.

> > > And if you go that naive route, just fix everything in the driver, then
> > > you simply end up with something terribly inefficient because all those
> > > corner case checks end up in the fast path and eating up code space.
> > >
> >
> > This is *exactly* the reason why we want this workaround in your
> > driver, because if it is not in your driver, we will have to put it in
> > generic code where it affects everybody.
> >
> > > For a someone claiming to "meet in the middle to compromise" you're
> > > surely not compromising anything at all ... No offense.
> > >
> >
> > None taken. I am really trying to work with you here, but changing
> > core code to address the limitations of one particular h/w
> > implementation is not something we do lightly.
> >
> Well, for one thing it is not "one particular h/w implementation".
> As for the zero length thing, I believe it's almost all of them, based
> on comments from other driver maintainers. I've only seen 1 comment saying
> that the HW *did* support it. And that was a "maybe".
>
> And my main suggestion does not require any core code changes at all.
>

So what exactly are you proposing? Mind you, we cannot optimize this
away, so we will have to add checks /somewhere/ that we are not
calling into the crypto code with length values it doesn't support.

2019-05-27 20:16:53

by Pascal Van Leeuwen

[permalink] [raw]
Subject: RE: another testmgr question

> >
> > How relevant is the fact that there may be (other) situations where latency
> > is not relevant for someone being in a situation where it is relevant?
> >
> > I was just talking about that situation where it actually is relevant and
> > therefore you do *not* wish the hardware driver to be used.
> >
> > Ok, let's phrase it such that can be no further misunderstandings:
> >
> > If you want performance from a single-threaded single application that does
> > synchronous, blocking, crypto API calls, then you do not want those to end
> > up at our hardware. Or, very likely, pretty much any other hardware.
> > The cra_priority mechanism does not allow the driver to convey such a
> thing.
> >
>
> Agreed.
>
> But we came to this point in response to your assertion that a
> userland application can only make meaningful use of the hardware
> accelerator if it uses some kind of asynchronous API like AIO, and I
> tried to explain that this is not the case.
>
Ah OK, now I get your point. I did not intend to say that this is never
useful, just that in a fair number (trying to be careful with the wording
here :-) of cases it really is not what you want.

> > These workarounds are on a hot path by definition, as they have to filter
> > out specific requests coming in, i.e. it affects every single request done
> > to the driver. As for memory footprint: that is still relevant for embedded
> > systems even today. Besides, memory footprint affects instruction cache hit
> > ratio and therefore, indirectly, performance as well.
> >
>
> Of course. But what appears on the actual hot path is a single 'cbz'
> instruction that is always predicted correctly, and the actual code
> lives somewhere else in the binary. That is why I said *memory*
> footprint not *cache* footprint, since it only affects the former and
> not the latter.
>
Well:
a) That would be the case for single corner case, but they all add up.
b) For zero-length HMAC it really is not that simple (for skcipher, it is).
Well, you could theoretically implement it like that, I suppose, by making
it completely independent paths, but then you'd have to duplicate a lot code
and complexity. Not maintainable. In reality, it's quite a few compares and
condition checks all over the hot path ... I suppose it could be optimized
further, but that would make it even harder to follow.

> > >
> > Where did I ever say that we omitted that to save silicon area?
> > You're putting words in my mouth (or fingers) now. I never said that,
> > that is not the reason at all.
> >
>
> Fair enough. But I did understand correctly that this was a deliberate
> decision, no?
>
Absolutely. It's a huge amount of extra complexity for the hardware.
You have no idea. It would take many manmonths to just add that the
existing design and properly verify it out, covering all the tricky
arbitration corner cases and pipeline issues it introduces.

I guess that's hard to understand for a software engineer, but the
problem really is NOT computing the zero length hash or HMAC. The
algorithm core can actually already do that. The problem is making the
whole host interface, ring management, DMA infrastructure and data
buffers and control logic deal with input packets of length zero ...


> > NO. We never decided any such thing. We decided that it was not a relevant
> > use case that we needed to support at all. Neither in the hardware nor in
> > the driver. Our own generic OS driver does not contain any such
> workarounds.
> > In fact, based on this Linux driver thing we had a new internal discussion
> > on it and the outcome did not change: not a relevant use case for us.
> >
>
> RIght. So how does this this relate to your remark above that working
> workarounds prevent issues from being known to the h/w guys?
>
I guess your point being: you know about the issue now and you're not
going to fix it. Obviously, we have to weigh every issue seperately.
If it was something we could easily fix, we would do it. I already
implemented HMAC continue - which we couldn't do either - for the next
release of this hardware. Because a) we considered it pretty useful and
b) it was something that wasn't very risky or very much effort to do.

Zero-length hash/HMAC is simpy not very useful - doing shasum's on files is
not really not a use case we're interested in - and very difficult to
implement, risking, potentially, breaking some critical proven-on-silicon
infrastructure.

> > Comply with what exactly? You can't "comply" with algorithms ... you just
> > implement whatever subset makes sense and specify the constraints. You can
> > comply with protocol specifications, and that's what we do. None of those
> > requires zero length hashing, HMAC, cipher or AEAD operations.
> > Many algorithms are unbounded anyway and hardware is bounded by definition.
> >
>
> I'll ignore the remark about boundedness since it has no bearing
> whatsoever on this discussion.
>
It has some bearing. On one end you have the length zero, on the other
end you have a 2^64-1 bit limitation. Which we also cannot do, by the way.
So those are the bounds and we "comply" with neither ...

By the way, the algorithm is specified for *bits*, not *bytes*, so to
be "compliant" you'd have to be able to process arbitrary numbers of *bits*.
So in that respect, I think the whole hash API is non-compliant :-P

> As for compliance, many of the zero length test vectors were sourced
> from FIPS or NIST documents, so i don't care what you call it, but it
> is a perfectly reasonable requirement that new implementations work as
> expected for test vectors that have been published along with the
> algorithm.
>
That's interesting though, as I don't hit these zero-length vectors until
I enable the fuzzing tests. So they're not part of the standard compliance
vectors. And I know for fact - as we deal with compliance testing - that,
for FIPS, zero length is *optional*.

> Again, I am not saying your hardware should do this. I am only saying
> that, from a software engineering perspective, your driver is where we
> fix up the differences, not anywhere else.
>
And I continue my argument that there is no real need to fix up such
differences. At least, not all of them, and not for every driver. As they
may not be relevant for any cases where you *want* to use that driver.
What I think is a mistake is the idea that you should be able to use
every driver with every possible use case. That simply makes no sense.
If you don't automatically select it, then that no longer needs to be a
requirement.

Example:
If I document that my driver should only be used with the IPsec stack,
then why should I need to worry about zero length operations, HMAC continue,
IV output values, etc. That's all not relevant there.

> > > This is all pretty well understood. We all have different interests to
> > > balance against each other, which is why we are perfectly fine with
> > > handling some corner cases in the driver. What we are not prepared to
> > > do is let those corner cases leak into the core crypto layer as cases
> > > that require special handling.
> > >
> > Which can be avoided by not selecting a driver for an application it
> > does not support ... if the corner case is not exercised, then no harm is
> > done. No need for any "leaking" through the crypto layers.
> >
>
> True. So again, if you choose to support your hardware as part of a
> subsystem that does not have these requirements, I am perfectly fine
> with that.
>
But what if that subsystem (IPsec) builds on top of the crypto API?
Because that is the exact situation at hand here.

> Negligible
>
First, that depends entirely on the complexity and amount of the workarounds.
It's often not as simple as a single conditional branch somewhere ...

We're doing performance stuff and often have to meet certain specific
hard performance targets (example: *must* keep up with 10/100/400 Gbps line
rate). Where it's really all or nothing: either you achieve it or you don't.
So losing a single cycle somewhere can sometimes make a HUGE difference.

> > Driver complexity vs maintenance.
>
> Yes, but again, this complexity has to live *somewhere*, and we don't
> want it in the generic code.
>
Of course it shouldn't be put in the generic code. It just shouldn't be
put anywhere, that's the whole point.

> > Well, for one thing it is not "one particular h/w implementation".
> > As for the zero length thing, I believe it's almost all of them, based
> > on comments from other driver maintainers. I've only seen 1 comment saying
> > that the HW *did* support it. And that was a "maybe".
> >
> > And my main suggestion does not require any core code changes at all.
> >
>
> So what exactly are you proposing? Mind you, we cannot optimize this
> away, so we will have to add checks /somewhere/ that we are not
> calling into the crypto code with length values it doesn't support.
>
Why do you have to add checks /somewhere/?
If you try to use a driver with an application it cannot support, things just
fail. As long as using driver X with application Y is a concious choice and
not something automatic, I don't see the huge problem with that. Just read
the documentation and/or follow the instructions from you h/w vendor.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines
http://www.insidesecure.com