2019-01-18 14:28:01

by Jarkko Sakkinen

[permalink] [raw]
Subject: Getting weird TPM error after rebasing my tree to security/next-general

I get this on a Geminilake NUC after rebasing my maintainer trees:

tpm tpm0: A TPM error (-1) occurred attempting the self test

I checked the latest commit ID from drivers/char/tpm to make sure
that I did not put anything broken to my last PR [1]. It works
without issues.

In addition [2] gives me an empty diff.

Something outside of the TPM driver must have happened that breaks
the driver. Any ideas?

[1] commit 9488585b21bef0df1217e510c7134905d1d376a7
[2] git diff 9488585b21bef0df1217e510c7134905d1d376a7 master drivers/char/tpm/

/Jarkko


2019-01-18 22:11:20

by James Bottomley

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Fri, 2019-01-18 at 16:25 +0200, Jarkko Sakkinen wrote:
> I get this on a Geminilake NUC after rebasing my maintainer trees:
>
> tpm tpm0: A TPM error (-1) occurred attempting the self test
>
> I checked the latest commit ID from drivers/char/tpm to make sure
> that I did not put anything broken to my last PR [1]. It works
> without issues.
>
> In addition [2] gives me an empty diff.
>
> Something outside of the TPM driver must have happened that breaks
> the driver. Any ideas?
>
> [1] commit 9488585b21bef0df1217e510c7134905d1d376a7
> [2] git diff 9488585b21bef0df1217e510c7134905d1d376a7 master
> drivers/char/tpm/

I'm afraid you're going to have to bisect to find the offending in-
kernel commit, which is going to be painful since it seems to depend on
physical hardware. My first instinct is that we're getting a zero
length read somewhere, but I still can't see anything in the merge
window that would cause that behaviour.

James


2019-01-20 16:06:04

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Fri, Jan 18, 2019 at 02:09:18PM -0800, James Bottomley wrote:
> On Fri, 2019-01-18 at 16:25 +0200, Jarkko Sakkinen wrote:
> > I get this on a Geminilake NUC after rebasing my maintainer trees:
> >
> > tpm tpm0: A TPM error (-1) occurred attempting the self test
> >
> > I checked the latest commit ID from drivers/char/tpm to make sure
> > that I did not put anything broken to my last PR [1]. It works
> > without issues.
> >
> > In addition [2] gives me an empty diff.
> >
> > Something outside of the TPM driver must have happened that breaks
> > the driver. Any ideas?
> >
> > [1] commit 9488585b21bef0df1217e510c7134905d1d376a7
> > [2] git diff 9488585b21bef0df1217e510c7134905d1d376a7 master
> > drivers/char/tpm/
>
> I'm afraid you're going to have to bisect to find the offending in-
> kernel commit, which is going to be painful since it seems to depend on
> physical hardware. My first instinct is that we're getting a zero
> length read somewhere, but I still can't see anything in the merge
> window that would cause that behaviour.

Yeah, I've started to bisect it (still 9 rounds to go).

/Jarkko

2019-01-22 01:03:47

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Sun, Jan 20, 2019 at 06:04:13PM +0200, Jarkko Sakkinen wrote:
> On Fri, Jan 18, 2019 at 02:09:18PM -0800, James Bottomley wrote:
> > On Fri, 2019-01-18 at 16:25 +0200, Jarkko Sakkinen wrote:
> > > I get this on a Geminilake NUC after rebasing my maintainer trees:
> > >
> > > tpm tpm0: A TPM error (-1) occurred attempting the self test
> > >
> > > I checked the latest commit ID from drivers/char/tpm to make sure
> > > that I did not put anything broken to my last PR [1]. It works
> > > without issues.
> > >
> > > In addition [2] gives me an empty diff.
> > >
> > > Something outside of the TPM driver must have happened that breaks
> > > the driver. Any ideas?
> > >
> > > [1] commit 9488585b21bef0df1217e510c7134905d1d376a7
> > > [2] git diff 9488585b21bef0df1217e510c7134905d1d376a7 master
> > > drivers/char/tpm/
> >
> > I'm afraid you're going to have to bisect to find the offending in-
> > kernel commit, which is going to be painful since it seems to depend on
> > physical hardware. My first instinct is that we're getting a zero
> > length read somewhere, but I still can't see anything in the merge
> > window that would cause that behaviour.
>
> Yeah, I've started to bisect it (still 9 rounds to go).

Fails on commit 170d13ca3a2fdaaa0283399247631b76b441cca2. Still works on
preceding commit a959dc88f9c8900296ccf13e2f3e1cbc555a8917.

Double-checked this after bisecting by compiling the kernel with these
commits as tip.

/Jarkko

2019-01-22 02:59:58

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Tue, Jan 22, 2019 at 03:02:18AM +0200, Jarkko Sakkinen wrote:
> On Sun, Jan 20, 2019 at 06:04:13PM +0200, Jarkko Sakkinen wrote:
> > On Fri, Jan 18, 2019 at 02:09:18PM -0800, James Bottomley wrote:
> > > On Fri, 2019-01-18 at 16:25 +0200, Jarkko Sakkinen wrote:
> > > > I get this on a Geminilake NUC after rebasing my maintainer trees:
> > > >
> > > > tpm tpm0: A TPM error (-1) occurred attempting the self test
> > > >
> > > > I checked the latest commit ID from drivers/char/tpm to make sure
> > > > that I did not put anything broken to my last PR [1]. It works
> > > > without issues.
> > > >
> > > > In addition [2] gives me an empty diff.
> > > >
> > > > Something outside of the TPM driver must have happened that breaks
> > > > the driver. Any ideas?
> > > >
> > > > [1] commit 9488585b21bef0df1217e510c7134905d1d376a7
> > > > [2] git diff 9488585b21bef0df1217e510c7134905d1d376a7 master
> > > > drivers/char/tpm/
> > >
> > > I'm afraid you're going to have to bisect to find the offending in-
> > > kernel commit, which is going to be painful since it seems to depend on
> > > physical hardware. My first instinct is that we're getting a zero
> > > length read somewhere, but I still can't see anything in the merge
> > > window that would cause that behaviour.
> >
> > Yeah, I've started to bisect it (still 9 rounds to go).
>
> Fails on commit 170d13ca3a2fdaaa0283399247631b76b441cca2. Still works on
> preceding commit a959dc88f9c8900296ccf13e2f3e1cbc555a8917.

This changes the IO access pattern in memcpy_to/fromio.. Presumably
CRB HW doesn't like the new 4 byte move? Swap each one in crb to
memcpy to confirm..

If the HW requires particular access patterns you can't use
memcpy_to/fromio

Jason

2019-01-22 13:31:57

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Mon, Jan 21, 2019 at 07:58:36PM -0700, Jason Gunthorpe wrote:
> On Tue, Jan 22, 2019 at 03:02:18AM +0200, Jarkko Sakkinen wrote:
> > On Sun, Jan 20, 2019 at 06:04:13PM +0200, Jarkko Sakkinen wrote:
> > > On Fri, Jan 18, 2019 at 02:09:18PM -0800, James Bottomley wrote:
> > > > On Fri, 2019-01-18 at 16:25 +0200, Jarkko Sakkinen wrote:
> > > > > I get this on a Geminilake NUC after rebasing my maintainer trees:
> > > > >
> > > > > tpm tpm0: A TPM error (-1) occurred attempting the self test
> > > > >
> > > > > I checked the latest commit ID from drivers/char/tpm to make sure
> > > > > that I did not put anything broken to my last PR [1]. It works
> > > > > without issues.
> > > > >
> > > > > In addition [2] gives me an empty diff.
> > > > >
> > > > > Something outside of the TPM driver must have happened that breaks
> > > > > the driver. Any ideas?
> > > > >
> > > > > [1] commit 9488585b21bef0df1217e510c7134905d1d376a7
> > > > > [2] git diff 9488585b21bef0df1217e510c7134905d1d376a7 master
> > > > > drivers/char/tpm/
> > > >
> > > > I'm afraid you're going to have to bisect to find the offending in-
> > > > kernel commit, which is going to be painful since it seems to depend on
> > > > physical hardware. My first instinct is that we're getting a zero
> > > > length read somewhere, but I still can't see anything in the merge
> > > > window that would cause that behaviour.
> > >
> > > Yeah, I've started to bisect it (still 9 rounds to go).
> >
> > Fails on commit 170d13ca3a2fdaaa0283399247631b76b441cca2. Still works on
> > preceding commit a959dc88f9c8900296ccf13e2f3e1cbc555a8917.
>
> This changes the IO access pattern in memcpy_to/fromio.. Presumably
> CRB HW doesn't like the new 4 byte move? Swap each one in crb to
> memcpy to confirm..
>
> If the HW requires particular access patterns you can't use
> memcpy_to/fromio

Did not have time to look at the commit at all but your deduction
is correct. I know it without testing.

Memory controller will feed 1's on unaligned read from IO memory,
and as we can see from the TPM header, this change causes two of
those:

struct tpm_output_header {
__be16 tag;
__be32 length;
__be32 return_code;
} __packed;

/Jarkko

2019-01-22 18:29:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Wed, Jan 23, 2019 at 2:29 AM Jarkko Sakkinen
<[email protected]> wrote:
>
> > >
> > > Fails on commit 170d13ca3a2fdaaa0283399247631b76b441cca2. Still works on
> > > preceding commit a959dc88f9c8900296ccf13e2f3e1cbc555a8917.
> >
> > This changes the IO access pattern in memcpy_to/fromio.. Presumably
> > CRB HW doesn't like the new 4 byte move? Swap each one in crb to
> > memcpy to confirm..
> >
> > If the HW requires particular access patterns you can't use
> > memcpy_to/fromio
>
> Did not have time to look at the commit at all but your deduction
> is correct. I know it without testing.
>
> Memory controller will feed 1's on unaligned read from IO memory,
> and as we can see from the TPM header, this change causes two of
> those:

Funky. But how did it work before then?

The new memcpy_fromio() is designed to have _predictable_ access
patterns. Not necessarily the best, but at least consistent.

Prevously, we used whatever random "memcpy()" implementation we
happened to pick, which *could* be aligned (particularly "rep movsb" -
absolutely horrible performance for MMIO, but by doing IO one byte at
a time it was certainly aligned ;), but most of our x86 memcpy
implementations don't actually try all that hard to align the source.
And the manual version will actually copy things *backwards* for some
cases.

Is it just that this particular hardware always happened to trigger
the ERMS case (ie "rep movsb")?

Anyway, Jason is correct that if a device has particular IO pattern
requirements, you shouldn't use "memcpy_fromio()" and friends, but
it's interesting how it apparently *happened* to work before.

Linus

2019-01-23 15:38:40

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Wed, Jan 23, 2019 at 07:26:42AM +1300, Linus Torvalds wrote:
> On Wed, Jan 23, 2019 at 2:29 AM Jarkko Sakkinen
> <[email protected]> wrote:
> >
> > > >
> > > > Fails on commit 170d13ca3a2fdaaa0283399247631b76b441cca2. Still works on
> > > > preceding commit a959dc88f9c8900296ccf13e2f3e1cbc555a8917.
> > >
> > > This changes the IO access pattern in memcpy_to/fromio.. Presumably
> > > CRB HW doesn't like the new 4 byte move? Swap each one in crb to
> > > memcpy to confirm..
> > >
> > > If the HW requires particular access patterns you can't use
> > > memcpy_to/fromio
> >
> > Did not have time to look at the commit at all but your deduction
> > is correct. I know it without testing.
> >
> > Memory controller will feed 1's on unaligned read from IO memory,
> > and as we can see from the TPM header, this change causes two of
> > those:
>
> Funky. But how did it work before then?
>
> The new memcpy_fromio() is designed to have _predictable_ access
> patterns. Not necessarily the best, but at least consistent.
>
> Prevously, we used whatever random "memcpy()" implementation we
> happened to pick, which *could* be aligned (particularly "rep movsb" -
> absolutely horrible performance for MMIO, but by doing IO one byte at
> a time it was certainly aligned ;), but most of our x86 memcpy
> implementations don't actually try all that hard to align the source.
> And the manual version will actually copy things *backwards* for some
> cases.
>
> Is it just that this particular hardware always happened to trigger
> the ERMS case (ie "rep movsb")?

This is the particular snippet in question:

memcpy_fromio(buf, priv->rsp, 6);
expected = be32_to_cpup((__be32 *) &buf[2]);
if (expected > count || expected < 6)
return -EIO;

memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);

I guess it did in the first memcpy_fromio operation since it is less
than a quad word, right? Not sure why the 2nd memcpy_fromio() operation
has worked, though.

> Anyway, Jason is correct that if a device has particular IO pattern
> requirements, you shouldn't use "memcpy_fromio()" and friends, but
> it's interesting how it apparently *happened* to work before.
>
> Linus

Sure, I'll prepare a fix ASAP.

/Jarkko

2019-01-23 18:44:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 24, 2019 at 4:36 AM Jarkko Sakkinen
<[email protected]> wrote:
> >
> > Is it just that this particular hardware always happened to trigger
> > the ERMS case (ie "rep movsb")?
>
> This is the particular snippet in question:
>
> memcpy_fromio(buf, priv->rsp, 6);
> expected = be32_to_cpup((__be32 *) &buf[2]);
> if (expected > count || expected < 6)
> return -EIO;

Ok, strange.

So what *used* to happen is that the memcpy_fromio() would just expand
as a "memcpy()", and in this case, gcc would then inline the memcpy().
In fact, gcc does it as a 4-byte access and a two-byte access from
what I can tell.

Which is actually exactly the same as memcpy_fromio() should do, just
using a different code sequence.

> memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);

This one gets turned into an out-of-line "memcpy()" in the old world
order, which depending on size will do different things, but might be
a "rep movsb". Or it might be the software expansion that does
overlapping accesses and/or backwards copies.

In the new world order, it's the "memcpy_fromio()" that willdo first
4-byte accesses for the main bulk of the copy, and then end up with a
two-byte and single-byte move to pad out the end.

> I guess it did in the first memcpy_fromio operation since it is less
> than a quad word, right? Not sure why the 2nd memcpy_fromio() operation
> has worked, though.

The first one seems to do the same thing now as it used to do, so I
don't *think* it should have mattered.

The second one looks like it is unaligned (offset 6) and doing the
4-byte io reads would fail if that device needs aligned accesses. The
old memcpy() *might* have done it with a "rep movsb" that would just
work (?).

Linus

2019-01-23 20:11:28

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Wed, Jan 23, 2019 at 05:36:38PM +0200, Jarkko Sakkinen wrote:
> On Wed, Jan 23, 2019 at 07:26:42AM +1300, Linus Torvalds wrote:
> > On Wed, Jan 23, 2019 at 2:29 AM Jarkko Sakkinen
> > <[email protected]> wrote:
> > >
> > > > >
> > > > > Fails on commit 170d13ca3a2fdaaa0283399247631b76b441cca2. Still works on
> > > > > preceding commit a959dc88f9c8900296ccf13e2f3e1cbc555a8917.
> > > >
> > > > This changes the IO access pattern in memcpy_to/fromio.. Presumably
> > > > CRB HW doesn't like the new 4 byte move? Swap each one in crb to
> > > > memcpy to confirm..
> > > >
> > > > If the HW requires particular access patterns you can't use
> > > > memcpy_to/fromio
> > >
> > > Did not have time to look at the commit at all but your deduction
> > > is correct. I know it without testing.
> > >
> > > Memory controller will feed 1's on unaligned read from IO memory,
> > > and as we can see from the TPM header, this change causes two of
> > > those:
> >
> > Funky. But how did it work before then?
> >
> > The new memcpy_fromio() is designed to have _predictable_ access
> > patterns. Not necessarily the best, but at least consistent.
> >
> > Prevously, we used whatever random "memcpy()" implementation we
> > happened to pick, which *could* be aligned (particularly "rep movsb" -
> > absolutely horrible performance for MMIO, but by doing IO one byte at
> > a time it was certainly aligned ;), but most of our x86 memcpy
> > implementations don't actually try all that hard to align the source.
> > And the manual version will actually copy things *backwards* for some
> > cases.
> >
> > Is it just that this particular hardware always happened to trigger
> > the ERMS case (ie "rep movsb")?
>
> This is the particular snippet in question:
>
> memcpy_fromio(buf, priv->rsp, 6);
> expected = be32_to_cpup((__be32 *) &buf[2]);
> if (expected > count || expected < 6)
> return -EIO;
>
> memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);
>
> I guess it did in the first memcpy_fromio operation since it is less
> than a quad word, right? Not sure why the 2nd memcpy_fromio() operation
> has worked, though.

And I wonder why 32-bit has worked before.

Tomas, you've been more involved with ME and fTPM runs there. Do you
have any clues where this could be rooted?

/Jarkko

2019-01-29 13:20:41

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 24, 2019 at 07:43:30AM +1300, Linus Torvalds wrote:
> On Thu, Jan 24, 2019 at 4:36 AM Jarkko Sakkinen
> <[email protected]> wrote:
> > >
> > > Is it just that this particular hardware always happened to trigger
> > > the ERMS case (ie "rep movsb")?
> >
> > This is the particular snippet in question:
> >
> > memcpy_fromio(buf, priv->rsp, 6);
> > expected = be32_to_cpup((__be32 *) &buf[2]);
> > if (expected > count || expected < 6)
> > return -EIO;
>
> Ok, strange.
>
> So what *used* to happen is that the memcpy_fromio() would just expand
> as a "memcpy()", and in this case, gcc would then inline the memcpy().
> In fact, gcc does it as a 4-byte access and a two-byte access from
> what I can tell.

I verified, and it is exactly as you stated:

0xffffffff814aaa33 <+51>: mov (%rax),%edx
0xffffffff814aaa35 <+53>: mov %edx,0x0(%rbp)
0xffffffff814aaa38 <+56>: movzwl 0x4(%rax),%eax
0xffffffff814aaa3c <+60>: mov %ax,0x4(%rbp)

And your new version does exactly the same thing to the first six bytes
(with different opcode, but the same memory access pattern).

/Jarkko

2019-01-31 12:29:42

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Tue, Jan 29, 2019 at 03:20:16PM +0200, Jarkko Sakkinen wrote:
> On Thu, Jan 24, 2019 at 07:43:30AM +1300, Linus Torvalds wrote:
> > On Thu, Jan 24, 2019 at 4:36 AM Jarkko Sakkinen
> > <[email protected]> wrote:
> > > >
> > > > Is it just that this particular hardware always happened to trigger
> > > > the ERMS case (ie "rep movsb")?
> > >
> > > This is the particular snippet in question:
> > >
> > > memcpy_fromio(buf, priv->rsp, 6);
> > > expected = be32_to_cpup((__be32 *) &buf[2]);
> > > if (expected > count || expected < 6)
> > > return -EIO;
> >
> > Ok, strange.
> >
> > So what *used* to happen is that the memcpy_fromio() would just expand
> > as a "memcpy()", and in this case, gcc would then inline the memcpy().
> > In fact, gcc does it as a 4-byte access and a two-byte access from
> > what I can tell.
>
> I verified, and it is exactly as you stated:
>
> 0xffffffff814aaa33 <+51>: mov (%rax),%edx
> 0xffffffff814aaa35 <+53>: mov %edx,0x0(%rbp)
> 0xffffffff814aaa38 <+56>: movzwl 0x4(%rax),%eax
> 0xffffffff814aaa3c <+60>: mov %ax,0x4(%rbp)
>
> And your new version does exactly the same thing to the first six bytes
> (with different opcode, but the same memory access pattern).

I think I have found the root cause:

memcpy_fromio(&__rsp_pa, &priv->regs_t->ctrl_rsp_pa, 8);

This is from crb_map_io(). This should be read as quad word.

I'll change it to ioread64() and see what happens. I don't know why it
even has used memcpy_fromio() in the first place. I guess, when I first
implemented the driver, I used that for no logical reason, and it has
worked since up until now.

/Jarkko

2019-01-31 16:07:04

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 02:26:06PM +0200, Jarkko Sakkinen wrote:
> On Tue, Jan 29, 2019 at 03:20:16PM +0200, Jarkko Sakkinen wrote:
> > On Thu, Jan 24, 2019 at 07:43:30AM +1300, Linus Torvalds wrote:
> > > On Thu, Jan 24, 2019 at 4:36 AM Jarkko Sakkinen
> > > <[email protected]> wrote:
> > > > >
> > > > > Is it just that this particular hardware always happened to trigger
> > > > > the ERMS case (ie "rep movsb")?
> > > >
> > > > This is the particular snippet in question:
> > > >
> > > > memcpy_fromio(buf, priv->rsp, 6);
> > > > expected = be32_to_cpup((__be32 *) &buf[2]);
> > > > if (expected > count || expected < 6)
> > > > return -EIO;
> > >
> > > Ok, strange.
> > >
> > > So what *used* to happen is that the memcpy_fromio() would just expand
> > > as a "memcpy()", and in this case, gcc would then inline the memcpy().
> > > In fact, gcc does it as a 4-byte access and a two-byte access from
> > > what I can tell.
> >
> > I verified, and it is exactly as you stated:
> >
> > 0xffffffff814aaa33 <+51>: mov (%rax),%edx
> > 0xffffffff814aaa35 <+53>: mov %edx,0x0(%rbp)
> > 0xffffffff814aaa38 <+56>: movzwl 0x4(%rax),%eax
> > 0xffffffff814aaa3c <+60>: mov %ax,0x4(%rbp)
> >
> > And your new version does exactly the same thing to the first six bytes
> > (with different opcode, but the same memory access pattern).
>
> I think I have found the root cause:
>
> memcpy_fromio(&__rsp_pa, &priv->regs_t->ctrl_rsp_pa, 8);
>
> This is from crb_map_io(). This should be read as quad word.
>
> I'll change it to ioread64() and see what happens. I don't know why it
> even has used memcpy_fromio() in the first place. I guess, when I first
> implemented the driver, I used that for no logical reason, and it has
> worked since up until now.

No, cannot be it. If you couldn't read it in two dwords, then it would
have been always broken with 32-bit build.

Anyway, just in case, I will check what address it prints out.

/Jarkko

2019-01-31 17:35:41

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 06:04:37PM +0200, Jarkko Sakkinen wrote:
> On Thu, Jan 31, 2019 at 02:26:06PM +0200, Jarkko Sakkinen wrote:
> > On Tue, Jan 29, 2019 at 03:20:16PM +0200, Jarkko Sakkinen wrote:
> > > On Thu, Jan 24, 2019 at 07:43:30AM +1300, Linus Torvalds wrote:
> > > > On Thu, Jan 24, 2019 at 4:36 AM Jarkko Sakkinen
> > > > <[email protected]> wrote:
> > > > > >
> > > > > > Is it just that this particular hardware always happened to trigger
> > > > > > the ERMS case (ie "rep movsb")?
> > > > >
> > > > > This is the particular snippet in question:
> > > > >
> > > > > memcpy_fromio(buf, priv->rsp, 6);
> > > > > expected = be32_to_cpup((__be32 *) &buf[2]);
> > > > > if (expected > count || expected < 6)
> > > > > return -EIO;
> > > >
> > > > Ok, strange.
> > > >
> > > > So what *used* to happen is that the memcpy_fromio() would just expand
> > > > as a "memcpy()", and in this case, gcc would then inline the memcpy().
> > > > In fact, gcc does it as a 4-byte access and a two-byte access from
> > > > what I can tell.
> > >
> > > I verified, and it is exactly as you stated:
> > >
> > > 0xffffffff814aaa33 <+51>: mov (%rax),%edx
> > > 0xffffffff814aaa35 <+53>: mov %edx,0x0(%rbp)
> > > 0xffffffff814aaa38 <+56>: movzwl 0x4(%rax),%eax
> > > 0xffffffff814aaa3c <+60>: mov %ax,0x4(%rbp)
> > >
> > > And your new version does exactly the same thing to the first six bytes
> > > (with different opcode, but the same memory access pattern).
> >
> > I think I have found the root cause:
> >
> > memcpy_fromio(&__rsp_pa, &priv->regs_t->ctrl_rsp_pa, 8);
> >
> > This is from crb_map_io(). This should be read as quad word.
> >
> > I'll change it to ioread64() and see what happens. I don't know why it
> > even has used memcpy_fromio() in the first place. I guess, when I first
> > implemented the driver, I used that for no logical reason, and it has
> > worked since up until now.
>
> No, cannot be it. If you couldn't read it in two dwords, then it would
> have been always broken with 32-bit build.
>
> Anyway, just in case, I will check what address it prints out.

Found something that *does* fix the issue. If I replace memcpy_*io()
calls with regular memcpy(), the driver works and all my tests pass.

/Jarkko


2019-01-31 17:47:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 9:06 AM Jarkko Sakkinen
<[email protected]> wrote:
>
> Found something that *does* fix the issue. If I replace memcpy_*io()
> calls with regular memcpy(), the driver works and all my tests pass.

That's not surprising, since that's what we used to do. And it's
horribly wrong because "memcpy()" can do things that are horribly
wrong on IO accesses. Like doing them twice, but alternatively also
"copy one byte at a time" which generally works, but is horrendously
slow for IO.

Can you check *which* memcpy_*io() triggers the issue? Maybe by
"bisecting" them (first perhaps on a file-by-file basis, and then
within a file).

Linus

2019-01-31 18:36:00

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 07:06:03PM +0200, Jarkko Sakkinen wrote:
> Found something that *does* fix the issue. If I replace memcpy_*io()
> calls with regular memcpy(), the driver works and all my tests pass.

OK, so the length of the response is not trashed, but only the error
code. The attached patch fully fixes the issue.

Here's the header again:

struct tpm_output_header {
__be16 tag;
__be32 length;
__be32 return_code;
} __packed;

The first to fields *are* read correctly and the last field get 1's
(thus TPM error -1).

With the attached the patch things work properly, but still
unsatisfactory fix (return to old behavior because it seems to
work).

/Jarkko


Attachments:
(No filename) (694.00 B)
0001-tpm-tpm_crb-Revert-to-memcpy-for-copying-tail-of-the.patch (930.00 B)
Download all attachments

2019-01-31 18:49:43

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 09:43:42AM -0800, Linus Torvalds wrote:
> On Thu, Jan 31, 2019 at 9:06 AM Jarkko Sakkinen
> <[email protected]> wrote:
> >
> > Found something that *does* fix the issue. If I replace memcpy_*io()
> > calls with regular memcpy(), the driver works and all my tests pass.
>
> That's not surprising, since that's what we used to do. And it's
> horribly wrong because "memcpy()" can do things that are horribly
> wrong on IO accesses. Like doing them twice, but alternatively also
> "copy one byte at a time" which generally works, but is horrendously
> slow for IO.

Yup, was just a sanity check.

> Can you check *which* memcpy_*io() triggers the issue? Maybe by
> "bisecting" them (first perhaps on a file-by-file basis, and then
> within a file).

Already did that. See my follow-up.

/Jarkko

2019-01-31 18:53:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 10:35 AM Jarkko Sakkinen
<[email protected]> wrote:
>
> OK, so the length of the response is not trashed, but only the error
> code. The attached patch fully fixes the issue.
>
> Here's the header again:
>
> struct tpm_output_header {
> __be16 tag;
> __be32 length;
> __be32 return_code;
> } __packed;
>
> The first to fields *are* read correctly and the last field get 1's
> (thus TPM error -1).

Ok, so this makes sense, even though that patch is (I think) completely wrong.

What happens is that the 32-bit fields are mis-aligned: the "tag" is
obviously properly 16-bit aligned, but then both "length" and
"return_code" are 32-bit fields that are only aligned on a 16-bit
alignment.

What happens is that first you copy the two first fields:

memcpy_fromio(buf, priv->rsp, 6);

which copies "tag" and "length", but it copies them by reading then as
a 4-byte and then 2-byte value (in that order). So it actually reads
'tag' and 'first two bytes of 'length', and then the second access
reads the last two bytes of 'length'

And it all works, because the accesses are aligned by address of
access, even though they are *not* aligned in the 'struct
tpm_output_header' fields.

But then later on, when you read 'return_code', and do

memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);

you now do a 4-byte memcpy at offset 6. So it does a 4-byte access,
bit it's not 4-byte aligned.

Honestly, memcpy() itself shouldn't have worked *either*, but you
lucked out. Gcc doesn't know that it's a 4-byte access, so it actually
calls out to the memcpy() routine. And that one happened to be "rep
movsb" on your machine. And that happened to work.

But it's really not supposed to work, and it really *wouldn't* have
worked if somebody disabled the rep-string functions.

In fact, we have another patch (that isn't applied) that makes even
the memcpy_erms() just call the sw version of memcpy() for short
copies (because "rep movsb" is slow for those cases). That would also
have broken your driver.

Linus

2019-01-31 18:55:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 10:51 AM Linus Torvalds
<[email protected]> wrote:
>
> In fact, we have another patch (that isn't applied) that makes even
> the memcpy_erms() just call the sw version of memcpy() for short
> copies (because "rep movsb" is slow for those cases). That would also
> have broken your driver.

I think what I should do is to just make "memcpy_*io()" do the "align
naturally" thing.

Let me cook up a patch for you to test.

Linus

2019-01-31 19:11:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 10:52 AM Linus Torvalds
<[email protected]> wrote:
>
> I think what I should do is to just make "memcpy_*io()" do the "align
> naturally" thing.
>
> Let me cook up a patch for you to test.

Does this work for you?

I haven't tested it at all, but I verified that the generated code
seems to make at least some amount of sense.

Linus


Attachments:
patch.diff (1.56 kB)

2019-01-31 19:54:15

by Tomas Winkler

[permalink] [raw]
Subject: RE: Getting weird TPM error after rebasing my tree to security/next-general


>
> On Thu, Jan 31, 2019 at 10:52 AM Linus Torvalds <torvalds@linux-
> foundation.org> wrote:
> >
> > I think what I should do is to just make "memcpy_*io()" do the "align
> > naturally" thing.
> >
> > Let me cook up a patch for you to test.
>
> Does this work for you?
>
> I haven't tested it at all, but I verified that the generated code seems to make
> at least some amount of sense.
>
> Linus

So dig into the spec and I think this is a bit relevant.

TPM TCG according the spec requires that all buffer access is done sequentially from the start to end of the payload,
Simply In case of skipping or going back the transaction is aborted.
The write transactions should be 1 or power of 2. So in general 6 byte read should not work. But I'm sure our hw really obey this restriction.

Thanks
Tomas

2019-01-31 20:07:32

by Tomas Winkler

[permalink] [raw]
Subject: RE: Getting weird TPM error after rebasing my tree to security/next-general



> -----Original Message-----
> From: Winkler, Tomas
> Sent: Thursday, January 31, 2019 21:47
> To: 'Linus Torvalds' <[email protected]>; Jarkko Sakkinen
> <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>; James Bottomley
> <[email protected]>; [email protected];
> [email protected]; Linux List Kernel Mailing <linux-
> [email protected]>
> Subject: RE: Getting weird TPM error after rebasing my tree to security/next-
> general
>
>
> >
> > On Thu, Jan 31, 2019 at 10:52 AM Linus Torvalds <torvalds@linux-
> > foundation.org> wrote:
> > >
> > > I think what I should do is to just make "memcpy_*io()" do the
> > > "align naturally" thing.
> > >
> > > Let me cook up a patch for you to test.
> >
> > Does this work for you?
> >
> > I haven't tested it at all, but I verified that the generated code
> > seems to make at least some amount of sense.
> >
> > Linus
>
> So dig into the spec and I think this is a bit relevant.
>
> TPM TCG according the spec requires that all buffer access is done sequentially
> from the start to end of the payload, Simply In case of skipping or going back
> the transaction is aborted.
> The write transactions should be 1 or power of 2. So in general 6 byte read
> should not work. But I'm sure our hw really obey this restriction.

For alignment of register access the spec specifies:

Some instantiating hardware may have size and alignment restrictions when accessing any of the fields in this interface. Some hardware may require access to any of data within a field or buffer be performed by reads or writes which are naturally aligned. For this reason, software should access the fields and buffers defined in this interface using instructions which do not cause an access to cross an alignment boundary and should not use string move instructions

This is for register address, I'm not sure this spans also for request response buffer, but from the behavior looks like it applies there two.


Thanks
Tomas

2019-01-31 22:38:52

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 10:51:03AM -0800, Linus Torvalds wrote:
> On Thu, Jan 31, 2019 at 10:35 AM Jarkko Sakkinen
> <[email protected]> wrote:
> >
> > OK, so the length of the response is not trashed, but only the error
> > code. The attached patch fully fixes the issue.
> >
> > Here's the header again:
> >
> > struct tpm_output_header {
> > __be16 tag;
> > __be32 length;
> > __be32 return_code;
> > } __packed;
> >
> > The first to fields *are* read correctly and the last field get 1's
> > (thus TPM error -1).
>
> Ok, so this makes sense, even though that patch is (I think) completely wrong.

I don't disagree on that :-) Just pinpointed the location where it
fails.

> What happens is that the 32-bit fields are mis-aligned: the "tag" is
> obviously properly 16-bit aligned, but then both "length" and
> "return_code" are 32-bit fields that are only aligned on a 16-bit
> alignment.
>
> What happens is that first you copy the two first fields:
>
> memcpy_fromio(buf, priv->rsp, 6);
>
> which copies "tag" and "length", but it copies them by reading then as
> a 4-byte and then 2-byte value (in that order). So it actually reads
> 'tag' and 'first two bytes of 'length', and then the second access
> reads the last two bytes of 'length'
>
> And it all works, because the accesses are aligned by address of
> access, even though they are *not* aligned in the 'struct
> tpm_output_header' fields.

Right, they are still naturally aligned accesses.

> But then later on, when you read 'return_code', and do
>
> memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);
>
> you now do a 4-byte memcpy at offset 6. So it does a 4-byte access,
> bit it's not 4-byte aligned.
>
> Honestly, memcpy() itself shouldn't have worked *either*, but you
> lucked out. Gcc doesn't know that it's a 4-byte access, so it actually
> calls out to the memcpy() routine. And that one happened to be "rep
> movsb" on your machine. And that happened to work.

I understand what you mean. Just surprised that this hasn't failed
before to anyone (the same driver has been even successfully used
on ARM64 with TrustZone based fTPM implementation). It has been in
for three years now.

> But it's really not supposed to work, and it really *wouldn't* have
> worked if somebody disabled the rep-string functions.
>
> In fact, we have another patch (that isn't applied) that makes even
> the memcpy_erms() just call the sw version of memcpy() for short
> copies (because "rep movsb" is slow for those cases). That would also
> have broken your driver.
>
> Linus

/Jarkko

2019-01-31 22:40:03

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 11:10:20AM -0800, Linus Torvalds wrote:
> On Thu, Jan 31, 2019 at 10:52 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > I think what I should do is to just make "memcpy_*io()" do the "align
> > naturally" thing.
> >
> > Let me cook up a patch for you to test.
>
> Does this work for you?
>
> I haven't tested it at all, but I verified that the generated code
> seems to make at least some amount of sense.

I'll try it first thing when I wake up tomorrow (11PM in Finland ATM).
Appreciate for taking time on this. Thank you.

/Jarkko

2019-01-31 23:05:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 12:47 PM Jarkko Sakkinen
<[email protected]> wrote:
>
> I'll try it first thing when I wake up tomorrow (11PM in Finland ATM).

Thanks.

> Appreciate for taking time on this.

Hey, it was my commit that broke it for you. Even if it happened to
work before, and only did so by pure luck, it was a functional
regression.

I get very upset when other developers don't step up when *their*
changes break something, and I don't consider "it shouldn't have
worked in the first place" to be a valid excuse. You broke it, you'd
better fix it.

So I had better fix my own mess too, in order to not look too hypocritical.

And I was very aware that hardcoding the memcpy_*io() access patterns
might break something. I just _hoped_ it wouldn't, because we actually
ended up going back to the very original access patterns (but it was
from a long long time ago).

In fact, while it's slightly annoying, in many ways it's actually good
that we found breakage, and could pinpoint exactly *why* it broke.
That does validate the whole "we shouldn't just depend on the random
implementation detail of 'memcpy()'" argument.

So I'll wait to hear back whether that patch fixes things for you, but
I _think_ it will, and we'll be better off in the long range with this
whole thing.

Linus

2019-01-31 23:35:25

by Jerry Snitselaar

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu Jan 31 19, Linus Torvalds wrote:
>On Thu, Jan 31, 2019 at 12:47 PM Jarkko Sakkinen
><[email protected]> wrote:
>>
>> I'll try it first thing when I wake up tomorrow (11PM in Finland ATM).
>
>Thanks.
>
>> Appreciate for taking time on this.
>
>Hey, it was my commit that broke it for you. Even if it happened to
>work before, and only did so by pure luck, it was a functional
>regression.
>
>I get very upset when other developers don't step up when *their*
>changes break something, and I don't consider "it shouldn't have
>worked in the first place" to be a valid excuse. You broke it, you'd
>better fix it.
>
>So I had better fix my own mess too, in order to not look too hypocritical.
>
>And I was very aware that hardcoding the memcpy_*io() access patterns
>might break something. I just _hoped_ it wouldn't, because we actually
>ended up going back to the very original access patterns (but it was
>from a long long time ago).
>
>In fact, while it's slightly annoying, in many ways it's actually good
>that we found breakage, and could pinpoint exactly *why* it broke.
>That does validate the whole "we shouldn't just depend on the random
>implementation detail of 'memcpy()'" argument.
>
>So I'll wait to hear back whether that patch fixes things for you, but
>I _think_ it will, and we'll be better off in the long range with this
>whole thing.
>
> Linus

I just did a quick test here of booting and running a couple commands
(tpm2_getcap, tpm2_pcrlist), and the patch seems to work for me. I was
seeing the error during tpm_crb initialization without the patch.

2019-02-01 08:14:05

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 07:47:19PM +0000, Winkler, Tomas wrote:
>
> >
> > On Thu, Jan 31, 2019 at 10:52 AM Linus Torvalds <torvalds@linux-
> > foundation.org> wrote:
> > >
> > > I think what I should do is to just make "memcpy_*io()" do the "align
> > > naturally" thing.
> > >
> > > Let me cook up a patch for you to test.
> >
> > Does this work for you?
> >
> > I haven't tested it at all, but I verified that the generated code seems to make
> > at least some amount of sense.
> >
> > Linus
>
> So dig into the spec and I think this is a bit relevant.
>
> TPM TCG according the spec requires that all buffer access is done
> sequentially from the start to end of the payload, Simply In case of
> skipping or going back the transaction is aborted. The write
> transactions should be 1 or power of 2. So in general 6 byte read
> should not work. But I'm sure our hw really obey this restriction.

We could easily read the first 8 bytes instead of 6, and then check the
length. The benefit in doing that would be to be able to backport the
patch to stable kernels.

I'm fine with either solution.

Applying Linus' fix might anyway a good idea since given that tpm_crb
broke down, there is a finite probability that there might be some other
drivers that break down for similar reasons, but no one has noticed so
far.

Thus, I might even propose doing both...

/Jarkko

2019-02-01 12:08:57

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 04:31:36PM -0700, Jerry Snitselaar wrote:
> On Thu Jan 31 19, Linus Torvalds wrote:
> > On Thu, Jan 31, 2019 at 12:47 PM Jarkko Sakkinen
> > <[email protected]> wrote:
> > >
> > > I'll try it first thing when I wake up tomorrow (11PM in Finland ATM).
> >
> > Thanks.
> >
> > > Appreciate for taking time on this.
> >
> > Hey, it was my commit that broke it for you. Even if it happened to
> > work before, and only did so by pure luck, it was a functional
> > regression.
> >
> > I get very upset when other developers don't step up when *their*
> > changes break something, and I don't consider "it shouldn't have
> > worked in the first place" to be a valid excuse. You broke it, you'd
> > better fix it.
> >
> > So I had better fix my own mess too, in order to not look too hypocritical.
> >
> > And I was very aware that hardcoding the memcpy_*io() access patterns
> > might break something. I just _hoped_ it wouldn't, because we actually
> > ended up going back to the very original access patterns (but it was
> > from a long long time ago).
> >
> > In fact, while it's slightly annoying, in many ways it's actually good
> > that we found breakage, and could pinpoint exactly *why* it broke.
> > That does validate the whole "we shouldn't just depend on the random
> > implementation detail of 'memcpy()'" argument.
> >
> > So I'll wait to hear back whether that patch fixes things for you, but
> > I _think_ it will, and we'll be better off in the long range with this
> > whole thing.
> >
> > Linus
>
> I just did a quick test here of booting and running a couple commands
> (tpm2_getcap, tpm2_pcrlist), and the patch seems to work for me. I was
> seeing the error during tpm_crb initialization without the patch.

Works for me too. Just submitted tpm_crb level fix too for the reasons
explained in accompanied commit message.

Reviewed-by: Jarkko Sakkinen <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>

/Jarkko

2019-02-01 18:24:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Thu, Jan 31, 2019 at 12:45 PM Jarkko Sakkinen
<[email protected]> wrote:
>
> I understand what you mean. Just surprised that this hasn't failed
> before to anyone (the same driver has been even successfully used
> on ARM64 with TrustZone based fTPM implementation). It has been in
> for three years now.

Just to finish this thread off: it turns out that both ARM and ARM64
worked fine, because neither did a memcpy(), but had explicit IO copy
routines.

And in those explicit routines, 32-bit ARM did only byte accesses, and
64-bit ARM did 8-byte accesses for the bulk transfer part, but byte
accesses for the unaligned head and tail of the IO area.

So I think it's all good. x86 used to work by luck (either because all
machines that used that TPM chip always had ERMS, or because the
people who didn't have it never cared), and ARM just worked because it
would never do unaligned IO accesses anyway (well, I guess you can
force them with "readl()" on an unaligned address, but then you just
have yourself to blame).

Linus

2019-02-04 14:17:38

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: Getting weird TPM error after rebasing my tree to security/next-general

On Fri, Feb 01, 2019 at 10:04:35AM -0800, Linus Torvalds wrote:
> On Thu, Jan 31, 2019 at 12:45 PM Jarkko Sakkinen
> <[email protected]> wrote:
> >
> > I understand what you mean. Just surprised that this hasn't failed
> > before to anyone (the same driver has been even successfully used
> > on ARM64 with TrustZone based fTPM implementation). It has been in
> > for three years now.
>
> Just to finish this thread off: it turns out that both ARM and ARM64
> worked fine, because neither did a memcpy(), but had explicit IO copy
> routines.
>
> And in those explicit routines, 32-bit ARM did only byte accesses, and
> 64-bit ARM did 8-byte accesses for the bulk transfer part, but byte
> accesses for the unaligned head and tail of the IO area.
>
> So I think it's all good. x86 used to work by luck (either because all
> machines that used that TPM chip always had ERMS, or because the
> people who didn't have it never cared), and ARM just worked because it
> would never do unaligned IO accesses anyway (well, I guess you can
> force them with "readl()" on an unaligned address, but then you just
> have yourself to blame).
>
> Linus

OK, thanks for the summary. This kind of answered to my question. Should
be sufficient to include the tpm_crb fix to the 5.1 PR.

/Jarkko