2022-12-28 20:51:25

by Vlastimil Babka

[permalink] [raw]
Subject: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

Ugh, while the problem [1] was fixed in 6.1, it's now happening again on
the T460 with 6.2-rc1. Except I didn't see any oops message or
"tpm_try_transmit" error this time. The first indication of a problem is
this during a resume from suspend to ram:

tpm tpm0: A TPM error (28) occurred continue selftest

and then periodically

tpm tpm0: A TPM error (28) occurred attempting get random

and further suspend to ram attempts fail:

tpm tpm0: Error (28) sending savestate before suspend
tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
tpm_tis 00:08: PM: failed to suspend: error 28
PM: Some devices failed to suspend, or early wake event detected

Unfortunately I doubt I would be able to bisect it as any "good" kernel might
be a false negative.

[1] https://lore.kernel.org/all/[email protected]/

#regzbot introduced: v6.1..v6.2-rc1


2022-12-28 23:31:44

by James Bottomley

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> Ugh, while the problem [1] was fixed in 6.1, it's now happening again
> on the T460 with 6.2-rc1. Except I didn't see any oops message or
> "tpm_try_transmit" error this time. The first indication of a problem
> is this during a resume from suspend to ram:
>
> tpm tpm0: A TPM error (28) occurred continue selftest
>
> and then periodically
>
> tpm tpm0: A TPM error (28) occurred attempting get random

That's a TPM 1.2 error which means the TPM failed the selftest. The
original problem was reported against TPM 2.0 because of a missing
try_get_ops(). The tpm 1.2 command path was never changed to require
this (and in fact hasn't changed for ages, TPM 1.2 being a bit
obsolete). So this looks like a new problem with TPM 1.2 and
suspend/resume, likely in the BIOS of your system.

James

2022-12-29 04:52:02

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Thu, Dec 29, 2022 at 05:03:44AM +0100, Jason A. Donenfeld wrote:
> I recall seeing something pretty similar to this report with the
> selftest as well. Basically, the call to tpm1_do_selftest can race with
> the call to tpm1_get_random, presumably because tpm1_get_random doesn't
> do any locking on its own. So it might be a good idea to make sure that
> tpm1_get_random() isn't running before tpm1_do_selftest() or any other
> TPM commands run.

The other locking angle is that tpm1_pm_suspend() should wait for
tpm1_get_random() to finish or cancel tpm1_get_random(), if that's not
already happening. IIRC, the selftest gets tripped up when it's
triggered on resume due to an already in-flight tpm1_get_random() from
prior to sleep, that never completed.

2022-12-29 04:55:34

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> > Ugh, while the problem [1] was fixed in 6.1, it's now happening again
> > on the T460 with 6.2-rc1. Except I didn't see any oops message or
> > "tpm_try_transmit" error this time. The first indication of a problem
> > is this during a resume from suspend to ram:
> >
> > tpm tpm0: A TPM error (28) occurred continue selftest
> >
> > and then periodically
> >
> > tpm tpm0: A TPM error (28) occurred attempting get random
>
> That's a TPM 1.2 error which means the TPM failed the selftest. The
> original problem was reported against TPM 2.0 because of a missing
> try_get_ops().

No, I'm pretty sure the original bug, which was fixed by "char: tpm:
Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
considering it's the same hardware from Vlastimil causing this. I also
recall seeing this in 1.2 when I ran this with the TPM emulator. So
that's not correct.

> The tpm 1.2 command path was never changed to require
> this (and in fact hasn't changed for ages, TPM 1.2 being a bit
> obsolete).

False. I'll just copy and paste the diff of that commit:

diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
index 1621ce818705..d69905233aff 100644
--- a/drivers/char/tpm/tpm-interface.c
+++ b/drivers/char/tpm/tpm-interface.c
@@ -401,13 +401,14 @@ int tpm_pm_suspend(struct device *dev)
!pm_suspend_via_firmware())
goto suspended;

- if (!tpm_chip_start(chip)) {
+ rc = tpm_try_get_ops(chip);
+ if (!rc) {
if (chip->flags & TPM_CHIP_FLAG_TPM2)
tpm2_shutdown(chip, TPM2_SU_STATE);
else
rc = tpm1_pm_suspend(chip, tpm_suspend_pcr);

- tpm_chip_stop(chip);
+ tpm_put_ops(chip);
}

suspended:


So, as you can see, this affects the call to tpm1_pm_suspend.

> So this looks like a new problem with TPM 1.2 and
> suspend/resume, likely in the BIOS of your system.

No, this is not a BIOS problem. This is a kernel bug in the TPM 1.2
driver. Yes, it'd be very nice to wave this all away and blame it on the
BIOS, but I really don't think that's the case, especially considering
this is all reproducible in the emulator.

I recall seeing something pretty similar to this report with the
selftest as well. Basically, the call to tpm1_do_selftest can race with
the call to tpm1_get_random, presumably because tpm1_get_random doesn't
do any locking on its own. So it might be a good idea to make sure that
tpm1_get_random() isn't running before tpm1_do_selftest() or any other
TPM commands run.

I spent a long time working through the TPM code when this came up
during 6.1. I set up the TPM emulator with QEMU and reproduced this and
had a whole test setup and S3 fuzzer. It took a long time, and when I was
done, I paged it all out of my brain. When I found that patch from Jan
that fixed the problem most of the time (but not all the time), I wasted
tons of time trying to get the TPM maintainers to take the patch and
send it to Linus as part of rc7 or rc8. But they all ignored me, and
eventually Linus just took that patch directly.

So I don't think I want to go down another rabbit hole here, having
experienced the TPM maintainers not really caring much, and that sucking
away the remaining energy I had before to keep looking at the issue and
its edge cases not handled by Jan's patch.

On the contrary, it'd make a big difference if the TPM maintainers could
actually help analyze the code that they're most familiar with, so that
we can get to the bottom of this. That's a lot better than some random
drive-by patches from a non-TPM person like me; before the 6.1 bug, I'd
never even looked at these drivers.

My plan is to therefore be available to help and analyze and test and
maybe even write some code, if the TPM maintainers take the lead on
getting to the bottom of this. But if this hits neglect again like last
time, I'll just send a `depends on BROKEN if PM` patch to the TPM
hw_random driver and see what happens... That's a really awful solution
though, so I hope the maintainers will wake up this cycle.

Regards,
Jason

2023-01-04 09:21:43

by Johannes Altmanninger

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Wed, Dec 28, 2022 at 09:22:56PM +0100, Vlastimil Babka wrote:
> Ugh, while the problem [1] was fixed in 6.1, it's now happening again on
> the T460 with 6.2-rc1. Except I didn't see any oops message or
> "tpm_try_transmit" error this time. The first indication of a problem is
> this during a resume from suspend to ram:
>
> tpm tpm0: A TPM error (28) occurred continue selftest
>
> and then periodically
>
> tpm tpm0: A TPM error (28) occurred attempting get random
>
> and further suspend to ram attempts fail:
>
> tpm tpm0: Error (28) sending savestate before suspend
> tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
> tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
> tpm_tis 00:08: PM: failed to suspend: error 28
> PM: Some devices failed to suspend, or early wake event detected
>
> Unfortunately I doubt I would be able to bisect it as any "good" kernel might
> be a false negative.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> #regzbot introduced: v6.1..v6.2-rc1

I see almost exactly the same symptoms with v6.1.1 on a T460s.
No "tpm_try_transmit" etc. The only difference is that I get 0x20 instead
of 0x10 (that's probably immaterial).

Also, I have this line immediately before the tpm error.

psmouse serio2: Failed to disable mouse on synaptics-pt/serio0
tpm tpm0: Error (28) sending savestate before suspend
[...]

In the past I have had similar problems on another Thinkpad where a touchpad
driver prevented suspend. Unloading the module (I think psmouse, not sure)
helped. I haven't tried this here since it only happens sometimes.

2023-01-05 14:16:39

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On 29.12.22 05:03, Jason A. Donenfeld wrote:
> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening again
>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
>>> "tpm_try_transmit" error this time. The first indication of a problem
>>> is this during a resume from suspend to ram:
>>>
>>> tpm tpm0: A TPM error (28) occurred continue selftest
>>>
>>> and then periodically
>>>
>>> tpm tpm0: A TPM error (28) occurred attempting get random
>>
>> That's a TPM 1.2 error which means the TPM failed the selftest. The
>> original problem was reported against TPM 2.0 because of a missing
>> try_get_ops().
>
> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
> considering it's the same hardware from Vlastimil causing this. I also
> recall seeing this in 1.2 when I ran this with the TPM emulator. So
> that's not correct.

James, are you or some other TPM developer looking into this? Or is this
deadlocked now? And if so: how can we get this unstuck to get this
regression solved?

Side note: I wonder if the problem that Johannes reported yesterday in
this thread (
https://lore.kernel.org/all/[email protected]/
) is related or something else, as it seems his issue happens with 6.1,
while Vlastimil's problems should be fixed there. Or am I missing something?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

>> The tpm 1.2 command path was never changed to require
>> this (and in fact hasn't changed for ages, TPM 1.2 being a bit
>> obsolete).
>
> False. I'll just copy and paste the diff of that commit:
>
> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> index 1621ce818705..d69905233aff 100644
> --- a/drivers/char/tpm/tpm-interface.c
> +++ b/drivers/char/tpm/tpm-interface.c
> @@ -401,13 +401,14 @@ int tpm_pm_suspend(struct device *dev)
> !pm_suspend_via_firmware())
> goto suspended;
>
> - if (!tpm_chip_start(chip)) {
> + rc = tpm_try_get_ops(chip);
> + if (!rc) {
> if (chip->flags & TPM_CHIP_FLAG_TPM2)
> tpm2_shutdown(chip, TPM2_SU_STATE);
> else
> rc = tpm1_pm_suspend(chip, tpm_suspend_pcr);
>
> - tpm_chip_stop(chip);
> + tpm_put_ops(chip);
> }
>
> suspended:
>
>
> So, as you can see, this affects the call to tpm1_pm_suspend.
>
>> So this looks like a new problem with TPM 1.2 and
>> suspend/resume, likely in the BIOS of your system.
>
> No, this is not a BIOS problem. This is a kernel bug in the TPM 1.2
> driver. Yes, it'd be very nice to wave this all away and blame it on the
> BIOS, but I really don't think that's the case, especially considering
> this is all reproducible in the emulator.
>
> I recall seeing something pretty similar to this report with the
> selftest as well. Basically, the call to tpm1_do_selftest can race with
> the call to tpm1_get_random, presumably because tpm1_get_random doesn't
> do any locking on its own. So it might be a good idea to make sure that
> tpm1_get_random() isn't running before tpm1_do_selftest() or any other
> TPM commands run.
>
> I spent a long time working through the TPM code when this came up
> during 6.1. I set up the TPM emulator with QEMU and reproduced this and
> had a whole test setup and S3 fuzzer. It took a long time, and when I was
> done, I paged it all out of my brain. When I found that patch from Jan
> that fixed the problem most of the time (but not all the time), I wasted
> tons of time trying to get the TPM maintainers to take the patch and
> send it to Linus as part of rc7 or rc8. But they all ignored me, and
> eventually Linus just took that patch directly.
>
> So I don't think I want to go down another rabbit hole here, having
> experienced the TPM maintainers not really caring much, and that sucking
> away the remaining energy I had before to keep looking at the issue and
> its edge cases not handled by Jan's patch.
>
> On the contrary, it'd make a big difference if the TPM maintainers could
> actually help analyze the code that they're most familiar with, so that
> we can get to the bottom of this. That's a lot better than some random
> drive-by patches from a non-TPM person like me; before the 6.1 bug, I'd
> never even looked at these drivers.
>
> My plan is to therefore be available to help and analyze and test and
> maybe even write some code, if the TPM maintainers take the lead on
> getting to the bottom of this. But if this hits neglect again like last
> time, I'll just send a `depends on BROKEN if PM` patch to the TPM
> hw_random driver and see what happens... That's a really awful solution
> though, so I hope the maintainers will wake up this cycle.
>
> Regards,
> Jason
>
>

2023-01-05 14:44:22

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On 1/5/23 14:59, Thorsten Leemhuis wrote:
> On 29.12.22 05:03, Jason A. Donenfeld wrote:
>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening again
>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
>>>> "tpm_try_transmit" error this time. The first indication of a problem
>>>> is this during a resume from suspend to ram:
>>>>
>>>> tpm tpm0: A TPM error (28) occurred continue selftest
>>>>
>>>> and then periodically
>>>>
>>>> tpm tpm0: A TPM error (28) occurred attempting get random
>>>
>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
>>> original problem was reported against TPM 2.0 because of a missing
>>> try_get_ops().
>>
>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
>> considering it's the same hardware from Vlastimil causing this. I also
>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
>> that's not correct.
>
> James, are you or some other TPM developer looking into this? Or is this
> deadlocked now? And if so: how can we get this unstuck to get this
> regression solved?
>
> Side note: I wonder if the problem that Johannes reported yesterday in
> this thread (
> https://lore.kernel.org/all/[email protected]/
> ) is related or something else, as it seems his issue happens with 6.1,
> while Vlastimil's problems should be fixed there. Or am I missing something?

Yaeh, as Jason noted, the fix in 6.1 probably covered only the most common
race scenario, yet other(s) were still possible, just more rare. So most
likely I just happened not to trigger it anymore with 6.1, only after
switching to 6.2 - but it's not really a new bug in 6.2. And 6.1 also only
started to exhibit it because hwrng started using the tpm, which would just
sit unused before that.

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>

2023-01-05 14:54:13

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH] tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled

TPM 1's support for its hardware RNG is broken across system suspends,
due to races or locking issues or something else that haven't been
diagnosed or fixed yet. These issues prevent the system from actually
suspending. So disable the driver in this case. Later, when this is
fixed properly, we can remove this.

Current breakage amounts to something like:

tpm tpm0: A TPM error (28) occurred continue selftest
...
tpm tpm0: A TPM error (28) occurred attempting get random
...
tpm tpm0: Error (28) sending savestate before suspend
tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
tpm_tis 00:08: PM: failed to suspend: error 28
PM: Some devices failed to suspend, or early wake event detected

This issue was partially fixed by 23393c646142 ("char: tpm: Protect
tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
directly because the TPM maintainers weren't available. However, it
seems like this just addresses the most common cases of the bug, rather
than addressing it entirely. So there are more things to fix still,
apparently.

The hwrng driver appears already to be occasionally disabled due to
other conditions, so this shouldn't be too large of a surprise.

Link: https://lore.kernel.org/lkml/[email protected]/
Cc: [email protected] # 6.1+
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
drivers/char/tpm/tpm-chip.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
index 741d8f3e8fb3..eed67ea8d3a7 100644
--- a/drivers/char/tpm/tpm-chip.c
+++ b/drivers/char/tpm/tpm-chip.c
@@ -524,6 +524,14 @@ static int tpm_add_hwrng(struct tpm_chip *chip)
if (!IS_ENABLED(CONFIG_HW_RANDOM_TPM) || tpm_is_firmware_upgrade(chip))
return 0;

+ /*
+ * This driver's support for using the RNG across suspend is broken on
+ * TPM1. Until somebody fixes this, just stop registering a HWRNG in
+ * that case.
+ */
+ if (!(chip->flags & TPM_CHIP_FLAG_TPM2) && IS_ENABLED(CONFIG_PM_SLEEP))
+ return 0;
+
snprintf(chip->hwrng_name, sizeof(chip->hwrng_name),
"tpm-rng-%d", chip->dev_num);
chip->hwrng.name = chip->hwrng_name;
--
2.39.0

2023-01-05 15:45:04

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH] tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled

On Thu, Jan 05, 2023 at 03:47:42PM +0100, Jason A. Donenfeld wrote:
> TPM 1's support for its hardware RNG is broken across system suspends,
> due to races or locking issues or something else that haven't been
> diagnosed or fixed yet. These issues prevent the system from actually
> suspending. So disable the driver in this case. Later, when this is
> fixed properly, we can remove this.
>
> Current breakage amounts to something like:
>
> tpm tpm0: A TPM error (28) occurred continue selftest
> ...
> tpm tpm0: A TPM error (28) occurred attempting get random
> ...
> tpm tpm0: Error (28) sending savestate before suspend
> tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
> tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
> tpm_tis 00:08: PM: failed to suspend: error 28
> PM: Some devices failed to suspend, or early wake event detected
>
> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
>
> The hwrng driver appears already to be occasionally disabled due to
> other conditions, so this shouldn't be too large of a surprise.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Cc: [email protected] # 6.1+
> Signed-off-by: Jason A. Donenfeld <[email protected]>

Quoting from my previous email:

| I spent a long time working through the TPM code when this came up
| during 6.1. I set up the TPM emulator with QEMU and reproduced this and
| had a whole test setup and S3 fuzzer. It took a long time, and when I was
| done, I paged it all out of my brain. When I found that patch from Jan
| that fixed the problem most of the time (but not all the time), I wasted
| tons of time trying to get the TPM maintainers to take the patch and
| send it to Linus as part of rc7 or rc8. But they all ignored me, and
| eventually Linus just took that patch directly.
|
| So I don't think I want to go down another rabbit hole here, having
| experienced the TPM maintainers not really caring much, and that sucking
| away the remaining energy I had before to keep looking at the issue and
| its edge cases not handled by Jan's patch.
|
| On the contrary, it'd make a big difference if the TPM maintainers could
| actually help analyze the code that they're most familiar with, so that
| we can get to the bottom of this. That's a lot better than some random
| drive-by patches from a non-TPM person like me; before the 6.1 bug, I'd
| never even looked at these drivers.
|
| My plan is to therefore be available to help and analyze and test and
| maybe even write some code, if the TPM maintainers take the lead on
| getting to the bottom of this. But if this hits neglect again like last
| time, I'll just send a `depends on BROKEN if PM` patch to the TPM
| hw_random driver and see what happens... That's a really awful solution
| though, so I hope the maintainers will wake up this cycle.

Seeing as there's still no life from the TPM maintainers, here's the
patch to make the problem go away until they wake up. When they do wake
up, though, I will be available to start looking into this again in
whatever capacity I might be useful.

Jason

2023-01-05 15:52:05

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Thu, Jan 05, 2023 at 04:27:02PM +0100, Jason A. Donenfeld wrote:
> On Thu, Jan 05, 2023 at 10:17:57AM -0500, James Bottomley wrote:
> > On Thu, 2023-01-05 at 14:59 +0100, Thorsten Leemhuis wrote:
> > > On 29.12.22 05:03, Jason A. Donenfeld wrote:
> > > > On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> > > > > On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> > > > > > Ugh, while the problem [1] was fixed in 6.1, it's now happening
> > > > > > again on the T460 with 6.2-rc1. Except I didn't see any oops
> > > > > > message or "tpm_try_transmit" error this time. The first
> > > > > > indication of a problem is this during a resume from suspend to
> > > > > > ram:
> > > > > >
> > > > > > tpm tpm0: A TPM error (28) occurred continue selftest
> > > > > >
> > > > > > and then periodically
> > > > > >
> > > > > > tpm tpm0: A TPM error (28) occurred attempting get random
> > > > >
> > > > > That's a TPM 1.2 error which means the TPM failed the selftest. 
> > > > > The original problem was reported against TPM 2.0  because of a
> > > > > missing try_get_ops().
> > > >
> > > > No, I'm pretty sure the original bug, which was fixed by "char:
> > > > tpm: Protect tpm_pm_suspend with locks" regards 1.2 as well,
> > > > especially considering it's the same hardware from Vlastimil
> > > > causing this. I also recall seeing this in 1.2 when I ran this with
> > > > the TPM emulator. So that's not correct.
> > >
> > > James, are you or some other TPM developer looking into this? Or is
> > > this deadlocked now?
> >
> > Not really: TPM 1.2 way predates my interest in the TPM subsystem, and
> > I've only ever done patches to 2.0. I can look at the paths
> > theoretically, but I don't have any hardware. Self Test failures tend
> > to be hardware specific, so even poking around in the emulator is
> > unlikely to give what might be the cause.
> >
> > > And if so: how can we get this unstuck to get this regression
> > > solved?
> >
> > One of the TPM maintainers with hardware (possibly the specific TPM ...
> > what is it, by the way?) needs to get involved.
>
> I already wrote in my last email [1] that this simply isn't the case.
> The issue reproduces in QEMU + the emulator. It's far more likely to be
> a locking/race situation [2] than some kind of obscure hardware bug.
>
> Jason
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/

I should add, though, that I don't fault you for lack of interest in TPM
1.2. Folks work on what they care about. If the rest of the TPM team has
a similar feeling, maybe it's best to just merge [3], rather than let
the bug linger forever? And possibly talk about sunsetting TPM 1 support
entirely if there's nobody to work on it and fix issues that inevitably
will come up? Just trying to look at the big picture here, you know...

[3] https://lore.kernel.org/lkml/[email protected]/

2023-01-05 16:08:23

by James Bottomley

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Thu, 2023-01-05 at 14:59 +0100, Thorsten Leemhuis wrote:
> On 29.12.22 05:03, Jason A. Donenfeld wrote:
> > On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> > > On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> > > > Ugh, while the problem [1] was fixed in 6.1, it's now happening
> > > > again on the T460 with 6.2-rc1. Except I didn't see any oops
> > > > message or "tpm_try_transmit" error this time. The first
> > > > indication of a problem is this during a resume from suspend to
> > > > ram:
> > > >
> > > > tpm tpm0: A TPM error (28) occurred continue selftest
> > > >
> > > > and then periodically
> > > >
> > > > tpm tpm0: A TPM error (28) occurred attempting get random
> > >
> > > That's a TPM 1.2 error which means the TPM failed the selftest. 
> > > The original problem was reported against TPM 2.0  because of a
> > > missing try_get_ops().
> >
> > No, I'm pretty sure the original bug, which was fixed by "char:
> > tpm: Protect tpm_pm_suspend with locks" regards 1.2 as well,
> > especially considering it's the same hardware from Vlastimil
> > causing this. I also recall seeing this in 1.2 when I ran this with
> > the TPM emulator. So that's not correct.
>
> James, are you or some other TPM developer looking into this? Or is
> this deadlocked now?

Not really: TPM 1.2 way predates my interest in the TPM subsystem, and
I've only ever done patches to 2.0. I can look at the paths
theoretically, but I don't have any hardware. Self Test failures tend
to be hardware specific, so even poking around in the emulator is
unlikely to give what might be the cause.

> And if so: how can we get this unstuck to get this regression
> solved?

One of the TPM maintainers with hardware (possibly the specific TPM ...
what is it, by the way?) needs to get involved.

James



2023-01-05 16:08:35

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Thu, Jan 05, 2023 at 10:17:57AM -0500, James Bottomley wrote:
> On Thu, 2023-01-05 at 14:59 +0100, Thorsten Leemhuis wrote:
> > On 29.12.22 05:03, Jason A. Donenfeld wrote:
> > > On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> > > > On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> > > > > Ugh, while the problem [1] was fixed in 6.1, it's now happening
> > > > > again on the T460 with 6.2-rc1. Except I didn't see any oops
> > > > > message or "tpm_try_transmit" error this time. The first
> > > > > indication of a problem is this during a resume from suspend to
> > > > > ram:
> > > > >
> > > > > tpm tpm0: A TPM error (28) occurred continue selftest
> > > > >
> > > > > and then periodically
> > > > >
> > > > > tpm tpm0: A TPM error (28) occurred attempting get random
> > > >
> > > > That's a TPM 1.2 error which means the TPM failed the selftest. 
> > > > The original problem was reported against TPM 2.0  because of a
> > > > missing try_get_ops().
> > >
> > > No, I'm pretty sure the original bug, which was fixed by "char:
> > > tpm: Protect tpm_pm_suspend with locks" regards 1.2 as well,
> > > especially considering it's the same hardware from Vlastimil
> > > causing this. I also recall seeing this in 1.2 when I ran this with
> > > the TPM emulator. So that's not correct.
> >
> > James, are you or some other TPM developer looking into this? Or is
> > this deadlocked now?
>
> Not really: TPM 1.2 way predates my interest in the TPM subsystem, and
> I've only ever done patches to 2.0. I can look at the paths
> theoretically, but I don't have any hardware. Self Test failures tend
> to be hardware specific, so even poking around in the emulator is
> unlikely to give what might be the cause.
>
> > And if so: how can we get this unstuck to get this regression
> > solved?
>
> One of the TPM maintainers with hardware (possibly the specific TPM ...
> what is it, by the way?) needs to get involved.

I already wrote in my last email [1] that this simply isn't the case.
The issue reproduces in QEMU + the emulator. It's far more likely to be
a locking/race situation [2] than some kind of obscure hardware bug.

Jason

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

2023-01-05 22:52:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled

On Thu, Jan 5, 2023 at 6:48 AM Jason A. Donenfeld <[email protected]> wrote:
>
> TPM 1's support for its hardware RNG is broken across system suspends,
> due to races or locking issues or something else that haven't been
> diagnosed or fixed yet. These issues prevent the system from actually
> suspending. So disable the driver in this case. Later, when this is
> fixed properly, we can remove this.

How about just keeping it enabled, but not making it a fatal error if
the TPM saving doesn't work? IOW, just print the warning, and then
"return 0" from the suspend function.

I doubt anybody cares, but your patch disables that TPM device just
because PM is *enabled*. That's basically "all the time".

Imagine being on a desktop with a distro kernel that enables suspend -
because that kernel obviously is expected to work on laptops too.
You're never actually going to suspend things on that machine, but
maybe you still want to register it as a source of hw random data?

Linus

2023-01-05 22:56:58

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH] tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled

On Thu, Jan 05, 2023 at 01:58:48PM -0800, Linus Torvalds wrote:
> On Thu, Jan 5, 2023 at 6:48 AM Jason A. Donenfeld <[email protected]> wrote:
> >
> > TPM 1's support for its hardware RNG is broken across system suspends,
> > due to races or locking issues or something else that haven't been
> > diagnosed or fixed yet. These issues prevent the system from actually
> > suspending. So disable the driver in this case. Later, when this is
> > fixed properly, we can remove this.
>
> How about just keeping it enabled, but not making it a fatal error if
> the TPM saving doesn't work? IOW, just print the warning, and then
> "return 0" from the suspend function.

You're right that returning 0 from the pm notifier would make the
problem that users actually care about -- laptop doesn't sleep when you
close the lid -- go away.

From a random.c perspective, the RNG is already initialized when the
driver loads, which will be before suspend bricks the driver. So even if
the behavior afterwards is a buggy driver handing all zeros to random.c,
it won't really matter much; random.c can deal with that
cryptographically. I have no idea if this is actually the case with the
driver's error condition. But if it is, it's good that it doesn't
matter.

So okay, I'll roll a patch to do that when I get home. I'm writing on my
phone now, but from memory it's just changing a 'return rc;' into
'return 0;'.

Then the TPM folks can fix the underlying issue at their leisure
whenever.

Jason

2023-01-06 03:42:52

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

TPM 1 is sometimes broken across system suspends, due to races or
locking issues or something else that haven't been diagnosed or fixed
yet, most likely having to do with concurrent reads from the TPM's
hardware random number generator driver. These issues prevent the system
from actually suspending, with errors like:

tpm tpm0: A TPM error (28) occurred continue selftest
...
tpm tpm0: A TPM error (28) occurred attempting get random
...
tpm tpm0: Error (28) sending savestate before suspend
tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
tpm_tis 00:08: PM: failed to suspend: error 28
PM: Some devices failed to suspend, or early wake event detected

This issue was partially fixed by 23393c646142 ("char: tpm: Protect
tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
directly because the TPM maintainers weren't available. However, it
seems like this just addresses the most common cases of the bug, rather
than addressing it entirely. So there are more things to fix still,
apparently.

In lieu of actually fixing the underlying bug, just allow system suspend
to continue, so that laptops still go to sleep fine. Later, this can be
reverted when the real bug is fixed.

Link: https://lore.kernel.org/lkml/[email protected]/
Cc: [email protected] # 6.1+
Reported-by: Vlastimil Babka <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
This is basically untested and I haven't worked out if there are any
awful implications of letting the system sleep when TPM suspend fails.
Maybe some PCRs get cleared and that will make everything explode on
resume? Maybe it doesn't matter? Somebody well versed in TPMology should
probably [n]ack this approach.

drivers/char/tpm/tpm-interface.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
index d69905233aff..6df9067ef7f9 100644
--- a/drivers/char/tpm/tpm-interface.c
+++ b/drivers/char/tpm/tpm-interface.c
@@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
}

suspended:
- return rc;
+ if (rc)
+ pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
+ chip->dev_num, rc);
+ return 0;
}
EXPORT_SYMBOL_GPL(tpm_pm_suspend);

--
2.39.0

2023-01-06 16:19:22

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

Hi Todd & ChromeOS folks,

On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> TPM 1 is sometimes broken across system suspends, due to races or
> locking issues or something else that haven't been diagnosed or fixed
> yet, most likely having to do with concurrent reads from the TPM's
> hardware random number generator driver. These issues prevent the system
> from actually suspending, with errors like:
>
> tpm tpm0: A TPM error (28) occurred continue selftest
> ...
> tpm tpm0: A TPM error (28) occurred attempting get random
> ...
> tpm tpm0: Error (28) sending savestate before suspend
> tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
> tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
> tpm_tis 00:08: PM: failed to suspend: error 28
> PM: Some devices failed to suspend, or early wake event detected
>
> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
>
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Cc: [email protected] # 6.1+
> Reported-by: Vlastimil Babka <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> ---
> This is basically untested and I haven't worked out if there are any
> awful implications of letting the system sleep when TPM suspend fails.
> Maybe some PCRs get cleared and that will make everything explode on
> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> probably [n]ack this approach.

When idling scrolling on my telephone to try to see what the
implications of skipping TPM_ORD_SAVESTATE could be, I stumbled across
some ChromeOS commits related to it, and realized that, ah-hah, finally
there's an obvious group of stakeholders who make heavy use of the TPM
and have likely amassed some expertise on it.

So I was wondering if you'd take a look at this patch briefly to make
sure it won't break ChromeOS laptops.

Jason

2023-01-06 17:41:59

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

On Fri, Jan 6, 2023 at 6:04 PM Luigi Semenzato <[email protected]> wrote:
>
> I worked a fair amount on TPM 1.0 about 10 years ago and I even vaguely remember suspend-related problems. I'd be happy to take a look. The linked thread shows that Peter Huewe was copied. I know Peter well, his opinion can be trusted. Unfortunately I don't immediately see a link to a patch, can you help?

Sorry, I should have included that:
https://lore.kernel.org/lkml/[email protected]/
Instead of blocking system suspend when TPM_ORD_SAVESTATE fails, it
just lets the system sleep anyway. This means that presumably the
system might sleep without having called TPM_ORD_SAVESTATE. Trying to
figure out how bad that is.

And yes, Peter Huewe certainly knows about TPMs, especially as he
maintains the code in Linux, but the maintainers haven't been so much
available, unfortunately. This bug happens to intersect with something
mostly related that I work on (the rng), so I'm motivated to at least
prevent the worst of the breakage, but I otherwise don't know anything
about the Linux TPM driver.

Jason

2023-01-06 19:35:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

On Thu, Jan 5, 2023 at 7:02 PM Jason A. Donenfeld <[email protected]> wrote:
>
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.

So the patch looks fine to me, but since there's still the ChromeOS
discussion pending I'll wait for that to finish.

Perhaps re-send or at least remind me if/when it does?

Also, a query about the printout:

> + if (rc)
> + pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> + chip->dev_num, rc);

so I suspect that 99% of the time the dev_num isn't actually all that
useful, but what *might* be useful is which tpm driver it is.

Just comparing the error dmesg output you had:

..
tpm tpm0: Error (28) sending savestate before suspend
tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
..

that "tpm tpm0" output is kind of useless compared to the "tpm_tis 00:08" one.

So I think "dev_err(dev, ...)" would be more useful here.

Finally - and maybe I'm just being difficult here, I will note here
again that TPM2 devices don't have this issue, because the TPM2 path
for suspend doesn't do any of this at all.

It just does

tpm_transmit_cmd(..);

with a TPM2_CC_SHUTDOWN TPM_SU_STATE command, and doesn't even check
the return value. In fact, the tpm2 code *used* to have this comment:

/* In places where shutdown command is sent there's no much we can do
* except print the error code on a system failure.
*/
if (rc < 0 && rc != -EPIPE)
dev_warn(&chip->dev, "transmit returned %d while
stopping the TPM",
rc);

but it was summarily removed when doing some re-organization around
buffer handling.

So just by looking at what tpm2 does, I'm not 100% convinced that tpm1
should do this dance at all.

But having a dev_err() is probably a good idea at least as a transitional thing.

Linus

2023-01-06 20:20:30

by Luigi Semenzato

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

I think it's fine to go ahead with your change, for multiple reasons.

1. I doubt that any of the ChromeOS devices using TPM 1.2 are still
being updated.
2. If the SAVESTATE command fails, it is probably better to continue
the transition to S3, and fail at resume, than to block the suspend.
The suspend is often triggered by closing the lid, so users would not
see what's going on and might put their running laptop in a backpack,
where it could overheat.
3. I don't recall bugs due to failures of TPM suspend, and I didn't
find any such bug in our database. Many (most?) ChromeOS devices left
the TPM powered on in S3, so didn't use the suspend/resume path.

Thank you for asking!


On Fri, Jan 6, 2023 at 11:00 AM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Jan 5, 2023 at 7:02 PM Jason A. Donenfeld <[email protected]> wrote:
> >
> > In lieu of actually fixing the underlying bug, just allow system suspend
> > to continue, so that laptops still go to sleep fine. Later, this can be
> > reverted when the real bug is fixed.
>
> So the patch looks fine to me, but since there's still the ChromeOS
> discussion pending I'll wait for that to finish.
>
> Perhaps re-send or at least remind me if/when it does?
>
> Also, a query about the printout:
>
> > + if (rc)
> > + pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> > + chip->dev_num, rc);
>
> so I suspect that 99% of the time the dev_num isn't actually all that
> useful, but what *might* be useful is which tpm driver it is.
>
> Just comparing the error dmesg output you had:
>
> ..
> tpm tpm0: Error (28) sending savestate before suspend
> tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
> ..
>
> that "tpm tpm0" output is kind of useless compared to the "tpm_tis 00:08" one.
>
> So I think "dev_err(dev, ...)" would be more useful here.
>
> Finally - and maybe I'm just being difficult here, I will note here
> again that TPM2 devices don't have this issue, because the TPM2 path
> for suspend doesn't do any of this at all.
>
> It just does
>
> tpm_transmit_cmd(..);
>
> with a TPM2_CC_SHUTDOWN TPM_SU_STATE command, and doesn't even check
> the return value. In fact, the tpm2 code *used* to have this comment:
>
> /* In places where shutdown command is sent there's no much we can do
> * except print the error code on a system failure.
> */
> if (rc < 0 && rc != -EPIPE)
> dev_warn(&chip->dev, "transmit returned %d while
> stopping the TPM",
> rc);
>
> but it was summarily removed when doing some re-organization around
> buffer handling.
>
> So just by looking at what tpm2 does, I'm not 100% convinced that tpm1
> should do this dance at all.
>
> But having a dev_err() is probably a good idea at least as a transitional thing.
>
> Linus

2023-01-06 23:04:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

On Fri, Jan 6, 2023 at 12:04 PM Luigi Semenzato <[email protected]> wrote:
>
> I think it's fine to go ahead with your change, for multiple reasons.

Ok, I've applied the patch (although I did end up editing it to use
dev_err() before doing that just to make myself happier about the
printout).

Linus

2023-01-09 16:13:39

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

Hi Thorsten,

On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
> On 29.12.22 05:03, Jason A. Donenfeld wrote:
> > On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> >> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> >>> Ugh, while the problem [1] was fixed in 6.1, it's now happening again
> >>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
> >>> "tpm_try_transmit" error this time. The first indication of a problem
> >>> is this during a resume from suspend to ram:
> >>>
> >>> tpm tpm0: A TPM error (28) occurred continue selftest
> >>>
> >>> and then periodically
> >>>
> >>> tpm tpm0: A TPM error (28) occurred attempting get random
> >>
> >> That's a TPM 1.2 error which means the TPM failed the selftest. The
> >> original problem was reported against TPM 2.0 because of a missing
> >> try_get_ops().
> >
> > No, I'm pretty sure the original bug, which was fixed by "char: tpm:
> > Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
> > considering it's the same hardware from Vlastimil causing this. I also
> > recall seeing this in 1.2 when I ran this with the TPM emulator. So
> > that's not correct.
>
> James, are you or some other TPM developer looking into this? Or is this
> deadlocked now? And if so: how can we get this unstuck to get this
> regression solved?
>
> Side note: I wonder if the problem that Johannes reported yesterday in
> this thread (
> https://lore.kernel.org/all/[email protected]/
> ) is related or something else, as it seems his issue happens with 6.1,
> while Vlastimil's problems should be fixed there. Or am I missing something?

So, this is now in rc3:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c

That should help avoid the worst of the issue -- laptop not sleeping.
But the race or whatever it is still does exist. So you might want to
keep this in your tracker to periodically nudge the TPM folks about it.

Jason


2023-01-09 16:39:44

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

On Fri, Jan 06, 2023 at 02:28:00PM -0800, Linus Torvalds wrote:
> On Fri, Jan 6, 2023 at 12:04 PM Luigi Semenzato <[email protected]> wrote:
> >
> > I think it's fine to go ahead with your change, for multiple reasons.
>
> Ok, I've applied the patch (although I did end up editing it to use
> dev_err() before doing that just to make myself happier about the
> printout).

Thanks for fixing it up to use dev_err(). Final commit looks good to me.

Jason

2023-01-10 17:44:47

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On 1/9/23 17:08, Jason A. Donenfeld wrote:
> Hi Thorsten,
>
> On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
>> On 29.12.22 05:03, Jason A. Donenfeld wrote:
>>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
>>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
>>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening again
>>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
>>>>> "tpm_try_transmit" error this time. The first indication of a problem
>>>>> is this during a resume from suspend to ram:
>>>>>
>>>>> tpm tpm0: A TPM error (28) occurred continue selftest
>>>>>
>>>>> and then periodically
>>>>>
>>>>> tpm tpm0: A TPM error (28) occurred attempting get random
>>>>
>>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
>>>> original problem was reported against TPM 2.0 because of a missing
>>>> try_get_ops().
>>>
>>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
>>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
>>> considering it's the same hardware from Vlastimil causing this. I also
>>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
>>> that's not correct.
>>
>> James, are you or some other TPM developer looking into this? Or is this
>> deadlocked now? And if so: how can we get this unstuck to get this
>> regression solved?
>>
>> Side note: I wonder if the problem that Johannes reported yesterday in
>> this thread (
>> https://lore.kernel.org/all/[email protected]/
>> ) is related or something else, as it seems his issue happens with 6.1,
>> while Vlastimil's problems should be fixed there. Or am I missing something?
>
> So, this is now in rc3:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c
>
> That should help avoid the worst of the issue -- laptop not sleeping.
> But the race or whatever it is still does exist. So you might want to
> keep this in your tracker to periodically nudge the TPM folks about it.

Heh, booted rc3 and managed to hit it on very first suspend to ram attempt:

tpm tpm0: A TPM error (28) occurred continue selftest

But thanks to the patch, the next suspend worked:

[ 236.598900] tpm tpm0: Error (28) sending savestate before suspend
[ 236.598915] tpm_tis 00:08: Ignoring error 28 while suspending

and on resume again:

[ 238.196645] tpm tpm0: A TPM error (28) occurred continue selftest

and indeed now I keep getting (as expected)

[ 399.671077] tpm tpm0: A TPM error (28) occurred attempting get random

So hopefully somebody will look into the root cause at some point.

> Jason
>
>

2023-01-16 09:27:54

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> TPM 1 is sometimes broken across system suspends, due to races or
> locking issues or something else that haven't been diagnosed or fixed
> yet, most likely having to do with concurrent reads from the TPM's
> hardware random number generator driver. These issues prevent the system
> from actually suspending, with errors like:
>
> tpm tpm0: A TPM error (28) occurred continue selftest
> ...
> tpm tpm0: A TPM error (28) occurred attempting get random
> ...
> tpm tpm0: Error (28) sending savestate before suspend
> tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
> tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
> tpm_tis 00:08: PM: failed to suspend: error 28
> PM: Some devices failed to suspend, or early wake event detected
>
> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
>
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Cc: [email protected] # 6.1+
> Reported-by: Vlastimil Babka <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> ---
> This is basically untested and I haven't worked out if there are any
> awful implications of letting the system sleep when TPM suspend fails.
> Maybe some PCRs get cleared and that will make everything explode on
> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> probably [n]ack this approach.
>
> drivers/char/tpm/tpm-interface.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> index d69905233aff..6df9067ef7f9 100644
> --- a/drivers/char/tpm/tpm-interface.c
> +++ b/drivers/char/tpm/tpm-interface.c
> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> }
>
> suspended:
> - return rc;
> + if (rc)
> + pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> + chip->dev_num, rc);
> + return 0;
> }
> EXPORT_SYMBOL_GPL(tpm_pm_suspend);
>
> --
> 2.39.0
>

Let me read all the threads through starting from the original report. I've
had emails piling up because of getting sick before holiday, and holiday
season after that.

This looks sane

Apologies for the lack of response.

BR, Jarkko

2023-01-16 12:03:14

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> TPM 1 is sometimes broken across system suspends, due to races or
> locking issues or something else that haven't been diagnosed or fixed
> yet, most likely having to do with concurrent reads from the TPM's
> hardware random number generator driver. These issues prevent the system
> from actually suspending, with errors like:
>
> tpm tpm0: A TPM error (28) occurred continue selftest
> ...

<REMOVE>

> tpm tpm0: A TPM error (28) occurred attempting get random
> ...
> tpm tpm0: Error (28) sending savestate before suspend
> tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
> tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
> tpm_tis 00:08: PM: failed to suspend: error 28
> PM: Some devices failed to suspend, or early wake event detected

</REMOVE>

Unrelated to thix particular fix.

> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
>
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Cc: [email protected] # 6.1+
> Reported-by: Vlastimil Babka <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> ---
> This is basically untested and I haven't worked out if there are any
> awful implications of letting the system sleep when TPM suspend fails.
> Maybe some PCRs get cleared and that will make everything explode on
> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> probably [n]ack this approach.
>
> drivers/char/tpm/tpm-interface.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> index d69905233aff..6df9067ef7f9 100644
> --- a/drivers/char/tpm/tpm-interface.c
> +++ b/drivers/char/tpm/tpm-interface.c
> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> }
>
> suspended:
> - return rc;
> + if (rc)
> + pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> + chip->dev_num, rc);
> + return 0;
> }
> EXPORT_SYMBOL_GPL(tpm_pm_suspend);
>
> --
> 2.39.0
>

This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing
the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and
the call is located in tpm_tis_resume() [*].

[*] https://lore.kernel.org/lkml/[email protected]/

BR, Jarkko

2023-01-16 12:26:13

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Wed, Dec 28, 2022 at 09:22:56PM +0100, Vlastimil Babka wrote:
> Ugh, while the problem [1] was fixed in 6.1, it's now happening again on
> the T460 with 6.2-rc1. Except I didn't see any oops message or
> "tpm_try_transmit" error this time. The first indication of a problem is
> this during a resume from suspend to ram:
>
> tpm tpm0: A TPM error (28) occurred continue selftest

This failure occurs in tpm_tis_resume().

tpm1_do_selftest() is called for the first time during a power cycle in
tpm_chip_register(), which does not spawn anything klog, does it?

Because they wrap the call differently tpm_tis_resume() does not disable
clkrun protocol, which is at least obvious semantical difference.

I'd suggest to workaround the bug by replacing tpm1_do_selftest() with

1. tpm_chip_start()
2. tpm1_auto_startup()
3. tpm_chip_stop()

tpm1_auto_startup() has already convergent rollback semantics as in

https://lore.kernel.org/lkml/[email protected]/

To add, even without bug, it makes a lot of commons to make semantics
convergent.

BR, Jarkko

2023-01-16 14:42:50

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

Hi Jarkko,

On Mon, Jan 16, 2023 at 9:12 AM Jarkko Sakkinen <[email protected]> wrote:
> > diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> > index d69905233aff..6df9067ef7f9 100644
> > --- a/drivers/char/tpm/tpm-interface.c
> > +++ b/drivers/char/tpm/tpm-interface.c
> > @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> > }
> >
> > suspended:
> > - return rc;
> > + if (rc)
> > + pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> > + chip->dev_num, rc);
> > + return 0;
> > }
> > EXPORT_SYMBOL_GPL(tpm_pm_suspend);
> >
> > --
> > 2.39.0
> >
>
> Let me read all the threads through starting from the original report. I've
> had emails piling up because of getting sick before holiday, and holiday
> season after that.
>
> This looks sane

No, not really. I mean, it was sane under the circumstances of, "I'm
not going to spend time fixing this for real if the maintainers aren't
around," and it fixed the suspend issue. But it doesn't actually fix
any real tpm issue. The real issue, AFAICT, is there's some sort of
race between the tpm rng read command and either suspend or wakeup or
selftest. One of these is missing some locking. And then commands step
on each other and the tpm gets upset. This is probably something that
should be fixed. I assume the "Fixes: ..." tag will actually go quite
far back, with recent things only unearthing a somewhat old bug. But
just a hunch.

Jason

2023-01-16 14:42:59

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

On 1/16/23 12:44, Jarkko Sakkinen wrote:
> On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
>> TPM 1 is sometimes broken across system suspends, due to races or
>> locking issues or something else that haven't been diagnosed or fixed
>> yet, most likely having to do with concurrent reads from the TPM's
>> hardware random number generator driver. These issues prevent the system
>> from actually suspending, with errors like:
>>
>> tpm tpm0: A TPM error (28) occurred continue selftest
>> ...
>
> <REMOVE>
>
>> tpm tpm0: A TPM error (28) occurred attempting get random
>> ...
>> tpm tpm0: Error (28) sending savestate before suspend
>> tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>> tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>> tpm_tis 00:08: PM: failed to suspend: error 28
>> PM: Some devices failed to suspend, or early wake event detected
>
> </REMOVE>
>
> Unrelated to thix particular fix.

Not sure I understand.
AFAIK this is not a proper fix, but a workaround for when laptop suspend no
longer works because TPM fails to suspend. The error messages quoted above
are very much related to the problem of suspend not working, and this patch
did work as advertised at least for me. I see errors but they don't prevent
suspend anymore:

https://lore.kernel.org/all/[email protected]/

>> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
>> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
>> directly because the TPM maintainers weren't available. However, it
>> seems like this just addresses the most common cases of the bug, rather
>> than addressing it entirely. So there are more things to fix still,
>> apparently.
>>
>> In lieu of actually fixing the underlying bug, just allow system suspend
>> to continue, so that laptops still go to sleep fine. Later, this can be
>> reverted when the real bug is fixed.
>>
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Cc: [email protected] # 6.1+
>> Reported-by: Vlastimil Babka <[email protected]>
>> Suggested-by: Linus Torvalds <[email protected]>
>> Signed-off-by: Jason A. Donenfeld <[email protected]>
>> ---
>> This is basically untested and I haven't worked out if there are any
>> awful implications of letting the system sleep when TPM suspend fails.
>> Maybe some PCRs get cleared and that will make everything explode on
>> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
>> probably [n]ack this approach.
>>
>> drivers/char/tpm/tpm-interface.c | 5 ++++-
>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
>> index d69905233aff..6df9067ef7f9 100644
>> --- a/drivers/char/tpm/tpm-interface.c
>> +++ b/drivers/char/tpm/tpm-interface.c
>> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
>> }
>>
>> suspended:
>> - return rc;
>> + if (rc)
>> + pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
>> + chip->dev_num, rc);
>> + return 0;
>> }
>> EXPORT_SYMBOL_GPL(tpm_pm_suspend);
>>
>> --
>> 2.39.0
>>
>
> This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing
> the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and
> the call is located in tpm_tis_resume() [*].
>
> [*] https://lore.kernel.org/lkml/[email protected]/

Yes the changelog at the top does say "due to races or locking issues or
something else that haven't been diagnosed or fixed yet"

I don't know what causes the TPM to start returning error 28 on resume and
never recover from it. But it didn't happen before hwrng started using the
TPM. Before that, it was probably just the selftest ever doing anything with
the TPM, and on its own I don't recall it ever (before 6.1) failing and
preventing further suspend/resume.

> BR, Jarkko

2023-01-21 00:10:45

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

On Mon, Jan 16, 2023 at 03:00:03PM +0100, Vlastimil Babka wrote:
> On 1/16/23 12:44, Jarkko Sakkinen wrote:
> > On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> >> TPM 1 is sometimes broken across system suspends, due to races or
> >> locking issues or something else that haven't been diagnosed or fixed
> >> yet, most likely having to do with concurrent reads from the TPM's
> >> hardware random number generator driver. These issues prevent the system
> >> from actually suspending, with errors like:
> >>
> >> tpm tpm0: A TPM error (28) occurred continue selftest
> >> ...
> >
> > <REMOVE>
> >
> >> tpm tpm0: A TPM error (28) occurred attempting get random
> >> ...
> >> tpm tpm0: Error (28) sending savestate before suspend
> >> tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
> >> tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
> >> tpm_tis 00:08: PM: failed to suspend: error 28
> >> PM: Some devices failed to suspend, or early wake event detected
> >
> > </REMOVE>
> >
> > Unrelated to thix particular fix.
>
> Not sure I understand.
> AFAIK this is not a proper fix, but a workaround for when laptop suspend no
> longer works because TPM fails to suspend. The error messages quoted above
> are very much related to the problem of suspend not working, and this patch
> did work as advertised at least for me. I see errors but they don't prevent
> suspend anymore:
>
> https://lore.kernel.org/all/[email protected]/
>
> >> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> >> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> >> directly because the TPM maintainers weren't available. However, it
> >> seems like this just addresses the most common cases of the bug, rather
> >> than addressing it entirely. So there are more things to fix still,
> >> apparently.
> >>
> >> In lieu of actually fixing the underlying bug, just allow system suspend
> >> to continue, so that laptops still go to sleep fine. Later, this can be
> >> reverted when the real bug is fixed.
> >>
> >> Link: https://lore.kernel.org/lkml/[email protected]/
> >> Cc: [email protected] # 6.1+
> >> Reported-by: Vlastimil Babka <[email protected]>
> >> Suggested-by: Linus Torvalds <[email protected]>
> >> Signed-off-by: Jason A. Donenfeld <[email protected]>
> >> ---
> >> This is basically untested and I haven't worked out if there are any
> >> awful implications of letting the system sleep when TPM suspend fails.
> >> Maybe some PCRs get cleared and that will make everything explode on
> >> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> >> probably [n]ack this approach.
> >>
> >> drivers/char/tpm/tpm-interface.c | 5 ++++-
> >> 1 file changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> >> index d69905233aff..6df9067ef7f9 100644
> >> --- a/drivers/char/tpm/tpm-interface.c
> >> +++ b/drivers/char/tpm/tpm-interface.c
> >> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> >> }
> >>
> >> suspended:
> >> - return rc;
> >> + if (rc)
> >> + pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> >> + chip->dev_num, rc);
> >> + return 0;
> >> }
> >> EXPORT_SYMBOL_GPL(tpm_pm_suspend);
> >>
> >> --
> >> 2.39.0
> >>
> >
> > This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing
> > the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and
> > the call is located in tpm_tis_resume() [*].
> >
> > [*] https://lore.kernel.org/lkml/[email protected]/
>
> Yes the changelog at the top does say "due to races or locking issues or
> something else that haven't been diagnosed or fixed yet"
>
> I don't know what causes the TPM to start returning error 28 on resume and
> never recover from it. But it didn't happen before hwrng started using the
> TPM. Before that, it was probably just the selftest ever doing anything with
> the TPM, and on its own I don't recall it ever (before 6.1) failing and
> preventing further suspend/resume.

Would it be possible to test this theory by commenting out tpm_add_hwrng()
call from tpm_chip_register()?

Since they are called sequentially any sort of concurrency issue can be
probably ruled out.

One thing that I noticed is that probably it would be more safe-play to
move tpm_add_hwrng() call after creating the character device, as there's
no need to do it before anything else.

BR, Jarkko

2023-01-21 00:16:36

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Tue, Jan 10, 2023 at 06:19:48PM +0100, Vlastimil Babka wrote:
> On 1/9/23 17:08, Jason A. Donenfeld wrote:
> > Hi Thorsten,
> >
> > On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
> >> On 29.12.22 05:03, Jason A. Donenfeld wrote:
> >>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> >>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> >>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening again
> >>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
> >>>>> "tpm_try_transmit" error this time. The first indication of a problem
> >>>>> is this during a resume from suspend to ram:
> >>>>>
> >>>>> tpm tpm0: A TPM error (28) occurred continue selftest
> >>>>>
> >>>>> and then periodically
> >>>>>
> >>>>> tpm tpm0: A TPM error (28) occurred attempting get random
> >>>>
> >>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
> >>>> original problem was reported against TPM 2.0 because of a missing
> >>>> try_get_ops().
> >>>
> >>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
> >>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
> >>> considering it's the same hardware from Vlastimil causing this. I also
> >>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
> >>> that's not correct.
> >>
> >> James, are you or some other TPM developer looking into this? Or is this
> >> deadlocked now? And if so: how can we get this unstuck to get this
> >> regression solved?
> >>
> >> Side note: I wonder if the problem that Johannes reported yesterday in
> >> this thread (
> >> https://lore.kernel.org/all/[email protected]/
> >> ) is related or something else, as it seems his issue happens with 6.1,
> >> while Vlastimil's problems should be fixed there. Or am I missing something?
> >
> > So, this is now in rc3:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c
> >
> > That should help avoid the worst of the issue -- laptop not sleeping.
> > But the race or whatever it is still does exist. So you might want to
> > keep this in your tracker to periodically nudge the TPM folks about it.
>
> Heh, booted rc3 and managed to hit it on very first suspend to ram attempt:
>
> tpm tpm0: A TPM error (28) occurred continue selftest
>
> But thanks to the patch, the next suspend worked:
>
> [ 236.598900] tpm tpm0: Error (28) sending savestate before suspend
> [ 236.598915] tpm_tis 00:08: Ignoring error 28 while suspending
>
> and on resume again:
>
> [ 238.196645] tpm tpm0: A TPM error (28) occurred continue selftest
>
> and indeed now I keep getting (as expected)
>
> [ 399.671077] tpm tpm0: A TPM error (28) occurred attempting get random
>
> So hopefully somebody will look into the root cause at some point.

I would start by using tpm1_auto_startup() also here, and see
the effect:

https://lore.kernel.org/linux-integrity/[email protected]/

BR, Jarkko

2023-01-21 00:30:18

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails

On Mon, Jan 16, 2023 at 03:03:17PM +0100, Jason A. Donenfeld wrote:
> Hi Jarkko,
>
> On Mon, Jan 16, 2023 at 9:12 AM Jarkko Sakkinen <[email protected]> wrote:
> > > diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> > > index d69905233aff..6df9067ef7f9 100644
> > > --- a/drivers/char/tpm/tpm-interface.c
> > > +++ b/drivers/char/tpm/tpm-interface.c
> > > @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> > > }
> > >
> > > suspended:
> > > - return rc;
> > > + if (rc)
> > > + pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> > > + chip->dev_num, rc);
> > > + return 0;
> > > }
> > > EXPORT_SYMBOL_GPL(tpm_pm_suspend);
> > >
> > > --
> > > 2.39.0
> > >
> >
> > Let me read all the threads through starting from the original report. I've
> > had emails piling up because of getting sick before holiday, and holiday
> > season after that.
> >
> > This looks sane
>
> No, not really. I mean, it was sane under the circumstances of, "I'm
> not going to spend time fixing this for real if the maintainers aren't
> around," and it fixed the suspend issue. But it doesn't actually fix
> any real tpm issue. The real issue, AFAICT, is there's some sort of
> race between the tpm rng read command and either suspend or wakeup or
> selftest. One of these is missing some locking. And then commands step
> on each other and the tpm gets upset. This is probably something that
> should be fixed. I assume the "Fixes: ..." tag will actually go quite
> far back, with recent things only unearthing a somewhat old bug. But
> just a hunch.
>
> Jason

See my response to Vlastimil:

https://lore.kernel.org/linux-integrity/[email protected]/

Can you try what happens if you do not call tpm_add_hwrng()?

BR, Jarkko

2023-03-14 09:35:50

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On 09.01.23 17:08, Jason A. Donenfeld wrote:
> On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
>> On 29.12.22 05:03, Jason A. Donenfeld wrote:
>>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
>>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
>>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening again
>>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
>>>>> "tpm_try_transmit" error this time. The first indication of a problem
>>>>> is this during a resume from suspend to ram:
>>>>> tpm tpm0: A TPM error (28) occurred continue selftest
>>>>> and then periodically
>>>>> tpm tpm0: A TPM error (28) occurred attempting get random
>>>>
>>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
>>>> original problem was reported against TPM 2.0 because of a missing
>>>> try_get_ops().
>>>
>>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
>>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
>>> considering it's the same hardware from Vlastimil causing this. I also
>>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
>>> that's not correct.
> [...]
> So, this is now in rc3:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c
>
> That should help avoid the worst of the issue -- laptop not sleeping.
> But the race or whatever it is still does exist. So you might want to
> keep this in your tracker to periodically nudge the TPM folks about it.

I did, and with -rc2 out now is a good time to remind everybody about
it. Jarkko even looked into it, but no real fix emerged afaics. Or did it?

Ciao, Thorsten

2023-03-14 12:30:16

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Tue, Mar 14, 2023 at 10:35:33AM +0100, Thorsten Leemhuis wrote:
> On 09.01.23 17:08, Jason A. Donenfeld wrote:
> > On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
> >> On 29.12.22 05:03, Jason A. Donenfeld wrote:
> >>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> >>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> >>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening again
> >>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
> >>>>> "tpm_try_transmit" error this time. The first indication of a problem
> >>>>> is this during a resume from suspend to ram:
> >>>>> tpm tpm0: A TPM error (28) occurred continue selftest
> >>>>> and then periodically
> >>>>> tpm tpm0: A TPM error (28) occurred attempting get random
> >>>>
> >>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
> >>>> original problem was reported against TPM 2.0 because of a missing
> >>>> try_get_ops().
> >>>
> >>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
> >>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
> >>> considering it's the same hardware from Vlastimil causing this. I also
> >>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
> >>> that's not correct.
> > [...]
> > So, this is now in rc3:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c
> >
> > That should help avoid the worst of the issue -- laptop not sleeping.
> > But the race or whatever it is still does exist. So you might want to
> > keep this in your tracker to periodically nudge the TPM folks about it.
>
> I did, and with -rc2 out now is a good time to remind everybody about
> it. Jarkko even looked into it, but no real fix emerged afaics. Or did it?

Jason's workaround was picked. I asked some questions in the thread but
have not received any responses.

BR, Jarkko

2023-03-14 13:00:10

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On 3/14/23, Jarkko Sakkinen <[email protected]> wrote:
> On Tue, Mar 14, 2023 at 10:35:33AM +0100, Thorsten Leemhuis wrote:
>> On 09.01.23 17:08, Jason A. Donenfeld wrote:
>> > On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
>> >> On 29.12.22 05:03, Jason A. Donenfeld wrote:
>> >>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
>> >>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
>> >>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening
>> >>>>> again
>> >>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
>> >>>>> "tpm_try_transmit" error this time. The first indication of a
>> >>>>> problem
>> >>>>> is this during a resume from suspend to ram:
>> >>>>> tpm tpm0: A TPM error (28) occurred continue selftest
>> >>>>> and then periodically
>> >>>>> tpm tpm0: A TPM error (28) occurred attempting get random
>> >>>>
>> >>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
>> >>>> original problem was reported against TPM 2.0 because of a missing
>> >>>> try_get_ops().
>> >>>
>> >>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
>> >>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
>> >>> considering it's the same hardware from Vlastimil causing this. I
>> >>> also
>> >>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
>> >>> that's not correct.
>> > [...]
>> > So, this is now in rc3:
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c
>> >
>> > That should help avoid the worst of the issue -- laptop not sleeping.
>> > But the race or whatever it is still does exist. So you might want to
>> > keep this in your tracker to periodically nudge the TPM folks about it.
>>
>> I did, and with -rc2 out now is a good time to remind everybody about
>> it. Jarkko even looked into it, but no real fix emerged afaics. Or did
>> it?
>
> Jason's workaround was picked. I asked some questions in the thread but
> have not received any responses.

As I've written several times now, that patch doesn't fix the issue.
It makes it less common but it still exists and needs to be addressed.
Please re-read my various messages describing this. I have nothing new
at all to add; you just need to review my prior comments. There's a
bug that probably needs to be fixed here by somebody who understands
the tpm1 code.

2023-03-14 13:10:29

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Tue, Mar 14, 2023 at 01:47:38PM +0100, Jason A. Donenfeld wrote:
> On 3/14/23, Jarkko Sakkinen <[email protected]> wrote:
> > On Tue, Mar 14, 2023 at 10:35:33AM +0100, Thorsten Leemhuis wrote:
> >> On 09.01.23 17:08, Jason A. Donenfeld wrote:
> >> > On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
> >> >> On 29.12.22 05:03, Jason A. Donenfeld wrote:
> >> >>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> >> >>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> >> >>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening
> >> >>>>> again
> >> >>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
> >> >>>>> "tpm_try_transmit" error this time. The first indication of a
> >> >>>>> problem
> >> >>>>> is this during a resume from suspend to ram:
> >> >>>>> tpm tpm0: A TPM error (28) occurred continue selftest
> >> >>>>> and then periodically
> >> >>>>> tpm tpm0: A TPM error (28) occurred attempting get random
> >> >>>>
> >> >>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
> >> >>>> original problem was reported against TPM 2.0 because of a missing
> >> >>>> try_get_ops().
> >> >>>
> >> >>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
> >> >>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
> >> >>> considering it's the same hardware from Vlastimil causing this. I
> >> >>> also
> >> >>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
> >> >>> that's not correct.
> >> > [...]
> >> > So, this is now in rc3:
> >> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c
> >> >
> >> > That should help avoid the worst of the issue -- laptop not sleeping.
> >> > But the race or whatever it is still does exist. So you might want to
> >> > keep this in your tracker to periodically nudge the TPM folks about it.
> >>
> >> I did, and with -rc2 out now is a good time to remind everybody about
> >> it. Jarkko even looked into it, but no real fix emerged afaics. Or did
> >> it?
> >
> > Jason's workaround was picked. I asked some questions in the thread but
> > have not received any responses.
>
> As I've written several times now, that patch doesn't fix the issue.
> It makes it less common but it still exists and needs to be addressed.
> Please re-read my various messages describing this. I have nothing new
> at all to add; you just need to review my prior comments. There's a
> bug that probably needs to be fixed here by somebody who understands
> the tpm1 code.

I'll try qemu path to see if I can reproduce it with/without the already
merged workaround.

BR, Jarkko

2023-03-14 13:13:00

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Tue, Mar 14, 2023 at 03:06:47PM +0200, Jarkko Sakkinen wrote:
> On Tue, Mar 14, 2023 at 01:47:38PM +0100, Jason A. Donenfeld wrote:
> > On 3/14/23, Jarkko Sakkinen <[email protected]> wrote:
> > > On Tue, Mar 14, 2023 at 10:35:33AM +0100, Thorsten Leemhuis wrote:
> > >> On 09.01.23 17:08, Jason A. Donenfeld wrote:
> > >> > On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
> > >> >> On 29.12.22 05:03, Jason A. Donenfeld wrote:
> > >> >>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> > >> >>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> > >> >>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening
> > >> >>>>> again
> > >> >>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
> > >> >>>>> "tpm_try_transmit" error this time. The first indication of a
> > >> >>>>> problem
> > >> >>>>> is this during a resume from suspend to ram:
> > >> >>>>> tpm tpm0: A TPM error (28) occurred continue selftest
> > >> >>>>> and then periodically
> > >> >>>>> tpm tpm0: A TPM error (28) occurred attempting get random
> > >> >>>>
> > >> >>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
> > >> >>>> original problem was reported against TPM 2.0 because of a missing
> > >> >>>> try_get_ops().
> > >> >>>
> > >> >>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
> > >> >>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
> > >> >>> considering it's the same hardware from Vlastimil causing this. I
> > >> >>> also
> > >> >>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
> > >> >>> that's not correct.
> > >> > [...]
> > >> > So, this is now in rc3:
> > >> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c
> > >> >
> > >> > That should help avoid the worst of the issue -- laptop not sleeping.
> > >> > But the race or whatever it is still does exist. So you might want to
> > >> > keep this in your tracker to periodically nudge the TPM folks about it.
> > >>
> > >> I did, and with -rc2 out now is a good time to remind everybody about
> > >> it. Jarkko even looked into it, but no real fix emerged afaics. Or did
> > >> it?
> > >
> > > Jason's workaround was picked. I asked some questions in the thread but
> > > have not received any responses.
> >
> > As I've written several times now, that patch doesn't fix the issue.
> > It makes it less common but it still exists and needs to be addressed.
> > Please re-read my various messages describing this. I have nothing new
> > at all to add; you just need to review my prior comments. There's a
> > bug that probably needs to be fixed here by somebody who understands
> > the tpm1 code.
>
> I'll try qemu path to see if I can reproduce it with/without the already
> merged workaround.

BTW, what sort of environment you had for your qemu run? I'm creating a
simple initramfs with buildroot for this.

BR, Jarkko

2023-03-14 13:55:37

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Tue, Mar 14, 2023 at 03:08:21PM +0200, Jarkko Sakkinen wrote:
> On Tue, Mar 14, 2023 at 03:06:47PM +0200, Jarkko Sakkinen wrote:
> > On Tue, Mar 14, 2023 at 01:47:38PM +0100, Jason A. Donenfeld wrote:
> > > On 3/14/23, Jarkko Sakkinen <[email protected]> wrote:
> > > > On Tue, Mar 14, 2023 at 10:35:33AM +0100, Thorsten Leemhuis wrote:
> > > >> On 09.01.23 17:08, Jason A. Donenfeld wrote:
> > > >> > On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
> > > >> >> On 29.12.22 05:03, Jason A. Donenfeld wrote:
> > > >> >>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> > > >> >>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> > > >> >>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening
> > > >> >>>>> again
> > > >> >>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
> > > >> >>>>> "tpm_try_transmit" error this time. The first indication of a
> > > >> >>>>> problem
> > > >> >>>>> is this during a resume from suspend to ram:
> > > >> >>>>> tpm tpm0: A TPM error (28) occurred continue selftest
> > > >> >>>>> and then periodically
> > > >> >>>>> tpm tpm0: A TPM error (28) occurred attempting get random
> > > >> >>>>
> > > >> >>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
> > > >> >>>> original problem was reported against TPM 2.0 because of a missing
> > > >> >>>> try_get_ops().
> > > >> >>>
> > > >> >>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
> > > >> >>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
> > > >> >>> considering it's the same hardware from Vlastimil causing this. I
> > > >> >>> also
> > > >> >>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
> > > >> >>> that's not correct.
> > > >> > [...]
> > > >> > So, this is now in rc3:
> > > >> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c
> > > >> >
> > > >> > That should help avoid the worst of the issue -- laptop not sleeping.
> > > >> > But the race or whatever it is still does exist. So you might want to
> > > >> > keep this in your tracker to periodically nudge the TPM folks about it.
> > > >>
> > > >> I did, and with -rc2 out now is a good time to remind everybody about
> > > >> it. Jarkko even looked into it, but no real fix emerged afaics. Or did
> > > >> it?
> > > >
> > > > Jason's workaround was picked. I asked some questions in the thread but
> > > > have not received any responses.
> > >
> > > As I've written several times now, that patch doesn't fix the issue.
> > > It makes it less common but it still exists and needs to be addressed.
> > > Please re-read my various messages describing this. I have nothing new
> > > at all to add; you just need to review my prior comments. There's a
> > > bug that probably needs to be fixed here by somebody who understands
> > > the tpm1 code.
> >
> > I'll try qemu path to see if I can reproduce it with/without the already
> > merged workaround.
>
> BTW, what sort of environment you had for your qemu run? I'm creating a
> simple initramfs with buildroot for this.

Nothing special at all in the userspace.

I think details of my test bed might be in some other thread from when
that original patch went in or when the original bug report came, but
from memory, I believe what I did to reliably reproduce various issues
was comment out the sleep in random.c so that it keeps asking the TPM
for more bytes from the kthread, like this:

diff --git a/drivers/char/random.c b/drivers/char/random.c
index ce3ccd172cc8..708110c780aa 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -934,20 +934,20 @@ EXPORT_SYMBOL(add_device_randomness);
void add_hwgenerator_randomness(const void *buf, size_t len, size_t entropy, bool sleep_after)
{
mix_pool_bytes(buf, len);
credit_init_bits(entropy);

/*
* Throttle writing to once every reseed interval, unless we're not yet
* initialized or no entropy is credited.
*/
- if (sleep_after && !kthread_should_stop() && (crng_ready() || !entropy))
- schedule_timeout_interruptible(crng_reseed_interval());
+// if (sleep_after && !kthread_should_stop() && (crng_ready() || !entropy))
+// schedule_timeout_interruptible(crng_reseed_interval());
}
EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);

/*
* Handle random seed passed by bootloader, and credit it depending
* on the command line option 'random.trust_bootloader'.
*/
void __init add_bootloader_randomness(const void *buf, size_t len)
{

Then I hooked the tpm emulator up to qemu and put it in tpm1 mode. I had
userspace `echo mem > /sys/power/state` every couple of seconds (or
continuously maybe?), and then I used the qemu monitor interface to wake
the system from sleep. And kaboom.

Jason

2023-03-14 14:24:39

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Tue, Mar 14, 2023 at 02:53:11PM +0100, Jason A. Donenfeld wrote:
> On Tue, Mar 14, 2023 at 03:08:21PM +0200, Jarkko Sakkinen wrote:
> > On Tue, Mar 14, 2023 at 03:06:47PM +0200, Jarkko Sakkinen wrote:
> > > On Tue, Mar 14, 2023 at 01:47:38PM +0100, Jason A. Donenfeld wrote:
> > > > On 3/14/23, Jarkko Sakkinen <[email protected]> wrote:
> > > > > On Tue, Mar 14, 2023 at 10:35:33AM +0100, Thorsten Leemhuis wrote:
> > > > >> On 09.01.23 17:08, Jason A. Donenfeld wrote:
> > > > >> > On Thu, Jan 05, 2023 at 02:59:15PM +0100, Thorsten Leemhuis wrote:
> > > > >> >> On 29.12.22 05:03, Jason A. Donenfeld wrote:
> > > > >> >>> On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> > > > >> >>>> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> > > > >> >>>>> Ugh, while the problem [1] was fixed in 6.1, it's now happening
> > > > >> >>>>> again
> > > > >> >>>>> on the T460 with 6.2-rc1. Except I didn't see any oops message or
> > > > >> >>>>> "tpm_try_transmit" error this time. The first indication of a
> > > > >> >>>>> problem
> > > > >> >>>>> is this during a resume from suspend to ram:
> > > > >> >>>>> tpm tpm0: A TPM error (28) occurred continue selftest
> > > > >> >>>>> and then periodically
> > > > >> >>>>> tpm tpm0: A TPM error (28) occurred attempting get random
> > > > >> >>>>
> > > > >> >>>> That's a TPM 1.2 error which means the TPM failed the selftest. The
> > > > >> >>>> original problem was reported against TPM 2.0 because of a missing
> > > > >> >>>> try_get_ops().
> > > > >> >>>
> > > > >> >>> No, I'm pretty sure the original bug, which was fixed by "char: tpm:
> > > > >> >>> Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
> > > > >> >>> considering it's the same hardware from Vlastimil causing this. I
> > > > >> >>> also
> > > > >> >>> recall seeing this in 1.2 when I ran this with the TPM emulator. So
> > > > >> >>> that's not correct.
> > > > >> > [...]
> > > > >> > So, this is now in rc3:
> > > > >> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1382999aa0548a171a272ca817f6c38e797c458c
> > > > >> >
> > > > >> > That should help avoid the worst of the issue -- laptop not sleeping.
> > > > >> > But the race or whatever it is still does exist. So you might want to
> > > > >> > keep this in your tracker to periodically nudge the TPM folks about it.
> > > > >>
> > > > >> I did, and with -rc2 out now is a good time to remind everybody about
> > > > >> it. Jarkko even looked into it, but no real fix emerged afaics. Or did
> > > > >> it?
> > > > >
> > > > > Jason's workaround was picked. I asked some questions in the thread but
> > > > > have not received any responses.
> > > >
> > > > As I've written several times now, that patch doesn't fix the issue.
> > > > It makes it less common but it still exists and needs to be addressed.
> > > > Please re-read my various messages describing this. I have nothing new
> > > > at all to add; you just need to review my prior comments. There's a
> > > > bug that probably needs to be fixed here by somebody who understands
> > > > the tpm1 code.
> > >
> > > I'll try qemu path to see if I can reproduce it with/without the already
> > > merged workaround.
> >
> > BTW, what sort of environment you had for your qemu run? I'm creating a
> > simple initramfs with buildroot for this.
>
> Nothing special at all in the userspace.
>
> I think details of my test bed might be in some other thread from when
> that original patch went in or when the original bug report came, but
> from memory, I believe what I did to reliably reproduce various issues
> was comment out the sleep in random.c so that it keeps asking the TPM
> for more bytes from the kthread, like this:
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index ce3ccd172cc8..708110c780aa 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -934,20 +934,20 @@ EXPORT_SYMBOL(add_device_randomness);
> void add_hwgenerator_randomness(const void *buf, size_t len, size_t entropy, bool sleep_after)
> {
> mix_pool_bytes(buf, len);
> credit_init_bits(entropy);
>
> /*
> * Throttle writing to once every reseed interval, unless we're not yet
> * initialized or no entropy is credited.
> */
> - if (sleep_after && !kthread_should_stop() && (crng_ready() || !entropy))
> - schedule_timeout_interruptible(crng_reseed_interval());
> +// if (sleep_after && !kthread_should_stop() && (crng_ready() || !entropy))
> +// schedule_timeout_interruptible(crng_reseed_interval());
> }
> EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
>
> /*
> * Handle random seed passed by bootloader, and credit it depending
> * on the command line option 'random.trust_bootloader'.
> */
> void __init add_bootloader_randomness(const void *buf, size_t len)
> {
>
> Then I hooked the tpm emulator up to qemu and put it in tpm1 mode. I had
> userspace `echo mem > /sys/power/state` every couple of seconds (or
> continuously maybe?), and then I used the qemu monitor interface to wake
> the system from sleep. And kaboom.

Thanks for the tips! I'll try various patch combinations and see what
happens.

BR, Jarkko

2023-04-21 15:09:56

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Tue, Mar 14, 2023 at 04:23:54PM +0200, Jarkko Sakkinen wrote:
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index ce3ccd172cc8..708110c780aa 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -934,20 +934,20 @@ EXPORT_SYMBOL(add_device_randomness);
> > void add_hwgenerator_randomness(const void *buf, size_t len, size_t entropy, bool sleep_after)
> > {
> > mix_pool_bytes(buf, len);
> > credit_init_bits(entropy);
> >
> > /*
> > * Throttle writing to once every reseed interval, unless we're not yet
> > * initialized or no entropy is credited.
> > */
> > - if (sleep_after && !kthread_should_stop() && (crng_ready() || !entropy))
> > - schedule_timeout_interruptible(crng_reseed_interval());
> > +// if (sleep_after && !kthread_should_stop() && (crng_ready() || !entropy))
> > +// schedule_timeout_interruptible(crng_reseed_interval());
> > }
> > EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
> >
> > /*
> > * Handle random seed passed by bootloader, and credit it depending
> > * on the command line option 'random.trust_bootloader'.
> > */
> > void __init add_bootloader_randomness(const void *buf, size_t len)
> > {
> >
> > Then I hooked the tpm emulator up to qemu and put it in tpm1 mode. I had
> > userspace `echo mem > /sys/power/state` every couple of seconds (or
> > continuously maybe?), and then I used the qemu monitor interface to wake
> > the system from sleep. And kaboom.
>
> Thanks for the tips! I'll try various patch combinations and see what
> happens.

I created a VM using libvirt with the following configuration [*]

<tpm model='tpm-tis'>
<backend type='emulator' version='1.2'/>
<alias name='tpm0'/>
</tpm>

I do not have issues with suspend so far with unmodified v6.3-rc7.

It would be nice to have a script that gives a guaranteed crash with
unmodified kernel even if run for some hours.

Instead of using QEMU monitor interface and /sys/power/state, I wonder
if a stress test could be based on looping rtcwake for easier comparison
of results?

[*] full description: https://gist.github.com/jarkkojs/66ea0d8d6a927311e8abb8b35e1ddbcd

BR, Jarkko

2023-04-21 18:39:23

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

Did you use the patch I sent you and suspend and resume according to
the instructions I gave you? If not, I don't have much to add.

2023-04-23 15:43:17

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Fri Apr 21, 2023 at 9:27 PM EEST, Jason A. Donenfeld wrote:
> Did you use the patch I sent you and suspend and resume according to
> the instructions I gave you? If not, I don't have much to add.

Finally, I got it reproduced at my side with TPM 1.2:

[ 0.379677] tpm_tis 00:00: 1.2 TPM (device-id 0x1, rev-id 1)
[ 32.453447] tpm tpm0: tpm_transmit: tpm_recv: error -5
[ 33.450601] tpm tpm0: Unable to read header
[ 33.450607] tpm tpm0: tpm_transmit: tpm_recv: error -62

I'll look at this further after I've sent v6.3 PR.

BR, Jarkko

2023-04-25 23:40:16

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Sun Apr 23, 2023 at 6:34 PM EEST, Jarkko Sakkinen wrote:
> On Fri Apr 21, 2023 at 9:27 PM EEST, Jason A. Donenfeld wrote:
> > Did you use the patch I sent you and suspend and resume according to
> > the instructions I gave you? If not, I don't have much to add.
>
> Finally, I got it reproduced at my side with TPM 1.2:
>
> [ 0.379677] tpm_tis 00:00: 1.2 TPM (device-id 0x1, rev-id 1)
> [ 32.453447] tpm tpm0: tpm_transmit: tpm_recv: error -5
> [ 33.450601] tpm tpm0: Unable to read header
> [ 33.450607] tpm tpm0: tpm_transmit: tpm_recv: error -62
>
> I'll look at this further after I've sent v6.3 PR.

OK, so this gives the exact tpm_transmit call where it fails:

$ sudo bpftrace -e 'kprobe:tpm_transmit { @[kstack] = count(); }'
[sudo] password for jarkko:
Attaching 1 probe...
^C

@[
tpm_transmit+1
tpm1_pcr_read+177
tpm1_do_selftest+287
tpm_tis_resume+443
pnp_bus_resume+102
dpm_run_callback+81
device_resume+173
dpm_resume+238
dpm_resume_end+17
suspend_devices_and_enter+473
enter_state+563
pm_suspend+68
state_store+43
kobj_attr_store+15
sysfs_kf_write+59
kernfs_fop_write_iter+304
vfs_write+590
ksys_write+115
__x64_sys_write+25
do_syscall_64+88
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
tpm_transmit+1
tpm1_do_selftest+179
tpm_tis_resume+443
pnp_bus_resume+102
dpm_run_callback+81
device_resume+173
dpm_resume+238
dpm_resume_end+17
suspend_devices_and_enter+473
enter_state+563
pm_suspend+68
state_store+43
kobj_attr_store+15
sysfs_kf_write+59
kernfs_fop_write_iter+304
vfs_write+590
ksys_write+115
__x64_sys_write+25
do_syscall_64+88
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
tpm_transmit+1
tpm1_pm_suspend+203
tpm_pm_suspend+131
__pnp_bus_suspend+65
pnp_bus_suspend+19
dpm_run_callback+81
__device_suspend+329
dpm_suspend+432
dpm_suspend_start+155
suspend_devices_and_enter+370
enter_state+563
pm_suspend+68
state_store+43
kobj_attr_store+15
sysfs_kf_write+59
kernfs_fop_write_iter+304
vfs_write+590
ksys_write+115
__x64_sys_write+25
do_syscall_64+88
entry_SYSCALL_64_after_hwframe+114
]: 1
@[
tpm_transmit+1
tpm1_get_random+206
tpm_get_random+70
tpm_hwrng_read+21
hwrng_fillfn+234
kthread+230
ret_from_fork+41
]: 75897

So it is the very first PCR read in tpm1_do_selftest.

There is a bug at plain sight in tpm1_tis_resume(): before
tpm_tis_resume() calls tpm1_do_selftest(), it only requests
and relinquishes locality. This is not sufficient: it should
also disable clkrun protocol.

tpm1_do_selftest() is called also during the driver initialization
successfully, the difference being that clkrun protocol is disabled.

I'm compiling now a kernel with a test fix that calls tpm_chip_start()
and tpm_chip_stop() as a substitute for request/relinquish locality.
These should be used anyway instead of ad-hoc code.

BR, Jarkko

BR, Jarkko

2023-04-26 01:43:38

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

Nice work! Happy that you got it figured out. That trace seems
consistent with what I was seeing in my traces, so I think you've
spotted the right bug.

Jason

2023-04-26 16:17:23

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Wed Apr 26, 2023 at 4:32 AM EEST, Jason A. Donenfeld wrote:
> Nice work! Happy that you got it figured out. That trace seems
> consistent with what I was seeing in my traces, so I think you've
> spotted the right bug.
>
> Jason

Unfortunately there might be two bugs. Unless I interpreted logs
incorrectly also hwrng can race with resume (sorry, I forgot to
save it).

Looking at drivers/char/hw_random/core.c there seems to be no
binding to the PM so I guess that this it supports what I'm
observing [*].

So there's two ways to fix the issue:

1. Unregister hwrng for the course of suspend
2. Add something like TPM_CHIP_SUSPENDED, which is set by suspend and
cleared by the resume.

I try the 2nd option first because I see it less complicated.
Probably would make sense to turn chip flags as atomic while
at it.

[*] https://elixir.bootlin.com/linux/latest/source/drivers/char/hw_random/core.c

BR, Jarkko

2023-04-26 17:05:33

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors

On Wed Apr 26, 2023 at 7:07 PM EEST, Jarkko Sakkinen wrote:
> On Wed Apr 26, 2023 at 4:32 AM EEST, Jason A. Donenfeld wrote:
> > Nice work! Happy that you got it figured out. That trace seems
> > consistent with what I was seeing in my traces, so I think you've
> > spotted the right bug.
> >
> > Jason
>
> Unfortunately there might be two bugs. Unless I interpreted logs
> incorrectly also hwrng can race with resume (sorry, I forgot to
> save it).
>
> Looking at drivers/char/hw_random/core.c there seems to be no
> binding to the PM so I guess that this it supports what I'm
> observing [*].
>
> So there's two ways to fix the issue:
>
> 1. Unregister hwrng for the course of suspend
> 2. Add something like TPM_CHIP_SUSPENDED, which is set by suspend and
> cleared by the resume.
>
> I try the 2nd option first because I see it less complicated.
> Probably would make sense to turn chip flags as atomic while
> at it.
>
> [*] https://elixir.bootlin.com/linux/latest/source/drivers/char/hw_random/core.c

OK, so I implemented fix also for hwrng and now I get a clean
resume. I'll add all necessary tags etc. and send for review.

BR, Jarkko