2022-11-28 08:34:08

by Vlastimil Babka

[permalink] [raw]
Subject: [REGRESSION] suspend to ram fails in 6.1 due to tpm errors

Hi,

I've noticed my Lenovo T460 developed a failure to suspend to ram related to
TPM errors. Seems at each suspend/resume cycle there's a chance the errors
develop and then further suspends are blocked by the stuck TPM (or its driver?)
I can say for sure it never happened before 6.1, however I didn't test all 6.1
RCs and suspend/resume enough times to pinpoint it was 6.1-rc1 (which would be
naturally suspicious). Bisecting would also be time consuming and unreliable,
so hopefully this historical account will be sufficient:

The tpm messages on boot are always the same, here from 6.0.0-rc7:

tpm_tis 00:08: 1.2 TPM (device-id 0x1B, rev-id 16)
tpm tpm0:a TPM is disabled/deactivated (0x6)
tpm tpm0: tpm_read_log_acpi: TCPA log area empty

And normally during resume from suspend to ram I can see:

tpm tpm0: TPM is disabled/deactivated (0x6)

With 6.1-rc3 (the first 6.1-rcX I've tried on this laptop) this was still
behaving OK, suspend/resume went fine 4 times until I updated the kernel
and rebooted. Maybe it just wasn't enough cycles to hit the issue.

With 6.1-rc4, there were initially 3 resumes OK, but on 4th resume I saw:

tpm tpm0: tpm_try_transmit: send(): error -5
tpm tpm0: invalid TPM_STS.x 0xff, dumping stack for forensics
CPU: 2 PID: 15299 Comm: systemd-sleep Not tainted 6.1.0-rc4-2.gc03e512-default #1 openSUSE Tumbleweed (unreleased) 232cc11569ae1616983f707f1010e2c19601c7ee
Hardware name: LENOVO 20FMS27W03/20FMS27W03, BIOS R06ET71W (1.45 ) 02/21/2022
Call Trace:
<TASK>
dump_stack_lvl+0x44/0x5c
tpm_tis_status.cold+0x19/0x1e
wait_for_tpm_stat+0x120/0x200
? tpm_tcg_read_bytes+0x8f/0xa0
tpm_tis_send_data+0x8c/0x260
tpm_tis_send_main+0x33/0x130
tpm_transmit+0xd4/0x3e0
tpm_transmit_cmd+0x25/0x90
tpm1_do_selftest+0x88/0x130
? set_next_entity+0xda/0x150
tpm_tis_resume+0x8d/0x1b0
pnp_bus_resume+0x60/0xa0
? pnp_bus_suspend+0x10/0x10
dpm_run_callback+0x47/0x150
device_resume+0x104/0x270
? dpm_show_time.cold+0x62/0x62
dpm_resume+0x191/0x2e0
dpm_resume_end+0xd/0x20
suspend_devices_and_enter+0x1c3/0x880
pm_suspend.cold+0x2d2/0x35e
state_store+0x68/0xd0
kernfs_fop_write_iter+0x11b/0x1f0
vfs_write+0x1f4/0x3b0
ksys_write+0x63/0xe0
do_syscall_64+0x58/0x80
? do_syscall_64+0x67/0x80
? do_syscall_64+0x67/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6197f079d4
Code: ff eb b7 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 90 90 80 3d cd 0f 0f 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
RSP: 002b:00007ffdaf9a92d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f6197f079d4
RDX: 0000000000000004 RSI: 00007ffdaf9a93c0 RDI: 0000000000000004
RBP: 00007ffdaf9a93c0 R08: 0000563fe8e33650 R09: 0000000000000073
R10: 00000000ffffffff R11: 0000000000000202 R12: 0000000000000004
R13: 0000563fe8e152d0 R14: 0000000000000004 R15: 00007f6197fe69e0
</TASK>

and a second later

tpm tpm0: tpm_try_transmit: send(): error -62

But that appears not to be the main issue. I've just noticed it while gathering
the info now, and didn't notice it back then as there were another 5 resumes ok
continuing the same kernel boot.

But then on another resume I got:

tpm tpm0: A TPM error (28) occurred continue selftest

And afterwards, many messages scattered in the log:

tpm tpm0: A TPM error (28) occurred attempting get random

And since then, suspend to ram no longer works and I see this:

tpm tpm0: Error (28) sending savestate before suspend
tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
tpm_tis 00:08: PM: failed to suspend: error 28
PM: Some devices failed to suspend, or early wake event detected

After reboot to 6.1-rc6, initially 3 resumes ok, and then again on 4th resume:

tpm tpm0: A TPM error (28) occurred continue selftest

and same story with the errors attempting get random, and suspend failing.
Notably this was without the tpm_try_transmit splat above, so that is
probably indeed not tha main issue. The moment things go wrong is
the "A TPM error (28) occurred continue selftest" during resume.

Dominik on IRC pointed me to commit b006c439d58d ("hwrng: core - start hwrng
kthread also for untrusted sources"), which could make sense if the TPM was not
used at all before and now it's used for randomness. But then it probably "just"
uncovered a pre-existing issue? Maybe there's a race with getting the randomness
and suspend? Could it be exactly what this patch is attempting to fix?
https://lore.kernel.org/all/[email protected]/

Thanks,
Vlastimil

#regzbot introduced: v6.0..v6.1-rc4


2022-11-28 12:13:28

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.1 due to tpm errors

Hi Vlastimil,

Thanks for CC'ing me.

On Mon, Nov 28, 2022 at 09:15:33AM +0100, Vlastimil Babka wrote:
> Dominik on IRC pointed me to commit b006c439d58d ("hwrng: core - start hwrng
> kthread also for untrusted sources"), which could make sense if the TPM was not
> used at all before and now it's used for randomness. But then it probably "just"
> uncovered a pre-existing issue? Maybe there's a race with getting the randomness
> and suspend? Could it be exactly what this patch is attempting to fix?
> https://lore.kernel.org/all/[email protected]/

The commit you referenced just turns on some functionality that most
people previously didn't have (but maybe could have, manually). So this
is a case of a new commit unearthing a bug in some old code. We had
something similar happen with a raspi driver and fixed the bug there.
From the looks of the lore link you provided, it sounds like the same
thing has happened already there to tpm? That this is already fixed by
that commit? I think if you have a system that's readily exhibiting the
issue, the best thing to do would be to try that series, and report
back, maybe even providing your `Tested-by:` line if it works.

I'll play around and see if I can repro too. Looks like you have a T460?
I don't have exactly that but maybe something close enough.

Jason

2022-11-28 14:38:02

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.1 due to tpm errors

On 11/28/22 13:03, Jason A. Donenfeld wrote:
> Hi Vlastimil,
>
> Thanks for CC'ing me.
>
> On Mon, Nov 28, 2022 at 09:15:33AM +0100, Vlastimil Babka wrote:
>> Dominik on IRC pointed me to commit b006c439d58d ("hwrng: core - start hwrng
>> kthread also for untrusted sources"), which could make sense if the TPM was not
>> used at all before and now it's used for randomness. But then it probably "just"
>> uncovered a pre-existing issue? Maybe there's a race with getting the randomness
>> and suspend? Could it be exactly what this patch is attempting to fix?
>> https://lore.kernel.org/all/[email protected]/
>
> The commit you referenced just turns on some functionality that most
> people previously didn't have (but maybe could have, manually). So this
> is a case of a new commit unearthing a bug in some old code. We had
> something similar happen with a raspi driver and fixed the bug there.

Right, I suspected as much.

> From the looks of the lore link you provided, it sounds like the same
> thing has happened already there to tpm? That this is already fixed by
> that commit? I think if you have a system that's readily exhibiting the

Well it was a patch series where a v3 was promised with more info about the
bug it fixes, but I haven't found a v3. So if the author or TPM maintainers
can confirm that it does (or can) indeed fix such a bug I'm observing, I'd
hope for it to be fasttracked to 6.1.

> issue, the best thing to do would be to try that series, and report
> back, maybe even providing your `Tested-by:` line if it works.

Yeah, I have compiled rc7 with patch 1/3, without the rest as that didn't
seem necessary, and there was some review feedback that some parts need
fixing. So far it didn't exhibit the bug when suspending/resuming 10 times,
but that's not yet a guarantee due to nondeterministic nature of the issue
and usually the suspends weren't happening in such rapid succession... I'll
keep using that kernel for now on the laptop, fwiw.

> I'll play around and see if I can repro too. Looks like you have a T460?
> I don't have exactly that but maybe something close enough.

Great, thanks!

> Jason

2022-11-28 18:09:05

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.1 due to tpm errors

On Mon, Nov 28, 2022 at 01:03:31PM +0100, Jason A. Donenfeld wrote:
> I'll play around and see if I can repro too. Looks like you have a T460?
> I don't have exactly that but maybe something close enough.

I reproduced it, and then tried that commit you linked, and it appears
to fix the problem. So maybe just poke the author/maintainer there and
make sure that gets in somewhere.

Jason

2022-11-28 20:17:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [REGRESSION] suspend to ram fails in 6.1 due to tpm errors


On 11/28/22 14:35, Vlastimil Babka wrote:
> On 11/28/22 13:03, Jason A. Donenfeld wrote:
>> Hi Vlastimil,
>>
>> Thanks for CC'ing me.
>>
>> On Mon, Nov 28, 2022 at 09:15:33AM +0100, Vlastimil Babka wrote:
>>> Dominik on IRC pointed me to commit b006c439d58d ("hwrng: core - start hwrng
>>> kthread also for untrusted sources"), which could make sense if the TPM was not
>>> used at all before and now it's used for randomness. But then it probably "just"
>>> uncovered a pre-existing issue? Maybe there's a race with getting the randomness
>>> and suspend? Could it be exactly what this patch is attempting to fix?
>>> https://lore.kernel.org/all/[email protected]/
>>
>> The commit you referenced just turns on some functionality that most
>> people previously didn't have (but maybe could have, manually). So this
>> is a case of a new commit unearthing a bug in some old code. We had
>> something similar happen with a raspi driver and fixed the bug there.
>
> Right, I suspected as much.
>
>> From the looks of the lore link you provided, it sounds like the same
>> thing has happened already there to tpm? That this is already fixed by
>> that commit? I think if you have a system that's readily exhibiting the
>
> Well it was a patch series where a v3 was promised with more info about the
> bug it fixes, but I haven't found a v3. So if the author or TPM maintainers
> can confirm that it does (or can) indeed fix such a bug I'm observing, I'd
> hope for it to be fasttracked to 6.1.
>
>> issue, the best thing to do would be to try that series, and report
>> back, maybe even providing your `Tested-by:` line if it works.
>
> Yeah, I have compiled rc7 with patch 1/3, without the rest as that didn't
> seem necessary, and there was some review feedback that some parts need
> fixing. So far it didn't exhibit the bug when suspending/resuming 10 times,
> but that's not yet a guarantee due to nondeterministic nature of the issue
> and usually the suspends weren't happening in such rapid succession... I'll
> keep using that kernel for now on the laptop, fwiw.

With the hack you provided on IRC I was able to more reliably reproduce this
on rc7+hack, and not reproduce anymore with rc7+hack+Jan's patch so it seems
like the proper fix to me, thanks.

#regzbot monitor: https://lore.kernel.org/all/[email protected]/ fix

>> I'll play around and see if I can repro too. Looks like you have a T460?
>> I don't have exactly that but maybe something close enough.
>
> Great, thanks!
>
>> Jason
>