2022-07-07 20:50:02

by Todd Brandt

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

Adding printk maintainers to the thread


On Thu, 2022-07-07 at 13:31 -0700, Todd Brandt wrote:
> Ever since 5.19.0-rc1 the serial device has taken almost a second
> longer in both suspend and resume. This effect is witnessed in half
> the
> machines in our lab (~10 machines). It occurs on dell, asus, gigabyte
> and other machines so it's not hardware specific.
>
> I opened a bugzilla issue which includes the git bisect:
> https://bugzilla.kernel.org/show_bug.cgi?id=216216
>
>
> Bad commit:
>
> commit 3b604ca81202eea2a917eb6491e90f610fba0ec7
> Author: John Ogness <[email protected]>
> Date: Thu Apr 21 23:28:46 2022 +0206
>
> printk: add pr_flush()
>
> Provide a might-sleep function to allow waiting for console
> printers
> to catch up to the latest logged message.
>
> Use pr_flush() whenever it is desirable to get buffered messages
> printed before continuing: suspend_console(), resume_console(),
> console_stop(), console_start(), console_unblank().
>
> Signed-off-by: John Ogness <[email protected]>
> Reviewed-by: Petr Mladek <[email protected]>
> Signed-off-by: Petr Mladek <[email protected]>
> Link:
>
https://lore.kernel.org/r/[email protected]
>
> include/linux/printk.h | 7 +++++
> kernel/printk/printk.c | 83
> ++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 90 insertions(+)


2022-07-08 08:25:53

by John Ogness

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

Hi Todd,

Thanks for reporting this.

On 2022-07-07, Todd Brandt <[email protected]> wrote:
> Ever since 5.19.0-rc1 the serial device has taken almost a second
> longer in both suspend and resume. This effect is witnessed in half
> the machines in our lab (~10 machines). It occurs on dell, asus,
> gigabyte and other machines so it's not hardware specific.

Please provide a full dmesg log if you can. The important parts of the
log would be:

- boot arguments
- any lines with "printk" in it
- timing information
- messages during suspend and resume

pr_flush() uses a timeout of 1 second. I assume your console driver is
not printing and therefore not making any forward progress.

John Ogness

2022-07-08 21:49:21

by Todd Brandt

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On Fri, 2022-07-08 at 10:07 +0206, John Ogness wrote:
> Hi Todd,
>
> Thanks for reporting this.
>
> On 2022-07-07, Todd Brandt <[email protected]> wrote:
> > Ever since 5.19.0-rc1 the serial device has taken almost a second
> > longer in both suspend and resume. This effect is witnessed in half
> > the machines in our lab (~10 machines). It occurs on dell, asus,
> > gigabyte and other machines so it's not hardware specific.
>
> Please provide a full dmesg log if you can. The important parts of
> the
> log would be:
>
> - boot arguments
> - any lines with "printk" in it
> - timing information
> - messages during suspend and resume
>
> pr_flush() uses a timeout of 1 second. I assume your console driver
> is
> not printing and therefore not making any forward progress.
>
> John Ogness

The dmesg logs are in the html timelines themselves, just click the
"dmesg" button in the upper right hand corner. The log button shows all
kinds of system info as well.

2022-07-09 20:49:15

by John Ogness

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On 2022-07-08, Todd Brandt <[email protected]> wrote:
> The dmesg logs are in the html timelines themselves, just click the
> "dmesg" button in the upper right hand corner. The log button shows
> all kinds of system info as well.

Since the beginning of the kernel log is missing, I still do not see
information about which serial driver you are using. But since it is
x86, I'll assume it is an 8250.

Looking at freeze-5.19.0-rc1-bad.html, at 3431.221039 we see that
suspend_console() was called. The additional 1-second delay you are
referring to would be 3432.436187, where serial is suspended. pr_flush()
would have been satisfied when the message at 3431.221039 was
printed. So the question is, why is there still printing going on?

@Todd: I assume you actually see all the messages up to and including
3431.506391 when it suspends? But the next message does not appear on
suspend?

I wonder if the console_lock()/console_unlock() within __pr_flush() is
allowing extra printing to occur, for example if another CPU was calling
printk.

@Todd: Could you build your kernel with CONFIG_PRINTK_CALLER enabled?
Also, please provide the last line you _see_ after a suspend. And
finally, please make sure you are using the latest 5.19-rc. Thanks!

John Ogness

2022-07-10 19:59:14

by Andy Shevchenko

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On Sat, Jul 9, 2022 at 10:48 PM John Ogness <[email protected]> wrote:
>
> On 2022-07-08, Todd Brandt <[email protected]> wrote:
> > The dmesg logs are in the html timelines themselves, just click the
> > "dmesg" button in the upper right hand corner. The log button shows
> > all kinds of system info as well.
>
> Since the beginning of the kernel log is missing, I still do not see
> information about which serial driver you are using. But since it is
> x86, I'll assume it is an 8250.
>
> Looking at freeze-5.19.0-rc1-bad.html, at 3431.221039 we see that
> suspend_console() was called. The additional 1-second delay you are
> referring to would be 3432.436187, where serial is suspended. pr_flush()
> would have been satisfied when the message at 3431.221039 was
> printed. So the question is, why is there still printing going on?

It might be no_console_suspend hack. Are you, btw, aware of this ugly
hack in the kernel?

--
With Best Regards,
Andy Shevchenko

2022-07-10 20:56:42

by John Ogness

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On 2022-07-10, Andy Shevchenko <[email protected]> wrote:
>> Looking at freeze-5.19.0-rc1-bad.html, at 3431.221039 we see that
>> suspend_console() was called. The additional 1-second delay you are
>> referring to would be 3432.436187, where serial is
>> suspended. pr_flush() would have been satisfied when the message at
>> 3431.221039 was printed. So the question is, why is there still
>> printing going on?
>
> It might be no_console_suspend hack. Are you, btw, aware of this ugly
> hack in the kernel?

I am aware of it. There are some cases where it actually works. But it
is not being used here. The boot args are:

BOOT_IMAGE=/boot/vmlinuz-5.19.0-rc1+ root=UUID=1dfec046-baf6-4f38-8b5e-a8f438a48038 ro rw quiet console=ttyS0,115200 console=tty0 i915.enable_psr=1 initcall_debug log_buf_len=32M quiet splash console=tty0 console=ttyS0,115200n8 vt.handoff=7

I am curious if Todd sees this problem with 5.19-rc4 or later (the
kthread printers were removed).

John

2022-07-11 08:18:14

by Petr Mladek

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On Sun 2022-07-10 22:10:45, John Ogness wrote:
> On 2022-07-10, Andy Shevchenko <[email protected]> wrote:
> >> Looking at freeze-5.19.0-rc1-bad.html, at 3431.221039 we see that
> >> suspend_console() was called. The additional 1-second delay you are
> >> referring to would be 3432.436187, where serial is
> >> suspended. pr_flush() would have been satisfied when the message at
> >> 3431.221039 was printed. So the question is, why is there still
> >> printing going on?
> >
> > It might be no_console_suspend hack. Are you, btw, aware of this ugly
> > hack in the kernel?
>
> I am aware of it. There are some cases where it actually works. But it
> is not being used here. The boot args are:
>
> BOOT_IMAGE=/boot/vmlinuz-5.19.0-rc1+ root=UUID=1dfec046-baf6-4f38-8b5e-a8f438a48038 ro rw quiet console=ttyS0,115200 console=tty0 i915.enable_psr=1 initcall_debug log_buf_len=32M quiet splash console=tty0 console=ttyS0,115200n8 vt.handoff=7
>
> I am curious if Todd sees this problem with 5.19-rc4 or later (the
> kthread printers were removed).

We removed the kthreads but not pr_flush(). The commit
3b604ca81202eea2a91 ("printk: add pr_flush()") is still there.

It seems that __pr_flush() does not check whether all consoles
are suspended. In this case the progress is not possible and
it has to wait the entire timeout.

Best Regards,
Petr

2022-07-11 11:44:02

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On (22/07/11 10:13), Petr Mladek wrote:
>
> It seems that __pr_flush() does not check whether all consoles
> are suspended. In this case the progress is not possible and
> it has to wait the entire timeout.

But isn't console_suspended set after pr_flush() call?

2022-07-13 10:28:25

by John Ogness

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On 2022-07-11, Sergey Senozhatsky <[email protected]> wrote:
>> It seems that __pr_flush() does not check whether all consoles are
>> suspended. In this case the progress is not possible and it has to
>> wait the entire timeout.
>
> But isn't console_suspended set after pr_flush() call?

There should not be any printing after the suspend_console() message. If
Todd's report is coming from 5.19-rc1, then it is likely a kthread
issue, where the kthread is not respecting @console_suspended. (This
would still need to be fixed for the kthreads, but would not be relevant
for 5.19.)

John

2022-07-13 15:24:46

by Todd Brandt

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On Wed, 2022-07-13 at 11:57 +0206, John Ogness wrote:
> On 2022-07-11, Sergey Senozhatsky <[email protected]> wrote:
> > > It seems that __pr_flush() does not check whether all consoles
> > > are
> > > suspended. In this case the progress is not possible and it has
> > > to
> > > wait the entire timeout.
> >
> > But isn't console_suspended set after pr_flush() call?
>
> There should not be any printing after the suspend_console() message.
> If
> Todd's report is coming from 5.19-rc1, then it is likely a kthread
> issue, where the kthread is not respecting @console_suspended. (This
> would still need to be fixed for the kthreads, but would not be
> relevant
> for 5.19.)
>
> John

It appears to still be there in rc5 (our lab system is usually 6 days
behind, so rc6 hasn't been run yet). I'm building a new rc6 kernel with
CONFIG_PRINTK_CALLER today and will run it on all affected machines,
I'll send out the data this afternoon Oregon Time. I'll include
timelines for all affected machines to give you a better sampling.

2022-07-13 17:22:27

by Todd Brandt

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

I've updated the bugzilla entry with new data:
https://bugzilla.kernel.org/show_bug.cgi?id=216216

I just added 3 new tests for 5.19.0-rc6 on 3 machines that see this
issue: otcpl-asus-e200-cht (cherry trail), otcpl-aml-y (amber lake),
and otcpl-whl-u (whiskey lake). The kernel has the CONFIG_PRINTK_CALLER
option enabled.

The test is a S2idle and is run thusly:
%> sleepgraph -dev -m freeze -rtcwake 15

I've included the dmesg boot logs for all three. The dmesg
suspend/resume logs are included in the html timelines by clicking the
"dmesg" button in the upper right hand corner of the timeline. There's
a "log" button as well that shows other system into.

These files are attached to the bugzilla entry.
otcpl-aml-y-5.19.0-rc6-boot-dmesg.txt
otcpl-aml-y-5.19.0-rc6-freeze.html
otcpl-asus-e200-cht-5.19.0-rc6-boot-dmesg.txt
otcpl-asus-e200-cht-5.19.0-rc6-freeze.html
otcpl-whl-u-5.19.0-rc6-boot-dmesg.txt
otcpl-whl-u-5.19.0-rc6-freeze.html

If possible can we move this thread to the bugzilla comment section?


On Wed, 2022-07-13 at 11:57 +0206, John Ogness wrote:
> On 2022-07-11, Sergey Senozhatsky <[email protected]> wrote:
> > > It seems that __pr_flush() does not check whether all consoles
> > > are
> > > suspended. In this case the progress is not possible and it has
> > > to
> > > wait the entire timeout.
> >
> > But isn't console_suspended set after pr_flush() call?
>
> There should not be any printing after the suspend_console() message.
> If
> Todd's report is coming from 5.19-rc1, then it is likely a kthread
> issue, where the kthread is not respecting @console_suspended. (This
> would still need to be fixed for the kthreads, but would not be
> relevant
> for 5.19.)
>
> John

2022-07-13 18:50:58

by Todd Brandt

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

URGENT: Removing the commit FIXES the issue.

I just ran a 5.19.0-rc6 kernel with the offending commit removed and it
fixed the problem completely on all 3 machines. To be clear the
offending commit is:

commit 3b604ca81202eea2a917eb6491e90f610fba0ec7

I strongly recommend that this commit be pulled (or fixed very quickly)
before the 5.19 release or 1 in 10 linux machines running 5.19 will
take up to 2 seconds longer in suspend/resume.

commit 3b604ca81202eea2a917eb6491e90f610fba0ec7
Author: John Ogness <[email protected]>
Date: Thu Apr 21 23:28:46 2022 +0206

printk: add pr_flush()

Provide a might-sleep function to allow waiting for console
printers
to catch up to the latest logged message.

Use pr_flush() whenever it is desirable to get buffered messages
printed before continuing: suspend_console(), resume_console(),
console_stop(), console_start(), console_unblank().

Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
Link:
https://lore.kernel.org/r/[email protected]

include/linux/printk.h | 7 +++++
kernel/printk/printk.c | 83
++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 90 insertions(+)



On Wed, 2022-07-13 at 10:11 -0700, Todd Brandt wrote:
> I've updated the bugzilla entry with new data:
> https://bugzilla.kernel.org/show_bug.cgi?id=216216
>
> I just added 3 new tests for 5.19.0-rc6 on 3 machines that see this
> issue: otcpl-asus-e200-cht (cherry trail), otcpl-aml-y (amber lake),
> and otcpl-whl-u (whiskey lake). The kernel has the
> CONFIG_PRINTK_CALLER
> option enabled.
>
> The test is a S2idle and is run thusly:
> %> sleepgraph -dev -m freeze -rtcwake 15
>
> I've included the dmesg boot logs for all three. The dmesg
> suspend/resume logs are included in the html timelines by clicking
> the
> "dmesg" button in the upper right hand corner of the timeline.
> There's
> a "log" button as well that shows other system into.
>
> These files are attached to the bugzilla entry.
> otcpl-aml-y-5.19.0-rc6-boot-dmesg.txt
> otcpl-aml-y-5.19.0-rc6-freeze.html
> otcpl-asus-e200-cht-5.19.0-rc6-boot-dmesg.txt
> otcpl-asus-e200-cht-5.19.0-rc6-freeze.html
> otcpl-whl-u-5.19.0-rc6-boot-dmesg.txt
> otcpl-whl-u-5.19.0-rc6-freeze.html
>
> If possible can we move this thread to the bugzilla comment section?
>
>
> On Wed, 2022-07-13 at 11:57 +0206, John Ogness wrote:
> > On 2022-07-11, Sergey Senozhatsky <[email protected]> wrote:
> > > > It seems that __pr_flush() does not check whether all consoles
> > > > are
> > > > suspended. In this case the progress is not possible and it has
> > > > to
> > > > wait the entire timeout.
> > >
> > > But isn't console_suspended set after pr_flush() call?
> >
> > There should not be any printing after the suspend_console()
> > message.
> > If
> > Todd's report is coming from 5.19-rc1, then it is likely a kthread
> > issue, where the kthread is not respecting @console_suspended.
> > (This
> > would still need to be fixed for the kthreads, but would not be
> > relevant
> > for 5.19.)
> >
> > John

2022-07-13 21:42:09

by John Ogness

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On 2022-07-13, Todd Brandt <[email protected]> wrote:
> URGENT: Removing the commit FIXES the issue.
>
> I just ran a 5.19.0-rc6 kernel with the offending commit removed and
> it fixed the problem completely on all 3 machines.

I believe to have found the issue. A patch was posted to bugzilla:

https://bugzilla.kernel.org/attachment.cgi?id=301413

Please test this. Thanks.

John Ogness

2022-07-13 22:57:52

by John Ogness

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

Hi,

I have posted a v2 to bugzilla:

https://bugzilla.kernel.org/attachment.cgi?id=301414

This version allows console_stop()/console_start() to flush for the
non-suspend case.

Please test. Thanks.

John Ogness

2022-07-14 22:38:41

by Todd Brandt

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On Wed, 2022-07-13 at 23:28 +0206, John Ogness wrote:
> On 2022-07-13, Todd Brandt <[email protected]> wrote:
> > URGENT: Removing the commit FIXES the issue.
> >
> > I just ran a 5.19.0-rc6 kernel with the offending commit removed
> > and
> > it fixed the problem completely on all 3 machines.
>
> I believe to have found the issue. A patch was posted to bugzilla:
>
> https://bugzilla.kernel.org/attachment.cgi?id=301413
>
> Please test this. Thanks.
>
> John Ogness

I ran with your fix on 30 systems. On the 7 machines where this problem
appears, the patch fixes it completely. On the 23 machines where this
problem does not appear, there are no issues or regressions.

Fix verified, thank you!

2022-07-15 06:39:29

by John Ogness

[permalink] [raw]
Subject: Re: PNP0501 serial driver takes almost 2 seconds to suspend/resume (printk issue)

On 2022-07-14, Todd Brandt <[email protected]> wrote:
> I ran with your fix on 30 systems. On the 7 machines where this
> problem appears, the patch fixes it completely. On the 23 machines
> where this problem does not appear, there are no issues or
> regressions.
>
> Fix verified, thank you!

I have posted [0] an official patch for 5.19-rc7. For the posted
version, I added more information to the commit message and comments.

Todd, thanks for reporting and testing!

John

[0] https://lore.kernel.org/lkml/[email protected]