by Dave Young

[permalink] [raw]

Subject: Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

[Added more people in cc]

On 10/08/23 at 12:19pm, John Ogness wrote:
> Hi Petr,
>
> On 2023-10-06, Petr Mladek <[email protected]> wrote:
> >> During the demo at LPC2022 we had the situation that there was a large
> >> backlog when a WARN was hit. With current mainline the first line of the
> >> WARN is put into the ringbuffer and then the entire backlog is flushed
> >> before storing the rest of the WARN into the ringbuffer. At the time it
> >> was obvious that we should finish storing the WARN message and then
> >> start flushing the backlog.
> >
> > This talks about the "emergency" context (WARN/OOPS/watchdog).
> > The system might be in big troubles but it would still try to continue.
> >
> > Do we really want to defer the flush also for panic() context?
>
> We can start flushing right after the backtrace is in the
> ringbuffer. But flushing the backlog _before_ putting the backtrace into
> the ringbuffer was not desired because if there is a large backlog, the
> machine may not survive to get the backtrace out. And in that case it
> won't even be in the ringbuffer to be used by other debugging
> tools.
>
> > I ask because I was not on LPC 2022 in person and I do not remember
> > all details.
>
> The LPC2022 demo/talk was recorded:
>
> https://www.youtube.com/watch?v=TVhNcKQvzxI
>
> At 55:55 is where the situation occurred and triggered the conversation,
> ultimately leading to this new feature.
>
> You may also want to reread my summary:
>
> https://lore.kernel.org/lkml/[email protected]
>
> as well as Linus' follow-up message:
>
> https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@mail.gmail.com
>
> > But it is tricky in panic(), see 8th patch at
> > https://lore.kernel.org/r/[email protected]
> >
> > + nbcon_atomic_exit() is called only in one code path.
>
> Correct. When panic() is complete and the machine goes into its infinite
> loop. This is also the point where it will attempt an unsafe flush, if
> it could not get the messages out yet.
>
> > + nbcon_atomic_flush_all() is used in other paths. It looks like
> > a "Whack a mole" game to me.
>
> Several different outputs occur during panic(). The flush is everywhere
> where something significant has been put into the ringbuffer and now it
> would be OK to flush it.
>
> > + messages are never emitted by printk kthread either because
> > CPUs are stopped or the kthread is not allowed to get the lock
>
> Correct.
>
> > I see only one positive of the explicit flush. The consoles would
> > not delay crash_exec() and the crash dump might be closer to
> > the point where panic() was called.
>
> It's only about getting the critical messages into the ringbuffer before
> flushing. And since various things can go wrong during the many actions
> within panic(), it makes sense to flush in between those actions.
>
> > Otherwise I see only negatives => IMHO, we want to flush atomic
> > consoles synchronously from printk() in panic().
> >
> > Does anyone really want explicit flushes in panic()?
>
> So far you are the only one speaking against it. I expect as time goes
> on it will get even more complex as it becomes tunable (also something
> we talked about during the demo).

Flush consoles in panic kexec case sounds not good, but I have no
deep understanding about the atomic printk series, added kexec list and
reviewers in cc.

>
> John
>

Thanks
Dave

2023-10-16 10:10:23

by John Ogness

[permalink] [raw]

Subject: Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

Hi Dave,

On 2023-10-16, Dave Young <[email protected]> wrote:
>> > Does anyone really want explicit flushes in panic()?
>>
>> So far you are the only one speaking against it. I expect as time
>> goes on it will get even more complex as it becomes tunable (also
>> something we talked about during the demo).
>
> Flush consoles in panic kexec case sounds not good, but I have no deep
> understanding about the atomic printk series, added kexec list and
> reviewers in cc.

Currently every printk() message tries to flush immediately.

This series introduced a new method of first allowing a set of printk()
messages to be stored to the ringbuffer and then flushing the full
set. That is what this discussion was about.

The issue with allowing a set of printk() messages to be stored is that
you need to explicitly mark in code where the actual flushing should
occur. Petr's argument is that we do not want to insert "flush points"
into the panic() function and instead we should do as we do now: flush
each printk() message immediately.

In the end (for my upcoming v3 series) I agreed with Petr. We will
continue to keep things as they are now: flush each printk() message
immediately.

Currently consoles try to flush unsafely before kexec. With the atomic
printk series our goal is to only perform _safe_ flushing until all
other panic operations are complete. Only at the very end of panic()
would unsafe flushing be attempted.

John