by John Ogness

[permalink] [raw]

Subject: Re: [PATCH printk v4 17/27] printk: nbcon: Use nbcon consoles in console_flush_all()

On 2024-04-18, Petr Mladek <[email protected]> wrote:
>> > Solve this problem by introducing[*] nbcon_atomic_flush_all()
>> > which would flush even newly added messages and
>> > call this in nbcon_cpu_emergency_exit() when the printk
>> > kthread does not work. It should bail out when there
>> > is a panic in progress.
>> >
>> > Motivation: It does not matter which "atomic" context
>> > flushes NORMAL/EMERGENCY messages when
>> > the printk kthread is not available.
>>
>> I do not think that solves the problem. If the console is in an unsafe
>> section, nothing can be printed.
>
> IMHO, it solves the problem.
>
> The idea is simple:
>
> "The current nbcon console owner will be responsible for flushing
> all messages when the printk kthread does not exist."

Currently this is not valid. It assumes owners are printers. That is not
always true. The owner might be some task modifying the baud rate and
has nothing to do with printing.

> The prove is more complicated:
>
> 1. Let's put aside panic. We already do the best effort there.
>
> 2. Emergency mode currently violates the rule because
> nbcon_atomic_flush_pending() ignores the simple rule.
>
> => FIX: improve nbcon_cpu_emergency_exit() to flush
> all messages when kthreads are not ready.

Emergency mode cannot flush _anything_ if there is an owner in an unsafe
region. (And that owner may not be a printer.)

> 3. Normal mode flushes nbcon consoles via
> nbcon_legacy_emit_next_record() from console_unlock()
> before the kthreads are started.
>
> It is not reliable when nbcon_try_acquire() fails.
> But it would fail only when there is another user.
> The other owner might be:
>
> + panic: will handle everything
>
> + emergency: should flush everything [*]
>
> + normal: can't happen because of con->device() lock.

As the code is now, "normal" does not imply con->device() lock. When
using con->write_atomic(), we do not (and can not) use the con->device()
lock. So normal _can_ fail in nbcon_legacy_emit_next_record() if another
CPU is adjusting the baud rate. This is why I said the problem with
"emergency" is basically the same problem as "normal" (WRT using
write_atomic()).

> => The only remaining problem is to fix nbcon_atomic_flush_pending()
> to flush everything when the kthreads are not started yet.

I think my proposed change handles it better. I have been doing various
tests and also adjusted it a bit.

The reason the flushing fails is because another context owns the
console. So I think it makes sense for that owner to handle flushing
responsibility when releasing ownership (even if that context was just
changing the baud rate).

[ Keep in mind we are only talking about printing via write_atomic()
when the kthread is not available. ]

If the current owner is a normal printing context, it will print to
completion anyway (via console_flush_all()).

If the current owner is an emergency printing context, it will only
print the emergency messages (as PRIO_EMERGENCY). However, when it
releases ownership, it could flush the remaining records (as
PRIO_NORMAL) in the same fashion as console_flush_all() does it.

If the current owner is a non-printing context, when it releases
ownership, it could flush the remaining records (as PRIO_NORMAL) in the
same fashion as console_flush_all() does it.

So what I am proposing is that we add two new normal-flushing sites that
are only used when the kthread is not available:

1. after exiting emergency mode: in nbcon_cpu_emergency_exit()

2. after releasing ownership for non-printing: in nbcon_driver_release()

I think this will close the gap and it does not require irq_work.

> Sigh, all this is so complicated. I wonder how to document
> this so that other people do not have to discover these
> dependencies again and again. Is it even possible?

In the end we will have the following 5 scenarios (assuming my
proposal):

1. PRIO_NORMAL non-printing activity, always under con->device() lock,
upon release flushes[*] full backlog

2. PRIO_NORMAL printing using write_thread(), always called from task
context and under con->device() lock, always flushes full backlog

3. PRIO_NORMAL printing using write_atomic(), called with hardware
interrupts disabled, always flushes full backlog, (only used when the
kthread is not available)

4. PRIO_EMERGENCY printing using write_atomic(), called with hardware
interrupts disabled, flushes through emergency, upon exit flushes[*]
full backlog

5. PRIO_PANIC printing using write_atomic(), called with hardware
interrupts disabled, flushes full backlog

[*] Only when the kthread is not available. And in that case #3 is the
scenario used for the trailing full flushing by #1 and #4.

Full flushing is attempted in all 5 scenarios. A failed attempt means
there is a new owner, but that owner is also one of the 5 scenarios.

Am I missing something?

IMHO #3 is the only bizarre scenario. It exists only to cover when the
kthread is not available.

John

2024-04-19 09:59:22

by Petr Mladek

[permalink] [raw]

Subject: Re: [PATCH printk v4 17/27] printk: nbcon: Use nbcon consoles in console_flush_all()

On Thu 2024-04-18 23:51:01, John Ogness wrote:
> On 2024-04-18, Petr Mladek <[email protected]> wrote:
> >> > Solve this problem by introducing[*] nbcon_atomic_flush_all()
> >> > which would flush even newly added messages and
> >> > call this in nbcon_cpu_emergency_exit() when the printk
> >> > kthread does not work. It should bail out when there
> >> > is a panic in progress.
> >> >
> >> > Motivation: It does not matter which "atomic" context
> >> > flushes NORMAL/EMERGENCY messages when
> >> > the printk kthread is not available.
> >>
> >> I do not think that solves the problem. If the console is in an unsafe
> >> section, nothing can be printed.
> >
> > IMHO, it solves the problem.
> >
> > The idea is simple:
> >
> > "The current nbcon console owner will be responsible for flushing
> > all messages when the printk kthread does not exist."
>
> Currently this is not valid. It assumes owners are printers. That is not
> always true. The owner might be some task modifying the baud rate and
> has nothing to do with printing.

Ah, I have missed this scenario.

> So what I am proposing is that we add two new normal-flushing sites that
> are only used when the kthread is not available:
>
> 1. after exiting emergency mode: in nbcon_cpu_emergency_exit()
>
> 2. after releasing ownership for non-printing: in nbcon_driver_release()
>
> I think this will close the gap and it does not require irq_work.
>
> > Sigh, all this is so complicated. I wonder how to document
> > this so that other people do not have to discover these
> > dependencies again and again. Is it even possible?
>
> In the end we will have the following 5 scenarios (assuming my
> proposal):
>
> 1. PRIO_NORMAL non-printing activity, always under con->device() lock,
> upon release flushes[*] full backlog
>
> 2. PRIO_NORMAL printing using write_thread(), always called from task
> context and under con->device() lock, always flushes full backlog
>
> 3. PRIO_NORMAL printing using write_atomic(), called with hardware
> interrupts disabled, always flushes full backlog, (only used when the
> kthread is not available)
>
> 4. PRIO_EMERGENCY printing using write_atomic(), called with hardware
> interrupts disabled, flushes through emergency, upon exit flushes[*]
> full backlog
>
> 5. PRIO_PANIC printing using write_atomic(), called with hardware
> interrupts disabled, flushes full backlog
>
> [*] Only when the kthread is not available. And in that case #3 is the
> scenario used for the trailing full flushing by #1 and #4.
>
>
> Full flushing is attempted in all 5 scenarios. A failed attempt means
> there is a new owner, but that owner is also one of the 5 scenarios.
>
> Am I missing something?
>
> IMHO #3 is the only bizarre scenario. It exists only to cover when the
> kthread is not available.

Great summary! I like it.

Let me try another summary:

We basically repeat the same trick as was used in the legacy
console_unlock(). When the kthread is not avaialale then
the current owner is responsible for flushing everything.

The game changer is the kthread. When it is available then
it takes care of "everything" as long as the system is
working normaly.

The system is working normally until suspend, shutdown, or panic().
It might also be sick. In which case, we try to flush the doctor
report immediately (emergency when safe). But we wait for
the entire doctor report before flushing (publishing).

Or another look. We have the following rules:

1. kthread is not avaialbe => owner flushes everything

2. kthread is available:

a) Normal messages are off loaded to kthread (store + kick)

b) Emergency messages (doctor examination) are first stored and then
tried to be flushed immediately (when possible/safe), including
all previous messages (other up-to-date notes).

The emergency messages might also be split in more
parts when the report is too long, for example,
backtraces or lock depedencies of all tasks.
(Reports from more doctor specialists). In this case,
each part (report) is flushed immediately (when possible/safe),
including all previous messages (other up-to-date notes).

When the system tries to continue normaly (no dying)
then any later messages (post doctor exam notes) are
let the kthread (secretary) to flush them.

c) Panic messages are flushed immediately (when possible/safe),
including all previous messages (other up-to-date notes).

d) The final flush in panic() will be done even when not safe
(patient will try to read the diary even when feeling
dizzy and it might be a complete fiasco but it is
the very last chance).

In shorrt, go ahead with your proposal.

Best Regards,
Petr