2023-09-20 05:02:14

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v2 00/11] wire up nbcon atomic printing

Hi,

This is v2 of a series to wire up the nbcon consoles so that
they actually perform atomic printing. This series is only a
subset of the original v1 [0]. In particular, this series
represents patches 14 and 16-18 of the v1 series. For
information about the motivation of the atomic consoles,
please read the cover letter of v1.

This series focuses on providing the functionality and marking
of atomic printing sections as well as wiring up the nbcon
atomic printing to the general console_flush_all() function.
This series does _not_ include threaded printing or nbcon
drivers. Those features will be added in separate follow-up
series.

Note that the behavior of atomic printing in priority-elevated
atomic printing sections differs from the legacy consoles. With
atomic printing, the full set of urgent messages
(WARN/OOPS/PANIC) are first stored into the ringbuffer and then
afterwards flushed. This makes sure the full backtraces are
available in the ringbuffer, even if they do not make it out to
the console(s). This is in accordance with what was discussed
at LPC 2022 [1].

A lot has changed since v1 and the patches no longer correlate
1:1. Here is an attempt to list the changes:

- Rather than flushing the full ringbuffer with one nbcon
console before moving to the next nbcon console, adapt the
same flushing strategy as the legacy consoles: rotation
through the consoles, one record at a time.

- Introduce nbcon_atomic_emit_one() to perform the "lock, emit
one record, unlock" pattern. (This is only a helper
function.)

- Introduce nbcon_console_emit_next_record() to act as the
nbcon variant of console_emit_next_record(). This allows
straight forward integration of nbcon consoles into
console_flush_all().

- nbcon_atomic_flush() no longer takes any arguments. These
were awkward for the caller. The context object is now
hidden from the caller.

- Introduce __nbcon_atomic_flush_all() as an internal helper
function in order to hide the ability to attempt unsafe
hostile takeovers from outside nbcon.c.

- For printk_trigger_flush(), migration is disabled instead of
preemption.

- Add atomic write enforcement to oops and lockdep.

- Comments and kerneldoc updated.

John Ogness

[0] https://lore.kernel.org/lkml/[email protected]

[1] https://lore.kernel.org/lkml/[email protected]

John Ogness (7):
printk: Make console_is_usable() available to nbcon
printk: Let console_is_usable() handle nbcon
printk: Add @flags argument for console_is_usable()
printk: nbcon: Wire up nbcon into console_flush_all()
panic: Add atomic write enforcement to oops
rcu: Add atomic write enforcement for rcu stalls
lockdep: Add atomic write enforcement for lockdep splats

Thomas Gleixner (4):
printk: nbcon: Provide functions to mark atomic write sections
printk: nbcon: Provide function for atomic flushing
printk: nbcon: Wire up nbcon console atomic flushing
panic: Add atomic write enforcement to warn/panic

include/linux/console.h | 4 +
include/linux/printk.h | 6 +
kernel/locking/lockdep.c | 7 ++
kernel/panic.c | 66 +++++++++++
kernel/printk/internal.h | 37 ++++++
kernel/printk/nbcon.c | 246 ++++++++++++++++++++++++++++++++++++++-
kernel/printk/printk.c | 67 +++++------
kernel/rcu/tree_stall.h | 6 +
8 files changed, 396 insertions(+), 43 deletions(-)


base-commit: 9757acd0a700ba4a0d16dde4ba820eb052aba1a7
--
2.39.2


2023-09-20 06:07:32

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v2 09/11] panic: Add atomic write enforcement to oops

Invoke the atomic write enforcement functions for oops to
ensure that the information gets out to the consoles.

Since there is no single general function that calls both
oops_enter() and oops_exit(), the nesting feature of atomic
write sections is taken advantage of in order to guarantee
full coverage between the first oops_enter() and the last
oops_exit().

It is important to note that if there are any legacy consoles
registered, they will be attempting to directly print from the
printk-caller context, which may jeopardize the reliability of
the atomic consoles. Optimally there should be no legacy
consoles registered.

Signed-off-by: John Ogness <[email protected]>
---
kernel/panic.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)

diff --git a/kernel/panic.c b/kernel/panic.c
index 86ed71ba8c4d..e2879098645d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -614,6 +614,10 @@ bool oops_may_print(void)
return pause_on_oops_flag == 0;
}

+static atomic_t oops_cpu = ATOMIC_INIT(-1);
+static int oops_nesting;
+static enum nbcon_prio oops_prev_prio;
+
/*
* Called when the architecture enters its oops handler, before it prints
* anything. If this is the first CPU to oops, and it's oopsing the first
@@ -630,6 +634,36 @@ bool oops_may_print(void)
*/
void oops_enter(void)
{
+ enum nbcon_prio prev_prio;
+ int cpu = -1;
+
+ /*
+ * If this turns out to be the first CPU in oops, this is the
+ * beginning of the outermost atomic section. Otherwise it is
+ * the beginning of an inner atomic section.
+ */
+ prev_prio = nbcon_atomic_enter(NBCON_PRIO_EMERGENCY);
+
+ if (atomic_try_cmpxchg_relaxed(&oops_cpu, &cpu, smp_processor_id())) {
+ /*
+ * This is the first CPU in oops. Save the outermost
+ * @prev_prio in order to restore it on the outermost
+ * matching oops_exit(), when @oops_nesting == 0.
+ */
+ oops_prev_prio = prev_prio;
+
+ /*
+ * Enter an inner atomic section that ends at the end of this
+ * function. In this case, the nbcon_atomic_enter() above
+ * began the outermost atomic section.
+ */
+ prev_prio = nbcon_atomic_enter(NBCON_PRIO_EMERGENCY);
+ }
+
+ /* Track nesting when this CPU is the owner. */
+ if (cpu == -1 || cpu == smp_processor_id())
+ oops_nesting++;
+
tracing_off();
/* can't trust the integrity of the kernel anymore: */
debug_locks_off();
@@ -637,6 +671,9 @@ void oops_enter(void)

if (sysctl_oops_all_cpu_backtrace)
trigger_all_cpu_backtrace();
+
+ /* Exit inner atomic section. */
+ nbcon_atomic_exit(NBCON_PRIO_EMERGENCY, prev_prio);
}

static void print_oops_end_marker(void)
@@ -652,6 +689,18 @@ void oops_exit(void)
{
do_oops_enter_exit();
print_oops_end_marker();
+
+ if (atomic_read(&oops_cpu) == smp_processor_id()) {
+ oops_nesting--;
+ if (oops_nesting == 0) {
+ atomic_set(&oops_cpu, -1);
+
+ /* Exit outmost atomic section. */
+ nbcon_atomic_exit(NBCON_PRIO_EMERGENCY, oops_prev_prio);
+ }
+ }
+ put_cpu();
+
kmsg_dump(KMSG_DUMP_OOPS);
}

--
2.39.2

2023-09-20 07:42:42

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

From: Thomas Gleixner <[email protected]>

WARN/OOPS/PANIC require printing out immediately since the
regular printing method (and in the future, the printing
threads) might not be able to run.

Add per-CPU state to denote the priority/urgency of the output
and provide functions to mark the beginning and end of sections
where the urgent messages are generated.

Note that when a CPU is in a priority elevated state, flushing
only occurs when dropping back to a lower priority. This allows
the full set of printk records (WARN/OOPS/PANIC output) to be
stored in the ringbuffer before beginning to flush the backlog.

Co-developed-by: John Ogness <[email protected]>
Signed-off-by: John Ogness <[email protected]>
Signed-off-by: Thomas Gleixner (Intel) <[email protected]>
---
include/linux/console.h | 4 ++
kernel/printk/nbcon.c | 89 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 93 insertions(+)

diff --git a/include/linux/console.h b/include/linux/console.h
index e4fc6f7c1496..25a3ddd39083 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -452,10 +452,14 @@ static inline bool console_is_registered(const struct console *con)
hlist_for_each_entry(con, &console_list, node)

#ifdef CONFIG_PRINTK
+extern enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio);
+extern void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio);
extern bool nbcon_can_proceed(struct nbcon_write_context *wctxt);
extern bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt);
extern bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt);
#else
+static inline enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio) { return NBCON_PRIO_NONE; }
+static inline void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio) { }
static inline bool nbcon_can_proceed(struct nbcon_write_context *wctxt) { return false; }
static inline bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt) { return false; }
static inline bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt) { return false; }
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index b96077152f49..9359906b575b 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -961,6 +961,95 @@ static bool nbcon_emit_next_record(struct nbcon_write_context *wctxt)
return nbcon_context_exit_unsafe(ctxt);
}

+/**
+ * struct nbcon_cpu_state - Per CPU printk context state
+ * @prio: The current context priority level
+ * @nesting: Per priority nest counter
+ */
+struct nbcon_cpu_state {
+ enum nbcon_prio prio;
+ int nesting[NBCON_PRIO_MAX];
+};
+
+static DEFINE_PER_CPU(struct nbcon_cpu_state, nbcon_pcpu_state);
+static struct nbcon_cpu_state early_nbcon_pcpu_state __initdata;
+
+/**
+ * nbcon_get_cpu_state - Get the per CPU console state pointer
+ *
+ * Returns either a pointer to the per CPU state of the current CPU or to
+ * the init data state during early boot.
+ */
+static __ref struct nbcon_cpu_state *nbcon_get_cpu_state(void)
+{
+ if (!printk_percpu_data_ready())
+ return &early_nbcon_pcpu_state;
+
+ return this_cpu_ptr(&nbcon_pcpu_state);
+}
+
+/**
+ * nbcon_atomic_enter - Enter a context that enforces atomic printing
+ * @prio: Priority of the context
+ *
+ * Return: The previous priority that needs to be fed into
+ * the corresponding nbcon_atomic_exit()
+ * Context: Any context. Disables migration.
+ */
+enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio)
+{
+ struct nbcon_cpu_state *cpu_state;
+ enum nbcon_prio prev_prio;
+
+ migrate_disable();
+
+ cpu_state = nbcon_get_cpu_state();
+
+ prev_prio = cpu_state->prio;
+ if (prio > prev_prio)
+ cpu_state->prio = prio;
+
+ /*
+ * Increment the nesting on @cpu_state->prio (instead of
+ * @prio) so that a WARN() nested within a panic printout
+ * does not attempt to scribble state.
+ */
+ cpu_state->nesting[cpu_state->prio]++;
+
+ return prev_prio;
+}
+
+/**
+ * nbcon_atomic_exit - Exit a context that enforces atomic printing
+ * @prio: Priority of the context to leave
+ * @prev_prio: Priority of the previous context for restore
+ *
+ * Context: Any context. Enables migration.
+ *
+ * @prev_prio is the priority returned by the corresponding
+ * nbcon_atomic_enter().
+ */
+void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
+{
+ struct nbcon_cpu_state *cpu_state;
+
+ cpu_state = nbcon_get_cpu_state();
+
+ /*
+ * Undo the nesting of nbcon_atomic_enter() at the CPU state
+ * priority.
+ */
+ cpu_state->nesting[cpu_state->prio]--;
+
+ /*
+ * Restore the previous priority, which was returned by
+ * nbcon_atomic_enter().
+ */
+ cpu_state->prio = prev_prio;
+
+ migrate_enable();
+}
+
/**
* nbcon_alloc - Allocate buffers needed by the nbcon console
* @con: Console to allocate buffers for
--
2.39.2

2023-09-20 08:46:55

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v2 05/11] printk: nbcon: Provide function for atomic flushing

From: Thomas Gleixner <[email protected]>

Provide nbcon_atomic_flush() to perform atomic write flushing
of all registered nbcon consoles. Like with legacy consoles,
the nbcon consoles are flushed one record per console. This
allows all nbcon consoles to generate pseudo-simultaneously,
rather than one console waiting for the full ringbuffer to
dump to another console before printing anything.

Note that if the current CPU is in a nested elevated priority
state (EMERGENCY/PANIC), nbcon_atomic_flush() does nothing.
This is in case the printing itself generates urgent messages
(OOPS/WARN/PANIC), that those messages are fully stored into
the ringbuffer before any printing resumes.

Co-developed-by: John Ogness <[email protected]>
Signed-off-by: John Ogness <[email protected]>
Signed-off-by: Thomas Gleixner (Intel) <[email protected]>
---
include/linux/printk.h | 6 +++
kernel/printk/nbcon.c | 101 ++++++++++++++++++++++++++++++++++++++++-
2 files changed, 105 insertions(+), 2 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index 8ef499ab3c1e..58e5f34d6df1 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -192,6 +192,7 @@ void show_regs_print_info(const char *log_lvl);
extern asmlinkage void dump_stack_lvl(const char *log_lvl) __cold;
extern asmlinkage void dump_stack(void) __cold;
void printk_trigger_flush(void);
+extern void nbcon_atomic_flush_all(void);
#else
static inline __printf(1, 0)
int vprintk(const char *s, va_list args)
@@ -271,6 +272,11 @@ static inline void dump_stack(void)
static inline void printk_trigger_flush(void)
{
}
+
+static inline void nbcon_atomic_flush_all(void)
+{
+}
+
#endif

#ifdef CONFIG_SMP
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index 9359906b575b..2a9ff18fc78c 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -571,7 +571,6 @@ static struct printk_buffers panic_nbcon_pbufs;
* in an unsafe state. Otherwise, on success the caller may assume
* the console is not in an unsafe state.
*/
-__maybe_unused
static bool nbcon_context_try_acquire(struct nbcon_context *ctxt)
{
unsigned int cpu = smp_processor_id();
@@ -873,7 +872,6 @@ EXPORT_SYMBOL_GPL(nbcon_exit_unsafe);
* When true is returned, @wctxt->ctxt.backlog indicates whether there are
* still records pending in the ringbuffer,
*/
-__maybe_unused
static bool nbcon_emit_next_record(struct nbcon_write_context *wctxt)
{
struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
@@ -988,6 +986,105 @@ static __ref struct nbcon_cpu_state *nbcon_get_cpu_state(void)
return this_cpu_ptr(&nbcon_pcpu_state);
}

+/**
+ * nbcon_atomic_emit_one - Print one record for a console in atomic mode
+ * @wctxt: An initialized write context struct to use
+ * for this context
+ *
+ * Returns false if the given console could not print a record or there are
+ * no more records to print, otherwise true.
+ *
+ * This is an internal helper to handle the locking of the console before
+ * calling nbcon_emit_next_record().
+ */
+static bool nbcon_atomic_emit_one(struct nbcon_write_context *wctxt)
+{
+ struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
+
+ if (!nbcon_context_try_acquire(ctxt))
+ return false;
+
+ /*
+ * nbcon_emit_next_record() returns false when the console was
+ * handed over or taken over. In both cases the context is no
+ * longer valid.
+ */
+ if (!nbcon_emit_next_record(wctxt))
+ return false;
+
+ nbcon_context_release(ctxt);
+
+ return prb_read_valid(prb, ctxt->seq, NULL);
+}
+
+/**
+ * __nbcon_atomic_flush_all - Flush all nbcon consoles in atomic mode
+ * @allow_unsafe_takeover: True, to allow unsafe hostile takeovers
+ */
+static void __nbcon_atomic_flush_all(bool allow_unsafe_takeover)
+{
+ struct nbcon_write_context wctxt = { };
+ struct nbcon_context *ctxt = &ACCESS_PRIVATE(&wctxt, ctxt);
+ struct nbcon_cpu_state *cpu_state;
+ struct console *con;
+ bool any_progress;
+ int cookie;
+
+ cpu_state = nbcon_get_cpu_state();
+
+ /*
+ * Let the outermost flush of this priority print. This avoids
+ * nasty hackery for nested WARN() where the printing itself
+ * generates one and ensures such nested messages are stored to
+ * the ringbuffer before any printing resumes.
+ *
+ * cpu_state->prio <= NBCON_PRIO_NORMAL is not subject to nesting
+ * and can proceed in order to allow any atomic printing for
+ * regular kernel messages.
+ */
+ if (cpu_state->prio > NBCON_PRIO_NORMAL &&
+ cpu_state->nesting[cpu_state->prio] != 1)
+ return;
+
+ do {
+ any_progress = false;
+
+ cookie = console_srcu_read_lock();
+ for_each_console_srcu(con) {
+ short flags = console_srcu_read_flags(con);
+ bool progress;
+
+ if (!(flags & CON_NBCON))
+ continue;
+
+ if (!console_is_usable(con, flags))
+ continue;
+
+ memset(ctxt, 0, sizeof(*ctxt));
+ ctxt->console = con;
+ ctxt->spinwait_max_us = 2000;
+ ctxt->prio = cpu_state->prio;
+ ctxt->allow_unsafe_takeover = allow_unsafe_takeover;
+
+ progress = nbcon_atomic_emit_one(&wctxt);
+ if (!progress)
+ continue;
+ any_progress = true;
+ }
+ console_srcu_read_unlock(cookie);
+ } while (any_progress);
+}
+
+/**
+ * nbcon_atomic_flush_all - Flush all nbcon consoles in atomic mode
+ *
+ * Context: Any context where migration is disabled.
+ */
+void nbcon_atomic_flush_all(void)
+{
+ __nbcon_atomic_flush_all(false);
+}
+
/**
* nbcon_atomic_enter - Enter a context that enforces atomic printing
* @prio: Priority of the context
--
2.39.2

2023-09-20 09:24:05

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing

From: Thomas Gleixner <[email protected]>

Perform nbcon console atomic flushing at key call sites:

nbcon_atomic_exit() - When exiting from the outermost atomic
printing section. If the priority was NBCON_PRIO_PANIC,
attempt a second flush allowing unsafe hostile
takeovers.

console_flush_on_panic() - Called from several call sites to
trigger ringbuffer dumping in an urgent situation.

console_flush_on_panic() - Typically the panic() function will
take care of atomic flushing the nbcon consoles on
panic. However, there are several users of
console_flush_on_panic() outside of panic().

printk_trigger_flush() - Used in urgent situations to trigger a
dump in an irq_work context. However, the atomic
flushing part happens in the calling context since an
alternative context is not required.

Co-developed-by: John Ogness <[email protected]>
Signed-off-by: John Ogness <[email protected]>
Signed-off-by: Thomas Gleixner (Intel) <[email protected]>
---
kernel/printk/nbcon.c | 10 ++++++++++
kernel/printk/printk.c | 6 ++++++
2 files changed, 16 insertions(+)

diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index 2a9ff18fc78c..82e6a1678363 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -1092,6 +1092,11 @@ void nbcon_atomic_flush_all(void)
* Return: The previous priority that needs to be fed into
* the corresponding nbcon_atomic_exit()
* Context: Any context. Disables migration.
+ *
+ * When within an atomic printing section, no atomic printing occurs. This
+ * is to allow all emergency messages to be dumped into the ringbuffer before
+ * flushing the ringbuffer. The actual atomic printing occurs when exiting
+ * the outermost atomic printing section.
*/
enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio)
{
@@ -1130,8 +1135,13 @@ void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
{
struct nbcon_cpu_state *cpu_state;

+ __nbcon_atomic_flush_all(false);
+
cpu_state = nbcon_get_cpu_state();

+ if (cpu_state->prio == NBCON_PRIO_PANIC)
+ __nbcon_atomic_flush_all(true);
+
/*
* Undo the nesting of nbcon_atomic_enter() at the CPU state
* priority.
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 6ef33cefa4d0..419c0fabc481 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3159,6 +3159,8 @@ void console_flush_on_panic(enum con_flush_mode mode)
console_srcu_read_unlock(cookie);
}

+ nbcon_atomic_flush_all();
+
console_flush_all(false, &next_seq, &handover);
}

@@ -3903,6 +3905,10 @@ void defer_console_output(void)

void printk_trigger_flush(void)
{
+ migrate_disable();
+ nbcon_atomic_flush_all();
+ migrate_enable();
+
defer_console_output();
}

--
2.39.2

2023-09-20 14:34:33

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v2 09/11] panic: Add atomic write enforcement to oops

On 2023-09-20, Andy Shevchenko <[email protected]> wrote:
> On Wed, Sep 20, 2023 at 01:14:54AM +0206, John Ogness wrote:
>> Invoke the atomic write enforcement functions for oops to
>> ensure that the information gets out to the consoles.
>>
>> Since there is no single general function that calls both
>> oops_enter() and oops_exit(), the nesting feature of atomic
>> write sections is taken advantage of in order to guarantee
>> full coverage between the first oops_enter() and the last
>> oops_exit().
>>
>> It is important to note that if there are any legacy consoles
>> registered, they will be attempting to directly print from the
>> printk-caller context, which may jeopardize the reliability of
>> the atomic consoles. Optimally there should be no legacy
>> consoles registered.
>
> ...
>
>> + if (atomic_read(&oops_cpu) == smp_processor_id()) {
>> + oops_nesting--;
>> + if (oops_nesting == 0) {
>> + atomic_set(&oops_cpu, -1);
>
> Between read and set the variable can change, can't it?

CPU migration is disabled. @oops_cpu contains the CPU ID of the only CPU
that is printing the oops. (Perhaps the variable should be called
"oops_printing_cpu"?)

If this matches smp_processor_id(), then the current CPU is the only one
that is allowed to change it back to -1. So no, if the first condition
is true, it cannot change before atomic_set(). And if the second
condition is true, this is the only CPU+context that is allowed to
change it back to -1;

> If not, why this variable is atomic then? Or, why it's not a problem?
> If the latter is the case, perhaps a comment to explain this?

If not atomic, it will be a data race since one CPU might be changing
@oops_cpu and another is reading it. For type "int" such a data race
would be fine because it doesn't matter which side of the race the
reader was on, both values will not match the current CPU ID.

The reason that I didn't implement it using cmpxchg(),
data_race(READ_ONCE()), and WRITE_ONCE() is because I once learned that
you should never mix cmpxchg() with READ_ONCE()/WRITE_ONCE() because
there are architectures that do not support cmpxchg() as an atomic
instruction. The answer was always: "use atomic_t instead... that is
what it is for".

But AFAICT for this case it would be fine because obviously cmpxchg()
will not race with itself. And successfully reading a matching CPU ID
means there cannot be any cmpxchg() in progress. And writing only occurs
after seeing a matching CPU ID.

So I can change it from atomic_t to int. Although I do feel like that
might require explanation about why the data race is safe.

Or perhaps it is enough just to have something like this:

/**
* oops_printing_cpu - The ID of the CPU responsible for printing the
* OOPS message(s) to the consoles.
*
* This is atomic_t because multiple CPUs can read this variable
* simultaneously when exiting OOPS while another CPU can be
* modifying this variable to begin or end its printing duties.
*/
static atomic_t oops_printing_cpu = ATOMIC_INIT(-1);

John Ogness

2023-09-20 16:20:45

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v2 09/11] panic: Add atomic write enforcement to oops

On 2023-09-20, John Ogness <[email protected]> wrote:
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 86ed71ba8c4d..e2879098645d 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -630,6 +634,36 @@ bool oops_may_print(void)
> */
> void oops_enter(void)
> {
> + enum nbcon_prio prev_prio;
> + int cpu = -1;

oops_exit() calls put_cpu(). Here used to be the corresponding
get_cpu(). Somehow I managed to drop it. Here are the inline fixups.

int cur_cpu = get_cpu();
int old_cpu = -1;

> +
> + /*
> + * If this turns out to be the first CPU in oops, this is the
> + * beginning of the outermost atomic section. Otherwise it is
> + * the beginning of an inner atomic section.
> + */
> + prev_prio = nbcon_atomic_enter(NBCON_PRIO_EMERGENCY);
> +
> + if (atomic_try_cmpxchg_relaxed(&oops_cpu, &cpu, smp_processor_id())) {

if (atomic_try_cmpxchg_relaxed(&oops_cpu, &old_cpu, cur_cpu)) {

> + /*
> + * This is the first CPU in oops. Save the outermost
> + * @prev_prio in order to restore it on the outermost
> + * matching oops_exit(), when @oops_nesting == 0.
> + */
> + oops_prev_prio = prev_prio;
> +
> + /*
> + * Enter an inner atomic section that ends at the end of this
> + * function. In this case, the nbcon_atomic_enter() above
> + * began the outermost atomic section.
> + */
> + prev_prio = nbcon_atomic_enter(NBCON_PRIO_EMERGENCY);
> + }
> +
> + /* Track nesting when this CPU is the owner. */
> + if (cpu == -1 || cpu == smp_processor_id())

if (old_cpu == -1 || old_cpu == cur_cpu)

> + oops_nesting++;
> +
> tracing_off();
> /* can't trust the integrity of the kernel anymore: */
> debug_locks_off();

2023-09-20 17:34:49

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH printk v2 09/11] panic: Add atomic write enforcement to oops

On Wed, Sep 20, 2023 at 01:14:54AM +0206, John Ogness wrote:
> Invoke the atomic write enforcement functions for oops to
> ensure that the information gets out to the consoles.
>
> Since there is no single general function that calls both
> oops_enter() and oops_exit(), the nesting feature of atomic
> write sections is taken advantage of in order to guarantee
> full coverage between the first oops_enter() and the last
> oops_exit().
>
> It is important to note that if there are any legacy consoles
> registered, they will be attempting to directly print from the
> printk-caller context, which may jeopardize the reliability of
> the atomic consoles. Optimally there should be no legacy
> consoles registered.

...

> + if (atomic_read(&oops_cpu) == smp_processor_id()) {
> + oops_nesting--;
> + if (oops_nesting == 0) {
> + atomic_set(&oops_cpu, -1);

Between read and set the variable can change, can't it?
If not, why this variable is atomic then? Or, why it's not a problem?
If the latter is the case, perhaps a comment to explain this?

> + /* Exit outmost atomic section. */
> + nbcon_atomic_exit(NBCON_PRIO_EMERGENCY, oops_prev_prio);
> + }
> + }
> + put_cpu();

--
With Best Regards,
Andy Shevchenko


2023-09-21 04:43:23

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH printk v2 09/11] panic: Add atomic write enforcement to oops

On Wed, Sep 20, 2023 at 04:26:12PM +0206, John Ogness wrote:
> On 2023-09-20, Andy Shevchenko <[email protected]> wrote:
> > On Wed, Sep 20, 2023 at 01:14:54AM +0206, John Ogness wrote:

...

> >> + if (atomic_read(&oops_cpu) == smp_processor_id()) {
> >> + oops_nesting--;
> >> + if (oops_nesting == 0) {
> >> + atomic_set(&oops_cpu, -1);
> >
> > Between read and set the variable can change, can't it?
>
> CPU migration is disabled. @oops_cpu contains the CPU ID of the only CPU
> that is printing the oops. (Perhaps the variable should be called
> "oops_printing_cpu"?)
>
> If this matches smp_processor_id(), then the current CPU is the only one
> that is allowed to change it back to -1. So no, if the first condition
> is true, it cannot change before atomic_set(). And if the second
> condition is true, this is the only CPU+context that is allowed to
> change it back to -1;
>
> > If not, why this variable is atomic then? Or, why it's not a problem?
> > If the latter is the case, perhaps a comment to explain this?
>
> If not atomic, it will be a data race since one CPU might be changing
> @oops_cpu and another is reading it. For type "int" such a data race
> would be fine because it doesn't matter which side of the race the
> reader was on, both values will not match the current CPU ID.
>
> The reason that I didn't implement it using cmpxchg(),
> data_race(READ_ONCE()), and WRITE_ONCE() is because I once learned that
> you should never mix cmpxchg() with READ_ONCE()/WRITE_ONCE() because
> there are architectures that do not support cmpxchg() as an atomic
> instruction. The answer was always: "use atomic_t instead... that is
> what it is for".
>
> But AFAICT for this case it would be fine because obviously cmpxchg()
> will not race with itself. And successfully reading a matching CPU ID
> means there cannot be any cmpxchg() in progress. And writing only occurs
> after seeing a matching CPU ID.
>
> So I can change it from atomic_t to int. Although I do feel like that
> might require explanation about why the data race is safe.

Either way a comment is needed, but I think the usage of atomic above
is a bit confusing as you see I immediately rose the concern.

> Or perhaps it is enough just to have something like this:
>
> /**
> * oops_printing_cpu - The ID of the CPU responsible for printing the
> * OOPS message(s) to the consoles.
> *
> * This is atomic_t because multiple CPUs can read this variable
> * simultaneously when exiting OOPS while another CPU can be
> * modifying this variable to begin or end its printing duties.
> */
> static atomic_t oops_printing_cpu = ATOMIC_INIT(-1);

--
With Best Regards,
Andy Shevchenko


2023-09-22 09:43:57

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

On Wed 2023-09-20 01:14:49, John Ogness wrote:
> From: Thomas Gleixner <[email protected]>
>
> WARN/OOPS/PANIC require printing out immediately since the
> regular printing method (and in the future, the printing
> threads) might not be able to run.
>
> Add per-CPU state to denote the priority/urgency of the output
> and provide functions to mark the beginning and end of sections
> where the urgent messages are generated.
>
> Note that when a CPU is in a priority elevated state, flushing
> only occurs when dropping back to a lower priority. This allows
> the full set of printk records (WARN/OOPS/PANIC output) to be
> stored in the ringbuffer before beginning to flush the backlog.

The above paragraph is a bit confusing. The code added by this patch
does not do any flushing. I guess that this last paragraph is supposed
to explain why the "nesting" array is needed. I would write
something like:

"The state also counts nesting of printing contexts per-priority.
It will be later used to prevent flushing in nested contexts."

That said, I am not sure if the special handling of nested contexts
is needed. But let's discuss it in the patch introducing the flush
funtions.

> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -961,6 +961,95 @@ static bool nbcon_emit_next_record(struct nbcon_write_context *wctxt)
> return nbcon_context_exit_unsafe(ctxt);
> }
>
> +/**
> + * struct nbcon_cpu_state - Per CPU printk context state
> + * @prio: The current context priority level
> + * @nesting: Per priority nest counter
> + */
> +struct nbcon_cpu_state {
> + enum nbcon_prio prio;
> + int nesting[NBCON_PRIO_MAX];
> +};
> +
> +static DEFINE_PER_CPU(struct nbcon_cpu_state, nbcon_pcpu_state);
> +static struct nbcon_cpu_state early_nbcon_pcpu_state __initdata;
> +
> +/**
> + * nbcon_get_cpu_state - Get the per CPU console state pointer
> + *
> + * Returns either a pointer to the per CPU state of the current CPU or to
> + * the init data state during early boot.
> + */
> +static __ref struct nbcon_cpu_state *nbcon_get_cpu_state(void)
> +{
> + if (!printk_percpu_data_ready())
> + return &early_nbcon_pcpu_state;

My first thought, was that this was racy. I was afraid that
printk_percpu_data_ready() could change value inside
atomit_enter()/exit() area. But it actually could not happen.
Anyway, it might worth a comment. Something like:

/*
* The value of __printk_percpu_data_ready is modified in normal
* context. As a result it could never change inside a nbcon
* atomic context.
*/
if (!printk_percpu_data_ready())
return &early_nbcon_pcpu_state;

> +
> + return this_cpu_ptr(&nbcon_pcpu_state);
> +}
> +
> +/**
> + * nbcon_atomic_exit - Exit a context that enforces atomic printing
> + * @prio: Priority of the context to leave
> + * @prev_prio: Priority of the previous context for restore
> + *
> + * Context: Any context. Enables migration.
> + *
> + * @prev_prio is the priority returned by the corresponding
> + * nbcon_atomic_enter().
> + */
> +void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
> +{
> + struct nbcon_cpu_state *cpu_state;
> +
> + cpu_state = nbcon_get_cpu_state();

I would add a consistency check:

WARN_ON_ONCE(cpu_state->nesting[cpu_state->prio] <= 0)

> + /*
> + * Undo the nesting of nbcon_atomic_enter() at the CPU state
> + * priority.
> + */
> + cpu_state->nesting[cpu_state->prio]--;
> +
> + /*
> + * Restore the previous priority, which was returned by
> + * nbcon_atomic_enter().
> + */
> + cpu_state->prio = prev_prio;
> +
> + migrate_enable();
> +}

Best Regards,
Petr

2023-09-22 18:23:47

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v2 05/11] printk: nbcon: Provide function for atomic flushing

On Wed 2023-09-20 01:14:50, John Ogness wrote:
> From: Thomas Gleixner <[email protected]>
>
> Provide nbcon_atomic_flush() to perform atomic write flushing
> of all registered nbcon consoles. Like with legacy consoles,
> the nbcon consoles are flushed one record per console. This
> allows all nbcon consoles to generate pseudo-simultaneously,
> rather than one console waiting for the full ringbuffer to
> dump to another console before printing anything.
>
> Note that if the current CPU is in a nested elevated priority
> state (EMERGENCY/PANIC), nbcon_atomic_flush() does nothing.

This confused me a bit. It was not clear to me if it was
"nested and elevated" or "the elevated priority was nested".
Well, it is probably because I am not a native speaker.

I would describe it another way, see below.

> This is in case the printing itself generates urgent messages
> (OOPS/WARN/PANIC), that those messages are fully stored into
> the ringbuffer before any printing resumes.

This feels like it was an advantage. But I would say that it is
a limitation. IMHO, it simply works this way and we should describe
it as a limitation. See below.

> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -988,6 +986,105 @@ static __ref struct nbcon_cpu_state *nbcon_get_cpu_state(void)
> return this_cpu_ptr(&nbcon_pcpu_state);
> }
>
> +/**
> + * nbcon_atomic_emit_one - Print one record for a console in atomic mode
> + * @wctxt: An initialized write context struct to use
> + * for this context
> + *
> + * Returns false if the given console could not print a record or there are
> + * no more records to print, otherwise true.
> + *
> + * This is an internal helper to handle the locking of the console before
> + * calling nbcon_emit_next_record().
> + */
> +static bool nbcon_atomic_emit_one(struct nbcon_write_context *wctxt)
> +{
> + struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
> +
> + if (!nbcon_context_try_acquire(ctxt))
> + return false;
> +
> + /*
> + * nbcon_emit_next_record() returns false when the console was
> + * handed over or taken over. In both cases the context is no
> + * longer valid.
> + */
> + if (!nbcon_emit_next_record(wctxt))
> + return false;
> +
> + nbcon_context_release(ctxt);
> +
> + return prb_read_valid(prb, ctxt->seq, NULL);

IMHO, it should be enough to check ctxt->backlog. I mean to do:

return !!ctxt->backlog;

We are here only when nbcon_emit_next_record() owned the context and
was able to call printk_get_next_message(). nbcon_emit_next_record()
and nbcon_atomic_emit_one() would work consistently especially
when prb_read_valid() is not called under the console lock here.


> +}
> +
> +/**
> + * __nbcon_atomic_flush_all - Flush all nbcon consoles in atomic mode
> + * @allow_unsafe_takeover: True, to allow unsafe hostile takeovers
> + */
> +static void __nbcon_atomic_flush_all(bool allow_unsafe_takeover)
> +{
> + struct nbcon_write_context wctxt = { };
> + struct nbcon_context *ctxt = &ACCESS_PRIVATE(&wctxt, ctxt);
> + struct nbcon_cpu_state *cpu_state;
> + struct console *con;
> + bool any_progress;
> + int cookie;
> +
> + cpu_state = nbcon_get_cpu_state();
> +
> + /*
> + * Let the outermost flush of this priority print. This avoids
> + * nasty hackery for nested WARN() where the printing itself
> + * generates one and ensures such nested messages are stored to
> + * the ringbuffer before any printing resumes.

It is not clear to me what hackery was meant. The fact is that
only printk_once() or WARN_ONCE() should be used in the console
emit/flush code paths. Any non-once printk might block consoles
and even these nesting checks probably would not help much.

Anyway, I believe that we do not need this nesting counter.
The nesting is already prevented by nbcon_context_try_acquire().
It would not allow to take the nested lock with the same priority.

I guess that this extra nesting counter made some sense only in the earlier
variants of the per-cpu console lock.

I would personally just describe the behavior in the commit message
and in the comment above __nbcon_atomic_flush_all():

* The messages are flushed only when this context is able to
* get the per-console lock. Namely, it works only when the
* lock is free or when this context has a higher priority
* than the current owner.

> + *
> + * cpu_state->prio <= NBCON_PRIO_NORMAL is not subject to nesting
> + * and can proceed in order to allow any atomic printing for
> + * regular kernel messages.
> + */
> + if (cpu_state->prio > NBCON_PRIO_NORMAL &&
> + cpu_state->nesting[cpu_state->prio] != 1)
> + return;
> +
> + do {
> + any_progress = false;
> +
> + cookie = console_srcu_read_lock();
> + for_each_console_srcu(con) {
> + short flags = console_srcu_read_flags(con);
> + bool progress;
> +
> + if (!(flags & CON_NBCON))
> + continue;
> +
> + if (!console_is_usable(con, flags))
> + continue;
> +
> + memset(ctxt, 0, sizeof(*ctxt));
> + ctxt->console = con;
> + ctxt->spinwait_max_us = 2000;
> + ctxt->prio = cpu_state->prio;
> + ctxt->allow_unsafe_takeover = allow_unsafe_takeover;
> +
> + progress = nbcon_atomic_emit_one(&wctxt);
> + if (!progress)
> + continue;
> + any_progress = true;
> + }
> + console_srcu_read_unlock(cookie);
> + } while (any_progress);
> +}
> +
> +/**
> + * nbcon_atomic_flush_all - Flush all nbcon consoles in atomic mode
> + *
> + * Context: Any context where migration is disabled.

We should make it more clear what migration is meant here. For
example:

* Context: Any context which could not be migrated to another CPU.

> + */
> +void nbcon_atomic_flush_all(void)
> +{
> + __nbcon_atomic_flush_all(false);
> +}
> +

Best Regards,
Petr

2023-09-23 03:31:08

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing

On Wed 2023-09-20 01:14:51, John Ogness wrote:
> From: Thomas Gleixner <[email protected]>
>
> Perform nbcon console atomic flushing at key call sites:
>
> nbcon_atomic_exit() - When exiting from the outermost atomic
> printing section.

IMHO, it would not help because it all depends
whether nbcon_context_try_acquire() succeeds or now, see below.

> If the priority was NBCON_PRIO_PANIC,
> attempt a second flush allowing unsafe hostile
> takeovers.

I was first scared that this would be called by any printk()
in panic(). But it seems that nbcon_atomic_exit() is called
only one after disabling printk(). It might deserve a comment
where it is supposed to be used. See a proposal below.


> console_flush_on_panic() - Called from several call sites to
> trigger ringbuffer dumping in an urgent situation.
>
> console_flush_on_panic() - Typically the panic() function will

This is a second description of console_flush_of_panic() which
looks like a mistake.

> take care of atomic flushing the nbcon consoles on
> panic. However, there are several users of
> console_flush_on_panic() outside of panic().

The generic panic() seems to use console_flush_on_panic() correctly
at the very end.

Hmm, I see that console_flush_on_panic() is called also in
arch/loongarch/kernel/reset.c and arch/powerpc/kernel/traps.c.

The loongarch code uses it in halt(). IMHO, it would be
better to use the normal flush. But maybe they accept the risk.
I know nothing about this architecture and who uses it.

The function defined on powerpc is then used in:

+ arch/powerpc/platforms/powernv/opal.c:

IMHO, it should be replaced there by normal flush. It seems
to call the generic panic() later.

+ arch/powerpc/platforms/ps3/setup.c:

This seems to be used as a panic notifier ppc_panic_platform_handler().
I am not completely sure but it does not look like the final flush.

+ arch/powerpc/platforms/pseries/setup.c:

Same as ps3/setup.c. Also "just" a panic notifier.

Anyway, we should make clear that console_flush_on_panic() might break
the system and should be called as the last attempt to flush consoles.
The above arch-specific users are worth review.

> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -1092,6 +1092,11 @@ void nbcon_atomic_flush_all(void)
> * Return: The previous priority that needs to be fed into
> * the corresponding nbcon_atomic_exit()
> * Context: Any context. Disables migration.
> + *
> + * When within an atomic printing section, no atomic printing occurs. This
> + * is to allow all emergency messages to be dumped into the ringbuffer before
> + * flushing the ringbuffer.

The comment sounds like it is an advantage. But I think that it would be
a disadvantage.

Fortunately, this is not true. Even the atomic context tries
to flush the messages immediately when it is able to get
the per-cpu lock. It happes via console_flush_all() called
from console_unlock().

> + * The actual atomic printing occurs when exiting
> + * the outermost atomic printing section.
> */
> enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio)
> {
> @@ -1130,8 +1135,13 @@ void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
> {
> struct nbcon_cpu_state *cpu_state;
>
> + __nbcon_atomic_flush_all(false);

IMHO, this would not help. Either this context was able to acquire the
lock and flush each message directly. Or it would fail to get the lock
even here.

> +
> cpu_state = nbcon_get_cpu_state();
>
> + if (cpu_state->prio == NBCON_PRIO_PANIC)
> + __nbcon_atomic_flush_all(true);

It would deserve a comment that nbcon_atomic_exit() is used
in panic() to calm down the consoles completely, similar effect
as setting the variable "suppress_printk".

> +
> /*
> * Undo the nesting of nbcon_atomic_enter() at the CPU state
> * priority.
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 6ef33cefa4d0..419c0fabc481 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -3159,6 +3159,8 @@ void console_flush_on_panic(enum con_flush_mode mode)
> console_srcu_read_unlock(cookie);
> }
>
> + nbcon_atomic_flush_all();
> +
> console_flush_all(false, &next_seq, &handover);

It seems that console_flush_all() tries to flush nbcon conosoles
as well. And nbcon_atomic_flush_all() is explicitely called
close above this flush_on_panic(). This is from panic()
after this patchset is applied.


void panic(const char *fmt, ...)
{
[...]
nbcon_atomic_flush_all();

console_unblank();

/*
* We may have ended up stopping the CPU holding the lock (in
* smp_send_stop()) while still having some valuable data in the console
* buffer. Try to acquire the lock then release it regardless of the
* result. The release will also print the buffers out. Locks debug
* should be disabled to avoid reporting bad unlock balance when
* panic() is not being callled from OOPS.
*/
debug_locks_off();
console_flush_on_panic(CONSOLE_FLUSH_PENDING);


By other words, we try to flush nbcon consoles 3 times almost
immediately after each other. I believe that there is a motivation
to do so. Anyway, I want to make sure that it was on purpose.

It would deserve some comment what is the purpose. Otherwise, people
would tend to remove the "redundant" calls.

> }
>
> @@ -3903,6 +3905,10 @@ void defer_console_output(void)
>
> void printk_trigger_flush(void)
> {
> + migrate_disable();
> + nbcon_atomic_flush_all();
> + migrate_enable();

Can this be actually called in NMI?

> +
> defer_console_output();
> }

This function would deserve some description because it looks strange.
There are three names which are contradicting each other:

+ trigger_flush: it sounds like it tells someone to do the flush

+ nbcon_atomic_flush_all: does the flush immediately

+ defer_console_output: sounds like it prevents the current context
to flush the consoles immediately

We should either choose better names and/or add comments:

/**
* console_flush_or_trigger() - Make sure that the consoles will get flushed.
*
* Try to flush consoles when possible or queue flushing consoles like
* in the deferred printk.
*
* Context: Can be used in any context (including NMI?)
*/
void printk_flush_or_trigger(void)
{
/*
* Try to flush consoles which do not depend on console_lock()
* and support safe atomic_write()
*/
migrate_disable();
nbcon_atomic_flush_all();
migrate_enable();

/* Try flushing legacy consoles or queue the deferred handling. */
if (!in_nmi() && console_trylock())
console_unlock();
else
defer_console_output();
}


All in all. I feel a bit overwhelmed by the many *flush*() functions at
the moment. Some flush only nbcon consoles. Some flush all. Some
are using only the safe takeover/handover and some allow also
harsh takeover. Some ignore port->lock because of bust_spinlock().
Some are ignoring console_lock. They are called on different
locations. The nbcon variants are called explicitly and also
inside the generic *flush*() functions.

It is partly because it is Friday evening. Anyway, I need to think
more about it.

Best Regards,
Petr

2023-09-25 14:31:27

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing

On 2023-09-22, Petr Mladek <[email protected]> wrote:
>> Perform nbcon console atomic flushing at key call sites:
>>
>> nbcon_atomic_exit() - When exiting from the outermost atomic
>> printing section.
>
> IMHO, it would not help because it all depends
> whether nbcon_context_try_acquire() succeeds or now, see below.

Sure, but it first attempts a _safe_ flush on all nbcon consoles that
alow it, _then_ it falls back to an _unsafe_ flush (which cannot fail,
but might explode).

>> If the priority was NBCON_PRIO_PANIC,
>> attempt a second flush allowing unsafe hostile
>> takeovers.
>
> I was first scared that this would be called by any printk()
> in panic(). But it seems that nbcon_atomic_exit() is called
> only one after disabling printk(). It might deserve a comment
> where it is supposed to be used. See a proposal below.

OK.

>> console_flush_on_panic() - Called from several call sites to
>> trigger ringbuffer dumping in an urgent situation.
>>
>> console_flush_on_panic() - Typically the panic() function will
>
> This is a second description of console_flush_of_panic() which
> looks like a mistake.

Oops. The first one should not be there.

>> take care of atomic flushing the nbcon consoles on
>> panic. However, there are several users of
>> console_flush_on_panic() outside of panic().
>
> The generic panic() seems to use console_flush_on_panic() correctly
> at the very end.
>
> Hmm, I see that console_flush_on_panic() is called also in

[...]

> Anyway, we should make clear that console_flush_on_panic() might break
> the system and should be called as the last attempt to flush consoles.
> The above arch-specific users are worth review.

In an upcoming series you will see that console_flush_on_panic() only
takes the console lock if there are legacy consoles. Ideally, eventually
there will only be nbcon consoles, so your concern would disappear.

And if those users continue to use legacy consoles, then the risks will
be the same as now.

>> * Return: The previous priority that needs to be fed into
>> * the corresponding nbcon_atomic_exit()
>> * Context: Any context. Disables migration.
>> + *
>> + * When within an atomic printing section, no atomic printing occurs. This
>> + * is to allow all emergency messages to be dumped into the ringbuffer before
>> + * flushing the ringbuffer.
>
> The comment sounds like it is an advantage. But I think that it would be
> a disadvantage.

Please explain. At LPC2022 we agreed it was an advantage and it is
implemented on purpose using the atomic printing sections.

> Fortunately, this is not true. Even the atomic context tries
> to flush the messages immediately when it is able to get
> the per-cpu lock. It happes via console_flush_all() called
> from console_unlock().

You are talking about legacy consoles.

The goal of the nbcon consoles is to introduce a _new_ console type to
support controlled decisions for emergency printing. Legacy consoles
will continue to work (or not work) as before.

>> @@ -1130,8 +1135,13 @@ void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
>> {
>> struct nbcon_cpu_state *cpu_state;
>>
>> + __nbcon_atomic_flush_all(false);
>
> IMHO, this would not help. Either this context was able to acquire the
> lock and flush each message directly.

What do you mean by "help"? We are exiting an atomic printing
section. This is where the accumulated emergency messages are to be
printed. If this is a nested atomic printing section, it does nothing
because the outermost atomic printing section will flush.

> Or it would fail to get the lock even here.

If it fails to get the lock, we are talking about the worst case
scenario. That scenario demands unsafe printing, which is only allowed
in panic. If we are not in panic, it is assumed the current owner will
take care of it.

>> +
>> cpu_state = nbcon_get_cpu_state();
>>
>> + if (cpu_state->prio == NBCON_PRIO_PANIC)
>> + __nbcon_atomic_flush_all(true);
>
> It would deserve a comment that nbcon_atomic_exit() is used
> in panic() to calm down the consoles completely, similar effect
> as setting the variable "suppress_printk".

Actually, it is nbcon_atomic_enter() that does the calming down, but
only for the emergency CPU. Other CPUs are allowed to print as wildly as
they want. And all CPUs are allowed to continue to fill the ringbuffer.

@suppress_printk is quite different as it globally blocks any new
messages from being stored into the ringbuffer. This is not calming down
the consoles, but rather putting a finite limit on what they produce.

>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index 6ef33cefa4d0..419c0fabc481 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -3159,6 +3159,8 @@ void console_flush_on_panic(enum con_flush_mode mode)
>> console_srcu_read_unlock(cookie);
>> }
>>
>> + nbcon_atomic_flush_all();
>> +
>> console_flush_all(false, &next_seq, &handover);
>
> It seems that console_flush_all() tries to flush nbcon conosoles
> as well. And nbcon_atomic_flush_all() is explicitely called
> close above this flush_on_panic(). This is from panic()
> after this patchset is applied.
>
>
> void panic(const char *fmt, ...)
> {
> [...]
> nbcon_atomic_flush_all();
>
> console_unblank();
>
> /*
> * We may have ended up stopping the CPU holding the lock (in
> * smp_send_stop()) while still having some valuable data in the console
> * buffer. Try to acquire the lock then release it regardless of the
> * result. The release will also print the buffers out. Locks debug
> * should be disabled to avoid reporting bad unlock balance when
> * panic() is not being callled from OOPS.
> */
> debug_locks_off();
> console_flush_on_panic(CONSOLE_FLUSH_PENDING);
>
>
> By other words, we try to flush nbcon consoles 3 times almost
> immediately after each other.

"Almost immediately" != "immediately". Keep in mind that this is in
atomic printing context. Generally speaking, no messages will be seen on
the consoles yet. Anytime time there is a signficant piece (such as
console_unblank()) it would probably be wise to flush.

We will probably want even more flush points. But I wanted to at least
start with the risky points (such as console_unblank() when legacy
consoles still exist).

> It would deserve some comment what is the purpose. Otherwise, people
> would tend to remove the "redundant" calls.

OK.

>> @@ -3903,6 +3905,10 @@ void defer_console_output(void)
>>
>> void printk_trigger_flush(void)
>> {
>> + migrate_disable();
>> + nbcon_atomic_flush_all();
>> + migrate_enable();
>
> Can this be actually called in NMI?

No. I need to add in_nmi() checks.

>> defer_console_output();
>> }
>
> This function would deserve some description because it looks strange.
> There are three names which are contradicting each other:
>
> + trigger_flush: it sounds like it tells someone to do the flush
>
> + nbcon_atomic_flush_all: does the flush immediately
>
> + defer_console_output: sounds like it prevents the current context
> to flush the consoles immediately

Agreed.

> We should either choose better names and/or add comments:
>
> /**
> * console_flush_or_trigger() - Make sure that the consoles will get flushed.
> *
> * Try to flush consoles when possible or queue flushing consoles like
> * in the deferred printk.
> *
> * Context: Can be used in any context (including NMI?)
> */
> void printk_flush_or_trigger(void)
> {
> /*
> * Try to flush consoles which do not depend on console_lock()
> * and support safe atomic_write()
> */
if (!in_nmi()) migrate_disable();
> nbcon_atomic_flush_all();
if (!in_nmi()) migrate_enable();
>
> /* Try flushing legacy consoles or queue the deferred handling. */
> if (!in_nmi() && console_trylock())
> console_unlock();
> else
> defer_console_output();
> }

I would be OK with this name and (fixed up for NMI) implementation.

> All in all. I feel a bit overwhelmed by the many *flush*() functions at
> the moment. Some flush only nbcon consoles. Some flush all. Some
> are using only the safe takeover/handover and some allow also
> harsh takeover. Some ignore port->lock because of bust_spinlock().
> Some are ignoring console_lock. They are called on different
> locations. The nbcon variants are called explicitly and also
> inside the generic *flush*() functions.

Agreed. This madness exists in part because we are continuing to support
legacy consoles. I tried to keep the two separate as much as
possible. But when it comes to flushing, there will be some overlap.

When working on the code, I tend to either look at the legacy consoles
_or_ the nbcon consoles. If you try to visualize the whole picture on a
system with legacy and nbcon consoles, it is a bit overwhelming. I
recommend focussing on them separately. I.e. when talking about nbcon
consoles, there is no reason to be concerned about the state of the
console lock or the port lock. When talking about legacy consoles, there
is no reason to be concerned about CPU atomic printing states.

John

2023-09-25 15:28:24

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

On 2023-09-22, Petr Mladek <[email protected]> wrote:
>> Note that when a CPU is in a priority elevated state, flushing
>> only occurs when dropping back to a lower priority. This allows
>> the full set of printk records (WARN/OOPS/PANIC output) to be
>> stored in the ringbuffer before beginning to flush the backlog.
>
> The above paragraph is a bit confusing. The code added by this patch
> does not do any flushing.

You are right. I should put this patch after patch 5 "printk: nbcon:
Provide function for atomic flushing" to simplify the introduction.

> I guess that this last paragraph is supposed to explain why the
> "nesting" array is needed.

No, it is explaining how this feature works in general. The term
"priority elevated state" means the CPU is in an atomic write section.

The "nesting" array is needed in order to support a feature that is not
explained in the commit message: If nested OOPS/WARN/PANIC occur, only
the outermost OOPS/WARN/PANIC will do the flushing. I will add this
information to the commit message.

>> +static __ref struct nbcon_cpu_state *nbcon_get_cpu_state(void)
>> +{
>> + if (!printk_percpu_data_ready())
>> + return &early_nbcon_pcpu_state;
>
> it might worth a comment. Something like:
>
> /*
> * The value of __printk_percpu_data_ready is modified in normal
> * context. As a result it could never change inside a nbcon
> * atomic context.
> */
> if (!printk_percpu_data_ready())
> return &early_nbcon_pcpu_state;

OK.

>> +void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
>> +{
>> + struct nbcon_cpu_state *cpu_state;
>> +
>> + cpu_state = nbcon_get_cpu_state();
>
> I would add a consistency check:
>
> WARN_ON_ONCE(cpu_state->nesting[cpu_state->prio] <= 0)

OK.

John

2023-09-25 17:24:07

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v2 05/11] printk: nbcon: Provide function for atomic flushing

On 2023-09-22, Petr Mladek <[email protected]> wrote:
>> Note that if the current CPU is in a nested elevated priority
>> state (EMERGENCY/PANIC), nbcon_atomic_flush() does nothing.
>
> This confused me a bit. It was not clear to me if it was
> "nested and elevated" or "the elevated priority was nested".

Elevated priority within an elevated priority. Or put another way: an
atomic printing section within an atomic printing section. Maybe the
"elevated priority" terminology is confusing. I can just use "atomic
printing section" instead if that helps.

>> This is in case the printing itself generates urgent messages
>> (OOPS/WARN/PANIC), that those messages are fully stored into
>> the ringbuffer before any printing resumes.
>
> This feels like it was an advantage. But I would say that it is
> a limitation. IMHO, it simply works this way and we should describe
> it as a limitation.

The "atomic printing section" feature was the result of designing this
advantage. It "simply works this way" because that it how it was
designed.

Actually, this is explaining the nesting variable that you asked about
in the previous patch commit message. When I reverse the patch order,
this paragraph will be moved into that patch commit message.

>> +/**
>> + * nbcon_atomic_emit_one - Print one record for a console in atomic mode
>> + * @wctxt: An initialized write context struct to use
>> + * for this context
>> + *
>> + * Returns false if the given console could not print a record or there are
>> + * no more records to print, otherwise true.
>> + *
>> + * This is an internal helper to handle the locking of the console before
>> + * calling nbcon_emit_next_record().
>> + */
>> +static bool nbcon_atomic_emit_one(struct nbcon_write_context *wctxt)
>> +{
>> + struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
>> +
>> + if (!nbcon_context_try_acquire(ctxt))
>> + return false;
>> +
>> + /*
>> + * nbcon_emit_next_record() returns false when the console was
>> + * handed over or taken over. In both cases the context is no
>> + * longer valid.
>> + */
>> + if (!nbcon_emit_next_record(wctxt))
>> + return false;
>> +
>> + nbcon_context_release(ctxt);
>> +
>> + return prb_read_valid(prb, ctxt->seq, NULL);
>
> IMHO, it should be enough to check ctxt->backlog. I mean to do:
>
> return !!ctxt->backlog;
>
> We are here only when nbcon_emit_next_record() owned the context and
> was able to call printk_get_next_message().

Yes, but ctxt->backlog is set before the printing begins. If any nested
atomic printing occurs (i.e. just adding records to the ringbuffer),
these also need to be atomically printed.

For example, console_unlock() deals with that situation with:

/*
* Some context may have added new records after
* console_flush_all() but before unlocking the console.
* Re-check if there is a new record to flush. If the trylock
* fails, another context is already handling the printing.
*/
} while (prb_read_valid(prb, next_seq, NULL) && console_trylock());

The prb_read_valid() here corresponds to the prb_read_valid() in
console_unlock(). I can add a similar comment here for that.

>> +static void __nbcon_atomic_flush_all(bool allow_unsafe_takeover)
>> +{
>> + struct nbcon_write_context wctxt = { };
>> + struct nbcon_context *ctxt = &ACCESS_PRIVATE(&wctxt, ctxt);
>> + struct nbcon_cpu_state *cpu_state;
>> + struct console *con;
>> + bool any_progress;
>> + int cookie;
>> +
>> + cpu_state = nbcon_get_cpu_state();
>> +
>> + /*
>> + * Let the outermost flush of this priority print. This avoids
>> + * nasty hackery for nested WARN() where the printing itself
>> + * generates one and ensures such nested messages are stored to
>> + * the ringbuffer before any printing resumes.
>
> It is not clear to me what hackery was meant.

Hackery = Trying to implement this feature without tracking CPU state
priorities.

> The fact is that only printk_once() or WARN_ONCE() should be used in
> the console emit/flush code paths. Any non-once printk might block
> consoles and even these nesting checks probably would not help much.

I am not sure what that has to do with it. This is a flush function,
which (for example) will be called when a warning is hit. We do _not_
want to flush the console if something more important (a panic) is
already in the process of being added to the ringbuffer.

> Anyway, I believe that we do not need this nesting counter.
> The nesting is already prevented by nbcon_context_try_acquire().
> It would not allow to take the nested lock with the same priority.

You are mixing 2 different things:

The acquire is related to ownership of a console.

The nesting is related to urgency state of a CPU.

> I would personally just describe the behavior in the commit message
> and in the comment above __nbcon_atomic_flush_all():
>
> * The messages are flushed only when this context is able to
> * get the per-console lock. Namely, it works only when the
> * lock is free or when this context has a higher priority
> * than the current owner.

Your comment is stating the obvious. All messages are only written by a
context when that context can acquire ownership.

What the check here is doing is refusing to write messages even if it
_could_ acquire ownership. It isn't about console ownership. It is about
not _wanting_ to print in nested atomic printing sections.

>> + if (cpu_state->prio > NBCON_PRIO_NORMAL &&
>> + cpu_state->nesting[cpu_state->prio] != 1)
>> + return;

[...]

>> +/**
>> + * nbcon_atomic_flush_all - Flush all nbcon consoles in atomic mode
>> + *
>> + * Context: Any context where migration is disabled.
>
> We should make it more clear what migration is meant here. For
> example:
>
> * Context: Any context which could not be migrated to another CPU.

OK.

John

2023-09-26 00:52:17

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

On Mon 2023-09-25 11:31:54, John Ogness wrote:
> On 2023-09-22, Petr Mladek <[email protected]> wrote:
> >> Note that when a CPU is in a priority elevated state, flushing
> >> only occurs when dropping back to a lower priority. This allows
> >> the full set of printk records (WARN/OOPS/PANIC output) to be
> >> stored in the ringbuffer before beginning to flush the backlog.
> >
> > The above paragraph is a bit confusing. The code added by this patch
> > does not do any flushing.
>
> You are right. I should put this patch after patch 5 "printk: nbcon:
> Provide function for atomic flushing" to simplify the introduction.
>
> > I guess that this last paragraph is supposed to explain why the
> > "nesting" array is needed.
>
> No, it is explaining how this feature works in general. The term
> "priority elevated state" means the CPU is in an atomic write section.

This did not help me much after at first. But I got it later after
connection more dots ;-)

IMHO, the problem was that the commit message introduced the terms
in the following order:

+ WARN/OOPS/PANIC require printing out immediately

+ per-CPU state to denote the priority/urgency

+ functions to mark the beginning/end where the urgent messages
are generated

+ when CPU is in priority elevated state, flushing only occurs
when dropping back to a lower priority

From my POV:

+ It did not mention/explained "atomic write" at all

+ It said that the urgent messages required immediate printing.
And Later, it said that they would get flushed later. Which
is contradicting each other.

+ The handling of priorities is not only about CPU nesting.
The same rules should apply also when other CPU is printing
messages in a higher priority context.

My proposal:

+ Use the term "higher priority context" for all WARN/OOPS/PANIC
messages.

+ Use "emergency priority context" or "nbcon emergency context"
when talking about NBCON_PRIO_EMERGENCY context to avoid
confusion with the printk log levels.

+ use "panic priority context or "nbcon panic context" for the panic
one if we really want to add enter/exit for the panic context.

+ We must somewhere explain the "atomic context" and "atomic_write".
callback. One important question is why it is atomic. Is it because it

+ _can_ be called in atomic context?
+ _must_ be called in atomic context?

It is called also from console_unlock() for boot messages
so it need not be in atomic context.

What about renaming it to "nbcon_write" to avoid this confusion?


> The "nesting" array is needed in order to support a feature that is not
> explained in the commit message: If nested OOPS/WARN/PANIC occur, only
> the outermost OOPS/WARN/PANIC will do the flushing. I will add this
> information to the commit message.

What is the motivation for the feature, please?

1. Either we want to flush the messages in the higher priority context
ASAP.

The per-CPU lock has been designed exactly for this purpose. It
would not need any extra nesting counter. And it would work
consistently also when the lock is acquired on another CPU.

It is simple, the context will either get the per-console lock or
not. The (nested) context might get the lock only when it has higher
priority. The flush would be called directly from vprintk_emit().

I always thought that this was the planned behavior.

IMHO, we want it. A nested panic() should be able to takeover
the console and flush messages ASAP. It will never return.


2. Or we want to wait until all messages in the higher priority context
are stored into the ring buffer and flush them by the caller.

Who would actually flush the higher messages?
WARN() caller?
The timer callback which detected softlockup?
Or a completely different context?
Who would flush panic() messages when panic() interrupted
locked CPU?


My proposal:

There are only two higher priority contexts:

+ NBCON_PRIO_PANIC should be used when panic_cpu == raw_smp_processor_id()

+ NBCON_PRIO_EMERGENCY contex would require some enter/exit wrappers
and tracking. But it does not necessarily need to be per-CPU
variable.

I think about adding "int printk_state" into struct task_struct.
It might be useful also for other things, e.g. for storing the last
log level of non-finished message. Entering section with enforced
minimal loglevel or so.


Then we could have:

#define PRINTK_STATE_EMERGENCY_MASK 0x000000ff
#define PRINTK_STATE_EMERGENCY_OFFSET 0x00000001

void nbcon_emergency_enter(&printk_state)
{
*printk_state = current->printk_state;

WARN_ON_ONCE((*printk_state & PRINTK_STATE_EMERGENCY_MASK) == PRINTK_STATE_EMERGENCY_MASK);

current->printk_state = *printk_state + PRINTK_STATE_EMERGENCY_OFFSET;
}

void nbcon_emergency_exit(printk_state)
{
WARN_ON_ONCE(!(current->printk_state & PRINTK_STATE_EMERGENCY_MASK))

current->printk_state = printk_state;
}

enum nbcon_prio nbcon_get_default_prio(void)
{
if (panic_cpu == raw_smp_processor_id()
return NBCON_PANIC_PRIO;

/* Does current always exist? What about fork? */
if (current && (current->printk_state && PRINTK_STATE_EMERGENCY_MASK))
return NBCON_PRIO_EMERGENCY;

return NBCON_PRIO_NORMAL;
}

IMHO, it should be race free. And we could use it to initialize
struct nbcon_context. The final decision will be left for
the nbcon_ctxt_try_acquire().

Best Regards,
Petr

2023-09-26 22:46:38

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing

On Mon 2023-09-25 15:43:03, John Ogness wrote:
> On 2023-09-22, Petr Mladek <[email protected]> wrote:
> >> console_flush_on_panic() - Called from several call sites to
> >> trigger ringbuffer dumping in an urgent situation.
> >>
> >> console_flush_on_panic() - Typically the panic() function will
> >
> > This is a second description of console_flush_of_panic() which
> > looks like a mistake.
>
> Oops. The first one should not be there.
>
> >> take care of atomic flushing the nbcon consoles on
> >> panic. However, there are several users of
> >> console_flush_on_panic() outside of panic().
> >
> > The generic panic() seems to use console_flush_on_panic() correctly
> > at the very end.
> >
> > Hmm, I see that console_flush_on_panic() is called also in
>
> [...]
>
> > Anyway, we should make clear that console_flush_on_panic() might break
> > the system and should be called as the last attempt to flush consoles.
> > The above arch-specific users are worth review.
>
> In an upcoming series you will see that console_flush_on_panic() only
> takes the console lock if there are legacy consoles. Ideally, eventually
> there will only be nbcon consoles, so your concern would disappear.

The legacy consoles have two risk levels:

1. post->lock is ignored after bust_spinlocks()
2. even console_lock is ignored in console_flush_on_panic()

The nbcon consoles have only one risk level:

1. unsafe takeover is allowed

First, I thought that we wanted to allow the unsafe takeover in
console_flush_on_panic(). In that case, this function would
be dangerous even for nbcon consoles.

Now, I remember that we wanted to allow it only before entering
the infinite loop (blinking diodes). In this case,
console_flush_on_panic() would be really safe for nbcon consoles.


> And if those users continue to use legacy consoles, then the risks will
> be the same as now.
>
> >> * Return: The previous priority that needs to be fed into
> >> * the corresponding nbcon_atomic_exit()
> >> * Context: Any context. Disables migration.
> >> + *
> >> + * When within an atomic printing section, no atomic printing occurs. This
> >> + * is to allow all emergency messages to be dumped into the ringbuffer before
> >> + * flushing the ringbuffer.
> >
> > The comment sounds like it is an advantage. But I think that it would be
> > a disadvantage.
>
> Please explain. At LPC2022 we agreed it was an advantage and it is
> implemented on purpose using the atomic printing sections.

I am sorry but I do not remember the details. Do you remember
the motivation, please?

In each case, we can't just say that this works by design
because someone somewhere agreed on it. We must explain
why this is better and I do not see it at the moment.

I am terribly sorry if I agreed with this and I disagree now.
I have never been good in life discussion because there is
no enough time to think about all consequences.

Anyway, the proposed behavior (agreed on LPC2022) seems
to contradict the original plan from LPC 2019, see
https://lore.kernel.org/all/[email protected]/
Namely:

--- cut ---
3. Rather than defining emergency _messages_, we define an emergency
_state_ where the kernel wants to flush the messages immediately before
dying. Unlike oops_in_progress, this state will not be visible to
anything outside of the printk infrastructure.

4. When in emergency state, the kernel will use a new console callback
write_atomic() to flush the messages in whatever context the CPU is in
at that moment. Only consoles that implement the NMI-safe write_atomic()
will be able to flush in this state.
--- cut ---

We wanted to flush it ASAP.

I wonder if we discussed some limitations where the messages
could not be flushed immediately. Maybe, we discussed a scenario
when there are many pending messages which would delay
flushing the emergency ones. But we need to flush them anyway.

Now, I do not see any real advantage to first store all messages
and flush them later in the same context.

OK, flushign them immediately might cause delay when flushing the
first emergency one. But storing all might cause overwrinting
the first emergency messages.

I hope that my proposed change would actually make things easier
and will not affect that much the upcoming patchsets.

Best Regards,
Petr

2023-09-27 16:01:14

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v2 09/11] panic: Add atomic write enforcement to oops

On Wed 2023-09-20 01:14:54, John Ogness wrote:
> Invoke the atomic write enforcement functions for oops to
> ensure that the information gets out to the consoles.
>
> Since there is no single general function that calls both
> oops_enter() and oops_exit(), the nesting feature of atomic
> write sections is taken advantage of in order to guarantee
> full coverage between the first oops_enter() and the last
> oops_exit().
>
> It is important to note that if there are any legacy consoles
> registered, they will be attempting to directly print from the
> printk-caller context, which may jeopardize the reliability of
> the atomic consoles. Optimally there should be no legacy
> consoles registered.
>
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -630,6 +634,36 @@ bool oops_may_print(void)
> */
> void oops_enter(void)
> {
> + enum nbcon_prio prev_prio;
> + int cpu = -1;
> +
> + /*
> + * If this turns out to be the first CPU in oops, this is the
> + * beginning of the outermost atomic section. Otherwise it is
> + * the beginning of an inner atomic section.
> + */

This sounds strange. What is the advantage of having the inner
atomic context, please? It covers only messages printed inside
oops_enter() and not the whole oops_enter()/exit(). Also see below.

> + prev_prio = nbcon_atomic_enter(NBCON_PRIO_EMERGENCY);
> +
> + if (atomic_try_cmpxchg_relaxed(&oops_cpu, &cpu, smp_processor_id())) {
> + /*
> + * This is the first CPU in oops. Save the outermost
> + * @prev_prio in order to restore it on the outermost
> + * matching oops_exit(), when @oops_nesting == 0.
> + */
> + oops_prev_prio = prev_prio;
> +
> + /*
> + * Enter an inner atomic section that ends at the end of this
> + * function. In this case, the nbcon_atomic_enter() above
> + * began the outermost atomic section.
> + */
> + prev_prio = nbcon_atomic_enter(NBCON_PRIO_EMERGENCY);
> + }
> +
> + /* Track nesting when this CPU is the owner. */
> + if (cpu == -1 || cpu == smp_processor_id())
> + oops_nesting++;
> +
> tracing_off();
> /* can't trust the integrity of the kernel anymore: */
> debug_locks_off();
> @@ -637,6 +671,9 @@ void oops_enter(void)
>
> if (sysctl_oops_all_cpu_backtrace)
> trigger_all_cpu_backtrace();
> +
> + /* Exit inner atomic section. */
> + nbcon_atomic_exit(NBCON_PRIO_EMERGENCY, prev_prio);

This will not flush the messages when:

+ This CPU owns oops_cpu. The flush will have to wait for exiting
the outer loop.

In this case, the inner atomic context is not needed.


+ oops_cpu is owner by another CPU, the other CPU is
just flushing the messages and block the per-console
lock.

The good thing is that the messages printed by this oops_enter()
would likely get flushed by the other CPU.

The bad thing is that oops_exit() on this CPU won't call
nbcon_atomic_exit() so that the following OOPS messages
from this CPU might need to wait for the printk kthread.
IMHO, this is not what we want.


One solution would be to store prev_prio in per-CPU array
so that each CPU could call its own nbcon_atomic_exit().

But I start liking more and more the idea with storing
and counting nested emergency contexts in struct task_struct.
It is the alternative implementation in reply to the 7th patch,
https://lore.kernel.org/r/ZRLBxsXPCym2NC5Q@alley

Then it will be enough to simply call:

+ nbcon_emergency_enter() in oops_enter()
+ nbcon_emergency_exit() in oops_enter()

Best Regards,
Petr

PS: I just hope that you didn't add all this complexity just because
we preferred this behavior at LPC 2022. Especially I hope
that it was not me who proposed and preferred this.

2023-10-05 14:37:21

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

On 2023-09-25, Petr Mladek <[email protected]> wrote:
> From my POV:
>
> + It did not mention/explained "atomic write" at all

Agreed.

> + It said that the urgent messages required immediate printing.
> And Later, it said that they would get flushed later. Which
> is contradicting each other.

I agree that the wording needs to be dramatically improved. The term
"urgent message" was not meant to represent a single printk() call.

> + The handling of priorities is not only about CPU nesting.
> The same rules should apply also when other CPU is printing
> messages in a higher priority context.

Sorry, I do not understand what you mean here. Each CPU is responsible
for its own state. If another CPU is in a higher priority state, that
CPU will be responsible for ensuring its own WARN/OOPS is stored and
flushed. (From v2 I see that the CPU does not try hard enough. I would
fix that for v3.)

> My proposal:
>
> + Use the term "higher priority context" for all WARN/OOPS/PANIC
> messages.
>
> + Use "emergency priority context" or "nbcon emergency context"
> when talking about NBCON_PRIO_EMERGENCY context to avoid
> confusion with the printk log levels.
>
> + use "panic priority context or "nbcon panic context" for the panic
> one if we really want to add enter/exit for the panic context.

There are 3 different types of priority:

1. The printk record priority: KERN_ERR, KERN_WARNING, etc.

2. The priority of a console owner: used for controlled takeovers.

3. The priority state of a CPU: only elevated for urgent messages, used
to store all urgent messages and then afterwards print directly by
taking ownership of consoles.

I need to choose terminology carefully to make it easy to distinguish
between these 3 types. v2 failed to name, describe, and document this
correctly.

> + We must somewhere explain the "atomic context" and "atomic_write".
> callback. One important question is why it is atomic. Is it because it
>
> + _can_ be called in atomic context?
> + _must_ be called in atomic context?

Its main reason for existing is because it can be called from atomic
(including NMI) contexts. But like atomic_t, it can be used from any
context.

> It is called also from console_unlock() for boot messages
> so it need not be in atomic context.
>
> What about renaming it to "nbcon_write" to avoid this confusion?

When we introduce the threads, there will be a 2nd callback (currently
planned write_thread()). This callback is guaranteed to be called from a
printing kthread, which for console drivers like fbcon will prove to be
helpful in cleaning up its code.

I will reserve the word "atomic" _only_ when talking about which
printing callback is used. That should help to avoid associating the
callback with a certain context or priority. But I think the name
"write_atomic" is appropriate.

>> The "nesting" array is needed in order to support a feature that is not
>> explained in the commit message: If nested OOPS/WARN/PANIC occur, only
>> the outermost OOPS/WARN/PANIC will do the flushing. I will add this
>> information to the commit message.
>
> What is the motivation for the feature, please?

During the demo at LPC2022 we had the situation that there was a large
backlog when a WARN was hit. With current mainline the first line of the
WARN is put into the ringbuffer and then the entire backlog is flushed
before storing the rest of the WARN into the ringbuffer. At the time it
was obvious that we should finish storing the WARN message and then
start flushing the backlog.

> 1. Either we want to flush the messages in the higher priority context
> ASAP.
>
> The per-CPU lock has been designed exactly for this purpose. It
> would not need any extra nesting counter. And it would work
> consistently also when the lock is acquired on another CPU.
>
> It is simple, the context will either get the per-console lock or
> not. The (nested) context might get the lock only when it has higher
> priority. The flush would be called directly from vprintk_emit().
>
> I always thought that this was the planned behavior.

It was. But then it was suggested by Thomas and acked by Linus that we
were taking the wrong approach. Rather than a global state for all the
consoles, each console should have separate owners with states, allowing
handovers/takeovers. And CPUs should have urgency states to control how
the ringbuffer is stored and when direct printing should take place for
that CPU. This idea is similar to the cpu_sync approach, but with
timeouts, handovers/takeovers, and is per-console.

This approach gives us a lot of flexibility to enforce logical and safe
policies for storing and printing in different situations. And if
named/documented correctly, I think it will be easy to understand.

> 2. Or we want to wait until all messages in the higher priority context
> are stored into the ring buffer and flush them by the caller.

Yes.

> Who would actually flush the higher messages?
> WARN() caller?

Yes.

> The timer callback which detected softlockup?

Yes.

> Or a completely different context?

Another CPU could do some helpful backlog printing if it sees new
messages and is able to become an owner. But it is not the CPU that is
responsible for making certain that the messages have been printed.

> Who would flush panic() messages when panic() interrupted
> locked CPU?

The panic CPU can takeover (as a last resort, unsafely if
necessary). The panic CPU is responsible for getting those messages out.

> My proposal:
>
> There are only two higher priority contexts:
>
> + NBCON_PRIO_PANIC should be used when panic_cpu == raw_smp_processor_id()

This is the case with v2.

> + NBCON_PRIO_EMERGENCY contex would require some enter/exit wrappers
> and tracking. But it does not necessarily need to be per-CPU
> variable.
<
> I think about adding "int printk_state" into struct task_struct.
> It might be useful also for other things, e.g. for storing the last
> log level of non-finished message. Entering section with enforced
> minimal loglevel or so.

printk() calls are quite rare. And warning states are even more
rare. IMHO adding such a field to every task is a huge price to
pay. Also, printk operates in non-task contexts (hardirq, nmi). Although
@current probably points to something existing, there could be some
tricky corner cases.

IMHO per-cpu urgency state variables and per-console owner state
variables allows a much simpler picture in all contexts.

My proposal:

You have provided excellent feedback regarding naming and
documentation. Allow me to fix these things to clarify the various
functions and their roles.

John

2023-10-05 14:37:37

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing

On 2023-09-26, Petr Mladek <[email protected]> wrote:
> Anyway, the proposed behavior (agreed on LPC2022) seems
> to contradict the original plan from LPC 2019, see
> https://lore.kernel.org/all/[email protected]/
> Namely:
>
> --- cut ---
> 3. Rather than defining emergency _messages_, we define an emergency
> _state_ where the kernel wants to flush the messages immediately before
> dying. Unlike oops_in_progress, this state will not be visible to
> anything outside of the printk infrastructure.
>
> 4. When in emergency state, the kernel will use a new console callback
> write_atomic() to flush the messages in whatever context the CPU is in
> at that moment. Only consoles that implement the NMI-safe write_atomic()
> will be able to flush in this state.
> --- cut ---
>
> We wanted to flush it ASAP.

In 2019 we didn't have all the code yet. Yes, we assumed that flushing
each individual printk() call would be how to do it. But in 2022, during
a live demo with real code, we saw that it was a bad idea.

I disagree that the 2022 decisions are contradicting the 2019
decisions. We have added more details because now we have a real
implementation.

v2 establishes an emergency state before dying. v2 flushes the messages
safely before dying (and unsafely _immediately_ before dying). v2
flushes the messages in whatever context the CPU is in. v2 uses
write_atomic() to perform the flush.

John

2023-10-06 12:51:40

by Petr Mladek

[permalink] [raw]
Subject: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

Adding Linus into Cc. I would like to be sure about the flushing
of atomic consoles in panic context.

> During the demo at LPC2022 we had the situation that there was a large
> backlog when a WARN was hit. With current mainline the first line of the
> WARN is put into the ringbuffer and then the entire backlog is flushed
> before storing the rest of the WARN into the ringbuffer. At the time it
> was obvious that we should finish storing the WARN message and then
> start flushing the backlog.

This talks about the "emergency" context (WARN/OOPS/watchdog).
The system might be in big troubles but it would still try to continue.

Do we really want to defer the flush also for panic() context?

I ask because I was not on LPC 2022 in person and I do not remember
all details.

Anyway, the deferred flush works relatively well for the "emergency" context:

+ flushed from nbcon_atomic_exit()
+ printk kthread might emit the messages while they are being added

But it is tricky in panic(), see 8th patch at
https://lore.kernel.org/r/[email protected]

+ nbcon_atomic_exit() is called only in one code path.

+ nbcon_atomic_flush_all() is used in other paths. It looks like
a "Whack a mole" game to me.

+ messages are never emitted by printk kthread either because
CPUs are stopped or the kthread is not allowed to get the lock[*]

I see only one positive of the explicit flush. The consoles would
not delay crash_exec() and the crash dump might be closer to
the point where panic() was called.

Otherwise I see only negatives => IMHO, we want to flush atomic
consoles synchronously from printk() in panic().

Does anyone really want explicit flushes in panic()?

[*] Emitting messages is explicitly blocked on non-panic CPUs. It
increases the change that panic-CPU would be able to take
the console lock the safe way.

Best Regards,
Petr

2023-10-06 12:53:51

by Petr Mladek

[permalink] [raw]
Subject: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

(2nd attempt with with Linus really in Cc).

Adding Linus into Cc. I would like to be sure about the flushing
of atomic consoles in panic context.

> During the demo at LPC2022 we had the situation that there was a large
> backlog when a WARN was hit. With current mainline the first line of the
> WARN is put into the ringbuffer and then the entire backlog is flushed
> before storing the rest of the WARN into the ringbuffer. At the time it
> was obvious that we should finish storing the WARN message and then
> start flushing the backlog.

This talks about the "emergency" context (WARN/OOPS/watchdog).
The system might be in big troubles but it would still try to continue.

Do we really want to defer the flush also for panic() context?

I ask because I was not on LPC 2022 in person and I do not remember
all details.

Anyway, the deferred flush works relatively well for the "emergency" context:

+ flushed from nbcon_atomic_exit()
+ printk kthread might emit the messages while they are being added

But it is tricky in panic(), see 8th patch at
https://lore.kernel.org/r/[email protected]

+ nbcon_atomic_exit() is called only in one code path.

+ nbcon_atomic_flush_all() is used in other paths. It looks like
a "Whack a mole" game to me.

+ messages are never emitted by printk kthread either because
CPUs are stopped or the kthread is not allowed to get the lock[*]

I see only one positive of the explicit flush. The consoles would
not delay crash_exec() and the crash dump might be closer to
the point where panic() was called.

Otherwise I see only negatives => IMHO, we want to flush atomic
consoles synchronously from printk() in panic().

Does anyone really want explicit flushes in panic()?

[*] Emitting messages is explicitly blocked on non-panic CPUs. It
increases the change that panic-CPU would be able to take
the console lock the safe way.

Best Regards,
Petr

2023-10-06 15:53:13

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

On Thu 2023-10-05 14:57:47, John Ogness wrote:
> On 2023-09-25, Petr Mladek <[email protected]> wrote:
> > From my POV:
> >
> > + It did not mention/explained "atomic write" at all
>
> Agreed.
>
> > + It said that the urgent messages required immediate printing.
> > And Later, it said that they would get flushed later. Which
> > is contradicting each other.
>
> I agree that the wording needs to be dramatically improved. The term
> "urgent message" was not meant to represent a single printk() call.
>
> > + The handling of priorities is not only about CPU nesting.
> > The same rules should apply also when other CPU is printing
> > messages in a higher priority context.
>
> Sorry, I do not understand what you mean here. Each CPU is responsible
> for its own state. If another CPU is in a higher priority state, that
> CPU will be responsible for ensuring its own WARN/OOPS is stored and
> flushed.

You are right that my comment was confusing. I had in mind that
flushing one emergency context would flush all pending messages
from other CPUs and even from other emergency context. Your
explanation is better.

> (From v2 I see that the CPU does not try hard enough. I would
> fix that for v3.)

Yes, this should be fixed.

> There are 3 different types of priority:
>
> 1. The printk record priority: KERN_ERR, KERN_WARNING, etc.
>
> 2. The priority of a console owner: used for controlled takeovers.
>
> 3. The priority state of a CPU: only elevated for urgent messages, used
> to store all urgent messages and then afterwards print directly by
> taking ownership of consoles.

I know that the you want to distinguish 2. and 3. But I think that
they should be the same. I mean that nbcon_context_try_acquire()
should use the PRIO according to what context it is in.

Is there any situation where it should be different, please?

IMHO, it might simplify some logic when they are the same.

> I need to choose terminology carefully to make it easy to distinguish
> between these 3 types. v2 failed to name, describe, and document this
> correctly.

> > + We must somewhere explain the "atomic context" and "atomic_write".
> > callback. One important question is why it is atomic. Is it because it
> >
> > + _can_ be called in atomic context?
> > + _must_ be called in atomic context?
>
> Its main reason for existing is because it can be called from atomic
> (including NMI) contexts. But like atomic_t, it can be used from any
> context.
>
> > It is called also from console_unlock() for boot messages
> > so it need not be in atomic context.
> >
> > What about renaming it to "nbcon_write" to avoid this confusion?
>
> When we introduce the threads, there will be a 2nd callback (currently
> planned write_thread()). This callback is guaranteed to be called from a
> printing kthread, which for console drivers like fbcon will prove to be
> helpful in cleaning up its code.

I see.

> I will reserve the word "atomic" _only_ when talking about which
> printing callback is used. That should help to avoid associating the
> callback with a certain context or priority. But I think the name
> "write_atomic" is appropriate.

Sounds good.

> >> The "nesting" array is needed in order to support a feature that is not
> >> explained in the commit message: If nested OOPS/WARN/PANIC occur, only
> >> the outermost OOPS/WARN/PANIC will do the flushing. I will add this
> >> information to the commit message.
> >
> > What is the motivation for the feature, please?
>
> During the demo at LPC2022 we had the situation that there was a large
> backlog when a WARN was hit. With current mainline the first line of the
> WARN is put into the ringbuffer and then the entire backlog is flushed
> before storing the rest of the WARN into the ringbuffer. At the time it
> was obvious that we should finish storing the WARN message and then
> start flushing the backlog.

This talks about contenxt using NBCON_PRIO_EMERGENCY. I am pretty sure
that we want to flush the messages synchronously from printk() in
panic(). Let's discuss this in the other thread with Linus in Cc [1].

> > My proposal:
> >
> > There are only two higher priority contexts:
> >
> > + NBCON_PRIO_PANIC should be used when panic_cpu == raw_smp_processor_id()
>
> This is the case with v2.

Ah, I see, nbcon_atomic_enter(NBCON_PRIO_PANIC) is called right after
setting panic_cpu. But this is actually another reason to get rid
of the nbcon_atomic_enter() call why?

+ The information about entering context with NBCON_PRIO_PANIC is
already provided by panic_cpu variable.

+ Nesting is not possible because only one one context is allowed
to acquire panic_cpu.

+ nbcon_atomic_exit() is almost useless because it is called only
in one code path. Explicit nbcon_atomic_flush_all() calls
are needed in other code paths.

IMHO, getting rid of nbcon_atomic_enter(NBCON_PRIO_PANIC) would have
several advantages:

+ enter()/exit() API would be needed only for NBCON_PRIO_EMERGENCY.
We could call it nbcon_emergency_enter()/exit() and avoid
the confusing atomic_enter name.

+ Nesting is important only for NBCON_PRIO_EMERGENCY context =>
the per-CPU variable might be a simple refecence counter =>
easier code and no need to remember the previous value
and pass it to _exit() as a parameter.

> > + NBCON_PRIO_EMERGENCY contex would require some enter/exit wrappers
> > and tracking. But it does not necessarily need to be per-CPU
> > variable.
> <
> > I think about adding "int printk_state" into struct task_struct.
> > It might be useful also for other things, e.g. for storing the last
> > log level of non-finished message. Entering section with enforced
> > minimal loglevel or so.
>
> printk() calls are quite rare. And warning states are even more
> rare. IMHO adding such a field to every task is a huge price to
> pay. Also, printk operates in non-task contexts (hardirq, nmi). Although
> @current probably points to something existing, there could be some
> tricky corner cases.

Fair enough. Per-CPU variables are fine after all.

> My proposal:
>
> You have provided excellent feedback regarding naming and
> documentation. Allow me to fix these things to clarify the various
> functions and their roles.

We should first agree on the approach in panic() in the other thread [1].

IMHO, the context using NBCON_PRIO_EMERGENCY should be the only one
where we need enter/exit and the deferred flush. In this case,
the per-CPU variable would be a simple nesting counter. And maybe
the explicit flush would be needed only in emergency context.
=> easier code, logic, naming.

[1] https://lore.kernel.org/r/ZSADUKp8oJ2Ws2vC@alley

Best Regards,
Petr

2023-10-08 10:13:36

by John Ogness

[permalink] [raw]
Subject: Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

Hi Petr,

On 2023-10-06, Petr Mladek <[email protected]> wrote:
>> During the demo at LPC2022 we had the situation that there was a large
>> backlog when a WARN was hit. With current mainline the first line of the
>> WARN is put into the ringbuffer and then the entire backlog is flushed
>> before storing the rest of the WARN into the ringbuffer. At the time it
>> was obvious that we should finish storing the WARN message and then
>> start flushing the backlog.
>
> This talks about the "emergency" context (WARN/OOPS/watchdog).
> The system might be in big troubles but it would still try to continue.
>
> Do we really want to defer the flush also for panic() context?

We can start flushing right after the backtrace is in the
ringbuffer. But flushing the backlog _before_ putting the backtrace into
the ringbuffer was not desired because if there is a large backlog, the
machine may not survive to get the backtrace out. And in that case it
won't even be in the ringbuffer to be used by other debugging
tools.

> I ask because I was not on LPC 2022 in person and I do not remember
> all details.

The LPC2022 demo/talk was recorded:

https://www.youtube.com/watch?v=TVhNcKQvzxI

At 55:55 is where the situation occurred and triggered the conversation,
ultimately leading to this new feature.

You may also want to reread my summary:

https://lore.kernel.org/lkml/[email protected]

as well as Linus' follow-up message:

https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@mail.gmail.com

> But it is tricky in panic(), see 8th patch at
> https://lore.kernel.org/r/[email protected]
>
> + nbcon_atomic_exit() is called only in one code path.

Correct. When panic() is complete and the machine goes into its infinite
loop. This is also the point where it will attempt an unsafe flush, if
it could not get the messages out yet.

> + nbcon_atomic_flush_all() is used in other paths. It looks like
> a "Whack a mole" game to me.

Several different outputs occur during panic(). The flush is everywhere
where something significant has been put into the ringbuffer and now it
would be OK to flush it.

> + messages are never emitted by printk kthread either because
> CPUs are stopped or the kthread is not allowed to get the lock

Correct.

> I see only one positive of the explicit flush. The consoles would
> not delay crash_exec() and the crash dump might be closer to
> the point where panic() was called.

It's only about getting the critical messages into the ringbuffer before
flushing. And since various things can go wrong during the many actions
within panic(), it makes sense to flush in between those actions.

> Otherwise I see only negatives => IMHO, we want to flush atomic
> consoles synchronously from printk() in panic().
>
> Does anyone really want explicit flushes in panic()?

So far you are the only one speaking against it. I expect as time goes
on it will get even more complex as it becomes tunable (also something
we talked about during the demo).

John

2023-10-09 15:32:24

by Petr Mladek

[permalink] [raw]
Subject: Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

On Sun 2023-10-08 12:19:21, John Ogness wrote:
> Hi Petr,
>
> On 2023-10-06, Petr Mladek <[email protected]> wrote:
> >> During the demo at LPC2022 we had the situation that there was a large
> >> backlog when a WARN was hit. With current mainline the first line of the
> >> WARN is put into the ringbuffer and then the entire backlog is flushed
> >> before storing the rest of the WARN into the ringbuffer. At the time it
> >> was obvious that we should finish storing the WARN message and then
> >> start flushing the backlog.
> >
> > This talks about the "emergency" context (WARN/OOPS/watchdog).
> > The system might be in big troubles but it would still try to continue.
> >
> > Do we really want to defer the flush also for panic() context?
>
> We can start flushing right after the backtrace is in the
> ringbuffer. But flushing the backlog _before_ putting the backtrace into
> the ringbuffer was not desired because if there is a large backlog, the
> machine may not survive to get the backtrace out. And in that case it
> won't even be in the ringbuffer to be used by other debugging
> tools.

We really have to distinguish emergency and panic context!

> > I ask because I was not on LPC 2022 in person and I do not remember
> > all details.
>
> The LPC2022 demo/talk was recorded:
>
> https://www.youtube.com/watch?v=TVhNcKQvzxI
>
> At 55:55 is where the situation occurred and triggered the conversation,
> ultimately leading to this new feature.

Thanks for the link. My understanding is that the following scenario
has happened:

1. System was booting and messages were being flushed using the kthread.

2. WARN() happened. It printed the 1st line, took over the per-console
console lock and started flushing the backlog. There were still
many pending messages from the boot.

3. NMI came and caused panic(). The panic context printed its first line,
took over the console from the WARN context, flushed the rest
of the backlog and continued printing/flushing more messages from
the panic() context.


Problem:

People complained that they saw only the first line from WARN().
The related detailed info, including backtrace, was missing.

It was because panic() interrupted WARN() before it was able
to flush the backlog and print/flush all WARN() messages.


Proposed solution:

WARN()/emergency context should first store the messages and
flush them at the end.

It would increase the chance that all WARN() messages will
be stored in the ring buffer before NMI/panic() is called.

panic() would then flush all pending messages including
the stored WARN() messages.


Important:

The problem is that panic() interrupted WARN().

There is _no_ problem with interrupting panic().
Let me repeat, nested panic() is not possible!


> You may also want to reread my summary:
>
> https://lore.kernel.org/lkml/[email protected]

Again, thanks for the pointer. Let me paste 2 paragraphs here:

<paste>
- Printing the backlog is important! If some emergency situation occurs,
make sure the backlog gets printed.

- When an emergency occurs, put the full backtrace into the ringbuffer
before flushing any backlog. This ensures that the backtrace makes it
into the ringbuffer in case a panic occurs while flushing the backlog.
</paste>

My understanding is:

1st paragraph is the reason why:

+ we have three priorities: normal, emergency, panic

+ messages in normal context might be offloaded to kthread

+ emergency and panic context should try to flush the messages
from this context.


2nd paragraph talks about that emergency context should first store
the messages and flush them later. And the important part:

"in case a panic occurs while flushing the backlog.

=> panic might interrupt emergency

It clearly distinguishes emergency and panic context.


> as well as Linus' follow-up message:
>
> https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@mail.gmail.com

IMHO, the important part is:

<paste>
Yeah, I really liked the notion of doing the oops with just filling
the back buffer but still getting it printed out if something goes
wrong in the middle.
</paste>

He was talking about oops => emergency context

Also he wanted to get it out when something goes wrong in the middle
=> panic in the middle ?


And another paragraph:

<paste>
I doubt it ends up being an issue in practice, but basically I wanted
to just pipe up and say that the exact details of how much of the back
buffer needs to be flushed first _could_ be tweaked if it ever does
come up as an issue.
</paste>

Linus had doubts that there might be problems with too long backlog
in practice. And I have the doubts as well.

And this is my view. The deferred flush is trying to solve a corner
case and we are forgetting what blocked printk kthreads >10 years.

> > But it is tricky in panic(), see 8th patch at
> > https://lore.kernel.org/r/[email protected]
> >
> > + nbcon_atomic_exit() is called only in one code path.
>
> Correct. When panic() is complete and the machine goes into its infinite
> loop. This is also the point where it will attempt an unsafe flush, if
> it could not get the messages out yet.
>
> > + nbcon_atomic_flush_all() is used in other paths. It looks like
> > a "Whack a mole" game to me.
>
> Several different outputs occur during panic(). The flush is everywhere
> where something significant has been put into the ringbuffer and now it
> would be OK to flush it.

No, we could _never_ say that the flush is everywhere where something
important happens.

panic() might fail anywhere. The last printed message might be
an important clue. And it can be any message.


> > + messages are never emitted by printk kthread either because
> > CPUs are stopped or the kthread is not allowed to get the lock
>
> Correct.
>
> > I see only one positive of the explicit flush. The consoles would
> > not delay crash_exec() and the crash dump might be closer to
> > the point where panic() was called.
>
> It's only about getting the critical messages into the ringbuffer before
> flushing. And since various things can go wrong during the many actions
> within panic(), it makes sense to flush in between those actions.

I am glad that we agree that "various things can go wrong during panic".

Are we sure that the patchset added the explicit flush right after
each possible problematic place? What about crash_exec(),
various kmsg dumpers, notifiers?


> > Otherwise I see only negatives => IMHO, we want to flush atomic
> > consoles synchronously from printk() in panic().
> >
> > Does anyone really want explicit flushes in panic()?
>
> So far you are the only one speaking against it. I expect as time goes
> on it will get even more complex as it becomes tunable (also something
> we talked about during the demo).

From my POV:

1. We are just two people discussing it at the moment
=> word against word.

2. Please, try to read my reply again. I agreed with delaying the
flush in emergency context.

But I am strongly against explicit flushes during panic at
least in the initial implementation.


3. IMHO, there might be a lot of misunderstanding caused by
the word "emergency" context. Some people might see
it as OOPs/WARN/panic and other might want to handle
panic special way.

I see it:

+ emergency - huge danger, medical check needed, patient might
survive

+ panic - patient is about to die, try to get some secrets from
him before he dies.

Any sentence of the secret might be important. It would be pity
to ignore his last A4 page just because it was not complete.


Sigh, I did my best to explain why the nesting problem is only in
the emergency context and why panic is different.

Feel free to ask if anything is still unclear.


Anyway, I am _not_ going to accept the explicit flushes in panic()
unless you show me a problem with interrupted panic() or
Linus explicitly says that the explicit flushes make sense.

Best Regards,
Petr

2023-10-10 16:03:25

by John Ogness

[permalink] [raw]
Subject: Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

On 2023-10-09, Petr Mladek <[email protected]> wrote:
> We really have to distinguish emergency and panic context!

OK.

>> The LPC2022 demo/talk was recorded:
>>
>> https://www.youtube.com/watch?v=TVhNcKQvzxI
>>
>> At 55:55 is where the situation occurred and triggered the conversation,
>> ultimately leading to this new feature.
>
> Thanks for the link. My understanding is that the following scenario
> has happened:
>
> 1. System was booting and messages were being flushed using the kthread.
>
> 2. WARN() happened. It printed the 1st line, took over the per-console
> console lock and started flushing the backlog. There were still
> many pending messages from the boot.
>
> 3. NMI came and caused panic(). The panic context printed its first line,
> took over the console from the WARN context, flushed the rest
> of the backlog and continued printing/flushing more messages from
> the panic() context.
>
>
> Problem:
>
> People complained that they saw only the first line from WARN().
> The related detailed info, including backtrace, was missing.
>
> It was because panic() interrupted WARN() before it was able
> to flush the backlog and print/flush all WARN() messages.

Thanks for taking the time to review it in detail.

> Proposed solution:
>
> WARN()/emergency context should first store the messages and
> flush them at the end.
>
> It would increase the chance that all WARN() messages will
> be stored in the ring buffer before NMI/panic() is called.
>
> panic() would then flush all pending messages including
> the stored WARN() messages.

OK.

> Important:
>
> The problem is that panic() interrupted WARN().

Ack.

>> You may also want to reread my summary:
>>
>> https://lore.kernel.org/lkml/[email protected]
>
> Again, thanks for the pointer. Let me paste 2 paragraphs here:
>
> <paste>
> - Printing the backlog is important! If some emergency situation occurs,
> make sure the backlog gets printed.
>
> - When an emergency occurs, put the full backtrace into the ringbuffer
> before flushing any backlog. This ensures that the backtrace makes it
> into the ringbuffer in case a panic occurs while flushing the backlog.
> </paste>
>
> My understanding is:
>
> 1st paragraph is the reason why:
>
> + we have three priorities: normal, emergency, panic
>
> + messages in normal context might be offloaded to kthread
>
> + emergency and panic context should try to flush the messages
> from this context.
>
>
> 2nd paragraph talks about that emergency context should first store
> the messages and flush them later. And the important part:
>
> "in case a panic occurs while flushing the backlog.
>
> => panic might interrupt emergency
>
> It clearly distinguishes emergency and panic context.
>
>
>> as well as Linus' follow-up message:
>>
>> https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@mail.gmail.com
>
> IMHO, the important part is:
>
> <paste>
> Yeah, I really liked the notion of doing the oops with just filling
> the back buffer but still getting it printed out if something goes
> wrong in the middle.
> </paste>
>
> He was talking about oops => emergency context
>
> Also he wanted to get it out when something goes wrong in the middle
> => panic in the middle ?
>
>
> And another paragraph:
>
> <paste>
> I doubt it ends up being an issue in practice, but basically I wanted
> to just pipe up and say that the exact details of how much of the back
> buffer needs to be flushed first _could_ be tweaked if it ever does
> come up as an issue.
> </paste>
>
> Linus had doubts that there might be problems with too long backlog
> in practice. And I have the doubts as well.
>
> And this is my view. The deferred flush is trying to solve a corner
> case and we are forgetting what blocked printk kthreads >10 years.

OK. Thank you for the detailed analysis.

For v3 I will do something similar to what you proposed [0], except that
I will use a per-cpu variable (to track printk emergency nesting)
instead of adding a new field to the task struct.

John Ogness

[0] https://lore.kernel.org/lkml/ZRLBxsXPCym2NC5Q@alley/

2023-10-16 09:00:44

by Dave Young

[permalink] [raw]
Subject: Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

[Added more people in cc]

On 10/08/23 at 12:19pm, John Ogness wrote:
> Hi Petr,
>
> On 2023-10-06, Petr Mladek <[email protected]> wrote:
> >> During the demo at LPC2022 we had the situation that there was a large
> >> backlog when a WARN was hit. With current mainline the first line of the
> >> WARN is put into the ringbuffer and then the entire backlog is flushed
> >> before storing the rest of the WARN into the ringbuffer. At the time it
> >> was obvious that we should finish storing the WARN message and then
> >> start flushing the backlog.
> >
> > This talks about the "emergency" context (WARN/OOPS/watchdog).
> > The system might be in big troubles but it would still try to continue.
> >
> > Do we really want to defer the flush also for panic() context?
>
> We can start flushing right after the backtrace is in the
> ringbuffer. But flushing the backlog _before_ putting the backtrace into
> the ringbuffer was not desired because if there is a large backlog, the
> machine may not survive to get the backtrace out. And in that case it
> won't even be in the ringbuffer to be used by other debugging
> tools.
>
> > I ask because I was not on LPC 2022 in person and I do not remember
> > all details.
>
> The LPC2022 demo/talk was recorded:
>
> https://www.youtube.com/watch?v=TVhNcKQvzxI
>
> At 55:55 is where the situation occurred and triggered the conversation,
> ultimately leading to this new feature.
>
> You may also want to reread my summary:
>
> https://lore.kernel.org/lkml/[email protected]
>
> as well as Linus' follow-up message:
>
> https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@mail.gmail.com
>
> > But it is tricky in panic(), see 8th patch at
> > https://lore.kernel.org/r/[email protected]
> >
> > + nbcon_atomic_exit() is called only in one code path.
>
> Correct. When panic() is complete and the machine goes into its infinite
> loop. This is also the point where it will attempt an unsafe flush, if
> it could not get the messages out yet.
>
> > + nbcon_atomic_flush_all() is used in other paths. It looks like
> > a "Whack a mole" game to me.
>
> Several different outputs occur during panic(). The flush is everywhere
> where something significant has been put into the ringbuffer and now it
> would be OK to flush it.
>
> > + messages are never emitted by printk kthread either because
> > CPUs are stopped or the kthread is not allowed to get the lock
>
> Correct.
>
> > I see only one positive of the explicit flush. The consoles would
> > not delay crash_exec() and the crash dump might be closer to
> > the point where panic() was called.
>
> It's only about getting the critical messages into the ringbuffer before
> flushing. And since various things can go wrong during the many actions
> within panic(), it makes sense to flush in between those actions.
>
> > Otherwise I see only negatives => IMHO, we want to flush atomic
> > consoles synchronously from printk() in panic().
> >
> > Does anyone really want explicit flushes in panic()?
>
> So far you are the only one speaking against it. I expect as time goes
> on it will get even more complex as it becomes tunable (also something
> we talked about during the demo).

Flush consoles in panic kexec case sounds not good, but I have no
deep understanding about the atomic printk series, added kexec list and
reviewers in cc.

>
> John
>

Thanks
Dave

2023-10-16 10:10:23

by John Ogness

[permalink] [raw]
Subject: Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

Hi Dave,

On 2023-10-16, Dave Young <[email protected]> wrote:
>> > Does anyone really want explicit flushes in panic()?
>>
>> So far you are the only one speaking against it. I expect as time
>> goes on it will get even more complex as it becomes tunable (also
>> something we talked about during the demo).
>
> Flush consoles in panic kexec case sounds not good, but I have no deep
> understanding about the atomic printk series, added kexec list and
> reviewers in cc.

Currently every printk() message tries to flush immediately.

This series introduced a new method of first allowing a set of printk()
messages to be stored to the ringbuffer and then flushing the full
set. That is what this discussion was about.

The issue with allowing a set of printk() messages to be stored is that
you need to explicitly mark in code where the actual flushing should
occur. Petr's argument is that we do not want to insert "flush points"
into the panic() function and instead we should do as we do now: flush
each printk() message immediately.

In the end (for my upcoming v3 series) I agreed with Petr. We will
continue to keep things as they are now: flush each printk() message
immediately.

Currently consoles try to flush unsafely before kexec. With the atomic
printk series our goal is to only perform _safe_ flushing until all
other panic operations are complete. Only at the very end of panic()
would unsafe flushing be attempted.

John