2023-12-14 21:42:23

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 00/14] fix console flushing

Hi,

While testing various flushing scenarios, I stumbled on a few
issues that cause console flushing to fail. While at LPC2023 in
Richmond, I sat down with Petr Mladek and we reviewed the
v2 [0] series. This v3 series is the result of that offline
discussion.

This series addresses the following issues:

1. The prb_next_seq() optimization caused inconsistent return
values. Fix prb_next_seq() to the originally intended
behavior but keep an optimization.

2. pr_flush() might not wait until the most recently stored
printk() message if non-finalized records precede it. Fix
pr_flush() to wait for all records to print that are at
least reserved at the time of the call.

3. In panic, the panic messages will not print if non-finalized
records precede them. Add a special condition so that
readers on the panic CPU will drop records that are not in
a consistent state.

4. It is possible (and easy to reproduce) a scenario where the
console on the panic CPU hands over to a waiter of a stopped
CPU. Do not use the handover feature in panic.

5. If messages are being dropped during panic, non-panic CPUs
are silenced. But by then it is already too late and most
likely the panic messages have been dropped. Change the
non-panic CPU silencing logic to _immediately_ silence
non-panic CPUs during panic. This also leads to clean panic
output when many CPUs are blasting the kernel log.

6. If a panic occurs in a context where printk() calls defer
printing (NMI or printk_safe section), the printing of the
final panic messages rely on irq_work. If that mechanism is
not available, the final panic messages are not seen (even
though they are finalized in the ringbuffer). Add one last
explicit flush after all printk() calls are finished to
ensure all available messages in the kernel log are printed.

This series also performs some minor cleanups to remove open
coded checks about the panic context and improve documentation
language regarding data-less records.

Because of multiple refactoring done in recent history, it
would be helpful to provide the LTS maintainers with the proper
backported patches. I am happy to do this.

The changes since v2:

- When on the panic CPU, allow desc_read_finalized_seq() to
read records in the committed state.

- Fix the off-by-one return value of the new function
prb_next_reserve_seq(). [1]

- Remove the suppress_panic_printk feature and instead simply
silently drop all printk() messages from non-panic CPUs.

- Remove the __seq_to_nbcon_seq()/__nbcon_seq_to_seq() macros
and use the new __u64seq_to_ulseq()/__ulseq_to_u64seq()
macros for nbcon as well.

- Adjust the new __u64seq_to_ulseq()/__ulseq_to_u64seq() macros
to support seq values above the base value. This is necessary
for the new last_finalized_seq feature. [2]

- Relocate the new __u64seq_to_ulseq()/__ulseq_to_u64seq()
macros to printk_ringbuffer.h so they can be shared by the
printk_ringbuffer and nbcon code.

- Call console_flush_on_panic() one more time at the end of
panic().

- Add and cleanup many comments and commit messages to clarify
reasoning and variable meanings.

John Ogness

[0] https://lore.kernel.org/lkml/[email protected]
[1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/lkml/[email protected]

John Ogness (13):
printk: nbcon: Relocate 32bit seq macros
printk: Adjust mapping for 32bit seq macros
printk: Use prb_first_seq() as base for 32bit seq macros
printk: ringbuffer: Do not skip non-finalized records with
prb_next_seq()
printk: ringbuffer: Clarify special lpos values
printk: For @suppress_panic_printk check for other CPU in panic
printk: Add this_cpu_in_panic()
printk: ringbuffer: Cleanup reader terminology
printk: Wait for all reserved records with pr_flush()
printk: ringbuffer: Skip non-finalized records in panic
printk: ringbuffer: Consider committed as finalized in panic
printk: Avoid non-panic CPUs writing to ringbuffer
panic: Flush kernel log buffer at the end

Petr Mladek (1):
printk: Disable passing console lock owner completely during panic()

kernel/panic.c | 8 +
kernel/printk/internal.h | 1 +
kernel/printk/nbcon.c | 41 +---
kernel/printk/printk.c | 101 +++++----
kernel/printk/printk_ringbuffer.c | 360 +++++++++++++++++++++++++-----
kernel/printk/printk_ringbuffer.h | 54 ++++-
6 files changed, 427 insertions(+), 138 deletions(-)


base-commit: 6c3a34e38436a2a3f7a1fa764c108ee19b05b893
--
2.39.2



2023-12-14 21:42:26

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 01/14] printk: nbcon: Relocate 32bit seq macros

The macros __seq_to_nbcon_seq() and __nbcon_seq_to_seq() are
used to provide support for atomic handling of sequence numbers
on 32bit systems. Until now this was only used by nbcon.c,
which is why they were located in nbcon.c and include nbcon in
the name.

In a follow-up commit this functionality is also needed by
printk_ringbuffer. Rather than duplicating the functionality,
relocate the macros to printk_ringbuffer.h.

Also, since the macros will be no longer nbcon-specific, rename
them to __u64seq_to_ulseq() and __ulseq_to_u64seq().

This does not result in any functional change.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/nbcon.c | 41 +++----------------------------
kernel/printk/printk_ringbuffer.h | 33 +++++++++++++++++++++++++
2 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index b96077152f49..c8093bcc01fe 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -140,39 +140,6 @@ static inline bool nbcon_state_try_cmpxchg(struct console *con, struct nbcon_sta
return atomic_try_cmpxchg(&ACCESS_PRIVATE(con, nbcon_state), &cur->atom, new->atom);
}

-#ifdef CONFIG_64BIT
-
-#define __seq_to_nbcon_seq(seq) (seq)
-#define __nbcon_seq_to_seq(seq) (seq)
-
-#else /* CONFIG_64BIT */
-
-#define __seq_to_nbcon_seq(seq) ((u32)seq)
-
-static inline u64 __nbcon_seq_to_seq(u32 nbcon_seq)
-{
- u64 seq;
- u64 rb_next_seq;
-
- /*
- * The provided sequence is only the lower 32 bits of the ringbuffer
- * sequence. It needs to be expanded to 64bit. Get the next sequence
- * number from the ringbuffer and fold it.
- *
- * Having a 32bit representation in the console is sufficient.
- * If a console ever gets more than 2^31 records behind
- * the ringbuffer then this is the least of the problems.
- *
- * Also the access to the ring buffer is always safe.
- */
- rb_next_seq = prb_next_seq(prb);
- seq = rb_next_seq - ((u32)rb_next_seq - nbcon_seq);
-
- return seq;
-}
-
-#endif /* CONFIG_64BIT */
-
/**
* nbcon_seq_read - Read the current console sequence
* @con: Console to read the sequence of
@@ -183,7 +150,7 @@ u64 nbcon_seq_read(struct console *con)
{
unsigned long nbcon_seq = atomic_long_read(&ACCESS_PRIVATE(con, nbcon_seq));

- return __nbcon_seq_to_seq(nbcon_seq);
+ return __ulseq_to_u64seq(prb, nbcon_seq);
}

/**
@@ -204,7 +171,7 @@ void nbcon_seq_force(struct console *con, u64 seq)
*/
u64 valid_seq = max_t(u64, seq, prb_first_valid_seq(prb));

- atomic_long_set(&ACCESS_PRIVATE(con, nbcon_seq), __seq_to_nbcon_seq(valid_seq));
+ atomic_long_set(&ACCESS_PRIVATE(con, nbcon_seq), __u64seq_to_ulseq(valid_seq));

/* Clear con->seq since nbcon consoles use con->nbcon_seq instead. */
con->seq = 0;
@@ -223,11 +190,11 @@ void nbcon_seq_force(struct console *con, u64 seq)
*/
static void nbcon_seq_try_update(struct nbcon_context *ctxt, u64 new_seq)
{
- unsigned long nbcon_seq = __seq_to_nbcon_seq(ctxt->seq);
+ unsigned long nbcon_seq = __u64seq_to_ulseq(ctxt->seq);
struct console *con = ctxt->console;

if (atomic_long_try_cmpxchg(&ACCESS_PRIVATE(con, nbcon_seq), &nbcon_seq,
- __seq_to_nbcon_seq(new_seq))) {
+ __u64seq_to_ulseq(new_seq))) {
ctxt->seq = new_seq;
} else {
ctxt->seq = nbcon_seq_read(con);
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index 18cd25e489b8..b82a96dc2ea2 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -381,4 +381,37 @@ bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
u64 prb_first_valid_seq(struct printk_ringbuffer *rb);
u64 prb_next_seq(struct printk_ringbuffer *rb);

+#ifdef CONFIG_64BIT
+
+#define __u64seq_to_ulseq(u64seq) (u64seq)
+#define __ulseq_to_u64seq(rb, ulseq) (ulseq)
+
+#else /* CONFIG_64BIT */
+
+#define __u64seq_to_ulseq(u64seq) ((u32)u64seq)
+
+static inline u64 __ulseq_to_u64seq(struct printk_ringbuffer *rb, u32 ulseq)
+{
+ u64 seq;
+ u64 rb_next_seq;
+
+ /*
+ * The provided sequence is only the lower 32 bits of the ringbuffer
+ * sequence. It needs to be expanded to 64bit. Get the next sequence
+ * number from the ringbuffer and fold it.
+ *
+ * Having a 32bit representation in the console is sufficient.
+ * If a console ever gets more than 2^31 records behind
+ * the ringbuffer then this is the least of the problems.
+ *
+ * Also the access to the ring buffer is always safe.
+ */
+ rb_next_seq = prb_next_seq(rb);
+ seq = rb_next_seq - ((u32)rb_next_seq - ulseq);
+
+ return seq;
+}
+
+#endif /* CONFIG_64BIT */
+
#endif /* _KERNEL_PRINTK_RINGBUFFER_H */
--
2.39.2


2023-12-14 21:42:31

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 03/14] printk: Use prb_first_seq() as base for 32bit seq macros

Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.

Currently prb_next_seq() is used as the base for the 32bit seq
macros __u64seq_to_ulseq() and __ulseq_to_u64seq(). However, in
a follow-up commit, prb_next_seq() will need to make use of the
32bit seq macros.

Use prb_first_seq() as the base for the 32bit seq macros instead
because it is guaranteed to return 64bit sequence numbers without
relying on any 32bit seq macros.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 2 +-
kernel/printk/printk_ringbuffer.h | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index fde338606ce8..49a82ccce8e9 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1832,7 +1832,7 @@ static int prb_read(struct printk_ringbuffer *rb, u64 seq,
}

/* Get the sequence number of the tail descriptor. */
-static u64 prb_first_seq(struct printk_ringbuffer *rb)
+u64 prb_first_seq(struct printk_ringbuffer *rb)
{
struct prb_desc_ring *desc_ring = &rb->desc_ring;
enum desc_state d_state;
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index 12f60c782e46..ee294aaf4aeb 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -378,6 +378,7 @@ bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
struct printk_info *info, unsigned int *line_count);

+u64 prb_first_seq(struct printk_ringbuffer *rb);
u64 prb_first_valid_seq(struct printk_ringbuffer *rb);
u64 prb_next_seq(struct printk_ringbuffer *rb);

@@ -392,12 +393,12 @@ u64 prb_next_seq(struct printk_ringbuffer *rb);

static inline u64 __ulseq_to_u64seq(struct printk_ringbuffer *rb, u32 ulseq)
{
+ u64 rb_first_seq = prb_first_seq(rb);
u64 seq;
- u64 rb_next_seq;

/*
* The provided sequence is only the lower 32 bits of the ringbuffer
- * sequence. It needs to be expanded to 64bit. Get the next sequence
+ * sequence. It needs to be expanded to 64bit. Get the first sequence
* number from the ringbuffer and fold it.
*
* Having a 32bit representation in the console is sufficient.
@@ -406,8 +407,7 @@ static inline u64 __ulseq_to_u64seq(struct printk_ringbuffer *rb, u32 ulseq)
*
* Also the access to the ring buffer is always safe.
*/
- rb_next_seq = prb_next_seq(rb);
- seq = rb_next_seq - (s32)((u32)rb_next_seq - ulseq);
+ seq = rb_first_seq - (s32)((u32)rb_first_seq - ulseq);

return seq;
}
--
2.39.2


2023-12-14 21:42:31

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.

__ulseq_to_u64seq() computes the upper 32 bits of the passed
argument value (@ulseq). The upper bits are derived from a base
value (@rb_next_seq) in a way that assumes @ulseq represents a
64bit number that is less than or equal to @rb_next_seq.

Until now this mapping has been correct for all call sites. However,
in a follow-up commit, values of @ulseq will be passed in that are
higher than the base value. This requires a change to how the 32bit
value is mapped to a 64bit sequence number.

Rather than mapping @ulseq such that the base value is the end of a
32bit block, map @ulseq such that the base value is in the middle of
a 32bit block. This allows supporting 31 bits before and after the
base value, which is deemed acceptable for the console sequence
number during runtime.

Here is an example to illustrate the previous and new mappings.

For a base value (@rb_next_seq) of 2 2000 0000...

Before this change the range of possible return values was:

1 2000 0001 to 2 2000 0000

__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 1 2000 0001
__ulseq_to_u64seq(9fff ffff) => 1 9fff ffff
__ulseq_to_u64seq(a000 0000) => 1 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001

After this change the range of possible return values are:
1 a000 0001 to 2 a000 0000

__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 2 2000 0001
__ulseq_to_u64seq(9fff ffff) => 2 9fff ffff
__ulseq_to_u64seq(a000 0000) => 2 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001

[ john.ogness: Rewrite commit message. ]

Reported-by: Francesco Dolcini <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk_ringbuffer.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index b82a96dc2ea2..12f60c782e46 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -407,7 +407,7 @@ static inline u64 __ulseq_to_u64seq(struct printk_ringbuffer *rb, u32 ulseq)
* Also the access to the ring buffer is always safe.
*/
rb_next_seq = prb_next_seq(rb);
- seq = rb_next_seq - ((u32)rb_next_seq - ulseq);
+ seq = rb_next_seq - (s32)((u32)rb_next_seq - ulseq);

return seq;
}
--
2.39.2


2023-12-14 21:42:43

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 04/14] printk: ringbuffer: Do not skip non-finalized records with prb_next_seq()

Commit f244b4dc53e5 ("printk: ringbuffer: Improve
prb_next_seq() performance") introduced an optimization for
prb_next_seq() by using best-effort to track recently finalized
records. However, the order of finalization does not
necessarily match the order of the records. The optimization
changed prb_next_seq() to return inconsistent results, possibly
yielding sequence numbers that are not available to readers
because they are preceded by non-finalized records or they are
not yet visible to the reader CPU.

Rather than simply best-effort tracking recently finalized
records, force the committing writer to read records and
increment the last "contiguous block" of finalized records. In
order to do this, the sequence number instead of ID must be
stored because ID's cannot be directly compared.

A new memory barrier pair is introduced to guarantee that a
reader can always read the records up until the sequence number
returned by prb_next_seq() (unless the records have since
been overwritten in the ringbuffer).

This restores the original functionality of prb_next_seq()
while also keeping the optimization.

For 32bit systems, only the lower 32 bits of the sequence
number are stored. When reading the value, it is expanded to
the full 64bit sequence number using the 32bit seq macros,
which fold in the value returned by prb_first_seq().

Fixes: f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq() performance")
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 164 +++++++++++++++++++++++-------
kernel/printk/printk_ringbuffer.h | 4 +-
2 files changed, 127 insertions(+), 41 deletions(-)

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 49a82ccce8e9..04c26cca546f 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -6,6 +6,7 @@
#include <linux/errno.h>
#include <linux/bug.h>
#include "printk_ringbuffer.h"
+#include "internal.h"

/**
* DOC: printk_ringbuffer overview
@@ -303,6 +304,9 @@
*
* desc_push_tail:B / desc_reserve:D
* set descriptor reusable (state), then push descriptor tail (id)
+ *
+ * desc_update_last_finalized:A / desc_last_finalized_seq:A
+ * store finalized record, then set new highest finalized sequence number
*/

#define DATA_SIZE(data_ring) _DATA_SIZE((data_ring)->size_bits)
@@ -1441,20 +1445,118 @@ bool prb_reserve_in_last(struct prb_reserved_entry *e, struct printk_ringbuffer
return false;
}

+/*
+ * @last_finalized_seq value guarantees that all records up to and including
+ * this sequence number are finalized and can be read. The only exception are
+ * too old records which have already been overwritten.
+ *
+ * It is also guaranteed that @last_finalized_seq only increases.
+ *
+ * Be aware that finalized records following non-finalized records are not
+ * reported because they are not yet available to the reader. For example,
+ * a new record stored via printk() will not be available to a printer if
+ * it follows a record that has not been finalized yet. However, once that
+ * non-finalized record becomes finalized, @last_finalized_seq will be
+ * appropriately updated and the full set of finalized records will be
+ * available to the printer. And since each printk() caller will either
+ * directly print or trigger deferred printing of all available unprinted
+ * records, all printk() messages will get printed.
+ */
+static u64 desc_last_finalized_seq(struct printk_ringbuffer *rb)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ unsigned long ulseq;
+
+ /*
+ * Guarantee the sequence number is loaded before loading the
+ * associated record in order to guarantee that the record can be
+ * seen by this CPU. This pairs with desc_update_last_finalized:A.
+ */
+ ulseq = atomic_long_read_acquire(&desc_ring->last_finalized_seq
+ ); /* LMM(desc_last_finalized_seq:A) */
+
+ return __ulseq_to_u64seq(rb, ulseq);
+}
+
+static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
+ struct printk_record *r, unsigned int *line_count);
+
+/*
+ * Check if there are records directly following @last_finalized_seq that are
+ * finalized. If so, update @last_finalized_seq to the latest of these
+ * records. It is not allowed to skip over records that are not yet finalized.
+ */
+static void desc_update_last_finalized(struct printk_ringbuffer *rb)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ u64 old_seq = desc_last_finalized_seq(rb);
+ unsigned long oldval;
+ unsigned long newval;
+ u64 finalized_seq;
+ u64 try_seq;
+
+try_again:
+ finalized_seq = old_seq;
+ try_seq = finalized_seq + 1;
+
+ /* Try to find later finalized records. */
+ while (_prb_read_valid(rb, &try_seq, NULL, NULL)) {
+ finalized_seq = try_seq;
+ try_seq++;
+ }
+
+ /* No update needed if no later finalized record was found. */
+ if (finalized_seq == old_seq)
+ return;
+
+ oldval = __u64seq_to_ulseq(old_seq);
+ newval = __u64seq_to_ulseq(finalized_seq);
+
+ /*
+ * Set the sequence number of a later finalized record that has been
+ * seen.
+ *
+ * Guarantee the record data is visible to other CPUs before storing
+ * its sequence number. This pairs with desc_last_finalized_seq:A.
+ *
+ * Memory barrier involvement:
+ *
+ * If desc_last_finalized_seq:A reads from
+ * desc_update_last_finalized:A, then desc_read:A reads from
+ * _prb_commit:B.
+ *
+ * Relies on:
+ *
+ * RELEASE from _prb_commit:B to desc_update_last_finalized:A
+ * matching
+ * ACQUIRE from desc_last_finalized_seq:A to desc_read:A
+ *
+ * Note: _prb_commit:B and desc_update_last_finalized:A can be
+ * different CPUs. However, the desc_update_last_finalized:A
+ * CPU (which performs the release) must have previously seen
+ * _prb_commit:B.
+ */
+ if (!atomic_long_try_cmpxchg_release(&desc_ring->last_finalized_seq,
+ &oldval, newval)) { /* LMM(desc_update_last_finalized:A) */
+ old_seq = __ulseq_to_u64seq(rb, oldval);
+ goto try_again;
+ }
+}
+
/*
* Attempt to finalize a specified descriptor. If this fails, the descriptor
* is either already final or it will finalize itself when the writer commits.
*/
-static void desc_make_final(struct prb_desc_ring *desc_ring, unsigned long id)
+static void desc_make_final(struct printk_ringbuffer *rb, unsigned long id)
{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
unsigned long prev_state_val = DESC_SV(id, desc_committed);
struct prb_desc *d = to_desc(desc_ring, id);

- atomic_long_cmpxchg_relaxed(&d->state_var, prev_state_val,
- DESC_SV(id, desc_finalized)); /* LMM(desc_make_final:A) */
-
- /* Best effort to remember the last finalized @id. */
- atomic_long_set(&desc_ring->last_finalized_id, id);
+ if (atomic_long_try_cmpxchg_relaxed(&d->state_var, &prev_state_val,
+ DESC_SV(id, desc_finalized))) { /* LMM(desc_make_final:A) */
+ desc_update_last_finalized(rb);
+ }
}

/**
@@ -1550,7 +1652,7 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
* readers. (For seq==0 there is no previous descriptor.)
*/
if (info->seq > 0)
- desc_make_final(desc_ring, DESC_ID(id - 1));
+ desc_make_final(rb, DESC_ID(id - 1));

r->text_buf = data_alloc(rb, r->text_buf_size, &d->text_blk_lpos, id);
/* If text data allocation fails, a data-less record is committed. */
@@ -1643,7 +1745,7 @@ void prb_commit(struct prb_reserved_entry *e)
*/
head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_commit:A) */
if (head_id != e->id)
- desc_make_final(desc_ring, e->id);
+ desc_make_final(e->rb, e->id);
}

/**
@@ -1663,12 +1765,9 @@ void prb_commit(struct prb_reserved_entry *e)
*/
void prb_final_commit(struct prb_reserved_entry *e)
{
- struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
-
_prb_commit(e, desc_finalized);

- /* Best effort to remember the last finalized @id. */
- atomic_long_set(&desc_ring->last_finalized_id, e->id);
+ desc_update_last_finalized(e->rb);
}

/*
@@ -2008,7 +2107,9 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
* newest sequence number available to readers will be.
*
* This provides readers a sequence number to jump to if all currently
- * available records should be skipped.
+ * available records should be skipped. It is guaranteed that all records
+ * previous to the returned value have been finalized and are (or were)
+ * available to the reader.
*
* Context: Any context.
* Return: The sequence number of the next newest (not yet available) record
@@ -2016,34 +2117,19 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
*/
u64 prb_next_seq(struct printk_ringbuffer *rb)
{
- struct prb_desc_ring *desc_ring = &rb->desc_ring;
- enum desc_state d_state;
- unsigned long id;
u64 seq;

- /* Check if the cached @id still points to a valid @seq. */
- id = atomic_long_read(&desc_ring->last_finalized_id);
- d_state = desc_read(desc_ring, id, NULL, &seq, NULL);
+ seq = desc_last_finalized_seq(rb);

- if (d_state == desc_finalized || d_state == desc_reusable) {
- /*
- * Begin searching after the last finalized record.
- *
- * On 0, the search must begin at 0 because of hack#2
- * of the bootstrapping phase it is not known if a
- * record at index 0 exists.
- */
- if (seq != 0)
- seq++;
- } else {
- /*
- * The information about the last finalized sequence number
- * has gone. It should happen only when there is a flood of
- * new messages and the ringbuffer is rapidly recycled.
- * Give up and start from the beginning.
- */
- seq = 0;
- }
+ /*
+ * Begin searching after the last finalized record.
+ *
+ * On 0, the search must begin at 0 because of hack#2
+ * of the bootstrapping phase it is not known if a
+ * record at index 0 exists.
+ */
+ if (seq != 0)
+ seq++;

/*
* The information about the last finalized @seq might be inaccurate.
@@ -2085,7 +2171,7 @@ void prb_init(struct printk_ringbuffer *rb,
rb->desc_ring.infos = infos;
atomic_long_set(&rb->desc_ring.head_id, DESC0_ID(descbits));
atomic_long_set(&rb->desc_ring.tail_id, DESC0_ID(descbits));
- atomic_long_set(&rb->desc_ring.last_finalized_id, DESC0_ID(descbits));
+ atomic_long_set(&rb->desc_ring.last_finalized_seq, 0);

rb->text_data_ring.size_bits = textbits;
rb->text_data_ring.data = text_buf;
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index ee294aaf4aeb..2d948cc82b5b 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -75,7 +75,7 @@ struct prb_desc_ring {
struct printk_info *infos;
atomic_long_t head_id;
atomic_long_t tail_id;
- atomic_long_t last_finalized_id;
+ atomic_long_t last_finalized_seq;
};

/*
@@ -259,7 +259,7 @@ static struct printk_ringbuffer name = { \
.infos = &_##name##_infos[0], \
.head_id = ATOMIC_INIT(DESC0_ID(descbits)), \
.tail_id = ATOMIC_INIT(DESC0_ID(descbits)), \
- .last_finalized_id = ATOMIC_INIT(DESC0_ID(descbits)), \
+ .last_finalized_seq = ATOMIC_INIT(0), \
}, \
.text_data_ring = { \
.size_bits = (avgtextbits) + (descbits), \
--
2.39.2


2023-12-14 21:43:22

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 05/14] printk: ringbuffer: Clarify special lpos values

For empty line records, no data blocks are created. Instead,
these valid records are identified by special logical position
values (in fields of @prb_desc.text_blk_lpos).

Currently the macro NO_LPOS is used for empty line records.
This name is confusing because it does not imply _why_ there is
no data block.

Rename NO_LPOS to EMPTY_LINE_LPOS so that it is clear why there
is no data block.

Also add comments explaining the use of EMPTY_LINE_LPOS as well
as clarification to the values used to represent data-less
blocks.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 20 ++++++++++++++++----
kernel/printk/printk_ringbuffer.h | 16 +++++++++++++++-
2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 04c26cca546f..244d991ffd73 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1034,9 +1034,13 @@ static char *data_alloc(struct printk_ringbuffer *rb, unsigned int size,
unsigned long next_lpos;

if (size == 0) {
- /* Specify a data-less block. */
- blk_lpos->begin = NO_LPOS;
- blk_lpos->next = NO_LPOS;
+ /*
+ * Data blocks are not created for empty lines. Instead, the
+ * reader will recognize these special lpos values and handle
+ * it appropriately.
+ */
+ blk_lpos->begin = EMPTY_LINE_LPOS;
+ blk_lpos->next = EMPTY_LINE_LPOS;
return NULL;
}

@@ -1214,10 +1218,18 @@ static const char *get_data(struct prb_data_ring *data_ring,

/* Data-less data block description. */
if (BLK_DATALESS(blk_lpos)) {
- if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) {
+ /*
+ * Records that are just empty lines are also valid, even
+ * though they do not have a data block. For such records
+ * explicitly return empty string data to signify success.
+ */
+ if (blk_lpos->begin == EMPTY_LINE_LPOS &&
+ blk_lpos->next == EMPTY_LINE_LPOS) {
*data_size = 0;
return "";
}
+
+ /* Data lost, invalid, or otherwise unavailable. */
return NULL;
}

diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index 2d948cc82b5b..d49460f7578e 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -127,8 +127,22 @@ enum desc_state {
#define DESC_SV(id, state) (((unsigned long)state << DESC_FLAGS_SHIFT) | id)
#define DESC_ID_MASK (~DESC_FLAGS_MASK)
#define DESC_ID(sv) ((sv) & DESC_ID_MASK)
+
+/*
+ * Special data block logical position values (for fields of
+ * @prb_desc.text_blk_lpos).
+ *
+ * - Bit0 is used to identify if the record has no data block. (Implemented in
+ * the LPOS_DATALESS() macro.)
+ *
+ * - Bit1 specifies the reason for not having a data block.
+ *
+ * These special values could never be real lpos values because of the
+ * meta data and alignment padding of data blocks. (See to_blk_size() for
+ * details.)
+ */
#define FAILED_LPOS 0x1
-#define NO_LPOS 0x3
+#define EMPTY_LINE_LPOS 0x3

#define FAILED_BLK_LPOS \
{ \
--
2.39.2


2023-12-14 21:43:24

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 06/14] printk: For @suppress_panic_printk check for other CPU in panic

Currently @suppress_panic_printk is checked along with
non-matching @panic_cpu and current CPU. This works
because @suppress_panic_printk is only set when
panic_in_progress() is true.

Rather than relying on the @suppress_panic_printk semantics,
use the concise helper function other_cpu_in_progress(). The
helper function exists to avoid open coding such tests.

Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/printk.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index be95a6851164..22bb45d8f2f2 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2270,8 +2270,7 @@ asmlinkage int vprintk_emit(int facility, int level,
if (unlikely(suppress_printk))
return 0;

- if (unlikely(suppress_panic_printk) &&
- atomic_read(&panic_cpu) != raw_smp_processor_id())
+ if (unlikely(suppress_panic_printk) && other_cpu_in_panic())
return 0;

if (level == LOGLEVEL_SCHED) {
--
2.39.2


2023-12-14 21:43:57

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 11/14] printk: ringbuffer: Consider committed as finalized in panic

A descriptor in the committed state means the record does not yet
exist for the reader. However, for the panic CPU, committed
records should be handled as finalized records since they contain
message data in a consistent state and may contain additional
hints as to the cause of the panic.

Add an exception for records in the commit state to not be
considered non-existing when reading from the panic CPU.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index d6ed33683b8b..e7b808b829a0 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1857,6 +1857,8 @@ static bool copy_data(struct prb_data_ring *data_ring,
* descriptor. However, it also verifies that the record is finalized and has
* the sequence number @seq. On success, 0 is returned.
*
+ * For the panic CPU, committed descriptors are also considered finalized.
+ *
* Error return values:
* -EINVAL: A finalized record with sequence number @seq does not exist.
* -ENOENT: A finalized record with sequence number @seq exists, but its data
@@ -1875,16 +1877,25 @@ static int desc_read_finalized_seq(struct prb_desc_ring *desc_ring,

/*
* An unexpected @id (desc_miss) or @seq mismatch means the record
- * does not exist. A descriptor in the reserved or committed state
- * means the record does not yet exist for the reader.
+ * does not exist. A descriptor in the reserved state means the
+ * record does not yet exist for the reader.
*/
if (d_state == desc_miss ||
d_state == desc_reserved ||
- d_state == desc_committed ||
s != seq) {
return -EINVAL;
}

+ /*
+ * A descriptor in the committed state means the record does not yet
+ * exist for the reader. However, for the panic CPU, committed
+ * records are also handled as finalized records since they contain
+ * message data in a consistent state and may contain additional
+ * hints as to the cause of the panic.
+ */
+ if (d_state == desc_committed && !this_cpu_in_panic())
+ return -EINVAL;
+
/*
* A descriptor in the reusable state may no longer have its data
* available; report it as existing but with lost data. Or the record
--
2.39.2


2023-12-14 21:44:05

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 08/14] printk: ringbuffer: Cleanup reader terminology

With the lockless ringbuffer, it is allowed that multiple
CPUs/contexts write simultaneously into the buffer. This creates
an ambiguity as some writers will finalize sooner.

The documentation for the prb_read functions is not clear as it
refers to "not yet written" and "no data available". Clarify the
return values and language to be in terms of the reader: records
available for reading.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 244d991ffd73..67ee1c62fcd6 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1987,11 +1987,13 @@ u64 prb_first_seq(struct printk_ringbuffer *rb)
}

/*
- * Non-blocking read of a record. Updates @seq to the last finalized record
- * (which may have no data available).
+ * Non-blocking read of a record.
*
- * See the description of prb_read_valid() and prb_read_valid_info()
- * for details.
+ * On success @seq is updated to the record that was read and (if provided)
+ * @r and @line_count will contain the read/calculated data.
+ *
+ * On failure @seq is updated to a record that is not yet available to the
+ * reader, but it will be the next record available to the reader.
*/
static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
struct printk_record *r, unsigned int *line_count)
@@ -2010,7 +2012,7 @@ static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
*seq = tail_seq;

} else if (err == -ENOENT) {
- /* Record exists, but no data available. Skip. */
+ /* Record exists, but the data was lost. Skip. */
(*seq)++;

} else {
@@ -2043,7 +2045,7 @@ static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
* On success, the reader must check r->info.seq to see which record was
* actually read. This allows the reader to detect dropped records.
*
- * Failure means @seq refers to a not yet written record.
+ * Failure means @seq refers to a record not yet available to the reader.
*/
bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
struct printk_record *r)
@@ -2073,7 +2075,7 @@ bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
* On success, the reader must check info->seq to see which record meta data
* was actually read. This allows the reader to detect dropped records.
*
- * Failure means @seq refers to a not yet written record.
+ * Failure means @seq refers to a record not yet available to the reader.
*/
bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
struct printk_info *info, unsigned int *line_count)
--
2.39.2


2023-12-14 21:44:07

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 10/14] printk: ringbuffer: Skip non-finalized records in panic

Normally a reader will stop once reaching a non-finalized
record. However, when a panic happens, writers from other CPUs
(or an interrupted context on the panic CPU) may have been
writing a record and were unable to finalize it. The panic CPU
will reserve/commit/finalize its panic records, but these will
be located after the non-finalized records. This results in
panic() not flushing the panic messages.

Extend _prb_read_valid() to skip over non-finalized records if
on the panic CPU.

Fixes: 896fbe20b4e2 ("printk: use the lockless ringbuffer")
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index b7748d7c44c1..d6ed33683b8b 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -2107,6 +2107,10 @@ u64 prb_next_reserve_seq(struct printk_ringbuffer *rb)
*
* On failure @seq is updated to a record that is not yet available to the
* reader, but it will be the next record available to the reader.
+ *
+ * Note: When the current CPU is in panic, this function will skip over any
+ * non-existent/non-finalized records in order to allow the panic CPU
+ * to print any and all records that have been finalized.
*/
static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
struct printk_record *r, unsigned int *line_count)
@@ -2129,8 +2133,28 @@ static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
(*seq)++;

} else {
- /* Non-existent/non-finalized record. Must stop. */
- return false;
+ /*
+ * Non-existent/non-finalized record. Must stop.
+ *
+ * For panic situations it cannot be expected that
+ * non-finalized records will become finalized. But
+ * there may be other finalized records beyond that
+ * need to be printed for a panic situation. If this
+ * is the panic CPU, skip this
+ * non-existent/non-finalized record unless it is
+ * at or beyond the head, in which case it is not
+ * possible to continue.
+ *
+ * Note that new messages printed on panic CPU are
+ * finalized when we are here. The only exception
+ * might be the last message without trailing newline.
+ * But it would have the sequence number returned
+ * by "prb_next_reserve_seq() - 1".
+ */
+ if (this_cpu_in_panic() && ((*seq + 1) < prb_next_reserve_seq(rb)))
+ (*seq)++;
+ else
+ return false;
}
}

--
2.39.2


2023-12-14 21:44:08

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 12/14] printk: Disable passing console lock owner completely during panic()

From: Petr Mladek <[email protected]>

The commit d51507098ff91 ("printk: disable optimistic spin
during panic") added checks to avoid becoming a console waiter
if a panic is in progress.

However, the transition to panic can occur while there is
already a waiter. The current owner should not pass the lock to
the waiter because it might get stopped or blocked anytime.

Also the panic context might pass the console lock owner to an
already stopped waiter by mistake. It might happen when
console_flush_on_panic() ignores the current lock owner, for
example:

CPU0 CPU1
---- ----
console_lock_spinning_enable()
console_trylock_spinning()
[CPU1 now console waiter]
NMI: panic()
panic_other_cpus_shutdown()
[stopped as console waiter]
console_flush_on_panic()
console_lock_spinning_enable()
[print 1 record]
console_lock_spinning_disable_and_check()
[handover to stopped CPU1]

This results in panic() not flushing the panic messages.

Fix these problems by disabling all spinning operations
completely during panic().

Another advantage is that it prevents possible deadlocks caused
by "console_owner_lock". The panic() context does not need to
take it any longer. The lockless checks are safe because the
functions become NOPs when they see the panic in progress. All
operations manipulating the state are still synchronized by the
lock even when non-panic CPUs would notice the panic
synchronously.

The current owner might stay spinning. But non-panic() CPUs
would get stopped anyway and the panic context will never start
spinning.

Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes")
Signed-off-by: Petr Mladek <[email protected]>
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index f3a7f5a6f6f8..cb99c854a648 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1869,10 +1869,23 @@ static bool console_waiter;
*/
static void console_lock_spinning_enable(void)
{
+ /*
+ * Do not use spinning in panic(). The panic CPU wants to keep the lock.
+ * Non-panic CPUs abandon the flush anyway.
+ *
+ * Just keep the lockdep annotation. The panic-CPU should avoid
+ * taking console_owner_lock because it might cause a deadlock.
+ * This looks like the easiest way how to prevent false lockdep
+ * reports without handling races a lockless way.
+ */
+ if (panic_in_progress())
+ goto lockdep;
+
raw_spin_lock(&console_owner_lock);
console_owner = current;
raw_spin_unlock(&console_owner_lock);

+lockdep:
/* The waiter may spin on us after setting console_owner */
spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
}
@@ -1897,6 +1910,22 @@ static int console_lock_spinning_disable_and_check(int cookie)
{
int waiter;

+ /*
+ * Ignore spinning waiters during panic() because they might get stopped
+ * or blocked at any time,
+ *
+ * It is safe because nobody is allowed to start spinning during panic
+ * in the first place. If there has been a waiter then non panic CPUs
+ * might stay spinning. They would get stopped anyway. The panic context
+ * will never start spinning and an interrupted spin on panic CPU will
+ * never continue.
+ */
+ if (panic_in_progress()) {
+ /* Keep lockdep happy. */
+ spin_release(&console_owner_dep_map, _THIS_IP_);
+ return 0;
+ }
+
raw_spin_lock(&console_owner_lock);
waiter = READ_ONCE(console_waiter);
console_owner = NULL;
--
2.39.2


2023-12-14 21:44:27

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 13/14] printk: Avoid non-panic CPUs writing to ringbuffer

Commit 13fb0f74d702 ("printk: Avoid livelock with heavy printk
during panic") introduced a mechanism to silence non-panic CPUs
if too many messages are being dropped. Aside from trying to
workaround the livelock bugs of legacy consoles, it was also
intended to avoid losing panic messages. However, if non-panic
CPUs are writing to the ringbuffer, then reacting to dropped
messages is too late.

To avoid losing panic CPU messages, silence non-panic CPUs
immediately on panic.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 26 ++++++--------------------
1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cb99c854a648..1685a71f3f71 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -462,12 +462,6 @@ static int console_msg_format = MSG_FORMAT_DEFAULT;
static DEFINE_MUTEX(syslog_lock);

#ifdef CONFIG_PRINTK
-/*
- * During panic, heavy printk by other CPUs can delay the
- * panic and risk deadlock on console resources.
- */
-static int __read_mostly suppress_panic_printk;
-
DECLARE_WAIT_QUEUE_HEAD(log_wait);
/* All 3 protected by @syslog_lock. */
/* the next printk record to read by syslog(READ) or /proc/kmsg */
@@ -2322,7 +2316,12 @@ asmlinkage int vprintk_emit(int facility, int level,
if (unlikely(suppress_printk))
return 0;

- if (unlikely(suppress_panic_printk) && other_cpu_in_panic())
+ /*
+ * The messages on the panic CPU are the most important. If
+ * non-panic CPUs are generating any messages, they will be
+ * silently dropped.
+ */
+ if (other_cpu_in_panic())
return 0;

if (level == LOGLEVEL_SCHED) {
@@ -2799,8 +2798,6 @@ void console_prepend_dropped(struct printk_message *pmsg, unsigned long dropped)
bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
bool is_extended, bool may_suppress)
{
- static int panic_console_dropped;
-
struct printk_buffers *pbufs = pmsg->pbufs;
const size_t scratchbuf_sz = sizeof(pbufs->scratchbuf);
const size_t outbuf_sz = sizeof(pbufs->outbuf);
@@ -2828,17 +2825,6 @@ bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
pmsg->seq = r.info->seq;
pmsg->dropped = r.info->seq - seq;

- /*
- * Check for dropped messages in panic here so that printk
- * suppression can occur as early as possible if necessary.
- */
- if (pmsg->dropped &&
- panic_in_progress() &&
- panic_console_dropped++ > 10) {
- suppress_panic_printk = 1;
- pr_warn_once("Too many dropped messages. Suppress messages on non-panic CPUs to prevent livelock.\n");
- }
-
/* Skip record that has level above the console loglevel. */
if (may_suppress && suppress_message_printing(r.info->level))
goto out;
--
2.39.2


2023-12-14 21:44:28

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 07/14] printk: Add this_cpu_in_panic()

There is already panic_in_progress() and other_cpu_in_panic(),
but checking if the current CPU is the panic CPU must still be
open coded.

Add this_cpu_in_panic() to complete the set.

Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/internal.h | 1 +
kernel/printk/printk.c | 43 +++++++++++++++++++++-------------------
2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 6c2afee5ef62..ac2d9750e5f8 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -130,6 +130,7 @@ struct printk_message {
};

bool other_cpu_in_panic(void);
+bool this_cpu_in_panic(void);
bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
bool is_extended, bool may_supress);

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 22bb45d8f2f2..82dc2c7949b7 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -347,6 +347,29 @@ static bool panic_in_progress(void)
return unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID);
}

+/* Return true if a panic is in progress on the current CPU. */
+bool this_cpu_in_panic(void)
+{
+ /*
+ * We can use raw_smp_processor_id() here because it is impossible for
+ * the task to be migrated to the panic_cpu, or away from it. If
+ * panic_cpu has already been set, and we're not currently executing on
+ * that CPU, then we never will be.
+ */
+ return unlikely(atomic_read(&panic_cpu) == raw_smp_processor_id());
+}
+
+/*
+ * Return true if a panic is in progress on a remote CPU.
+ *
+ * On true, the local CPU should immediately release any printing resources
+ * that may be needed by the panic CPU.
+ */
+bool other_cpu_in_panic(void)
+{
+ return (panic_in_progress() && !this_cpu_in_panic());
+}
+
/*
* This is used for debugging the mess that is the VT code by
* keeping track if we have the console semaphore held. It's
@@ -2592,26 +2615,6 @@ static int console_cpu_notify(unsigned int cpu)
return 0;
}

-/*
- * Return true if a panic is in progress on a remote CPU.
- *
- * On true, the local CPU should immediately release any printing resources
- * that may be needed by the panic CPU.
- */
-bool other_cpu_in_panic(void)
-{
- if (!panic_in_progress())
- return false;
-
- /*
- * We can use raw_smp_processor_id() here because it is impossible for
- * the task to be migrated to the panic_cpu, or away from it. If
- * panic_cpu has already been set, and we're not currently executing on
- * that CPU, then we never will be.
- */
- return atomic_read(&panic_cpu) != raw_smp_processor_id();
-}
-
/**
* console_lock - block the console subsystem from printing
*
--
2.39.2


2023-12-14 21:44:36

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 09/14] printk: Wait for all reserved records with pr_flush()

Currently pr_flush() will only wait for records that were
available to readers at the time of the call (using
prb_next_seq()). But there may be more records (non-finalized)
that have following finalized records. pr_flush() should wait
for these to print as well. Particularly because any trailing
finalized records may be the messages that the calling context
wants to ensure are printed.

Add a new ringbuffer function prb_next_reserve_seq() to return
the sequence number following the most recently reserved record.
This guarantees that pr_flush() will wait until all current
printk() messages (completed or in progress) have been printed.

Fixes: 3b604ca81202 ("printk: add pr_flush()")
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 2 +-
kernel/printk/printk_ringbuffer.c | 113 ++++++++++++++++++++++++++++++
kernel/printk/printk_ringbuffer.h | 1 +
3 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 82dc2c7949b7..f3a7f5a6f6f8 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3755,7 +3755,7 @@ static bool __pr_flush(struct console *con, int timeout_ms, bool reset_on_progre

might_sleep();

- seq = prb_next_seq(prb);
+ seq = prb_next_reserve_seq(prb);

/* Flush the consoles so that records up to @seq are printed. */
console_lock();
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 67ee1c62fcd6..b7748d7c44c1 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1986,6 +1986,119 @@ u64 prb_first_seq(struct printk_ringbuffer *rb)
return seq;
}

+/**
+ * prb_next_reserve_seq() - Get the sequence number after the most recently
+ * reserved record.
+ *
+ * @rb: The ringbuffer to get the sequence number from.
+ *
+ * This is the public function available to readers to see what sequence
+ * number will be assigned to the next reserved record.
+ *
+ * Note that depending on the situation, this value can be equal to or
+ * higher than the sequence number returned by prb_next_seq().
+ *
+ * Context: Any context.
+ * Return: The sequence number that will be assigned to the next record
+ * reserved.
+ */
+u64 prb_next_reserve_seq(struct printk_ringbuffer *rb)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ unsigned long last_finalized_id;
+ atomic_long_t *state_var;
+ u64 last_finalized_seq;
+ unsigned long head_id;
+ struct prb_desc desc;
+ unsigned long diff;
+ struct prb_desc *d;
+ int err;
+
+ /*
+ * It may not be possible to read a sequence number for @head_id.
+ * So the ID of @last_finailzed_seq is used to calculate what the
+ * sequence number of @head_id will be.
+ */
+
+try_again:
+ last_finalized_seq = desc_last_finalized_seq(rb);
+
+ /*
+ * @head_id is loaded after @last_finalized_seq to ensure that it is
+ * at or beyond @last_finalized_seq.
+ *
+ * Memory barrier involvement:
+ *
+ * If desc_last_finalized_seq:A reads from
+ * desc_update_last_finalized:A, then
+ * prb_next_reserve_seq:A reads from desc_reserve:D.
+ *
+ * Relies on:
+ *
+ * RELEASE from desc_reserve:D to desc_update_last_finalized:A
+ * matching
+ * ACQUIRE from desc_last_finalized_seq:A to prb_next_reserve_seq:A
+ *
+ * Note: desc_reserve:D and desc_update_last_finalized:A can be
+ * different CPUs. However, the desc_update_last_finalized:A CPU
+ * (which performs the release) must have previously seen
+ * desc_read:C, which implies desc_reserve:D can be seen.
+ */
+ head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_next_reserve_seq:A) */
+
+ d = to_desc(desc_ring, last_finalized_seq);
+ state_var = &d->state_var;
+
+ /* Extract the ID, used to specify the descriptor to read. */
+ last_finalized_id = DESC_ID(atomic_long_read(state_var));
+
+ /* Ensure @last_finalized_id is correct. */
+ err = desc_read_finalized_seq(desc_ring, last_finalized_id, last_finalized_seq, &desc);
+
+ if (err == -EINVAL) {
+ if (last_finalized_seq == 0) {
+ /*
+ * @last_finalized_seq still contains its initial
+ * value. Probably no record has been finalized yet.
+ * This means the ringbuffer is not yet full and the
+ * @head_id value can be used directly (subtracting
+ * off the id value corresponding to seq=0).
+ */
+
+ /*
+ * Because of hack#2 of the bootstrapping phase, the
+ * @head_id initial value must be handled separately.
+ */
+ if (head_id == DESC0_ID(desc_ring->count_bits))
+ return 0;
+
+ /*
+ * The @head_id is initialized such that the first
+ * increment will yield the first record (seq=0).
+ * Therefore use the initial value +1 as the base to
+ * subtract from @head_id.
+ */
+ last_finalized_id = DESC0_ID(desc_ring->count_bits) + 1;
+ } else {
+ /* Record must have been overwritten. Try again. */
+ goto try_again;
+ }
+ }
+
+ /*
+ * @diff is the number of records beyond the last record available
+ * to readers.
+ */
+ diff = head_id - last_finalized_id;
+
+ /*
+ * @head_id points to the most recently reserved record, but this
+ * function returns the sequence number that will be assigned to the
+ * next (not yet reserved) record. Thus +1 is needed.
+ */
+ return (last_finalized_seq + diff + 1);
+}
+
/*
* Non-blocking read of a record.
*
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index d49460f7578e..52626d0f1fa3 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -395,6 +395,7 @@ bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
u64 prb_first_seq(struct printk_ringbuffer *rb);
u64 prb_first_valid_seq(struct printk_ringbuffer *rb);
u64 prb_next_seq(struct printk_ringbuffer *rb);
+u64 prb_next_reserve_seq(struct printk_ringbuffer *rb);

#ifdef CONFIG_64BIT

--
2.39.2


2023-12-14 21:46:07

by John Ogness

[permalink] [raw]
Subject: [PATCH printk v3 14/14] panic: Flush kernel log buffer at the end

If the kernel crashes in a context where printk() calls always
defer printing (such as in NMI or inside a printk_safe section)
then the final panic messages will be deferred to irq_work. But
if irq_work is not available, the messages will not get printed
unless explicitly flushed. The result is that the final
"end Kernel panic" banner does not get printed.

Add one final flush after the last printk() call to make sure
the final panic messages make it out as well.

Signed-off-by: John Ogness <[email protected]>
---
kernel/panic.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/panic.c b/kernel/panic.c
index 2807639aab51..f22d8f33ea14 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -446,6 +446,14 @@ void panic(const char *fmt, ...)

/* Do not scroll important messages printed above */
suppress_printk = 1;
+
+ /*
+ * The final messages may not have been printed if in a context that
+ * defers printing (such as NMI) and irq_work is not available.
+ * Explicitly flush the kernel log buffer one last time.
+ */
+ console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+
local_irq_enable();
for (i = 0; ; i += PANIC_TIMER_STEP) {
touch_softlockup_watchdog();
--
2.39.2


Subject: Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

On 2023-12-14 22:47:49 [+0106], John Ogness wrote:
> Note: This change only applies to 32bit architectures. On 64bit
> architectures the macros are NOPs.

You lost my authorship.
But serious matter: Is this bug already present in the current printk
tree or is this code just not used and enabled later? Patch #1 from this
series moves the sequence around so is this problem introduced in the
previous patch and fixed here?

Sebastian

2023-12-15 10:11:07

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

On 2023-12-15, Sebastian Andrzej Siewior <[email protected]> wrote:
> On 2023-12-14 22:47:49 [+0106], John Ogness wrote:
>> Note: This change only applies to 32bit architectures. On 64bit
>> architectures the macros are NOPs.
>
> You lost my authorship.

OK. Thanks for speaking up. (And thanks for doing the heavy lifting
researching the needed change.)

> But serious matter: Is this bug already present in the current printk
> tree or is this code just not used and enabled later? Patch #1 from
> this series moves the sequence around so is this problem introduced in
> the previous patch and fixed here?

No, it is not a bug at this point. As the commit message writes:

"Until now this mapping has been correct for all call sites."

But if the mapping is not changed here, it will become a bug after
applying patch 3 of this series.

John

Subject: Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

On 2023-12-15 11:16:44 [+0106], John Ogness wrote:
> No, it is not a bug at this point. As the commit message writes:
>
> "Until now this mapping has been correct for all call sites."
>
> But if the mapping is not changed here, it will become a bug after
> applying patch 3 of this series.

Okay. Thank you clarifying this.

> John

Sebastian

2024-01-12 10:16:59

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 01/14] printk: nbcon: Relocate 32bit seq macros

On Thu 2023-12-14 22:47:48, John Ogness wrote:
> The macros __seq_to_nbcon_seq() and __nbcon_seq_to_seq() are
> used to provide support for atomic handling of sequence numbers
> on 32bit systems. Until now this was only used by nbcon.c,
> which is why they were located in nbcon.c and include nbcon in
> the name.
>
> In a follow-up commit this functionality is also needed by
> printk_ringbuffer. Rather than duplicating the functionality,
> relocate the macros to printk_ringbuffer.h.
>
> Also, since the macros will be no longer nbcon-specific, rename
> them to __u64seq_to_ulseq() and __ulseq_to_u64seq().
>
> This does not result in any functional change.
>
> Signed-off-by: John Ogness <[email protected]>

Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr

2024-01-12 10:28:47

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

On Thu 2023-12-14 22:47:49, John Ogness wrote:
> Note: This change only applies to 32bit architectures. On 64bit
> architectures the macros are NOPs.
>
> __ulseq_to_u64seq() computes the upper 32 bits of the passed
> argument value (@ulseq). The upper bits are derived from a base
> value (@rb_next_seq) in a way that assumes @ulseq represents a
> 64bit number that is less than or equal to @rb_next_seq.
>
> Until now this mapping has been correct for all call sites. However,
> in a follow-up commit, values of @ulseq will be passed in that are
> higher than the base value. This requires a change to how the 32bit
> value is mapped to a 64bit sequence number.
>
> Rather than mapping @ulseq such that the base value is the end of a
> 32bit block, map @ulseq such that the base value is in the middle of
> a 32bit block. This allows supporting 31 bits before and after the
> base value, which is deemed acceptable for the console sequence
> number during runtime.
>
> Here is an example to illustrate the previous and new mappings.
>
> For a base value (@rb_next_seq) of 2 2000 0000...
>
> Before this change the range of possible return values was:
>
> 1 2000 0001 to 2 2000 0000
>
> __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
> __ulseq_to_u64seq(2000 0000) => 2 2000 0000
> __ulseq_to_u64seq(2000 0001) => 1 2000 0001
> __ulseq_to_u64seq(9fff ffff) => 1 9fff ffff
> __ulseq_to_u64seq(a000 0000) => 1 a000 0000
> __ulseq_to_u64seq(a000 0001) => 1 a000 0001
>
> After this change the range of possible return values are:
> 1 a000 0001 to 2 a000 0000
>
> __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
> __ulseq_to_u64seq(2000 0000) => 2 2000 0000
> __ulseq_to_u64seq(2000 0001) => 2 2000 0001
> __ulseq_to_u64seq(9fff ffff) => 2 9fff ffff
> __ulseq_to_u64seq(a000 0000) => 2 a000 0000
> __ulseq_to_u64seq(a000 0001) => 1 a000 0001
>
> [ john.ogness: Rewrite commit message. ]
>
> Reported-by: Francesco Dolcini <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> Closes: https://lore.kernel.org/oe-lkp/[email protected]
> Reported-by: kernel test robot <[email protected]>
> Closes: https://lore.kernel.org/oe-lkp/[email protected]
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> Signed-off-by: John Ogness <[email protected]>

Great catch! It must have been complicated to debug this.

Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr


2024-01-12 16:21:03

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 03/14] printk: Use prb_first_seq() as base for 32bit seq macros

On Thu 2023-12-14 22:47:50, John Ogness wrote:
> Note: This change only applies to 32bit architectures. On 64bit
> architectures the macros are NOPs.
>
> Currently prb_next_seq() is used as the base for the 32bit seq
> macros __u64seq_to_ulseq() and __ulseq_to_u64seq(). However, in
> a follow-up commit, prb_next_seq() will need to make use of the
> 32bit seq macros.
>
> Use prb_first_seq() as the base for the 32bit seq macros instead
> because it is guaranteed to return 64bit sequence numbers without
> relying on any 32bit seq macros.
>
> Signed-off-by: John Ogness <[email protected]>

This increases the risk of an overflow which would result
into wrong u64 sequence number. I was curious what numbers
we are talking about.

Here is my thinking:

The difference between prb_next_seq() and prb_first_seq() is
limited by the amount of messages stored in the buffer.

Every stored message must have meta data stored in desc_ring.
The current code expects average length of the message to be 32.

Now, __ulseq_to_u64seq() would give wrong result when
(s32)((u32)rb_first_seq - ulseq) overflows. It is 2^31
sequence numbers.

If we create desc buffer for these messages then the
data buffer would be 32x bigger, so:

2^31 * 32 = 68,719,476,736 => 68GB

=> we seem to be on the safe side with a good reserve.


Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr

2024-01-12 18:06:00

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 04/14] printk: ringbuffer: Do not skip non-finalized records with prb_next_seq()

On Thu 2023-12-14 22:47:51, John Ogness wrote:
> Commit f244b4dc53e5 ("printk: ringbuffer: Improve
> prb_next_seq() performance") introduced an optimization for
> prb_next_seq() by using best-effort to track recently finalized
> records. However, the order of finalization does not
> necessarily match the order of the records. The optimization
> changed prb_next_seq() to return inconsistent results, possibly
> yielding sequence numbers that are not available to readers
> because they are preceded by non-finalized records or they are
> not yet visible to the reader CPU.
>
> Rather than simply best-effort tracking recently finalized
> records, force the committing writer to read records and
> increment the last "contiguous block" of finalized records. In
> order to do this, the sequence number instead of ID must be
> stored because ID's cannot be directly compared.
>
> A new memory barrier pair is introduced to guarantee that a
> reader can always read the records up until the sequence number
> returned by prb_next_seq() (unless the records have since
> been overwritten in the ringbuffer).
>
> This restores the original functionality of prb_next_seq()
> while also keeping the optimization.
>
> For 32bit systems, only the lower 32 bits of the sequence
> number are stored. When reading the value, it is expanded to
> the full 64bit sequence number using the 32bit seq macros,
> which fold in the value returned by prb_first_seq().
>
> @@ -1441,20 +1445,118 @@ bool prb_reserve_in_last(struct prb_reserved_entry *e, struct printk_ringbuffer
> return false;
> }
>
> +/*
> + * @last_finalized_seq value guarantees that all records up to and including
> + * this sequence number are finalized and can be read. The only exception are
> + * too old records which have already been overwritten.
> + *
> + * It is also guaranteed that @last_finalized_seq only increases.
> + *
> + * Be aware that finalized records following non-finalized records are not
> + * reported because they are not yet available to the reader. For example,
> + * a new record stored via printk() will not be available to a printer if
> + * it follows a record that has not been finalized yet. However, once that
> + * non-finalized record becomes finalized, @last_finalized_seq will be
> + * appropriately updated and the full set of finalized records will be
> + * available to the printer. And since each printk() caller will either
> + * directly print or trigger deferred printing of all available unprinted
> + * records, all printk() messages will get printed.
> + */
> +static u64 desc_last_finalized_seq(struct printk_ringbuffer *rb)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + unsigned long ulseq;
> +
> + /*
> + * Guarantee the sequence number is loaded before loading the
> + * associated record in order to guarantee that the record can be
> + * seen by this CPU. This pairs with desc_update_last_finalized:A.
> + */
> + ulseq = atomic_long_read_acquire(&desc_ring->last_finalized_seq
> + ); /* LMM(desc_last_finalized_seq:A) */
> +
> + return __ulseq_to_u64seq(rb, ulseq);
> +}
> +
> +static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
> + struct printk_record *r, unsigned int *line_count);
> +
> +/*
> + * Check if there are records directly following @last_finalized_seq that are
> + * finalized. If so, update @last_finalized_seq to the latest of these
> + * records. It is not allowed to skip over records that are not yet finalized.
> + */
> +static void desc_update_last_finalized(struct printk_ringbuffer *rb)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + u64 old_seq = desc_last_finalized_seq(rb);
> + unsigned long oldval;
> + unsigned long newval;
> + u64 finalized_seq;
> + u64 try_seq;
> +
> +try_again:
> + finalized_seq = old_seq;
> + try_seq = finalized_seq + 1;
> +
> + /* Try to find later finalized records. */
> + while (_prb_read_valid(rb, &try_seq, NULL, NULL)) {
> + finalized_seq = try_seq;
> + try_seq++;
> + }
> +
> + /* No update needed if no later finalized record was found. */
> + if (finalized_seq == old_seq)
> + return;
> +
> + oldval = __u64seq_to_ulseq(old_seq);
> + newval = __u64seq_to_ulseq(finalized_seq);
> +
> + /*
> + * Set the sequence number of a later finalized record that has been
> + * seen.
> + *
> + * Guarantee the record data is visible to other CPUs before storing
> + * its sequence number. This pairs with desc_last_finalized_seq:A.
> + *
> + * Memory barrier involvement:
> + *
> + * If desc_last_finalized_seq:A reads from
> + * desc_update_last_finalized:A, then desc_read:A reads from
> + * _prb_commit:B.
> + *
> + * Relies on:
> + *
> + * RELEASE from _prb_commit:B to desc_update_last_finalized:A
> + * matching
> + * ACQUIRE from desc_last_finalized_seq:A to desc_read:A
> + *
> + * Note: _prb_commit:B and desc_update_last_finalized:A can be
> + * different CPUs. However, the desc_update_last_finalized:A
> + * CPU (which performs the release) must have previously seen
> + * _prb_commit:B.
> + */
> + if (!atomic_long_try_cmpxchg_release(&desc_ring->last_finalized_seq,
> + &oldval, newval)) { /* LMM(desc_update_last_finalized:A) */
> + old_seq = __ulseq_to_u64seq(rb, oldval);
> + goto try_again;
> + }
> +}
> +
> /*
> * Attempt to finalize a specified descriptor. If this fails, the descriptor
> * is either already final or it will finalize itself when the writer commits.
> */
> -static void desc_make_final(struct prb_desc_ring *desc_ring, unsigned long id)
> +static void desc_make_final(struct printk_ringbuffer *rb, unsigned long id)
> {
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> unsigned long prev_state_val = DESC_SV(id, desc_committed);
> struct prb_desc *d = to_desc(desc_ring, id);
>
> - atomic_long_cmpxchg_relaxed(&d->state_var, prev_state_val,
> - DESC_SV(id, desc_finalized)); /* LMM(desc_make_final:A) */
> -
> - /* Best effort to remember the last finalized @id. */
> - atomic_long_set(&desc_ring->last_finalized_id, id);
> + if (atomic_long_try_cmpxchg_relaxed(&d->state_var, &prev_state_val,
> + DESC_SV(id, desc_finalized))) { /* LMM(desc_make_final:A) */
> + desc_update_last_finalized(rb);
> + }
> }
> @@ -2008,7 +2107,9 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
> * newest sequence number available to readers will be.
> *
> * This provides readers a sequence number to jump to if all currently
> - * available records should be skipped.
> + * available records should be skipped. It is guaranteed that all records
> + * previous to the returned value have been finalized and are (or were)
> + * available to the reader.
> *
> * Context: Any context.
> * Return: The sequence number of the next newest (not yet available) record
> @@ -2016,34 +2117,19 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
> */
> u64 prb_next_seq(struct printk_ringbuffer *rb)
> {
> - struct prb_desc_ring *desc_ring = &rb->desc_ring;
> - enum desc_state d_state;
> - unsigned long id;
> u64 seq;
>
> - /* Check if the cached @id still points to a valid @seq. */
> - id = atomic_long_read(&desc_ring->last_finalized_id);
> - d_state = desc_read(desc_ring, id, NULL, &seq, NULL);
> + seq = desc_last_finalized_seq(rb);

desc_last_finalized_seq() does internally:

ulseq = atomic_long_read_acquire(&desc_ring->last_finalized_seq
); /* LMM(desc_last_finalized_seq:A) */


It guarantees that this CPU would see the data which were seen by the
CPU which updated desc_ring->last_finalized_seq.

So far so good.

The problem is that I somehow miss the counter part. Maybe,
it is not needed. It just looks strange.

> - if (d_state == desc_finalized || d_state == desc_reusable) {
> - /*
> - * Begin searching after the last finalized record.
> - *
> - * On 0, the search must begin at 0 because of hack#2
> - * of the bootstrapping phase it is not known if a
> - * record at index 0 exists.
> - */
> - if (seq != 0)
> - seq++;
> - } else {
> - /*
> - * The information about the last finalized sequence number
> - * has gone. It should happen only when there is a flood of
> - * new messages and the ringbuffer is rapidly recycled.
> - * Give up and start from the beginning.
> - */
> - seq = 0;
> - }
> + /*
> + * Begin searching after the last finalized record.
> + *
> + * On 0, the search must begin at 0 because of hack#2
> + * of the bootstrapping phase it is not known if a
> + * record at index 0 exists.
> + */
> + if (seq != 0)
> + seq++;
>
> /*
> * The information about the last finalized @seq might be inaccurate.

Below is the code:

while (_prb_read_valid(rb, &seq, NULL, NULL))
seq++;

Maybe, the release() should be here to make sure that the CPU which
would see this "seq" would also the data.

Would it make sense, please?

Otherwise, it looks good.

Best Regaards,
Petr

2024-01-12 18:15:08

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

On Fri 2024-01-12 11:28:35, Petr Mladek wrote:
> On Thu 2023-12-14 22:47:49, John Ogness wrote:
> > Note: This change only applies to 32bit architectures. On 64bit
> > architectures the macros are NOPs.
> >
> > __ulseq_to_u64seq() computes the upper 32 bits of the passed
> > argument value (@ulseq). The upper bits are derived from a base
> > value (@rb_next_seq) in a way that assumes @ulseq represents a
> > 64bit number that is less than or equal to @rb_next_seq.
> >
> > Until now this mapping has been correct for all call sites. However,
> > in a follow-up commit, values of @ulseq will be passed in that are
> > higher than the base value. This requires a change to how the 32bit
> > value is mapped to a 64bit sequence number.
> >
> > Rather than mapping @ulseq such that the base value is the end of a
> > 32bit block, map @ulseq such that the base value is in the middle of
> > a 32bit block. This allows supporting 31 bits before and after the
> > base value, which is deemed acceptable for the console sequence
> > number during runtime.
> >
> > Here is an example to illustrate the previous and new mappings.
> >
> > For a base value (@rb_next_seq) of 2 2000 0000...
> >
> > Before this change the range of possible return values was:
> >
> > 1 2000 0001 to 2 2000 0000
> >
> > __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
> > __ulseq_to_u64seq(2000 0000) => 2 2000 0000
> > __ulseq_to_u64seq(2000 0001) => 1 2000 0001
> > __ulseq_to_u64seq(9fff ffff) => 1 9fff ffff
> > __ulseq_to_u64seq(a000 0000) => 1 a000 0000
> > __ulseq_to_u64seq(a000 0001) => 1 a000 0001
> >
> > After this change the range of possible return values are:
> > 1 a000 0001 to 2 a000 0000
> >
> > __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
> > __ulseq_to_u64seq(2000 0000) => 2 2000 0000
> > __ulseq_to_u64seq(2000 0001) => 2 2000 0001
> > __ulseq_to_u64seq(9fff ffff) => 2 9fff ffff
> > __ulseq_to_u64seq(a000 0000) => 2 a000 0000
> > __ulseq_to_u64seq(a000 0001) => 1 a000 0001
> >
> > [ john.ogness: Rewrite commit message. ]
> >
> > Reported-by: Francesco Dolcini <[email protected]>
> > Reported-by: kernel test robot <[email protected]>
> > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> > Reported-by: kernel test robot <[email protected]>
> > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> > Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> > Signed-off-by: John Ogness <[email protected]>
>
> Great catch! It must have been complicated to debug this.
>
> Reviewed-by: Petr Mladek <[email protected]>

That said, I am a bit nervous that a bug like this might cause
workqueue stall and panic() the kernel.

At least, this is how I read
https://lore.kernel.org/oe-lkp/[email protected]/

[ 125.917650][ C0] workqueue events: flags=0x0
[ 125.917979][ C0] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 125.917984][ C0] pending: deferred_probe_timeout_work_func
[ 125.917992][ C0] workqueue events_power_efficient: flags=0x82
[ 125.919409][ C0] pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/256 refcnt=2
[ 125.919412][ C0] pending: do_cache_clean
[ 125.919419][ C0] workqueue mm_percpu_wq: flags=0x8
[ 125.920619][ C0] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 125.920622][ C0] pending: vmstat_update
[ 125.920632][ C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[ 129.568547][ C1] rcu: INFO: rcu_sched self-detected stall on CPU
[ 129.568553][ C1] rcu: 1-....: (26250 ticks this GP) idle=3c24/1/0x40000000 softirq=5858/5934 fqs=13122
[ 129.568558][ C1] rcu: (t=26251 jiffies g=301 q=598 ncpus=2)
[ 129.568562][ C1] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G W N 6.6.0-00134-g768c33033e41 #1
[ 129.568565][ C1] EIP: desc_read (kernel/printk/printk_ringbuffer.c:439)
[ 129.568573][ C1] Code: fb 83 c4 08 5b 5e 5f 5d c3 8d 74 26 00 0f 0b eb d2 3e 8d 74 26 00 55 89 e5 57 89 cf 56 53 89 c3 83 ec 08 8b 08 b8 ff ff ff ff <89> 55 f0 d3 e0 f7 d0 21 d0 6b f0 58 83 f9 1f 0f 87 a7 e6 83 00 03

[ 129.568575][ C1] EAX: ffffffff EBX: c1d5cd60 ECX: 0000000f EDX: 3fff8001
[ 129.568576][ C1] ESI: 00000001 EDI: c320fd78 EBP: c320fd28 ESP: c320fd14
[ 129.568578][ C1] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00000286
[ 129.568582][ C1] CR0: 80050033 CR2: b7d77b93 CR3: 022b4000 CR4: 00040690
[ 129.568583][ C1] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 129.568584][ C1] DR6: fffe0ff0 DR7: 00000400
[ 129.568585][ C1] Call Trace:
[ 129.568589][ C1] ? show_regs (arch/x86/kernel/dumpstack.c:479)
[ 129.568596][ C1] ? dump_cpu_task (kernel/sched/core.c:11572)
[ 129.568602][ C1] ? rcu_dump_cpu_stacks (include/linux/cpumask.h:143 include/linux/cpumask.h:150 include/linux/cpumask.h:212 kernel/rcu/tree_stall.h:367)
[ 129.568606][ C1] ? vprintk (kernel/printk/printk_safe.c:46)
[ 129.568611][ C1] ? print_cpu_stall (kernel/rcu/tree_stall.h:692)
[ 129.568616][ C1] ? check_cpu_stall (kernel/rcu/tree_stall.h:775)
[ 129.568620][ C1] ? rcu_sched_clock_irq (kernel/rcu/tree.c:3869 kernel/rcu/tree.c:2240)
[ 129.568623][ C1] ? update_process_times (arch/x86/include/asm/preempt.h:27 kernel/time/timer.c:2080)
[ 129.568628][ C1] ? tick_sched_timer (kernel/time/tick-sched.c:1497)
[ 129.568632][ C1] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752)
[ 129.568636][ C1] ? cpumask_weight (arch/x86/kernel/alternative.c:1766)
[ 129.568640][ C1] ? hrtimer_interrupt (kernel/time/hrtimer.c:1817)
[ 129.568645][ C1] ? sysvec_call_function_single (arch/x86/kernel/apic/apic.c:1074)
[ 129.568650][ C1] ? __sysvec_apic_timer_interrupt (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:444 include/linux/jump_label.h:260 include/linux/jump_label.h:270 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1081)
[ 129.568653][ C1] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074)
[ 129.568656][ C1] ? handle_exception (arch/x86/entry/entry_32.S:1049)
[ 129.568662][ C1] ? sysvec_call_function_single (arch/x86/kernel/apic/apic.c:1074)
[ 129.568664][ C1] ? desc_read (kernel/printk/printk_ringbuffer.c:439)
[ 129.568668][ C1] ? sysvec_call_function_single (arch/x86/kernel/apic/apic.c:1074)
[ 129.568670][ C1] ? desc_read (kernel/printk/printk_ringbuffer.c:439)
[ 129.568675][ C1] desc_read_finalized_seq (kernel/printk/printk_ringbuffer.c:1907)
[ 129.568680][ C1] prb_next_reserve_seq (kernel/printk/printk_ringbuffer.c:2084)
[ 129.568682][ C1] ? nbcon_get_cpu_emergency_nesting (kernel/printk/nbcon.c:984)
[ 129.568689][ C1] nbcon_atomic_flush_all (kernel/printk/nbcon.c:1133)
[ 129.568692][ C1] ? nbcon_atomic_flush_all (kernel/printk/nbcon.c:1133)
[ 129.568695][ C1] printk_trigger_flush (arch/x86/include/asm/preempt.h:85 kernel/printk/printk.c:3993)
[ 129.568698][ C1] nbcon_cpu_emergency_exit (kernel/printk/nbcon.c:1193)
[ 129.568701][ C1] ? refcount_warn_saturate (lib/refcount.c:28 (discriminator 3))
[ 129.568706][ C1] __warn (kernel/panic.c:692)
[ 129.568711][ C1] ? refcount_warn_saturate (lib/refcount.c:28 (discriminator 3))
[ 129.568714][ C1] report_bug (lib/bug.c:201 lib/bug.c:219)
[ 129.568720][ C1] ? exc_overflow (arch/x86/kernel/traps.c:250)
[ 129.568722][ C1] handle_bug (arch/x86/kernel/traps.c:216)
[ 129.568724][ C1] exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1))
[ 129.568727][ C1] handle_exception (arch/x86/entry/entry_32.S:1049)
[ 129.568729][ C1] EIP: refcount_warn_saturate (lib/refcount.c:28 (discriminator 3))
[ 129.568731][ C1] Code: 79 18 c2 01 68 a8 b9 c3 c1 e8 03 7b cf ff 0f 0b 58 c9 c3 8d b6 00 00 00 00 c6 05 86 79 18 c2 01 68 50 b9 c3 c1 e8 e7 7a cf ff <0f> 0b 59 c9 c3 66 90 89 c2 8b 00 3d 00 00 00 c0 74 12 83 f8 01 74

It looks like it caused some loop and refcout overlow or so.
But I might be wrong.

I would like to better understand this and check if we could prevent
it somehow.

Best Regards,
Petr

Subject: Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

On 2024-01-12 19:14:44 [+0100], Petr Mladek wrote:
>
> That said, I am a bit nervous that a bug like this might cause
> workqueue stall and panic() the kernel.


> At least, this is how I read
> https://lore.kernel.org/oe-lkp/[email protected]/

well, workqueue stalls and RCU as well because the CPU spins. That is a
natural consequence because the CPU makes no progress (at boot). The
panic _might_ be due to panic_on_error or so.
There is no scheduler, nothing so one CPU is blocked and the world ends…

> It looks like it caused some loop and refcout overlow or so.
> But I might be wrong.
>
> I would like to better understand this and check if we could prevent
> it somehow.

Based on memory: the problem is that the sign extension bug (the fixed
bug) returned the wrong or too low sequence number. So the printk code
tried again to obtain a new sequence number. And got the wrong
again. And this is what looped during boot.

I'm not sure if this sort of lockup can happen now after the bug is
fixed. I can issue a NMI backtrace on all CPUs (32) without the sync (so
they all can printk immediately and not one after the other) and it
prints and continues…

> Best Regards,
> Petr

Sebastian

2024-01-15 10:52:58

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

On 2024-01-12, Petr Mladek <[email protected]> wrote:
>> > Reported-by: Francesco Dolcini <[email protected]>
>> > Reported-by: kernel test robot <[email protected]>
>> > Closes: https://lore.kernel.org/oe-lkp/[email protected]
>> > Reported-by: kernel test robot <[email protected]>
>> > Closes: https://lore.kernel.org/oe-lkp/[email protected]
>> > Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
>> > Signed-off-by: John Ogness <[email protected]>
>>
>> Great catch! It must have been complicated to debug this.
>>
>> Reviewed-by: Petr Mladek <[email protected]>
>
> That said, I am a bit nervous that a bug like this might cause
> workqueue stall and panic() the kernel.
>
> At least, this is how I read
> https://lore.kernel.org/oe-lkp/[email protected]/

Note that the bug is reported for code that mainline will never
have. This patch "fixes" the problem before it is a problem. From the
perspective of mainline, the problem never existed. Perhaps it is
inappropriate to list these with the Closes tag.

> It looks like it caused some loop and refcout overlow or so.
> But I might be wrong.
>
> I would like to better understand this and check if we could prevent
> it somehow.

The code in question can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/kernel/printk/printk_ringbuffer.c?id=768c33033e41ad195a9304ebb183afb730d5ae98#n2031

The URL above points to prb_next_reserve_seq(), which is in an infinite
loop.

desc_last_finalized_seq() is returning a huge garbage value due to the
32bit sequence value 1 being incorrectly mapped to 0xffffffff00000001
(which is 18446744069414584321). When desc_read_finalized_seq() is
called (line 2082), it obviously returns -EINVAL because there is no
record with this huge sequence number.

The code interrupts -EINVAL to mean that the record has been
overwritten, and so it tries again with an updated last finalized
sequence number => infinite loop.

With this patch applied, the 32bit value 1 is correctly mapped to
sequence number 1 and all is OK.

The problem was that desc_last_finalized_seq() returned a sequence
number that was higher than the highest record. That was the bug. As
long as desc_last_finalized_seq() always returns a sequence number that
exists or used to exist, the retry loop is fine. And since the tail
record is always in the finalized state, there will always be a
finalized record available.

John

2024-01-15 11:55:56

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v3 04/14] printk: ringbuffer: Do not skip non-finalized records with prb_next_seq()

On 2024-01-12, Petr Mladek <[email protected]> wrote:
>> u64 prb_next_seq(struct printk_ringbuffer *rb)
>> {
>> - struct prb_desc_ring *desc_ring = &rb->desc_ring;
>> - enum desc_state d_state;
>> - unsigned long id;
>> u64 seq;
>>
>> - /* Check if the cached @id still points to a valid @seq. */
>> - id = atomic_long_read(&desc_ring->last_finalized_id);
>> - d_state = desc_read(desc_ring, id, NULL, &seq, NULL);
>> + seq = desc_last_finalized_seq(rb);
>
> desc_last_finalized_seq() does internally:
>
> ulseq = atomic_long_read_acquire(&desc_ring->last_finalized_seq
> ); /* LMM(desc_last_finalized_seq:A) */
>
>
> It guarantees that this CPU would see the data which were seen by the
> CPU which updated desc_ring->last_finalized_seq.
>
> So far so good.
>
> The problem is that I somehow miss the counter part. Maybe,
> it is not needed. It just looks strange.

As the comments in desc_last_finalized_seq() state: "This pairs with
desc_update_last_finalized:A."

desc_update_last_finalized() successfully reads a record and then uses a
cmpxchg_release() to set the new @last_finalized_seq value (of the
record it just read). That is the counter part.

>> - if (d_state == desc_finalized || d_state == desc_reusable) {
>> - /*
>> - * Begin searching after the last finalized record.
>> - *
>> - * On 0, the search must begin at 0 because of hack#2
>> - * of the bootstrapping phase it is not known if a
>> - * record at index 0 exists.
>> - */
>> - if (seq != 0)
>> - seq++;
>> - } else {
>> - /*
>> - * The information about the last finalized sequence number
>> - * has gone. It should happen only when there is a flood of
>> - * new messages and the ringbuffer is rapidly recycled.
>> - * Give up and start from the beginning.
>> - */
>> - seq = 0;
>> - }
>> + /*
>> + * Begin searching after the last finalized record.
>> + *
>> + * On 0, the search must begin at 0 because of hack#2
>> + * of the bootstrapping phase it is not known if a
>> + * record at index 0 exists.
>> + */
>> + if (seq != 0)
>> + seq++;
>>
>> /*
>> * The information about the last finalized @seq might be inaccurate.
>
> Below is the code:
>
> while (_prb_read_valid(rb, &seq, NULL, NULL))
> seq++;
>
> Maybe, the release() should be here to make sure that the CPU which
> would see this "seq" would also the data.

The acquire is with @last_finalized_seq. So the release must also be
with @last_finalized_seq. The important thing is that the CPU that
updates @last_finalized_seq has actually read the corresponding record
beforehand. That is exactly what desc_update_last_finalized() does.

John

2024-01-15 16:21:49

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

On Mon 2024-01-15 11:58:39, John Ogness wrote:
> On 2024-01-12, Petr Mladek <[email protected]> wrote:
> >> > Reported-by: Francesco Dolcini <[email protected]>
> >> > Reported-by: kernel test robot <[email protected]>
> >> > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> >> > Reported-by: kernel test robot <[email protected]>
> >> > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> >> > Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> >> > Signed-off-by: John Ogness <[email protected]>
> >>
> >> Great catch! It must have been complicated to debug this.
> >>
> >> Reviewed-by: Petr Mladek <[email protected]>
> >
> > That said, I am a bit nervous that a bug like this might cause
> > workqueue stall and panic() the kernel.
> >
> > At least, this is how I read
> > https://lore.kernel.org/oe-lkp/[email protected]/
>
> Note that the bug is reported for code that mainline will never
> have. This patch "fixes" the problem before it is a problem. From the
> perspective of mainline, the problem never existed. Perhaps it is
> inappropriate to list these with the Closes tag.

I think that it is better to keep the reference. It might help
archaeologists to understand the problem ;-)

> > It looks like it caused some loop and refcout overlow or so.
> > But I might be wrong.
> >
> > I would like to better understand this and check if we could prevent
> > it somehow.
>
> The code in question can be found here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/kernel/printk/printk_ringbuffer.c?id=768c33033e41ad195a9304ebb183afb730d5ae98#n2031
>
> The URL above points to prb_next_reserve_seq(), which is in an infinite
> loop.
>
> desc_last_finalized_seq() is returning a huge garbage value due to the
> 32bit sequence value 1 being incorrectly mapped to 0xffffffff00000001
> (which is 18446744069414584321). When desc_read_finalized_seq() is
> called (line 2082), it obviously returns -EINVAL because there is no
> record with this huge sequence number.
>
> The code interrupts -EINVAL to mean that the record has been
> overwritten, and so it tries again with an updated last finalized
> sequence number => infinite loop.
>
> With this patch applied, the 32bit value 1 is correctly mapped to
> sequence number 1 and all is OK.
>
> The problem was that desc_last_finalized_seq() returned a sequence
> number that was higher than the highest record. That was the bug. As
> long as desc_last_finalized_seq() always returns a sequence number that
> exists or used to exist, the retry loop is fine. And since the tail
> record is always in the finalized state, there will always be a
> finalized record available.

Thank you both for explanation. You are right that this was kind
of special.

You know, the code around reading the messages is getting more
and more complex. Also the u32 -> u64 transition is not 100% safe.
So some sanity checks might be useful.

That said, I do not see any obvious trivial one. Just the following
came to my mind. prb_first_seq() is reliable and we could do:

void next_seq_sanity_check(struct printk_ringbuffer *rb, u64 next_seq)
{
struct prb_desc_ring *desc_ring = &rb->desc_ring;
u64 first_seq = prb_first_seq(rb);
u64 desc_count = DESC_COUNT(&rb->desc_ring)).

WARN_ONCE(seq > first_seq + DESC_COUNT(desc_ring));
}

Well, I am not sure if it is worth it. Also the WARN() won't
be printed on consoles when the reading is broken.

Best Regards,
Petr

2024-01-15 17:00:46

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 04/14] printk: ringbuffer: Do not skip non-finalized records with prb_next_seq()

On Mon 2024-01-15 13:01:36, John Ogness wrote:
> On 2024-01-12, Petr Mladek <[email protected]> wrote:
> >> u64 prb_next_seq(struct printk_ringbuffer *rb)
> >> {
> >> - struct prb_desc_ring *desc_ring = &rb->desc_ring;
> >> - enum desc_state d_state;
> >> - unsigned long id;
> >> u64 seq;
> >>
> >> - /* Check if the cached @id still points to a valid @seq. */
> >> - id = atomic_long_read(&desc_ring->last_finalized_id);
> >> - d_state = desc_read(desc_ring, id, NULL, &seq, NULL);
> >> + seq = desc_last_finalized_seq(rb);
> >
> > desc_last_finalized_seq() does internally:
> >
> > ulseq = atomic_long_read_acquire(&desc_ring->last_finalized_seq
> > ); /* LMM(desc_last_finalized_seq:A) */
> >
> >
> > It guarantees that this CPU would see the data which were seen by the
> > CPU which updated desc_ring->last_finalized_seq.
> >
> > So far so good.
> >
> > The problem is that I somehow miss the counter part. Maybe,
> > it is not needed. It just looks strange.
>
> As the comments in desc_last_finalized_seq() state: "This pairs with
> desc_update_last_finalized:A."
>
> desc_update_last_finalized() successfully reads a record and then uses a
> cmpxchg_release() to set the new @last_finalized_seq value (of the
> record it just read). That is the counter part.
>
> >> - if (d_state == desc_finalized || d_state == desc_reusable) {
> >> - /*
> >> - * Begin searching after the last finalized record.
> >> - *
> >> - * On 0, the search must begin at 0 because of hack#2
> >> - * of the bootstrapping phase it is not known if a
> >> - * record at index 0 exists.
> >> - */
> >> - if (seq != 0)
> >> - seq++;
> >> - } else {
> >> - /*
> >> - * The information about the last finalized sequence number
> >> - * has gone. It should happen only when there is a flood of
> >> - * new messages and the ringbuffer is rapidly recycled.
> >> - * Give up and start from the beginning.
> >> - */
> >> - seq = 0;
> >> - }
> >> + /*
> >> + * Begin searching after the last finalized record.
> >> + *
> >> + * On 0, the search must begin at 0 because of hack#2
> >> + * of the bootstrapping phase it is not known if a
> >> + * record at index 0 exists.
> >> + */
> >> + if (seq != 0)
> >> + seq++;
> >>
> >> /*
> >> * The information about the last finalized @seq might be inaccurate.
> >
> > Below is the code:
> >
> > while (_prb_read_valid(rb, &seq, NULL, NULL))
> > seq++;
> >
> > Maybe, the release() should be here to make sure that the CPU which
> > would see this "seq" would also the data.
>
> The acquire is with @last_finalized_seq. So the release must also be
> with @last_finalized_seq. The important thing is that the CPU that
> updates @last_finalized_seq has actually read the corresponding record
> beforehand. That is exactly what desc_update_last_finalized() does.

I probably did not describe it well. The CPU updating @last_finalized_seq
does the right thing. I was not sure about the CPU which reads
@last_finalized_seq via prb_next_seq().

To make it more clear:

u64 prb_next_seq(struct printk_ringbuffer *rb)
{
u64 seq;

seq = desc_last_finalized_seq(rb);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
`-> This includes atomic_long_read_acquire(last_finalized_seq)


if (seq != 0)
seq++;

while (_prb_read_valid(rb, &seq, NULL, NULL))
seq++;

return seq;
}

But where is the atomic_long_read_release(last_finalized_seq) in
this code path?

IMHO, the barrier provided by the acquire() is _important_ to make sure
that _prb_read_valid() would see the valid descriptor.

Now, I think that the related read_release(seq) is hidden in:

static int prb_read(struct printk_ringbuffer *rb, u64 seq,
struct printk_record *r, unsigned int *line_count)
{
/* Get a local copy of the correct descriptor (if available). */
err = desc_read_finalized_seq(desc_ring, id, seq, &desc);

/* If requested, copy meta data. */
if (r->info)
memcpy(r->info, info, sizeof(*(r->info)));

/* Copy text data. If it fails, this is a data-less record. */
if (!copy_data(&rb->text_data_ring, &desc.text_blk_lpos, info->text_len,
r->text_buf, r->text_buf_size, line_count)) {
return -ENOENT;
}

/* Ensure the record is still finalized and has the same @seq. */
return desc_read_finalized_seq(desc_ring, id, seq, &desc);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
`-> This includes a memory barrier /* LMM(desc_read:A) */
which makes sure that the data are read before
the desc/data could be reused.
}

I consider this /* LMM(desc_read:A) */ as a counter part for that
acquire() in prb_next_seq().


Summary:

I saw atomic_long_read_acquire(last_finalized_seq) called from
prb_next_seq() code path. The barrier looked important to me.
But I saw neither the counter-part nor any comment. I wanted
to understand it because it might be important for reviewing
following patches which depend on prb_next_seq().

Does it make sense now, please?

Best Regards,
Petr


2024-01-15 17:09:16

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

On 2024-01-15, Petr Mladek <[email protected]> wrote:
> You know, the code around reading the messages is getting more
> and more complex. Also the u32 -> u64 transition is not 100% safe.
> So some sanity checks might be useful.

This complexity only exists for 32bit systems, but yeah, it is still
important.

> That said, I do not see any obvious trivial one. Just the following
> came to my mind. prb_first_seq() is reliable and we could do:
>
> void next_seq_sanity_check(struct printk_ringbuffer *rb, u64 next_seq)
> {
> struct prb_desc_ring *desc_ring = &rb->desc_ring;
> u64 first_seq = prb_first_seq(rb);
> u64 desc_count = DESC_COUNT(&rb->desc_ring)).
>
> WARN_ONCE(seq > first_seq + DESC_COUNT(desc_ring));
> }

@seq is allowed to be 2^31 before or after @first_seq. The check would
look more like this:

WARN_ONCE((rb_first_seq < 0xffffffff80000000 &&
seq > rb_first_seq + 0x80000000) ||
(rb_first_seq > 0x80000000 &&
seq < rb_first_seq - 0x80000000));

> Well, I am not sure if it is worth it. Also the WARN() won't be
> printed on consoles when the reading is broken.

Broken printing is irrelevant. There are plenty of debug methods to get
at the ringbuffer. I am OK with adding this sanity check (again, only
for 32bit, within __ulseq_to_u64seq()).

John

2024-01-30 13:18:35

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 05/14] printk: ringbuffer: Clarify special lpos values

On Thu 2023-12-14 22:47:52, John Ogness wrote:
> For empty line records, no data blocks are created. Instead,
> these valid records are identified by special logical position
> values (in fields of @prb_desc.text_blk_lpos).
>
> Currently the macro NO_LPOS is used for empty line records.
> This name is confusing because it does not imply _why_ there is
> no data block.
>
> Rename NO_LPOS to EMPTY_LINE_LPOS so that it is clear why there
> is no data block.
>
> Also add comments explaining the use of EMPTY_LINE_LPOS as well
> as clarification to the values used to represent data-less
> blocks.
>
> Signed-off-by: John Ogness <[email protected]>

Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr

PS: Just for record. I had to make a pause in the review
because I have been on a sick leave last week and half.


2024-01-30 14:37:02

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 08/14] printk: ringbuffer: Cleanup reader terminology

On Thu 2023-12-14 22:47:55, John Ogness wrote:
> With the lockless ringbuffer, it is allowed that multiple
> CPUs/contexts write simultaneously into the buffer. This creates
> an ambiguity as some writers will finalize sooner.
>
> The documentation for the prb_read functions is not clear as it
> refers to "not yet written" and "no data available". Clarify the
> return values and language to be in terms of the reader: records
> available for reading.
>
> Signed-off-by: John Ogness <[email protected]>

Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr

2024-01-31 12:07:22

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 09/14] printk: Wait for all reserved records with pr_flush()

On Thu 2023-12-14 22:47:56, John Ogness wrote:
> Currently pr_flush() will only wait for records that were
> available to readers at the time of the call (using
> prb_next_seq()). But there may be more records (non-finalized)
> that have following finalized records. pr_flush() should wait
> for these to print as well. Particularly because any trailing
> finalized records may be the messages that the calling context
> wants to ensure are printed.
>
> Add a new ringbuffer function prb_next_reserve_seq() to return
> the sequence number following the most recently reserved record.
> This guarantees that pr_flush() will wait until all current
> printk() messages (completed or in progress) have been printed.
>
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -1986,6 +1986,119 @@ u64 prb_first_seq(struct printk_ringbuffer *rb)
> return seq;
> }
>
> +/**
> + * prb_next_reserve_seq() - Get the sequence number after the most recently
> + * reserved record.
> + *
> + * @rb: The ringbuffer to get the sequence number from.
> + *
> + * This is the public function available to readers to see what sequence
> + * number will be assigned to the next reserved record.
> + *
> + * Note that depending on the situation, this value can be equal to or
> + * higher than the sequence number returned by prb_next_seq().
> + *
> + * Context: Any context.
> + * Return: The sequence number that will be assigned to the next record
> + * reserved.
> + */
> +u64 prb_next_reserve_seq(struct printk_ringbuffer *rb)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + unsigned long last_finalized_id;
> + atomic_long_t *state_var;
> + u64 last_finalized_seq;
> + unsigned long head_id;
> + struct prb_desc desc;
> + unsigned long diff;
> + struct prb_desc *d;
> + int err;
> +
> + /*
> + * It may not be possible to read a sequence number for @head_id.
> + * So the ID of @last_finailzed_seq is used to calculate what the
> + * sequence number of @head_id will be.
> + */
> +
> +try_again:
> + last_finalized_seq = desc_last_finalized_seq(rb);
> +
> + /*
> + * @head_id is loaded after @last_finalized_seq to ensure that it is
> + * at or beyond @last_finalized_seq.
> + *
> + * Memory barrier involvement:
> + *
> + * If desc_last_finalized_seq:A reads from
> + * desc_update_last_finalized:A, then
> + * prb_next_reserve_seq:A reads from desc_reserve:D.
> + *
> + * Relies on:
> + *
> + * RELEASE from desc_reserve:D to desc_update_last_finalized:A
> + * matching
> + * ACQUIRE from desc_last_finalized_seq:A to prb_next_reserve_seq:A
> + *
> + * Note: desc_reserve:D and desc_update_last_finalized:A can be
> + * different CPUs. However, the desc_update_last_finalized:A CPU
> + * (which performs the release) must have previously seen
> + * desc_read:C, which implies desc_reserve:D can be seen.

Huh, it must have been a huge effort to pull all the pieces together.
It took me long time to check and understand it. And it looks right
to me.

The most tricky part is desc_reserve:D. It is a supper complicated
barrier which guarantees many things. But I think that there are
many more write barriers before the allocated descriptor reaches
finalized state. So that it should be easier to prove the correctness
here by other easier barriers.

To make it clear. I am all for keeping the above precise description
as is. I just think about adding one more human friendly note which
might help future generations to understand the correctness an easier way.
Something like:

* Note: The above description might be hard to follow because
* desc_reserve:D is a multi-purpose barrier. But this is
* just the first barrier which makes sure that @head_id
* is updated before the reserved descriptor gets finalized
* and updates @last_finalized_seq.
*
* There are more release barriers in between, especially,
* desc_reserve:F and desc_update_last_finalized:A. Also these make
* sure that the desc_last_finalized_seq:A acquire barrier allows
* to read @head_id related to @last_finalized_seq or newer.

In fact, the desc_update_last_finalized:A release and
desc_last_finalized_seq:A acquire barriers are enough to prove
that we read here @head_id related to @last_finalized_seq or newer.

Or maybe it is exactly what you described and my "human friendly"
description is too naive. I am still not completely familiar
with the "Memory barrier involvement:" and "Relies on:"
format.

> + */
> + head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_next_reserve_seq:A) */
> +
> + d = to_desc(desc_ring, last_finalized_seq);
> + state_var = &d->state_var;
> +
> + /* Extract the ID, used to specify the descriptor to read. */
> + last_finalized_id = DESC_ID(atomic_long_read(state_var));
> +
> + /* Ensure @last_finalized_id is correct. */
> + err = desc_read_finalized_seq(desc_ring, last_finalized_id, last_finalized_seq, &desc);
> +
> + if (err == -EINVAL) {
> + if (last_finalized_seq == 0) {
> + /*
> + * @last_finalized_seq still contains its initial
> + * value. Probably no record has been finalized yet.
> + * This means the ringbuffer is not yet full and the
> + * @head_id value can be used directly (subtracting
> + * off the id value corresponding to seq=0).
> + */
> +
> + /*
> + * Because of hack#2 of the bootstrapping phase, the
> + * @head_id initial value must be handled separately.
> + */
> + if (head_id == DESC0_ID(desc_ring->count_bits))
> + return 0;
> +
> + /*
> + * The @head_id is initialized such that the first
> + * increment will yield the first record (seq=0).
> + * Therefore use the initial value +1 as the base to
> + * subtract from @head_id.
> + */
> + last_finalized_id = DESC0_ID(desc_ring->count_bits) + 1;

It took me long time to understand the above code and comments. I
wonder if the following looks easier even for you:

if (err == -EINVAL) {
if (last_finalized_seq == 0) {
/*
* No record has been finalized or even reserved yet.
*
* The @head_id is initialized such that the first
* increment will yield the first record (seq=0).
* Handle it separately to avoid a negative @diff below.
*/
if (head_id == DESC0_ID(desc_ring->count_bits))
return 0;

/* One or more descriptors are already reserved. Use
* the descriptor ID of the first one (@seq=0) for
* the @diff below.
*/
last_finalized_id = DESC0_ID(desc_ring->count_bits) + 1;


> + } else {
> + /* Record must have been overwritten. Try again. */
> + goto try_again;
> + }
> + }
> +
> + /*
> + * @diff is the number of records beyond the last record available
> + * to readers.
> + */

This is kind of obvious from the code. Also it is not true when the
first record has not been finalized yet. The following might
be more useful:

/* Diff of known descriptor IDs to compure releted sequence nubmers. */

> + diff = head_id - last_finalized_id;
> +
> + /*
> + * @head_id points to the most recently reserved record, but this
> + * function returns the sequence number that will be assigned to the
> + * next (not yet reserved) record. Thus +1 is needed.
> + */
> + return (last_finalized_seq + diff + 1);
> +}
> +
> /*
> * Non-blocking read of a record.
> *

BTW: It came to my mind whether the logic might be more straightforward
if we store desc_ring->next_finalized_seq instead of @last_finalized_seq.

Even the initial value vould be valid. Also the value is
used only in prb_next_reserve_seq() and prb_next_seq() where
we need to start with this value anyway.

Note that prb_next_seq() actually does not need to try _prb_read_valid()
for @last_finalized_seq. It should always be valid unless
the record has been already overwritten. In which case,
there should be a newer readable record.

Well, it is just an idea. I am not completely sure that it
would make the logic easier to follow. The current code is perfectly fine.


The patch is correct. Feel free to add:

Reviewed-by: Petr Mladek <[email protected]>

Well, it would be nice to update the comments if you liked the proposals.

Best Regards,
Petr

2024-02-01 17:35:55

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 10/14] printk: ringbuffer: Skip non-finalized records in panic

On Thu 2023-12-14 22:47:57, John Ogness wrote:
> Normally a reader will stop once reaching a non-finalized
> record. However, when a panic happens, writers from other CPUs
> (or an interrupted context on the panic CPU) may have been
> writing a record and were unable to finalize it. The panic CPU
> will reserve/commit/finalize its panic records, but these will
> be located after the non-finalized records. This results in
> panic() not flushing the panic messages.
>
> Extend _prb_read_valid() to skip over non-finalized records if
> on the panic CPU.
>
> Fixes: 896fbe20b4e2 ("printk: use the lockless ringbuffer")
> Signed-off-by: John Ogness <[email protected]>

Makes sense.

The most interesting information from other CPUs are backtraces. But
they should be finalized before reaching panic().
trigger_all_cpu_backtrace() waits until all CPUs print the backtrace.

In fact, it might be even useful because other CPUs might just spam
the ring buffer with mess caused by the panic() situation.
Messages which caused the panic() situation should be in the ring
buffer before panic() gets called.

Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr

2024-02-01 18:00:55

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 11/14] printk: ringbuffer: Consider committed as finalized in panic

On Thu 2023-12-14 22:47:58, John Ogness wrote:
> A descriptor in the committed state means the record does not yet
> exist for the reader. However, for the panic CPU, committed
> records should be handled as finalized records since they contain
> message data in a consistent state and may contain additional
> hints as to the cause of the panic.
>
> Add an exception for records in the commit state to not be
> considered non-existing when reading from the panic CPU.

IMHO, it is important to describe effects of this change in more
details. And I think that it actually does not work as expected,
see below.

> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -1875,16 +1877,25 @@ static int desc_read_finalized_seq(struct prb_desc_ring *desc_ring,
>
> /*
> * An unexpected @id (desc_miss) or @seq mismatch means the record
> - * does not exist. A descriptor in the reserved or committed state
> - * means the record does not yet exist for the reader.
> + * does not exist. A descriptor in the reserved state means the
> + * record does not yet exist for the reader.
> */
> if (d_state == desc_miss ||
> d_state == desc_reserved ||
> - d_state == desc_committed ||
> s != seq) {
> return -EINVAL;
> }
>
> + /*
> + * A descriptor in the committed state means the record does not yet
> + * exist for the reader. However, for the panic CPU, committed
> + * records are also handled as finalized records since they contain
> + * message data in a consistent state and may contain additional
> + * hints as to the cause of the panic.
> + */
> + if (d_state == desc_committed && !this_cpu_in_panic())
> + return -EINVAL;

If I get it correctly, this causes that panic CPU would see a
non-finalized continuous line as finalized. And it would flush
the existing piece to consoles.

The problem is that pr_cont() would append the message into
the same record. But the consoles would already wait
for the next record. They would miss the appended pieces.

It might be fixed by finalizing the record before we read
the non-finalized piece. It is doable but it would add
another lock-less scenario. I am not sure if it would work with
the existing barriers or how complicated it would be to
prove the correctness.

Now, in practice, it would allow to flush pieces of continuous lines
printed on panic CPU immediately. It would not affect messages printed
by other CPUs because of a mix of reasons:

1. The current code tries hard to move the console owner to panic
CPU. It allows the panic CPU to flush the messages a safe way
even when other CPUs are stopped or somehow locked.

It means that the consoles are ideally flushed when the panic CPU
adds a message.

2. Only the last record might be in a committed state. Any older
record is automatically finalized when it reaches the committed
state.

3. The previous patch causes that messages from non-panic CPUs
are skipped when they are not committed or finalized, see
https://lore.kernel.org/all/[email protected]


Now, this patch would really help only when the panic CPU dies in
the middle of a continuous message or when the final message does
not have a newline character.

Honestly, I think that it is not worth the effort. It would add another
complexity to the memory barriers. The real effect is not easy
to understand. And the benefit is minimal from my POV.


Alternative solutions:

1. Flush the non-finalized continuous line only before reaching
the infinite loop in panic(). printk() gets disabled
at this point anyway.

It would solve only this one scenario, though.


2. Always finalize messages in vprintk_store() when in_panic().

It would have the desired effect in all panic() situations.
And it would not add another complexity to the lock-less
ringbuffer code.


Best Regards,
Petr

2024-02-02 09:35:41

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 13/14] printk: Avoid non-panic CPUs writing to ringbuffer

On Thu 2023-12-14 22:48:00, John Ogness wrote:
> Commit 13fb0f74d702 ("printk: Avoid livelock with heavy printk
> during panic") introduced a mechanism to silence non-panic CPUs
> if too many messages are being dropped. Aside from trying to
> workaround the livelock bugs of legacy consoles, it was also
> intended to avoid losing panic messages. However, if non-panic
> CPUs are writing to the ringbuffer, then reacting to dropped
> messages is too late.
>
> To avoid losing panic CPU messages, silence non-panic CPUs
> immediately on panic.
>
> Signed-off-by: John Ogness <[email protected]>

I am slightly nervous about this change because it looks
too agresive ;-)

But it makes perfect sense. And it nicely complements the 10th
patch, see
https://lore.kernel.org/r/[email protected]
The 10th patch allows to skip messages in reserved state. It might
cause skipping random messages from other CPUs. So it really
looks better to skip/ignore them completely.

It would be nice to mention the relation to the 10th patch
in the commit message. Something like:

<proposal>
Another motivation is that non-finalized messages already might be
skipped in panic(). By other words, random messages from non-panic
CPUs might already get lost. It is better to ignore all to avoid
confusion.
</proposal>

With the updated commit message:

Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr

2024-02-02 09:39:02

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 00/14] fix console flushing

On Thu 2023-12-14 22:47:47, John Ogness wrote:
> Hi,
>
> While testing various flushing scenarios, I stumbled on a few
> issues that cause console flushing to fail. While at LPC2023 in
> Richmond, I sat down with Petr Mladek and we reviewed the
> v2 [0] series. This v3 series is the result of that offline
> discussion.
>
> This series addresses the following issues:
>
> 1. The prb_next_seq() optimization caused inconsistent return
> values. Fix prb_next_seq() to the originally intended
> behavior but keep an optimization.
>
> 2. pr_flush() might not wait until the most recently stored
> printk() message if non-finalized records precede it. Fix
> pr_flush() to wait for all records to print that are at
> least reserved at the time of the call.
>
> 3. In panic, the panic messages will not print if non-finalized
> records precede them. Add a special condition so that
> readers on the panic CPU will drop records that are not in
> a consistent state.
>
> 4. It is possible (and easy to reproduce) a scenario where the
> console on the panic CPU hands over to a waiter of a stopped
> CPU. Do not use the handover feature in panic.
>
> 5. If messages are being dropped during panic, non-panic CPUs
> are silenced. But by then it is already too late and most
> likely the panic messages have been dropped. Change the
> non-panic CPU silencing logic to _immediately_ silence
> non-panic CPUs during panic. This also leads to clean panic
> output when many CPUs are blasting the kernel log.
>
> 6. If a panic occurs in a context where printk() calls defer
> printing (NMI or printk_safe section), the printing of the
> final panic messages rely on irq_work. If that mechanism is
> not available, the final panic messages are not seen (even
> though they are finalized in the ringbuffer). Add one last
> explicit flush after all printk() calls are finished to
> ensure all available messages in the kernel log are printed.

I hope that I reviewed all patches. Some already had my tag.
Sigh, it took much longer than I have hoped because I was
overloaded and on a sick-leave.

Anyway, please let me know if I missed some patch or reply.

Best Regards,
Petr

2024-02-02 10:05:21

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 14/14] panic: Flush kernel log buffer at the end

On Thu 2023-12-14 22:48:01, John Ogness wrote:
> If the kernel crashes in a context where printk() calls always
> defer printing (such as in NMI or inside a printk_safe section)
> then the final panic messages will be deferred to irq_work. But
> if irq_work is not available, the messages will not get printed
> unless explicitly flushed. The result is that the final
> "end Kernel panic" banner does not get printed.
>
> Add one final flush after the last printk() call to make sure
> the final panic messages make it out as well.
>
> Signed-off-by: John Ogness <[email protected]>

Makes sense.

Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr

2024-02-05 11:53:17

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v3 04/14] printk: ringbuffer: Do not skip non-finalized records with prb_next_seq()

On 2024-01-15, Petr Mladek <[email protected]> wrote:
>> The acquire is with @last_finalized_seq. So the release must also be
>> with @last_finalized_seq. The important thing is that the CPU that
>> updates @last_finalized_seq has actually read the corresponding
>> record beforehand. That is exactly what desc_update_last_finalized()
>> does.
>
> I probably did not describe it well. The CPU updating
> @last_finalized_seq does the right thing. I was not sure about the CPU
> which reads @last_finalized_seq via prb_next_seq().
>
> To make it more clear:
>
> u64 prb_next_seq(struct printk_ringbuffer *rb)
> {
> u64 seq;
>
> seq = desc_last_finalized_seq(rb);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> |
> `-> This includes atomic_long_read_acquire(last_finalized_seq)
>
>
> if (seq != 0)
> seq++;
>
> while (_prb_read_valid(rb, &seq, NULL, NULL))
> seq++;
>
> return seq;
> }
>
> But where is the atomic_long_read_release(last_finalized_seq) in
> this code path?

read_release? The counterpart of this load_acquire is a
store_release. For example:

CPU0 CPU1
==== ====
load(varA)
store_release(varB) load_acquire(varB)
load(varA)

If CPU1 reads the value in varB that CPU0 stored, then it is guaranteed
that CPU1 will read the value (or a later value) in varA that CPU0 read.

Translating the above example to this particular patch, we have:

CPU0: desc_update_last_finalized() CPU1: prb_next_seq()
==== ====
_prb_read_valid(seq)
cmpxchg_release(last_finalized_seq,seq) seq=read_acquire(last_finalized_seq)
_prb_read_valid(seq)

> IMHO, the barrier provided by the acquire() is _important_ to make
> sure that _prb_read_valid() would see the valid descriptor.

Correct.

> Now, I think that the related read_release(seq) is hidden in:
>
> static int prb_read(struct printk_ringbuffer *rb, u64 seq,
> struct printk_record *r, unsigned int *line_count)
> {
> /* Get a local copy of the correct descriptor (if available). */
> err = desc_read_finalized_seq(desc_ring, id, seq, &desc);
>
> /* If requested, copy meta data. */
> if (r->info)
> memcpy(r->info, info, sizeof(*(r->info)));
>
> /* Copy text data. If it fails, this is a data-less record. */
> if (!copy_data(&rb->text_data_ring, &desc.text_blk_lpos, info->text_len,
> r->text_buf, r->text_buf_size, line_count)) {
> return -ENOENT;
> }
>
> /* Ensure the record is still finalized and has the same @seq. */
> return desc_read_finalized_seq(desc_ring, id, seq, &desc);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> |
> `-> This includes a memory barrier /* LMM(desc_read:A) */
> which makes sure that the data are read before
> the desc/data could be reused.
> }
>
> I consider this /* LMM(desc_read:A) */ as a counter part for that
> acquire() in prb_next_seq().

desc_read:A is not a memory barrier. It only marks the load of the
descriptor state. This is a significant load because prb_next_seq() must
see at least the descriptor state that desc_update_last_finalized() saw.

The memory barrier comments in desc_update_last_finalized() state:

* If desc_last_finalized_seq:A reads from
* desc_update_last_finalized:A, then desc_read:A reads from
* _prb_commit:B.

This is referring to a slightly different situation than the example I
used above because it is referencing where the descriptor state was
stored (_prb_commit:B). The same general picture is valid:

CPU0 CPU1
==== ====
_prb_commit:B
desc_update_last_finalized:A desc_last_finalized_seq:A
desc_read:A

desc_read:A is loding the descriptor state that _prb_commit:B stored.

The extra note in the comment clarifies that _prb_commit:B could also be
denoted as desc_read:A because desc_update_last_finalized() performs a
read (i.e. must have seen) _prb_commit:B.

* Note: _prb_commit:B and desc_update_last_finalized:A can be
* different CPUs. However, the desc_update_last_finalized:A
* CPU (which performs the release) must have previously seen
* _prb_commit:B.

Normally the CPU committing the record will also update
last_finalized_seq. But it is possible that another CPU updates
last_finalized_seq before the committing CPU because it already sees the
finalized record. In that case the complete (maximally complex) picture
looks like this.

CPU0 CPU1 CPU2
==== ==== ====
_prb_commit:B desc_read:A
desc_update_last_finalized:A desc_last_finalized_seq:A
desc_read:A

Any memory barriers in _prb_commit() or desc_read() are irrelevant for
guaranteeing that a CPU reading a sequence value from
desc_last_finalized_seq() will always be able to read that record.

> Summary:
>
> I saw atomic_long_read_acquire(last_finalized_seq) called from
> prb_next_seq() code path. The barrier looked important to me.
> But I saw neither the counter-part nor any comment. I wanted
> to understand it because it might be important for reviewing
> following patches which depend on prb_next_seq().

desc_update_last_finalized:A is the counterpart to
desc_last_finalized_seq:A. IMHO there are plenty of comments that are
formally documenting these memory barriers. Including the new entry in
the summary of all memory barriers:

* desc_update_last_finalized:A / desc_last_finalized_seq:A
* store finalized record, then set new highest finalized sequence number

John

2024-02-05 13:33:22

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v3 09/14] printk: Wait for all reserved records with pr_flush()

On 2024-01-31, Petr Mladek <[email protected]> wrote:
>> + last_finalized_seq = desc_last_finalized_seq(rb);
>> +
>> + /*
>> + * @head_id is loaded after @last_finalized_seq to ensure that it is
>> + * at or beyond @last_finalized_seq.
>> + *
>> + * Memory barrier involvement:
>> + *
>> + * If desc_last_finalized_seq:A reads from
>> + * desc_update_last_finalized:A, then
>> + * prb_next_reserve_seq:A reads from desc_reserve:D.
>> + *
>> + * Relies on:
>> + *
>> + * RELEASE from desc_reserve:D to desc_update_last_finalized:A
>> + * matching
>> + * ACQUIRE from desc_last_finalized_seq:A to prb_next_reserve_seq:A
>> + *
>> + * Note: desc_reserve:D and desc_update_last_finalized:A can be
>> + * different CPUs. However, the desc_update_last_finalized:A CPU
>> + * (which performs the release) must have previously seen
>> + * desc_read:C, which implies desc_reserve:D can be seen.
>
> The most tricky part is desc_reserve:D. It is a supper complicated
> barrier which guarantees many things. But I think that there are
> many more write barriers before the allocated descriptor reaches
> finalized state. So that it should be easier to prove the correctness
> here by other easier barriers.

Yes, desc_reserve:D provides memory barriers for several orderings. But
it is _not_ providing a memory barrier for this ordering. It only marks
where @head_id is stored.

The only memory barriers involved here are desc_update_last_finalized:A
and its counterpart desc_last_finalized_seq:A.

CPU0 CPU1
==== ====
store(head_id)
store_release(last_finalized_seq) load_acquire(last_finalized_seq)
load(head_id)

> To make it clear. I am all for keeping the above precise description
> as is. I just think about adding one more human friendly note which
> might help future generations to understand the correctness an easier
> way. Something like:
>
> * Note: The above description might be hard to follow because
> * desc_reserve:D is a multi-purpose barrier. But this is
> * just the first barrier which makes sure that @head_id
> * is updated before the reserved descriptor gets finalized
> * and updates @last_finalized_seq.
> *
> * There are more release barriers in between, especially,
> * desc_reserve:F and desc_update_last_finalized:A. Also these make
> * sure that the desc_last_finalized_seq:A acquire barrier allows
> * to read @head_id related to @last_finalized_seq or newer.

Non-global memory barriers must operate on the same memory. In this
case, the acquire/release memory barriers are operating on
@last_finalized_seq. The other ordered memory load in this situation is
for @head_id, so it makes sense to specify the store of @head_id as the
start of the release block.

> In fact, the desc_update_last_finalized:A release and
> desc_last_finalized_seq:A acquire barriers are enough to prove
> that we read here @head_id related to @last_finalized_seq or newer.

Yes, which is why they are listed here. ;-)

> Or maybe it is exactly what you described and my "human friendly"
> description is too naive. I am still not completely familiar
> with the "Memory barrier involvement:" and "Relies on:"
> format.

Yes, the format takes some getting used to. I thank Andrea Parri for
helping me to understand memory barriers and contributing ideas to
formalize the documentation.

>> + if (err == -EINVAL) {
>> + if (last_finalized_seq == 0) {
>> + /*
>> + * @last_finalized_seq still contains its initial
>> + * value. Probably no record has been finalized yet.
>> + * This means the ringbuffer is not yet full and the
>> + * @head_id value can be used directly (subtracting
>> + * off the id value corresponding to seq=0).
>> + */
>> +
>> + /*
>> + * Because of hack#2 of the bootstrapping phase, the
>> + * @head_id initial value must be handled separately.
>> + */
>> + if (head_id == DESC0_ID(desc_ring->count_bits))
>> + return 0;
>> +
>> + /*
>> + * The @head_id is initialized such that the first
>> + * increment will yield the first record (seq=0).
>> + * Therefore use the initial value +1 as the base to
>> + * subtract from @head_id.
>> + */
>> + last_finalized_id = DESC0_ID(desc_ring->count_bits) + 1;
>
> It took me long time to understand the above code and comments. I
> wonder if the following looks easier even for you:
>
> if (err == -EINVAL) {
> if (last_finalized_seq == 0) {
> /*
> * No record has been finalized or even reserved yet.
> *
> * The @head_id is initialized such that the first
> * increment will yield the first record (seq=0).
> * Handle it separately to avoid a negative @diff below.
> */
> if (head_id == DESC0_ID(desc_ring->count_bits))
> return 0;
>
> /* One or more descriptors are already reserved. Use
> * the descriptor ID of the first one (@seq=0) for
> * the @diff below.
> */
> last_finalized_id = DESC0_ID(desc_ring->count_bits) + 1;

I will use your comments for the next version.

>> + /*
>> + * @diff is the number of records beyond the last record available
>> + * to readers.
>> + */
>
> This is kind of obvious from the code. Also it is not true when the
> first record has not been finalized yet. The following might
> be more useful:
>
> /* Diff of known descriptor IDs to compure releted sequence nubmers. */
>
>> + diff = head_id - last_finalized_id;

I will use your comments for the next version.

> BTW: It came to my mind whether the logic might be more
> straightforward if we store desc_ring->next_finalized_seq instead
> of @last_finalized_seq.

I thought about this as well. But I felt like the meaning was a bit
confusing. Also @head_id points to the latest reserved descriptor, so it
makes the code a bit easier to follow when creating a diff based on the
latest finalized descriptor.

> Even the initial value vould be valid. Also the value is
> used only in prb_next_reserve_seq() and prb_next_seq() where
> we need to start with this value anyway.

Agreed. I just didn't like how the code looked, even though there were
less +1 operations. It was hard(er) to follow.

> Note that prb_next_seq() actually does not need to try _prb_read_valid()
> for @last_finalized_seq. It should always be valid unless
> the record has been already overwritten. In which case,
> there should be a newer readable record.

Correct. The current code increments before reading.

> Reviewed-by: Petr Mladek <[email protected]>
>
> Well, it would be nice to update the comments if you liked the
> proposals.

Thanks. I will use your comments.

John

2024-02-05 14:44:25

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH printk v3 11/14] printk: ringbuffer: Consider committed as finalized in panic

On 2024-02-01, Petr Mladek <[email protected]> wrote:
> On Thu 2023-12-14 22:47:58, John Ogness wrote:
>> A descriptor in the committed state means the record does not yet
>> exist for the reader. However, for the panic CPU, committed
>> records should be handled as finalized records since they contain
>> message data in a consistent state and may contain additional
>> hints as to the cause of the panic.
>>
>> Add an exception for records in the commit state to not be
>> considered non-existing when reading from the panic CPU.
>
> IMHO, it is important to describe effects of this change in more
> details. And I think that it actually does not work as expected,
> see below.

I reviewed my notes from our meeting in Richmond. We had agreed that
this feature should not apply to the latest message. That would change
the commit message to be as follows:

printk: ringbuffer: Consider committed as finalized in panic

A descriptor in the committed state means the record does not yet
exist for the reader. However, for the panic CPU, committed
records should be handled as finalized records since they contain
message data in a consistent state and may contain additional
hints as to the cause of the panic.

The only exception is the last record. The panic CPU may be
usig LOG_CONT and the individual pieces should not be printed
separately.

Add a special-case check for records in the commit state to not
be considered non-existing when reading from the panic CPU and
it is not the last record.

>> --- a/kernel/printk/printk_ringbuffer.c
>> +++ b/kernel/printk/printk_ringbuffer.c
>> @@ -1875,16 +1877,25 @@ static int desc_read_finalized_seq(struct prb_desc_ring *desc_ring,
>>
>> /*
>> * An unexpected @id (desc_miss) or @seq mismatch means the record
>> - * does not exist. A descriptor in the reserved or committed state
>> - * means the record does not yet exist for the reader.
>> + * does not exist. A descriptor in the reserved state means the
>> + * record does not yet exist for the reader.
>> */
>> if (d_state == desc_miss ||
>> d_state == desc_reserved ||
>> - d_state == desc_committed ||
>> s != seq) {
>> return -EINVAL;
>> }
>>
>> + /*
>> + * A descriptor in the committed state means the record does not yet
>> + * exist for the reader. However, for the panic CPU, committed
>> + * records are also handled as finalized records since they contain
>> + * message data in a consistent state and may contain additional
>> + * hints as to the cause of the panic.
>> + */
>> + if (d_state == desc_committed && !this_cpu_in_panic())
>> + return -EINVAL;

And this code would change to:

+ /*
+ * A descriptor in the committed state means the record does not yet
+ * exist for the reader. However, for the panic CPU, committed
+ * records are also handled as finalized records since they contain
+ * message data in a consistent state and may contain additional
+ * hints as to the cause of the panic. The only exception is the
+ * last record, which may still be appended by the panic CPU and so
+ * is not available to the panic CPU for reading.
+ */
+ if (d_state == desc_committed &&
+ (!this_cpu_in_panic() || id == atomic_long_read(&desc_ring->head_id))) {
+ return -EINVAL;
+ }

> If I get it correctly, this causes that panic CPU would see a
> non-finalized continuous line as finalized. And it would flush
> the existing piece to consoles.
>
> The problem is that pr_cont() would append the message into
> the same record. But the consoles would already wait
> for the next record. They would miss the appended pieces.

Exactly. That is why we said that the last message would not be
available. Maybe this new version is acceptable.

> Honestly, I think that it is not worth the effort. It would add
> another complexity to the memory barriers. The real effect is not easy
> to understand. And the benefit is minimal from my POV.

I am OK with dropping this patch from the series. It is questionable how
valuable a LOG_CONT piece from a non-panic CPU is anyway. And if the
non-panic CPU managed to reopen the record, it would be skipped anyway.

I will drop this patch unless you want to keep the new version.

John

2024-02-06 17:28:09

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 04/14] printk: ringbuffer: Do not skip non-finalized records with prb_next_seq()

On Mon 2024-02-05 12:39:30, John Ogness wrote:
> On 2024-01-15, Petr Mladek <[email protected]> wrote:
> >> The acquire is with @last_finalized_seq. So the release must also be
> >> with @last_finalized_seq. The important thing is that the CPU that
> >> updates @last_finalized_seq has actually read the corresponding
> >> record beforehand. That is exactly what desc_update_last_finalized()
> >> does.
> >
> > I probably did not describe it well. The CPU updating
> > @last_finalized_seq does the right thing. I was not sure about the CPU
> > which reads @last_finalized_seq via prb_next_seq().
> >
> > To make it more clear:
> >
> > u64 prb_next_seq(struct printk_ringbuffer *rb)
> > {
> > u64 seq;
> >
> > seq = desc_last_finalized_seq(rb);
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > |
> > `-> This includes atomic_long_read_acquire(last_finalized_seq)
> >
> >
> > if (seq != 0)
> > seq++;
> >
> > while (_prb_read_valid(rb, &seq, NULL, NULL))
> > seq++;
> >
> > return seq;
> > }
> >
> > But where is the atomic_long_read_release(last_finalized_seq) in
> > this code path?
>
> read_release? The counterpart of this load_acquire is a
> store_release. For example:
>
> CPU0 CPU1
> ==== ====
> load(varA)
> store_release(varB) load_acquire(varB)
> load(varA)
>
> If CPU1 reads the value in varB that CPU0 stored, then it is guaranteed
> that CPU1 will read the value (or a later value) in varA that CPU0 read.
>
> Translating the above example to this particular patch, we have:
>
> CPU0: desc_update_last_finalized() CPU1: prb_next_seq()
> ==== ====
> _prb_read_valid(seq)
> cmpxchg_release(last_finalized_seq,seq) seq=read_acquire(last_finalized_seq)
> _prb_read_valid(seq)

I think that I was confused because I had acquire/release mentally
connected with lock/unlock. Where the lock/unlock surrounds some
critical code section. I think that it is actually the most common
usecase.

Our usage is not typical from my POV. There are two aspects:

1. They do not surround a critical section, at least not
an obvious one.

2. In my mental code, this patch patch uses the release
before the acquire. When I think about this code,
than I first imagine the write path (using release)
and then the reader (using acquire).

I think that this code (mis)uses the acquire/release
semantic just for optimized memory barriers.

It might be worth a note that this is not a typical acquire/release
scenario.

> > IMHO, the barrier provided by the acquire() is _important_ to make
> > sure that _prb_read_valid() would see the valid descriptor.
>
> Correct.
>
> > Now, I think that the related read_release(seq) is hidden in:
> >
> > static int prb_read(struct printk_ringbuffer *rb, u64 seq,
> > struct printk_record *r, unsigned int *line_count)
> > {
> > /* Get a local copy of the correct descriptor (if available). */
> > err = desc_read_finalized_seq(desc_ring, id, seq, &desc);
> >
> > /* If requested, copy meta data. */
> > if (r->info)
> > memcpy(r->info, info, sizeof(*(r->info)));
> >
> > /* Copy text data. If it fails, this is a data-less record. */
> > if (!copy_data(&rb->text_data_ring, &desc.text_blk_lpos, info->text_len,
> > r->text_buf, r->text_buf_size, line_count)) {
> > return -ENOENT;
> > }
> >
> > /* Ensure the record is still finalized and has the same @seq. */
> > return desc_read_finalized_seq(desc_ring, id, seq, &desc);
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > |
> > `-> This includes a memory barrier /* LMM(desc_read:A) */
> > which makes sure that the data are read before
> > the desc/data could be reused.
> > }
> >
> > I consider this /* LMM(desc_read:A) */ as a counter part for that
> > acquire() in prb_next_seq().
>
> desc_read:A is not a memory barrier. It only marks the load of the
> descriptor state.

I see. The real read barriers are desc_read:B and desc_read:D

> This is a significant load because prb_next_seq() must
> see at least the descriptor state that desc_update_last_finalized() saw.
>
> The memory barrier comments in desc_update_last_finalized() state:
>
> * If desc_last_finalized_seq:A reads from
> * desc_update_last_finalized:A, then desc_read:A reads from
> * _prb_commit:B.
>
> This is referring to a slightly different situation than the example I
> used above because it is referencing where the descriptor state was
> stored (_prb_commit:B). The same general picture is valid:
>
> CPU0 CPU1
> ==== ====
> _prb_commit:B
> desc_update_last_finalized:A desc_last_finalized_seq:A
> desc_read:A
>
> desc_read:A is loding the descriptor state that _prb_commit:B stored.
>
> The extra note in the comment clarifies that _prb_commit:B could also be
> denoted as desc_read:A because desc_update_last_finalized() performs a
> read (i.e. must have seen) _prb_commit:B.
>
> * Note: _prb_commit:B and desc_update_last_finalized:A can be
> * different CPUs. However, the desc_update_last_finalized:A
> * CPU (which performs the release) must have previously seen
> * _prb_commit:B.
>
> Normally the CPU committing the record will also update
> last_finalized_seq. But it is possible that another CPU updates
> last_finalized_seq before the committing CPU because it already sees the
> finalized record. In that case the complete (maximally complex) picture
> looks like this.
>
> CPU0 CPU1 CPU2
> ==== ==== ====
> _prb_commit:B desc_read:A
> desc_update_last_finalized:A desc_last_finalized_seq:A
> desc_read:A
>
> Any memory barriers in _prb_commit() or desc_read() are irrelevant for
> guaranteeing that a CPU reading a sequence value from
> desc_last_finalized_seq() will always be able to read that record.

I believe that they are relevant exactly because 3 CPUs might be
involved here. I believe that

+ _prb_commit:B makes sure that CPU0 stores all data before
setting the state as finalized.

+ desc_read:B makes sure that CPU1 will see all data written
when it reads the finalized state.

+ desc_update_last_finalized:A makes sure that saw the record
with the given seq in finalized state.

+ desc_last_finalized_seq:A makes sure that it sees the record
associated with "last_finalized_seq" in the finalized state.

This is why the "Note:" is very important. And maybe desc_read:B
or desc_read:D should be mentioned in the note as well.

Also all these dependencies are really hard to follow. This
is why I suggested to add another note in the 9th patch,
see https://lore.kernel.org/all/ZbowlkOVWiSgCxNX@alley/

Or maybe we should document that pr_read() and _prb_read_valid()
provides these guarantees and just reference it here.

By another words. IMHO, the current "Note:" tries to
prove that _prb_read_valid() guarantees that all data
can be read when it sees the finalized state. IMHO,
we should document this above these functions and
just reference it here.

The current comments for desc_update_last_finalized:A
and prb_next_reserve_seq:A are really hard to follow
because they both try to explain these transitional
guarantees between prb_commit() and prb_read() APIs.
The comments mention many barriers even though the guarantees
should be there by design and should be mentioned
in the prb_read() API.

My motivation is that it took me long time to understand this.
And I am still not sure if I understand it correctly.
IMHO, it might be better to describe some guarantees of
upper level API so that we do not need to look at
the low level barriers again and again.

Best Regards,
Petr

2024-02-07 09:20:32

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 09/14] printk: Wait for all reserved records with pr_flush()

On Mon 2024-02-05 14:39:03, John Ogness wrote:
> On 2024-01-31, Petr Mladek <[email protected]> wrote:
> >> + last_finalized_seq = desc_last_finalized_seq(rb);
> >> +
> >> + /*
> >> + * @head_id is loaded after @last_finalized_seq to ensure that it is
> >> + * at or beyond @last_finalized_seq.

Maybe we could do it the following way. I would slightly update the
above comment:

* @head_id is loaded after @last_finalized_seq to ensure that
* it points to the record with @last_finalized_seq or newer.

I have got it even from the previous record but I had to think about
it a bit. You always wanted to make a difference between the IDs
and sequence numbers.

Anyway, this comment makes the safety kind of obvious.

@head_id always have to be updated when the related record is reserved.
And all other CPUs have to see it so that they have to bump @head_id
for a newer record.

And @last_finalized_seq points to an already finalized record.
If a record is finalized then other CPUs must be able to
read valid data.

Maybe, we should add the two above paragraphs into the human
readable part of the comment as well. They describe some
basics of the design. Everything would blow up when
neither of them is true.

My motivation is that a good human friendly description
helps to understand what the code does and that it is
"obviously" correct.

I agree that the below precise description is useful.
But people should need it only when they want to prove
or revisit the human friendly claim.

> >> + *
> >> + * Memory barrier involvement:
> >> + *
> >> + * If desc_last_finalized_seq:A reads from
> >> + * desc_update_last_finalized:A, then
> >> + * prb_next_reserve_seq:A reads from desc_reserve:D.
> >> + *
> >> + * Relies on:
> >> + *
> >> + * RELEASE from desc_reserve:D to desc_update_last_finalized:A
> >> + * matching
> >> + * ACQUIRE from desc_last_finalized_seq:A to prb_next_reserve_seq:A
> >> + *
> >> + * Note: desc_reserve:D and desc_update_last_finalized:A can be
> >> + * different CPUs. However, the desc_update_last_finalized:A CPU
> >> + * (which performs the release) must have previously seen
> >> + * desc_read:C, which implies desc_reserve:D can be seen.
> >
> > The most tricky part is desc_reserve:D. It is a supper complicated
> > barrier which guarantees many things. But I think that there are
> > many more write barriers before the allocated descriptor reaches
> > finalized state. So that it should be easier to prove the correctness
> > here by other easier barriers.
>
> Yes, desc_reserve:D provides memory barriers for several orderings. But
> it is _not_ providing a memory barrier for this ordering. It only marks
> where @head_id is stored.
>
> The only memory barriers involved here are desc_update_last_finalized:A
> and its counterpart desc_last_finalized_seq:A.
>
> CPU0 CPU1
> ==== ====
> store(head_id)
> store_release(last_finalized_seq) load_acquire(last_finalized_seq)
> load(head_id)
>
> > To make it clear. I am all for keeping the above precise description
> > as is. I just think about adding one more human friendly note which
> > might help future generations to understand the correctness an easier
> > way. Something like:
> >
> > * Note: The above description might be hard to follow because
> > * desc_reserve:D is a multi-purpose barrier. But this is
> > * just the first barrier which makes sure that @head_id
> > * is updated before the reserved descriptor gets finalized
> > * and updates @last_finalized_seq.
> > *
> > * There are more release barriers in between, especially,
> > * desc_reserve:F and desc_update_last_finalized:A. Also these make
> > * sure that the desc_last_finalized_seq:A acquire barrier allows
> > * to read @head_id related to @last_finalized_seq or newer.
>
> Non-global memory barriers must operate on the same memory. In this
> case, the acquire/release memory barriers are operating on
> @last_finalized_seq. The other ordered memory load in this situation is
> for @head_id, so it makes sense to specify the store of @head_id as the
> start of the release block.

I am a bit confused by the "non-global memory barrier" here. I would
expect that acquire()/release() variants provide global barriers.
They are used to implement locking.

Anyway, I am fine with keeping this precise description as is
if we make the correctness more "obvious" from the human friendly
part of the comment.

I am sorry for all the nitpicking. I want to make sure that
the entire comment is correct including the precise part.
This was not easy so I thought about ways to make it easier.
Sigh.

Anyway, I do not want to block the patchset much longer
because of this formal things.

> > BTW: It came to my mind whether the logic might be more
> > straightforward if we store desc_ring->next_finalized_seq instead
> > of @last_finalized_seq.
>
> I thought about this as well. But I felt like the meaning was a bit
> confusing. Also @head_id points to the latest reserved descriptor, so it
> makes the code a bit easier to follow when creating a diff based on the
> latest finalized descriptor.

Fair enough.

Best Regards,
Petr

2024-02-07 10:11:38

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH printk v3 11/14] printk: ringbuffer: Consider committed as finalized in panic

On Mon 2024-02-05 15:14:14, John Ogness wrote:
> On 2024-02-01, Petr Mladek <[email protected]> wrote:
> > On Thu 2023-12-14 22:47:58, John Ogness wrote:
> >> A descriptor in the committed state means the record does not yet
> >> exist for the reader. However, for the panic CPU, committed
> >> records should be handled as finalized records since they contain
> >> message data in a consistent state and may contain additional
> >> hints as to the cause of the panic.
> >>
> >> Add an exception for records in the commit state to not be
> >> considered non-existing when reading from the panic CPU.
> >
> > IMHO, it is important to describe effects of this change in more
> > details. And I think that it actually does not work as expected,
> > see below.
>
> I reviewed my notes from our meeting in Richmond. We had agreed that
> this feature should not apply to the latest message. That would change
> the commit message to be as follows:
>
> printk: ringbuffer: Consider committed as finalized in panic
>
> A descriptor in the committed state means the record does not yet
> exist for the reader. However, for the panic CPU, committed
> records should be handled as finalized records since they contain
> message data in a consistent state and may contain additional
> hints as to the cause of the panic.
>
> The only exception is the last record. The panic CPU may be
> usig LOG_CONT and the individual pieces should not be printed
> separately.

This was my first understanding. But then I realized that appended
pieces would not be printed at all when a committed part
was printed.

And it might be even worse. I have realized yesterday that an
attempt to print the last record in the committed state might
cause missing the trailing '\0' in the copied data. Just imagine
the following race:

CPU0 CPU1

prb_read()
// read last record in committed state
copy_data(.., info->text_len)

prb_reserve_in_last()
printk_sprint(&r.text_buf[0], reserve_size, ...
prb_commit()

// copy data_size before appending
// from already appended buffer
// The trailing '\0' is not in the copied part
memcpy(&buf[0], data, data_size);

BANG: CPU0 would try to use a string without the trailing '\0'.

We should probably add a sanity check and fixup for this potential
problem.


> Add a special-case check for records in the commit state to not
> be considered non-existing when reading from the panic CPU and
> it is not the last record.
>
> >> --- a/kernel/printk/printk_ringbuffer.c
> >> +++ b/kernel/printk/printk_ringbuffer.c
> >> @@ -1875,16 +1877,25 @@ static int desc_read_finalized_seq(struct prb_desc_ring *desc_ring,
> >>
> >> /*
> >> * An unexpected @id (desc_miss) or @seq mismatch means the record
> >> - * does not exist. A descriptor in the reserved or committed state
> >> - * means the record does not yet exist for the reader.
> >> + * does not exist. A descriptor in the reserved state means the
> >> + * record does not yet exist for the reader.
> >> */
> >> if (d_state == desc_miss ||
> >> d_state == desc_reserved ||
> >> - d_state == desc_committed ||
> >> s != seq) {
> >> return -EINVAL;
> >> }
> >>
> >> + /*
> >> + * A descriptor in the committed state means the record does not yet
> >> + * exist for the reader. However, for the panic CPU, committed
> >> + * records are also handled as finalized records since they contain
> >> + * message data in a consistent state and may contain additional
> >> + * hints as to the cause of the panic.
> >> + */
> >> + if (d_state == desc_committed && !this_cpu_in_panic())
> >> + return -EINVAL;
>
> And this code would change to:
>
> + /*
> + * A descriptor in the committed state means the record does not yet
> + * exist for the reader. However, for the panic CPU, committed
> + * records are also handled as finalized records since they contain
> + * message data in a consistent state and may contain additional
> + * hints as to the cause of the panic. The only exception is the
> + * last record, which may still be appended by the panic CPU and so
> + * is not available to the panic CPU for reading.
> + */
> + if (d_state == desc_committed &&
> + (!this_cpu_in_panic() || id == atomic_long_read(&desc_ring->head_id))) {
> + return -EINVAL;
> + }

This is clever.

Well, it would help only with a very small race window. Older
records are automatically finalized when committed. It is because
they could not be reopened. See prb_commit().

> > If I get it correctly, this causes that panic CPU would see a
> > non-finalized continuous line as finalized. And it would flush
> > the existing piece to consoles.
> >
> > The problem is that pr_cont() would append the message into
> > the same record. But the consoles would already wait
> > for the next record. They would miss the appended pieces.
>
> Exactly. That is why we said that the last message would not be
> available. Maybe this new version is acceptable.
>
> > Honestly, I think that it is not worth the effort. It would add
> > another complexity to the memory barriers. The real effect is not easy
> > to understand. And the benefit is minimal from my POV.
>
> I am OK with dropping this patch from the series. It is questionable how
> valuable a LOG_CONT piece from a non-panic CPU is anyway. And if the
> non-panic CPU managed to reopen the record, it would be skipped anyway.
>
> I will drop this patch unless you want to keep the new version.

Honestly, I would drop it. It is kind of tricky code. And it would
help only with a very small race window with messages from
non-panic() CPUs. Especially with the 14th patch which
blocks messages from other CPUs during panic() anyway.

Best Regards,
Petr