Hi,
While testing various flushing scenarios, I stumbled on a few
issues that cause console flushing to fail. While at LPC2023 in
Richmond, I sat down with Petr Mladek and we reviewed the
v2 [0] series. This series is the result of that offline
discussion. v3 is here [1].
This series addresses the following issues:
1. The prb_next_seq() optimization caused inconsistent return
values. Fix prb_next_seq() to the originally intended
behavior but keep an optimization.
2. pr_flush() might not wait until the most recently stored
printk() message if non-finalized records precede it. Fix
pr_flush() to wait for all records to print that are at
least reserved at the time of the call.
3. In panic, the panic messages will not print if non-finalized
records precede them. Add a special condition so that
readers on the panic CPU will drop records that are not in
a consistent state.
4. It is possible (and easy to reproduce) a scenario where the
console on the panic CPU hands over to a waiter of a stopped
CPU. Do not use the handover feature in panic.
5. If messages are being dropped during panic, non-panic CPUs
are silenced. But by then it is already too late and most
likely the panic messages have been dropped. Change the
non-panic CPU silencing logic to _immediately_ silence
non-panic CPUs during panic. This also leads to clean panic
output when many CPUs are blasting the kernel log.
6. If a panic occurs in a context where printk() calls defer
printing (NMI or printk_safe section), the printing of the
final panic messages rely on irq_work. If that mechanism is
not available, the final panic messages are not seen (even
though they are finalized in the ringbuffer). Add one last
explicit flush after all printk() calls are finished to
ensure all available messages in the kernel log are printed.
7. When dumping the stacktrace from panic(), do not use the
printk_cpu_sync because it can deadlock if another CPU holds
and is unable to release the printk_cpu_sync.
This series also performs some minor cleanups to remove open
coded checks about the panic context and improve documentation
language regarding data-less records.
Because of multiple refactoring done in recent history, it
would be helpful to provide the LTS maintainers with the proper
backported patches. I am happy to do this.
The changes since v3:
- Drop patch 11 of v3 ("printk: ringbuffer: Consider committed
as finalized in panic"). It adds complexity, may not perform
as expected, and it is questionable as to whether it would
help provide useful messages on panic.
- Changed several comments according to the suggestions by
Petr. Note that I did not include all the suggestions
because IMO they were too vague in describing the related
memory barriers.
- Add a patch to avoid using printk_cpu_sync on panic(). This
was recently discussed [2]. Feel free to drop the patch
14/14 ("dump_stack: Do not get cpu_sync for panic CPU") if
you think it is not appropriate or will significantly delay
this series.
John Ogness
[0] https://lore.kernel.org/lkml/[email protected]
[1] https://lore.kernel.org/lkml/[email protected]
[2] https://lore.kernel.org/lkml/ZcIGKU8sxti38Kok@alley
John Ogness (12):
printk: nbcon: Relocate 32bit seq macros
printk: Use prb_first_seq() as base for 32bit seq macros
printk: ringbuffer: Do not skip non-finalized records with
prb_next_seq()
printk: ringbuffer: Clarify special lpos values
printk: For @suppress_panic_printk check for other CPU in panic
printk: Add this_cpu_in_panic()
printk: ringbuffer: Cleanup reader terminology
printk: Wait for all reserved records with pr_flush()
printk: ringbuffer: Skip non-finalized records in panic
printk: Avoid non-panic CPUs writing to ringbuffer
panic: Flush kernel log buffer at the end
dump_stack: Do not get cpu_sync for panic CPU
Petr Mladek (1):
printk: Disable passing console lock owner completely during panic()
Sebastian Andrzej Siewior (1):
printk: Adjust mapping for 32bit seq macros
include/linux/printk.h | 2 +
kernel/panic.c | 8 +
kernel/printk/nbcon.c | 41 +---
kernel/printk/printk.c | 101 +++++----
kernel/printk/printk_ringbuffer.c | 335 +++++++++++++++++++++++++-----
kernel/printk/printk_ringbuffer.h | 54 ++++-
lib/dump_stack.c | 16 +-
7 files changed, 419 insertions(+), 138 deletions(-)
base-commit: 6c3a34e38436a2a3f7a1fa764c108ee19b05b893
--
2.39.2
The macros __seq_to_nbcon_seq() and __nbcon_seq_to_seq() are
used to provide support for atomic handling of sequence numbers
on 32bit systems. Until now this was only used by nbcon.c,
which is why they were located in nbcon.c and include nbcon in
the name.
In a follow-up commit this functionality is also needed by
printk_ringbuffer. Rather than duplicating the functionality,
relocate the macros to printk_ringbuffer.h.
Also, since the macros will be no longer nbcon-specific, rename
them to __u64seq_to_ulseq() and __ulseq_to_u64seq().
This does not result in any functional change.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/nbcon.c | 41 +++----------------------------
kernel/printk/printk_ringbuffer.h | 33 +++++++++++++++++++++++++
2 files changed, 37 insertions(+), 37 deletions(-)
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index b96077152f49..c8093bcc01fe 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -140,39 +140,6 @@ static inline bool nbcon_state_try_cmpxchg(struct console *con, struct nbcon_sta
return atomic_try_cmpxchg(&ACCESS_PRIVATE(con, nbcon_state), &cur->atom, new->atom);
}
-#ifdef CONFIG_64BIT
-
-#define __seq_to_nbcon_seq(seq) (seq)
-#define __nbcon_seq_to_seq(seq) (seq)
-
-#else /* CONFIG_64BIT */
-
-#define __seq_to_nbcon_seq(seq) ((u32)seq)
-
-static inline u64 __nbcon_seq_to_seq(u32 nbcon_seq)
-{
- u64 seq;
- u64 rb_next_seq;
-
- /*
- * The provided sequence is only the lower 32 bits of the ringbuffer
- * sequence. It needs to be expanded to 64bit. Get the next sequence
- * number from the ringbuffer and fold it.
- *
- * Having a 32bit representation in the console is sufficient.
- * If a console ever gets more than 2^31 records behind
- * the ringbuffer then this is the least of the problems.
- *
- * Also the access to the ring buffer is always safe.
- */
- rb_next_seq = prb_next_seq(prb);
- seq = rb_next_seq - ((u32)rb_next_seq - nbcon_seq);
-
- return seq;
-}
-
-#endif /* CONFIG_64BIT */
-
/**
* nbcon_seq_read - Read the current console sequence
* @con: Console to read the sequence of
@@ -183,7 +150,7 @@ u64 nbcon_seq_read(struct console *con)
{
unsigned long nbcon_seq = atomic_long_read(&ACCESS_PRIVATE(con, nbcon_seq));
- return __nbcon_seq_to_seq(nbcon_seq);
+ return __ulseq_to_u64seq(prb, nbcon_seq);
}
/**
@@ -204,7 +171,7 @@ void nbcon_seq_force(struct console *con, u64 seq)
*/
u64 valid_seq = max_t(u64, seq, prb_first_valid_seq(prb));
- atomic_long_set(&ACCESS_PRIVATE(con, nbcon_seq), __seq_to_nbcon_seq(valid_seq));
+ atomic_long_set(&ACCESS_PRIVATE(con, nbcon_seq), __u64seq_to_ulseq(valid_seq));
/* Clear con->seq since nbcon consoles use con->nbcon_seq instead. */
con->seq = 0;
@@ -223,11 +190,11 @@ void nbcon_seq_force(struct console *con, u64 seq)
*/
static void nbcon_seq_try_update(struct nbcon_context *ctxt, u64 new_seq)
{
- unsigned long nbcon_seq = __seq_to_nbcon_seq(ctxt->seq);
+ unsigned long nbcon_seq = __u64seq_to_ulseq(ctxt->seq);
struct console *con = ctxt->console;
if (atomic_long_try_cmpxchg(&ACCESS_PRIVATE(con, nbcon_seq), &nbcon_seq,
- __seq_to_nbcon_seq(new_seq))) {
+ __u64seq_to_ulseq(new_seq))) {
ctxt->seq = new_seq;
} else {
ctxt->seq = nbcon_seq_read(con);
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index 18cd25e489b8..b82a96dc2ea2 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -381,4 +381,37 @@ bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
u64 prb_first_valid_seq(struct printk_ringbuffer *rb);
u64 prb_next_seq(struct printk_ringbuffer *rb);
+#ifdef CONFIG_64BIT
+
+#define __u64seq_to_ulseq(u64seq) (u64seq)
+#define __ulseq_to_u64seq(rb, ulseq) (ulseq)
+
+#else /* CONFIG_64BIT */
+
+#define __u64seq_to_ulseq(u64seq) ((u32)u64seq)
+
+static inline u64 __ulseq_to_u64seq(struct printk_ringbuffer *rb, u32 ulseq)
+{
+ u64 seq;
+ u64 rb_next_seq;
+
+ /*
+ * The provided sequence is only the lower 32 bits of the ringbuffer
+ * sequence. It needs to be expanded to 64bit. Get the next sequence
+ * number from the ringbuffer and fold it.
+ *
+ * Having a 32bit representation in the console is sufficient.
+ * If a console ever gets more than 2^31 records behind
+ * the ringbuffer then this is the least of the problems.
+ *
+ * Also the access to the ring buffer is always safe.
+ */
+ rb_next_seq = prb_next_seq(rb);
+ seq = rb_next_seq - ((u32)rb_next_seq - ulseq);
+
+ return seq;
+}
+
+#endif /* CONFIG_64BIT */
+
#endif /* _KERNEL_PRINTK_RINGBUFFER_H */
--
2.39.2
Commit f244b4dc53e5 ("printk: ringbuffer: Improve
prb_next_seq() performance") introduced an optimization for
prb_next_seq() by using best-effort to track recently finalized
records. However, the order of finalization does not
necessarily match the order of the records. The optimization
changed prb_next_seq() to return inconsistent results, possibly
yielding sequence numbers that are not available to readers
because they are preceded by non-finalized records or they are
not yet visible to the reader CPU.
Rather than simply best-effort tracking recently finalized
records, force the committing writer to read records and
increment the last "contiguous block" of finalized records. In
order to do this, the sequence number instead of ID must be
stored because ID's cannot be directly compared.
A new memory barrier pair is introduced to guarantee that a
reader can always read the records up until the sequence number
returned by prb_next_seq() (unless the records have since
been overwritten in the ringbuffer).
This restores the original functionality of prb_next_seq()
while also keeping the optimization.
For 32bit systems, only the lower 32 bits of the sequence
number are stored. When reading the value, it is expanded to
the full 64bit sequence number using the 32bit seq macros,
which fold in the value returned by prb_first_seq().
Fixes: f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq() performance")
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 164 +++++++++++++++++++++++-------
kernel/printk/printk_ringbuffer.h | 4 +-
2 files changed, 127 insertions(+), 41 deletions(-)
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 49a82ccce8e9..04c26cca546f 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -6,6 +6,7 @@
#include <linux/errno.h>
#include <linux/bug.h>
#include "printk_ringbuffer.h"
+#include "internal.h"
/**
* DOC: printk_ringbuffer overview
@@ -303,6 +304,9 @@
*
* desc_push_tail:B / desc_reserve:D
* set descriptor reusable (state), then push descriptor tail (id)
+ *
+ * desc_update_last_finalized:A / desc_last_finalized_seq:A
+ * store finalized record, then set new highest finalized sequence number
*/
#define DATA_SIZE(data_ring) _DATA_SIZE((data_ring)->size_bits)
@@ -1441,20 +1445,118 @@ bool prb_reserve_in_last(struct prb_reserved_entry *e, struct printk_ringbuffer
return false;
}
+/*
+ * @last_finalized_seq value guarantees that all records up to and including
+ * this sequence number are finalized and can be read. The only exception are
+ * too old records which have already been overwritten.
+ *
+ * It is also guaranteed that @last_finalized_seq only increases.
+ *
+ * Be aware that finalized records following non-finalized records are not
+ * reported because they are not yet available to the reader. For example,
+ * a new record stored via printk() will not be available to a printer if
+ * it follows a record that has not been finalized yet. However, once that
+ * non-finalized record becomes finalized, @last_finalized_seq will be
+ * appropriately updated and the full set of finalized records will be
+ * available to the printer. And since each printk() caller will either
+ * directly print or trigger deferred printing of all available unprinted
+ * records, all printk() messages will get printed.
+ */
+static u64 desc_last_finalized_seq(struct printk_ringbuffer *rb)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ unsigned long ulseq;
+
+ /*
+ * Guarantee the sequence number is loaded before loading the
+ * associated record in order to guarantee that the record can be
+ * seen by this CPU. This pairs with desc_update_last_finalized:A.
+ */
+ ulseq = atomic_long_read_acquire(&desc_ring->last_finalized_seq
+ ); /* LMM(desc_last_finalized_seq:A) */
+
+ return __ulseq_to_u64seq(rb, ulseq);
+}
+
+static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
+ struct printk_record *r, unsigned int *line_count);
+
+/*
+ * Check if there are records directly following @last_finalized_seq that are
+ * finalized. If so, update @last_finalized_seq to the latest of these
+ * records. It is not allowed to skip over records that are not yet finalized.
+ */
+static void desc_update_last_finalized(struct printk_ringbuffer *rb)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ u64 old_seq = desc_last_finalized_seq(rb);
+ unsigned long oldval;
+ unsigned long newval;
+ u64 finalized_seq;
+ u64 try_seq;
+
+try_again:
+ finalized_seq = old_seq;
+ try_seq = finalized_seq + 1;
+
+ /* Try to find later finalized records. */
+ while (_prb_read_valid(rb, &try_seq, NULL, NULL)) {
+ finalized_seq = try_seq;
+ try_seq++;
+ }
+
+ /* No update needed if no later finalized record was found. */
+ if (finalized_seq == old_seq)
+ return;
+
+ oldval = __u64seq_to_ulseq(old_seq);
+ newval = __u64seq_to_ulseq(finalized_seq);
+
+ /*
+ * Set the sequence number of a later finalized record that has been
+ * seen.
+ *
+ * Guarantee the record data is visible to other CPUs before storing
+ * its sequence number. This pairs with desc_last_finalized_seq:A.
+ *
+ * Memory barrier involvement:
+ *
+ * If desc_last_finalized_seq:A reads from
+ * desc_update_last_finalized:A, then desc_read:A reads from
+ * _prb_commit:B.
+ *
+ * Relies on:
+ *
+ * RELEASE from _prb_commit:B to desc_update_last_finalized:A
+ * matching
+ * ACQUIRE from desc_last_finalized_seq:A to desc_read:A
+ *
+ * Note: _prb_commit:B and desc_update_last_finalized:A can be
+ * different CPUs. However, the desc_update_last_finalized:A
+ * CPU (which performs the release) must have previously seen
+ * _prb_commit:B.
+ */
+ if (!atomic_long_try_cmpxchg_release(&desc_ring->last_finalized_seq,
+ &oldval, newval)) { /* LMM(desc_update_last_finalized:A) */
+ old_seq = __ulseq_to_u64seq(rb, oldval);
+ goto try_again;
+ }
+}
+
/*
* Attempt to finalize a specified descriptor. If this fails, the descriptor
* is either already final or it will finalize itself when the writer commits.
*/
-static void desc_make_final(struct prb_desc_ring *desc_ring, unsigned long id)
+static void desc_make_final(struct printk_ringbuffer *rb, unsigned long id)
{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
unsigned long prev_state_val = DESC_SV(id, desc_committed);
struct prb_desc *d = to_desc(desc_ring, id);
- atomic_long_cmpxchg_relaxed(&d->state_var, prev_state_val,
- DESC_SV(id, desc_finalized)); /* LMM(desc_make_final:A) */
-
- /* Best effort to remember the last finalized @id. */
- atomic_long_set(&desc_ring->last_finalized_id, id);
+ if (atomic_long_try_cmpxchg_relaxed(&d->state_var, &prev_state_val,
+ DESC_SV(id, desc_finalized))) { /* LMM(desc_make_final:A) */
+ desc_update_last_finalized(rb);
+ }
}
/**
@@ -1550,7 +1652,7 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
* readers. (For seq==0 there is no previous descriptor.)
*/
if (info->seq > 0)
- desc_make_final(desc_ring, DESC_ID(id - 1));
+ desc_make_final(rb, DESC_ID(id - 1));
r->text_buf = data_alloc(rb, r->text_buf_size, &d->text_blk_lpos, id);
/* If text data allocation fails, a data-less record is committed. */
@@ -1643,7 +1745,7 @@ void prb_commit(struct prb_reserved_entry *e)
*/
head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_commit:A) */
if (head_id != e->id)
- desc_make_final(desc_ring, e->id);
+ desc_make_final(e->rb, e->id);
}
/**
@@ -1663,12 +1765,9 @@ void prb_commit(struct prb_reserved_entry *e)
*/
void prb_final_commit(struct prb_reserved_entry *e)
{
- struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
-
_prb_commit(e, desc_finalized);
- /* Best effort to remember the last finalized @id. */
- atomic_long_set(&desc_ring->last_finalized_id, e->id);
+ desc_update_last_finalized(e->rb);
}
/*
@@ -2008,7 +2107,9 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
* newest sequence number available to readers will be.
*
* This provides readers a sequence number to jump to if all currently
- * available records should be skipped.
+ * available records should be skipped. It is guaranteed that all records
+ * previous to the returned value have been finalized and are (or were)
+ * available to the reader.
*
* Context: Any context.
* Return: The sequence number of the next newest (not yet available) record
@@ -2016,34 +2117,19 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
*/
u64 prb_next_seq(struct printk_ringbuffer *rb)
{
- struct prb_desc_ring *desc_ring = &rb->desc_ring;
- enum desc_state d_state;
- unsigned long id;
u64 seq;
- /* Check if the cached @id still points to a valid @seq. */
- id = atomic_long_read(&desc_ring->last_finalized_id);
- d_state = desc_read(desc_ring, id, NULL, &seq, NULL);
+ seq = desc_last_finalized_seq(rb);
- if (d_state == desc_finalized || d_state == desc_reusable) {
- /*
- * Begin searching after the last finalized record.
- *
- * On 0, the search must begin at 0 because of hack#2
- * of the bootstrapping phase it is not known if a
- * record at index 0 exists.
- */
- if (seq != 0)
- seq++;
- } else {
- /*
- * The information about the last finalized sequence number
- * has gone. It should happen only when there is a flood of
- * new messages and the ringbuffer is rapidly recycled.
- * Give up and start from the beginning.
- */
- seq = 0;
- }
+ /*
+ * Begin searching after the last finalized record.
+ *
+ * On 0, the search must begin at 0 because of hack#2
+ * of the bootstrapping phase it is not known if a
+ * record at index 0 exists.
+ */
+ if (seq != 0)
+ seq++;
/*
* The information about the last finalized @seq might be inaccurate.
@@ -2085,7 +2171,7 @@ void prb_init(struct printk_ringbuffer *rb,
rb->desc_ring.infos = infos;
atomic_long_set(&rb->desc_ring.head_id, DESC0_ID(descbits));
atomic_long_set(&rb->desc_ring.tail_id, DESC0_ID(descbits));
- atomic_long_set(&rb->desc_ring.last_finalized_id, DESC0_ID(descbits));
+ atomic_long_set(&rb->desc_ring.last_finalized_seq, 0);
rb->text_data_ring.size_bits = textbits;
rb->text_data_ring.data = text_buf;
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index ee294aaf4aeb..2d948cc82b5b 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -75,7 +75,7 @@ struct prb_desc_ring {
struct printk_info *infos;
atomic_long_t head_id;
atomic_long_t tail_id;
- atomic_long_t last_finalized_id;
+ atomic_long_t last_finalized_seq;
};
/*
@@ -259,7 +259,7 @@ static struct printk_ringbuffer name = { \
.infos = &_##name##_infos[0], \
.head_id = ATOMIC_INIT(DESC0_ID(descbits)), \
.tail_id = ATOMIC_INIT(DESC0_ID(descbits)), \
- .last_finalized_id = ATOMIC_INIT(DESC0_ID(descbits)), \
+ .last_finalized_seq = ATOMIC_INIT(0), \
}, \
.text_data_ring = { \
.size_bits = (avgtextbits) + (descbits), \
--
2.39.2
Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.
Currently prb_next_seq() is used as the base for the 32bit seq
macros __u64seq_to_ulseq() and __ulseq_to_u64seq(). However, in
a follow-up commit, prb_next_seq() will need to make use of the
32bit seq macros.
Use prb_first_seq() as the base for the 32bit seq macros instead
because it is guaranteed to return 64bit sequence numbers without
relying on any 32bit seq macros.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 2 +-
kernel/printk/printk_ringbuffer.h | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index fde338606ce8..49a82ccce8e9 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1832,7 +1832,7 @@ static int prb_read(struct printk_ringbuffer *rb, u64 seq,
}
/* Get the sequence number of the tail descriptor. */
-static u64 prb_first_seq(struct printk_ringbuffer *rb)
+u64 prb_first_seq(struct printk_ringbuffer *rb)
{
struct prb_desc_ring *desc_ring = &rb->desc_ring;
enum desc_state d_state;
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index 12f60c782e46..ee294aaf4aeb 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -378,6 +378,7 @@ bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
struct printk_info *info, unsigned int *line_count);
+u64 prb_first_seq(struct printk_ringbuffer *rb);
u64 prb_first_valid_seq(struct printk_ringbuffer *rb);
u64 prb_next_seq(struct printk_ringbuffer *rb);
@@ -392,12 +393,12 @@ u64 prb_next_seq(struct printk_ringbuffer *rb);
static inline u64 __ulseq_to_u64seq(struct printk_ringbuffer *rb, u32 ulseq)
{
+ u64 rb_first_seq = prb_first_seq(rb);
u64 seq;
- u64 rb_next_seq;
/*
* The provided sequence is only the lower 32 bits of the ringbuffer
- * sequence. It needs to be expanded to 64bit. Get the next sequence
+ * sequence. It needs to be expanded to 64bit. Get the first sequence
* number from the ringbuffer and fold it.
*
* Having a 32bit representation in the console is sufficient.
@@ -406,8 +407,7 @@ static inline u64 __ulseq_to_u64seq(struct printk_ringbuffer *rb, u32 ulseq)
*
* Also the access to the ring buffer is always safe.
*/
- rb_next_seq = prb_next_seq(rb);
- seq = rb_next_seq - (s32)((u32)rb_next_seq - ulseq);
+ seq = rb_first_seq - (s32)((u32)rb_first_seq - ulseq);
return seq;
}
--
2.39.2
For empty line records, no data blocks are created. Instead,
these valid records are identified by special logical position
values (in fields of @prb_desc.text_blk_lpos).
Currently the macro NO_LPOS is used for empty line records.
This name is confusing because it does not imply _why_ there is
no data block.
Rename NO_LPOS to EMPTY_LINE_LPOS so that it is clear why there
is no data block.
Also add comments explaining the use of EMPTY_LINE_LPOS as well
as clarification to the values used to represent data-less
blocks.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 20 ++++++++++++++++----
kernel/printk/printk_ringbuffer.h | 16 +++++++++++++++-
2 files changed, 31 insertions(+), 5 deletions(-)
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 04c26cca546f..244d991ffd73 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1034,9 +1034,13 @@ static char *data_alloc(struct printk_ringbuffer *rb, unsigned int size,
unsigned long next_lpos;
if (size == 0) {
- /* Specify a data-less block. */
- blk_lpos->begin = NO_LPOS;
- blk_lpos->next = NO_LPOS;
+ /*
+ * Data blocks are not created for empty lines. Instead, the
+ * reader will recognize these special lpos values and handle
+ * it appropriately.
+ */
+ blk_lpos->begin = EMPTY_LINE_LPOS;
+ blk_lpos->next = EMPTY_LINE_LPOS;
return NULL;
}
@@ -1214,10 +1218,18 @@ static const char *get_data(struct prb_data_ring *data_ring,
/* Data-less data block description. */
if (BLK_DATALESS(blk_lpos)) {
- if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) {
+ /*
+ * Records that are just empty lines are also valid, even
+ * though they do not have a data block. For such records
+ * explicitly return empty string data to signify success.
+ */
+ if (blk_lpos->begin == EMPTY_LINE_LPOS &&
+ blk_lpos->next == EMPTY_LINE_LPOS) {
*data_size = 0;
return "";
}
+
+ /* Data lost, invalid, or otherwise unavailable. */
return NULL;
}
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index 2d948cc82b5b..d49460f7578e 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -127,8 +127,22 @@ enum desc_state {
#define DESC_SV(id, state) (((unsigned long)state << DESC_FLAGS_SHIFT) | id)
#define DESC_ID_MASK (~DESC_FLAGS_MASK)
#define DESC_ID(sv) ((sv) & DESC_ID_MASK)
+
+/*
+ * Special data block logical position values (for fields of
+ * @prb_desc.text_blk_lpos).
+ *
+ * - Bit0 is used to identify if the record has no data block. (Implemented in
+ * the LPOS_DATALESS() macro.)
+ *
+ * - Bit1 specifies the reason for not having a data block.
+ *
+ * These special values could never be real lpos values because of the
+ * meta data and alignment padding of data blocks. (See to_blk_size() for
+ * details.)
+ */
#define FAILED_LPOS 0x1
-#define NO_LPOS 0x3
+#define EMPTY_LINE_LPOS 0x3
#define FAILED_BLK_LPOS \
{ \
--
2.39.2
Currently @suppress_panic_printk is checked along with
non-matching @panic_cpu and current CPU. This works
because @suppress_panic_printk is only set when
panic_in_progress() is true.
Rather than relying on the @suppress_panic_printk semantics,
use the concise helper function other_cpu_in_progress(). The
helper function exists to avoid open coding such tests.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/printk.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index be95a6851164..22bb45d8f2f2 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2270,8 +2270,7 @@ asmlinkage int vprintk_emit(int facility, int level,
if (unlikely(suppress_printk))
return 0;
- if (unlikely(suppress_panic_printk) &&
- atomic_read(&panic_cpu) != raw_smp_processor_id())
+ if (unlikely(suppress_panic_printk) && other_cpu_in_panic())
return 0;
if (level == LOGLEVEL_SCHED) {
--
2.39.2
With the lockless ringbuffer, it is allowed that multiple
CPUs/contexts write simultaneously into the buffer. This creates
an ambiguity as some writers will finalize sooner.
The documentation for the prb_read functions is not clear as it
refers to "not yet written" and "no data available". Clarify the
return values and language to be in terms of the reader: records
available for reading.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 244d991ffd73..67ee1c62fcd6 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1987,11 +1987,13 @@ u64 prb_first_seq(struct printk_ringbuffer *rb)
}
/*
- * Non-blocking read of a record. Updates @seq to the last finalized record
- * (which may have no data available).
+ * Non-blocking read of a record.
*
- * See the description of prb_read_valid() and prb_read_valid_info()
- * for details.
+ * On success @seq is updated to the record that was read and (if provided)
+ * @r and @line_count will contain the read/calculated data.
+ *
+ * On failure @seq is updated to a record that is not yet available to the
+ * reader, but it will be the next record available to the reader.
*/
static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
struct printk_record *r, unsigned int *line_count)
@@ -2010,7 +2012,7 @@ static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
*seq = tail_seq;
} else if (err == -ENOENT) {
- /* Record exists, but no data available. Skip. */
+ /* Record exists, but the data was lost. Skip. */
(*seq)++;
} else {
@@ -2043,7 +2045,7 @@ static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
* On success, the reader must check r->info.seq to see which record was
* actually read. This allows the reader to detect dropped records.
*
- * Failure means @seq refers to a not yet written record.
+ * Failure means @seq refers to a record not yet available to the reader.
*/
bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
struct printk_record *r)
@@ -2073,7 +2075,7 @@ bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
* On success, the reader must check info->seq to see which record meta data
* was actually read. This allows the reader to detect dropped records.
*
- * Failure means @seq refers to a not yet written record.
+ * Failure means @seq refers to a record not yet available to the reader.
*/
bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
struct printk_info *info, unsigned int *line_count)
--
2.39.2
Currently pr_flush() will only wait for records that were
available to readers at the time of the call (using
prb_next_seq()). But there may be more records (non-finalized)
that have following finalized records. pr_flush() should wait
for these to print as well. Particularly because any trailing
finalized records may be the messages that the calling context
wants to ensure are printed.
Add a new ringbuffer function prb_next_reserve_seq() to return
the sequence number following the most recently reserved record.
This guarantees that pr_flush() will wait until all current
printk() messages (completed or in progress) have been printed.
Fixes: 3b604ca81202 ("printk: add pr_flush()")
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/printk.c | 2 +-
kernel/printk/printk_ringbuffer.c | 105 ++++++++++++++++++++++++++++++
kernel/printk/printk_ringbuffer.h | 1 +
3 files changed, 107 insertions(+), 1 deletion(-)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 82dc2c7949b7..f3a7f5a6f6f8 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3755,7 +3755,7 @@ static bool __pr_flush(struct console *con, int timeout_ms, bool reset_on_progre
might_sleep();
- seq = prb_next_seq(prb);
+ seq = prb_next_reserve_seq(prb);
/* Flush the consoles so that records up to @seq are printed. */
console_lock();
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 67ee1c62fcd6..24b484c221e8 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1986,6 +1986,111 @@ u64 prb_first_seq(struct printk_ringbuffer *rb)
return seq;
}
+/**
+ * prb_next_reserve_seq() - Get the sequence number after the most recently
+ * reserved record.
+ *
+ * @rb: The ringbuffer to get the sequence number from.
+ *
+ * This is the public function available to readers to see what sequence
+ * number will be assigned to the next reserved record.
+ *
+ * Note that depending on the situation, this value can be equal to or
+ * higher than the sequence number returned by prb_next_seq().
+ *
+ * Context: Any context.
+ * Return: The sequence number that will be assigned to the next record
+ * reserved.
+ */
+u64 prb_next_reserve_seq(struct printk_ringbuffer *rb)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ unsigned long last_finalized_id;
+ atomic_long_t *state_var;
+ u64 last_finalized_seq;
+ unsigned long head_id;
+ struct prb_desc desc;
+ unsigned long diff;
+ struct prb_desc *d;
+ int err;
+
+ /*
+ * It may not be possible to read a sequence number for @head_id.
+ * So the ID of @last_finailzed_seq is used to calculate what the
+ * sequence number of @head_id will be.
+ */
+
+try_again:
+ last_finalized_seq = desc_last_finalized_seq(rb);
+
+ /*
+ * @head_id is loaded after @last_finalized_seq to ensure that
+ * it points to the record with @last_finalized_seq or newer.
+ *
+ * Memory barrier involvement:
+ *
+ * If desc_last_finalized_seq:A reads from
+ * desc_update_last_finalized:A, then
+ * prb_next_reserve_seq:A reads from desc_reserve:D.
+ *
+ * Relies on:
+ *
+ * RELEASE from desc_reserve:D to desc_update_last_finalized:A
+ * matching
+ * ACQUIRE from desc_last_finalized_seq:A to prb_next_reserve_seq:A
+ *
+ * Note: desc_reserve:D and desc_update_last_finalized:A can be
+ * different CPUs. However, the desc_update_last_finalized:A CPU
+ * (which performs the release) must have previously seen
+ * desc_read:C, which implies desc_reserve:D can be seen.
+ */
+ head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_next_reserve_seq:A) */
+
+ d = to_desc(desc_ring, last_finalized_seq);
+ state_var = &d->state_var;
+
+ /* Extract the ID, used to specify the descriptor to read. */
+ last_finalized_id = DESC_ID(atomic_long_read(state_var));
+
+ /* Ensure @last_finalized_id is correct. */
+ err = desc_read_finalized_seq(desc_ring, last_finalized_id, last_finalized_seq, &desc);
+
+ if (err == -EINVAL) {
+ if (last_finalized_seq == 0) {
+ /*
+ * No record has been finalized or even reserved yet.
+ *
+ * The @head_id is initialized such that the first
+ * increment will yield the first record (seq=0).
+ * Handle it separately to avoid a negative @diff
+ * below.
+ */
+ if (head_id == DESC0_ID(desc_ring->count_bits))
+ return 0;
+
+ /*
+ * One or more descriptors are already reserved. Use
+ * the descriptor ID of the first one (@seq=0) for
+ * the @diff below.
+ */
+ last_finalized_id = DESC0_ID(desc_ring->count_bits) + 1;
+ } else {
+ /* Record must have been overwritten. Try again. */
+ goto try_again;
+ }
+ }
+
+ /* Diff of known descriptor IDs to compute related sequence numbers. */
+ diff = head_id - last_finalized_id;
+
+ /*
+ * @head_id points to the most recently reserved record, but this
+ * function returns the sequence number that will be assigned to the
+ * next (not yet reserved) record. Thus +1 is needed.
+ */
+ return (last_finalized_seq + diff + 1);
+}
+
/*
* Non-blocking read of a record.
*
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index d49460f7578e..52626d0f1fa3 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -395,6 +395,7 @@ bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
u64 prb_first_seq(struct printk_ringbuffer *rb);
u64 prb_first_valid_seq(struct printk_ringbuffer *rb);
u64 prb_next_seq(struct printk_ringbuffer *rb);
+u64 prb_next_reserve_seq(struct printk_ringbuffer *rb);
#ifdef CONFIG_64BIT
--
2.39.2
From: Petr Mladek <[email protected]>
The commit d51507098ff91 ("printk: disable optimistic spin
during panic") added checks to avoid becoming a console waiter
if a panic is in progress.
However, the transition to panic can occur while there is
already a waiter. The current owner should not pass the lock to
the waiter because it might get stopped or blocked anytime.
Also the panic context might pass the console lock owner to an
already stopped waiter by mistake. It might happen when
console_flush_on_panic() ignores the current lock owner, for
example:
CPU0 CPU1
---- ----
console_lock_spinning_enable()
console_trylock_spinning()
[CPU1 now console waiter]
NMI: panic()
panic_other_cpus_shutdown()
[stopped as console waiter]
console_flush_on_panic()
console_lock_spinning_enable()
[print 1 record]
console_lock_spinning_disable_and_check()
[handover to stopped CPU1]
This results in panic() not flushing the panic messages.
Fix these problems by disabling all spinning operations
completely during panic().
Another advantage is that it prevents possible deadlocks caused
by "console_owner_lock". The panic() context does not need to
take it any longer. The lockless checks are safe because the
functions become NOPs when they see the panic in progress. All
operations manipulating the state are still synchronized by the
lock even when non-panic CPUs would notice the panic
synchronously.
The current owner might stay spinning. But non-panic() CPUs
would get stopped anyway and the panic context will never start
spinning.
Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes")
Signed-off-by: Petr Mladek <[email protected]>
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index f3a7f5a6f6f8..cb99c854a648 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1869,10 +1869,23 @@ static bool console_waiter;
*/
static void console_lock_spinning_enable(void)
{
+ /*
+ * Do not use spinning in panic(). The panic CPU wants to keep the lock.
+ * Non-panic CPUs abandon the flush anyway.
+ *
+ * Just keep the lockdep annotation. The panic-CPU should avoid
+ * taking console_owner_lock because it might cause a deadlock.
+ * This looks like the easiest way how to prevent false lockdep
+ * reports without handling races a lockless way.
+ */
+ if (panic_in_progress())
+ goto lockdep;
+
raw_spin_lock(&console_owner_lock);
console_owner = current;
raw_spin_unlock(&console_owner_lock);
+lockdep:
/* The waiter may spin on us after setting console_owner */
spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
}
@@ -1897,6 +1910,22 @@ static int console_lock_spinning_disable_and_check(int cookie)
{
int waiter;
+ /*
+ * Ignore spinning waiters during panic() because they might get stopped
+ * or blocked at any time,
+ *
+ * It is safe because nobody is allowed to start spinning during panic
+ * in the first place. If there has been a waiter then non panic CPUs
+ * might stay spinning. They would get stopped anyway. The panic context
+ * will never start spinning and an interrupted spin on panic CPU will
+ * never continue.
+ */
+ if (panic_in_progress()) {
+ /* Keep lockdep happy. */
+ spin_release(&console_owner_dep_map, _THIS_IP_);
+ return 0;
+ }
+
raw_spin_lock(&console_owner_lock);
waiter = READ_ONCE(console_waiter);
console_owner = NULL;
--
2.39.2
Normally a reader will stop once reaching a non-finalized
record. However, when a panic happens, writers from other CPUs
(or an interrupted context on the panic CPU) may have been
writing a record and were unable to finalize it. The panic CPU
will reserve/commit/finalize its panic records, but these will
be located after the non-finalized records. This results in
panic() not flushing the panic messages.
Extend _prb_read_valid() to skip over non-finalized records if
on the panic CPU.
Fixes: 896fbe20b4e2 ("printk: use the lockless ringbuffer")
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 24b484c221e8..88e8f3a61922 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -2099,6 +2099,10 @@ u64 prb_next_reserve_seq(struct printk_ringbuffer *rb)
*
* On failure @seq is updated to a record that is not yet available to the
* reader, but it will be the next record available to the reader.
+ *
+ * Note: When the current CPU is in panic, this function will skip over any
+ * non-existent/non-finalized records in order to allow the panic CPU
+ * to print any and all records that have been finalized.
*/
static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
struct printk_record *r, unsigned int *line_count)
@@ -2121,8 +2125,28 @@ static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
(*seq)++;
} else {
- /* Non-existent/non-finalized record. Must stop. */
- return false;
+ /*
+ * Non-existent/non-finalized record. Must stop.
+ *
+ * For panic situations it cannot be expected that
+ * non-finalized records will become finalized. But
+ * there may be other finalized records beyond that
+ * need to be printed for a panic situation. If this
+ * is the panic CPU, skip this
+ * non-existent/non-finalized record unless it is
+ * at or beyond the head, in which case it is not
+ * possible to continue.
+ *
+ * Note that new messages printed on panic CPU are
+ * finalized when we are here. The only exception
+ * might be the last message without trailing newline.
+ * But it would have the sequence number returned
+ * by "prb_next_reserve_seq() - 1".
+ */
+ if (this_cpu_in_panic() && ((*seq + 1) < prb_next_reserve_seq(rb)))
+ (*seq)++;
+ else
+ return false;
}
}
--
2.39.2
dump_stack() is called in panic(). If for some reason another CPU
is holding the printk_cpu_sync and is unable to release it, the
panic CPU will be unable to continue and print the stacktrace.
Since non-panic CPUs are not allowed to store new printk messages
anyway, there is no need to synchronize the stacktrace output in
a panic situation.
For the panic CPU, do not get the printk_cpu_sync because it is
not needed and avoids a potential deadlock scenario in panic().
Link: https://lore.kernel.org/lkml/ZcIGKU8sxti38Kok@alley
Signed-off-by: John Ogness <[email protected]>
---
include/linux/printk.h | 2 ++
kernel/printk/internal.h | 1 -
lib/dump_stack.c | 16 +++++++++++++---
3 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 8ef499ab3c1e..955e31860095 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -273,6 +273,8 @@ static inline void printk_trigger_flush(void)
}
#endif
+bool this_cpu_in_panic(void);
+
#ifdef CONFIG_SMP
extern int __printk_cpu_sync_try_get(void);
extern void __printk_cpu_sync_wait(void);
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index ac2d9750e5f8..6c2afee5ef62 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -130,7 +130,6 @@ struct printk_message {
};
bool other_cpu_in_panic(void);
-bool this_cpu_in_panic(void);
bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
bool is_extended, bool may_supress);
diff --git a/lib/dump_stack.c b/lib/dump_stack.c
index 83471e81501a..222c6d6c8281 100644
--- a/lib/dump_stack.c
+++ b/lib/dump_stack.c
@@ -96,15 +96,25 @@ static void __dump_stack(const char *log_lvl)
*/
asmlinkage __visible void dump_stack_lvl(const char *log_lvl)
{
+ bool in_panic = this_cpu_in_panic();
unsigned long flags;
/*
* Permit this cpu to perform nested stack dumps while serialising
- * against other CPUs
+ * against other CPUs, unless this CPU is in panic.
+ *
+ * When in panic, non-panic CPUs are not permitted to store new
+ * printk messages so there is no need to synchronize the output.
+ * This avoids potential deadlock in panic() if another CPU is
+ * holding and unable to release the printk_cpu_sync.
*/
- printk_cpu_sync_get_irqsave(flags);
+ if (!in_panic)
+ printk_cpu_sync_get_irqsave(flags);
+
__dump_stack(log_lvl);
- printk_cpu_sync_put_irqrestore(flags);
+
+ if (!in_panic)
+ printk_cpu_sync_put_irqrestore(flags);
}
EXPORT_SYMBOL(dump_stack_lvl);
--
2.39.2
If the kernel crashes in a context where printk() calls always
defer printing (such as in NMI or inside a printk_safe section)
then the final panic messages will be deferred to irq_work. But
if irq_work is not available, the messages will not get printed
unless explicitly flushed. The result is that the final
"end Kernel panic" banner does not get printed.
Add one final flush after the last printk() call to make sure
the final panic messages make it out as well.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/panic.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/panic.c b/kernel/panic.c
index 2807639aab51..f22d8f33ea14 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -446,6 +446,14 @@ void panic(const char *fmt, ...)
/* Do not scroll important messages printed above */
suppress_printk = 1;
+
+ /*
+ * The final messages may not have been printed if in a context that
+ * defers printing (such as NMI) and irq_work is not available.
+ * Explicitly flush the kernel log buffer one last time.
+ */
+ console_flush_on_panic(CONSOLE_FLUSH_PENDING);
+
local_irq_enable();
for (i = 0; ; i += PANIC_TIMER_STEP) {
touch_softlockup_watchdog();
--
2.39.2
From: Sebastian Andrzej Siewior <[email protected]>
Note: This change only applies to 32bit architectures. On 64bit
architectures the macros are NOPs.
__ulseq_to_u64seq() computes the upper 32 bits of the passed
argument value (@ulseq). The upper bits are derived from a base
value (@rb_next_seq) in a way that assumes @ulseq represents a
64bit number that is less than or equal to @rb_next_seq.
Until now this mapping has been correct for all call sites. However,
in a follow-up commit, values of @ulseq will be passed in that are
higher than the base value. This requires a change to how the 32bit
value is mapped to a 64bit sequence number.
Rather than mapping @ulseq such that the base value is the end of a
32bit block, map @ulseq such that the base value is in the middle of
a 32bit block. This allows supporting 31 bits before and after the
base value, which is deemed acceptable for the console sequence
number during runtime.
Here is an example to illustrate the previous and new mappings.
For a base value (@rb_next_seq) of 2 2000 0000...
Before this change the range of possible return values was:
1 2000 0001 to 2 2000 0000
__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 1 2000 0001
__ulseq_to_u64seq(9fff ffff) => 1 9fff ffff
__ulseq_to_u64seq(a000 0000) => 1 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001
After this change the range of possible return values are:
1 a000 0001 to 2 a000 0000
__ulseq_to_u64seq(1fff ffff) => 2 1fff ffff
__ulseq_to_u64seq(2000 0000) => 2 2000 0000
__ulseq_to_u64seq(2000 0001) => 2 2000 0001
__ulseq_to_u64seq(9fff ffff) => 2 9fff ffff
__ulseq_to_u64seq(a000 0000) => 2 a000 0000
__ulseq_to_u64seq(a000 0001) => 1 a000 0001
[ john.ogness: Rewrite commit message. ]
Reported-by: Francesco Dolcini <[email protected]>
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/printk_ringbuffer.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index b82a96dc2ea2..12f60c782e46 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -407,7 +407,7 @@ static inline u64 __ulseq_to_u64seq(struct printk_ringbuffer *rb, u32 ulseq)
* Also the access to the ring buffer is always safe.
*/
rb_next_seq = prb_next_seq(rb);
- seq = rb_next_seq - ((u32)rb_next_seq - ulseq);
+ seq = rb_next_seq - (s32)((u32)rb_next_seq - ulseq);
return seq;
}
--
2.39.2
There is already panic_in_progress() and other_cpu_in_panic(),
but checking if the current CPU is the panic CPU must still be
open coded.
Add this_cpu_in_panic() to complete the set.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/internal.h | 1 +
kernel/printk/printk.c | 43 +++++++++++++++++++++-------------------
2 files changed, 24 insertions(+), 20 deletions(-)
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 6c2afee5ef62..ac2d9750e5f8 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -130,6 +130,7 @@ struct printk_message {
};
bool other_cpu_in_panic(void);
+bool this_cpu_in_panic(void);
bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
bool is_extended, bool may_supress);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 22bb45d8f2f2..82dc2c7949b7 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -347,6 +347,29 @@ static bool panic_in_progress(void)
return unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID);
}
+/* Return true if a panic is in progress on the current CPU. */
+bool this_cpu_in_panic(void)
+{
+ /*
+ * We can use raw_smp_processor_id() here because it is impossible for
+ * the task to be migrated to the panic_cpu, or away from it. If
+ * panic_cpu has already been set, and we're not currently executing on
+ * that CPU, then we never will be.
+ */
+ return unlikely(atomic_read(&panic_cpu) == raw_smp_processor_id());
+}
+
+/*
+ * Return true if a panic is in progress on a remote CPU.
+ *
+ * On true, the local CPU should immediately release any printing resources
+ * that may be needed by the panic CPU.
+ */
+bool other_cpu_in_panic(void)
+{
+ return (panic_in_progress() && !this_cpu_in_panic());
+}
+
/*
* This is used for debugging the mess that is the VT code by
* keeping track if we have the console semaphore held. It's
@@ -2592,26 +2615,6 @@ static int console_cpu_notify(unsigned int cpu)
return 0;
}
-/*
- * Return true if a panic is in progress on a remote CPU.
- *
- * On true, the local CPU should immediately release any printing resources
- * that may be needed by the panic CPU.
- */
-bool other_cpu_in_panic(void)
-{
- if (!panic_in_progress())
- return false;
-
- /*
- * We can use raw_smp_processor_id() here because it is impossible for
- * the task to be migrated to the panic_cpu, or away from it. If
- * panic_cpu has already been set, and we're not currently executing on
- * that CPU, then we never will be.
- */
- return atomic_read(&panic_cpu) != raw_smp_processor_id();
-}
-
/**
* console_lock - block the console subsystem from printing
*
--
2.39.2
Commit 13fb0f74d702 ("printk: Avoid livelock with heavy printk
during panic") introduced a mechanism to silence non-panic CPUs
if too many messages are being dropped. Aside from trying to
workaround the livelock bugs of legacy consoles, it was also
intended to avoid losing panic messages. However, if non-panic
CPUs are writing to the ringbuffer, then reacting to dropped
messages is too late.
Another motivation is that non-finalized messages already might
be skipped in panic(). In other words, random messages from
non-panic CPUs might already get lost. It is better to ignore
all to avoid confusion.
To avoid losing panic CPU messages, silence non-panic CPUs
immediately on panic.
Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
---
kernel/printk/printk.c | 26 ++++++--------------------
1 file changed, 6 insertions(+), 20 deletions(-)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cb99c854a648..1685a71f3f71 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -462,12 +462,6 @@ static int console_msg_format = MSG_FORMAT_DEFAULT;
static DEFINE_MUTEX(syslog_lock);
#ifdef CONFIG_PRINTK
-/*
- * During panic, heavy printk by other CPUs can delay the
- * panic and risk deadlock on console resources.
- */
-static int __read_mostly suppress_panic_printk;
-
DECLARE_WAIT_QUEUE_HEAD(log_wait);
/* All 3 protected by @syslog_lock. */
/* the next printk record to read by syslog(READ) or /proc/kmsg */
@@ -2322,7 +2316,12 @@ asmlinkage int vprintk_emit(int facility, int level,
if (unlikely(suppress_printk))
return 0;
- if (unlikely(suppress_panic_printk) && other_cpu_in_panic())
+ /*
+ * The messages on the panic CPU are the most important. If
+ * non-panic CPUs are generating any messages, they will be
+ * silently dropped.
+ */
+ if (other_cpu_in_panic())
return 0;
if (level == LOGLEVEL_SCHED) {
@@ -2799,8 +2798,6 @@ void console_prepend_dropped(struct printk_message *pmsg, unsigned long dropped)
bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
bool is_extended, bool may_suppress)
{
- static int panic_console_dropped;
-
struct printk_buffers *pbufs = pmsg->pbufs;
const size_t scratchbuf_sz = sizeof(pbufs->scratchbuf);
const size_t outbuf_sz = sizeof(pbufs->outbuf);
@@ -2828,17 +2825,6 @@ bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
pmsg->seq = r.info->seq;
pmsg->dropped = r.info->seq - seq;
- /*
- * Check for dropped messages in panic here so that printk
- * suppression can occur as early as possible if necessary.
- */
- if (pmsg->dropped &&
- panic_in_progress() &&
- panic_console_dropped++ > 10) {
- suppress_panic_printk = 1;
- pr_warn_once("Too many dropped messages. Suppress messages on non-panic CPUs to prevent livelock.\n");
- }
-
/* Skip record that has level above the console loglevel. */
if (may_suppress && suppress_message_printing(r.info->level))
goto out;
--
2.39.2
On Wed 2024-02-07 14:46:53, John Ogness wrote:
> Commit f244b4dc53e5 ("printk: ringbuffer: Improve
> prb_next_seq() performance") introduced an optimization for
> prb_next_seq() by using best-effort to track recently finalized
> records. However, the order of finalization does not
> necessarily match the order of the records. The optimization
> changed prb_next_seq() to return inconsistent results, possibly
> yielding sequence numbers that are not available to readers
> because they are preceded by non-finalized records or they are
> not yet visible to the reader CPU.
>
> Rather than simply best-effort tracking recently finalized
> records, force the committing writer to read records and
> increment the last "contiguous block" of finalized records. In
> order to do this, the sequence number instead of ID must be
> stored because ID's cannot be directly compared.
>
> A new memory barrier pair is introduced to guarantee that a
> reader can always read the records up until the sequence number
> returned by prb_next_seq() (unless the records have since
> been overwritten in the ringbuffer).
>
> This restores the original functionality of prb_next_seq()
> while also keeping the optimization.
>
> For 32bit systems, only the lower 32 bits of the sequence
> number are stored. When reading the value, it is expanded to
> the full 64bit sequence number using the 32bit seq macros,
> which fold in the value returned by prb_first_seq().
>
> Fixes: f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq() performance")
> Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Best Regards,
Petr
Hi, I'm on vacation. I'll get back to you when I'm done with this vacation.
I'll be back on 2.18.
Thanks.
On Wed 2024-02-07 14:47:03, John Ogness wrote:
> dump_stack() is called in panic(). If for some reason another CPU
> is holding the printk_cpu_sync and is unable to release it, the
> panic CPU will be unable to continue and print the stacktrace.
>
> Since non-panic CPUs are not allowed to store new printk messages
> anyway, there is no need to synchronize the stacktrace output in
> a panic situation.
>
> For the panic CPU, do not get the printk_cpu_sync because it is
> not needed and avoids a potential deadlock scenario in panic().
>
> Link: https://lore.kernel.org/lkml/ZcIGKU8sxti38Kok@alley
> Signed-off-by: John Ogness <[email protected]>
Makes sense.
Reviewed-by: Petr Mladek <[email protected]>
Best Regards,
Petr
On Wed 2024-02-07 14:46:49, John Ogness wrote:
> Hi,
>
> While testing various flushing scenarios, I stumbled on a few
> issues that cause console flushing to fail. While at LPC2023 in
> Richmond, I sat down with Petr Mladek and we reviewed the
> v2 [0] series. This series is the result of that offline
> discussion. v3 is here [1].
>
> This series addresses the following issues:
>
> 1. The prb_next_seq() optimization caused inconsistent return
> values. Fix prb_next_seq() to the originally intended
> behavior but keep an optimization.
>
> 2. pr_flush() might not wait until the most recently stored
> printk() message if non-finalized records precede it. Fix
> pr_flush() to wait for all records to print that are at
> least reserved at the time of the call.
>
> 3. In panic, the panic messages will not print if non-finalized
> records precede them. Add a special condition so that
> readers on the panic CPU will drop records that are not in
> a consistent state.
>
> 4. It is possible (and easy to reproduce) a scenario where the
> console on the panic CPU hands over to a waiter of a stopped
> CPU. Do not use the handover feature in panic.
>
> 5. If messages are being dropped during panic, non-panic CPUs
> are silenced. But by then it is already too late and most
> likely the panic messages have been dropped. Change the
> non-panic CPU silencing logic to _immediately_ silence
> non-panic CPUs during panic. This also leads to clean panic
> output when many CPUs are blasting the kernel log.
>
> 6. If a panic occurs in a context where printk() calls defer
> printing (NMI or printk_safe section), the printing of the
> final panic messages rely on irq_work. If that mechanism is
> not available, the final panic messages are not seen (even
> though they are finalized in the ringbuffer). Add one last
> explicit flush after all printk() calls are finished to
> ensure all available messages in the kernel log are printed.
>
> 7. When dumping the stacktrace from panic(), do not use the
> printk_cpu_sync because it can deadlock if another CPU holds
> and is unable to release the printk_cpu_sync.
>
> This series also performs some minor cleanups to remove open
> coded checks about the panic context and improve documentation
> language regarding data-less records.
>
> Because of multiple refactoring done in recent history, it
> would be helpful to provide the LTS maintainers with the proper
> backported patches. I am happy to do this.
The series seems to be ready linux-next from my POV. I am going
to push it there so that it gets as much testing before
the merge window as possible.
Best Regards,
Petr
On Wed 2024-02-07 17:18:38, Petr Mladek wrote:
> On Wed 2024-02-07 14:46:49, John Ogness wrote:
> > Hi,
> >
> > While testing various flushing scenarios, I stumbled on a few
> > issues that cause console flushing to fail. While at LPC2023 in
> > Richmond, I sat down with Petr Mladek and we reviewed the
> > v2 [0] series. This series is the result of that offline
> > discussion. v3 is here [1].
> >
> > This series addresses the following issues:
> >
> > 1. The prb_next_seq() optimization caused inconsistent return
> > values. Fix prb_next_seq() to the originally intended
> > behavior but keep an optimization.
> >
> > 2. pr_flush() might not wait until the most recently stored
> > printk() message if non-finalized records precede it. Fix
> > pr_flush() to wait for all records to print that are at
> > least reserved at the time of the call.
> >
> > 3. In panic, the panic messages will not print if non-finalized
> > records precede them. Add a special condition so that
> > readers on the panic CPU will drop records that are not in
> > a consistent state.
> >
> > 4. It is possible (and easy to reproduce) a scenario where the
> > console on the panic CPU hands over to a waiter of a stopped
> > CPU. Do not use the handover feature in panic.
> >
> > 5. If messages are being dropped during panic, non-panic CPUs
> > are silenced. But by then it is already too late and most
> > likely the panic messages have been dropped. Change the
> > non-panic CPU silencing logic to _immediately_ silence
> > non-panic CPUs during panic. This also leads to clean panic
> > output when many CPUs are blasting the kernel log.
> >
> > 6. If a panic occurs in a context where printk() calls defer
> > printing (NMI or printk_safe section), the printing of the
> > final panic messages rely on irq_work. If that mechanism is
> > not available, the final panic messages are not seen (even
> > though they are finalized in the ringbuffer). Add one last
> > explicit flush after all printk() calls are finished to
> > ensure all available messages in the kernel log are printed.
> >
> > 7. When dumping the stacktrace from panic(), do not use the
> > printk_cpu_sync because it can deadlock if another CPU holds
> > and is unable to release the printk_cpu_sync.
> >
> > This series also performs some minor cleanups to remove open
> > coded checks about the panic context and improve documentation
> > language regarding data-less records.
> >
> > Because of multiple refactoring done in recent history, it
> > would be helpful to provide the LTS maintainers with the proper
> > backported patches. I am happy to do this.
>
> The series seems to be ready linux-next from my POV. I am going
> to push it there so that it gets as much testing before
> the merge window as possible.
JFYI, it has been committed into printk/linux.git,
branch rework/console-flushing-fixes.
Let's keep our fingers crossed.
Best Regards,
Petr