Hi,
While testing various panic scenarios using legacy consoles, I
stumbled on a couple issues that cause the panic backtrace not
to be printed:
1. In panic, an unfinalized record blocks printing of finalized
records that may follow.
2. It is possible (and easy to reproduce) a scenario where the
console on the panic CPU hands over to a waiter of a stopped
CPU.
This series addresses these 2 issues while also performing some
minor cleanups to remove open coded checks about the panic
context.
Because of multiple refactoring done in recent history, it
would be helpful to provide the LTS maintainers with the proper
backported patches. I am happy to do this.
John Ogness
John Ogness (4):
printk: For @suppress_panic_printk check other panic
printk: Add this_cpu_in_panic()
printk: Skip unfinalized records in panic
printk: Ignore waiter on panic
kernel/printk/internal.h | 1 +
kernel/printk/printk.c | 64 +++++++++++++++++++------------
kernel/printk/printk_ringbuffer.c | 9 +++--
3 files changed, 45 insertions(+), 29 deletions(-)
base-commit: 29fda1ad2a64a62e1c51d61207396e2de1c67362
--
2.39.2
Currently @suppress_panic_printk is checked along with
non-matching @panic_cpu and current CPU. This works
because @suppress_panic_printk is only set when
panic_in_progress() is true.
Rather than relying on the @suppress_panic_printk semantics,
use the concise helper function other_cpu_in_progress(). The
helper function exists to avoid open coding such tests.
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 80eb79177bab..abb6a50ee26f 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2272,8 +2272,7 @@ asmlinkage int vprintk_emit(int facility, int level,
if (unlikely(suppress_printk))
return 0;
- if (unlikely(suppress_panic_printk) &&
- atomic_read(&panic_cpu) != raw_smp_processor_id())
+ if (unlikely(suppress_panic_printk) && other_cpu_in_panic())
return 0;
if (level == LOGLEVEL_SCHED) {
--
2.39.2
There is already panic_in_progress() and other_cpu_in_panic(),
but checking if the current CPU is the panic CPU must still be
open coded.
Add this_cpu_in_panic() to complete the set.
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/internal.h | 1 +
kernel/printk/printk.c | 43 +++++++++++++++++++++-------------------
2 files changed, 24 insertions(+), 20 deletions(-)
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 7d4979d5c3ce..2397734c0fd2 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -105,3 +105,4 @@ struct printk_message {
};
bool other_cpu_in_panic(void);
+bool this_cpu_in_panic(void);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index abb6a50ee26f..19b752880879 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -351,6 +351,29 @@ static bool panic_in_progress(void)
return unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID);
}
+/* Return true if a panic is in progress on the current CPU. */
+bool this_cpu_in_panic(void)
+{
+ /*
+ * We can use raw_smp_processor_id() here because it is impossible for
+ * the task to be migrated to the panic_cpu, or away from it. If
+ * panic_cpu has already been set, and we're not currently executing on
+ * that CPU, then we never will be.
+ */
+ return unlikely(atomic_read(&panic_cpu) == raw_smp_processor_id());
+}
+
+/*
+ * Return true if a panic is in progress on a remote CPU.
+ *
+ * On true, the local CPU should immediately release any printing resources
+ * that may be needed by the panic CPU.
+ */
+bool other_cpu_in_panic(void)
+{
+ return (panic_in_progress() && !this_cpu_in_panic());
+}
+
/*
* This is used for debugging the mess that is the VT code by
* keeping track if we have the console semaphore held. It's
@@ -2610,26 +2633,6 @@ static int console_cpu_notify(unsigned int cpu)
return 0;
}
-/*
- * Return true if a panic is in progress on a remote CPU.
- *
- * On true, the local CPU should immediately release any printing resources
- * that may be needed by the panic CPU.
- */
-bool other_cpu_in_panic(void)
-{
- if (!panic_in_progress())
- return false;
-
- /*
- * We can use raw_smp_processor_id() here because it is impossible for
- * the task to be migrated to the panic_cpu, or away from it. If
- * panic_cpu has already been set, and we're not currently executing on
- * that CPU, then we never will be.
- */
- return atomic_read(&panic_cpu) != raw_smp_processor_id();
-}
-
/**
* console_lock - block the console subsystem from printing
*
--
2.39.2
Commit d51507098ff91 ("printk: disable optimistic spin during
panic") added checks to avoid becoming a console waiter if a
panic is in progress. However, the transition to panic can occur
while there is already a waiter. If the panic occurred in a
context that does not support printing from the printk() caller
context, the waiter CPU may become stopped while still stored as
@console_waiter. Then when console_flush_on_panic() is called,
the panic CPU will see @console_waiter and handover to the
stopped CPU.
Here a simple example:
CPU0 CPU1
---- ----
console_lock_spinning_enable()
console_trylock_spinning()
[set as console waiter]
NMI: panic()
panic_other_cpus_shutdown()
[stopped as console waiter]
console_flush_on_panic()
console_lock_spinning_enable()
[print 1 record]
console_lock_spinning_disable_and_check()
[handover to stopped CPU1]
This results in panic() not flushing the panic messages.
Fix this by ignoring any console waiter if the panic CPU is
printing.
Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes")
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 56d9b4acbbf2..cd6493f12970 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1904,7 +1904,8 @@ static int console_lock_spinning_disable_and_check(int cookie)
console_owner = NULL;
raw_spin_unlock(&console_owner_lock);
- if (!waiter) {
+ /* Waiters are ignored by the panic CPU. */
+ if (!waiter || this_cpu_in_panic()) {
spin_release(&console_owner_dep_map, _THIS_IP_);
return 0;
}
--
2.39.2
Normally a reader will stop once reaching an unfinalized record.
However, when a panic happens, writers from other CPUs (or an
interrupted context on the panic CPU) may have been writing a
record and were unable to finalize it. The panic CPU will
reserve/commit/finalize its panic records, but these will be
located after the unfinalized records. This results in panic()
not flushing the panic messages.
Add a special case to printk_get_next_message() to allow
skipping over unfinalized records if on the panic CPU.
Also refine the documentation of the ringbuffer reading
functions to clarify that failure may also be due to an
unfinalized record.
Fixes: 896fbe20b4e2 ("printk: use the lockless ringbuffer")
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 15 +++++++++++++--
kernel/printk/printk_ringbuffer.c | 9 +++++----
2 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 19b752880879..56d9b4acbbf2 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2813,8 +2813,19 @@ static bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
else
prb_rec_init_rd(&r, &info, outbuf, outbuf_sz);
- if (!prb_read_valid(prb, seq, &r))
- return false;
+ while (!prb_read_valid(prb, seq, &r)) {
+ if (this_cpu_in_panic() && seq < prb_next_seq(prb)) {
+ /*
+ * The record @seq is not finalized and there may be
+ * more records in the ringbuffer. Since this is the
+ * panic CPU, skip over the unfinalized record and
+ * try to read a finalized record that may follow.
+ */
+ seq++;
+ } else {
+ return false;
+ }
+ }
pmsg->seq = r.info->seq;
pmsg->dropped = r.info->seq - seq;
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 2dc4d5a1f1ff..1bbc008109ef 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1876,8 +1876,9 @@ static u64 prb_first_seq(struct printk_ringbuffer *rb)
}
/*
- * Non-blocking read of a record. Updates @seq to the last finalized record
- * (which may have no data available).
+ * Non-blocking read of a record. Updates @seq to the record that was read
+ * (which may have no data available) or was attempted to be read (in case
+ * it was unfinalized or non-existent).
*
* See the description of prb_read_valid() and prb_read_valid_info()
* for details.
@@ -1932,7 +1933,7 @@ static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
* On success, the reader must check r->info.seq to see which record was
* actually read. This allows the reader to detect dropped records.
*
- * Failure means @seq refers to a not yet written record.
+ * Failure means @seq refers to a not yet finalized or non-existing record.
*/
bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
struct printk_record *r)
@@ -1962,7 +1963,7 @@ bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
* On success, the reader must check info->seq to see which record meta data
* was actually read. This allows the reader to detect dropped records.
*
- * Failure means @seq refers to a not yet written record.
+ * Failure means @seq refers to a not yet finalized or non-existing record.
*/
bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
struct printk_info *info, unsigned int *line_count)
--
2.39.2
Oops. This series is not a v2. It is v1.
John
On Fri 2023-10-13 22:49:37, John Ogness wrote:
> Currently @suppress_panic_printk is checked along with
> non-matching @panic_cpu and current CPU. This works
> because @suppress_panic_printk is only set when
> panic_in_progress() is true.
>
> Rather than relying on the @suppress_panic_printk semantics,
> use the concise helper function other_cpu_in_progress(). The
> helper function exists to avoid open coding such tests.
>
> Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Best Regards,
Petr
On Fri 2023-10-13 22:49:38, John Ogness wrote:
> There is already panic_in_progress() and other_cpu_in_panic(),
> but checking if the current CPU is the panic CPU must still be
> open coded.
>
> Add this_cpu_in_panic() to complete the set.
>
> Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Best Regards,
Petr
On Fri 2023-10-13 22:49:39, John Ogness wrote:
> Normally a reader will stop once reaching an unfinalized record.
> However, when a panic happens, writers from other CPUs (or an
> interrupted context on the panic CPU) may have been writing a
> record and were unable to finalize it. The panic CPU will
> reserve/commit/finalize its panic records, but these will be
> located after the unfinalized records. This results in panic()
> not flushing the panic messages.
>
> Add a special case to printk_get_next_message() to allow
> skipping over unfinalized records if on the panic CPU.
>
> Also refine the documentation of the ringbuffer reading
> functions to clarify that failure may also be due to an
> unfinalized record.
Great catch!
It took me some time to think about various possible scenarios.
As a result, I suggest to update some comments, see below.
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2813,8 +2813,19 @@ static bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
> else
> prb_rec_init_rd(&r, &info, outbuf, outbuf_sz);
>
> - if (!prb_read_valid(prb, seq, &r))
> - return false;
> + while (!prb_read_valid(prb, seq, &r)) {
> + if (this_cpu_in_panic() && seq < prb_next_seq(prb)) {
> + /*
> + * The record @seq is not finalized and there may be
"may be" is a bit misleading. If I count it correctly then there
"are" more records when seq < prb_next_seq().
> + * more records in the ringbuffer. Since this is the
> + * panic CPU, skip over the unfinalized record and
> + * try to read a finalized record that may follow.
> + */
It took me quite some time to understand whether this is right or not.
It is partly because the comment describes what the code does but it
does not describe the motivation and why it is safe. I wonder
if the following is better:
/*
* Skip non-finalized records when emitting messages
* from panic CPU. They might never get finalized.
* Note that new messages printed on panic CPU
* are finalized when we are here.
*/
But wait. Are the messages printed in panic context always finalized?
What about messages without the trailing newline printed?
Aha, they actually are finalized because prb_next_seq() would return sequence
number of the record in "desc_committed" state when there is
a message without newline and we skip only seq < prb_next_seq().
So I would update the comment, something like:
/*
* Skip non-finalized records when emitting messages
* from panic CPU. They might never get finalized.
*
* Note that new messages printed on panic CPU are
* finalized when we are here. The only exception
* might be the last message without trailing newline.
* But it would have the sequence number returned
* by prb_next_seq().
*/
Sigh, it is long. But this is a quite tricky situation.
> + seq++;
> + } else {
> + return false;
> + }
> + }
>
> pmsg->seq = r.info->seq;
> pmsg->dropped = r.info->seq - seq;
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> index 2dc4d5a1f1ff..1bbc008109ef 100644
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -1876,8 +1876,9 @@ static u64 prb_first_seq(struct printk_ringbuffer *rb)
> }
>
> /*
> - * Non-blocking read of a record. Updates @seq to the last finalized record
> - * (which may have no data available).
> + * Non-blocking read of a record. Updates @seq to the record that was read
> + * (which may have no data available) or was attempted to be read (in case
> + * it was unfinalized or non-existent).
I am confused. Well, even the original message was confusing.
I think about describing it the following way.
* On input, @seq defines the record which should be read. It might
* be updated to a higher value when the requested record has already
* been overwritten or when the record had empty data.
*
* On return, @seq value depends on the situation. It is:
*
* - sequence number of the read record on success.
* - sequence number of the first found to-be-finalized record
* when the input seq number was lower or equal to prb_next_seq().
* - the original value when @seq was invalid, bigger then prb_next_seq().
Sigh, the comment is hairy. Maybe you would find a more elegant way
to describe the variants.
BTW: The support for data-less records were added by the commit
("printk: ringbuffer: support dataless records"). It was
needed to handle empty lines: printk("\n"). It is strange
that we skip them instead of printing the empty line.
It looks like a small bug. I might look a it if you agree
that it looks like a bug.
> *
> * See the description of prb_read_valid() and prb_read_valid_info()
> * for details.
Also I would update the above prb_next_seq():
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -2012,8 +2012,8 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
* available records should be skipped.
*
* Context: Any context.
- * Return: The sequence number of the next newest (not yet available) record
- * for readers.
+ * Return: The sequence number of the next newest (not yet finalized or
+ * non-existing) record for readers.
*/
u64 prb_next_seq(struct printk_ringbuffer *rb)
{
On 2023-10-17, Petr Mladek <[email protected]> wrote:
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -2813,8 +2813,19 @@ static bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
>> else
>> prb_rec_init_rd(&r, &info, outbuf, outbuf_sz);
>>
>> - if (!prb_read_valid(prb, seq, &r))
>> - return false;
>> + while (!prb_read_valid(prb, seq, &r)) {
>> + if (this_cpu_in_panic() && seq < prb_next_seq(prb)) {
>> + /*
>> + * The record @seq is not finalized and there may be
>
> "may be" is a bit misleading. If I count it correctly then there
> "are" more records when seq < prb_next_seq().
Ack.
> But wait. Are the messages printed in panic context always finalized?
> What about messages without the trailing newline printed?
>
> Aha, they actually are finalized because prb_next_seq() would return sequence
> number of the record in "desc_committed" state when there is
> a message without newline and we skip only seq < prb_next_seq().
> So I would update the comment, something like:
>
> /*
> * Skip non-finalized records when emitting messages
> * from panic CPU. They might never get finalized.
> *
> * Note that new messages printed on panic CPU are
> * finalized when we are here. The only exception
> * might be the last message without trailing newline.
> * But it would have the sequence number returned
> * by prb_next_seq().
> */
>
> Sigh, it is long. But this is a quite tricky situation.
OK.
>> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
>> index 2dc4d5a1f1ff..1bbc008109ef 100644
>> --- a/kernel/printk/printk_ringbuffer.c
>> +++ b/kernel/printk/printk_ringbuffer.c
>> @@ -1876,8 +1876,9 @@ static u64 prb_first_seq(struct printk_ringbuffer *rb)
>> }
>>
>> /*
>> - * Non-blocking read of a record. Updates @seq to the last finalized record
>> - * (which may have no data available).
>> + * Non-blocking read of a record. Updates @seq to the record that was read
>> + * (which may have no data available) or was attempted to be read (in case
>> + * it was unfinalized or non-existent).
>
> I am confused. Well, even the original message was confusing.
> I think about describing it the following way.
>
> * On input, @seq defines the record which should be read. It might
> * be updated to a higher value when the requested record has already
> * been overwritten or when the record had empty data.
> *
> * On return, @seq value depends on the situation. It is:
> *
> * - sequence number of the read record on success.
> * - sequence number of the first found to-be-finalized record
> * when the input seq number was lower or equal to prb_next_seq().
> * - the original value when @seq was invalid, bigger then prb_next_seq().
>
> Sigh, the comment is hairy. Maybe you would find a more elegant way
> to describe the variants.
Be careful. prb_next_seq() is only loosely related to
_prb_read_valid(). I would not mention prb_next_seq() when describing
_prb_read_valid(). There are situations where _prb_read_valid() could
successfully read a record with a higher sequence number than
prb_next_seq() would return. This is because prb_next_seq() is only best
effort.
For panic it is sufficient because panic() will commit finalized records
after having stopped all other CPUs, so it will definitely update
@prb->desc_ring.last_finalized_id and allow prb_next_seq() to point to
the end of the panic messages. But for non-panic that is not the case.
I do not have a problem understanding my version of the comments. Note
that it is just a brief internal comment. It also says:
* See the description of prb_read_valid() and prb_read_valid_info()
* for details.
And if you look at the kerneldoc of either of those functions you see:
* On success, the reader must check r->info.seq to see which record was
* actually read. This allows the reader to detect dropped records.
*
* Failure means @seq refers to a not yet finalized or non-existing record.
Also note that @seq is never passed by reference from the external
caller. It is only passed by reference to the helper function
_prb_read_valid().
> BTW: The support for data-less records were added by the commit
> ("printk: ringbuffer: support dataless records"). It was
> needed to handle empty lines: printk("\n"). It is strange
> that we skip them instead of printing the empty line.
We do not skip them. That was the whole point of adding support for
data-less records. ;-)
get_data() returns "" and @data_size=0
copy_data() returns true (but does not copy any data)
prb_read() returns true (we are assuming it is finalized)
_prb_read_valid() returns true
prb_read_valid() return true
record_print_text() creates a string with prefix and adds "\n"
printk_get_next_message() returns something to print
> Also I would update the above prb_next_seq():
>
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -2012,8 +2012,8 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
> * available records should be skipped.
> *
> * Context: Any context.
> - * Return: The sequence number of the next newest (not yet available) record
> - * for readers.
> + * Return: The sequence number of the next newest (not yet finalized or
> + * non-existing) record for readers.
Ack.
John Ogness
On 2023-10-17, John Ogness <[email protected]> wrote:
>>> - if (!prb_read_valid(prb, seq, &r))
>>> - return false;
>>> + while (!prb_read_valid(prb, seq, &r)) {
>>> + if (this_cpu_in_panic() && seq < prb_next_seq(prb)) {
>>> + /*
>>> + * The record @seq is not finalized and there may be
I am wondering if we should put the this_cpu_in_panic() check into
_prb_read_valid() instead. This has the advantage that it could more
intelligently skip non-finalized records and not need to rely on
prb_next_seq(). Also, it would allow any other readers (kmsg_dump) to
skip non-finalized records during panic.
Looking over the code, it would be easy to extend
desc_read_finalized_seq() to return a separate error code for exists but
is non-finalized. This would make it trivial for _prb_read_valid() to
skip over it when this_cpu_in_panic().
John
On Fri 2023-10-13 22:49:40, John Ogness wrote:
> Commit d51507098ff91 ("printk: disable optimistic spin during
> panic") added checks to avoid becoming a console waiter if a
> panic is in progress. However, the transition to panic can occur
> while there is already a waiter. If the panic occurred in a
> context that does not support printing from the printk() caller
> context, the waiter CPU may become stopped while still stored as
> @console_waiter.
I guess that "context that does not support printing" is NMI
or printk_safe when the console handling is deferred().
Another scenario is when the current owner gets stopped
or blocked so that it actually can't pass the lock to
the waiter.
> the panic CPU will see @console_waiter and handover to the
> stopped CPU.
>
> Here a simple example:
>
> CPU0 CPU1
> ---- ----
> console_lock_spinning_enable()
> console_trylock_spinning()
> [set as console waiter]
> NMI: panic()
> panic_other_cpus_shutdown()
> [stopped as console waiter]
> console_flush_on_panic()
> console_lock_spinning_enable()
> [print 1 record]
> console_lock_spinning_disable_and_check()
> [handover to stopped CPU1]
>
> This results in panic() not flushing the panic messages.
Great catch!
> Fix this by ignoring any console waiter if the panic CPU is
> printing.
>
> Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes")
> Signed-off-by: John Ogness <[email protected]>
> ---
> kernel/printk/printk.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 56d9b4acbbf2..cd6493f12970 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1904,7 +1904,8 @@ static int console_lock_spinning_disable_and_check(int cookie)
> console_owner = NULL;
> raw_spin_unlock(&console_owner_lock);
>
> - if (!waiter) {
> + /* Waiters are ignored by the panic CPU. */
> + if (!waiter || this_cpu_in_panic()) {
> spin_release(&console_owner_dep_map, _THIS_IP_);
> return 0;
> }
This seems to work.
Well, I have spent some time thinking about possible scenarios and I would
go even further. I would block also console_lock_spinning_enable()
Also I would do the checks before taking console_owner_lock.
It would make the behavior symmetric. And more importantly, it would prevent
possible deadlocks caused by console_owner_lock.
See the proposed patch below. I created it to see the changes in the
code. Also I added info about possible scenarios and effects.
Here we go:
From 2b6e2b58df737e3a3e0f86134a53f560539be813 Mon Sep 17 00:00:00 2001
From: Petr Mladek <[email protected]>
Date: Wed, 18 Oct 2023 10:04:17 +0200
Subject: [PATCH] printk: Disable passing console lock owner completely during
panic()
The commit d51507098ff91 ("printk: disable optimistic spin during panic")
added checks to avoid becoming a console waiter if a panic is in progress.
However, the transition to panic can occur while there is already waiter.
The current owner should not pass the lock to the waiter because it might
get stopped or blocked anytime.
Also the panic context might pass the console lock owner to an already
stopped waiter by mistake. It might happen when console_flush_on_panic()
ignores the current lock owner, for example:
CPU0 CPU1
---- ----
console_lock_spinning_enable()
console_trylock_spinning()
[set as console waiter]
NMI: panic()
panic_other_cpus_shutdown()
[stopped as console waiter]
console_flush_on_panic()
console_lock_spinning_enable()
[print 1 record]
console_lock_spinning_disable_and_check()
[handover to stopped CPU1]
This results in panic() not flushing the panic messages.
Fix these problems by disabling all spinning operations completely
during panic().
Another advantage is that it prevents possible deadlocks caused by
"console_owner_lock". The panic() context does not need to take it
any longer. The lockless checks are safe because the functions become
NOPs when they see the panic in progress. All operations manipulating
the state are still synchronized by the lock even when non-panic CPUs
would notice the panic synchronously.
The current owner might stay spinning. But non-panic() CPUs would get
stopped anyway and the panic context will never start spinning.
Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes")
Signed-off-by: John Ogness <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
---
kernel/printk/printk.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 26a76b167ea6..fbe0dc1ca416 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1847,10 +1847,23 @@ static bool console_waiter;
*/
static void console_lock_spinning_enable(void)
{
+ /*
+ * Do not use spinning in panic(). The panic CPU wants to keep the lock.
+ * Non-panic CPUs abandon the flush anyway.
+ *
+ * Just keep the lockdep annotation. The panic-CPU should avoid
+ * taking console_owner_lock because it might cause a deadlock.
+ * This looks like the easiest way how to prevent false lockdep
+ * reports without handling races a lockless way.
+ */
+ if (panic_in_progress())
+ goto lockdep;
+
raw_spin_lock(&console_owner_lock);
console_owner = current;
raw_spin_unlock(&console_owner_lock);
+lockdep:
/* The waiter may spin on us after setting console_owner */
spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
}
@@ -1875,6 +1888,22 @@ static int console_lock_spinning_disable_and_check(int cookie)
{
int waiter;
+ /*
+ * Ignore spinning waiters during panic() because they might get stopped
+ * or blocked at any time,
+ *
+ * It is safe because nobody is allowed to start spinning during panic
+ * in the first place. If there has been a waiter then non panic CPUs
+ * might stay spinning. They would get stopped anyway. The panic context
+ * will never start spinning and an interrupted spin on panic CPU will
+ * never continue.
+ */
+ if (panic_in_progress()) {
+ /* Keep lockdep happy. */
+ spin_release(&console_owner_dep_map, _THIS_IP_);
+ return 0;
+ }
+
raw_spin_lock(&console_owner_lock);
waiter = READ_ONCE(console_waiter);
console_owner = NULL;
--
2.35.3
On Tue 2023-10-17 23:31:25, John Ogness wrote:
> On 2023-10-17, Petr Mladek <[email protected]> wrote:
> >> --- a/kernel/printk/printk_ringbuffer.c
> >> +++ b/kernel/printk/printk_ringbuffer.c
> >> @@ -1876,8 +1876,9 @@ static u64 prb_first_seq(struct printk_ringbuffer *rb)
> >> }
> >>
> >> /*
> >> - * Non-blocking read of a record. Updates @seq to the last finalized record
> >> - * (which may have no data available).
> >> + * Non-blocking read of a record. Updates @seq to the record that was read
> >> + * (which may have no data available) or was attempted to be read (in case
> >> + * it was unfinalized or non-existent).
> >
> > I am confused. Well, even the original message was confusing.
> > I think about describing it the following way.
> >
> > * On input, @seq defines the record which should be read. It might
> > * be updated to a higher value when the requested record has already
> > * been overwritten or when the record had empty data.
> > *
> > * On return, @seq value depends on the situation. It is:
> > *
> > * - sequence number of the read record on success.
> > * - sequence number of the first found to-be-finalized record
> > * when the input seq number was lower or equal to prb_next_seq().
> > * - the original value when @seq was invalid, bigger then prb_next_seq().
> >
> > Sigh, the comment is hairy. Maybe you would find a more elegant way
> > to describe the variants.
>
> Be careful. prb_next_seq() is only loosely related to
> _prb_read_valid(). I would not mention prb_next_seq() when describing
> _prb_read_valid(). There are situations where _prb_read_valid() could
> successfully read a record with a higher sequence number than
> prb_next_seq() would return. This is because prb_next_seq() is only best
> effort.
I see. I misunderstood the meaning of prb_next_seq().
Well, I would not say that "prb_next_seq() is only the best effort".
It has well defined meaning. It returns the sequence number
if the first not yet finalized record.
You are right, there might be many reserved records which are newer.
Some of these reserved records might get finalized before older ones.
This is why _prb_read_valid() might actually return the higher number
when anyone attempts to read records above prb_next_seq().
Wait! This means that your patch actually does not work. Here is the diff:
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2815,8 +2815,19 @@ static bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
else
prb_rec_init_rd(&r, &info, outbuf, outbuf_sz);
- if (!prb_read_valid(prb, seq, &r))
- return false;
+ while (!prb_read_valid(prb, seq, &r)) {
+ if (this_cpu_in_panic() && seq < prb_next_seq(prb)) {
+ /*
+ * The record @seq is not finalized and there may be
+ * more records in the ringbuffer. Since this is the
+ * panic CPU, skip over the unfinalized record and
+ * try to read a finalized record that may follow.
+ */
+ seq++;
+ } else {
+ return false;
+ }
+ }
pmsg->seq = r.info->seq;
pmsg->dropped = r.info->seq - seq;
It skips the invalid reads only when seq < prb_next_seq(). But
prb_next_seq(prb) points to the 1st non-finalized record. And
all records with seq < prb_next_seq() must be finalized!
> For panic it is sufficient because panic() will commit finalized records
> after having stopped all other CPUs, so it will definitely update
> @prb->desc_ring.last_finalized_id and allow prb_next_seq() to point to
> the end of the panic messages. But for non-panic that is not the case.
>
> I do not have a problem understanding my version of the comments.
Maybe, "hairy" is not the right word. Well, from my POV, the comment
is not clear enough. It says that it updates "seq" but it does not say
when and how.
And it is even wrong. I still believe that dataless records are
quietly skipped. Here is the code:
static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
struct printk_record *r, unsigned int *line_count)
{
while ((err = prb_read(rb, *seq, r, line_count))) {
[...]
} else if (err == -ENOENT) {
/* Record exists, but no data available. Skip. */
(*seq)++;
} else {
}
[...]
}
Note that "seq" is incremented and the cycle continues. It means
that it tries to read the next record. It does _not_ return
"true" when seeing dataless record, definitely.
=> it might modify "seq" and still return "false" when
the original "seq" pointed to finalized dataless record.
> that it is just a brief internal comment. It also says:
>
> * See the description of prb_read_valid() and prb_read_valid_info()
> * for details.
This does not help because these two functions do not describe
how the seq is modified. And there are clearly three possibilities
when it returns false:
1. The record for the given seq number has not been finalized
yet. In this case, it keeps "seq" as it was.
2. The record for the given seq has been dataless. In this case
it would try to advance seq until it finds a record with
the data or not-yet-finalized one.
3. The given seq number has been invalid, above the head.
In this case, it keeps "seq" intact. So same as 1st case.
Note that "seq" was not updated in the 1st and 3rd case. But the
current comment suggests that it is updated.
Well, it might be updated in all situations when then given
seq pointed to an already overwritten record (below tail).
> And if you look at the kerneldoc of either of those functions you see:
>
> * On success, the reader must check r->info.seq to see which record was
> * actually read. This allows the reader to detect dropped records.
> *
> * Failure means @seq refers to a not yet finalized or non-existing record.
>
> Also note that @seq is never passed by reference from the external
> caller. It is only passed by reference to the helper function
> _prb_read_valid().
>
> > BTW: The support for data-less records were added by the commit
> > ("printk: ringbuffer: support dataless records"). It was
> > needed to handle empty lines: printk("\n"). It is strange
> > that we skip them instead of printing the empty line.
>
> We do not skip them. That was the whole point of adding support for
> data-less records. ;-)
>
> get_data() returns "" and @data_size=0
>
> copy_data() returns true (but does not copy any data)
>
> prb_read() returns true (we are assuming it is finalized)
>
> _prb_read_valid() returns true
This is not true if I read the code correctly.
> prb_read_valid() return true
>
> record_print_text() creates a string with prefix and adds "\n"
>
> printk_get_next_message() returns something to print
Best Regards,
Petr
PS: Sigh, I know that my comments looked like nitpicking. I had big inner
fight whether it was worth it or I was just burning my and others
time. But it seems that it helped for find many subtle problems
in the end. Or am I still missing something?
On Wed 2023-10-18 10:30:46, John Ogness wrote:
> On 2023-10-17, John Ogness <[email protected]> wrote:
> >>> - if (!prb_read_valid(prb, seq, &r))
> >>> - return false;
> >>> + while (!prb_read_valid(prb, seq, &r)) {
> >>> + if (this_cpu_in_panic() && seq < prb_next_seq(prb)) {
> >>> + /*
> >>> + * The record @seq is not finalized and there may be
>
> I am wondering if we should put the this_cpu_in_panic() check into
> _prb_read_valid() instead. This has the advantage that it could more
> intelligently skip non-finalized records and not need to rely on
> prb_next_seq(). Also, it would allow any other readers (kmsg_dump) to
> skip non-finalized records during panic.
Yeah, this look like a better solution. Anyway, we need to try reading
the messages behind prb_next_seq() up to the head of the ring buffer.
And @head_id is not available outside prb API.
> Looking over the code, it would be easy to extend
> desc_read_finalized_seq() to return a separate error code for exists but
> is non-finalized. This would make it trivial for _prb_read_valid() to
> skip over it when this_cpu_in_panic().
Sounds reasonable.
Best Regards,
Petr
On 2023-10-18, Petr Mladek <[email protected]> wrote:
> Wait! This means that your patch actually does not work. Here is the diff:
>
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2815,8 +2815,19 @@ static bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
> else
> prb_rec_init_rd(&r, &info, outbuf, outbuf_sz);
>
> - if (!prb_read_valid(prb, seq, &r))
> - return false;
> + while (!prb_read_valid(prb, seq, &r)) {
> + if (this_cpu_in_panic() && seq < prb_next_seq(prb)) {
> + /*
> + * The record @seq is not finalized and there may be
> + * more records in the ringbuffer. Since this is the
> + * panic CPU, skip over the unfinalized record and
> + * try to read a finalized record that may follow.
> + */
> + seq++;
> + } else {
> + return false;
> + }
> + }
>
> pmsg->seq = r.info->seq;
> pmsg->dropped = r.info->seq - seq;
>
> It skips the invalid reads only when seq < prb_next_seq(). But
> prb_next_seq(prb) points to the 1st non-finalized record. And
> all records with seq < prb_next_seq() must be finalized!
Please take a look at prb_next_seq(). It _starts_ its search from:
id = atomic_long_read(&desc_ring->last_finalized_id);
For console_flush_on_panic(), @last_finalized_id will _always_ be set to
the last finalized message of the panic messages, being higher than any
non-finalized records that may exist. There are no other CPUs running
except the panic CPU.
> And it is even wrong. I still believe that dataless records are
> quietly skipped. Here is the code:
>
> static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
> struct printk_record *r, unsigned int *line_count)
> {
> while ((err = prb_read(rb, *seq, r, line_count))) {
> [...]
> } else if (err == -ENOENT) {
> /* Record exists, but no data available. Skip. */
> (*seq)++;
>
> } else {
> }
> [...]
> }
Yes, this is the code. But prb_read() returns _true_ for data-less
records. You are confusing data-less records (which are valid records
containing the empty string) with "no data available" (which are records
that have lost their data).
Your confusion shows that we should improve the language. I am working
on a patch that does this.
> Note that "seq" is incremented and the cycle continues. It means
> that it tries to read the next record. It does _not_ return
> "true" when seeing dataless record, definitely.
I showed you where to look: get_data() has an extra check for data-less
records. It returns a valid string "", not NULL.
>> that it is just a brief internal comment. It also says:
>>
>> * See the description of prb_read_valid() and prb_read_valid_info()
>> * for details.
>
> This does not help because these two functions do not describe
> how the seq is modified. And there are clearly three possibilities
> when it returns false:
>
> 1. The record for the given seq number has not been finalized
> yet. In this case, it keeps "seq" as it was.
>
> 2. The record for the given seq has been dataless. In this case
> it would try to advance seq until it finds a record with
> the data or not-yet-finalized one.
"dataless" is the wrong word. It should be "lost data". It is skipped
because there is no data available (because it was lost) for the reader
to read. This is _not_ the case where it is a finalized fully intakt
record containing an empty string.
>
> 3. The given seq number has been invalid, above the head.
> In this case, it keeps "seq" intact. So same as 1st case.
>
> Note that "seq" was not updated in the 1st and 3rd case. But the
> current comment suggests that it is updated.
>
> Well, it might be updated in all situations when then given
> seq pointed to an already overwritten record (below tail).
You are making this all sound much more complicated than it
is. Really. I suppose this is testimony for the horrible documentation.
>> _prb_read_valid() returns true
>
> This is not true if I read the code correctly.
Unfortunately you are not reading the code correctly. (Or rather, you
are being misled by comments because you incorrectly associate "not
available to reader" to mean "valid record with an empty string".
> PS: Sigh, I know that my comments looked like nitpicking. I had big inner
> fight whether it was worth it or I was just burning my and others
> time. But it seems that it helped for find many subtle problems
> in the end.
Indeed. The subtle problem is: unclear comments/documentation.
John
On Wed 2023-10-18 14:54:29, Petr Mladek wrote:
> On Tue 2023-10-17 23:31:25, John Ogness wrote:
> > On 2023-10-17, Petr Mladek <[email protected]> wrote:
> > > BTW: The support for data-less records were added by the commit
> > > ("printk: ringbuffer: support dataless records"). It was
> > > needed to handle empty lines: printk("\n"). It is strange
> > > that we skip them instead of printing the empty line.
> >
> > We do not skip them. That was the whole point of adding support for
> > data-less records. ;-)
> >
> > get_data() returns "" and @data_size=0
Ah, I see it now. The code really handles printk("\n") correctly.
There are two types of dataless records:
+ #define NO_LPOS 0x3. It is used for storing zero size data.
+ #define FAILED_LPOS 0x1. It is used when the code failed
to allocate enough space on the ring buffer or when
they were overwritten.
BLK_DATALESS() returns true for both of them.
Sigh, we should really distinguish in the comments when functions
handle the two situations different way. Especially we should not
use the ambiguous "dataless record" word.
I think about "empty line record" and "records with missing data".
And I would rename NO_LPOS to EMPTY_LINE_LPOS to make the meaning
more obvious.
Also it would make sense to use 0x2 for EMPTY_LINE_LPOS and
#define FAILED_LPOS 0x1
#define EMPTY_LINE_LPOS 0x2
#define DATALESS_LPOS_MASK (FAILED_LPOS | EMPTY_LINE_LPOS)
#define LPOS_DATALESS(lpos) ((lpos) & DATALESS_LPOS_MASK)
> > copy_data() returns true (but does not copy any data)
> >
> > prb_read() returns true (we are assuming it is finalized)
> >
> > _prb_read_valid() returns true
>
> This is not true if I read the code correctly.
>
> > prb_read_valid() return true
> >
> > record_print_text() creates a string with prefix and adds "\n"
> >
> > printk_get_next_message() returns something to print
If we used the right terms in the comments then they might stay short
and still be clear.
Best Regards,
Petr
On Wed 2023-10-18 15:51:57, John Ogness wrote:
> On 2023-10-18, Petr Mladek <[email protected]> wrote:
> > Wait! This means that your patch actually does not work. Here is the diff:
> >
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2815,8 +2815,19 @@ static bool printk_get_next_message(struct printk_message *pmsg, u64 seq,
> > else
> > prb_rec_init_rd(&r, &info, outbuf, outbuf_sz);
> >
> > - if (!prb_read_valid(prb, seq, &r))
> > - return false;
> > + while (!prb_read_valid(prb, seq, &r)) {
> > + if (this_cpu_in_panic() && seq < prb_next_seq(prb)) {
> > + /*
> > + * The record @seq is not finalized and there may be
> > + * more records in the ringbuffer. Since this is the
> > + * panic CPU, skip over the unfinalized record and
> > + * try to read a finalized record that may follow.
> > + */
> > + seq++;
> > + } else {
> > + return false;
> > + }
> > + }
> >
> > pmsg->seq = r.info->seq;
> > pmsg->dropped = r.info->seq - seq;
> >
> > It skips the invalid reads only when seq < prb_next_seq(). But
> > prb_next_seq(prb) points to the 1st non-finalized record. And
> > all records with seq < prb_next_seq() must be finalized!
>
> Please take a look at prb_next_seq(). It _starts_ its search from:
>
> id = atomic_long_read(&desc_ring->last_finalized_id);
I see. And this this set in
static void desc_make_final(struct prb_desc_ring *desc_ring, unsigned long id)
{
[...]
/* Best effort to remember the last finalized @id. */
atomic_long_set(&desc_ring->last_finalized_id, id);
}
So it is the _last_ finalized id from the timing POV. If there are
more CPUs storing and finalizing the messages in parallel then
it might change forth and back. There might be earlier non-finalized
records and newer finalized ones.
It means that prb_next_seq() really is the best effort and
the description is not valid:
/**
* prb_next_seq() - Get the sequence number after the last available record.
*
* @rb: The ringbuffer to get the sequence number from.
*
* This is the public function available to readers to see what the next
* newest sequence number available to readers will be.
*
* This provides readers a sequence number to jump to if all currently
* available records should be skipped.
It is not guaranteed that it will be the last available record
because there might be newer already finalized records with
some non-finalized records in between.
Also it is not guaranteed that it will be the next record available
to readers because readers should stop on the 1st non-yet-finalized
record and prb_next_seq() might be behind.
It would be great to document these subtle details especially when
we are going to depend on them.
> For console_flush_on_panic(), @last_finalized_id will _always_ be set to
> the last finalized message of the panic messages, being higher than any
> non-finalized records that may exist. There are no other CPUs running
> except the panic CPU.
It is not guaranteed. It might be lower when some still running CPU
manages to finalize an earlier record and there are later
non-finalized records.
But you are right, there is a very high chance that it will point
behind the last message from panic() context sooner or later,
especially after CPUs get stopped.
Well, note that only NMI guarantees that CPUs get stopped. There
are still architectures which stop CPUs using normal interrupts.
> Unfortunately you are not reading the code correctly. (Or rather, you
> are being misled by comments because you incorrectly associate "not
> available to reader" to mean "valid record with an empty string".
You are right. Well, it is really hard to put all the pieces together
just by reading the code. And the unclear comments make it even worse.
Best Regards,
Petr
On 2023-10-18, Petr Mladek <[email protected]> wrote:
> So it is the _last_ finalized id from the timing POV. If there are
> more CPUs storing and finalizing the messages in parallel then
> it might change forth and back. There might be earlier non-finalized
> records and newer finalized ones.
>
> It means that prb_next_seq() really is the best effort and
> the description is not valid:
Well, the description was valid until prb_next_seq() was optimized and
converted to best-effort with:
commit f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq() performance")
> It would be great to document these subtle details especially when
> we are going to depend on them.
Going through the various call sites of prb_next_seq(), I would argue
that the above optimization introduced some bugs. I will investigate if
prb_next_seq() can be fixed to match its description because the current
users already depend on that.
WRT to this series, I have put together an alternative implementation
that does not use prb_next_seq().
John
On Wed 2023-10-18 17:56:52, John Ogness wrote:
> On 2023-10-18, Petr Mladek <[email protected]> wrote:
> > So it is the _last_ finalized id from the timing POV. If there are
> > more CPUs storing and finalizing the messages in parallel then
> > it might change forth and back. There might be earlier non-finalized
> > records and newer finalized ones.
> >
> > It means that prb_next_seq() really is the best effort and
> > the description is not valid:
>
> Well, the description was valid until prb_next_seq() was optimized and
> converted to best-effort with:
>
> commit f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq() performance")
>
> > It would be great to document these subtle details especially when
> > we are going to depend on them.
>
> Going through the various call sites of prb_next_seq(), I would argue
> that the above optimization introduced some bugs. I will investigate if
> prb_next_seq() can be fixed to match its description because the current
> users already depend on that.
I thought about caching the last seq returned by
prb_next_seq() instead of caching the last finalized record.
Also I thought about the highest sequence number accessed
by _prb_read_valid(). But it can be done lockless way only
on 64-bit systems. Well, it might be good enough. I doubt that
there are big 32-bit systems.
> WRT to this series, I have put together an alternative implementation
> that does not use prb_next_seq().
Great.
Best Regards,
Petr
On 2023-10-18, Petr Mladek <[email protected]> wrote:
> Well, I have spent some time thinking about possible scenarios and I would
> go even further. I would block also console_lock_spinning_enable()
> Also I would do the checks before taking console_owner_lock.
>
> It would make the behavior symmetric. And more importantly, it would prevent
> possible deadlocks caused by console_owner_lock.
>
> See the proposed patch below. I created it to see the changes in the
> code. Also I added info about possible scenarios and effects.
Reviewed-by: John Ogness <[email protected]>
I will use your patch for the next version of this series.
John
On 2023-10-18, Petr Mladek <[email protected]> wrote:
> I think about "empty line record" and "records with missing data".
> And I would rename NO_LPOS to EMPTY_LINE_LPOS to make the meaning
> more obvious.
OK.
> Also it would make sense to use 0x2 for EMPTY_LINE_LPOS and
>
> #define FAILED_LPOS 0x1
> #define EMPTY_LINE_LPOS 0x2
> #define DATALESS_LPOS_MASK (FAILED_LPOS | EMPTY_LINE_LPOS)
>
> #define LPOS_DATALESS(lpos) ((lpos) & DATALESS_LPOS_MASK)
The existing debugging tools use bit0 to identify if there is data. Bit1
is really the _reason_ for the missing data. This can be seen in the
definition of LPOS_DATALESS(), but it needs to be documented
better. (Ideally where FAILED_LPOS and EMPTY_LINE_LPOS are defined.)
John