LinuxLists.cc - [RFC PATCH printk v1] printk: ringbuffer: Do not skip non-finalized with prb_next

2023-10-19 13:26:59

Subject: [RFC PATCH printk v1] printk: ringbuffer: Do not skip non-finalized with prb_next_seq()

Commit f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq()
performance") introduced an optimization for prb_next_seq() by
using best-effort to track recently finalized records. However,
the order of finalization does not necessarily match the order
of the records. This can lead to prb_next_seq() returning
higher than desired sequence numbers, which results in the
reader skipping over records that are not yet finalized. From
the reader's perspective it results in messages never being
seen.

Rather than simply tracking recently finalized records, force
the committing writer to read records and increment the last
"contiguous block" of finalized records. In order to do this,
the sequence number instead of ID must be stored because ID's
cannot be directly compared.

For 32bit systems, only the lower 32 bits of the sequence
number are stored. When reading the value, it is expanded to
the full 64bit sequence number by folding in the value
returned by prb_first_seq().

Fixes: f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq() performance")
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk_ringbuffer.c | 114 +++++++++++++++++++++---------
kernel/printk/printk_ringbuffer.h | 4 +-
2 files changed, 82 insertions(+), 36 deletions(-)

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index fde338606ce8..f9fbfa21c5b2 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1441,20 +1441,83 @@ bool prb_reserve_in_last(struct prb_reserved_entry *e, struct printk_ringbuffer
return false;
}

+#ifdef CONFIG_64BIT
+
+#define __u64seq_to_ulseq(u64seq) (u64seq)
+#define __ulseq_to_u64seq(ulseq) (ulseq)
+
+#else /* CONFIG_64BIT */
+
+static u64 prb_first_seq(struct printk_ringbuffer *rb);
+
+#define __u64seq_to_ulseq(u64seq) ((u32)u64seq)
+static inline u64 __ulseq_to_u64seq(u32 ulseq)
+{
+ u64 rb_first_seq = prb_first_seq(prb);
+ u64 seq;
+
+ /*
+ * The provided sequence is only the lower 32 bits of the ringbuffer
+ * sequence. It needs to be expanded to 64bit. Get the first sequence
+ * number from the ringbuffer and fold it.
+ */
+ seq = rb_first_seq - ((u32)rb_first_seq - ulseq);
+
+ return seq;
+}
+
+#endif /* CONFIG_64BIT */
+
+static u64 desc_last_finalized_seq(struct prb_desc_ring *desc_ring)
+{
+ unsigned long ulseq = atomic_long_read(&desc_ring->last_finalized_seq);
+
+ return __ulseq_to_u64seq(ulseq);
+}
+
+static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
+ struct printk_record *r, unsigned int *line_count);
+
+/*
+ * Check if there are records directly following @last_finalized_seq that are
+ * finalized. If so, update @last_finalized_seq to the latest of these
+ * records. It is not allowed to skip over records that are not yet finalized.
+ */
+static void desc_update_last_finalized(struct printk_ringbuffer *rb)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ u64 prev_seq = desc_last_finalized_seq(desc_ring);
+ u64 seq = prev_seq;
+
+ while (_prb_read_valid(rb, &seq, NULL, NULL)) {
+ unsigned long oldval = __u64seq_to_ulseq(prev_seq);
+ unsigned long newval = __u64seq_to_ulseq(seq);
+
+ if (atomic_long_try_cmpxchg_relaxed(&desc_ring->last_finalized_seq,
+ &oldval, newval)) {
+ prev_seq = seq;
+ } else {
+ prev_seq = __ulseq_to_u64seq(oldval);
+ }
+
+ seq = prev_seq + 1;
+ }
+}
+
/*
* Attempt to finalize a specified descriptor. If this fails, the descriptor
* is either already final or it will finalize itself when the writer commits.
*/
-static void desc_make_final(struct prb_desc_ring *desc_ring, unsigned long id)
+static void desc_make_final(struct printk_ringbuffer *rb, unsigned long id)
{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
unsigned long prev_state_val = DESC_SV(id, desc_committed);
struct prb_desc *d = to_desc(desc_ring, id);

atomic_long_cmpxchg_relaxed(&d->state_var, prev_state_val,
DESC_SV(id, desc_finalized)); /* LMM(desc_make_final:A) */

- /* Best effort to remember the last finalized @id. */
- atomic_long_set(&desc_ring->last_finalized_id, id);
+ desc_update_last_finalized(rb);
}

/**
@@ -1550,7 +1613,7 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
* readers. (For seq==0 there is no previous descriptor.)
*/
if (info->seq > 0)
- desc_make_final(desc_ring, DESC_ID(id - 1));
+ desc_make_final(rb, DESC_ID(id - 1));

r->text_buf = data_alloc(rb, r->text_buf_size, &d->text_blk_lpos, id);
/* If text data allocation fails, a data-less record is committed. */
@@ -1643,7 +1706,7 @@ void prb_commit(struct prb_reserved_entry *e)
*/
head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_commit:A) */
if (head_id != e->id)
- desc_make_final(desc_ring, e->id);
+ desc_make_final(e->rb, e->id);
}

/**
@@ -1663,12 +1726,9 @@ void prb_commit(struct prb_reserved_entry *e)
*/
void prb_final_commit(struct prb_reserved_entry *e)
{
- struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
-
_prb_commit(e, desc_finalized);

- /* Best effort to remember the last finalized @id. */
- atomic_long_set(&desc_ring->last_finalized_id, e->id);
+ desc_update_last_finalized(e->rb);
}

/*
@@ -2017,33 +2077,19 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
u64 prb_next_seq(struct printk_ringbuffer *rb)
{
struct prb_desc_ring *desc_ring = &rb->desc_ring;
- enum desc_state d_state;
- unsigned long id;
u64 seq;

- /* Check if the cached @id still points to a valid @seq. */
- id = atomic_long_read(&desc_ring->last_finalized_id);
- d_state = desc_read(desc_ring, id, NULL, &seq, NULL);
+ seq = __ulseq_to_u64seq(atomic_long_read(&desc_ring->last_finalized_seq));

- if (d_state == desc_finalized || d_state == desc_reusable) {
- /*
- * Begin searching after the last finalized record.
- *
- * On 0, the search must begin at 0 because of hack#2
- * of the bootstrapping phase it is not known if a
- * record at index 0 exists.
- */
- if (seq != 0)
- seq++;
- } else {
- /*
- * The information about the last finalized sequence number
- * has gone. It should happen only when there is a flood of
- * new messages and the ringbuffer is rapidly recycled.
- * Give up and start from the beginning.
- */
- seq = 0;
- }
+ /*
+ * Begin searching after the last finalized record.
+ *
+ * On 0, the search must begin at 0 because of hack#2
+ * of the bootstrapping phase it is not known if a
+ * record at index 0 exists.
+ */
+ if (seq != 0)
+ seq++;

/*
* The information about the last finalized @seq might be inaccurate.
@@ -2085,7 +2131,7 @@ void prb_init(struct printk_ringbuffer *rb,
rb->desc_ring.infos = infos;
atomic_long_set(&rb->desc_ring.head_id, DESC0_ID(descbits));
atomic_long_set(&rb->desc_ring.tail_id, DESC0_ID(descbits));
- atomic_long_set(&rb->desc_ring.last_finalized_id, DESC0_ID(descbits));
+ atomic_long_set(&rb->desc_ring.last_finalized_seq, 0);

rb->text_data_ring.size_bits = textbits;
rb->text_data_ring.data = text_buf;
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index 18cd25e489b8..3374a5a3303e 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -75,7 +75,7 @@ struct prb_desc_ring {
struct printk_info *infos;
atomic_long_t head_id;
atomic_long_t tail_id;
- atomic_long_t last_finalized_id;
+ atomic_long_t last_finalized_seq;
};

/*
@@ -259,7 +259,7 @@ static struct printk_ringbuffer name = { \
.infos = &_##name##_infos[0], \
.head_id = ATOMIC_INIT(DESC0_ID(descbits)), \
.tail_id = ATOMIC_INIT(DESC0_ID(descbits)), \
- .last_finalized_id = ATOMIC_INIT(DESC0_ID(descbits)), \
+ .last_finalized_seq = ATOMIC_INIT(0), \
}, \
.text_data_ring = { \
.size_bits = (avgtextbits) + (descbits), \

base-commit: a26f18f291f6c2eac6dac9faa12cc68fd1da5865
--
2.39.2

2023-10-19 13:39:17

by John Ogness

[permalink] [raw]

Subject: Re: [RFC PATCH printk v1] printk: ringbuffer: Do not skip non-finalized with prb_next_seq()

Some suggested changes.

On 2023-10-19, John Ogness <[email protected]> wrote:
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> index fde338606ce8..f9fbfa21c5b2 100644
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -1441,20 +1441,83 @@ bool prb_reserve_in_last(struct prb_reserved_entry *e, struct printk_ringbuffer
> return false;
> }
>
> +#ifdef CONFIG_64BIT
> +
> +#define __u64seq_to_ulseq(u64seq) (u64seq)
> +#define __ulseq_to_u64seq(ulseq) (ulseq)
> +
> +#else /* CONFIG_64BIT */
> +
> +static u64 prb_first_seq(struct printk_ringbuffer *rb);
> +
> +#define __u64seq_to_ulseq(u64seq) ((u32)u64seq)
> +static inline u64 __ulseq_to_u64seq(u32 ulseq)
> +{
> + u64 rb_first_seq = prb_first_seq(prb);
> + u64 seq;
> +
> + /*
> + * The provided sequence is only the lower 32 bits of the ringbuffer
> + * sequence. It needs to be expanded to 64bit. Get the first sequence
> + * number from the ringbuffer and fold it.
> + */
> + seq = rb_first_seq - ((u32)rb_first_seq - ulseq);
> +
> + return seq;
> +}
> +
> +#endif /* CONFIG_64BIT */
> +
> +static u64 desc_last_finalized_seq(struct prb_desc_ring *desc_ring)
> +{
> + unsigned long ulseq = atomic_long_read(&desc_ring->last_finalized_seq);
> +
> + return __ulseq_to_u64seq(ulseq);
> +}
> +
> +static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
> + struct printk_record *r, unsigned int *line_count);
> +
> +/*
> + * Check if there are records directly following @last_finalized_seq that are
> + * finalized. If so, update @last_finalized_seq to the latest of these
> + * records. It is not allowed to skip over records that are not yet finalized.
> + */
> +static void desc_update_last_finalized(struct printk_ringbuffer *rb)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + u64 prev_seq = desc_last_finalized_seq(desc_ring);
> + u64 seq = prev_seq;
> +
> + while (_prb_read_valid(rb, &seq, NULL, NULL)) {
> + unsigned long oldval = __u64seq_to_ulseq(prev_seq);
> + unsigned long newval = __u64seq_to_ulseq(seq);
> +

Generally there are not multiple CPUs racing to finalize records. So I
would add an extra check for the first loop (where seq == prev_seq) to
avoid the extra cmpxchg.

+ if (prev_seq == seq) {
+ ; /* NOP */
+ } else

> + if (atomic_long_try_cmpxchg_relaxed(&desc_ring->last_finalized_seq,
> + &oldval, newval)) {
> + prev_seq = seq;
> + } else {
> + prev_seq = __ulseq_to_u64seq(oldval);
> + }
> +
> + seq = prev_seq + 1;
> + }
> +}
> +
> /*
> * Attempt to finalize a specified descriptor. If this fails, the descriptor
> * is either already final or it will finalize itself when the writer commits.
> */
> -static void desc_make_final(struct prb_desc_ring *desc_ring, unsigned long id)
> +static void desc_make_final(struct printk_ringbuffer *rb, unsigned long id)
> {
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> unsigned long prev_state_val = DESC_SV(id, desc_committed);
> struct prb_desc *d = to_desc(desc_ring, id);
>
> atomic_long_cmpxchg_relaxed(&d->state_var, prev_state_val,
> DESC_SV(id, desc_finalized)); /* LMM(desc_make_final:A) */
>
> - /* Best effort to remember the last finalized @id. */
> - atomic_long_set(&desc_ring->last_finalized_id, id);
> + desc_update_last_finalized(rb);
> }
>
> /**
> @@ -1550,7 +1613,7 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
> * readers. (For seq==0 there is no previous descriptor.)
> */
> if (info->seq > 0)
> - desc_make_final(desc_ring, DESC_ID(id - 1));
> + desc_make_final(rb, DESC_ID(id - 1));
>
> r->text_buf = data_alloc(rb, r->text_buf_size, &d->text_blk_lpos, id);
> /* If text data allocation fails, a data-less record is committed. */
> @@ -1643,7 +1706,7 @@ void prb_commit(struct prb_reserved_entry *e)
> */
> head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_commit:A) */
> if (head_id != e->id)
> - desc_make_final(desc_ring, e->id);
> + desc_make_final(e->rb, e->id);
> }
>
> /**
> @@ -1663,12 +1726,9 @@ void prb_commit(struct prb_reserved_entry *e)
> */
> void prb_final_commit(struct prb_reserved_entry *e)
> {
> - struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
> -
> _prb_commit(e, desc_finalized);
>
> - /* Best effort to remember the last finalized @id. */
> - atomic_long_set(&desc_ring->last_finalized_id, e->id);
> + desc_update_last_finalized(e->rb);
> }
>
> /*
> @@ -2017,33 +2077,19 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
> u64 prb_next_seq(struct printk_ringbuffer *rb)
> {
> struct prb_desc_ring *desc_ring = &rb->desc_ring;
> - enum desc_state d_state;
> - unsigned long id;
> u64 seq;
>
> - /* Check if the cached @id still points to a valid @seq. */
> - id = atomic_long_read(&desc_ring->last_finalized_id);
> - d_state = desc_read(desc_ring, id, NULL, &seq, NULL);
> + seq = __ulseq_to_u64seq(atomic_long_read(&desc_ring->last_finalized_seq));

There is now a function for this. No need to open code it.

+ seq = desc_last_finalized_seq(desc_ring);

>
> - if (d_state == desc_finalized || d_state == desc_reusable) {
> - /*
> - * Begin searching after the last finalized record.
> - *
> - * On 0, the search must begin at 0 because of hack#2
> - * of the bootstrapping phase it is not known if a
> - * record at index 0 exists.
> - */
> - if (seq != 0)
> - seq++;
> - } else {
> - /*
> - * The information about the last finalized sequence number
> - * has gone. It should happen only when there is a flood of
> - * new messages and the ringbuffer is rapidly recycled.
> - * Give up and start from the beginning.
> - */
> - seq = 0;
> - }
> + /*
> + * Begin searching after the last finalized record.
> + *
> + * On 0, the search must begin at 0 because of hack#2
> + * of the bootstrapping phase it is not known if a
> + * record at index 0 exists.
> + */
> + if (seq != 0)
> + seq++;
>
> /*
> * The information about the last finalized @seq might be inaccurate.
> @@ -2085,7 +2131,7 @@ void prb_init(struct printk_ringbuffer *rb,
> rb->desc_ring.infos = infos;
> atomic_long_set(&rb->desc_ring.head_id, DESC0_ID(descbits));
> atomic_long_set(&rb->desc_ring.tail_id, DESC0_ID(descbits));
> - atomic_long_set(&rb->desc_ring.last_finalized_id, DESC0_ID(descbits));
> + atomic_long_set(&rb->desc_ring.last_finalized_seq, 0);
>
> rb->text_data_ring.size_bits = textbits;
> rb->text_data_ring.data = text_buf;
> diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
> index 18cd25e489b8..3374a5a3303e 100644
> --- a/kernel/printk/printk_ringbuffer.h
> +++ b/kernel/printk/printk_ringbuffer.h
> @@ -75,7 +75,7 @@ struct prb_desc_ring {
> struct printk_info *infos;
> atomic_long_t head_id;
> atomic_long_t tail_id;
> - atomic_long_t last_finalized_id;
> + atomic_long_t last_finalized_seq;
> };
>
> /*
> @@ -259,7 +259,7 @@ static struct printk_ringbuffer name = { \
> .infos = &_##name##_infos[0], \
> .head_id = ATOMIC_INIT(DESC0_ID(descbits)), \
> .tail_id = ATOMIC_INIT(DESC0_ID(descbits)), \
> - .last_finalized_id = ATOMIC_INIT(DESC0_ID(descbits)), \
> + .last_finalized_seq = ATOMIC_INIT(0), \
> }, \
> .text_data_ring = { \
> .size_bits = (avgtextbits) + (descbits), \
>
> base-commit: a26f18f291f6c2eac6dac9faa12cc68fd1da5865
> --
> 2.39.2

John

2023-10-25 15:16:41

by Petr Mladek

[permalink] [raw]

Subject: Re: [RFC PATCH printk v1] printk: ringbuffer: Do not skip non-finalized with prb_next_seq()

Hi,

there seems to be missing word in the subject:

s/non-finalized/non-finalized records/

On Thu 2023-10-19 15:31:45, John Ogness wrote:
> Commit f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq()
> performance") introduced an optimization for prb_next_seq() by
> using best-effort to track recently finalized records. However,
> the order of finalization does not necessarily match the order
> of the records. This can lead to prb_next_seq() returning
> higher than desired sequence numbers, which results in the
> reader skipping over records that are not yet finalized. From
> the reader's perspective it results in messages never being
> seen.

IMHO, "messages never being seen" is too strong. Also it seems
to be more tricky than it might look.

I have found only one (or two) scenarios where the messages might
really get lost.

1. It might happen when real console is replacing a boot console.
The real console is initialized with the value returned
by prb_next_seq(). And the boot console might not be able
to flush earlier non-finalized records.

But it is more complicated because console_unlock() tries
hard to flush the early console before it is removed.

So it might happen only when console_init_seq() sees higher
prb_next_seq() and console_unlock() sees lower prb_next_seq().

2. The other scenario is based on the fact that console_unlock()
or pr_flush() might see lower prb_next_seq() than the last
reserved record. It means that they might not flush all
pending records.

In this case, the messages will be lost forever only when
the system dies before doing another flush later.

But wait! This is actually the opposite case. pr_flush()
and console_unlock() might miss the messages when
they see too low prb_next_seq().

Important: This problem existed since introducing
the lockless ring buffer!

The question is. Should pr_flush() and console_unlock()
wait until all registered messages get finalized?

It would need to ignore only the last record when it
is not finalized because it might be a continuous line.

My opinion:

I would do the change proposed by this patch for now. With some
changes suggested below.

The commit message should say that it restores the prb_next_seq() behavior
before the commit f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq()
performance").

In particular, prb_next_seq() will guarantee that all records
with a lower sequence number have been finalized. They can be
read except for the already overwritten ones.

Regarding, the flushing behavior. For now, I would rather keep
the historic behavior where only messages with seq < prb_next_seq()
are handled and do not wait on the reserved ones.

Waiting for reserved but not-yet-finalized records might be tricky
and might even cause livelocks or deadlocks. We might always
improve the situation when there are problems in the real life.

> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -1441,20 +1441,83 @@ bool prb_reserve_in_last(struct prb_reserved_entry *e, struct printk_ringbuffer
> return false;
> }
>
> +#ifdef CONFIG_64BIT
> +
> +#define __u64seq_to_ulseq(u64seq) (u64seq)
> +#define __ulseq_to_u64seq(ulseq) (ulseq)
> +
> +#else /* CONFIG_64BIT */
> +
> +static u64 prb_first_seq(struct printk_ringbuffer *rb);
> +
> +#define __u64seq_to_ulseq(u64seq) ((u32)u64seq)
> +static inline u64 __ulseq_to_u64seq(u32 ulseq)
> +{
> + u64 rb_first_seq = prb_first_seq(prb);
> + u64 seq;
> +
> + /*
> + * The provided sequence is only the lower 32 bits of the ringbuffer
> + * sequence. It needs to be expanded to 64bit. Get the first sequence
> + * number from the ringbuffer and fold it.
> + */
> + seq = rb_first_seq - ((u32)rb_first_seq - ulseq);
> +
> + return seq;
> +}
> +
> +#endif /* CONFIG_64BIT */
> +
> +static u64 desc_last_finalized_seq(struct prb_desc_ring *desc_ring)
> +{
> + unsigned long ulseq = atomic_long_read(&desc_ring->last_finalized_seq);
> +
> + return __ulseq_to_u64seq(ulseq);
> +}
> +
> +static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
> + struct printk_record *r, unsigned int *line_count);
> +
> +/*
> + * Check if there are records directly following @last_finalized_seq that are
> + * finalized. If so, update @last_finalized_seq to the latest of these
> + * records. It is not allowed to skip over records that are not yet finalized.

I would add some details about what expect from this value. Something
like:

* @last_finalized_seq value guarantees that all records up to this
* sequence number are finalized and might be read. The only exception
* are too old records which have already been overwritten.
*
* Also it is guaranteed that the value can only grow.
*
* But it is just a best effort. There is _no_ synchronization between
* writers finalizing records and @last_finalized_seq updates. The result
* might be a lower value because updaters might not see finalized
* records from other CPUs.
*
* There is _no_ easy way to serialize these two operations. It would
* require to remember two values which might be called handled in:
*
* @last_finalized_id in desc_make_final()
* @last_readable_seq in desc_update_last_readable_seq()
*
* and these two values would need to be updated atomically together
* so that only the last updater of the @id part would be allowed
* to update the @seq part. Also it would require a barrier so
* that each writer would see the finalized state from other
* writers whom updated the @id part earlier.
*
* Summary:
*
* This value might be used by readers only to check the last
* readable seq number at the given time. They must count with
* the fact that new records might appear at any time.
*
* Beware that this value is not increased with every finalized
* record. It stays the same when there are not-yet-finalized
* records with lower sequence number.
*
* In particular, it does not guarantee that readers woken
* via wake_up_klogd would be able to flush all messages
* up to the one just stored by vprintk_store(). For example,
* it does not guarantee that console_unlock() would flush
* all pending messages.
*/

> + */
> +static void desc_update_last_finalized(struct printk_ringbuffer *rb)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + u64 prev_seq = desc_last_finalized_seq(desc_ring);
> + u64 seq = prev_seq;
> +
> + while (_prb_read_valid(rb, &seq, NULL, NULL)) {
> + unsigned long oldval = __u64seq_to_ulseq(prev_seq);
> + unsigned long newval = __u64seq_to_ulseq(seq);
> +
> + if (atomic_long_try_cmpxchg_relaxed(&desc_ring->last_finalized_seq,
> + &oldval, newval)) {
> + prev_seq = seq;
> + } else {
> + prev_seq = __ulseq_to_u64seq(oldval);
> + }
> +
> + seq = prev_seq + 1;
> + }

I agree that this might be optimized. I think about reducing the
number of cmpxchg even more, something like:

static void desc_update_last_finalized(struct prb_desc_ring *desc_ring)
{
struct prb_desc_ring *desc_ring = &rb->desc_ring;
u64 prev_seq = desc_last_finalized_seq(desc_ring);
u64 seq = prev_seq;

try_again:
while (_prb_read_valid(rb, &seq, NULL, NULL))
seq++;

if (seq == prev_seq)
return;

oldval = __u64seq_to_ulseq(prev_seq);
newval = __u64seq_to_ulseq(seq);

if (!atomic_long_try_cmpxchg_relaxed(&desc_ring->last_finalized_seq,
&oldval, newval)) {
prev_seq = seq;
goto try_again;
}
}

> +}
> +
> /*
> * Attempt to finalize a specified descriptor. If this fails, the descriptor
> * is either already final or it will finalize itself when the writer commits.
> */
> -static void desc_make_final(struct prb_desc_ring *desc_ring, unsigned long id)
> +static void desc_make_final(struct printk_ringbuffer *rb, unsigned long id)

It looks to me that we could keep passing desc_ring as the parameter.
I guess that passing "rb" was needed only by some iteration of the patch.

> {
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> unsigned long prev_state_val = DESC_SV(id, desc_committed);
> struct prb_desc *d = to_desc(desc_ring, id);
>
> atomic_long_cmpxchg_relaxed(&d->state_var, prev_state_val,
> DESC_SV(id, desc_finalized)); /* LMM(desc_make_final:A) */
>

> - /* Best effort to remember the last finalized @id. */
> - atomic_long_set(&desc_ring->last_finalized_id, id);
> + desc_update_last_finalized(rb);
> }
>
> /**
> @@ -1550,7 +1613,7 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
> * readers. (For seq==0 there is no previous descriptor.)
> */
> if (info->seq > 0)
> - desc_make_final(desc_ring, DESC_ID(id - 1));
> + desc_make_final(rb, DESC_ID(id - 1));
>
> r->text_buf = data_alloc(rb, r->text_buf_size, &d->text_blk_lpos, id);
> /* If text data allocation fails, a data-less record is committed. */
> @@ -1643,7 +1706,7 @@ void prb_commit(struct prb_reserved_entry *e)
> */
> head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_commit:A) */
> if (head_id != e->id)
> - desc_make_final(desc_ring, e->id);
> + desc_make_final(e->rb, e->id);
> }
>
> /**
> @@ -1663,12 +1726,9 @@ void prb_commit(struct prb_reserved_entry *e)
> */
> void prb_final_commit(struct prb_reserved_entry *e)
> {
> - struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
> -
> _prb_commit(e, desc_finalized);
>
> - /* Best effort to remember the last finalized @id. */
> - atomic_long_set(&desc_ring->last_finalized_id, e->id);
> + desc_update_last_finalized(e->rb);
> }
>
> /*
> @@ -2017,33 +2077,19 @@ u64 prb_first_valid_seq(struct printk_ringbuffer *rb)
> u64 prb_next_seq(struct printk_ringbuffer *rb)
> {
> struct prb_desc_ring *desc_ring = &rb->desc_ring;
> - enum desc_state d_state;
> - unsigned long id;
> u64 seq;
>
> - /* Check if the cached @id still points to a valid @seq. */
> - id = atomic_long_read(&desc_ring->last_finalized_id);
> - d_state = desc_read(desc_ring, id, NULL, &seq, NULL);
> + seq = __ulseq_to_u64seq(atomic_long_read(&desc_ring->last_finalized_seq));

I feel that we need a read barrier here. It should be between the above

atomic_long_read(&desc_ring->last_finalized_seq))

and the below

while (_prb_read_valid(rb, &seq, NULL, NULL))
seq++;

It should make sure that the _prb_read_valid() will see all messages
finalized which were seen finalized by the CPU updating
desc_ring->last_finalized_seq.

Hmm, I am not sure about the counter part. I tried many variants when I
was discovering the races described above. And I have got lost a bit.
And I feel quite tired to be more confident with any proposal.

My best guess is that we should replace:

atomic_long_try_cmpxchg_relaxed(&desc_ring->last_finalized_seq,
&oldval, newval))

with

atomic_long_try_cmpxchg(&desc_ring->last_finalized_seq,
&oldval, newval))

so that it would work as the full barrier.

>
> - if (d_state == desc_finalized || d_state == desc_reusable) {
> - /*
> - * Begin searching after the last finalized record.
> - *
> - * On 0, the search must begin at 0 because of hack#2
> - * of the bootstrapping phase it is not known if a
> - * record at index 0 exists.
> - */
> - if (seq != 0)
> - seq++;
> - } else {
> - /*
> - * The information about the last finalized sequence number
> - * has gone. It should happen only when there is a flood of
> - * new messages and the ringbuffer is rapidly recycled.
> - * Give up and start from the beginning.
> - */
> - seq = 0;
> - }
> + /*
> + * Begin searching after the last finalized record.
> + *
> + * On 0, the search must begin at 0 because of hack#2
> + * of the bootstrapping phase it is not known if a
> + * record at index 0 exists.
> + */
> + if (seq != 0)
> + seq++;
>
> /*
> * The information about the last finalized @seq might be inaccurate.

Best Regards,
Petr

2023-10-26 08:07:32

by Petr Mladek

[permalink] [raw]

Subject: Re: [RFC PATCH printk v1] printk: ringbuffer: Do not skip non-finalized with prb_next_seq()

On Wed 2023-10-25 17:16:09, Petr Mladek wrote:
> On Thu 2023-10-19 15:31:45, John Ogness wrote:
> > --- a/kernel/printk/printk_ringbuffer.c
> > +++ b/kernel/printk/printk_ringbuffer.c
> > +static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
> > + struct printk_record *r, unsigned int *line_count);
> > +
> > +/*
> > + * Check if there are records directly following @last_finalized_seq that are
> > + * finalized. If so, update @last_finalized_seq to the latest of these
> > + * records. It is not allowed to skip over records that are not yet finalized.
>
> I would add some details about what expect from this value. Something
> like:
>
> * @last_finalized_seq value guarantees that all records up to this
> * sequence number are finalized and might be read. The only exception
> * are too old records which have already been overwritten.
> *
> * Also it is guaranteed that the value can only grow.
> *
> * But it is just a best effort. There is _no_ synchronization between
> * writers finalizing records and @last_finalized_seq updates. The result
> * might be a lower value because updaters might not see finalized
> * records from other CPUs.
> *
> * There is _no_ easy way to serialize these two operations. It would
> * require to remember two values which might be called handled in:
> *
> * @last_finalized_id in desc_make_final()
> * @last_readable_seq in desc_update_last_readable_seq()
> *
> * and these two values would need to be updated atomically together
> * so that only the last updater of the @id part would be allowed
> * to update the @seq part. Also it would require a barrier so
> * that each writer would see the finalized state from other
> * writers whom updated the @id part earlier.
> *
> * Summary:
> *
> * This value might be used by readers only to check the last
> * readable seq number at the given time. They must count with
> * the fact that new records might appear at any time.
> *
> * Beware that this value is not increased with every finalized
> * record. It stays the same when there are not-yet-finalized
> * records with lower sequence number.
> *
> * In particular, it does not guarantee that readers woken
> * via wake_up_klogd would be able to flush all messages
> * up to the one just stored by vprintk_store(). For example,
> * it does not guarantee that console_unlock() would flush
> * all pending messages.
> */

The last paragraph sounds scary. But it is a pretty old problem.
I was curious why nobody noticed it.

IMHO, we are mostly safe in practice because:

1. Every printk() either tries to flush the consoles or
queues irq_work to do a deferred flush.

2. Every printk() calls wake_up_klogd() to wake user space
log daemons.

3. console_unlock() checks prb_next_seq() after releasing
the semaphore. It goes back and flushes any message
finalized in parallel or the parallel writer is able
to take the console semaphore.

By other words, even when one flush is not able to flush
everything there always should be scheduled someone
who would flush the rest in a near future.

The only problem might be a missing barrier when some CPU
sees a finalized record and others do not see it. But it is
probably very hard to hit in practice.

Anyway, I haven't been aware about this prb_next_seq() limitation
until last week. We should keep it in mind or improve it.
But it seems that it is not that critical in the end.

Best Regards,
Petr

2023-11-02 13:49:09

by John Ogness

[permalink] [raw]

Subject: Re: [RFC PATCH printk v1] printk: ringbuffer: Do not skip non-finalized with prb_next_seq()

On 2023-10-25, Petr Mladek <[email protected]> wrote:
> there seems to be missing word in the subject:
>
> s/non-finalized/non-finalized records/

Ack.

> On Thu 2023-10-19 15:31:45, John Ogness wrote:
>> Commit f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq()
>> performance") introduced an optimization for prb_next_seq() by
>> using best-effort to track recently finalized records. However,
>> the order of finalization does not necessarily match the order
>> of the records. This can lead to prb_next_seq() returning
>> higher than desired sequence numbers, which results in the
>> reader skipping over records that are not yet finalized. From
>> the reader's perspective it results in messages never being
>> seen.
>
> IMHO, "messages never being seen" is too strong.

Agreed. A reader does not use prb_next_seq() to decide what to print
next. Worst case it thinks records are available that are not (available
for that reader).

> I have found only one (or two) scenarios where the messages might
> really get lost.
>
> 1. It might happen when real console is replacing a boot console.
> The real console is initialized with the value returned
> by prb_next_seq(). And the boot console might not be able
> to flush earlier non-finalized records.

This cannot happen because in this situation console_init_seq() sets
@seq to the lowest boot console counter.

> 2. The other scenario is based on the fact that console_unlock()
> or pr_flush() might see lower prb_next_seq() than the last
> reserved record. It means that they might not flush all
> pending records.
>
> But wait! This is actually the opposite case. pr_flush()
> and console_unlock() might miss the messages when
> they see too low prb_next_seq().
>
> Important: This problem existed since introducing
> the lockless ring buffer!
>
> The question is. Should pr_flush() and console_unlock()
> wait until all registered messages get finalized?
>
> It would need to ignore only the last record when it
> is not finalized because it might be a continuous line.

Yes, this is the question to answer.

With the lockless ringbuffer we allow multiple CPUs/contexts to write
simultaneously into the buffer. This creates an ambiguity as some
writers will finalize sooner.

IMHO we need 2 different functions:

1. A function that reports the last contiguous finalized record for a
reader. This is useful for syslog and kmsg_dump to know what is
available for them to read. We can use @last_finalized_seq for this,
optimizing it correctly this time.

2. A function that reports the last reserved sequence number of a
writer. This is useful for pr_flush and console_unlock to know when they
are finished. This function can begin with @last_finalized_seq, looking
for the last finalized record (skipping over any non-finalized).

> I agree that this might be optimized. I think about reducing the
> number of cmpxchg even more, something like:
>
> static void desc_update_last_finalized(struct prb_desc_ring *desc_ring)
> {
> struct prb_desc_ring *desc_ring = &rb->desc_ring;
> u64 prev_seq = desc_last_finalized_seq(desc_ring);
> u64 seq = prev_seq;
>
> try_again:
> while (_prb_read_valid(rb, &seq, NULL, NULL))
> seq++;
>
> if (seq == prev_seq)
> return;
>
> oldval = __u64seq_to_ulseq(prev_seq);
> newval = __u64seq_to_ulseq(seq);
>
> if (!atomic_long_try_cmpxchg_relaxed(&desc_ring->last_finalized_seq,
> &oldval, newval)) {
> prev_seq = seq;
> goto try_again;
> }
> }

I am fine with this implementation.

> It looks to me that we could keep passing desc_ring as the parameter.

No, _prb_read_valid() needs it.

> I feel that we need a read barrier here. It should be between the
> above
>
> atomic_long_read(&desc_ring->last_finalized_seq))
>
> and the below
>
> while (_prb_read_valid(rb, &seq, NULL, NULL))
> seq++;
>
> It should make sure that the _prb_read_valid() will see all messages
> finalized which were seen finalized by the CPU updating
> desc_ring->last_finalized_seq.

Generally we have not concerned ourselves with readers. But I agree we
should make the optimization coherent with what a reader can actually
read. It might save some CPU cycles for polling tasks.

Writing and reading of @last_finalized_seq will provide the necessary
boundaries to guarantee this:

...finalize record...
atomic_long_try_cmpxchg_release(&desc_ring->last_finalized_seq, ...);

and

atomic_long_read_acquire(&desc_ring->last_finalized_seq);
...read record...

This guarantees that if a reader sees a certain @last_finalized_seq
value, that they will also see the record that was finalized.

This will be the 13th memory barrier pair to be added to the
documentation.

I would like to submit a new patch implementing things as described
here.

John

2023-11-03 12:33:09

by Petr Mladek

[permalink] [raw]

Subject: Re: [RFC PATCH printk v1] printk: ringbuffer: Do not skip non-finalized with prb_next_seq()

On Thu 2023-11-02 14:54:23, John Ogness wrote:
> On 2023-10-25, Petr Mladek <[email protected]> wrote:
> > there seems to be missing word in the subject:
> >
> > s/non-finalized/non-finalized records/
>
> Ack.
>
> > On Thu 2023-10-19 15:31:45, John Ogness wrote:
> >> Commit f244b4dc53e5 ("printk: ringbuffer: Improve prb_next_seq()
> >> performance") introduced an optimization for prb_next_seq() by
> >> using best-effort to track recently finalized records. However,
> >> the order of finalization does not necessarily match the order
> >> of the records. This can lead to prb_next_seq() returning
> >> higher than desired sequence numbers, which results in the
> >> reader skipping over records that are not yet finalized. From
> >> the reader's perspective it results in messages never being
> >> seen.
> >
> > IMHO, "messages never being seen" is too strong.
>
> Agreed. A reader does not use prb_next_seq() to decide what to print
> next. Worst case it thinks records are available that are not (available
> for that reader).
>
> > I have found only one (or two) scenarios where the messages might
> > really get lost.
> >
> > 1. It might happen when real console is replacing a boot console.
> > The real console is initialized with the value returned
> > by prb_next_seq(). And the boot console might not be able
> > to flush earlier non-finalized records.
>
> This cannot happen because in this situation console_init_seq() sets
> @seq to the lowest boot console counter.

You are right.

> > 2. The other scenario is based on the fact that console_unlock()
> > or pr_flush() might see lower prb_next_seq() than the last
> > reserved record. It means that they might not flush all
> > pending records.
> >
> > But wait! This is actually the opposite case. pr_flush()
> > and console_unlock() might miss the messages when
> > they see too low prb_next_seq().
> >
> > Important: This problem existed since introducing
> > the lockless ring buffer!
> >
> > The question is. Should pr_flush() and console_unlock()
> > wait until all registered messages get finalized?
> >
> > It would need to ignore only the last record when it
> > is not finalized because it might be a continuous line.
>
> Yes, this is the question to answer.
>
> With the lockless ringbuffer we allow multiple CPUs/contexts to write
> simultaneously into the buffer. This creates an ambiguity as some
> writers will finalize sooner.
>
> IMHO we need 2 different functions:

I thought about this as well.

> 1. A function that reports the last contiguous finalized record for a
> reader. This is useful for syslog and kmsg_dump to know what is
> available for them to read. We can use @last_finalized_seq for this,
> optimizing it correctly this time.

I would use this also for console_unlock() as well, see below.

> 2. A function that reports the last reserved sequence number of a
> writer. This is useful for pr_flush and console_unlock to know when they
> are finished.

This looks like the right solution for pr_flush() when it is called
with a timeout.

I am a bit afraid that console_unlock() or pr_flush() without a timeout
might get blocked when some CPU is not able to finalize a message.
It is realistic, especially during panic() when other CPUs might
get stopped by NMI.

Anyway, it would require waiting for the pending writers.
And only pr_flush() uses the sleep between checks.

So, I would personally use the @last_finalized_seq for
console_unlock() and pr_flush() without a timeout. We could
always call defer_console_output() when it is lower then
the last reserved seq.

Well, we actually do not even need to do this because
the reserved records must be added by some printk().
And this printk() will either flush the pending messages
or it will call defer_console_output().

The above paragraph describes a scenario which is not obvious.
We should probably document it somewhere, probably in the description
of prb_last_finalized_seq() or how it will be called.

> This function can begin with @last_finalized_seq, looking
> for the last finalized record (skipping over any non-finalized).

I though about using desc_ring->head_id or looking for the
last reserved sequence number.

Well, it is just a question where to cut the waiting. And
the very last finalized sequence number might be good
as well.

> > I agree that this might be optimized. I think about reducing the
> > number of cmpxchg even more, something like:
> >
> > static void desc_update_last_finalized(struct prb_desc_ring *desc_ring)
> > {
> > struct prb_desc_ring *desc_ring = &rb->desc_ring;
> > u64 prev_seq = desc_last_finalized_seq(desc_ring);
> > u64 seq = prev_seq;
> >
> > try_again:
> > while (_prb_read_valid(rb, &seq, NULL, NULL))
> > seq++;
> >
> > if (seq == prev_seq)
> > return;
> >
> > oldval = __u64seq_to_ulseq(prev_seq);
> > newval = __u64seq_to_ulseq(seq);
> >
> > if (!atomic_long_try_cmpxchg_relaxed(&desc_ring->last_finalized_seq,
> > &oldval, newval)) {
> > prev_seq = seq;
> > goto try_again;
> > }
> > }
>
> I am fine with this implementation.
>
> > It looks to me that we could keep passing desc_ring as the parameter.
>
> No, _prb_read_valid() needs it.

Ah, I see.

> > I feel that we need a read barrier here. It should be between the
> > above
> >
> > atomic_long_read(&desc_ring->last_finalized_seq))
> >
> > and the below
> >
> > while (_prb_read_valid(rb, &seq, NULL, NULL))
> > seq++;
> >
> > It should make sure that the _prb_read_valid() will see all messages
> > finalized which were seen finalized by the CPU updating
> > desc_ring->last_finalized_seq.
>
> Generally we have not concerned ourselves with readers. But I agree we
> should make the optimization coherent with what a reader can actually
> read. It might save some CPU cycles for polling tasks.

I wanted to agree. But then I found this scenario:

CPU0 CPU1

console_unlock()
console_flush_all()

printk()
vprintk_store()
return;
prb_final_commit;

console_trylock(); # failed

while (prb_read_valid());

Now, the race:

+ console_flush_all() did not flush the message from CPU1 because
it was not finalized in time.

+ CPU1 failed to get console_lock() => CPU0 is responsible for
flushing

+ prb_read_valid() failed on CPU0 because it did not see
the prb_desc finalized (missing barrier).

Result: The message does not reach the console until another printk()
tries to flush messages.

Sigh: Even the barrier in prb_next_seq() would not help in this case
because the while cycle checks prb_read_valid().

Reason: The race is on printk/console_unlock() level. We would need
something like:

vprintk_emit()
{

vprintk_store()

if (!in_sched) {
/*
* Make sure that either the current console_lock() owner
* will see the finalized messages or we will see
* free console_lock().
*/
smp_mb();

if (console_trylock_spinning())
console_unlock();
} ...

}

and

console_unlock()
{
do {
console_flush_all();

__console_unlock();

/*
* Make sure that we either a newly stored message
* or the writer will see free console_lock().
*/
smp_mb();
} while(prb_read_valid(prb, next_seq, NULL) && console_trylock());
}

I believe that it is not needed for defer_console_output()
and wake_up_klogd() because there the barrier is hidden
in the irq_work API.

We needed it only for console_unlock() because there is the race
when console_lock() is temporary released.

> Writing and reading of @last_finalized_seq will provide the necessary
> boundaries to guarantee this:
>
> ...finalize record...
> atomic_long_try_cmpxchg_release(&desc_ring->last_finalized_seq, ...);
>
> and
>
> atomic_long_read_acquire(&desc_ring->last_finalized_seq);
> ...read record...

Yup. something like this.

Well, it is suspicious that there is no _release() counter part.
It suggests that this might not be the right solution. Maybe
the barriers really needed to be on the reader side.

> This guarantees that if a reader sees a certain @last_finalized_seq
> value, that they will also see the record that was finalized.
>
> This will be the 13th memory barrier pair to be added to the
> documentation.

I know. We need to be careful about the design to do not
make it over-complicated.

Well, I still feel that the barrier in desc_update_last_finalized()
and prb_next_seq() might be needed when the prb_next_seq()
caller gives up when there is no progress (no new message
could be proceed).

And I think that there will be some users in the nbcon code.

But again, this need not be the right location. Usually,
the prb_next_seq() caller will be responsible for flushing
new messages. And there has to be a barrier between the
writer and the flusher like in the vprintk_emit()/console_unlock()
case above.

I mean, that the flusher either has to see the bigger prb_next_seq()
or it has to be scheduled to check it again later. And this
need to be handled by a barrier in vprintk_emit().

Sigh, sigh, sigh, my brain is cycling now ;-)

My notice that atomic_long_read_acquire() in prb_next_seq()
would not have _release() counter part suggests that
the prb API for readers does not help with synchronization.

I have to think more about it. Maybe some acquire/release
or synchronized for_each_finalized_seq() API
might be useful.

Or maybe the synchronization has to be on the printk() level.

Anyway, I feel that this might fall to the "perfection is
the enemy of good" category. We only need to make sure that
the flush will not get stuck. And that the final flush will try
hard.

The races are probably hard to hit in practice.

Best Regards,
Petr

2023-11-03 13:32:10

by John Ogness

[permalink] [raw]

Subject: Re: [RFC PATCH printk v1] printk: ringbuffer: Do not skip non-finalized with prb_next_seq()

On 2023-11-03, Petr Mladek <[email protected]> wrote:
>> IMHO we need 2 different functions:
>>
>> 1. A function that reports the last contiguous finalized record for a
>> reader. This is useful for syslog and kmsg_dump to know what is
>> available for them to read. We can use @last_finalized_seq for this,
>> optimizing it correctly this time.
>
> I would use this also for console_unlock() as well, see below.

OK.

>> 2. A function that reports the last reserved sequence number of a
>> writer. This is useful for pr_flush and console_unlock to know when they
>> are finished.
>
> I would personally use the @last_finalized_seq for
> console_unlock() and pr_flush() without a timeout. We could
> always call defer_console_output() when it is lower then
> the last reserved seq.
>
> Well, we actually do not even need to do this because
> the reserved records must be added by some printk().
> And this printk() will either flush the pending messages
> or it will call defer_console_output().

OK.

> The above paragraph describes a scenario which is not obvious.
> We should probably document it somewhere, probably in the description
> of prb_last_finalized_seq() or how it will be called.

OK.

>> This function can begin with @last_finalized_seq, looking
>> for the last finalized record (skipping over any non-finalized).
>
> I though about using desc_ring->head_id or looking for the
> last reserved sequence number.

The problem with @head_id is that the sequence number may not be
assigned yet. Really @last_finalized_seq is the newest sequence number
we have to search from.

>> Generally we have not concerned ourselves with readers. But I agree we
>> should make the optimization coherent with what a reader can actually
>> read. It might save some CPU cycles for polling tasks.
>
> I wanted to agree. But then I found this scenario:
>
> CPU0 CPU1
>
> console_unlock()
> console_flush_all()
>
> printk()
> vprintk_store()
> return;
> prb_final_commit;
>
> console_trylock(); # failed
>
> while (prb_read_valid());
>
> Now, the race:
>
> + console_flush_all() did not flush the message from CPU1 because
> it was not finalized in time.
>
> + CPU1 failed to get console_lock() => CPU0 is responsible for
> flushing
>
> + prb_read_valid() failed on CPU0 because it did not see
> the prb_desc finalized (missing barrier).

For semaphores, up() and down_trylock() successfully take and release a
raw spin lock. That provides the necessary barriers so that CPU0 sees
the record that CPU1 finalized.

>> Writing and reading of @last_finalized_seq will provide the necessary
>> boundaries to guarantee this:
>>
>> ...finalize record...
>> atomic_long_try_cmpxchg_release(&desc_ring->last_finalized_seq, ...);
>>
>> and
>>
>> atomic_long_read_acquire(&desc_ring->last_finalized_seq);
>> ...read record...
>
> Yup. something like this.
>
> Well, it is suspicious that there is no _release() counter part.

Take a closer look above. The cmpxchg (on the writer side) does the
release. I have the litmus tests to verify that is correct and
sufficient for what we want: to guarantee that for any read
@last_finalized_seq value, the CPU can also read the associated record.

I am finalizing a new version of the "fix console flushing on panic"
series [0] that will also include the prb_next_seq() fix. If needed, we
can continue this discussion based on the new code.

John

[0] https://lore.kernel.org/lkml/[email protected]

2023-11-03 15:58:19

by Petr Mladek

[permalink] [raw]

Subject: Re: [RFC PATCH printk v1] printk: ringbuffer: Do not skip non-finalized with prb_next_seq()

On Fri 2023-11-03 14:37:23, John Ogness wrote:
> On 2023-11-03, Petr Mladek <[email protected]> wrote:
> >> Generally we have not concerned ourselves with readers. But I agree we
> >> should make the optimization coherent with what a reader can actually
> >> read. It might save some CPU cycles for polling tasks.
> >
> > I wanted to agree. But then I found this scenario:
> >
> > CPU0 CPU1
> >
> > console_unlock()
> > console_flush_all()
> >
> > printk()
> > vprintk_store()
> > return;
> > prb_final_commit;
> >
> > console_trylock(); # failed

__console_unlock();

while (prb_read_valid() || console_trylock();

I added the __console_unlock() and console_trylock()
to have the full picture.

> >
> > Now, the race:
> >
> > + console_flush_all() did not flush the message from CPU1 because
> > it was not finalized in time.
> >
> > + CPU1 failed to get console_lock() => CPU0 is responsible for
> > flushing
> >
> > + prb_read_valid() failed on CPU0 because it did not see
> > the prb_desc finalized (missing barrier).
>
> For semaphores, up() and down_trylock() successfully take and release a
> raw spin lock. That provides the necessary barriers so that CPU0 sees
> the record that CPU1 finalized.

I see. Hmm, we should document this. The spinlock is an implementaion
detail.

IMHO, up()/down()/down_trylock() are quite special. I think
that the theory is that lock() and unlock() are one-way barriers.
And trylock() is one-way barrier only when it succeds.

By other words, _trylock() on CPU1 normally would not guarantee
that CPU0 would see the finalized record after unlock(). [*]

Maybe, we could rely on the existing behavior but we should at least
document it.

Note that I thouht many times about using spin_trylock() in
down_trylock(). It would make more sense because trylock()
should not be be blocking operations. But I see now,
that it might break users depending on the full barrier.

Note: It is possible that I get it wrong. It is so easy to get lost
in barriers.

> >> Writing and reading of @last_finalized_seq will provide the necessary
> >> boundaries to guarantee this:
> >>
> >> ...finalize record...
> >> atomic_long_try_cmpxchg_release(&desc_ring->last_finalized_seq, ...);
> >>
> >> and
> >>
> >> atomic_long_read_acquire(&desc_ring->last_finalized_seq);
> >> ...read record...
> >
> > Yup. something like this.
> >
> > Well, it is suspicious that there is no _release() counter part.
>
> Take a closer look above. The cmpxchg (on the writer side) does the
> release. I have the litmus tests to verify that is correct and
> sufficient for what we want: to guarantee that for any read
> @last_finalized_seq value, the CPU can also read the associated record.

The barrier is only in _prb_commit() which sets the committed state.

desc_make_final() uses cmpxchg_relaxed() so that it is not guaranteed
that other CPUs would see the "finalized" state.

> I am finalizing a new version of the "fix console flushing on panic"
> series [0] that will also include the prb_next_seq() fix. If needed, we
> can continue this discussion based on the new code.

OK, maybe it would look different in the new version. And it might
be better to continue the discussion on the new code.

Best Regards,
Petr