Hello,
This is a follow-up RFC on the work to re-implement much of
the core of printk. The threads for the previous RFC versions
are here: v1[0], v2[1], v3[2].
This series only builds upon v3 (i.e. the first part of this
series is exactly v3). The main purpose of this series is to
replace the current printk ringbuffer with the new
ringbuffer. As was discussed[3], this is a conservative
first step to rework printk. For example, all logbuf_lock
usage is kept even though the new ringbuffer does not
require it. This avoids any side-effect bugs in case the
logbuf_lock is (unintentionally) synchronizing more than
just the ringbuffer. However, this also means that the
series does not bring any improvements, just swapping out
implementations. A future patch will remove the logbuf_lock.
Except for the test module (patches 2 and 6), the rest may
already be interesting for mainline as is. I have tested
the various interfaces (console, /dev/kmsg, syslog,
kmsg_dump) and their features and all looks good AFAICT.
The patches can be broken down as follows:
1-2: the previously posted RFCv3
3-7: addresses minor issues from RFCv3
8: adds new high-level ringbuffer functions to support
printk (nothing involving new memory barriers)
9: replace the ringbuffer usage in printk.c
One important thing to know (as is mentioned in the commit
message of patch 9), there are 2 externally visible
changes:
- vmcore info changes
- powerpc powernv/opal memdump of log discontinued
I have no idea how acceptable these changes are.
I will not be posting any further printk patches until I
have received some feedback on this. I appreciate all the
help so far. I realize that this is a lot of code to go
through.
The series is based on 5.3-rc3. I would encourage people to
apply the series and give it a run. I expect that you
will not notice any difference with your printk behaviour.
John Ogness
[0] https://lkml.kernel.org/r/[email protected]
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] https://lkml.kernel.org/r/[email protected]
John Ogness (9):
printk-rb: add a new printk ringbuffer implementation
printk-rb: add test module
printk-rb: fix missing includes/exports
printk-rb: initialize new descriptors as invalid
printk-rb: remove extra data buffer size allocation
printk-rb: adjust test module ringbuffer sizes
printk-rb: increase size of seq and size variables
printk-rb: new functionality to support printk
printk: use a new ringbuffer implementation
arch/powerpc/platforms/powernv/opal.c | 22 +-
include/linux/kmsg_dump.h | 6 +-
include/linux/printk.h | 12 -
kernel/printk/Makefile | 5 +
kernel/printk/dataring.c | 809 ++++++++++++++++++
kernel/printk/dataring.h | 108 +++
kernel/printk/numlist.c | 376 +++++++++
kernel/printk/numlist.h | 72 ++
kernel/printk/printk.c | 745 +++++++++--------
kernel/printk/ringbuffer.c | 1079 +++++++++++++++++++++++++
kernel/printk/ringbuffer.h | 354 ++++++++
kernel/printk/test_prb.c | 256 ++++++
12 files changed, 3450 insertions(+), 394 deletions(-)
create mode 100644 kernel/printk/dataring.c
create mode 100644 kernel/printk/dataring.h
create mode 100644 kernel/printk/numlist.c
create mode 100644 kernel/printk/numlist.h
create mode 100644 kernel/printk/ringbuffer.c
create mode 100644 kernel/printk/ringbuffer.h
create mode 100644 kernel/printk/test_prb.c
--
2.20.1
Add missing includes and exports.
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/dataring.c | 1 +
kernel/printk/numlist.c | 1 +
kernel/printk/ringbuffer.c | 4 ++++
3 files changed, 6 insertions(+)
diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
index 911bac593ec1..6642e085a05d 100644
--- a/kernel/printk/dataring.c
+++ b/kernel/printk/dataring.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
#include "dataring.h"
/**
diff --git a/kernel/printk/numlist.c b/kernel/printk/numlist.c
index df3f89e7f7fd..d5e224dafc0c 100644
--- a/kernel/printk/numlist.c
+++ b/kernel/printk/numlist.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
#include <linux/sched.h>
#include "numlist.h"
diff --git a/kernel/printk/ringbuffer.c b/kernel/printk/ringbuffer.c
index 59bf59aba3de..b9fc13597b10 100644
--- a/kernel/printk/ringbuffer.c
+++ b/kernel/printk/ringbuffer.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
#include <linux/irqflags.h>
#include <linux/string.h>
#include <linux/err.h>
@@ -222,6 +223,7 @@ struct nl_node *prb_desc_node(unsigned long id, void *arg)
return &d->list;
}
+EXPORT_SYMBOL(prb_desc_node);
/**
* prb_desc_busy() - Numbered list callback to report if a node is busy.
@@ -262,6 +264,7 @@ bool prb_desc_busy(unsigned long id, void *arg)
/* hC: */
return (id == atomic_long_read(&d->id));
}
+EXPORT_SYMBOL(prb_desc_busy);
/**
* prb_getdesc() - Data ringbuffer callback to lookup a descriptor from an ID.
@@ -296,6 +299,7 @@ struct dr_desc *prb_getdesc(unsigned long id, void *arg)
/* iB: */
return &d->desc;
}
+EXPORT_SYMBOL(prb_getdesc);
/**
* assign_desc() - Assign a descriptor to the caller.
--
2.20.1
Initialize never-used descriptors as permanently invalid so there
is no risk of the descriptor unexpectedly being determined as
valid due to dataring head overflowing/wrapping.
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/dataring.c | 42 +++++++++++++++++++++++++++-----------
kernel/printk/dataring.h | 12 +++++++++++
kernel/printk/ringbuffer.c | 5 +++++
kernel/printk/ringbuffer.h | 7 ++++++-
4 files changed, 53 insertions(+), 13 deletions(-)
diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
index 6642e085a05d..345c46dba5bb 100644
--- a/kernel/printk/dataring.c
+++ b/kernel/printk/dataring.c
@@ -316,8 +316,8 @@ static unsigned long _dataring_pop(struct dataring *dr,
* If dB reads from gA, then dD reads from fH.
* If dB reads from gA, then dE reads from fE.
*
- * Note that if dB reads from gA, then dC cannot read from fC.
- * Note that if dB reads from gA, then dD cannot read from fD.
+ * Note that if dB reads from gA, then dC cannot read from fC->nA.
+ * Note that if dB reads from gA, then dD cannot read from fC->nB.
*
* Relies on:
*
@@ -489,6 +489,30 @@ static bool get_new_lpos(struct dataring *dr, unsigned int size,
}
}
+/**
+ * dataring_desc_init() - Initialize a descriptor to be permanently invalid.
+ *
+ * @desc: The descriptor to initialize.
+ *
+ * Setting a descriptor to be permanently invalid means that there is no risk
+ * of the descriptor later unexpectedly being determined as valid due to
+ * overflowing/wrapping of @head_lpos.
+ */
+void dataring_desc_init(struct dr_desc *desc)
+{
+ /*
+ * An unaligned @begin_lpos can never point to a data block and
+ * having the same value for @begin_lpos and @next_lpos is also
+ * invalid.
+ */
+
+ /* nA: */
+ WRITE_ONCE(desc->begin_lpos, 1);
+
+ /* nB: */
+ WRITE_ONCE(desc->next_lpos, 1);
+}
+
/**
* dataring_push() - Reserve a data block in the data array.
*
@@ -568,20 +592,14 @@ char *dataring_push(struct dataring *dr, unsigned int size,
if (!ret) {
/*
+ * fC:
+ *
* Force @desc permanently invalid to minimize risk
* of the descriptor later unexpectedly being
* determined as valid due to overflowing/wrapping of
- * @head_lpos. An unaligned @begin_lpos can never
- * point to a data block and having the same value
- * for @begin_lpos and @next_lpos is also invalid.
+ * @head_lpos.
*/
-
- /* fC: */
- WRITE_ONCE(desc->begin_lpos, 1);
-
- /* fD: */
- WRITE_ONCE(desc->next_lpos, 1);
-
+ dataring_desc_init(desc);
return NULL;
}
/* fE: */
diff --git a/kernel/printk/dataring.h b/kernel/printk/dataring.h
index 346a455a335a..413ee95f4dd6 100644
--- a/kernel/printk/dataring.h
+++ b/kernel/printk/dataring.h
@@ -43,6 +43,17 @@ struct dr_desc {
unsigned long next_lpos;
};
+/*
+ * Initialize a descriptor to be permanently invalid so there is no risk
+ * of the descriptor later unexpectedly being determined as valid due to
+ * overflowing/wrapping of @head_lpos.
+ */
+#define __DR_DESC_INITIALIZER \
+ { \
+ .begin_lpos = 1, \
+ .next_lpos = 1, \
+ }
+
/**
* struct dataring - A data ringbuffer with support for entry IDs.
*
@@ -90,6 +101,7 @@ void dataring_datablock_setid(struct dataring *dr, struct dr_desc *desc,
struct dr_datablock *dataring_getdatablock(struct dataring *dr,
struct dr_desc *desc, int *size);
bool dataring_datablock_isvalid(struct dataring *dr, struct dr_desc *desc);
+void dataring_desc_init(struct dr_desc *desc);
void dataring_desc_copy(struct dr_desc *dst, struct dr_desc *src);
#endif /* _KERNEL_PRINTK_DATARING_H */
diff --git a/kernel/printk/ringbuffer.c b/kernel/printk/ringbuffer.c
index b9fc13597b10..9be841761ea2 100644
--- a/kernel/printk/ringbuffer.c
+++ b/kernel/printk/ringbuffer.c
@@ -345,6 +345,11 @@ static bool assign_desc(struct prb_reserved_entry *e)
if (i < DESCS_COUNT(rb)) {
d = &rb->descs[i];
atomic_long_set(&d->id, i);
+ /*
+ * Initialize the new descriptor such that
+ * it is permanently invalid.
+ */
+ dataring_desc_init(&d->desc);
break;
}
}
diff --git a/kernel/printk/ringbuffer.h b/kernel/printk/ringbuffer.h
index 9fe54a09fbc2..462b4d3a3ee2 100644
--- a/kernel/printk/ringbuffer.h
+++ b/kernel/printk/ringbuffer.h
@@ -178,7 +178,12 @@ struct dr_desc *prb_getdesc(unsigned long id, void *arg);
char _##name##_data[(1 << ((avgdatabits) + (descbits))) + \
sizeof(long)] \
__aligned(__alignof__(long)); \
-struct prb_desc _##name##_descs[1 << (descbits)]; \
+struct prb_desc _##name##_descs[1 << (descbits)] = { \
+ { \
+ .id = ATOMIC_LONG_INIT(0), \
+ .desc = __DR_DESC_INITIALIZER, \
+ }, \
+ }; \
struct printk_ringbuffer name = { \
.desc_count_bits = descbits, \
.descs = &_##name##_descs[0], \
--
2.20.1
The buffer for the raw data storage included extra space at the
end for a long. This was meant to guarantee space for the ID of a
wrapping datablock. However, since datablocks are padded and the
dataring is implemented such that no datablock can end at exactly
the end of the data buffer:
DATA_WRAPS(begin) != DATA_WRAPS(next)
there will always be space available for the ID of a wrapping
datablock.
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/ringbuffer.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/printk/ringbuffer.h b/kernel/printk/ringbuffer.h
index 462b4d3a3ee2..698d2328ea9e 100644
--- a/kernel/printk/ringbuffer.h
+++ b/kernel/printk/ringbuffer.h
@@ -175,8 +175,7 @@ struct dr_desc *prb_getdesc(unsigned long id, void *arg);
* * descriptor 1 will be the next descriptor
*/
#define DECLARE_PRINTKRB(name, avgdatabits, descbits) \
-char _##name##_data[(1 << ((avgdatabits) + (descbits))) + \
- sizeof(long)] \
+char _##name##_data[1 << ((avgdatabits) + (descbits))] \
__aligned(__alignof__(long)); \
struct prb_desc _##name##_descs[1 << (descbits)] = { \
{ \
--
2.20.1
The ringbuffer documents that the expected average size value
should be lower than the actual average. For the test module the
average should be 64, so set the expected average to 5 bits (32).
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/test_prb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/printk/test_prb.c b/kernel/printk/test_prb.c
index 1ecb4fcbf823..77d298b6990a 100644
--- a/kernel/printk/test_prb.c
+++ b/kernel/printk/test_prb.c
@@ -63,7 +63,7 @@ static void dump_rb(struct printk_ringbuffer *rb)
trace_printk("END full dump\n");
}
-DECLARE_PRINTKRB(test_rb, 7, 5);
+DECLARE_PRINTKRB(test_rb, 5, 7);
static int prbtest_writer(void *data)
{
--
2.20.1
The printk implementation will rely on sequence numbers never
wrapping. For 32-bit systems, an unsigned long for sequence
numbers is not acceptable. Change the sequence number to u64.
Size variables are currently unsigned int, which may not be
acceptable for 64-bit systems. Change size variables to
unsigned long. (32-bit sizes on 32-bit systems should be fine.)
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/dataring.c | 8 ++++----
kernel/printk/dataring.h | 4 ++--
kernel/printk/numlist.c | 6 +++---
kernel/printk/numlist.h | 6 +++---
kernel/printk/ringbuffer.c | 6 +++---
kernel/printk/ringbuffer.h | 10 +++++-----
kernel/printk/test_prb.c | 14 +++++++-------
7 files changed, 27 insertions(+), 27 deletions(-)
diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
index 345c46dba5bb..712842f2dc04 100644
--- a/kernel/printk/dataring.c
+++ b/kernel/printk/dataring.c
@@ -152,7 +152,7 @@ static struct dr_datablock *to_datablock(struct dataring *dr,
* than or equal to the size of @dr_datablock.id. This ensures that there is
* always space in the data array for the @id of a "wrapping" data block.
*/
-static void to_db_size(unsigned int *size)
+static void to_db_size(unsigned long *size)
{
*size += sizeof(struct dr_datablock);
/* Alignment padding must be >= sizeof(dr_datablock.id). */
@@ -238,7 +238,7 @@ bool dataring_datablock_isvalid(struct dataring *dr, struct dr_desc *desc)
*
* Return: true if the size is legal for the data ringbuffer, otherwise false.
*/
-bool dataring_checksize(struct dataring *dr, unsigned int size)
+bool dataring_checksize(struct dataring *dr, unsigned long size)
{
if (size == 0)
return false;
@@ -445,7 +445,7 @@ static unsigned long _dataring_pop(struct dataring *dr,
* This will only fail if it was not possible to invalidate the tail data
* block (i.e. the @id of the tail data block was not yet set by its writer).
*/
-static bool get_new_lpos(struct dataring *dr, unsigned int size,
+static bool get_new_lpos(struct dataring *dr, unsigned long size,
unsigned long *begin_lpos_out,
unsigned long *next_lpos_out)
{
@@ -537,7 +537,7 @@ void dataring_desc_init(struct dr_desc *desc)
* This will only fail if it was not possible to invalidate the tail data
* block.
*/
-char *dataring_push(struct dataring *dr, unsigned int size,
+char *dataring_push(struct dataring *dr, unsigned long size,
struct dr_desc *desc, unsigned long id)
{
unsigned long begin_lpos;
diff --git a/kernel/printk/dataring.h b/kernel/printk/dataring.h
index 413ee95f4dd6..c566ce228abe 100644
--- a/kernel/printk/dataring.h
+++ b/kernel/printk/dataring.h
@@ -90,10 +90,10 @@ struct dataring {
void *getdesc_arg;
};
-bool dataring_checksize(struct dataring *dr, unsigned int size);
+bool dataring_checksize(struct dataring *dr, unsigned long size);
bool dataring_pop(struct dataring *dr);
-char *dataring_push(struct dataring *dr, unsigned int size,
+char *dataring_push(struct dataring *dr, unsigned long size,
struct dr_desc *desc, unsigned long id);
void dataring_datablock_setid(struct dataring *dr, struct dr_desc *desc,
diff --git a/kernel/printk/numlist.c b/kernel/printk/numlist.c
index d5e224dafc0c..16c6ffa74b01 100644
--- a/kernel/printk/numlist.c
+++ b/kernel/printk/numlist.c
@@ -108,7 +108,7 @@
*
* This function will fail if @id is not valid anytime during this function.
*/
-bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
+bool numlist_read(struct numlist *nl, unsigned long id, u64 *seq,
unsigned long *next_id)
{
struct nl_node *n;
@@ -165,7 +165,7 @@ bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
*
* Return: The ID of the tail node.
*/
-unsigned long numlist_read_tail(struct numlist *nl, unsigned long *seq,
+unsigned long numlist_read_tail(struct numlist *nl, u64 *seq,
unsigned long *next_id)
{
unsigned long tail_id;
@@ -201,8 +201,8 @@ unsigned long numlist_read_tail(struct numlist *nl, unsigned long *seq,
void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
{
unsigned long head_id;
- unsigned long seq;
unsigned long r;
+ u64 seq;
/*
* bA:
diff --git a/kernel/printk/numlist.h b/kernel/printk/numlist.h
index cdc3b21e6597..d4595fb9a3e9 100644
--- a/kernel/printk/numlist.h
+++ b/kernel/printk/numlist.h
@@ -20,7 +20,7 @@
*/
struct nl_node {
/* private */
- unsigned long seq;
+ u64 seq;
unsigned long next_id;
};
@@ -64,9 +64,9 @@ struct numlist {
void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id);
struct nl_node *numlist_pop(struct numlist *nl);
-unsigned long numlist_read_tail(struct numlist *nl, unsigned long *seq,
+unsigned long numlist_read_tail(struct numlist *nl, u64 *seq,
unsigned long *next_id);
-bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
+bool numlist_read(struct numlist *nl, unsigned long id, u64 *seq,
unsigned long *next_id);
#endif /* _KERNEL_PRINTK_NUMLIST_H */
diff --git a/kernel/printk/ringbuffer.c b/kernel/printk/ringbuffer.c
index 9be841761ea2..053622151447 100644
--- a/kernel/printk/ringbuffer.c
+++ b/kernel/printk/ringbuffer.c
@@ -416,7 +416,7 @@ static bool assign_desc(struct prb_reserved_entry *e)
* * -ENOMEM: failed to reserve data (invalid descriptor committed)
*/
char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
- unsigned int size)
+ unsigned long size)
{
struct prb_desc *d;
unsigned long id;
@@ -567,7 +567,7 @@ EXPORT_SYMBOL(prb_iter_init);
*/
static void reset_iter(struct prb_iterator *iter)
{
- unsigned long last_seq;
+ u64 last_seq;
iter->next_id = numlist_read_tail(&iter->rb->nl, &last_seq, NULL);
@@ -650,9 +650,9 @@ int prb_iter_next_valid_entry(struct prb_iterator *iter)
struct dr_desc desc;
struct prb_desc *d;
struct nl_node *n;
- unsigned long seq;
unsigned long id;
int size;
+ u64 seq;
if (!setup_next(nl, iter))
return 0;
diff --git a/kernel/printk/ringbuffer.h b/kernel/printk/ringbuffer.h
index 698d2328ea9e..ec7bb21abac2 100644
--- a/kernel/printk/ringbuffer.h
+++ b/kernel/printk/ringbuffer.h
@@ -97,9 +97,9 @@ struct prb_reserved_entry {
*/
struct prb_entry {
/* public */
- unsigned long seq;
- char *buffer;
- int buffer_size;
+ u64 seq;
+ char *buffer;
+ int buffer_size;
};
/**
@@ -125,13 +125,13 @@ struct prb_iterator {
struct prb_entry *e;
unsigned long last_id;
- unsigned long last_seq;
+ u64 last_seq;
unsigned long next_id;
};
/* writer interface */
char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
- unsigned int size);
+ unsigned long size);
void prb_commit(struct prb_reserved_entry *e);
/* reader interface */
diff --git a/kernel/printk/test_prb.c b/kernel/printk/test_prb.c
index 77d298b6990a..0157bbdf051f 100644
--- a/kernel/printk/test_prb.c
+++ b/kernel/printk/test_prb.c
@@ -38,8 +38,8 @@ static void dump_rb(struct printk_ringbuffer *rb)
{
DECLARE_PRINTKRB_ENTRY(entry, 160);
DECLARE_PRINTKRB_ITER(iter, rb, &entry);
- unsigned long last_seq = 0;
struct rbdata *dat;
+ u64 last_seq = 0;
char buf[160];
int len;
@@ -47,7 +47,7 @@ static void dump_rb(struct printk_ringbuffer *rb)
prb_for_each_entry_continue(&iter, len) {
if (entry.seq - last_seq != 1) {
- trace_printk("LOST %lu\n",
+ trace_printk("LOST %llu\n",
entry.seq - (last_seq + 1));
}
last_seq = entry.seq;
@@ -56,7 +56,7 @@ static void dump_rb(struct printk_ringbuffer *rb)
snprintf(buf, sizeof(buf), "%s", dat->text);
buf[sizeof(buf) - 1] = 0;
- trace_printk("seq=%lu len=%d textlen=%d dataval=%s\n",
+ trace_printk("seq=%llu len=%d textlen=%d dataval=%s\n",
entry.seq, len, dat->len, buf);
}
@@ -112,11 +112,11 @@ static int prbtest_reader(void *data)
DECLARE_PRINTKRB_ENTRY(entry, 160);
DECLARE_PRINTKRB_ITER(iter, &test_rb, &entry);
unsigned long total_lost = 0;
- unsigned long last_seq = 0;
unsigned long max_lost = 0;
unsigned long count = 0;
struct rbdata *dat;
int did_sched = 1;
+ u64 last_seq = 0;
int len;
pr_err("prbtest: start thread %lu (reader)\n", num);
@@ -126,7 +126,7 @@ static int prbtest_reader(void *data)
if (entry.seq < last_seq) {
WRITE_ONCE(halt_test, 1);
trace_printk(
- "reader%lu invalid seq %lu -> %lu\n",
+ "reader%lu invalid seq %llu -> %llu\n",
num, last_seq, entry.seq);
goto out;
}
@@ -145,7 +145,7 @@ static int prbtest_reader(void *data)
if (len != dat->len || len >= 160) {
WRITE_ONCE(halt_test, 1);
trace_printk(
- "reader%lu invalid len for %lu (%d<->%d)\n",
+ "reader%lu invalid len for %llu (%d<->%d)\n",
num, entry.seq, len, dat->len);
goto out;
}
@@ -172,7 +172,7 @@ static int prbtest_reader(void *data)
}
out:
pr_err(
- "reader%lu: total_lost=%lu max_lost=%lu total_read=%lu seq=%lu\n",
+ "reader%lu: total_lost=%lu max_lost=%lu total_read=%lu seq=%llu\n",
num, total_lost, max_lost, count, entry.seq);
pr_err("prbtest: end thread %lu (reader)\n", num);
--
2.20.1
Add the following functions needed to support printk features.
dataring:
dataring_unused() - return free bytes
ringbuffer:
prb_init() - dynamically initialize a ringbuffer
prb_iter_seek() - seek to an entry in the committed list
prb_iter_wait_next_valid_entry() - blocking reader function
prb_iter_copy() - duplicate an iterator
prb_iter_entry() - get the entry of an iterator
prb_unused() - wrapper for dataring_unused()
prb_wait_queue() - get the ringbuffer wait queue
DECLARE_PRINTKRB_SEQENTRY() - declare entry for only seq reading
Also modify prb_iter_peek_next_entry() to optionally return the
sequence number previous to the next entry.
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/dataring.c | 29 ++++
kernel/printk/dataring.h | 3 +-
kernel/printk/ringbuffer.c | 292 +++++++++++++++++++++++++++++++++++--
kernel/printk/ringbuffer.h | 46 +++++-
4 files changed, 354 insertions(+), 16 deletions(-)
diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
index 712842f2dc04..e48069dc27bc 100644
--- a/kernel/printk/dataring.c
+++ b/kernel/printk/dataring.c
@@ -489,6 +489,35 @@ static bool get_new_lpos(struct dataring *dr, unsigned long size,
}
}
+/**
+ * dataring_unused() - Determine the unused bytes available for pushing.
+ *
+ * @dr: The data ringbuffer to check.
+ *
+ * Determine the largest possible push size that can be performed without
+ * invalidating existing data.
+ *
+ * Return: The number of unused bytes available for pushing.
+ */
+unsigned long dataring_unused(struct dataring *dr)
+{
+ unsigned long head_lpos;
+ unsigned long tail_lpos;
+ unsigned long size = 0;
+ unsigned long diff;
+
+ to_db_size(&size);
+
+ tail_lpos = atomic_long_read(&dr->tail_lpos);
+ head_lpos = atomic_long_read(&dr->head_lpos);
+
+ diff = DATA_SIZE(dr) + tail_lpos - head_lpos;
+ if (diff <= size)
+ return 0;
+
+ return (diff - size);
+}
+
/**
* dataring_desc_init() - Initialize a descriptor to be permanently invalid.
*
diff --git a/kernel/printk/dataring.h b/kernel/printk/dataring.h
index c566ce228abe..896a7f855d9f 100644
--- a/kernel/printk/dataring.h
+++ b/kernel/printk/dataring.h
@@ -91,13 +91,14 @@ struct dataring {
};
bool dataring_checksize(struct dataring *dr, unsigned long size);
+unsigned long dataring_unused(struct dataring *dr);
bool dataring_pop(struct dataring *dr);
char *dataring_push(struct dataring *dr, unsigned long size,
struct dr_desc *desc, unsigned long id);
-
void dataring_datablock_setid(struct dataring *dr, struct dr_desc *desc,
unsigned long id);
+
struct dr_datablock *dataring_getdatablock(struct dataring *dr,
struct dr_desc *desc, int *size);
bool dataring_datablock_isvalid(struct dataring *dr, struct dr_desc *desc);
diff --git a/kernel/printk/ringbuffer.c b/kernel/printk/ringbuffer.c
index 053622151447..e727d9d72f65 100644
--- a/kernel/printk/ringbuffer.c
+++ b/kernel/printk/ringbuffer.c
@@ -3,6 +3,7 @@
#include <linux/kernel.h>
#include <linux/irqflags.h>
#include <linux/string.h>
+#include <linux/sched.h>
#include <linux/err.h>
#include "ringbuffer.h"
@@ -530,6 +531,24 @@ void prb_commit(struct prb_reserved_entry *e)
}
EXPORT_SYMBOL(prb_commit);
+/**
+ * prb_unused() - Determine the unused bytes available for reserving.
+ *
+ * @rb: The ringbuffer to check.
+ *
+ * This is the public function available to writer to determine the largest
+ * possible reserve size that can be performed without invalidating old
+ * entries.
+ *
+ * Context: Any context.
+ * Return: The number of unused bytes available for reserving.
+ */
+unsigned long prb_unused(struct printk_ringbuffer *rb)
+{
+ return dataring_unused(&rb->dr);
+}
+EXPORT_SYMBOL(prb_unused);
+
/**
* prb_iter_init() - Initialize an iterator.
*
@@ -543,7 +562,7 @@ EXPORT_SYMBOL(prb_commit);
*
* As an alternative, DECLARE_PRINTKRB_ITER() can be used.
*
- * The interator is initialized to the beginning of the committed list (the
+ * The iterator is initialized to the beginning of the committed list (the
* oldest committed entry).
*
* Context: Any context.
@@ -575,10 +594,10 @@ static void reset_iter(struct prb_iterator *iter)
iter->last_seq = last_seq - 1;
/*
- * @last_id is only significant in EOL situations, when it is equal to
- * @next_id and the iterator wants to read the entry after @last_id as
- * the next entry. Set @last_id to something other than @next_id. So
- * that the iterator will read @next_id as the next entry.
+ * @last_id is only significant in EOL situations, when it is equal
+ * to @next_id, in which case it reads the entry after @last_id. Set
+ * @last_id to something other than @next_id so that the iterator
+ * will read @next_id as the next entry.
*/
iter->last_id = iter->next_id - 1;
}
@@ -696,8 +715,12 @@ int prb_iter_next_valid_entry(struct prb_iterator *iter)
e->seq = seq;
db = dataring_getdatablock(dr, &desc, &size);
- memcpy(&e->buffer[0], &db->data[0],
- size > e->buffer_size ? e->buffer_size : size);
+
+ if (e->buffer && e->buffer_size) {
+ memcpy(&e->buffer[0], &db->data[0],
+ size > e->buffer_size ?
+ e->buffer_size : size);
+ }
/*
* mD:
@@ -726,6 +749,39 @@ int prb_iter_next_valid_entry(struct prb_iterator *iter)
}
EXPORT_SYMBOL(prb_iter_next_valid_entry);
+
+/**
+ * prb_iter_wait_next_valid_entry() - Blocking traverse and read.
+ *
+ * @iter: The iterator used for list traversal.
+ *
+ * This is the public function available to readers to traverse the committed
+ * entry list. It is the same as prb_iter_next_valid_entry() except that it
+ * blocks (interruptible) if the end of the commit list is reached. See
+ * prb_iter_next_valid_entry() for traversal/read/size details.
+ *
+ * Context: Process context. Sleeps if the end of the commit list reached.
+ * Return: The size of the entry data or -ERESTARTSYS if interrupted.
+ */
+int prb_iter_wait_next_valid_entry(struct prb_iterator *iter)
+{
+ int ret;
+
+ for (;;) {
+ ret = wait_event_interruptible(*(iter->rb->wq),
+ prb_iter_peek_next_entry(iter, NULL));
+ if (ret < 0)
+ break;
+
+ ret = prb_iter_next_valid_entry(iter);
+ if (ret > 0)
+ break;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(prb_iter_wait_next_valid_entry);
+
/**
* prb_iter_sync() - Position an iterator to that of another iterator.
*
@@ -764,10 +820,41 @@ void prb_iter_sync(struct prb_iterator *dst, struct prb_iterator *src)
}
EXPORT_SYMBOL(prb_iter_sync);
+/**
+ * prb_iter_copy() - Copy all iterator information to another iterator.
+ *
+ * @dst: The iterator to modify.
+ *
+ * @src: The iterator to copy from.
+ *
+ * This is the public function available to readers to copy all iterator
+ * information of another iterator. After calling this function, @dst will
+ * be using the same entry and traverse the same ringbuffer, from the
+ * same committed entry as @src.
+ *
+ * It is not necessary for @dst to be previously initialized.
+ *
+ * Context: Any context.
+ *
+ * It is safe to call this function from any context and state. But note
+ * that this function is not atomic. Callers must not sync iterators that
+ * can be accessed by other tasks/contexts unless proper synchronization is
+ * used.
+ */
+void prb_iter_copy(struct prb_iterator *dst, struct prb_iterator *src)
+{
+ dst->e = src->e;
+ prb_iter_sync(dst, src);
+}
+EXPORT_SYMBOL(prb_iter_copy);
+
/**
* prb_iter_peek_next_entry() - Check if there is a next (newer) entry.
*
- * @iter: The iterator used for list traversal.
+ * @iter: The iterator used for list traversal.
+ *
+ * @last_seq: A pointer to a variable to store the last seen sequence number.
+ * This may be NULL if the caller is not interested in this value.
*
* This is the public function available to readers to check if a newer
* entry is available.
@@ -775,14 +862,23 @@ EXPORT_SYMBOL(prb_iter_sync);
* Context: Any context.
* Return: true if there is a next entry, otherwise false.
*/
-bool prb_iter_peek_next_entry(struct prb_iterator *iter)
+bool prb_iter_peek_next_entry(struct prb_iterator *iter, u64 *last_seq)
{
- DECLARE_PRINTKRB_ENTRY(e, 1);
+ DECLARE_PRINTKRB_SEQENTRY(e);
DECLARE_PRINTKRB_ITER(iter_copy, iter->rb, &e);
prb_iter_sync(&iter_copy, iter);
- return (prb_iter_next_valid_entry(&iter_copy) != 0);
+ if (prb_iter_next_valid_entry(&iter_copy) == 0) {
+ if (last_seq)
+ *last_seq = iter_copy.last_seq;
+ return false;
+ }
+
+ /* Pretend to have seen the previous entry. */
+ if (last_seq)
+ *last_seq = iter_copy.last_seq - 1;
+ return true;
}
EXPORT_SYMBOL(prb_iter_peek_next_entry);
@@ -807,3 +903,177 @@ unsigned long prb_getfail(struct printk_ringbuffer *rb)
return atomic_long_read(&rb->fail);
}
EXPORT_SYMBOL(prb_getfail);
+
+/**
+ * prb_iter_seek() - Reposition an iterator based on the sequence number.
+ *
+ * @iter: The iterator used for list traversal.
+ *
+ * @last_seq: The sequence number that the iterator will have seen last.
+ * Use 0 to position for reading the oldest commit list entry and
+ * -1 to position beyond the newest commit list entry.
+ *
+ * This is the public function available to readers to reposition an iterator
+ * based on the commit list entry sequence number.
+ *
+ * If @last_seq exists, the iterator is positioned such that a following read
+ * will read the entry with the next higher sequence number.
+ *
+ * If @last_seq does not exist but a higher (newer) sequence number exists,
+ * the iterator is positioned such that a following read will read that
+ * higher entry.
+ *
+ * If @last_seq does not exist and no higher (newer) sequence number exists,
+ * the iterator is positioned at the end of the commit list such that a
+ * following read will read the next (not yet existent) entry.
+ *
+ * Context: Any context.
+ * Return: The last seen sequence number.
+ *
+ * From the return value (and the value of @last_seq) the caller can identify
+ * which of the above described scenarios occurred.
+ */
+u64 prb_iter_seek(struct prb_iterator *iter, u64 last_seq)
+{
+ DECLARE_PRINTKRB_SEQENTRY(e);
+ DECLARE_PRINTKRB_ITER(i, iter->rb, &e);
+ int l;
+
+ /* Seek to the beginning? */
+ if (last_seq == 0) {
+ reset_iter(iter);
+ goto out;
+ }
+
+ /* Iterator already where it should be? */
+ if (iter->last_seq == last_seq)
+ goto out;
+
+ /*
+ * Backward seeking is not possible. Reset the iterator to the
+ * beginning and seek forwards.
+ */
+ if (last_seq < iter->last_seq)
+ reset_iter(iter);
+
+ /*
+ * Seek using a local copy and only sync with the iterator when it
+ * is known that the seek has not gone too far, for example when
+ * the desired last_seq was an invalid entry or does not exist.
+ */
+ prb_iter_sync(&i, iter);
+
+ prb_for_each_entry_continue(&i, l) {
+ if (e.seq > last_seq)
+ break;
+
+ prb_iter_sync(iter, &i);
+ if (e.seq == last_seq)
+ break;
+ }
+out:
+ return iter->last_seq;
+}
+EXPORT_SYMBOL(prb_iter_seek);
+
+/**
+ * prb_init() - Initialize a ringbuffer.
+ *
+ * @rb: The ringbuffer to initialize.
+ *
+ * @data: A pointer to a byte array for raw entry storage.
+ *
+ * @data_size_bits: The power-of-2 size fo @data.
+ *
+ * @descs: A pointer to a prb_desc array for descriptor storage.
+ *
+ * @desc_count_bits: The power-of-2 count of descriptors in @descs.
+ *
+ * @waitq: A wait queue to use for blocking readers.
+ *
+ * This is the public function available to initialize a ringbuffer. It
+ * allows the caller to provide the internal buffers, thus allowing the
+ * buffers to be allocated dynamically.
+ *
+ * As per numlist requirement of always having at least one node in the list,
+ * the ringbuffer structures are initialized such that:
+ *
+ * * the numlist head and tail point to descriptor 0
+ * * descriptor 0 has an invalid data block and is the terminating node
+ * * descriptor 1 will be the next descriptor
+ *
+ * As an alternative, DECLARE_PRINTKRB() can be used.
+ *
+ * Context: Any context.
+ */
+void prb_init(struct printk_ringbuffer *rb, char *data, int data_size_bits,
+ struct prb_desc *descs, int desc_count_bits,
+ struct wait_queue_head *waitq)
+{
+ struct dataring *dr = &rb->dr;
+ struct numlist *nl = &rb->nl;
+
+ rb->desc_count_bits = desc_count_bits;
+ rb->descs = descs;
+ atomic_long_set(&descs[0].id, 0);
+ descs[0].desc.begin_lpos = 1;
+ descs[0].desc.next_lpos = 1;
+ atomic_set(&rb->desc_next_unused, 1);
+
+ atomic_long_set(&nl->head_id, 0);
+ atomic_long_set(&nl->tail_id, 0);
+ nl->node = prb_desc_node;
+ nl->node_arg = rb;
+ nl->busy = prb_desc_busy;
+ nl->busy_arg = rb;
+
+ dr->size_bits = data_size_bits;
+ dr->data = data;
+ atomic_long_set(&dr->head_lpos, -111 * sizeof(long));
+ atomic_long_set(&dr->tail_lpos, -111 * sizeof(long));
+ dr->getdesc = prb_getdesc;
+ dr->getdesc_arg = rb;
+
+ atomic_long_set(&rb->fail, 0);
+
+ rb->wq = waitq;
+}
+EXPORT_SYMBOL(prb_init);
+
+/**
+ * prb_wait_queue() - Get the wait queue of blocking readers.
+ *
+ * @rb: The ringbuffer containing the wait queue.
+ *
+ * This is the public function available to readers to get the wait queue
+ * associated with a ringbuffer. All waiters on this wait queue are woken
+ * each time a new entry is committed. This allows a reader to implement
+ * their own blocking read/poll function.
+ *
+ * Context: Any context.
+ * Return: The ringbuffer wait queue.
+ */
+struct wait_queue_head *prb_wait_queue(struct printk_ringbuffer *rb)
+{
+ return rb->wq;
+}
+EXPORT_SYMBOL(prb_wait_queue);
+
+/**
+ * prb_iter_entry() - Get the prb_entry associated with an iterator.
+ *
+ * @iter: The iterator to get the entry from.
+ *
+ * This is the public function to allow readers to get the prb_entry
+ * structure associated with an iterator. Readers need an iterator's
+ * prb_entry in order to process the read data. This function is useful in
+ * case a caller only has an iterator, but not the associated prb_entry.
+ *
+ * Context: Any context.
+ * Return: The prb_entry used by @iter.
+ */
+struct prb_entry *prb_iter_entry(struct prb_iterator *iter)
+{
+ return iter->e;
+}
+EXPORT_SYMBOL(prb_iter_entry);
diff --git a/kernel/printk/ringbuffer.h b/kernel/printk/ringbuffer.h
index ec7bb21abac2..70cb9ad284d4 100644
--- a/kernel/printk/ringbuffer.h
+++ b/kernel/printk/ringbuffer.h
@@ -4,6 +4,7 @@
#define _LINUX_PRINTK_RINGBUFFER_H
#include <linux/atomic.h>
+#include <linux/wait.h>
#include "numlist.h"
#include "dataring.h"
@@ -48,6 +49,8 @@ struct prb_desc {
* descriptor. Failure due to not being able to reserve
* space in the dataring is not counted because readers
* will notice a lost sequence number in that case.
+ *
+ * @wq: The wait queue used by blocking readers.
*/
struct printk_ringbuffer {
/* private */
@@ -60,6 +63,8 @@ struct printk_ringbuffer {
struct dataring dr;
atomic_long_t fail;
+
+ struct wait_queue_head *wq;
};
/**
@@ -138,13 +143,22 @@ void prb_commit(struct prb_reserved_entry *e);
void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
struct prb_entry *e);
int prb_iter_next_valid_entry(struct prb_iterator *iter);
-void prb_iter_sync(struct prb_iterator *dest, struct prb_iterator *src);
-bool prb_iter_peek_next_entry(struct prb_iterator *iter);
+int prb_iter_wait_next_valid_entry(struct prb_iterator *iter);
+void prb_iter_sync(struct prb_iterator *dst, struct prb_iterator *src);
+void prb_iter_copy(struct prb_iterator *dst, struct prb_iterator *src);
+bool prb_iter_peek_next_entry(struct prb_iterator *iter, u64 *last_seq);
+u64 prb_iter_seek(struct prb_iterator *iter, u64 last_seq);
+struct wait_queue_head *prb_wait_queue(struct printk_ringbuffer *rb);
+struct prb_entry *prb_iter_entry(struct prb_iterator *iter);
/* utility functions */
unsigned long prb_getfail(struct printk_ringbuffer *rb);
+void prb_init(struct printk_ringbuffer *rb, char *data, int data_size_bits,
+ struct prb_desc *descs, int desc_count_bits,
+ struct wait_queue_head *waitq);
+unsigned long prb_unused(struct printk_ringbuffer *rb);
-/* prototypes for callbacks used by numlist and dataring, respectively */
+/* callbacks used by numlist and dataring, respectively */
struct nl_node *prb_desc_node(unsigned long id, void *arg);
bool prb_desc_busy(unsigned long id, void *arg);
struct dr_desc *prb_getdesc(unsigned long id, void *arg);
@@ -164,6 +178,8 @@ struct dr_desc *prb_getdesc(unsigned long id, void *arg);
*
* @descbits: The power-of-2 maximum amount of descriptors allowed.
*
+ * @waitq: A wait queue to use for blocking readers.
+ *
* The size of the data array will be the average data size multiplied by the
* maximum amount of descriptors.
*
@@ -173,8 +189,12 @@ struct dr_desc *prb_getdesc(unsigned long id, void *arg);
* * the numlist head and tail point to descriptor 0
* * descriptor 0 has an invalid data block and is the terminating node
* * descriptor 1 will be the next descriptor
+ *
+ * This macro is particularly useful for static ringbuffers that should be
+ * immediately available and initialized. It is an alternative to
+ * manually initializing a ringbuffer with prb_init().
*/
-#define DECLARE_PRINTKRB(name, avgdatabits, descbits) \
+#define DECLARE_PRINTKRB(name, avgdatabits, descbits, waitq) \
char _##name##_data[1 << ((avgdatabits) + (descbits))] \
__aligned(__alignof__(long)); \
struct prb_desc _##name##_descs[1 << (descbits)] = { \
@@ -206,6 +226,7 @@ struct printk_ringbuffer name = { \
.getdesc_arg = &name, \
}, \
.fail = ATOMIC_LONG_INIT(0), \
+ .wq = waitq, \
}
/**
@@ -231,6 +252,23 @@ struct prb_entry name = { \
.buffer_size = size, \
}
+/**
+ * DECLARE_PRINTKRB_SEQENTRY() - Declare an entry structure for sequences.
+ *
+ * @name: The name for the entry structure variable.
+ *
+ * This macro is declares and initializes an entry structure without any
+ * buffer. This is useful if an iterator is only interested in sequence
+ * numbers and so does not need to read the entry data. Also, because of
+ * its small size, it is safe to put on the stack.
+ */
+#define DECLARE_PRINTKRB_SEQENTRY(name) \
+struct prb_entry name = { \
+ .seq = 0, \
+ .buffer = NULL, \
+ .buffer_size = 0, \
+}
+
/**
* DECLARE_PRINTKRB_ITER() - Declare an iterator for readers.
*
--
2.20.1
See documentation for details.
For the real patch the "prb overview" documentation section in
kernel/printk/ringbuffer.c will be included in the commit message.
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/Makefile | 3 +
kernel/printk/dataring.c | 761 +++++++++++++++++++++++++++++++++++
kernel/printk/dataring.h | 95 +++++
kernel/printk/numlist.c | 375 +++++++++++++++++
kernel/printk/numlist.h | 72 ++++
kernel/printk/ringbuffer.c | 800 +++++++++++++++++++++++++++++++++++++
kernel/printk/ringbuffer.h | 288 +++++++++++++
7 files changed, 2394 insertions(+)
create mode 100644 kernel/printk/dataring.c
create mode 100644 kernel/printk/dataring.h
create mode 100644 kernel/printk/numlist.c
create mode 100644 kernel/printk/numlist.h
create mode 100644 kernel/printk/ringbuffer.c
create mode 100644 kernel/printk/ringbuffer.h
diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
index 4d052fc6bcde..567999aa93af 100644
--- a/kernel/printk/Makefile
+++ b/kernel/printk/Makefile
@@ -1,4 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-y = printk.o
obj-$(CONFIG_PRINTK) += printk_safe.o
+obj-$(CONFIG_PRINTK) += ringbuffer.o
+obj-$(CONFIG_PRINTK) += numlist.o
+obj-$(CONFIG_PRINTK) += dataring.o
obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o
diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
new file mode 100644
index 000000000000..911bac593ec1
--- /dev/null
+++ b/kernel/printk/dataring.c
@@ -0,0 +1,761 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "dataring.h"
+
+/**
+ * DOC: dataring overview
+ *
+ * A dataring is a lockless ringbuffer consisting of variable length data
+ * blocks, each of which are assigned an ID. The IDs map to descriptors, which
+ * contain metadata about the data block. The lookup function mapping IDs to
+ * descriptors is implemented by the user.
+ *
+ * Data Blocks
+ * -----------
+ * All ringbuffer data is stored within a single static byte array. This is
+ * to ensure that any pointers to the data (past and present) will always
+ * point to valid memory. This is important because the lockless readers
+ * and writers may preempt for long periods of time and when they resume may
+ * be working with expired pointers.
+ *
+ * Data blocks are specified by begin and end logical positions (lpos) that
+ * map directly to byte array offsets. Using logical positions indirectly
+ * provides tagged state references for the data blocks to avoid ABA issues
+ * when the ringbuffer wraps. The number of tagged states per index is::
+ *
+ * sizeof(long) / size of byte array
+ *
+ * If a data block starts near the end of the byte array but would extend
+ * beyond it, that data block is handled differently: a special "wrapping data
+ * block" is inserted in the space available at the end of the byte array and
+ * a "content data block" is placed at the beginning of the byte array. This
+ * can waste space at the end of the byte array, but simplifies the
+ * implementation by allowing writers to always work with contiguous buffers.
+ * For example, for a 1000 byte array, a descriptor may show a start lpos of
+ * 1950 and an end lpos of 2100. The data block associated with this
+ * descriptor is 100 bytes in size. Its ID is located in the "wrapping" data
+ * block (located at offset 950 of the byte array) and its data is found in
+ * the "content" data block (located at offset 0 of the byte array).
+ *
+ * Descriptors
+ * -----------
+ * A descriptor is a handle to a data block. How descriptors are structured
+ * and mapped to IDs is implemented by the user.
+ *
+ * Descriptors contain the begin (begin_lpos) and end (next_lpos) logical
+ * positions of the data block they represent. The end logical position
+ * matches the begin logical position of the adjacent data block.
+ *
+ * Why Descriptors?
+ * ----------------
+ * The data ringbuffer supports variable length entities, which means that
+ * data blocks will not always begin at a predictable offset of the byte
+ * array. This is a major problem for lockless writers that, for example, will
+ * compete to expire and reuse old data blocks when the ringbuffer is full.
+ * Without a predictable begin for the data blocks, a writer has no reliable
+ * information about the status of the "free" area. Are any flags or state
+ * variables already set or is it just garbage left over from previous usage?
+ *
+ * Descriptors allow safe and controlled access to data block metadata by
+ * providing predictable offsets for such metadata. This is key to supporting
+ * multiple concurrent lockless writers.
+ *
+ * Behavior
+ * --------
+ * The data ringbuffer allows writers to commit data without regard for
+ * readers. Readers must pre- and post-validate the data blocks they are
+ * processing to be sure the processed data is consistent. A function
+ * dataring_datablock_isvalid() is available for that. Readers can only
+ * iterate data blocks by utilizing an external implementation using
+ * descriptor lookups based on IDs.
+ *
+ * Writers commit data in two steps:
+ *
+ * (1) Reserve a new data block (dataring_push()).
+ * (2) Commit the data block (dataring_datablock_setid()).
+ *
+ * Once a data block is committed, it is available for recycling by another
+ * writer. Therefore, once committed, a writer must no longer access the data
+ * block.
+ *
+ * If data block reservation fails, it means the oldest reserved data block
+ * has not yet been committed by its writer. This acts as a blocker for any
+ * future data block reservation.
+ */
+
+#define DATA_SIZE(dr) (1 << (dr)->size_bits)
+#define DATA_SIZE_MASK(dr) (DATA_SIZE(dr) - 1)
+
+/**
+ * DATA_INDEX() - Determine the data array index from a logical position.
+ *
+ * @dr: The associated data ringbuffer.
+ *
+ * @lpos: A logical position used (or to be used) for a data block.
+ */
+#define DATA_INDEX(dr, lpos) ((lpos) & DATA_SIZE_MASK(dr))
+
+/**
+ * DATA_WRAPS() - Determine how many times the data array has wrapped.
+ *
+ * @dr: The associated data ringbuffer.
+ *
+ * @lpos: A logical position used (or to be used) for a data block.
+ *
+ * The number of wraps is useful when determining if one logical position
+ * is overtaking the data array index of another logical position.
+ */
+#define DATA_WRAPS(dr, lpos) ((lpos) >> (dr)->size_bits)
+
+/**
+ * DATA_THIS_WRAP_START_LPOS() - Get the position at the start of the wrap.
+ *
+ * @dr: The associated data ringbuffer.
+ *
+ * @lpos: A logical position used (or to be used) for a data block.
+ *
+ * Given a logical position, return the logical position if backed up to the
+ * beginning of the current wrap (data array index 0). This is useful when a
+ * data block extends beyond the end of the data array (i.e. wraps) and
+ * therefore needs to store its data at the beginning of the data array.
+ */
+#define DATA_THIS_WRAP_START_LPOS(dr, lpos) ((lpos) & ~DATA_SIZE_MASK(dr))
+
+/**
+ * to_datablock() - Get a data block pointer from a logical position.
+ *
+ * @dr: The associated data ringbuffer.
+ *
+ * @begin_lpos: The logical position of the beginning of the data block.
+ *
+ * Return: A pointer to a data block.
+ *
+ * Since multiple logical positions can map to the same array index, the
+ * returned data block may not be the data the caller wants. The caller is
+ * responsible for validating the data.
+ *
+ * The returned pointer has an address dependency to @begin_lpos.
+ */
+static struct dr_datablock *to_datablock(struct dataring *dr,
+ unsigned long begin_lpos)
+{
+ return (struct dr_datablock *)&dr->data[DATA_INDEX(dr, begin_lpos)];
+}
+
+/**
+ * to_db_size() - Increase a data size to account for ID and padding.
+ *
+ * @size: A pointer to a size value to modify.
+ *
+ * IMPORTANT: The alignment padding specified within ALIGN() must be greater
+ * than or equal to the size of @dr_datablock.id. This ensures that there is
+ * always space in the data array for the @id of a "wrapping" data block.
+ */
+static void to_db_size(unsigned int *size)
+{
+ *size += sizeof(struct dr_datablock);
+ /* Alignment padding must be >= sizeof(dr_datablock.id). */
+ *size = ALIGN(*size, sizeof(long));
+}
+
+/**
+ * _datablock_valid() - Check if given positions yield a valid data block.
+ *
+ * @dr: The associated data ringbuffer.
+ *
+ * @head_lpos: The newest data logical position.
+ *
+ * @tail_lpos: The oldest data logical position.
+ *
+ * @begin_lpos: The beginning logical position of the data block to check.
+ *
+ * @next_lpos: The logical position of the next adjacent data block.
+ * This value is used to identify the end of the data block.
+ *
+ * A data block is considered valid if it satisfies the two conditions:
+ *
+ * * tail_lpos <= begin_lpos < next_lpos <= head_lpos
+ * * tail_lpos is at most exactly 1 wrap behind head_lpos
+ *
+ * Return: true if the specified data block is valid.
+ */
+static bool _datablock_valid(struct dataring *dr,
+ unsigned long head_lpos, unsigned long tail_lpos,
+ unsigned long begin_lpos, unsigned long next_lpos)
+
+{
+ unsigned long size = DATA_SIZE(dr);
+
+ /* tail_lpos <= begin_lpos */
+ if (begin_lpos - tail_lpos >= size)
+ return false;
+
+ /* begin_lpos < next_lpos */
+ if (begin_lpos == next_lpos)
+ return false;
+ if (next_lpos - begin_lpos >= size)
+ return false;
+
+ /* next_lpos <= head_lpos */
+ if (head_lpos - next_lpos >= size)
+ return false;
+
+ /* at most exactly 1 wrap */
+ if (head_lpos - tail_lpos > size)
+ return false;
+
+ return true;
+}
+
+/**
+ * dataring_datablock_isvalid() - Check if a specified data block is valid.
+ *
+ * @dr: The associated data ringbuffer.
+ *
+ * @desc: A descriptor describing a data block to check.
+ *
+ * Return: true if the data block is valid, otherwise false.
+ */
+bool dataring_datablock_isvalid(struct dataring *dr, struct dr_desc *desc)
+{
+ return _datablock_valid(dr,
+ atomic_long_read(&dr->head_lpos),
+ atomic_long_read(&dr->tail_lpos),
+ READ_ONCE(desc->begin_lpos),
+ READ_ONCE(desc->next_lpos));
+}
+
+/**
+ * dataring_checksize() - Check if a data size is legal.
+ *
+ * @dr: The data ringbuffer to check against.
+ *
+ * @size: The size value requested by a writer.
+ *
+ * The size value is increased to include the data block structure size and
+ * any needed alignment padding.
+ *
+ * Return: true if the size is legal for the data ringbuffer, otherwise false.
+ */
+bool dataring_checksize(struct dataring *dr, unsigned int size)
+{
+ if (size == 0)
+ return false;
+
+ /* Ensure the alignment padded size fits in the data array. */
+ to_db_size(&size);
+ if (size >= DATA_SIZE(dr))
+ return false;
+
+ return true;
+}
+
+/**
+ * _dataring_pop() - Move tail forward, invalidating the oldest data block.
+ *
+ * @dr: The data ringbuffer containing the data block.
+ *
+ * @tail_lpos: The logical position of the oldest data block.
+ *
+ * This function expects to move the pointer to the oldest data block forward,
+ * thus invalidating the oldest data block. Before attempting to move the
+ * tail, it is verified that the data block is valid. An invalid data block
+ * means that another task has already moved the tail pointer forward.
+ *
+ * Return: The new/current value (logical position) of the tail.
+ *
+ * From the return value the caller can identify if the tail was moved
+ * forward. However, the caller does not know if it was the task that
+ * performed the move.
+ *
+ * If, after seeing a moved tail, the caller will be modifying @begin_lpos or
+ * @next_lpos of a descriptor or will be modifying the head, a full memory
+ * barrier is required before doing so. This ensures that if any update to a
+ * descriptor's @begin_lpos or @next_lpos or the data ringbuffer's head is
+ * visible, that the previous update to the tail is also visible. This avoids
+ * the possibility of failure to notice when another task has moved the tail.
+ *
+ * If the tail has not moved forward it means the @id for the data block was
+ * not set yet. In this case the tail cannot move forward.
+ */
+static unsigned long _dataring_pop(struct dataring *dr,
+ unsigned long tail_lpos)
+{
+ unsigned long new_tail_lpos;
+ unsigned long begin_lpos;
+ unsigned long next_lpos;
+ struct dr_datablock *db;
+ struct dr_desc *desc;
+
+ /*
+ * dA:
+ *
+ * @db has an address dependency on @tail_pos. Therefore @tail_lpos
+ * must be loaded before dB, which accesses @db.
+ */
+ db = to_datablock(dr, tail_lpos);
+
+ /*
+ * dB:
+ *
+ * When a writer has completed accessing its data block, it sets the
+ * @id thus making the data block available for invalidation. This
+ * _acquire() ensures that this task sees all data ringbuffer and
+ * descriptor values seen by the writer as @id was set. This is
+ * necessary to ensure that the data block can be correctly identified
+ * as valid (i.e. @begin_lpos, @next_lpos, @head_lpos are at least the
+ * values seen by that writer, which yielded a valid data block at
+ * that time). It is not enough to rely on the address dependency of
+ * @desc to @id because @head_lpos is not depedent on @id. This pairs
+ * with the _release() in dataring_datablock_setid().
+ *
+ * Memory barrier involvement:
+ *
+ * If dB reads from gA, then dC reads from fG.
+ * If dB reads from gA, then dD reads from fH.
+ * If dB reads from gA, then dE reads from fE.
+ *
+ * Note that if dB reads from gA, then dC cannot read from fC.
+ * Note that if dB reads from gA, then dD cannot read from fD.
+ *
+ * Relies on:
+ *
+ * RELEASE from fG to gA
+ * matching
+ * ADDRESS DEP. from dB to dC
+ *
+ * RELEASE from fH to gA
+ * matching
+ * ADDRESS DEP. from dB to dD
+ *
+ * RELEASE from fE to gA
+ * matching
+ * ACQUIRE from dB to dE
+ */
+ desc = dr->getdesc(smp_load_acquire(&db->id), dr->getdesc_arg);
+ if (!desc) {
+ /*
+ * The data block @id is invalid. The data block is either in
+ * use by the writer (@id not yet set) or has already been
+ * invalidated by another task and the data array area or
+ * descriptor have already been recycled. The latter case
+ * (descriptor already recycled) relies on the implementation
+ * of getdesc(), which, when using an smp_rmb(), must allow
+ * this task to see @tail_lpos as it was visible to the task
+ * that changed the ID-to-descriptor mapping. See the
+ * implementation of getdesc() for details.
+ */
+ goto out;
+ }
+
+ /*
+ * dC:
+ *
+ * Even though the data block @id was determined to be valid, it is
+ * possible that it is a data block recently made available and @id
+ * has not yet been initialized. The @id needs to be re-validated (dF)
+ * after checking if the descriptor points to the data block. Use
+ * _acquire() to ensure that the re-loading of @id occurs after
+ * loading @begin_lpos. This pairs with the _release() in
+ * dataring_push(). See fG for details.
+ */
+ begin_lpos = smp_load_acquire(&desc->begin_lpos);
+
+ if (begin_lpos != tail_lpos) {
+ /*
+ * @desc is not describing the data block at @tail_lpos. Since
+ * a data block and its descriptor always become valid before
+ * @id is set (see dB for details) the data block at
+ * @tail_lpos has already been invalidated.
+ */
+ goto out;
+ }
+
+ /* dD: */
+ next_lpos = READ_ONCE(desc->next_lpos);
+
+ if (!_datablock_valid(dr,
+ /* dE: */
+ atomic_long_read(&dr->head_lpos),
+ tail_lpos, begin_lpos, next_lpos)) {
+ /* Another task has already invalidated the data block. */
+ goto out;
+ }
+
+ /* dF: */
+ if (dr->getdesc(READ_ONCE(db->id), dr->getdesc_arg) != desc) {
+ /*
+ * The data block ID has changed. The rare case of an
+ * uninitialized @db->id matching the descriptor ID was hit.
+ * This is a special case and it applies to the failure of the
+ * previous @id check (dB).
+ */
+ goto out;
+ }
+
+ /* dG: */
+ new_tail_lpos = atomic_long_cmpxchg_relaxed(&dr->tail_lpos,
+ begin_lpos, next_lpos);
+ if (new_tail_lpos == begin_lpos)
+ return next_lpos;
+ return new_tail_lpos;
+out:
+ /*
+ * dH:
+ *
+ * Ensure that the updated @tail_lpos is visible if the data block has
+ * been invalidated. This pairs with the smp_mb() in dataring_push()
+ * (see fB for details) as well as with the ID synchronization used in
+ * the getdesc() implementation, which must guarantee that an
+ * smp_rmb() is sufficient for seeing an updated @tail_lpos (see the
+ * implementation of getdesc() for details).
+ */
+ smp_rmb();
+
+ /* dI: */
+ return atomic_long_read(&dr->tail_lpos);
+}
+
+/**
+ * get_new_lpos() - Determine the logical positions of a new data block.
+ *
+ * @dr: The data ringbuffer to contain the data.
+ *
+ * @size: The (alignment padded) size of the new data block.
+ *
+ * @begin_lpos_out: A pointer where to set the begin logical position value.
+ * This will be the beginning of the data block.
+ *
+ * @next_lpos_out: A pointer where to set the next logical position value.
+ * This value is used to identify the end of the data block.
+ *
+ * Based on the logical position @head_lpos, determine @begin_lpos and
+ * @next_lpos values for a new data block. If the data block would overwrite
+ * the tail data block, this function will invalidate the tail data block,
+ * thus providing itself space for the new data block.
+ *
+ * IMPORTANT: This function can push or see a pushed data ringbuffer tail.
+ * If the caller will be modifying @begin_lpos or @next_lpos of a descriptor
+ * or will be modifying the head, a full memory barrier is required before
+ * doing so.
+ *
+ * Return: true if logical positions were determined, otherwise false.
+ *
+ * This will only fail if it was not possible to invalidate the tail data
+ * block (i.e. the @id of the tail data block was not yet set by its writer).
+ */
+static bool get_new_lpos(struct dataring *dr, unsigned int size,
+ unsigned long *begin_lpos_out,
+ unsigned long *next_lpos_out)
+{
+ unsigned long data_begin_lpos;
+ unsigned long new_tail_lpos;
+ unsigned long begin_lpos;
+ unsigned long next_lpos;
+ unsigned long tail_lpos;
+
+ /* eA: #1 */
+ tail_lpos = atomic_long_read(&dr->tail_lpos);
+
+ for (;;) {
+ begin_lpos = atomic_long_read(&dr->head_lpos);
+ data_begin_lpos = begin_lpos;
+
+ for (;;) {
+ next_lpos = data_begin_lpos + size;
+
+ if (next_lpos - tail_lpos > DATA_SIZE(dr)) {
+ /* would overwrite oldest */
+
+ new_tail_lpos = _dataring_pop(dr, tail_lpos);
+ if (new_tail_lpos == tail_lpos)
+ return false;
+ /* eA: #2 */
+ tail_lpos = new_tail_lpos;
+ break;
+ }
+
+ if (DATA_WRAPS(dr, data_begin_lpos) ==
+ DATA_WRAPS(dr, next_lpos)) {
+ *begin_lpos_out = begin_lpos;
+ *next_lpos_out = next_lpos;
+ return true;
+ }
+
+ data_begin_lpos =
+ DATA_THIS_WRAP_START_LPOS(dr, next_lpos);
+ }
+ }
+}
+
+/**
+ * dataring_push() - Reserve a data block in the data array.
+ *
+ * @dr: The data ringbuffer to reserve data in.
+ *
+ * @size: The size to reserve.
+ *
+ * @desc: A pointer to a descriptor to store the data block information.
+ *
+ * @id: The ID of the descriptor to be associated.
+ * The data block will not be set with @id, but rather initialized with
+ * a value that is explicitly different than @id. This is to handle the
+ * case when newly available garbage by chance matches the descriptor
+ * ID.
+ *
+ * This function expects to move the head pointer forward. If this would
+ * result in overtaking the data array index of the tail, the tail data block
+ * will be invalidated.
+ *
+ * Return: A pointer to the reserved writer data, otherwise NULL.
+ *
+ * This will only fail if it was not possible to invalidate the tail data
+ * block.
+ */
+char *dataring_push(struct dataring *dr, unsigned int size,
+ struct dr_desc *desc, unsigned long id)
+{
+ unsigned long begin_lpos;
+ unsigned long next_lpos;
+ struct dr_datablock *db;
+ bool ret;
+
+ to_db_size(&size);
+
+ do {
+ /* fA: */
+ ret = get_new_lpos(dr, size, &begin_lpos, &next_lpos);
+
+ /*
+ * fB:
+ *
+ * The data ringbuffer tail may have been pushed (by this or
+ * any other task). The updated @tail_lpos must be visible to
+ * all observers before changes to @begin_lpos, @next_lpos, or
+ * @head_lpos by this task are visible in order to allow other
+ * tasks to recognize the invalidation of the data blocks.
+ * This pairs with the smp_rmb() in _dataring_pop() as well as
+ * any reader task using smp_rmb() to post-validate data that
+ * has been read from a data block.
+ *
+ * Memory barrier involvement:
+ *
+ * If dE reads from fE, then dI reads from fA->eA.
+ * If dC reads from fG, then dI reads from fA->eA.
+ * If dD reads from fH, then dI reads from fA->eA.
+ * If mC reads from fH, then mF reads from fA->eA.
+ *
+ * Relies on:
+ *
+ * FULL MB between fA->eA and fE
+ * matching
+ * RMB between dE and dI
+ *
+ * FULL MB between fA->eA and fG
+ * matching
+ * RMB between dC and dI
+ *
+ * FULL MB between fA->eA and fH
+ * matching
+ * RMB between dD and dI
+ *
+ * FULL MB between fA->eA and fH
+ * matching
+ * RMB between mC and mF
+ */
+ smp_mb();
+
+ if (!ret) {
+ /*
+ * Force @desc permanently invalid to minimize risk
+ * of the descriptor later unexpectedly being
+ * determined as valid due to overflowing/wrapping of
+ * @head_lpos. An unaligned @begin_lpos can never
+ * point to a data block and having the same value
+ * for @begin_lpos and @next_lpos is also invalid.
+ */
+
+ /* fC: */
+ WRITE_ONCE(desc->begin_lpos, 1);
+
+ /* fD: */
+ WRITE_ONCE(desc->next_lpos, 1);
+
+ return NULL;
+ }
+ /* fE: */
+ } while (atomic_long_cmpxchg_relaxed(&dr->head_lpos, begin_lpos,
+ next_lpos) != begin_lpos);
+
+ db = to_datablock(dr, begin_lpos);
+
+ /*
+ * fF:
+ *
+ * @db->id is a garbage value and could possibly match the @id. This
+ * would be a problem because the data block would be considered
+ * valid before the writer has finished with it (i.e. before the
+ * writer has set @id). Force some other ID value.
+ */
+ WRITE_ONCE(db->id, id - 1);
+
+ /*
+ * fG:
+ *
+ * Ensure that @db->id is initialized to a wrong ID value before
+ * setting @begin_lpos so that there is no risk of accidentally
+ * matching a data block to a descriptor before the writer is finished
+ * with it (i.e. before the writer has set the correct @id). This
+ * pairs with the _acquire() in _dataring_pop().
+ *
+ * Memory barrier involvement:
+ *
+ * If dC reads from fG, then dF reads from fF.
+ *
+ * Relies on:
+ *
+ * RELEASE from fF to fG
+ * matching
+ * ACQUIRE from dC to dF
+ */
+ smp_store_release(&desc->begin_lpos, begin_lpos);
+
+ /* fH: */
+ WRITE_ONCE(desc->next_lpos, next_lpos);
+
+ /* If this data block wraps, use @data from the content data block. */
+ if (DATA_WRAPS(dr, begin_lpos) != DATA_WRAPS(dr, next_lpos))
+ db = to_datablock(dr, 0);
+
+ return &db->data[0];
+}
+
+/**
+ * dataring_datablock_setid() - Set the @id for a data block.
+ *
+ * @dr: The data ringbuffer containing the data block.
+ *
+ * @desc: A descriptor of the data block to set the iD of.
+ *
+ * @id: The ID value to set.
+ *
+ * The data block ID is not a field within @desc, but rather is part of the
+ * data block within the data array (struct dr_datablock). Once this value is
+ * set (and only when it is set), it is possible for the data block to be
+ * invalidated. Upon calling this function, the writer data will be fully
+ * stored in the data ringbuffer. The writer must not perform any operations
+ * on the data block after calling this function.
+ */
+void dataring_datablock_setid(struct dataring *dr, struct dr_desc *desc,
+ unsigned long id)
+{
+ struct dr_datablock *db;
+
+ db = to_datablock(dr, READ_ONCE(desc->begin_lpos));
+
+ /*
+ * gA:
+ *
+ * Ensure that all information (within the data ringbuffer and the
+ * descriptor) for the valid data block are visible before allowing
+ * the data block to be invalidated. This includes all storage of
+ * writer data for this data block. After this _release() the writer
+ * is no longer allowed to write to the data block. It pairs with the
+ * load_acquire() in _dataring_pop(). See dB for details.
+ */
+ smp_store_release(&db->id, id);
+}
+
+/**
+ * dataring_pop() - Invalidate the tail data block.
+ *
+ * @dr: The data ringbuffer to invalidatd the tail data block in.
+ *
+ * This is a public wrapper for _dataring_pop(). See _dataring_pop() for
+ * more details about this function.
+ *
+ * Return: true if the tail was invalidated, otherwise false.
+ *
+ * This function returns true if it notices the tail was invalidated even if
+ * it was not the task the did the invalidation.
+ */
+bool dataring_pop(struct dataring *dr)
+{
+ unsigned long tail_lpos;
+
+ tail_lpos = atomic_long_read(&dr->tail_lpos);
+
+ return (_dataring_pop(dr, tail_lpos) != tail_lpos);
+}
+
+/**
+ * dataring_getdatablock() - Return the data block of a descriptor.
+ *
+ * @dr: The data ringbuffer containing the data block.
+ *
+ * @desc: The descriptor to retrieve the data block of.
+ *
+ * @size: A pointer to a variable to set the size of the data block.
+ * @size will include any alignment padding that may have been added.
+ *
+ * Since datablocks always contain contiguous data, the situation can occur
+ * where there is not enough space at the end of the array for a new data
+ * block. In this situation, two data blocks are created:
+ *
+ * * A "wrapping" data block at the end of the data array.
+ * * A "content" data block at the beginning of the data array.
+ *
+ * In this situation, @desc will contain the beginning logical position of the
+ * wrapping data block and the end logical position of the content data block
+ * Note that the ID is stored in the wrapping data block.
+ *
+ * This function transparently handles wrapping/content data blocks to return
+ * the data block with the data content. Note that @desc->id is a private
+ * field and thus must not be directly accessed by callers. For
+ * wrapping/content data blocks, @desc->id of the content data block is
+ * garbage.
+ *
+ * Return: A pointer to the data block. Also, the variable pointed to by @size
+ * is set to the size of the data block.
+ */
+struct dr_datablock *dataring_getdatablock(struct dataring *dr,
+ struct dr_desc *desc, int *size)
+{
+ unsigned long begin_lpos;
+ unsigned long next_lpos;
+
+ begin_lpos = READ_ONCE(desc->begin_lpos);
+ next_lpos = READ_ONCE(desc->next_lpos);
+
+ if (DATA_WRAPS(dr, begin_lpos) == DATA_WRAPS(dr, next_lpos)) {
+ *size = next_lpos - begin_lpos;
+ } else {
+ /* Use the content data block. */
+ *size = DATA_INDEX(dr, next_lpos);
+ begin_lpos = 0;
+ }
+ *size -= sizeof(struct dr_datablock);
+
+ return to_datablock(dr, begin_lpos);
+}
+
+/**
+ * dataring_block_copy() - Copy information referring to a data block.
+ *
+ * @dst: The data block to copy the information to.
+ *
+ * @src: The data block to copy the information from.
+ *
+ * This function only copies the data block information (i.e. begin and
+ * end logical positions for a data block). After calling this function
+ * both descriptors are referring to the same data block.
+ *
+ * This can be useful for tasks that want a local copy of a descriptor
+ * so that validation can be performed without concern of the descriptor
+ * values changing during validation.
+ */
+void dataring_desc_copy(struct dr_desc *dst, struct dr_desc *src)
+{
+ dst->begin_lpos = READ_ONCE(src->begin_lpos);
+ dst->next_lpos = READ_ONCE(src->next_lpos);
+}
diff --git a/kernel/printk/dataring.h b/kernel/printk/dataring.h
new file mode 100644
index 000000000000..346a455a335a
--- /dev/null
+++ b/kernel/printk/dataring.h
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _KERNEL_PRINTK_DATARING_H
+#define _KERNEL_PRINTK_DATARING_H
+
+#include <linux/atomic.h>
+
+/**
+ * struct dr_datablock - A contiguous block of data reserved for a writer.
+ *
+ * @id: An ID assigned by the writer.
+ * This has no pre-initialized value. IMPORTANT: The size of this
+ * field MUST be less than or equal to the alignment padding used by
+ * ALIGN() in to_db_size() and @id must be located at the beginning of
+ * the structure.
+ *
+ * @data: The writer data.
+ *
+ * A pointer to the beginning of a data block can be directly typecasted to
+ * this structure to access the data block fields and their content.
+ */
+struct dr_datablock {
+ /* private */
+ unsigned long id;
+
+ /* public */
+ char data[0];
+};
+
+/**
+ * struct dr_desc - Meta data describing a data block.
+ *
+ * @begin_lpos: The logical position of the beginning of the data block.
+ * The array index can be derived from the logical position.
+ *
+ * @next_lpos: The logical position of the next data block.
+ * This is the @begin_lpos of the adjacent data block and is used
+ * to determine the size of the data block being described.
+ */
+struct dr_desc {
+ /* private */
+ unsigned long begin_lpos;
+ unsigned long next_lpos;
+};
+
+/**
+ * struct dataring - A data ringbuffer with support for entry IDs.
+ *
+ * @size_bits: The power-of-2 size of the ringbuffer's data array.
+ *
+ * @data: A pointer to the ringbuffer's data array.
+ *
+ * @head_lpos: The @next_lpos value of the most recently pushed data block.
+ * The array index can be derived from the logical position.
+ *
+ * @tail_lpos: The @begin_lpos value of the oldest data block.
+ * The array index can be derived from the logical position.
+ *
+ * @getdesc: A callback function to get a descriptor for a data block.
+ * IMPORTANT: The lookup must be implemented in such a way to
+ * ensure that if the ID of a descriptor has been updated, any
+ * changed data ringbuffer and descriptor fields can be
+ * synchronized using a pairing smp_rmb().
+ *
+ * @getdesc_arg: An argument that will be passed to the getdesc() callback.
+ *
+ * IDs are used in order to get a descriptor given an ID. It is the
+ * responsibility of the user to implement this functionality via
+ * dataring_datablock_setid() and dataring.getdesc().
+ */
+struct dataring {
+ /* private */
+ unsigned int size_bits;
+ char *data;
+ atomic_long_t head_lpos;
+ atomic_long_t tail_lpos;
+
+ struct dr_desc *(*getdesc)(unsigned long id, void *arg);
+ void *getdesc_arg;
+};
+
+bool dataring_checksize(struct dataring *dr, unsigned int size);
+
+bool dataring_pop(struct dataring *dr);
+char *dataring_push(struct dataring *dr, unsigned int size,
+ struct dr_desc *desc, unsigned long id);
+
+void dataring_datablock_setid(struct dataring *dr, struct dr_desc *desc,
+ unsigned long id);
+struct dr_datablock *dataring_getdatablock(struct dataring *dr,
+ struct dr_desc *desc, int *size);
+bool dataring_datablock_isvalid(struct dataring *dr, struct dr_desc *desc);
+void dataring_desc_copy(struct dr_desc *dst, struct dr_desc *src);
+
+#endif /* _KERNEL_PRINTK_DATARING_H */
diff --git a/kernel/printk/numlist.c b/kernel/printk/numlist.c
new file mode 100644
index 000000000000..df3f89e7f7fd
--- /dev/null
+++ b/kernel/printk/numlist.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/sched.h>
+#include "numlist.h"
+
+/**
+ * DOC: numlist overview
+ *
+ * A numlist is a lockless FIFO queue list, where the nodes of the list are
+ * sequentially numbered. A numlist can be iterated over by non-consuming
+ * readers. The purpose of the sequence numbers is so that a non-consuming
+ * reader can identify if it has missed nodes (for example, if they were
+ * popped off the list before the non-consuming reader could finish reading
+ * them).
+ *
+ * IDs vs. Pointers
+ * ----------------
+ * Rather than nodes being linked by pointers, each node is given an ID and
+ * has the ID of the next node in the list. IDs are used in order to avoid an
+ * ABA problem between writers when assigning sequence numbers.
+ *
+ * The ABA problem using pointers can be seen if the list has one node and
+ * while CPU0 is adding a second node another CPU adds and removes nodes
+ * (recycling the node that CPU0 saw as the head)::
+ *
+ * CPU0 CPU1
+ * ---- ----
+ * (enter numlist_push())
+ * head = READ_ONCE(h->head);
+ * seq = READ_ONCE(head->seq);
+ * WRITE_ONCE(n->seq, seq + 1);
+ * WRITE_ONCE(n->next, NULL);
+ * numlist_push(); // add some node
+ * numlist_pop(); // remove "head"
+ * numlist_push(); // re-add "head"
+ * cmpxchg(&h->head, head, n);
+ * WRITE_ONCE(head->next, n);
+ * (exit numlist_push())
+ *
+ * Let the original head have a sequence number of 1. Then after the above
+ * scenario the list nodes could have the following sequence numbers::
+ *
+ * 1 -> 2 -> 3 -> 2
+ *
+ * The problem is that the cmpxchg() is succeeding even though the head that
+ * CPU0 wants to replace is not the head it expects. This problem is avoided
+ * by using IDs, which indirectly provide tagged state references for the
+ * nodes. The number of tagged states per node is::
+ *
+ * sizeof(long) / max number of nodes in the list
+ *
+ * Using IDs instead of pointers, the cmpxchg() will fail as it should.
+ *
+ * Numlist users are required to implement a function to map ID values to
+ * numlist nodes and are responsible for issuing IDs to numlist nodes.
+ *
+ * Behavior
+ * --------
+ * A numbered list has two conditions in order to pop the tail node from the
+ * list:
+ *
+ * * The tail node is not the only node on the list.
+ * * The tail node is not busy.
+ *
+ * The first condition is necessary to support writers adding new nodes to the
+ * list. Such writers must modify the "next pointer" of the previous head,
+ * which becomes complex/racey if that previous head could be removed from the
+ * list and recycled during the update. One consequence of this condition is
+ * that numbered lists must be initialized such that the tail and head are
+ * already pointing to a node.
+ *
+ * For the second condition, the term "busy" is defined by the user, who must
+ * implement a function to report if a node is busy. This gives the user
+ * control to decide if a node can be removed from the list.
+ *
+ * For non-consuming readers, the function numlist_read() needs to be called a
+ * second time after reading node data to ensure the node is still valid.
+ * Nodes can become invalid while being read by non-consuming readers.
+ */
+
+/**
+ * numlist_read() - Read the information stored within a node.
+ *
+ * @nl: The numbered list to use.
+ *
+ * @id: The ID of the node to read from.
+ *
+ * @seq: A pointer to a variable to store the sequence number of the node.
+ * This may be NULL if the caller is not interested in this value.
+ *
+ * @next_id: A pointer to a variable to store the next node ID in the list.
+ * This may be NULL if the caller is not interested in this value.
+ * If this function stores the node's own ID to @next_id, this is
+ * the last node on the list. Note that a node does not know its own
+ * ID. It is up to the caller to track the node IDs during
+ * traversal.
+ *
+ * This function can be used to read node contents for both a node that is
+ * part of a list or a node that is off-list. (Although, from this function it
+ * is not possible to identify if the node is part of a list.)
+ *
+ * Readers should call this function a second time after reading node data to
+ * ensure the node is still valid because it may have become invalid while the
+ * reader was reading the data.
+ *
+ * Return: true if the node was read, otherwise false.
+ *
+ * This function will fail if @id is not valid anytime during this function.
+ */
+bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
+ unsigned long *next_id)
+{
+ struct nl_node *n;
+
+ n = nl->node(id, nl->node_arg);
+ if (!n)
+ return false;
+
+ if (seq) {
+ /*
+ * aA:
+ *
+ * Adresss dependency on @id.
+ */
+ *seq = READ_ONCE(n->seq);
+ }
+
+ if (next_id) {
+ /*
+ * aB:
+ *
+ * Adresss dependency on @id.
+ */
+ *next_id = READ_ONCE(n->next_id);
+ }
+
+ /*
+ * aC:
+ *
+ * Validate @seq and @next_id by making sure the node has not been
+ * recycled. This pairs with the smp_wmb() of the function that
+ * updates the ID, which is required to issue an smp_wmb() after
+ * updating the ID and before modifying fields of the node. See the
+ * smp_wmb() in prb_reserve() for details.
+ */
+ smp_rmb();
+
+ return (nl->node(id, nl->node_arg) != NULL);
+}
+
+/**
+ * numlist_read_tail() - Read the oldest node.
+ *
+ * @nl: The numbered list to use.
+ *
+ * @seq: A pointer to a variable to store the sequence number of the node.
+ * This may be NULL if the caller is not interested in this value.
+ *
+ * @next_id: A pointer to a variable to store the next node ID in the list.
+ * This may be NULL if the caller is not interested in this value.
+ *
+ * This function will not return until it has successfully read the tail
+ * node.
+ *
+ * Return: The ID of the tail node.
+ */
+unsigned long numlist_read_tail(struct numlist *nl, unsigned long *seq,
+ unsigned long *next_id)
+{
+ unsigned long tail_id;
+
+ tail_id = atomic_long_read(&nl->tail_id);
+
+ while (!numlist_read(nl, tail_id, seq, next_id)) {
+ /* @tail_id is invalid. Try again with an updated value. */
+
+ cpu_relax();
+
+ tail_id = atomic_long_read(&nl->tail_id);
+ }
+
+ return tail_id;
+}
+
+/**
+ * numlist_push() - Add a node to the list and assign it a sequence number.
+ *
+ * @nl: The numbered list to push to.
+ *
+ * @n: A node to push to the numbered list.
+ * The node must not already be part of a list.
+ *
+ * @id: The ID of the node.
+ *
+ * A node is added in two steps: The first step is to make this node the
+ * head, which causes a following push to add to this node. The second step is
+ * to update @next_id of the former head node to point to this one, which
+ * makes this node visible to any task that sees the former head node.
+ */
+void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
+{
+ unsigned long head_id;
+ unsigned long seq;
+ unsigned long r;
+
+ /*
+ * bA:
+ *
+ * Setup the node to be a list terminator: next_id == id.
+ */
+ WRITE_ONCE(n->next_id, id);
+
+ /* bB: #1 */
+ head_id = atomic_long_read(&nl->head_id);
+
+ for (;;) {
+ /* bC: */
+ while (!numlist_read(nl, head_id, &seq, NULL)) {
+ /*
+ * @head_id is invalid. Try again with an
+ * updated value.
+ */
+
+ cpu_relax();
+
+ /* bB: #2 */
+ head_id = atomic_long_read(&nl->head_id);
+ }
+
+ /*
+ * bD:
+ *
+ * Set @seq to +1 of @seq from the previous head.
+ *
+ * Memory barrier involvement:
+ *
+ * If bB reads from bE, then bC->aA reads from bD.
+ *
+ * Relies on:
+ *
+ * RELEASE from bD to bE
+ * matching
+ * ADDRESS DEP. from bB to bC->aA
+ */
+ WRITE_ONCE(n->seq, seq + 1);
+
+ /*
+ * bE:
+ *
+ * This store_release() guarantees that @seq and @next are
+ * stored before the node with @id is visible to any popping
+ * writers. It pairs with the address dependency between @id
+ * and @seq/@next provided by numlist_read(). See bD and bF
+ * for details.
+ */
+ r = atomic_long_cmpxchg_release(&nl->head_id, head_id, id);
+ if (r == head_id)
+ break;
+
+ /* bB: #3 */
+ head_id = r;
+ }
+
+ n = nl->node(head_id, nl->node_arg);
+
+ /*
+ * The old head (which is still the list terminator), cannot be
+ * removed because the list will always have at least one node.
+ * Therefore @n must be non-NULL.
+ */
+
+ /*
+ * bF: the STORE part for @next_id
+ *
+ * Set @next_id of the previous head to @id.
+ *
+ * Memory barrier involvement:
+ *
+ * If bB reads from bE, then bF overwrites bA.
+ *
+ * Relies on:
+ *
+ * RELEASE from bA to bE
+ * matching
+ * ADDRESS DEP. from bB to bF
+ */
+ /*
+ * bG: the RELEASE part for @next_id
+ *
+ * This _release() guarantees that a reader will see the updates to
+ * this node's @seq/@next_id if the reader saw the @next_id of the
+ * previous node in the list. It pairs with the address dependency
+ * between @id and @seq/@next provided by numlist_read().
+ *
+ * Memory barrier involvement:
+ *
+ * If aB reads from bG, then aA' reads from bD, where aA' is in
+ * numlist_read() to read the node ID from bG.
+ * If aB reads from bG, then aB' reads from bA, where aB' is in
+ * numlist_read() to read the node ID from bG.
+ *
+ * Relies on:
+ *
+ * RELEASE from bG to bD
+ * matching
+ * ADDRESS DEP. from aB to aA'
+ *
+ * RELEASE from bG to bA
+ * matching
+ * ADDRESS DEP. from aB to aB'
+ */
+ smp_store_release(&n->next_id, id);
+}
+
+/**
+ * numlist_pop() - Remove the oldest node from the list.
+ *
+ * @nl: The numbered list from which to remove the tail node.
+ *
+ * The tail node can only be removed if two conditions are satisfied:
+ *
+ * * The node is not the only node on the list.
+ * * The node is not busy.
+ *
+ * If, during this function, another task removes the tail, this function
+ * will try again with the new tail.
+ *
+ * Return: The removed node or NULL if the tail node cannot be removed.
+ */
+struct nl_node *numlist_pop(struct numlist *nl)
+{
+ unsigned long tail_id;
+ unsigned long next_id;
+ unsigned long r;
+
+ /* cA: #1 */
+ tail_id = atomic_long_read(&nl->tail_id);
+
+ for (;;) {
+ /* cB */
+ while (!numlist_read(nl, tail_id, NULL, &next_id)) {
+ /*
+ * @tail_id is invalid. Try again with an
+ * updated value.
+ */
+
+ cpu_relax();
+
+ /* cA: #2 */
+ tail_id = atomic_long_read(&nl->tail_id);
+ }
+
+ /* Make sure the node is not the only node on the list. */
+ if (next_id == tail_id)
+ return NULL;
+
+ /*
+ * cC:
+ *
+ * Make sure the node is not busy.
+ */
+ if (nl->busy(tail_id, nl->busy_arg))
+ return NULL;
+
+ r = atomic_long_cmpxchg_relaxed(&nl->tail_id,
+ tail_id, next_id);
+ if (r == tail_id)
+ break;
+
+ /* cA: #3 */
+ tail_id = r;
+ }
+
+ return nl->node(tail_id, nl->node_arg);
+}
diff --git a/kernel/printk/numlist.h b/kernel/printk/numlist.h
new file mode 100644
index 000000000000..cdc3b21e6597
--- /dev/null
+++ b/kernel/printk/numlist.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _KERNEL_PRINTK_NUMLIST_H
+#define _KERNEL_PRINTK_NUMLIST_H
+
+#include <linux/atomic.h>
+
+/**
+ * struct nl_node - A node of a numbered list.
+ *
+ * @seq: The sequence number assigned to this node.
+ *
+ * @next_id: The ID of the next node of the numbered list.
+ * If this is the node's own ID, this is the last node on the list.
+ * Note that a node does not know its own ID. It is up to the caller
+ * to track the node IDs during traversal.
+ *
+ * The fields are private and must not be directly accessed by users.
+ * numlist_read() is available for users to access these fields.
+ */
+struct nl_node {
+ /* private */
+ unsigned long seq;
+ unsigned long next_id;
+};
+
+/**
+ * struct numlist - A numbered list of nodes.
+ *
+ * @head_id: The ID of the most recently pushed node.
+ * Disregarding overflow, this node will have the highest sequence
+ * number.
+ *
+ * @tail_id: The ID of the oldest node.
+ * Disregarding overflow, this node will have the lowest sequence
+ * number.
+ *
+ * @node: A callback function to get a node for an ID.
+ * IDs must be implemented such that an smp_wmb() is issued after
+ * the ID of a node has been modified and before any further
+ * changes to the node are performed.
+ *
+ * @node_arg: An argument that will be passed to the node() callback.
+ *
+ * @busy: A callback function to determine if a node can be removed.
+ * Nodes that are "busy" will not be removed.
+ *
+ * @busy_arg: An argument that will be passed to the busy() callback.
+ *
+ * List nodes are sorted by their sequence numbers.
+ */
+struct numlist {
+ /* private */
+ atomic_long_t head_id;
+ atomic_long_t tail_id;
+
+ struct nl_node *(*node)(unsigned long id, void *arg);
+ void *node_arg;
+
+ bool (*busy)(unsigned long id, void *arg);
+ void *busy_arg;
+};
+
+void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id);
+struct nl_node *numlist_pop(struct numlist *nl);
+
+unsigned long numlist_read_tail(struct numlist *nl, unsigned long *seq,
+ unsigned long *next_id);
+bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
+ unsigned long *next_id);
+
+#endif /* _KERNEL_PRINTK_NUMLIST_H */
diff --git a/kernel/printk/ringbuffer.c b/kernel/printk/ringbuffer.c
new file mode 100644
index 000000000000..59bf59aba3de
--- /dev/null
+++ b/kernel/printk/ringbuffer.c
@@ -0,0 +1,800 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/irqflags.h>
+#include <linux/string.h>
+#include <linux/err.h>
+#include "ringbuffer.h"
+
+/**
+ * DOC: prb overview
+ *
+ * As the name suggests, this ringbuffer was implemented specifically to
+ * serve the needs of the printk() infrastructure. The ringbuffer itself is
+ * not specific to printk and could be used for other purposes. However, the
+ * requirements and semantics of printk are rather unique. If you intend to
+ * use this ringbuffer for anything other than printk, you need to be very
+ * clear on its features, behavior, and limitations.
+ *
+ * Features
+ * --------
+ * * single global buffer
+ * * resides in initialized data section (available at early boot)
+ * * supports multiple concurrent lockless readers and writers
+ * * safe from any context (including NMI)
+ * * groups bytes into variable length data blocks (referenced by entries)
+ * * entries tagged with sequence numbers
+ *
+ * Terminology
+ * -----------
+ * * data block: A contiguous block of data containing an ID of an associated
+ * descriptor and the raw data from the writer.
+ *
+ * * descriptor: Meta data for a data block containing an ID, the logical
+ * positions of the associated data block, a unique sequence
+ * number, and a pointer to the next (newer) descriptor.
+ *
+ * * entry: A high level object used by the readers/writers that contains
+ * a descriptor as well as state information during the
+ * reserve/commit window.
+ *
+ * Data Structure
+ * --------------
+ * The ringbuffer implementation makes use of two internal data structures:
+ *
+ * * a data ringbuffer (dataring) to manage the raw data storage of entries
+ *
+ * * a numbered list (numlist) to sort and sequentially label committed
+ * entries
+ *
+ * Both dataring and numlist implementations make use of ID numbers to
+ * identify their entities. The ringbuffer implementation creates and manages
+ * those IDs using the same value for all data structures (i.e. ID=22 for a
+ * dataring data block and descriptor corresponds to ID=22 for the numlist
+ * node.)
+ *
+ * The ringbuffer implementation provides a high-level descriptor (prb_desc).
+ * This descriptor contains the ID, a dataring descriptor, and a numlist node,
+ * thus acting as the unifying entity for the various data structures.
+ *
+ * The descriptors are stored within a static array and IDs map directly to
+ * array offsets. The full range of IDs (unsigned long) is used in order to
+ * provide tagged state references for the descriptors to avoid ABA issues
+ * when the descriptors are recycled. The number of tagged states per
+ * descriptor is::
+ *
+ * sizeof(long) / number of descriptors in the array
+ *
+ * Behavior
+ * --------
+ * Since the printk ringbuffer is lockless, there exists no synchronization
+ * between readers and writers. Basically writers are the tasks in control and
+ * may overwrite any and all committed data at any time and from any context.
+ * For this reason readers can miss entries if they are overwritten before the
+ * reader was able to access the data. The reader API implementation is such
+ * that reader access to data is atomic, so there is no risk of readers having
+ * to deal with partial or corrupt data blocks. Also, entries include the
+ * sequence number of the associated descriptor so that readers can recognize
+ * if entries were missed.
+ *
+ * Writing to the ringbuffer consists of two steps:
+ *
+ * (1) Reserve data (prb_reserve()).
+ * (2) Commit the reserved data (prb_commit()).
+ *
+ * The sequence number is assigned on commit.
+ *
+ * Once committed, a writer must no longer access the data directly. This is
+ * because the data may have been overwritten and no longer exists. If a
+ * writer must access the data, it should either keep a private copy before
+ * committing or use the reader API to gain access to the data.
+ *
+ * Because of how the data backend (dataring) is implemented, data blocks that
+ * have been reserved but not yet committed act as blockers, preventing future
+ * writers from filling the ringbuffer beyond the location of the reserved but
+ * not yet committed data block region. For this reason it is important that
+ * writers perform both reserve and commit as quickly as possible. To assist
+ * with this, local interrupts are disabled during the reserve/commit window.
+ * Writers in NMI contexts can still preempt any other writers, but as long
+ * as these writers do not write a large amount of data with respect to the
+ * ringbuffer size, this should not become an issue.
+ *
+ * The reader API provides an iterator to traverse the list of committed
+ * entries. The iterator utilizes a user-provided entry buffer to copy into
+ * and validate the entry data upon traversal.
+ *
+ * Entry Lifecycle
+ * ---------------
+ * This is an overview of the lifecycle for an entry. It is meant to assist in
+ * understanding how the various data structures and their procedures are
+ * coordinated. Following is what happens within the reserve/commit window of
+ * a writer:
+ *
+ * (1) Reserve a descriptor (part 1 of 3): Invalidate the oldest data block by
+ * moving the dataring tail forward. (See the next step for why this was
+ * necessary.)
+ *
+ * (2) Reserve a descriptor (part 2 of 3): Remove the oldest descriptor from
+ * the committed list by removing the numlist tail. (Note that
+ * descriptors can only be removed if their associated data block is
+ * invalid. That invalidation was performed in the previous step.)
+ *
+ * (3) Reserve a descriptor (part 3 of 3): Invalidate the removed descriptor
+ * by modifying its ID.
+ *
+ * (4) Reserve data: Reserve a new data block by moving the dataring head
+ * forward.
+ *
+ * (5) Write data: Write the user data into the reserved data block.
+ *
+ * (6) Commit the data: Set the descriptor ID in the reserved data block.
+ *
+ * (7) Commit the descriptor (part 1 of 2): Add the descriptor to the
+ * committed list by setting it as the numlist head.
+ *
+ * (8) Commit the descriptor (part 2 of 2): Link the descriptor to the older
+ * committed entries by setting it as the "next pointer" of the former
+ * numlist head.
+ *
+ * Usage
+ * -----
+ * Here are some simple examples demonstrating writers and readers. For the
+ * examples it is assumed that a global ringbuffer is available::
+ *
+ * DECLARE_PRINTKRB(rb, 5, 7);
+ *
+ * This ringbuffer has a size of 4096 bytes, expects an average data size of
+ * 32 bytes, and allows up to 128 descriptors.
+ *
+ * Sample writer code::
+ *
+ * struct prb_reserved_entry e;
+ * char *s;
+ *
+ * s = prb_reserve(&e, &rb, 16);
+ * if (s) {
+ * snprintf(s, 16, "Hello, world!");
+ * prb_commit(&e);
+ * }
+ *
+ * Sample reader code::
+ *
+ * DECLARE_PRINTKRB_ENTRY(e, 64);
+ * struct prb_iterator iter;
+ * u64 last_seq = 0;
+ * int len;
+ * char *s;
+ *
+ * prb_for_each_entry(&iter, &rb, &e, len) {
+ * if (e.seq - last_seq != 1) {
+ * pr_warn("LOST %llu ENTRIES\n",
+ * e.seq - (last_seq + 1));
+ * }
+ * last_seq = e.seq;
+ *
+ * s = (char *)&e.buffer[0];
+ * if (len >= 64) {
+ * pr_warn("ENTRY %llu TRUNCATED\n", e.seq);
+ * s[64 - 1] = 0;
+ * }
+ * pr_info("%llu: %s\n", e.seq, s);
+ * }
+ */
+
+#define DESCS_COUNT(rb) (1 << (rb)->desc_count_bits)
+#define DESCS_COUNT_MASK(rb) (DESCS_COUNT(rb) - 1)
+
+/**
+ * to_desc() - Translate an ID to a descriptor.
+ *
+ * @rb: The ringbuffer containing the descriptor.
+ *
+ * @id: An ID value to translate.
+ *
+ * This is a low-level function to provide a safe mapping that always maps
+ * to a descriptor. Callers are responsible for checking the ID of the
+ * returned descriptor to see if the mapping was correct.
+ *
+ * Return: A pointer to a descriptor.
+ */
+static struct prb_desc *to_desc(struct printk_ringbuffer *rb,
+ unsigned long id)
+{
+ return &rb->descs[id & DESCS_COUNT_MASK(rb)];
+}
+
+/**
+ * prb_desc_node() - Numbered list callback to lookup a node from an ID.
+ *
+ * @id: The ID to lookup.
+ *
+ * @arg: The ringbuffer containing the numlist the node belongs to.
+ *
+ * Return: A pointer to the node or NULL if the ID is unknown.
+ *
+ * The returned pointer has an address dependency to @id.
+ */
+struct nl_node *prb_desc_node(unsigned long id, void *arg)
+{
+ struct prb_desc *d = to_desc(arg, id);
+
+ if (id != atomic_long_read(&d->id))
+ return NULL;
+
+ return &d->list;
+}
+
+/**
+ * prb_desc_busy() - Numbered list callback to report if a node is busy.
+ *
+ * @id: The ID of the node to check.
+ *
+ * @arg: The ringbuffer containing the numlist the node belongs to.
+ *
+ * This callback is used by numbered lists to determine if a node can be
+ * removed from the committed list. A descriptor is considered busy if the
+ * data block it references is valid. Data blocks must be invalidated before
+ * descriptors can be recycled.
+ *
+ * Return: true if the descriptor's data block is valid, otherwise false.
+ *
+ * If the specified ID is unknown, the (unknown) descriptor is reported as
+ * not busy.
+ */
+bool prb_desc_busy(unsigned long id, void *arg)
+{
+ struct printk_ringbuffer *rb = arg;
+ struct prb_desc *d = to_desc(rb, id);
+
+ /* hA: */
+ if (!dataring_datablock_isvalid(&rb->dr, &d->desc))
+ return false;
+
+ /*
+ * hB:
+ *
+ * Ensure that the data block that was just checked belongs to the
+ * expected descriptor. Writers modify the descriptor ID before making
+ * any other changes to the data block or descriptor. This pairs with
+ * smp_wmb() in prb_reserve(). See kB for details.
+ */
+ smp_rmb();
+
+ /* hC: */
+ return (id == atomic_long_read(&d->id));
+}
+
+/**
+ * prb_getdesc() - Data ringbuffer callback to lookup a descriptor from an ID.
+ *
+ * @id: The ID to lookup.
+ *
+ * @arg: The ringbuffer containing the dataring the descriptor belongs to.
+ *
+ * The data ringbuffer requires that the caller, by issuing an smp_rmb()
+ * after this function, will see data ringbuffer updates that were visible to
+ * the task that last updated the ID.
+ *
+ * Return: A pointer to the dataring descriptor or NULL if the ID is unknown.
+ *
+ * The returned pointer has an address dependency to @id.
+ */
+struct dr_desc *prb_getdesc(unsigned long id, void *arg)
+{
+ struct prb_desc *d = to_desc(arg, id);
+
+ /*
+ * iA:
+ *
+ * Since the values of @id correspond to the values of @d->id, it is
+ * enough that the ID updating task performs a _release() in
+ * assign_desc(). The smp_rmb() issued by the caller after calling
+ * this function pairs with that _release(). See jB for details.
+ */
+ if (id != atomic_long_read(&d->id))
+ return NULL;
+
+ /* iB: */
+ return &d->desc;
+}
+
+/**
+ * assign_desc() - Assign a descriptor to the caller.
+ *
+ * @e: The entry structure to store the assigned descriptor to.
+ *
+ * Find an available descriptor to assign to the caller. First it is checked
+ * if the tail descriptor from the committed list can be recycled. If not,
+ * perhaps a never-used descriptor is available. Otherwise, data blocks will
+ * be invalidated until the tail descriptor from the committed list can be
+ * recycled.
+ *
+ * Assigned descriptors are invalid until data has been reserved for them.
+ *
+ * Return: true if a descriptor was assigned, otherwise false.
+ *
+ * This will only fail if it was not possible to invalidate data blocks in
+ * order to recycle a descriptor. This can happen if a writer has reserved but
+ * not yet committed data and that reserved data is currently the oldest data.
+ */
+static bool assign_desc(struct prb_reserved_entry *e)
+{
+ struct printk_ringbuffer *rb = e->rb;
+ struct prb_desc *d;
+ struct nl_node *n;
+ unsigned long i;
+
+ for (;;) {
+ /*
+ * jA:
+ *
+ * Try to recycle a descriptor on the committed list.
+ */
+ n = numlist_pop(&rb->nl);
+ if (n) {
+ d = container_of(n, struct prb_desc, list);
+ break;
+ }
+
+ /* Fallback to static never-used descriptors. */
+ if (atomic_read(&rb->desc_next_unused) < DESCS_COUNT(rb)) {
+ i = atomic_fetch_inc(&rb->desc_next_unused);
+ if (i < DESCS_COUNT(rb)) {
+ d = &rb->descs[i];
+ atomic_long_set(&d->id, i);
+ break;
+ }
+ }
+
+ /*
+ * No descriptor available. Make one available for recycling
+ * by invalidating data (which some descriptor will be
+ * referencing).
+ */
+ if (!dataring_pop(&rb->dr))
+ return false;
+ }
+
+ /*
+ * jB:
+ *
+ * Modify the descriptor ID so that users of the descriptor see that
+ * it has been recycled. A _release() is used so that prb_getdesc()
+ * callers can see all data ringbuffer updates after issuing a
+ * pairing smb_rmb(). See iA for details.
+ *
+ * Memory barrier involvement:
+ *
+ * If dB->iA reads from jB, then dI reads the same value as
+ * jA->cD->hA.
+ *
+ * Relies on:
+ *
+ * RELEASE from jA->cD->hA to jB
+ * matching
+ * RMB between dB->iA and dI
+ */
+ atomic_long_set_release(&d->id, atomic_long_read(&d->id) +
+ DESCS_COUNT(rb));
+
+ e->desc = d;
+ return true;
+}
+
+/**
+ * prb_reserve() - Reserve data in the ringbuffer.
+ *
+ * @e: The entry structure to setup.
+ *
+ * @rb: The ringbuffer to reserve data in.
+ *
+ * @size: The size of the data to reserve.
+ *
+ * This is the public function available to writers to reserve data.
+ *
+ * Context: Any context. Disables local interrupts on success.
+ * Return: A pointer to the reserved data or an ERR_PTR if data could not be
+ * reserved.
+ *
+ * If the provided size is legal, this will only fail if it was not possible
+ * to invalidate the oldest data block. This can happen if a writer has
+ * reserved but not yet committed data and that reserved data is currently
+ * the oldest data.
+ *
+ * The ERR_PTR values and their meaning:
+ *
+ * * -EINVAL: illegal @size value
+ * * -EBUSY: failed to reserve a descriptor (@fail count incremented)
+ * * -ENOMEM: failed to reserve data (invalid descriptor committed)
+ */
+char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
+ unsigned int size)
+{
+ struct prb_desc *d;
+ unsigned long id;
+ char *buf;
+
+ if (!dataring_checksize(&rb->dr, size))
+ return ERR_PTR(-EINVAL);
+
+ e->rb = rb;
+
+ /*
+ * Disable interrupts during the reserve/commit window in order to
+ * minimize the number of reserved but not yet committed data blocks
+ * in the data ringbuffer. Although such data blocks are not bad per
+ * se, they act as blockers for writers once the data ringbuffer has
+ * wrapped back to them.
+ */
+ local_irq_save(e->irqflags);
+
+ /* kA: */
+ if (!assign_desc(e)) {
+ /* Failures to reserve descriptors are counted. */
+ atomic_long_inc(&rb->fail);
+ buf = ERR_PTR(-EBUSY);
+ goto err_out;
+ }
+
+ d = e->desc;
+
+ /*
+ * kB:
+ *
+ * The descriptor ID has been updated so that its users can see that
+ * it is now invalid. Issue an smp_wmb() so that upcoming changes to
+ * the descriptor will not be associated with the old descriptor ID.
+ * This pairs with the smp_rmb() of prb_desc_busy() (see hB for
+ * details) and the smp_rmb() within numlist_read() and the smp_rmb()
+ * of prb_iter_next_valid_entry() (see mD for details).
+ *
+ * Memory barrier involvement:
+ *
+ * If hA reads from kC, then hC reads from jB.
+ * If mC reads from kC, then mE reads from jB.
+ *
+ * Relies on:
+ *
+ * WMB between jB and kC
+ * matching
+ * RMB between hA and hC
+ *
+ * WMB between jB and kC
+ * matching
+ * RMB between mC and mE
+ */
+ smp_wmb();
+
+ id = atomic_long_read(&d->id);
+
+ /* kC: */
+ buf = dataring_push(&rb->dr, size, &d->desc, id);
+ if (!buf) {
+ /* Put the invalid descriptor on the committed list. */
+ numlist_push(&rb->nl, &d->list, id);
+ buf = ERR_PTR(-ENOMEM);
+ goto err_out;
+ }
+
+ return buf;
+err_out:
+ local_irq_restore(e->irqflags);
+ return buf;
+}
+EXPORT_SYMBOL(prb_reserve);
+
+/**
+ * prb_commit() - Commit (previously reserved) data to the ringbuffer.
+ *
+ * @e: The entry containing the reserved data information.
+ *
+ * This is the public function available to writers to commit data.
+ *
+ * Context: Any context. Enables local interrupts.
+ */
+void prb_commit(struct prb_reserved_entry *e)
+{
+ struct printk_ringbuffer *rb = e->rb;
+ struct prb_desc *d = e->desc;
+ unsigned long id;
+
+ id = atomic_long_read(&d->id);
+
+ /*
+ * lA:
+ *
+ * Commit all writer data to the data ringbuffer. By setting the ID,
+ * the data will be available for invalidation.
+ */
+ dataring_datablock_setid(&rb->dr, &d->desc, id);
+
+ /*
+ * lB:
+ *
+ * Add the descriptor to the committed list. Then it will be visible
+ * to readers and popping writers (as long as all preceding
+ * descriptors have also completed the numlist_push() call).
+ */
+ numlist_push(&rb->nl, &d->list, id);
+
+ /* The reserve/commit window is closed. Re-enable interrupts. */
+ local_irq_restore(e->irqflags);
+}
+EXPORT_SYMBOL(prb_commit);
+
+/**
+ * prb_iter_init() - Initialize an iterator.
+ *
+ * @iter: The iterator to initialize.
+ *
+ * @rb: The ringbuffer to associate with the iterator.
+ *
+ * @e: An entry structure to use during iteration.
+ *
+ * This is the public function available to readers to initialize an iterator.
+ *
+ * As an alternative, DECLARE_PRINTKRB_ITER() can be used.
+ *
+ * The interator is initialized to the beginning of the committed list (the
+ * oldest committed entry).
+ *
+ * Context: Any context.
+ */
+void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
+ struct prb_entry *e)
+{
+ iter->rb = rb;
+ iter->e = e;
+
+ iter->last_id = 0;
+ iter->last_seq = 0;
+ iter->next_id = 0;
+}
+EXPORT_SYMBOL(prb_iter_init);
+
+/**
+ * reset_iter() - Reset an iterator to the beginning of the committed list.
+ *
+ * @iter: The iterator to reset.
+ */
+static void reset_iter(struct prb_iterator *iter)
+{
+ unsigned long last_seq;
+
+ iter->next_id = numlist_read_tail(&iter->rb->nl, &last_seq, NULL);
+
+ /* Pretend the entry preceding the oldest entry was last seen. */
+ iter->last_seq = last_seq - 1;
+
+ /*
+ * @last_id is only significant in EOL situations, when it is equal to
+ * @next_id and the iterator wants to read the entry after @last_id as
+ * the next entry. Set @last_id to something other than @next_id. So
+ * that the iterator will read @next_id as the next entry.
+ */
+ iter->last_id = iter->next_id - 1;
+}
+
+/**
+ * setup_next() - Prepare an iterator to read the next entry.
+ *
+ * @nl: The numbered list used for the committed list.
+ *
+ * @iter: The iterator to prepare.
+ *
+ * Evaluate the current state of the iterator and committed list and determine
+ * if and how the iterator can proceed. For example, if the iterator
+ * previously hit the end of the list, there may now be new entries available
+ * for traversing.
+ *
+ * If the last seen entry no longer exists, the iterator is reset to the
+ * beginning of the committed list. The oldest available entry will be newer
+ * than the last seen entry.
+ *
+ * Return: true if a next entry is available for reading, otherwise false.
+ */
+static bool setup_next(struct numlist *nl, struct prb_iterator *iter)
+{
+ unsigned long next;
+
+ if (iter->last_id == iter->next_id) {
+ /* previously hit EOL, check for updated next */
+
+ if (!numlist_read(nl, iter->last_id, NULL, &next))
+ reset_iter(iter);
+ else if (next != iter->next_id)
+ iter->next_id = next;
+ else
+ return false;
+ }
+
+ return true;
+}
+
+/**
+ * prb_iter_next_valid_entry() - Traverse to and read the next (newer) entry.
+ *
+ * @iter: The iterator used for list traversal.
+ *
+ * This is the public function available to readers to traverse the committed
+ * entry list.
+ *
+ * If the iterator has not yet been used or has fallen behind and no longer
+ * has a reference to a valid entry, the next read entry will be the oldest
+ * in the committed list (which will be newer than the previously read entry).
+ *
+ * Context: Any context.
+ * Return: The size of the entry data or 0 if there is no next entry.
+ *
+ * The entry data is padded (if necessary) to allow alignment for following
+ * data blocks. Therefore the returned size value can be larger than the size
+ * reserved. If users want the exact size to be tracked, they should include
+ * this information within their data.
+ */
+int prb_iter_next_valid_entry(struct prb_iterator *iter)
+{
+ struct printk_ringbuffer *rb = iter->rb;
+ struct prb_entry *e = iter->e;
+ struct dataring *dr = &rb->dr;
+ struct numlist *nl = &rb->nl;
+ struct dr_datablock *db;
+ unsigned long next_id;
+ struct dr_desc desc;
+ struct prb_desc *d;
+ struct nl_node *n;
+ unsigned long seq;
+ unsigned long id;
+ int size;
+
+ if (!setup_next(nl, iter))
+ return 0;
+
+ for (;;) {
+ id = iter->next_id;
+
+ /* mA: */
+ if (!numlist_read(nl, id, &seq, &next_id)) {
+ /* @id not available */
+ reset_iter(iter);
+ continue;
+ }
+
+ if (seq != iter->last_seq + 1) {
+ /* @seq has an unexpected value */
+ reset_iter(iter);
+ continue;
+ }
+
+ /* mB: */
+ n = prb_desc_node(id, rb);
+ if (!n) {
+ /* @id has become invalid */
+ reset_iter(iter);
+ continue;
+ }
+
+ /* advance iter */
+ iter->last_id = id;
+ iter->last_seq = seq;
+ iter->next_id = next_id;
+
+ d = container_of(n, struct prb_desc, list);
+
+ /* get a local copy to allow non-racey validation */
+ dataring_desc_copy(&desc, &d->desc);
+
+ /* mC: */
+ if (dataring_datablock_isvalid(dr, &desc)) {
+ e->seq = seq;
+
+ db = dataring_getdatablock(dr, &desc, &size);
+ memcpy(&e->buffer[0], &db->data[0],
+ size > e->buffer_size ? e->buffer_size : size);
+
+ /*
+ * mD:
+ *
+ * Now that the data is copied, re-validate that @id
+ * and @desc are still valid. For @id validation, this
+ * pairs with the smp_wmb() in prb_reserve() (see kB
+ * for details). For @desc validation, this pairs with
+ * the smp_mb() in dataring_push() (see fB for
+ * details).
+ */
+ smp_rmb();
+
+ /* mE: */
+ if (prb_desc_node(id, rb) &&
+ /* mF: */
+ dataring_datablock_isvalid(dr, &desc)) {
+ return size;
+ }
+ }
+
+ /* hit EOL? */
+ if (next_id == id)
+ return 0;
+ }
+}
+EXPORT_SYMBOL(prb_iter_next_valid_entry);
+
+/**
+ * prb_iter_sync() - Position an iterator to that of another iterator.
+ *
+ * @dst: The iterator to modify.
+ *
+ * @src: The iterator to sync from.
+ *
+ * This is the public function available to readers to set an iterator to
+ * the position of another iterator. This is particularly useful for making
+ * backup copies of an iterator in case a form of rewinding is needed or if
+ * one iterator should continue where another left off.
+ *
+ * Note that the destination iterator must be previously initialized. The
+ * prb_entry provided during initialization will continue to be used. The
+ * iterator being sync'd from is allowed to be using a different prb_entry.
+ *
+ * Also note that if the iterator being sync'd from is traversing a
+ * different ringbuffer, the modified iterator will now also traverse that
+ * ringbuffer.
+ *
+ * Context: Any context.
+ *
+ * It is safe to call this function from any context and state. But note
+ * that this function is not atomic. Callers must not sync iterators that
+ * can be accessed by other tasks/contexts unless proper synchronization is
+ * used.
+ */
+void prb_iter_sync(struct prb_iterator *dst, struct prb_iterator *src)
+{
+ /* copy everything except the entry buffer */
+
+ dst->rb = src->rb;
+ dst->last_id = src->last_id;
+ dst->last_seq = src->last_seq;
+ dst->next_id = src->next_id;
+}
+EXPORT_SYMBOL(prb_iter_sync);
+
+/**
+ * prb_iter_peek_next_entry() - Check if there is a next (newer) entry.
+ *
+ * @iter: The iterator used for list traversal.
+ *
+ * This is the public function available to readers to check if a newer
+ * entry is available.
+ *
+ * Context: Any context.
+ * Return: true if there is a next entry, otherwise false.
+ */
+bool prb_iter_peek_next_entry(struct prb_iterator *iter)
+{
+ DECLARE_PRINTKRB_ENTRY(e, 1);
+ DECLARE_PRINTKRB_ITER(iter_copy, iter->rb, &e);
+
+ prb_iter_sync(&iter_copy, iter);
+
+ return (prb_iter_next_valid_entry(&iter_copy) != 0);
+}
+EXPORT_SYMBOL(prb_iter_peek_next_entry);
+
+/**
+ * prb_getfail() - Read the descriptor reservation failure counter.
+ *
+ * @rb: The ringbuffer whose counter to read.
+ *
+ * This is the public function available to readers to see how many descriptor
+ * reservation failures exist.
+ *
+ * The counter only counts failures to assign a descriptor. Failures due to
+ * failing to reserve data are not counted because the associated (invalid)
+ * descriptor with its sequence number will exist and readers will be able to
+ * identify that as a lost entry by seeing a jump in the sequence number.
+ *
+ * Context: Any context.
+ * Return: The number of descriptor reservation failures for this ringbuffer.
+ */
+unsigned long prb_getfail(struct printk_ringbuffer *rb)
+{
+ return atomic_long_read(&rb->fail);
+}
+EXPORT_SYMBOL(prb_getfail);
diff --git a/kernel/printk/ringbuffer.h b/kernel/printk/ringbuffer.h
new file mode 100644
index 000000000000..9fe54a09fbc2
--- /dev/null
+++ b/kernel/printk/ringbuffer.h
@@ -0,0 +1,288 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_PRINTK_RINGBUFFER_H
+#define _LINUX_PRINTK_RINGBUFFER_H
+
+#include <linux/atomic.h>
+#include "numlist.h"
+#include "dataring.h"
+
+/**
+ * struct prb_desc - A descriptor representing an entry in the ringbuffer.
+ *
+ * @id: The ID of the descriptor.
+ * The descriptor array index can be derived from the ID.
+ *
+ * @desc: The dataring descriptor where the entry data is stored.
+ *
+ * @list: The numbered list node within the committed list.
+ *
+ * Descriptors include information about where (and if) data for this entry is
+ * stored within the ringbuffer's dataring. A descriptor may or may not be
+ * within a numbered list of committed descriptors.
+ */
+struct prb_desc {
+ /* private */
+ atomic_long_t id;
+ struct dr_desc desc;
+ struct nl_node list;
+};
+
+/**
+ * struct printk_ringbuffer - The ringbuffer structure.
+ *
+ * @desc_count_bits: The power-of-2 maximum amount of descriptors allowed.
+ *
+ * @descs: An array of all descriptors to use.
+ *
+ * @desc_next_unused: An index of the next available (never before used)
+ * descriptor. This value only increases until the
+ * maximum is reached.
+ *
+ * @nl: A numbered list of committed entries.
+ *
+ * @dr: The dataring used to manage the entry data.
+ *
+ * @fail: A counter tracking how often writers fail to reserve.
+ * This only tracks failure due to not being able to get a
+ * descriptor. Failure due to not being able to reserve
+ * space in the dataring is not counted because readers
+ * will notice a lost sequence number in that case.
+ */
+struct printk_ringbuffer {
+ /* private */
+ unsigned int desc_count_bits;
+ struct prb_desc *descs;
+ atomic_t desc_next_unused;
+
+ struct numlist nl;
+
+ struct dataring dr;
+
+ atomic_long_t fail;
+};
+
+/**
+ * struct prb_reserved_entry - Used by writers to reserve/commit an entry.
+ *
+ * @rb: The printk ringbuffer used for reserve/commit.
+ *
+ * @desc: A pointer to the descriptor of the reserved entry.
+ *
+ * @irqflags: Local IRQs are disabled during the reserve/commit window.
+ *
+ * A writer provides this structure when reserving and committing data. The
+ * values of all the members are set on reserve and are only valid until
+ * commit has been called.
+ */
+struct prb_reserved_entry {
+ /* private */
+ struct printk_ringbuffer *rb;
+ struct prb_desc *desc;
+ unsigned long irqflags;
+};
+
+/**
+ * struct prb_entry - Used by readers to read a ringbuffer entry.
+ *
+ * @seq: The sequence number of the entry.
+ *
+ * @buffer: A pointer to a reader-provided buffer.
+ * When reading an entry, the data is copied to this buffer.
+ *
+ * @buffer_size: The size of the reader-provided buffer.
+ *
+ * A reader initializes and provides this structure when traversing/reading
+ * the entries of the ringbuffer.
+ */
+struct prb_entry {
+ /* public */
+ unsigned long seq;
+ char *buffer;
+ int buffer_size;
+};
+
+/**
+ * struct prb_iterator - Used by readers to traverse the committed list.
+ *
+ * @rb: The printk ringbuffer being traversed.
+ *
+ * @e: A pointer to a reader-provided entry structure.
+ *
+ * @last_id: The ID of the last read entry.
+ *
+ * @last_seq: The sequence number of the last read entry.
+ *
+ * @next_id: The ID of the next (unread) entry.
+ *
+ * An iterator tracks the current position of a reader within the ringbuffer.
+ * Readers can notice if they have missed an entry by a jump in the sequence
+ * number.
+ */
+struct prb_iterator {
+ /* private */
+ struct printk_ringbuffer *rb;
+ struct prb_entry *e;
+
+ unsigned long last_id;
+ unsigned long last_seq;
+ unsigned long next_id;
+};
+
+/* writer interface */
+char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
+ unsigned int size);
+void prb_commit(struct prb_reserved_entry *e);
+
+/* reader interface */
+void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
+ struct prb_entry *e);
+int prb_iter_next_valid_entry(struct prb_iterator *iter);
+void prb_iter_sync(struct prb_iterator *dest, struct prb_iterator *src);
+bool prb_iter_peek_next_entry(struct prb_iterator *iter);
+
+/* utility functions */
+unsigned long prb_getfail(struct printk_ringbuffer *rb);
+
+/* prototypes for callbacks used by numlist and dataring, respectively */
+struct nl_node *prb_desc_node(unsigned long id, void *arg);
+bool prb_desc_busy(unsigned long id, void *arg);
+struct dr_desc *prb_getdesc(unsigned long id, void *arg);
+
+/**
+ * DECLARE_PRINTKRB() - Declare a printk ringbuffer.
+ *
+ * @name: The name for the ringbuffer structure variable.
+ *
+ * @avgdatabits: The average size of data as a power-of-2.
+ * If this value is greater than the actual average data size,
+ * there will not be enough descriptors available. This will
+ * lead to situations where data is invalidated in order to free
+ * up descriptors, rather than because the data ringbuffer is
+ * full. Generally, values that are less than the actual average
+ * data size are preferred.
+ *
+ * @descbits: The power-of-2 maximum amount of descriptors allowed.
+ *
+ * The size of the data array will be the average data size multiplied by the
+ * maximum amount of descriptors.
+ *
+ * As per numlist requirement of always having at least one node in the list,
+ * the ringbuffer structures are initialized such that:
+ *
+ * * the numlist head and tail point to descriptor 0
+ * * descriptor 0 has an invalid data block and is the terminating node
+ * * descriptor 1 will be the next descriptor
+ */
+#define DECLARE_PRINTKRB(name, avgdatabits, descbits) \
+char _##name##_data[(1 << ((avgdatabits) + (descbits))) + \
+ sizeof(long)] \
+ __aligned(__alignof__(long)); \
+struct prb_desc _##name##_descs[1 << (descbits)]; \
+struct printk_ringbuffer name = { \
+ .desc_count_bits = descbits, \
+ .descs = &_##name##_descs[0], \
+ .desc_next_unused = ATOMIC_INIT(1), \
+ .nl = { \
+ .head_id = ATOMIC_LONG_INIT(0), \
+ .tail_id = ATOMIC_LONG_INIT(0), \
+ .node = prb_desc_node, \
+ .node_arg = &name, \
+ .busy = prb_desc_busy, \
+ .busy_arg = &name, \
+ }, \
+ .dr = { \
+ .size_bits = (avgdatabits) + (descbits), \
+ .data = &_##name##_data[0], \
+ .head_lpos = ATOMIC_LONG_INIT(-111 * \
+ sizeof(long)), \
+ .tail_lpos = ATOMIC_LONG_INIT(-111 * \
+ sizeof(long)), \
+ .getdesc = prb_getdesc, \
+ .getdesc_arg = &name, \
+ }, \
+ .fail = ATOMIC_LONG_INIT(0), \
+}
+
+/**
+ * DECLARE_PRINTKRB_ENTRY() - Declare an entry structure.
+ *
+ * @name: The name for the entry structure variable.
+ *
+ * @size: The size of the associated reader buffer (also declared).
+ *
+ * This macro is particularly useful for static entry structures that should
+ * be immediately available and initialized. It is an alternative to the
+ * reader manually creating a buffer and setting the buffer and
+ * buffer_size fields of the structure.
+ *
+ * Note that this macro will declare the buffer as well. This could be a
+ * problem if this is used with a large buffer size within a stack frame.
+ */
+#define DECLARE_PRINTKRB_ENTRY(name, size) \
+char _##name##_entry_buf[size]; \
+struct prb_entry name = { \
+ .seq = 0, \
+ .buffer = &_##name##_entry_buf[0], \
+ .buffer_size = size, \
+}
+
+/**
+ * DECLARE_PRINTKRB_ITER() - Declare an iterator for readers.
+ *
+ * @name: The name for the iterator structure variable.
+ *
+ * @rbaddr: A pointer to a printk ringbuffer.
+ *
+ * @entryaddr: A pointer to an entry structure.
+ *
+ * This macro is particularly useful for static iterators that should be
+ * immediately available and initialized. It is an alternative to
+ * manually initializing an iterator with prb_iter_init().
+ */
+#define DECLARE_PRINTKRB_ITER(name, rbaddr, entryaddr) \
+struct prb_iterator name = { \
+ .rb = rbaddr, \
+ .e = entryaddr, \
+ .last_id = 0, \
+ .last_seq = 0, \
+ .next_id = 0, \
+}
+
+/**
+ * prb_for_each_entry() - Iterate all entries of a ringbuffer.
+ *
+ * @i: A pointer to an iterator.
+ *
+ * @r: The printk ringbuffer to iterate.
+ *
+ * @e: An entry structure to use during iteration.
+ *
+ * @l: An integer used to identify when the final entry is traversed.
+ *
+ * This macro initializes the iterator and traverses through all available
+ * ringbuffer entries.
+ *
+ * See prb_for_each_entry_continue() if you want to continue traversing using
+ * an iterator that has already begun traversal.
+ */
+#define prb_for_each_entry(i, r, e, l) \
+ for (prb_iter_init(i, r, e); (l = prb_iter_next_valid_entry(i)) != 0;)
+
+/**
+ * prb_for_each_entry_continue() - Continue iterating entries of a ringbuffer.
+ *
+ * @i: A pointer to an iterator.
+ *
+ * @l: An integer used to identify when the final entry is traversed.
+ *
+ * This macro expects the iterator to be initialized. It does not reset the
+ * iterator. If the iterator has already been used for some traversal, this
+ * macro will continue where the iterator left off.
+ *
+ * See prb_for_each_entry() if you want to iterate from the beginning.
+ */
+#define prb_for_each_entry_continue(i, l) \
+ for (; (l = prb_iter_next_valid_entry(i)) != 0;)
+
+#endif /*_LINUX_PRINTK_RINGBUFFER_H */
--
2.20.1
This module does some heavy write stress testing on the ringbuffer
with a reader that is checking for integrity.
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/Makefile | 2 +
kernel/printk/test_prb.c | 256 +++++++++++++++++++++++++++++++++++++++
2 files changed, 258 insertions(+)
create mode 100644 kernel/printk/test_prb.c
diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
index 567999aa93af..24365ecee348 100644
--- a/kernel/printk/Makefile
+++ b/kernel/printk/Makefile
@@ -5,3 +5,5 @@ obj-$(CONFIG_PRINTK) += ringbuffer.o
obj-$(CONFIG_PRINTK) += numlist.o
obj-$(CONFIG_PRINTK) += dataring.o
obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o
+
+obj-m += test_prb.o
diff --git a/kernel/printk/test_prb.c b/kernel/printk/test_prb.c
new file mode 100644
index 000000000000..1ecb4fcbf823
--- /dev/null
+++ b/kernel/printk/test_prb.c
@@ -0,0 +1,256 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/delay.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include "ringbuffer.h"
+
+/*
+ * This is a test module that starts "num_online_cpus() - 1" writer threads
+ * and 1 reader thread. The writer threads each write strings of varying
+ * length. They do this as fast as they can.
+ *
+ * The reader thread reads as fast as it can and performs sanity checks on
+ * the data.
+ *
+ * Because the threads are running in such tight loops, they will call
+ * schedule() from time to time so the system stays alive.
+ *
+ * If either the writers or the reader encounter an error, the test is
+ * aborted. Test results are recorded to the ftrace buffers, with some
+ * additional information also provided via printk. The test can be aborted
+ * manually by removing the module. (Ideally the test should never abort on
+ * its own.)
+ */
+
+struct rbdata {
+ int len;
+ char text[0];
+};
+
+static char *test_running;
+static int halt_test;
+
+static void dump_rb(struct printk_ringbuffer *rb)
+{
+ DECLARE_PRINTKRB_ENTRY(entry, 160);
+ DECLARE_PRINTKRB_ITER(iter, rb, &entry);
+ unsigned long last_seq = 0;
+ struct rbdata *dat;
+ char buf[160];
+ int len;
+
+ trace_printk("BEGIN full dump\n");
+
+ prb_for_each_entry_continue(&iter, len) {
+ if (entry.seq - last_seq != 1) {
+ trace_printk("LOST %lu\n",
+ entry.seq - (last_seq + 1));
+ }
+ last_seq = entry.seq;
+
+ dat = (struct rbdata *)&entry.buffer[0];
+
+ snprintf(buf, sizeof(buf), "%s", dat->text);
+ buf[sizeof(buf) - 1] = 0;
+ trace_printk("seq=%lu len=%d textlen=%d dataval=%s\n",
+ entry.seq, len, dat->len, buf);
+ }
+
+ trace_printk("END full dump\n");
+}
+
+DECLARE_PRINTKRB(test_rb, 7, 5);
+
+static int prbtest_writer(void *data)
+{
+ unsigned long num = (unsigned long)data;
+ struct prb_reserved_entry e;
+ char id = 'A' + num;
+ struct rbdata *dat;
+ int count = 0;
+ int len;
+
+ pr_err("prbtest: start thread %lu (writer)\n", num);
+
+ for (;;) {
+ len = sizeof(struct rbdata) + (prandom_u32() & 0x7f) + 2;
+
+ dat = (struct rbdata *)prb_reserve(&e, &test_rb, len);
+ if (!IS_ERR(dat)) {
+ len -= sizeof(struct rbdata) + 1;
+ memset(&dat->text[0], id, len);
+ dat->text[len] = 0;
+ dat->len = len;
+ prb_commit(&e);
+ } else {
+ WRITE_ONCE(halt_test, 1);
+ trace_printk("writer%lu (%c) reserve failed (%ld)\n",
+ num, id, PTR_ERR(dat));
+ }
+
+ if ((count++ & 0x3fff) == 0)
+ schedule();
+
+ if (READ_ONCE(halt_test) == 1)
+ break;
+ }
+
+ pr_err("prbtest: end thread %lu (writer)\n", num);
+
+ test_running[num] = 0;
+
+ return 0;
+}
+
+static int prbtest_reader(void *data)
+{
+ unsigned long num = (unsigned long)data;
+ DECLARE_PRINTKRB_ENTRY(entry, 160);
+ DECLARE_PRINTKRB_ITER(iter, &test_rb, &entry);
+ unsigned long total_lost = 0;
+ unsigned long last_seq = 0;
+ unsigned long max_lost = 0;
+ unsigned long count = 0;
+ struct rbdata *dat;
+ int did_sched = 1;
+ int len;
+
+ pr_err("prbtest: start thread %lu (reader)\n", num);
+
+ for (;;) {
+ prb_for_each_entry_continue(&iter, len) {
+ if (entry.seq < last_seq) {
+ WRITE_ONCE(halt_test, 1);
+ trace_printk(
+ "reader%lu invalid seq %lu -> %lu\n",
+ num, last_seq, entry.seq);
+ goto out;
+ }
+
+ if (entry.seq - last_seq != 1 && !did_sched) {
+ total_lost += entry.seq - (last_seq + 1);
+ if (max_lost < entry.seq - (last_seq + 1))
+ max_lost = entry.seq - (last_seq + 1);
+ }
+ last_seq = entry.seq;
+ did_sched = 0;
+
+ dat = (struct rbdata *)&entry.buffer[0];
+
+ len = strnlen(dat->text, 160);
+ if (len != dat->len || len >= 160) {
+ WRITE_ONCE(halt_test, 1);
+ trace_printk(
+ "reader%lu invalid len for %lu (%d<->%d)\n",
+ num, entry.seq, len, dat->len);
+ goto out;
+ }
+ while (len) {
+ len--;
+ if (dat->text[len] != dat->text[0]) {
+ WRITE_ONCE(halt_test, 1);
+ trace_printk("reader%lu bad data\n",
+ num);
+ goto out;
+ }
+ }
+
+ if ((count++ & 0x3fff) == 0) {
+ did_sched = 1;
+ schedule();
+ }
+
+ if (READ_ONCE(halt_test) == 1)
+ goto out;
+ }
+ if (READ_ONCE(halt_test) == 1)
+ goto out;
+ }
+out:
+ pr_err(
+ "reader%lu: total_lost=%lu max_lost=%lu total_read=%lu seq=%lu\n",
+ num, total_lost, max_lost, count, entry.seq);
+ pr_err("prbtest: end thread %lu (reader)\n", num);
+
+ test_running[num] = 0;
+
+ return 0;
+}
+
+static int module_test_running;
+
+static int start_test(void *arg)
+{
+ struct task_struct *thread;
+ unsigned long i;
+ int num_cpus;
+
+ num_cpus = num_online_cpus();
+ test_running = kzalloc(num_cpus, GFP_KERNEL);
+ if (!test_running)
+ return -ENOMEM;
+
+ module_test_running = 1;
+
+ pr_err("prbtest: starting test\n");
+
+ for (i = 0; i < num_cpus; i++) {
+ test_running[i] = 1;
+ if (i < num_cpus - 1) {
+ thread = kthread_run(prbtest_writer, (void *)i,
+ "prbtest writer");
+ } else {
+ thread = kthread_run(prbtest_reader, (void *)i,
+ "prbtest reader");
+ }
+ if (IS_ERR(thread)) {
+ pr_err("prbtest: unable to create thread %lu\n", i);
+ test_running[i] = 0;
+ }
+ }
+
+ for (;;) {
+ for (i = 0; i < num_cpus; i++) {
+ if (test_running[i] == 1)
+ break;
+ }
+ if (i == num_cpus)
+ break;
+ msleep(1000);
+ }
+
+ pr_err("prbtest: completed test\n");
+
+ dump_rb(&test_rb);
+
+ module_test_running = 0;
+
+ kfree(test_running);
+
+ return 0;
+}
+
+static int prbtest_init(void)
+{
+ kthread_run(start_test, NULL, "prbtest");
+ return 0;
+}
+
+static void prbtest_exit(void)
+{
+ WRITE_ONCE(halt_test, 1);
+
+ while (module_test_running)
+ msleep(1000);
+}
+
+module_init(prbtest_init);
+module_exit(prbtest_exit);
+
+MODULE_AUTHOR("John Ogness <[email protected]>");
+MODULE_DESCRIPTION("printk ringbuffer test");
+MODULE_LICENSE("GPL v2");
--
2.20.1
This is a major change because the API (and underlying workings)
of the new ringbuffer are completely different than the previous
ringbuffer. Since there are several components of the printk
infrastructure that use the ringbuffer API (console, /dev/kmsg,
syslog, kmsg_dump), there are quite a few changes throughout the
printk implementation.
This is also a conservative change because it continues to use the
logbuf_lock raw spinlock even though the new ringbuffer is lockless.
The externally visible changes are:
1. The exported vmcore info has changed:
- VMCOREINFO_SYMBOL(log_buf);
- VMCOREINFO_SYMBOL(log_buf_len);
- VMCOREINFO_SYMBOL(log_first_idx);
- VMCOREINFO_SYMBOL(clear_idx);
- VMCOREINFO_SYMBOL(log_next_idx);
+ VMCOREINFO_SYMBOL(printk_rb_static);
+ VMCOREINFO_SYMBOL(printk_rb_dynamic);
2. For the CONFIG_PPC_POWERNV powerpc platform, kernel log buffer
registration is no longer available because there is no longer
a single contigous block of memory to represent all of the
ringbuffer.
Signed-off-by: John Ogness <[email protected]>
---
arch/powerpc/platforms/powernv/opal.c | 22 +-
include/linux/kmsg_dump.h | 6 +-
include/linux/printk.h | 12 -
kernel/printk/printk.c | 745 ++++++++++++++------------
kernel/printk/ringbuffer.h | 24 +
5 files changed, 415 insertions(+), 394 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index aba443be7daa..8c4b894b6663 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -806,30 +806,10 @@ static void opal_export_attrs(void)
static void __init opal_dump_region_init(void)
{
- void *addr;
- uint64_t size;
- int rc;
-
if (!opal_check_token(OPAL_REGISTER_DUMP_REGION))
return;
- /* Register kernel log buffer */
- addr = log_buf_addr_get();
- if (addr == NULL)
- return;
-
- size = log_buf_len_get();
- if (size == 0)
- return;
-
- rc = opal_register_dump_region(OPAL_DUMP_REGION_LOG_BUF,
- __pa(addr), size);
- /* Don't warn if this is just an older OPAL that doesn't
- * know about that call
- */
- if (rc && rc != OPAL_UNSUPPORTED)
- pr_warn("DUMP: Failed to register kernel log buffer. "
- "rc = %d\n", rc);
+ pr_warn("DUMP: This kernel does not support kernel log buffer registration.\n");
}
static void opal_pdev_init(const char *compatible)
diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
index 2e7a1e032c71..d9b721347742 100644
--- a/include/linux/kmsg_dump.h
+++ b/include/linux/kmsg_dump.h
@@ -46,10 +46,8 @@ struct kmsg_dumper {
bool registered;
/* private state of the kmsg iterator */
- u32 cur_idx;
- u32 next_idx;
- u64 cur_seq;
- u64 next_seq;
+ u64 last_seq;
+ u64 until_seq;
};
#ifdef CONFIG_PRINTK
diff --git a/include/linux/printk.h b/include/linux/printk.h
index cefd374c47b1..fd3007659cfb 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -194,8 +194,6 @@ devkmsg_sysctl_set_loglvl(struct ctl_table *table, int write, void __user *buf,
extern void wake_up_klogd(void);
-char *log_buf_addr_get(void);
-u32 log_buf_len_get(void);
void log_buf_vmcoreinfo_setup(void);
void __init setup_log_buf(int early);
__printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...);
@@ -235,16 +233,6 @@ static inline void wake_up_klogd(void)
{
}
-static inline char *log_buf_addr_get(void)
-{
- return NULL;
-}
-
-static inline u32 log_buf_len_get(void)
-{
- return 0;
-}
-
static inline void log_buf_vmcoreinfo_setup(void)
{
}
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1888f6a3b694..1a50e0c43775 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -56,6 +56,7 @@
#define CREATE_TRACE_POINTS
#include <trace/events/printk.h>
+#include "ringbuffer.h"
#include "console_cmdline.h"
#include "braille.h"
#include "internal.h"
@@ -409,28 +410,38 @@ DEFINE_RAW_SPINLOCK(logbuf_lock);
#ifdef CONFIG_PRINTK
DECLARE_WAIT_QUEUE_HEAD(log_wait);
-/* the next printk record to read by syslog(READ) or /proc/kmsg */
-static u64 syslog_seq;
-static u32 syslog_idx;
-static size_t syslog_partial;
-static bool syslog_time;
-/* index and sequence number of the first record stored in the buffer */
-static u64 log_first_seq;
-static u32 log_first_idx;
+/*
+ * Define the average message size. This only affects the number of
+ * descriptors that will be available. Underestimating is better than
+ * overestimating (too many available descriptors is better than not enough).
+ */
+#define PRB_AVGBITS 6
+
+DECLARE_PRINTKRB(printk_rb_static, PRB_AVGBITS,
+ CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS, &log_wait);
+
+static struct printk_ringbuffer printk_rb_dynamic;
-/* index and sequence number of the next record to store in the buffer */
-static u64 log_next_seq;
-static u32 log_next_idx;
+static struct printk_ringbuffer *prb = &printk_rb_static;
-/* the next printk record to write to the console */
-static u64 console_seq;
-static u32 console_idx;
+/* the last printk record read by syslog(READ) or /proc/kmsg */
+static u64 syslog_last_seq;
+DECLARE_PRINTKRB_ENTRY(syslog_entry,
+ sizeof(struct printk_log) + CONSOLE_EXT_LOG_MAX);
+DECLARE_PRINTKRB_ITER(syslog_iter, &printk_rb_static, &syslog_entry);
+static size_t syslog_partial;
+static bool syslog_time;
+
+/* the last printk record written to the console */
+static u64 console_last_seq;
+DECLARE_PRINTKRB_ENTRY(console_entry,
+ sizeof(struct printk_log) + CONSOLE_EXT_LOG_MAX);
+DECLARE_PRINTKRB_ITER(console_iter, &printk_rb_static, &console_entry);
static u64 exclusive_console_stop_seq;
-/* the next printk record to read after the last 'clear' command */
-static u64 clear_seq;
-static u32 clear_idx;
+/* the last printk record read before the last 'clear' command */
+static u64 clear_last_seq;
#ifdef CONFIG_PRINTK_CALLER
#define PREFIX_MAX 48
@@ -446,22 +457,8 @@ static u32 clear_idx;
#define LOG_ALIGN __alignof__(struct printk_log)
#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
#define LOG_BUF_LEN_MAX (u32)(1 << 31)
-static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
-static char *log_buf = __log_buf;
static u32 log_buf_len = __LOG_BUF_LEN;
-/* Return log buffer address */
-char *log_buf_addr_get(void)
-{
- return log_buf;
-}
-
-/* Return log buffer size */
-u32 log_buf_len_get(void)
-{
- return log_buf_len;
-}
-
/* human readable text of the record */
static char *log_text(const struct printk_log *msg)
{
@@ -474,92 +471,12 @@ static char *log_dict(const struct printk_log *msg)
return (char *)msg + sizeof(struct printk_log) + msg->text_len;
}
-/* get record by index; idx must point to valid msg */
-static struct printk_log *log_from_idx(u32 idx)
-{
- struct printk_log *msg = (struct printk_log *)(log_buf + idx);
-
- /*
- * A length == 0 record is the end of buffer marker. Wrap around and
- * read the message at the start of the buffer.
- */
- if (!msg->len)
- return (struct printk_log *)log_buf;
- return msg;
-}
-
-/* get next record; idx must point to valid msg */
-static u32 log_next(u32 idx)
-{
- struct printk_log *msg = (struct printk_log *)(log_buf + idx);
-
- /* length == 0 indicates the end of the buffer; wrap */
- /*
- * A length == 0 record is the end of buffer marker. Wrap around and
- * read the message at the start of the buffer as *this* one, and
- * return the one after that.
- */
- if (!msg->len) {
- msg = (struct printk_log *)log_buf;
- return msg->len;
- }
- return idx + msg->len;
-}
-
-/*
- * Check whether there is enough free space for the given message.
- *
- * The same values of first_idx and next_idx mean that the buffer
- * is either empty or full.
- *
- * If the buffer is empty, we must respect the position of the indexes.
- * They cannot be reset to the beginning of the buffer.
- */
-static int logbuf_has_space(u32 msg_size, bool empty)
-{
- u32 free;
-
- if (log_next_idx > log_first_idx || empty)
- free = max(log_buf_len - log_next_idx, log_first_idx);
- else
- free = log_first_idx - log_next_idx;
-
- /*
- * We need space also for an empty header that signalizes wrapping
- * of the buffer.
- */
- return free >= msg_size + sizeof(struct printk_log);
-}
-
-static int log_make_free_space(u32 msg_size)
-{
- while (log_first_seq < log_next_seq &&
- !logbuf_has_space(msg_size, false)) {
- /* drop old messages until we have enough contiguous space */
- log_first_idx = log_next(log_first_idx);
- log_first_seq++;
- }
-
- if (clear_seq < log_first_seq) {
- clear_seq = log_first_seq;
- clear_idx = log_first_idx;
- }
-
- /* sequence numbers are equal, so the log buffer is empty */
- if (logbuf_has_space(msg_size, log_first_seq == log_next_seq))
- return 0;
-
- return -ENOMEM;
-}
-
-/* compute the message size including the padding bytes */
-static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
+/* compute the message size */
+static u32 msg_used_size(u16 text_len, u16 dict_len)
{
u32 size;
size = sizeof(struct printk_log) + text_len + dict_len;
- *pad_len = (-size) & (LOG_ALIGN - 1);
- size += *pad_len;
return size;
}
@@ -573,21 +490,39 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
static const char trunc_msg[] = "<truncated>";
static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
- u16 *dict_len, u32 *pad_len)
+ u16 *dict_len)
{
/*
* The message should not take the whole buffer. Otherwise, it might
* get removed too soon.
*/
u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
+ unsigned long max_available;
+
+ /* determine available text space in the ringbuffer */
+ max_available = prb_unused(prb);
+ if (max_available <= sizeof(struct printk_log))
+ return 0;
+ max_available -= sizeof(struct printk_log);
+
+ if (max_available < max_text_len)
+ max_text_len = max_available;
+
if (*text_len > max_text_len)
*text_len = max_text_len;
- /* enable the warning message */
+
+ /* enable the warning message (if there is room) */
*trunc_msg_len = strlen(trunc_msg);
+ if (*text_len >= *trunc_msg_len)
+ *text_len -= *trunc_msg_len;
+ else
+ *trunc_msg_len = 0;
+
/* disable the "dict" completely */
*dict_len = 0;
+
/* compute the size again, count also the warning message */
- return msg_used_size(*text_len + *trunc_msg_len, 0, pad_len);
+ return msg_used_size(*text_len + *trunc_msg_len, 0);
}
/* insert record into the buffer, discard old ones, update heads */
@@ -596,34 +531,26 @@ static int log_store(u32 caller_id, int facility, int level,
const char *dict, u16 dict_len,
const char *text, u16 text_len)
{
+ struct prb_reserved_entry res_entry;
struct printk_log *msg;
- u32 size, pad_len;
u16 trunc_msg_len = 0;
+ char *rbuf;
+ u32 size;
- /* number of '\0' padding bytes to next message */
- size = msg_used_size(text_len, dict_len, &pad_len);
+ size = msg_used_size(text_len, dict_len);
- if (log_make_free_space(size)) {
+ rbuf = prb_reserve(&res_entry, prb, size);
+ if (IS_ERR(rbuf)) {
/* truncate the message if it is too long for empty buffer */
- size = truncate_msg(&text_len, &trunc_msg_len,
- &dict_len, &pad_len);
+ size = truncate_msg(&text_len, &trunc_msg_len, &dict_len);
/* survive when the log buffer is too small for trunc_msg */
- if (log_make_free_space(size))
+ rbuf = prb_reserve(&res_entry, prb, size);
+ if (IS_ERR(rbuf))
return 0;
}
- if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
- /*
- * This message + an additional empty header does not fit
- * at the end of the buffer. Add an empty header with len == 0
- * to signify a wrap around.
- */
- memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
- log_next_idx = 0;
- }
-
/* fill message */
- msg = (struct printk_log *)(log_buf + log_next_idx);
+ msg = (struct printk_log *)rbuf;
memcpy(log_text(msg), text, text_len);
msg->text_len = text_len;
if (trunc_msg_len) {
@@ -642,14 +569,13 @@ static int log_store(u32 caller_id, int facility, int level,
#ifdef CONFIG_PRINTK_CALLER
msg->caller_id = caller_id;
#endif
- memset(log_dict(msg) + dict_len, 0, pad_len);
msg->len = size;
/* insert message */
- log_next_idx += msg->len;
- log_next_seq++;
+ prb_commit(&res_entry);
- return msg->text_len;
+ /* msg is no longer valid, return the local copy */
+ return text_len;
}
int dmesg_restrict = IS_ENABLED(CONFIG_SECURITY_DMESG_RESTRICT);
@@ -770,13 +696,18 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
return p - buf;
}
+#define PRINTK_RECORD_MAX (sizeof(struct printk_log) + \
+ CONSOLE_EXT_LOG_MAX + LOG_LINE_MAX + PREFIX_MAX)
+
/* /dev/kmsg - userspace message inject/listen interface */
struct devkmsg_user {
- u64 seq;
- u32 idx;
+ u64 last_seq;
+ struct prb_iterator iter;
struct ratelimit_state rs;
struct mutex lock;
char buf[CONSOLE_EXT_LOG_MAX];
+ struct prb_entry entry;
+ char msgbuf[PRINTK_RECORD_MAX];
};
static __printf(3, 4) __cold
@@ -859,6 +790,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
struct devkmsg_user *user = file->private_data;
+ struct prb_iterator backup_iter;
struct printk_log *msg;
size_t len;
ssize_t ret;
@@ -871,7 +803,11 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
return ret;
logbuf_lock_irq();
- while (user->seq == log_next_seq) {
+
+ /* make a backup copy in case there is a problem */
+ prb_iter_copy(&backup_iter, &user->iter);
+
+ if (prb_iter_next_valid_entry(&user->iter) == 0) {
if (file->f_flags & O_NONBLOCK) {
ret = -EAGAIN;
logbuf_unlock_irq();
@@ -879,43 +815,53 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
}
logbuf_unlock_irq();
- ret = wait_event_interruptible(log_wait,
- user->seq != log_next_seq);
- if (ret)
+ ret = prb_iter_wait_next_valid_entry(&user->iter);
+ if (ret < 0)
goto out;
logbuf_lock_irq();
}
- if (user->seq < log_first_seq) {
- /* our last seen message is gone, return error and reset */
- user->idx = log_first_idx;
- user->seq = log_first_seq;
- ret = -EPIPE;
- logbuf_unlock_irq();
- goto out;
+ if (user->entry.seq - user->last_seq != 1) {
+ DECLARE_PRINTKRB_SEQENTRY(e);
+ DECLARE_PRINTKRB_ITER(i, prb, &e);
+ u64 last_seq;
+
+ prb_iter_peek_next_entry(&i, &last_seq);
+ if (last_seq > user->last_seq) {
+ /* a record was missed, return error and reset */
+ prb_iter_sync(&user->iter, &i);
+ user->last_seq = last_seq;
+ ret = -EPIPE;
+ logbuf_unlock_irq();
+ goto out;
+ }
}
- msg = log_from_idx(user->idx);
+ user->last_seq = user->entry.seq;
+
+ msg = (struct printk_log *)&user->entry.buffer[0];
len = msg_print_ext_header(user->buf, sizeof(user->buf),
- msg, user->seq);
+ msg, user->last_seq);
len += msg_print_ext_body(user->buf + len, sizeof(user->buf) - len,
log_dict(msg), msg->dict_len,
log_text(msg), msg->text_len);
- user->idx = log_next(user->idx);
- user->seq++;
logbuf_unlock_irq();
if (len > count) {
ret = -EINVAL;
- goto out;
+ goto restore_out;
}
if (copy_to_user(buf, user->buf, len)) {
ret = -EFAULT;
- goto out;
+ goto restore_out;
}
ret = len;
+ goto out;
+restore_out:
+ prb_iter_copy(&user->iter, &backup_iter);
+ user->last_seq = user->entry.seq - 1;
out:
mutex_unlock(&user->lock);
return ret;
@@ -935,8 +881,7 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
switch (whence) {
case SEEK_SET:
/* the first record */
- user->idx = log_first_idx;
- user->seq = log_first_seq;
+ user->last_seq = prb_iter_seek(&user->iter, 0);
break;
case SEEK_DATA:
/*
@@ -944,13 +889,11 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
* like issued by 'dmesg -c'. Reading /dev/kmsg itself
* changes no global state, and does not clear anything.
*/
- user->idx = clear_idx;
- user->seq = clear_seq;
+ user->last_seq = prb_iter_seek(&user->iter, clear_last_seq);
break;
case SEEK_END:
/* after the last record */
- user->idx = log_next_idx;
- user->seq = log_next_seq;
+ user->last_seq = prb_iter_seek(&user->iter, -1);
break;
default:
ret = -EINVAL;
@@ -963,19 +906,39 @@ static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
{
struct devkmsg_user *user = file->private_data;
__poll_t ret = 0;
+ u64 last_seq;
if (!user)
return EPOLLERR|EPOLLNVAL;
- poll_wait(file, &log_wait, wait);
+ poll_wait(file, prb_wait_queue(prb), wait);
logbuf_lock_irq();
- if (user->seq < log_next_seq) {
- /* return error when data has vanished underneath us */
- if (user->seq < log_first_seq)
- ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
- else
- ret = EPOLLIN|EPOLLRDNORM;
+ if (prb_iter_peek_next_entry(&user->iter, &last_seq)) {
+ ret = EPOLLIN|EPOLLRDNORM;
+ if (last_seq - user->last_seq != 1) {
+ DECLARE_PRINTKRB_SEQENTRY(e);
+ DECLARE_PRINTKRB_ITER(i, prb, &e);
+ u64 last_seq;
+
+ /*
+ * The sequence number has jumped. This might mean
+ * that the ringbuffer has overtaken the reader,
+ * which would mean that the sequence number previous
+ * the first entry will now be later than the last
+ * entry the reader has seen.
+ *
+ * If instead the sequence number jump is due to
+ * iterating over invalid entries, there is no error.
+ */
+
+ /* get the sequence number previous the first entry */
+ prb_iter_peek_next_entry(&i, &last_seq);
+
+ /* return error when data has vanished underneath us */
+ if (last_seq > user->last_seq)
+ ret |= EPOLLERR|EPOLLPRI;
+ }
}
logbuf_unlock_irq();
@@ -1008,8 +971,10 @@ static int devkmsg_open(struct inode *inode, struct file *file)
mutex_init(&user->lock);
logbuf_lock_irq();
- user->idx = log_first_idx;
- user->seq = log_first_seq;
+ user->entry.buffer = &user->msgbuf[0];
+ user->entry.buffer_size = sizeof(user->msgbuf);
+ prb_iter_init(&user->iter, prb, &user->entry);
+ prb_iter_peek_next_entry(&user->iter, &user->last_seq);
logbuf_unlock_irq();
file->private_data = user;
@@ -1050,11 +1015,8 @@ const struct file_operations kmsg_fops = {
*/
void log_buf_vmcoreinfo_setup(void)
{
- VMCOREINFO_SYMBOL(log_buf);
- VMCOREINFO_SYMBOL(log_buf_len);
- VMCOREINFO_SYMBOL(log_first_idx);
- VMCOREINFO_SYMBOL(clear_idx);
- VMCOREINFO_SYMBOL(log_next_idx);
+ VMCOREINFO_SYMBOL(printk_rb_static);
+ VMCOREINFO_SYMBOL(printk_rb_dynamic);
/*
* Export struct printk_log size and field offsets. User space tools can
* parse it and detect any changes to structure down the line.
@@ -1136,13 +1098,36 @@ static void __init log_buf_add_cpu(void)
static inline void log_buf_add_cpu(void) {}
#endif /* CONFIG_SMP */
+static void __init add_to_rb(struct printk_ringbuffer *rb,
+ struct prb_entry *e)
+{
+ struct printk_log *msg = (struct printk_log *)&e->buffer[0];
+ struct prb_reserved_entry re;
+ int size;
+ char *b;
+
+ size = sizeof(*msg) + msg->text_len + msg->dict_len;
+
+ b = prb_reserve(&re, rb, size);
+ if (!IS_ERR(b)) {
+ memcpy(b, msg, size);
+ prb_commit(&re);
+ }
+}
+
+static char setup_buf[PRINTK_RECORD_MAX] __initdata;
+
void __init setup_log_buf(int early)
{
+ struct prb_desc *new_descs;
+ struct prb_iterator i;
unsigned long flags;
+ struct prb_entry e;
char *new_log_buf;
unsigned int free;
+ int l;
- if (log_buf != __log_buf)
+ if (prb != &printk_rb_static)
return;
if (!early && !new_log_buf_len)
@@ -1151,19 +1136,47 @@ void __init setup_log_buf(int early)
if (!new_log_buf_len)
return;
+ if (!is_power_of_2(new_log_buf_len))
+ return;
+
new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
if (unlikely(!new_log_buf)) {
- pr_err("log_buf_len: %lu bytes not available\n",
- new_log_buf_len);
+ pr_err("log_buf_len: %lu data bytes not available\n",
+ new_log_buf_len);
+ return;
+ }
+
+ new_descs = memblock_alloc((new_log_buf_len >> PRB_AVGBITS) *
+ sizeof(struct prb_desc), LOG_ALIGN);
+ if (unlikely(!new_descs)) {
+ pr_err("log_buf_len: %lu desc bytes not available\n",
+ new_log_buf_len >> PRB_AVGBITS);
+ memblock_free(__pa(new_log_buf), new_log_buf_len);
return;
}
+ e.buffer = &setup_buf[0];
+ e.buffer_size = sizeof(setup_buf);
+
logbuf_lock_irqsave(flags);
- log_buf_len = new_log_buf_len;
- log_buf = new_log_buf;
- new_log_buf_len = 0;
- free = __LOG_BUF_LEN - log_next_idx;
- memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
+
+ prb_init(&printk_rb_dynamic, new_log_buf,
+ bits_per(new_log_buf_len) - 1, new_descs,
+ (bits_per(new_log_buf_len) - 1) - PRB_AVGBITS, &log_wait);
+
+ free = __LOG_BUF_LEN;
+ prb_for_each_entry(&i, &printk_rb_static, &e, l) {
+ add_to_rb(&printk_rb_dynamic, &e);
+ free -= l;
+ }
+
+ prb_iter_init(&syslog_iter, &printk_rb_dynamic, &syslog_entry);
+ prb_iter_init(&console_iter, &printk_rb_dynamic, &console_entry);
+
+ prb_iter_seek(&console_iter, e.seq);
+
+ prb = &printk_rb_dynamic;
+
logbuf_unlock_irqrestore(flags);
pr_info("log_buf_len: %u bytes\n", log_buf_len);
@@ -1340,29 +1353,43 @@ static size_t msg_print_text(const struct printk_log *msg, bool syslog,
static int syslog_print(char __user *buf, int size)
{
char *text;
+ struct prb_iterator iter;
struct printk_log *msg;
+ struct prb_entry e;
+ char *msgbuf;
int len = 0;
text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
if (!text)
return -ENOMEM;
+ msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
+ if (!msgbuf) {
+ kfree(text);
+ return -ENOMEM;
+ }
+
+ e.buffer = msgbuf;
+ e.buffer_size = PRINTK_RECORD_MAX;
+ prb_iter_init(&iter, prb, &e);
+ msg = (struct printk_log *)msgbuf;
while (size > 0) {
size_t n;
size_t skip;
logbuf_lock_irq();
- if (syslog_seq < log_first_seq) {
- /* messages are gone, move to first one */
- syslog_seq = log_first_seq;
- syslog_idx = log_first_idx;
- syslog_partial = 0;
- }
- if (syslog_seq == log_next_seq) {
+ prb_iter_sync(&iter, &syslog_iter);
+ if (prb_iter_next_valid_entry(&iter) == 0) {
logbuf_unlock_irq();
break;
}
+ if (e.seq - syslog_last_seq != 1) {
+ /* messages are gone, move to first one */
+ syslog_last_seq = e.seq - 1;
+ syslog_partial = 0;
+ }
+
/*
* To keep reading/counting partial line consistent,
* use printk_time value as of the beginning of a line.
@@ -1371,16 +1398,15 @@ static int syslog_print(char __user *buf, int size)
syslog_time = printk_time;
skip = syslog_partial;
- msg = log_from_idx(syslog_idx);
n = msg_print_text(msg, true, syslog_time, text,
LOG_LINE_MAX + PREFIX_MAX);
if (n - syslog_partial <= size) {
/* message fits into buffer, move forward */
- syslog_idx = log_next(syslog_idx);
- syslog_seq++;
+ prb_iter_sync(&syslog_iter, &iter);
+ syslog_last_seq++;
n -= syslog_partial;
syslog_partial = 0;
- } else if (!len){
+ } else if (!len) {
/* partial read(), remember position */
n = size;
syslog_partial += n;
@@ -1402,22 +1428,73 @@ static int syslog_print(char __user *buf, int size)
buf += n;
}
+ kfree(msgbuf);
kfree(text);
return len;
}
+/**
+ * count_remaining() - Count the text bytes in following entries.
+ *
+ * @iter: The iterator to use for counting.
+ *
+ * @until_seq: A sequence number to stop counting at.
+ * The entry with this sequence number is not counted.
+ *
+ * Note that although this function will not modify @iter, it does make
+ * use of the prb_entry of @iter.
+ *
+ * Return: The number of bytes of text counted.
+ */
+static int count_remaining(struct prb_iterator *iter, const u64 until_seq)
+{
+ bool time = syslog_partial ? syslog_time : printk_time;
+ struct printk_log *msg;
+ struct prb_iterator i;
+ struct prb_entry *e;
+ int len = 0;
+
+ prb_iter_copy(&i, iter);
+ e = prb_iter_entry(&i);
+ msg = (struct printk_log *)&e->buffer[0];
+
+ for (;;) {
+ if (prb_iter_next_valid_entry(&i) == 0)
+ break;
+
+ if (e->seq >= until_seq)
+ break;
+
+ len += msg_print_text(msg, true, time, NULL, 0);
+ time = printk_time;
+ }
+
+ return len;
+}
+
static int syslog_print_all(char __user *buf, int size, bool clear)
{
+ struct prb_iterator iter;
+ struct printk_log *msg;
+ struct prb_entry e;
+ char *msgbuf;
+ int textlen;
char *text;
int len = 0;
- u64 next_seq;
- u64 seq;
- u32 idx;
bool time;
text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
if (!text)
return -ENOMEM;
+ msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
+ if (!msgbuf) {
+ kfree(text);
+ return -ENOMEM;
+ }
+
+ e.buffer = msgbuf;
+ e.buffer_size = PRINTK_RECORD_MAX;
+ msg = (struct printk_log *)msgbuf;
time = printk_time;
logbuf_lock_irq();
@@ -1425,73 +1502,65 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
* Find first record that fits, including all following records,
* into the user-provided buffer for this dump.
*/
- seq = clear_seq;
- idx = clear_idx;
- while (seq < log_next_seq) {
- struct printk_log *msg = log_from_idx(idx);
-
- len += msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
- }
+ prb_iter_init(&iter, prb, &e);
+ prb_iter_seek(&iter, clear_last_seq);
+ len = count_remaining(&iter, -1);
- /* move first record forward until length fits into the buffer */
- seq = clear_seq;
- idx = clear_idx;
- while (len > size && seq < log_next_seq) {
- struct printk_log *msg = log_from_idx(idx);
+ /* move iterator forward until text fits into the buffer */
+ while (len > size) {
+ if (prb_iter_next_valid_entry(&iter) == 0)
+ break;
len -= msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
}
- /* last message fitting into this dump */
- next_seq = log_next_seq;
-
+ /* copy the rest of the messages into the buffer */
len = 0;
- while (len >= 0 && seq < next_seq) {
- struct printk_log *msg = log_from_idx(idx);
- int textlen = msg_print_text(msg, true, time, text,
- LOG_LINE_MAX + PREFIX_MAX);
+ for (;;) {
+ if (prb_iter_next_valid_entry(&iter) == 0)
+ break;
+
+ textlen = msg_print_text(msg, true, time, text,
+ LOG_LINE_MAX + PREFIX_MAX);
- idx = log_next(idx);
- seq++;
+ if (len + textlen > size)
+ break;
logbuf_unlock_irq();
- if (copy_to_user(buf + len, text, textlen))
+ if (copy_to_user(buf + len, text, textlen)) {
len = -EFAULT;
- else
- len += textlen;
+ logbuf_lock_irq();
+ break;
+ }
logbuf_lock_irq();
- if (seq < log_first_seq) {
- /* messages are gone, move to next one */
- seq = log_first_seq;
- idx = log_first_idx;
- }
- }
+ len += textlen;
- if (clear) {
- clear_seq = log_next_seq;
- clear_idx = log_next_idx;
+ if (clear)
+ clear_last_seq = e.seq;
}
logbuf_unlock_irq();
+ kfree(msgbuf);
kfree(text);
return len;
}
static void syslog_clear(void)
{
+ DECLARE_PRINTKRB_SEQENTRY(e);
+ DECLARE_PRINTKRB_ITER(i, prb, &e);
+
logbuf_lock_irq();
- clear_seq = log_next_seq;
- clear_idx = log_next_idx;
+ prb_iter_sync(&i, &syslog_iter);
+ clear_last_seq = prb_iter_seek(&i, -1);
logbuf_unlock_irq();
}
int do_syslog(int type, char __user *buf, int len, int source)
{
+ DECLARE_PRINTKRB_SEQENTRY(e);
+ DECLARE_PRINTKRB_ITER(iter, prb, &e);
bool clear = false;
static int saved_console_loglevel = LOGLEVEL_DEFAULT;
int error;
@@ -1512,10 +1581,15 @@ int do_syslog(int type, char __user *buf, int len, int source)
return 0;
if (!access_ok(buf, len))
return -EFAULT;
- error = wait_event_interruptible(log_wait,
- syslog_seq != log_next_seq);
- if (error)
+
+ logbuf_lock_irq();
+ prb_iter_sync(&iter, &syslog_iter);
+ logbuf_unlock_irq();
+
+ error = prb_iter_wait_next_valid_entry(&iter);
+ if (error < 0)
return error;
+
error = syslog_print(buf, len);
break;
/* Read/clear last kernel messages */
@@ -1562,33 +1636,15 @@ int do_syslog(int type, char __user *buf, int len, int source)
/* Number of chars in the log buffer */
case SYSLOG_ACTION_SIZE_UNREAD:
logbuf_lock_irq();
- if (syslog_seq < log_first_seq) {
- /* messages are gone, move to first one */
- syslog_seq = log_first_seq;
- syslog_idx = log_first_idx;
- syslog_partial = 0;
- }
if (source == SYSLOG_FROM_PROC) {
/*
* Short-cut for poll(/"proc/kmsg") which simply checks
- * for pending data, not the size; return the count of
- * records, not the length.
+ * for pending data, not the size; return true if there
+ * is a pending record
*/
- error = log_next_seq - syslog_seq;
+ error = prb_iter_peek_next_entry(&syslog_iter, NULL);
} else {
- u64 seq = syslog_seq;
- u32 idx = syslog_idx;
- bool time = syslog_partial ? syslog_time : printk_time;
-
- while (seq < log_next_seq) {
- struct printk_log *msg = log_from_idx(idx);
-
- error += msg_print_text(msg, true, time, NULL,
- 0);
- time = printk_time;
- idx = log_next(idx);
- seq++;
- }
+ error = count_remaining(&syslog_iter, -1);
error -= syslog_partial;
}
logbuf_unlock_irq();
@@ -1948,7 +2004,6 @@ asmlinkage int vprintk_emit(int facility, int level,
int printed_len;
bool in_sched = false, pending_output;
unsigned long flags;
- u64 curr_log_seq;
/* Suppress unimportant messages after panic happens */
if (unlikely(suppress_printk))
@@ -1964,9 +2019,8 @@ asmlinkage int vprintk_emit(int facility, int level,
/* This stops the holder of console_sem just where we want him */
logbuf_lock_irqsave(flags);
- curr_log_seq = log_next_seq;
printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
- pending_output = (curr_log_seq != log_next_seq);
+ pending_output = prb_iter_peek_next_entry(&console_iter, NULL);
logbuf_unlock_irqrestore(flags);
/* If called from the scheduler, we can not call up(). */
@@ -2056,18 +2110,15 @@ EXPORT_SYMBOL(printk);
#define PREFIX_MAX 0
#define printk_time false
-static u64 syslog_seq;
-static u32 syslog_idx;
-static u64 console_seq;
-static u32 console_idx;
+DECLARE_PRINTKRB_SEQENTRY(syslog_entry);
+DECLARE_PRINTKRB_ITER(syslog_iter, NULL, NULL);
+DECLARE_PRINTKRB_SEQENTRY(console_entry);
+DECLARE_PRINTKRB_ITER(console_iter, NULL, NULL);
+
+static u64 console_last_seq;
static u64 exclusive_console_stop_seq;
-static u64 log_first_seq;
-static u32 log_first_idx;
-static u64 log_next_seq;
static char *log_text(const struct printk_log *msg) { return NULL; }
static char *log_dict(const struct printk_log *msg) { return NULL; }
-static struct printk_log *log_from_idx(u32 idx) { return NULL; }
-static u32 log_next(u32 idx) { return 0; }
static ssize_t msg_print_ext_header(char *buf, size_t size,
struct printk_log *msg,
u64 seq) { return 0; }
@@ -2402,36 +2453,32 @@ void console_unlock(void)
printk_safe_enter_irqsave(flags);
raw_spin_lock(&logbuf_lock);
- if (console_seq < log_first_seq) {
- len = sprintf(text,
- "** %llu printk messages dropped **\n",
- log_first_seq - console_seq);
+skip:
+ if (prb_iter_next_valid_entry(&console_iter) == 0)
+ break;
- /* messages are gone, move to first one */
- console_seq = log_first_seq;
- console_idx = log_first_idx;
+ if (console_entry.seq - console_last_seq != 1) {
+ len = sprintf(text,
+ "** %llu printk messages dropped **\n",
+ console_entry.seq - (console_last_seq + 1));
} else {
len = 0;
}
-skip:
- if (console_seq == log_next_seq)
- break;
+ console_last_seq = console_entry.seq;
- msg = log_from_idx(console_idx);
+ msg = (struct printk_log *)&console_entry.buffer[0];
if (suppress_message_printing(msg->level)) {
/*
* Skip record we have buffered and already printed
* directly to the console when we received it, and
* record that has level above the console loglevel.
*/
- console_idx = log_next(console_idx);
- console_seq++;
goto skip;
}
/* Output to all consoles once old messages replayed. */
if (unlikely(exclusive_console &&
- console_seq >= exclusive_console_stop_seq)) {
+ console_last_seq > exclusive_console_stop_seq)) {
exclusive_console = NULL;
}
@@ -2441,14 +2488,12 @@ void console_unlock(void)
if (nr_ext_console_drivers) {
ext_len = msg_print_ext_header(ext_text,
sizeof(ext_text),
- msg, console_seq);
+ msg, console_last_seq);
ext_len += msg_print_ext_body(ext_text + ext_len,
sizeof(ext_text) - ext_len,
log_dict(msg), msg->dict_len,
log_text(msg), msg->text_len);
}
- console_idx = log_next(console_idx);
- console_seq++;
raw_spin_unlock(&logbuf_lock);
/*
@@ -2487,7 +2532,7 @@ void console_unlock(void)
* flush, no worries.
*/
raw_spin_lock(&logbuf_lock);
- retry = console_seq != log_next_seq;
+ retry = prb_iter_peek_next_entry(&console_iter, NULL);
raw_spin_unlock(&logbuf_lock);
printk_safe_exit_irqrestore(flags);
@@ -2556,8 +2601,8 @@ void console_flush_on_panic(enum con_flush_mode mode)
unsigned long flags;
logbuf_lock_irqsave(flags);
- console_seq = log_first_seq;
- console_idx = log_first_idx;
+ console_last_seq = 0;
+ prb_iter_seek(&console_iter, 0);
logbuf_unlock_irqrestore(flags);
}
console_unlock();
@@ -2760,8 +2805,7 @@ void register_console(struct console *newcon)
* for us.
*/
logbuf_lock_irqsave(flags);
- console_seq = syslog_seq;
- console_idx = syslog_idx;
+ prb_iter_sync(&console_iter, &syslog_iter);
/*
* We're about to replay the log buffer. Only do this to the
* just-registered console to avoid excessive message spam to
@@ -2772,7 +2816,8 @@ void register_console(struct console *newcon)
* ignores console_lock.
*/
exclusive_console = newcon;
- exclusive_console_stop_seq = console_seq;
+ exclusive_console_stop_seq = console_last_seq;
+ console_last_seq = 0;
logbuf_unlock_irqrestore(flags);
}
console_unlock();
@@ -3033,6 +3078,8 @@ EXPORT_SYMBOL(printk_timed_ratelimit);
static DEFINE_SPINLOCK(dump_list_lock);
static LIST_HEAD(dump_list);
+static char kmsg_dump_msgbuf[PRINTK_RECORD_MAX];
+
/**
* kmsg_dump_register - register a kernel log dumper.
* @dumper: pointer to the kmsg_dumper structure
@@ -3102,7 +3149,6 @@ module_param_named(always_kmsg_dump, always_kmsg_dump, bool, S_IRUGO | S_IWUSR);
void kmsg_dump(enum kmsg_dump_reason reason)
{
struct kmsg_dumper *dumper;
- unsigned long flags;
if ((reason > KMSG_DUMP_OOPS) && !always_kmsg_dump)
return;
@@ -3115,12 +3161,7 @@ void kmsg_dump(enum kmsg_dump_reason reason)
/* initialize iterator with data about the stored records */
dumper->active = true;
- logbuf_lock_irqsave(flags);
- dumper->cur_seq = clear_seq;
- dumper->cur_idx = clear_idx;
- dumper->next_seq = log_next_seq;
- dumper->next_idx = log_next_idx;
- logbuf_unlock_irqrestore(flags);
+ kmsg_dump_rewind(dumper);
/* invoke dumper which will iterate over records */
dumper->dump(dumper, reason);
@@ -3153,28 +3194,27 @@ void kmsg_dump(enum kmsg_dump_reason reason)
bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
char *line, size_t size, size_t *len)
{
- struct printk_log *msg;
+ struct prb_entry e = {
+ .buffer = &kmsg_dump_msgbuf[0],
+ .buffer_size = sizeof(kmsg_dump_msgbuf),
+ };
+ DECLARE_PRINTKRB_ITER(i, prb, &e);
+ struct printk_log *msg = (struct printk_log *)&e.buffer[0];
size_t l = 0;
bool ret = false;
if (!dumper->active)
goto out;
- if (dumper->cur_seq < log_first_seq) {
- /* messages are gone, move to first available one */
- dumper->cur_seq = log_first_seq;
- dumper->cur_idx = log_first_idx;
- }
+ dumper->last_seq = prb_iter_seek(&i, dumper->last_seq);
/* last entry */
- if (dumper->cur_seq >= log_next_seq)
+ if (prb_iter_next_valid_entry(&i) == 0)
goto out;
- msg = log_from_idx(dumper->cur_idx);
l = msg_print_text(msg, syslog, printk_time, line, size);
- dumper->cur_idx = log_next(dumper->cur_idx);
- dumper->cur_seq++;
+ dumper->last_seq = e.seq;
ret = true;
out:
if (len)
@@ -3235,11 +3275,14 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_line);
bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
char *buf, size_t size, size_t *len)
{
+ struct prb_entry e = {
+ .buffer = &kmsg_dump_msgbuf[0],
+ .buffer_size = sizeof(kmsg_dump_msgbuf),
+ };
+ DECLARE_PRINTKRB_ITER(i, prb, &e);
+ struct printk_log *msg = (struct printk_log *)&e.buffer[0];
unsigned long flags;
- u64 seq;
- u32 idx;
- u64 next_seq;
- u32 next_idx;
+ u64 next_until_seq;
size_t l = 0;
bool ret = false;
bool time = printk_time;
@@ -3248,55 +3291,45 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
goto out;
logbuf_lock_irqsave(flags);
- if (dumper->cur_seq < log_first_seq) {
- /* messages are gone, move to first available one */
- dumper->cur_seq = log_first_seq;
- dumper->cur_idx = log_first_idx;
- }
+
+ if (!dumper->until_seq)
+ dumper->until_seq = -1;
+
+ dumper->last_seq = prb_iter_seek(&i, dumper->last_seq);
/* last entry */
- if (dumper->cur_seq >= dumper->next_seq) {
+ if (!prb_iter_peek_next_entry(&i, NULL)) {
logbuf_unlock_irqrestore(flags);
goto out;
}
/* calculate length of entire buffer */
- seq = dumper->cur_seq;
- idx = dumper->cur_idx;
- while (seq < dumper->next_seq) {
- struct printk_log *msg = log_from_idx(idx);
+ l = count_remaining(&i, dumper->until_seq);
- l += msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
- }
-
- /* move first record forward until length fits into the buffer */
- seq = dumper->cur_seq;
- idx = dumper->cur_idx;
- while (l > size && seq < dumper->next_seq) {
- struct printk_log *msg = log_from_idx(idx);
+ if (l <= size) {
+ /* last message in next iteration */
+ next_until_seq = dumper->last_seq;
+ } else {
+ /* move iterator forward until text fits into the buffer */
+ while (l > size) {
+ prb_iter_next_valid_entry(&i);
+ l -= msg_print_text(msg, true, time, NULL, 0);
+ }
- l -= msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
+ /* last message in next iteration */
+ next_until_seq = e.seq + 1;
}
- /* last message in next interation */
- next_seq = seq;
- next_idx = idx;
-
+ /* copy messages to buffer */
l = 0;
- while (seq < dumper->next_seq) {
- struct printk_log *msg = log_from_idx(idx);
-
+ for (;;) {
+ prb_iter_next_valid_entry(&i);
+ if (e.seq >= dumper->until_seq)
+ break;
l += msg_print_text(msg, syslog, time, buf + l, size - l);
- idx = log_next(idx);
- seq++;
}
- dumper->next_seq = next_seq;
- dumper->next_idx = next_idx;
+ dumper->until_seq = next_until_seq;
ret = true;
logbuf_unlock_irqrestore(flags);
out:
@@ -3318,10 +3351,8 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);
*/
void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper)
{
- dumper->cur_seq = clear_seq;
- dumper->cur_idx = clear_idx;
- dumper->next_seq = log_next_seq;
- dumper->next_idx = log_next_idx;
+ dumper->last_seq = clear_last_seq;
+ dumper->until_seq = 0;
}
/**
diff --git a/kernel/printk/ringbuffer.h b/kernel/printk/ringbuffer.h
index 70cb9ad284d4..02b4c53e287e 100644
--- a/kernel/printk/ringbuffer.h
+++ b/kernel/printk/ringbuffer.h
@@ -134,6 +134,8 @@ struct prb_iterator {
unsigned long next_id;
};
+#ifdef CONFIG_PRINTK
+
/* writer interface */
char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
unsigned long size);
@@ -163,6 +165,28 @@ struct nl_node *prb_desc_node(unsigned long id, void *arg);
bool prb_desc_busy(unsigned long id, void *arg);
struct dr_desc *prb_getdesc(unsigned long id, void *arg);
+#else /* CONFIG_PRINTK */
+
+#define prb_reserve(e, rb, size) NULL
+#define prb_commit(e)
+#define prb_iter_init(iter, rb, e)
+#define prb_iter_next_valid_entry(iter) 0
+#define prb_iter_wait_next_valid_entry(iter) -ERESTARTSYS
+#define prb_iter_sync(dst, src)
+#define prb_iter_copy(dst, src)
+#define prb_iter_peek_next_entry(iter, last_seq) false
+#define prb_iter_seek(iter, last_seq) 0
+#define prb_wait_queue(rb) NULL
+#define prb_iter_entry(iter) NULL
+#define prb_getfail(rb) 0
+#define prb_init(rb, data, data_size_bits, descs, desc_count_bits, waitq)
+#define prb_unused(rb) 0
+#define prb_desc_node NULL
+#define prb_desc_busy NULL
+#define prb_getdesc NULL
+
+#endif /* CONFIG_PRINTK */
+
/**
* DECLARE_PRINTKRB() - Declare a printk ringbuffer.
*
--
2.20.1
On Wed, Aug 7, 2019 at 3:27 PM John Ogness <[email protected]> wrote:
>
> 2. For the CONFIG_PPC_POWERNV powerpc platform, kernel log buffer
> registration is no longer available because there is no longer
> a single contigous block of memory to represent all of the
> ringbuffer.
So this is tangential, but I've actually been wishing for a special
"raw dump" format that has absolutely *no* structure to it at all, and
is as a result not necessarily strictly reliable, but is a lot more
robust.
The background for that is that we have a class of bugs that are
really hard to debug "in the wild", because people don't have access
to serial consoles or any kind of special hardware at all (ie forget
things like nvram etc), and when the machine locks up you're happy to
just have a reset button (but more likely you have to turn power off
and on).
End result: a DRAM buffer can work, but is not "reliable".
Particularly if you turn power on and off, data retention of DRAM is
iffy. But it's possible, at least in theory.
So I have a patch that implements a "stupid ring buffer" for thisa
case, with absolutely zero data structures (because in the presense of
DRAM corruption, all you can get is "hopefully only slightly garbled
ASCII".
It actually does work. It's a complete hack, but I have used this on
real hardware to see dumps that happened after the machine could no
longer send them to any device.
I actually suspect that this kind of "stupid non-structured secondary
log" can often be much more useful than the existing nvram special
cases - yes the output can be garbled for multi-cpu cases because it
not only is lockless, it's lockess without even any data structures -
but it also works somewhat reliably when the machine is _really_
borked. Which is exactly when you want a log that isn't just the
normal "working machine syslog".
NOTE! This is *not* a replacement for a lockless printk. This is very
much an _additional_ "low overhead buffer in RAM" for post-mortem
analysis when anything fancier doesn't work.
So I'm throwing this patch out there in case people have interest in
looking at that very special case. Also note how right now the example
code just steals a random physical memory area at roughly physical
location 12GB - this is a hack and would need to be configurable
obviously in real life, but it worked for the machines I tested (which
both happened to have 16GB of RAM).
Those parts are marked with "// HACK HACK HACK" and just a hardcoded
physical address (0x320000000).
Linus
On 2019-08-08, Linus Torvalds <[email protected]> wrote:
>> 2. For the CONFIG_PPC_POWERNV powerpc platform, kernel log buffer
>> registration is no longer available because there is no longer
>> a single contigous block of memory to represent all of the
>> ringbuffer.
>
> So this is tangential, but I've actually been wishing for a special
> "raw dump" format that has absolutely *no* structure to it at all, and
> is as a result not necessarily strictly reliable, but is a lot more
> robust.
>
> The background for that is that we have a class of bugs that are
> really hard to debug "in the wild", because people don't have access
> to serial consoles or any kind of special hardware at all (ie forget
> things like nvram etc), and when the machine locks up you're happy to
> just have a reset button (but more likely you have to turn power off
> and on).
>
> End result: a DRAM buffer can work, but is not "reliable".
> Particularly if you turn power on and off, data retention of DRAM is
> iffy. But it's possible, at least in theory.
>
> So I have a patch that implements a "stupid ring buffer" for thisa
> case, with absolutely zero data structures (because in the presense of
> DRAM corruption, all you can get is "hopefully only slightly garbled
> ASCII".
You can read the current printk ringbuffer this way also because the
ASCII strings are sorted and separated by struct binary data. The binary
parts of the structs can even prove useful in this case to act as record
separators.
dump_area(log_buf, log_buf_len);
Note: To test this, I modified your dump_area() to call trace_printk()
instead of printk().
The _raw_ contents of the ringbuffer I am proposing (dataring.data) is
nearly identical to that of the current ringbuffer. Its raw data is also
sorted and separated by non-ascii data. So using:
dump_area(prb->dr.data, 1 << prb->dr.size_bits);
produces essentially the same results. Both ringbuffers are structuring
the data similar to yours, but they have the advantage of writer
synchronization.
What is missing is a way for the raw data buffers to be fixed to a
specified address so that they can be recovered after a reset. I will
look into such a feature for my next version.
On a side note, I'm not sure sure if we want kernel code to do the ASCII
dump of the raw buffer. A userspace application grabbing from /dev/mem
might be more desirable. I can imagine that all kinds of "intelligence"
could be added to such an application to try to recover/sanitize as much
meta-data as possible (such as timestamps, CPU ID, task ID, etc). Maybe
we should add CRC or ECC to struct prink_log. :-)
John Ogness
On Thu, Aug 8, 2019 at 3:56 PM John Ogness <[email protected]> wrote:
>
> On a side note, I'm not sure sure if we want kernel code to do the ASCII
> dump of the raw buffer. A userspace application grabbing from /dev/mem
> might be more desirable.
So there are two major issues here.
One obvious one is that the *current* boot must not overwrite the previous data.
With the separate stupid ring buffer, that isn't an issue. We just
won't start logging into it at all until after it's been registered,
so the solution is "read it out before registering it again" - and
that can indeed be done at almost any point (including by a user space
thing).
But with your suggestion of just using the native ring buffer you
already have, that's not an option. The current boot will have started
using it already and is overwriting the data we wanted from the
previous boot.
To which the obvious solution is "just use a different buffer for the
next boot". But that brings up the *second* big issue with a
reboot-safe buffer: it can't just be anywhere. Not only do you have to
have some way to find the old one, the actual location may end up
being basically fixed by hardware or firmware.
The reason for that is that while the patch I sent out actually works
fine on lots of machines in practice, it has a serious issue: it only
works for a nice warm reset, and only when the BIOS doesn't overwrite
memory at reset. Both of those are problems.
Most BIOSes don't overwrite all memory for a warm reset, simply
because booting fast is an issue. But some BIOSes _do_ re-initialize
everything. And in particular, they tend to do so after cold boots,
particularly on server machines where you may have "oh, ECC needs to
be re-initialized".
End result: in reality, my hacky patch is a "look, if we had firmware
support for not re-initializing a small piece of RAM, we could use
this". I've asked Intel for a fast logging facility in hardware for
basically two decades by now, and it's not happening. But I'm hoping
that I _can_ ask them for "hey, can you make your firmware not reinit
a small 64kB buffer at boot, so that it will survive even a cold boot
as long as the power-off was very short".
(And the "64kB" above is just a random number. Maybe it's 4kB.
Something fairly small, and something that the BIOS would not reinit,
and in the presense of ECC it would possibly just scrub - read old
contents without ECC, write them back with ECC).
See? That basically fixes the reboot-safe buffer at a particular
memory area, and means that you can't share the same buffer for the
basic logging as for the reboot-safe case..
That's why I also reacted to your POWER NVDIMM thing - it actually has
some of the same issues. It's fixed by external forces, and not a
generic random piece of memory.
Linus
On Thu, 8 Aug 2019 16:33:20 -0700
Linus Torvalds <[email protected]> wrote:
> To which the obvious solution is "just use a different buffer for the
> next boot". But that brings up the *second* big issue with a
> reboot-safe buffer: it can't just be anywhere. Not only do you have to
> have some way to find the old one, the actual location may end up
> being basically fixed by hardware or firmware.
Could we possibly have a magic value in some location that if it is
set, we know right away that the buffer here has data from the last
reboot, and we read it out into a safe location before we start using
it again?
-- Steve
On Thu, Aug 8, 2019 at 4:45 PM Steven Rostedt <[email protected]> wrote:
>
> Could we possibly have a magic value in some location that if it is
> set, we know right away that the buffer here has data from the last
> reboot, and we read it out into a safe location before we start using
> it again?
Right now I don't know how reliable RAM ends up being.
But with a small enough buffer I guess we could just do it
unconditionally and then let some debug tool in user space try to make
sense of it later.
More background for what I'm looking for: my hope for this is that we
can finally get the case of "undebuggable laptop hangs" logged with
something like this.
But laptops don't have reset buttons. They have "press the power
button for ten seconds, power turns off. Press it again, and power
comes on" reset sequences.
So DRAM power off for maybe 5 seconds? I've tried to find papers on
how well DRAM retention works (not very common: usually they happen
because you have some security researcher that wants to move a DIMM
and read it on another system, and some of them talk about using
freezing techniques to increase retention), and from what I've seen,
retention *should* be possible even for that kind of timeframe,
despite the usual "DRAM wants 60ms refresh". As in "maybe 90% of bits
might still be legible". And newer DRAM with smaller capacitors isn't
apparently a downside, because they have much less leakage too.
But some of those papers were for old DRAM. Maybe somebody knows
better. I don't have any real data myself, because my cold-boot tests
all seemed to show the BIOS reinitializing it to garbage. For all I
know, the DRAM training will guarantee garbage and it's all a pipe
dream.
Anyway, from some wild handwaving of "maybe we can get 90% bit
retention" means that a human can read garbled data and guess
(particularly if you can see "ok, it's an oops, I know what the
overall pattern is, I can ignore a lot of bits that don't matter").
But I wouldn't want to necessarily automte it all that much.
But the retention pattern might be very random too, and honestly, I'm
mostly guessing to begin with (if that wasn't clear already ;).
But the "random user didn't have any other choice but to just
powercycle the machine" is one of the nastiest debug problems we have
right now, and if we were to get "next boot automatically sends a
report to the distro or whatever after a non-clean shutdown" that
might be *very* useful.
Or it might not be. Right now we simply don't have that kind of data
at all. Sure, we have a ton of virtual machines and servers that have
"reliable IO" (either thanks to the VM or thanks to serial lines etc),
but it's literally the "normal random consumer who runs
Fedora/Ubuntu/Suse workstation" that currently basically has no data
at all if it's the kind of crash that doesn't get you a saved log.
And the people running VM's and servers with serial lines are simply
not doing the same things as real people on real hardware are, so I
don't think it's an argument that "hey, we get reports from those nice
datacenter guys".
We likely don't even have any idea of how common it is, because while
I know "hangs on resume with no logs" used to be a fairly common
problem case, it by definition never gets _logged_. Maybe people
complain, but more likely they just curse and reboot.
And no, I don't think this is actually common at all. But the problem
with those unloggable problems is that _if_ they happen - even if it's
very very rare indeed - they are really nasty to debug.
They are nasty to debug when they happen on a developer machine (I
should know, I've definitely had them), but when they happen in the
wild they are basically "user just rebooted the machine". End of
story, and no stats or anything like that.
Linus
On Thu, 8 Aug 2019 17:21:09 -0700
Linus Torvalds <[email protected]> wrote:
> But laptops don't have reset buttons. They have "press the power
> button for ten seconds, power turns off. Press it again, and power
> comes on" reset sequences.
I've never tried, but are you saying that even with the "10 second
hold" the laptop's DRAM may still have old data that is accessible?
>
> They are nasty to debug when they happen on a developer machine (I
> should know, I've definitely had them), but when they happen in the
> wild they are basically "user just rebooted the machine". End of
> story, and no stats or anything like that.
Would a best effort 1 page buffer work? Really, with a hard hang we
usually only care about the last thing that was printed (we need to add
one of those: stop printing after the first WARN_ON is hit, to not
lose the initial bug).
That way you could have a buffer that is written to constantly but only
is the size of one or two pages. It can have a variable in it that gets
reset on shutdown. If the system hangs, the next boot could look to see
if that page was shutdown cleanly (or never initialized) otherwise, it
can read the page or pages into a buffer that can be read from debugfs.
A user space tool could read this page and if it detects that it
contains data from a crash, notify the user and say "Can you send this
to [email protected]"? Even better if it tells the user the
subject and content of the email that should be sent.
-- Steve
On Thu, Aug 8, 2019 at 5:48 PM Steven Rostedt <[email protected]> wrote:
>
> I've never tried, but are you saying that even with the "10 second
> hold" the laptop's DRAM may still have old data that is accessible?
The power doesn't go off when you *start* the 10s hold. It goes off
after ten seconds.
So power is off only for the time it then takes you to press the power
button again to turn it on again. So just a second or two if you react
quickly to the "ok, power light finally went off". Longer if you
don't.
But yes, DRAM has retention time in the seconds. See for example
https://www.pdl.cmu.edu/PDL-FTP/NVM/dram-retention_isca13.pdf
and look at the kinds of times they are looking at - their graphs
aren't in milliseconds, they are in 1-5 seconds (and the retention is
pretty high for that time).
But I don't know what a power-off-in-laptop scenario really looks like..
Linus
On Thu, Aug 08, 2019 at 12:07:28PM -0700, Linus Torvalds wrote:
> End result: a DRAM buffer can work, but is not "reliable".
> Particularly if you turn power on and off, data retention of DRAM is
> iffy. But it's possible, at least in theory.
>
> So I have a patch that implements a "stupid ring buffer" for thisa
> case, with absolutely zero data structures (because in the presense of
> DRAM corruption, all you can get is "hopefully only slightly garbled
> ASCII".
Note that you can hook this into printk as a fake early serial device;
just have the serial device write to the DRAM buffer.
On 2019-08-09, Peter Zijlstra <[email protected]> wrote:
>> End result: a DRAM buffer can work, but is not "reliable".
>> Particularly if you turn power on and off, data retention of DRAM is
>> iffy. But it's possible, at least in theory.
>>
>> So I have a patch that implements a "stupid ring buffer" for thisa
>> case, with absolutely zero data structures (because in the presense of
>> DRAM corruption, all you can get is "hopefully only slightly garbled
>> ASCII".
>
> Note that you can hook this into printk as a fake early serial device;
> just have the serial device write to the DRAM buffer.
Or the other way around, implement a fake console to handle writing the
messages (as they are being emitted from printk) to some special
area. Then the messages would even be pre-processed so that all
meta-data is already in ASCII form.
John Ogness
On Thu, 8 Aug 2019, Linus Torvalds wrote:
> On Thu, Aug 8, 2019 at 5:48 PM Steven Rostedt <[email protected]> wrote:
> >
> > I've never tried, but are you saying that even with the "10 second
> > hold" the laptop's DRAM may still have old data that is accessible?
>
> The power doesn't go off when you *start* the 10s hold. It goes off
> after ten seconds.
>
> So power is off only for the time it then takes you to press the power
> button again to turn it on again. So just a second or two if you react
> quickly to the "ok, power light finally went off". Longer if you
> don't.
>
> But yes, DRAM has retention time in the seconds. See for example
>
> https://www.pdl.cmu.edu/PDL-FTP/NVM/dram-retention_isca13.pdf
>
> and look at the kinds of times they are looking at - their graphs
> aren't in milliseconds, they are in 1-5 seconds (and the retention is
> pretty high for that time).
>
> But I don't know what a power-off-in-laptop scenario really looks like..
That's random behaviour. It's hardware & BIOS & value add. What do you
expect?
I tried on a few machines. My laptop does not retain any useful information
and on some server box (which takes ages to boot) the memory is squeaky
clean, i.e. the BIOS wiped it already. Some others worked with a two second
delay between turning the remote power switch on and off.
Thanks,
tglx
On Thu, Aug 8, 2019 at 11:14 PM Peter Zijlstra <[email protected]> wrote:
>
> Note that you can hook this into printk as a fake early serial device;
> just have the serial device write to the DRAM buffer.
No, you really really can't.
Look, the whole point of that reboot buffer is that it works WHEN
NOTHING ELSE DOES.
Very much including things like "oh, we're doing a suspend/resume, so
the console lock is held to make sure we don't touch any devices that
are likely dead right now".
The poweroff buffer is not a console. Don't even try to think of it as
one. It's for when consoles don't work. Trying to make it be an
early-console would completely defeat the whole point.
Even the "early console" stuff tries to honor serialization by
console_lock and console_suspended etc. Or things like the "I'm in the
middle of the scheduler, so I won't be doing any real logging".
If the system works, and you get console output or you have a working
syslogd that saves messages to disk, all of this is entirely
irrelevant. Really.
Don't think of it as a console. If you do, you're doing it wrong.
Linus
On Fri, Aug 9, 2019 at 4:15 AM Thomas Gleixner <[email protected]> wrote:
>
> >
> > But I don't know what a power-off-in-laptop scenario really looks like..
>
> That's random behaviour. It's hardware & BIOS & value add. What do you
> expect?
>
> I tried on a few machines. My laptop does not retain any useful information
> and on some server box (which takes ages to boot) the memory is squeaky
> clean, i.e. the BIOS wiped it already. Some others worked with a two second
> delay between turning the remote power switch on and off.
You were there at the Intel meeting, weren't you?
This is all about the fact that "we're not getting sane and reliable
debug facilities for remote debugging". We haven't gotten them over
two decades, we're not seeing it in the future either.
So what if we _can_ get an ACPI update and in the next decade your
laptop _will_ have a memory area that doesn't get scribbled over?
Does it work today? Yes it does, but only for very special cases
(basically warm reboot with "fast boot" enabled).
But they are special cases that may be things that can be extended
upon without any actual hardware changes.
Linus
On Fri, 9 Aug 2019, Linus Torvalds wrote:
> On Fri, Aug 9, 2019 at 4:15 AM Thomas Gleixner <[email protected]> wrote:
> >
> > >
> > > But I don't know what a power-off-in-laptop scenario really looks like..
> >
> > That's random behaviour. It's hardware & BIOS & value add. What do you
> > expect?
> >
> > I tried on a few machines. My laptop does not retain any useful information
> > and on some server box (which takes ages to boot) the memory is squeaky
> > clean, i.e. the BIOS wiped it already. Some others worked with a two second
> > delay between turning the remote power switch on and off.
>
> You were there at the Intel meeting, weren't you?
Yup.
> This is all about the fact that "we're not getting sane and reliable
> debug facilities for remote debugging". We haven't gotten them over
> two decades, we're not seeing it in the future either.
I know. It sucks.
> So what if we _can_ get an ACPI update and in the next decade your
> laptop _will_ have a memory area that doesn't get scribbled over?
No argument here.
> Does it work today? Yes it does, but only for very special cases
> (basically warm reboot with "fast boot" enabled).
>
> But they are special cases that may be things that can be extended
> upon without any actual hardware changes.
I'm all for it. I just tried it out and the ratio was 3 out of 5 retained
the data +/- a few bitflips with ~2 seconds power off. The other two were
the laptop and that server machine which wipes everything.
If that can be avoided with some ACPI tweak especially on the laptop, that
would be great. I'm not so worried about the server case.
Debugging laptops and random client machines is the real interesting use
case. They usually lack serial and even if they have serial then the
reporter has not necessarily a second machine to capture the stuff.
Thanks,
tglx
On Fri, Aug 9, 2019 at 1:07 PM Thomas Gleixner <[email protected]> wrote:
>
> I'm all for it. I just tried it out and the ratio was 3 out of 5 retained
> the data +/- a few bitflips with ~2 seconds power off. The other two were
> the laptop and that server machine which wipes everything.
Perfect. That actually says "the theory works". My desktop worked only
on warm reboot - which isn't really the interesting case (it does
cover things like triple boots etc and "press reset button when it
hangs, so it *can* be helpful, but even on desktops reset buttons seem
to be getting less common).
But yes, the whole thing where BIOSes wipe everything is problematic,
but that's where I just need to ping the right people inside Intel
again.
I did send the patch to inside Intel earlier, but I think the timing
for that might have been bad (people were on vacation), so I should
just reach out to more Intel people.
It would be better to have a more polished patch (the whole "fixed
address at around 12GB physical" really is such a horrible hack), but
I dreaded actually parsing the e280 memory map to do some "static for
one particular configuration" thing.
I should just do that and have something that Intel HW and FW people
can test on any hardware.
> If that can be avoided with some ACPI tweak especially on the laptop, that
> would be great. I'm not so worried about the server case.
Yeah, the server case I think we have covered other ways. Plus people
running them tend to have serious developer resources anyway.
They might still use something like this for some convenient
first-order debugging if we end up having generally available, of
course, but the target really is "random laptop or home user that uses
a distro and can't be expected to even try to sanely report - much
less debug - a hung machine condition".
Linus
On Fri, 9 Aug 2019, Linus Torvalds wrote:
> On Thu, Aug 8, 2019 at 11:14 PM Peter Zijlstra <[email protected]> wrote:
> > Note that you can hook this into printk as a fake early serial device;
> > just have the serial device write to the DRAM buffer.
>
> No, you really really can't.
...
> Even the "early console" stuff tries to honor serialization by
> console_lock and console_suspended etc. Or things like the "I'm in the
> middle of the scheduler, so I won't be doing any real logging".
If you think of it as the classic console you are right. What Peter has in
mind is the extra stuff on top of this buffer patchset, which implements
emergency write to consoles. That's an extra callback in the console
struct, which can be invoked in such situations igoring context and console
lock completely.
Right now we have an implementation for serial only, but that already is
useful. I nicely got (minimaly garbled) crash dumps out of an NMI
handler. With the current mainline console code the machine just hung.
So with this scheme we actually could hook your smart buffer into the
console stuff and still achieve what you want.
Thanks,
tglx
On Fri, Aug 9, 2019 at 8:16 AM Peter Zijlstra <[email protected]> wrote:
> On Thu, Aug 08, 2019 at 12:07:28PM -0700, Linus Torvalds wrote:
> > End result: a DRAM buffer can work, but is not "reliable".
> > Particularly if you turn power on and off, data retention of DRAM is
> > iffy. But it's possible, at least in theory.
> >
> > So I have a patch that implements a "stupid ring buffer" for thisa
> > case, with absolutely zero data structures (because in the presense of
> > DRAM corruption, all you can get is "hopefully only slightly garbled
> > ASCII".
>
> Note that you can hook this into printk as a fake early serial device;
> just have the serial device write to the DRAM buffer.
Yep. Amiga had debug=mem for years, to write kernel messages to Chip
RAM, and retrieve them after a reboot. Cfr. amiga_console_driver and
arch/m68k/tools/amiga/dmesg.c.
BTW, with those old machines, it was not uncommon for DRAM retention
time to be 30s or so.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
John, can you cc kexec list for your later series?
On 08/08/19 at 12:32am, John Ogness wrote:
> This is a major change because the API (and underlying workings)
> of the new ringbuffer are completely different than the previous
> ringbuffer. Since there are several components of the printk
> infrastructure that use the ringbuffer API (console, /dev/kmsg,
> syslog, kmsg_dump), there are quite a few changes throughout the
> printk implementation.
>
> This is also a conservative change because it continues to use the
> logbuf_lock raw spinlock even though the new ringbuffer is lockless.
>
> The externally visible changes are:
>
> 1. The exported vmcore info has changed:
>
> - VMCOREINFO_SYMBOL(log_buf);
> - VMCOREINFO_SYMBOL(log_buf_len);
> - VMCOREINFO_SYMBOL(log_first_idx);
> - VMCOREINFO_SYMBOL(clear_idx);
> - VMCOREINFO_SYMBOL(log_next_idx);
> + VMCOREINFO_SYMBOL(printk_rb_static);
> + VMCOREINFO_SYMBOL(printk_rb_dynamic);
I assumed this needs some userspace work in kexec, how did you test
them?
makedumpfile should need changes to dump the kernel log.
Also kexec-tools includes a vmcore-dmesg.c to extrace dmesg from
/proc/vmcore.
>
> 2. For the CONFIG_PPC_POWERNV powerpc platform, kernel log buffer
> registration is no longer available because there is no longer
> a single contigous block of memory to represent all of the
> ringbuffer.
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> arch/powerpc/platforms/powernv/opal.c | 22 +-
> include/linux/kmsg_dump.h | 6 +-
> include/linux/printk.h | 12 -
> kernel/printk/printk.c | 745 ++++++++++++++------------
> kernel/printk/ringbuffer.h | 24 +
> 5 files changed, 415 insertions(+), 394 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index aba443be7daa..8c4b894b6663 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -806,30 +806,10 @@ static void opal_export_attrs(void)
>
> static void __init opal_dump_region_init(void)
> {
> - void *addr;
> - uint64_t size;
> - int rc;
> -
> if (!opal_check_token(OPAL_REGISTER_DUMP_REGION))
> return;
>
> - /* Register kernel log buffer */
> - addr = log_buf_addr_get();
> - if (addr == NULL)
> - return;
> -
> - size = log_buf_len_get();
> - if (size == 0)
> - return;
> -
> - rc = opal_register_dump_region(OPAL_DUMP_REGION_LOG_BUF,
> - __pa(addr), size);
> - /* Don't warn if this is just an older OPAL that doesn't
> - * know about that call
> - */
> - if (rc && rc != OPAL_UNSUPPORTED)
> - pr_warn("DUMP: Failed to register kernel log buffer. "
> - "rc = %d\n", rc);
> + pr_warn("DUMP: This kernel does not support kernel log buffer registration.\n");
> }
>
> static void opal_pdev_init(const char *compatible)
> diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
> index 2e7a1e032c71..d9b721347742 100644
> --- a/include/linux/kmsg_dump.h
> +++ b/include/linux/kmsg_dump.h
> @@ -46,10 +46,8 @@ struct kmsg_dumper {
> bool registered;
>
> /* private state of the kmsg iterator */
> - u32 cur_idx;
> - u32 next_idx;
> - u64 cur_seq;
> - u64 next_seq;
> + u64 last_seq;
> + u64 until_seq;
> };
>
> #ifdef CONFIG_PRINTK
> diff --git a/include/linux/printk.h b/include/linux/printk.h
> index cefd374c47b1..fd3007659cfb 100644
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -194,8 +194,6 @@ devkmsg_sysctl_set_loglvl(struct ctl_table *table, int write, void __user *buf,
>
> extern void wake_up_klogd(void);
>
> -char *log_buf_addr_get(void);
> -u32 log_buf_len_get(void);
> void log_buf_vmcoreinfo_setup(void);
> void __init setup_log_buf(int early);
> __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...);
> @@ -235,16 +233,6 @@ static inline void wake_up_klogd(void)
> {
> }
>
> -static inline char *log_buf_addr_get(void)
> -{
> - return NULL;
> -}
> -
> -static inline u32 log_buf_len_get(void)
> -{
> - return 0;
> -}
> -
> static inline void log_buf_vmcoreinfo_setup(void)
> {
> }
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 1888f6a3b694..1a50e0c43775 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -56,6 +56,7 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/printk.h>
>
> +#include "ringbuffer.h"
> #include "console_cmdline.h"
> #include "braille.h"
> #include "internal.h"
> @@ -409,28 +410,38 @@ DEFINE_RAW_SPINLOCK(logbuf_lock);
>
> #ifdef CONFIG_PRINTK
> DECLARE_WAIT_QUEUE_HEAD(log_wait);
> -/* the next printk record to read by syslog(READ) or /proc/kmsg */
> -static u64 syslog_seq;
> -static u32 syslog_idx;
> -static size_t syslog_partial;
> -static bool syslog_time;
>
> -/* index and sequence number of the first record stored in the buffer */
> -static u64 log_first_seq;
> -static u32 log_first_idx;
> +/*
> + * Define the average message size. This only affects the number of
> + * descriptors that will be available. Underestimating is better than
> + * overestimating (too many available descriptors is better than not enough).
> + */
> +#define PRB_AVGBITS 6
> +
> +DECLARE_PRINTKRB(printk_rb_static, PRB_AVGBITS,
> + CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS, &log_wait);
> +
> +static struct printk_ringbuffer printk_rb_dynamic;
>
> -/* index and sequence number of the next record to store in the buffer */
> -static u64 log_next_seq;
> -static u32 log_next_idx;
> +static struct printk_ringbuffer *prb = &printk_rb_static;
>
> -/* the next printk record to write to the console */
> -static u64 console_seq;
> -static u32 console_idx;
> +/* the last printk record read by syslog(READ) or /proc/kmsg */
> +static u64 syslog_last_seq;
> +DECLARE_PRINTKRB_ENTRY(syslog_entry,
> + sizeof(struct printk_log) + CONSOLE_EXT_LOG_MAX);
> +DECLARE_PRINTKRB_ITER(syslog_iter, &printk_rb_static, &syslog_entry);
> +static size_t syslog_partial;
> +static bool syslog_time;
> +
> +/* the last printk record written to the console */
> +static u64 console_last_seq;
> +DECLARE_PRINTKRB_ENTRY(console_entry,
> + sizeof(struct printk_log) + CONSOLE_EXT_LOG_MAX);
> +DECLARE_PRINTKRB_ITER(console_iter, &printk_rb_static, &console_entry);
> static u64 exclusive_console_stop_seq;
>
> -/* the next printk record to read after the last 'clear' command */
> -static u64 clear_seq;
> -static u32 clear_idx;
> +/* the last printk record read before the last 'clear' command */
> +static u64 clear_last_seq;
>
> #ifdef CONFIG_PRINTK_CALLER
> #define PREFIX_MAX 48
> @@ -446,22 +457,8 @@ static u32 clear_idx;
> #define LOG_ALIGN __alignof__(struct printk_log)
> #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> #define LOG_BUF_LEN_MAX (u32)(1 << 31)
> -static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> -static char *log_buf = __log_buf;
> static u32 log_buf_len = __LOG_BUF_LEN;
>
> -/* Return log buffer address */
> -char *log_buf_addr_get(void)
> -{
> - return log_buf;
> -}
> -
> -/* Return log buffer size */
> -u32 log_buf_len_get(void)
> -{
> - return log_buf_len;
> -}
> -
> /* human readable text of the record */
> static char *log_text(const struct printk_log *msg)
> {
> @@ -474,92 +471,12 @@ static char *log_dict(const struct printk_log *msg)
> return (char *)msg + sizeof(struct printk_log) + msg->text_len;
> }
>
> -/* get record by index; idx must point to valid msg */
> -static struct printk_log *log_from_idx(u32 idx)
> -{
> - struct printk_log *msg = (struct printk_log *)(log_buf + idx);
> -
> - /*
> - * A length == 0 record is the end of buffer marker. Wrap around and
> - * read the message at the start of the buffer.
> - */
> - if (!msg->len)
> - return (struct printk_log *)log_buf;
> - return msg;
> -}
> -
> -/* get next record; idx must point to valid msg */
> -static u32 log_next(u32 idx)
> -{
> - struct printk_log *msg = (struct printk_log *)(log_buf + idx);
> -
> - /* length == 0 indicates the end of the buffer; wrap */
> - /*
> - * A length == 0 record is the end of buffer marker. Wrap around and
> - * read the message at the start of the buffer as *this* one, and
> - * return the one after that.
> - */
> - if (!msg->len) {
> - msg = (struct printk_log *)log_buf;
> - return msg->len;
> - }
> - return idx + msg->len;
> -}
> -
> -/*
> - * Check whether there is enough free space for the given message.
> - *
> - * The same values of first_idx and next_idx mean that the buffer
> - * is either empty or full.
> - *
> - * If the buffer is empty, we must respect the position of the indexes.
> - * They cannot be reset to the beginning of the buffer.
> - */
> -static int logbuf_has_space(u32 msg_size, bool empty)
> -{
> - u32 free;
> -
> - if (log_next_idx > log_first_idx || empty)
> - free = max(log_buf_len - log_next_idx, log_first_idx);
> - else
> - free = log_first_idx - log_next_idx;
> -
> - /*
> - * We need space also for an empty header that signalizes wrapping
> - * of the buffer.
> - */
> - return free >= msg_size + sizeof(struct printk_log);
> -}
> -
> -static int log_make_free_space(u32 msg_size)
> -{
> - while (log_first_seq < log_next_seq &&
> - !logbuf_has_space(msg_size, false)) {
> - /* drop old messages until we have enough contiguous space */
> - log_first_idx = log_next(log_first_idx);
> - log_first_seq++;
> - }
> -
> - if (clear_seq < log_first_seq) {
> - clear_seq = log_first_seq;
> - clear_idx = log_first_idx;
> - }
> -
> - /* sequence numbers are equal, so the log buffer is empty */
> - if (logbuf_has_space(msg_size, log_first_seq == log_next_seq))
> - return 0;
> -
> - return -ENOMEM;
> -}
> -
> -/* compute the message size including the padding bytes */
> -static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
> +/* compute the message size */
> +static u32 msg_used_size(u16 text_len, u16 dict_len)
> {
> u32 size;
>
> size = sizeof(struct printk_log) + text_len + dict_len;
> - *pad_len = (-size) & (LOG_ALIGN - 1);
> - size += *pad_len;
>
> return size;
> }
> @@ -573,21 +490,39 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
> static const char trunc_msg[] = "<truncated>";
>
> static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
> - u16 *dict_len, u32 *pad_len)
> + u16 *dict_len)
> {
> /*
> * The message should not take the whole buffer. Otherwise, it might
> * get removed too soon.
> */
> u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
> + unsigned long max_available;
> +
> + /* determine available text space in the ringbuffer */
> + max_available = prb_unused(prb);
> + if (max_available <= sizeof(struct printk_log))
> + return 0;
> + max_available -= sizeof(struct printk_log);
> +
> + if (max_available < max_text_len)
> + max_text_len = max_available;
> +
> if (*text_len > max_text_len)
> *text_len = max_text_len;
> - /* enable the warning message */
> +
> + /* enable the warning message (if there is room) */
> *trunc_msg_len = strlen(trunc_msg);
> + if (*text_len >= *trunc_msg_len)
> + *text_len -= *trunc_msg_len;
> + else
> + *trunc_msg_len = 0;
> +
> /* disable the "dict" completely */
> *dict_len = 0;
> +
> /* compute the size again, count also the warning message */
> - return msg_used_size(*text_len + *trunc_msg_len, 0, pad_len);
> + return msg_used_size(*text_len + *trunc_msg_len, 0);
> }
>
> /* insert record into the buffer, discard old ones, update heads */
> @@ -596,34 +531,26 @@ static int log_store(u32 caller_id, int facility, int level,
> const char *dict, u16 dict_len,
> const char *text, u16 text_len)
> {
> + struct prb_reserved_entry res_entry;
> struct printk_log *msg;
> - u32 size, pad_len;
> u16 trunc_msg_len = 0;
> + char *rbuf;
> + u32 size;
>
> - /* number of '\0' padding bytes to next message */
> - size = msg_used_size(text_len, dict_len, &pad_len);
> + size = msg_used_size(text_len, dict_len);
>
> - if (log_make_free_space(size)) {
> + rbuf = prb_reserve(&res_entry, prb, size);
> + if (IS_ERR(rbuf)) {
> /* truncate the message if it is too long for empty buffer */
> - size = truncate_msg(&text_len, &trunc_msg_len,
> - &dict_len, &pad_len);
> + size = truncate_msg(&text_len, &trunc_msg_len, &dict_len);
> /* survive when the log buffer is too small for trunc_msg */
> - if (log_make_free_space(size))
> + rbuf = prb_reserve(&res_entry, prb, size);
> + if (IS_ERR(rbuf))
> return 0;
> }
>
> - if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
> - /*
> - * This message + an additional empty header does not fit
> - * at the end of the buffer. Add an empty header with len == 0
> - * to signify a wrap around.
> - */
> - memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
> - log_next_idx = 0;
> - }
> -
> /* fill message */
> - msg = (struct printk_log *)(log_buf + log_next_idx);
> + msg = (struct printk_log *)rbuf;
> memcpy(log_text(msg), text, text_len);
> msg->text_len = text_len;
> if (trunc_msg_len) {
> @@ -642,14 +569,13 @@ static int log_store(u32 caller_id, int facility, int level,
> #ifdef CONFIG_PRINTK_CALLER
> msg->caller_id = caller_id;
> #endif
> - memset(log_dict(msg) + dict_len, 0, pad_len);
> msg->len = size;
>
> /* insert message */
> - log_next_idx += msg->len;
> - log_next_seq++;
> + prb_commit(&res_entry);
>
> - return msg->text_len;
> + /* msg is no longer valid, return the local copy */
> + return text_len;
> }
>
> int dmesg_restrict = IS_ENABLED(CONFIG_SECURITY_DMESG_RESTRICT);
> @@ -770,13 +696,18 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
> return p - buf;
> }
>
> +#define PRINTK_RECORD_MAX (sizeof(struct printk_log) + \
> + CONSOLE_EXT_LOG_MAX + LOG_LINE_MAX + PREFIX_MAX)
> +
> /* /dev/kmsg - userspace message inject/listen interface */
> struct devkmsg_user {
> - u64 seq;
> - u32 idx;
> + u64 last_seq;
> + struct prb_iterator iter;
> struct ratelimit_state rs;
> struct mutex lock;
> char buf[CONSOLE_EXT_LOG_MAX];
> + struct prb_entry entry;
> + char msgbuf[PRINTK_RECORD_MAX];
> };
>
> static __printf(3, 4) __cold
> @@ -859,6 +790,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
> size_t count, loff_t *ppos)
> {
> struct devkmsg_user *user = file->private_data;
> + struct prb_iterator backup_iter;
> struct printk_log *msg;
> size_t len;
> ssize_t ret;
> @@ -871,7 +803,11 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
> return ret;
>
> logbuf_lock_irq();
> - while (user->seq == log_next_seq) {
> +
> + /* make a backup copy in case there is a problem */
> + prb_iter_copy(&backup_iter, &user->iter);
> +
> + if (prb_iter_next_valid_entry(&user->iter) == 0) {
> if (file->f_flags & O_NONBLOCK) {
> ret = -EAGAIN;
> logbuf_unlock_irq();
> @@ -879,43 +815,53 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
> }
>
> logbuf_unlock_irq();
> - ret = wait_event_interruptible(log_wait,
> - user->seq != log_next_seq);
> - if (ret)
> + ret = prb_iter_wait_next_valid_entry(&user->iter);
> + if (ret < 0)
> goto out;
> logbuf_lock_irq();
> }
>
> - if (user->seq < log_first_seq) {
> - /* our last seen message is gone, return error and reset */
> - user->idx = log_first_idx;
> - user->seq = log_first_seq;
> - ret = -EPIPE;
> - logbuf_unlock_irq();
> - goto out;
> + if (user->entry.seq - user->last_seq != 1) {
> + DECLARE_PRINTKRB_SEQENTRY(e);
> + DECLARE_PRINTKRB_ITER(i, prb, &e);
> + u64 last_seq;
> +
> + prb_iter_peek_next_entry(&i, &last_seq);
> + if (last_seq > user->last_seq) {
> + /* a record was missed, return error and reset */
> + prb_iter_sync(&user->iter, &i);
> + user->last_seq = last_seq;
> + ret = -EPIPE;
> + logbuf_unlock_irq();
> + goto out;
> + }
> }
>
> - msg = log_from_idx(user->idx);
> + user->last_seq = user->entry.seq;
> +
> + msg = (struct printk_log *)&user->entry.buffer[0];
> len = msg_print_ext_header(user->buf, sizeof(user->buf),
> - msg, user->seq);
> + msg, user->last_seq);
> len += msg_print_ext_body(user->buf + len, sizeof(user->buf) - len,
> log_dict(msg), msg->dict_len,
> log_text(msg), msg->text_len);
>
> - user->idx = log_next(user->idx);
> - user->seq++;
> logbuf_unlock_irq();
>
> if (len > count) {
> ret = -EINVAL;
> - goto out;
> + goto restore_out;
> }
>
> if (copy_to_user(buf, user->buf, len)) {
> ret = -EFAULT;
> - goto out;
> + goto restore_out;
> }
> ret = len;
> + goto out;
> +restore_out:
> + prb_iter_copy(&user->iter, &backup_iter);
> + user->last_seq = user->entry.seq - 1;
> out:
> mutex_unlock(&user->lock);
> return ret;
> @@ -935,8 +881,7 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
> switch (whence) {
> case SEEK_SET:
> /* the first record */
> - user->idx = log_first_idx;
> - user->seq = log_first_seq;
> + user->last_seq = prb_iter_seek(&user->iter, 0);
> break;
> case SEEK_DATA:
> /*
> @@ -944,13 +889,11 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
> * like issued by 'dmesg -c'. Reading /dev/kmsg itself
> * changes no global state, and does not clear anything.
> */
> - user->idx = clear_idx;
> - user->seq = clear_seq;
> + user->last_seq = prb_iter_seek(&user->iter, clear_last_seq);
> break;
> case SEEK_END:
> /* after the last record */
> - user->idx = log_next_idx;
> - user->seq = log_next_seq;
> + user->last_seq = prb_iter_seek(&user->iter, -1);
> break;
> default:
> ret = -EINVAL;
> @@ -963,19 +906,39 @@ static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
> {
> struct devkmsg_user *user = file->private_data;
> __poll_t ret = 0;
> + u64 last_seq;
>
> if (!user)
> return EPOLLERR|EPOLLNVAL;
>
> - poll_wait(file, &log_wait, wait);
> + poll_wait(file, prb_wait_queue(prb), wait);
>
> logbuf_lock_irq();
> - if (user->seq < log_next_seq) {
> - /* return error when data has vanished underneath us */
> - if (user->seq < log_first_seq)
> - ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
> - else
> - ret = EPOLLIN|EPOLLRDNORM;
> + if (prb_iter_peek_next_entry(&user->iter, &last_seq)) {
> + ret = EPOLLIN|EPOLLRDNORM;
> + if (last_seq - user->last_seq != 1) {
> + DECLARE_PRINTKRB_SEQENTRY(e);
> + DECLARE_PRINTKRB_ITER(i, prb, &e);
> + u64 last_seq;
> +
> + /*
> + * The sequence number has jumped. This might mean
> + * that the ringbuffer has overtaken the reader,
> + * which would mean that the sequence number previous
> + * the first entry will now be later than the last
> + * entry the reader has seen.
> + *
> + * If instead the sequence number jump is due to
> + * iterating over invalid entries, there is no error.
> + */
> +
> + /* get the sequence number previous the first entry */
> + prb_iter_peek_next_entry(&i, &last_seq);
> +
> + /* return error when data has vanished underneath us */
> + if (last_seq > user->last_seq)
> + ret |= EPOLLERR|EPOLLPRI;
> + }
> }
> logbuf_unlock_irq();
>
> @@ -1008,8 +971,10 @@ static int devkmsg_open(struct inode *inode, struct file *file)
> mutex_init(&user->lock);
>
> logbuf_lock_irq();
> - user->idx = log_first_idx;
> - user->seq = log_first_seq;
> + user->entry.buffer = &user->msgbuf[0];
> + user->entry.buffer_size = sizeof(user->msgbuf);
> + prb_iter_init(&user->iter, prb, &user->entry);
> + prb_iter_peek_next_entry(&user->iter, &user->last_seq);
> logbuf_unlock_irq();
>
> file->private_data = user;
> @@ -1050,11 +1015,8 @@ const struct file_operations kmsg_fops = {
> */
> void log_buf_vmcoreinfo_setup(void)
> {
> - VMCOREINFO_SYMBOL(log_buf);
> - VMCOREINFO_SYMBOL(log_buf_len);
> - VMCOREINFO_SYMBOL(log_first_idx);
> - VMCOREINFO_SYMBOL(clear_idx);
> - VMCOREINFO_SYMBOL(log_next_idx);
> + VMCOREINFO_SYMBOL(printk_rb_static);
> + VMCOREINFO_SYMBOL(printk_rb_dynamic);
> /*
> * Export struct printk_log size and field offsets. User space tools can
> * parse it and detect any changes to structure down the line.
> @@ -1136,13 +1098,36 @@ static void __init log_buf_add_cpu(void)
> static inline void log_buf_add_cpu(void) {}
> #endif /* CONFIG_SMP */
>
> +static void __init add_to_rb(struct printk_ringbuffer *rb,
> + struct prb_entry *e)
> +{
> + struct printk_log *msg = (struct printk_log *)&e->buffer[0];
> + struct prb_reserved_entry re;
> + int size;
> + char *b;
> +
> + size = sizeof(*msg) + msg->text_len + msg->dict_len;
> +
> + b = prb_reserve(&re, rb, size);
> + if (!IS_ERR(b)) {
> + memcpy(b, msg, size);
> + prb_commit(&re);
> + }
> +}
> +
> +static char setup_buf[PRINTK_RECORD_MAX] __initdata;
> +
> void __init setup_log_buf(int early)
> {
> + struct prb_desc *new_descs;
> + struct prb_iterator i;
> unsigned long flags;
> + struct prb_entry e;
> char *new_log_buf;
> unsigned int free;
> + int l;
>
> - if (log_buf != __log_buf)
> + if (prb != &printk_rb_static)
> return;
>
> if (!early && !new_log_buf_len)
> @@ -1151,19 +1136,47 @@ void __init setup_log_buf(int early)
> if (!new_log_buf_len)
> return;
>
> + if (!is_power_of_2(new_log_buf_len))
> + return;
> +
> new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
> if (unlikely(!new_log_buf)) {
> - pr_err("log_buf_len: %lu bytes not available\n",
> - new_log_buf_len);
> + pr_err("log_buf_len: %lu data bytes not available\n",
> + new_log_buf_len);
> + return;
> + }
> +
> + new_descs = memblock_alloc((new_log_buf_len >> PRB_AVGBITS) *
> + sizeof(struct prb_desc), LOG_ALIGN);
> + if (unlikely(!new_descs)) {
> + pr_err("log_buf_len: %lu desc bytes not available\n",
> + new_log_buf_len >> PRB_AVGBITS);
> + memblock_free(__pa(new_log_buf), new_log_buf_len);
> return;
> }
>
> + e.buffer = &setup_buf[0];
> + e.buffer_size = sizeof(setup_buf);
> +
> logbuf_lock_irqsave(flags);
> - log_buf_len = new_log_buf_len;
> - log_buf = new_log_buf;
> - new_log_buf_len = 0;
> - free = __LOG_BUF_LEN - log_next_idx;
> - memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
> +
> + prb_init(&printk_rb_dynamic, new_log_buf,
> + bits_per(new_log_buf_len) - 1, new_descs,
> + (bits_per(new_log_buf_len) - 1) - PRB_AVGBITS, &log_wait);
> +
> + free = __LOG_BUF_LEN;
> + prb_for_each_entry(&i, &printk_rb_static, &e, l) {
> + add_to_rb(&printk_rb_dynamic, &e);
> + free -= l;
> + }
> +
> + prb_iter_init(&syslog_iter, &printk_rb_dynamic, &syslog_entry);
> + prb_iter_init(&console_iter, &printk_rb_dynamic, &console_entry);
> +
> + prb_iter_seek(&console_iter, e.seq);
> +
> + prb = &printk_rb_dynamic;
> +
> logbuf_unlock_irqrestore(flags);
>
> pr_info("log_buf_len: %u bytes\n", log_buf_len);
> @@ -1340,29 +1353,43 @@ static size_t msg_print_text(const struct printk_log *msg, bool syslog,
> static int syslog_print(char __user *buf, int size)
> {
> char *text;
> + struct prb_iterator iter;
> struct printk_log *msg;
> + struct prb_entry e;
> + char *msgbuf;
> int len = 0;
>
> text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
> if (!text)
> return -ENOMEM;
> + msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
> + if (!msgbuf) {
> + kfree(text);
> + return -ENOMEM;
> + }
> +
> + e.buffer = msgbuf;
> + e.buffer_size = PRINTK_RECORD_MAX;
> + prb_iter_init(&iter, prb, &e);
> + msg = (struct printk_log *)msgbuf;
>
> while (size > 0) {
> size_t n;
> size_t skip;
>
> logbuf_lock_irq();
> - if (syslog_seq < log_first_seq) {
> - /* messages are gone, move to first one */
> - syslog_seq = log_first_seq;
> - syslog_idx = log_first_idx;
> - syslog_partial = 0;
> - }
> - if (syslog_seq == log_next_seq) {
> + prb_iter_sync(&iter, &syslog_iter);
> + if (prb_iter_next_valid_entry(&iter) == 0) {
> logbuf_unlock_irq();
> break;
> }
>
> + if (e.seq - syslog_last_seq != 1) {
> + /* messages are gone, move to first one */
> + syslog_last_seq = e.seq - 1;
> + syslog_partial = 0;
> + }
> +
> /*
> * To keep reading/counting partial line consistent,
> * use printk_time value as of the beginning of a line.
> @@ -1371,16 +1398,15 @@ static int syslog_print(char __user *buf, int size)
> syslog_time = printk_time;
>
> skip = syslog_partial;
> - msg = log_from_idx(syslog_idx);
> n = msg_print_text(msg, true, syslog_time, text,
> LOG_LINE_MAX + PREFIX_MAX);
> if (n - syslog_partial <= size) {
> /* message fits into buffer, move forward */
> - syslog_idx = log_next(syslog_idx);
> - syslog_seq++;
> + prb_iter_sync(&syslog_iter, &iter);
> + syslog_last_seq++;
> n -= syslog_partial;
> syslog_partial = 0;
> - } else if (!len){
> + } else if (!len) {
> /* partial read(), remember position */
> n = size;
> syslog_partial += n;
> @@ -1402,22 +1428,73 @@ static int syslog_print(char __user *buf, int size)
> buf += n;
> }
>
> + kfree(msgbuf);
> kfree(text);
> return len;
> }
>
> +/**
> + * count_remaining() - Count the text bytes in following entries.
> + *
> + * @iter: The iterator to use for counting.
> + *
> + * @until_seq: A sequence number to stop counting at.
> + * The entry with this sequence number is not counted.
> + *
> + * Note that although this function will not modify @iter, it does make
> + * use of the prb_entry of @iter.
> + *
> + * Return: The number of bytes of text counted.
> + */
> +static int count_remaining(struct prb_iterator *iter, const u64 until_seq)
> +{
> + bool time = syslog_partial ? syslog_time : printk_time;
> + struct printk_log *msg;
> + struct prb_iterator i;
> + struct prb_entry *e;
> + int len = 0;
> +
> + prb_iter_copy(&i, iter);
> + e = prb_iter_entry(&i);
> + msg = (struct printk_log *)&e->buffer[0];
> +
> + for (;;) {
> + if (prb_iter_next_valid_entry(&i) == 0)
> + break;
> +
> + if (e->seq >= until_seq)
> + break;
> +
> + len += msg_print_text(msg, true, time, NULL, 0);
> + time = printk_time;
> + }
> +
> + return len;
> +}
> +
> static int syslog_print_all(char __user *buf, int size, bool clear)
> {
> + struct prb_iterator iter;
> + struct printk_log *msg;
> + struct prb_entry e;
> + char *msgbuf;
> + int textlen;
> char *text;
> int len = 0;
> - u64 next_seq;
> - u64 seq;
> - u32 idx;
> bool time;
>
> text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
> if (!text)
> return -ENOMEM;
> + msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
> + if (!msgbuf) {
> + kfree(text);
> + return -ENOMEM;
> + }
> +
> + e.buffer = msgbuf;
> + e.buffer_size = PRINTK_RECORD_MAX;
> + msg = (struct printk_log *)msgbuf;
>
> time = printk_time;
> logbuf_lock_irq();
> @@ -1425,73 +1502,65 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
> * Find first record that fits, including all following records,
> * into the user-provided buffer for this dump.
> */
> - seq = clear_seq;
> - idx = clear_idx;
> - while (seq < log_next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> -
> - len += msg_print_text(msg, true, time, NULL, 0);
> - idx = log_next(idx);
> - seq++;
> - }
> + prb_iter_init(&iter, prb, &e);
> + prb_iter_seek(&iter, clear_last_seq);
> + len = count_remaining(&iter, -1);
>
> - /* move first record forward until length fits into the buffer */
> - seq = clear_seq;
> - idx = clear_idx;
> - while (len > size && seq < log_next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> + /* move iterator forward until text fits into the buffer */
> + while (len > size) {
> + if (prb_iter_next_valid_entry(&iter) == 0)
> + break;
>
> len -= msg_print_text(msg, true, time, NULL, 0);
> - idx = log_next(idx);
> - seq++;
> }
>
> - /* last message fitting into this dump */
> - next_seq = log_next_seq;
> -
> + /* copy the rest of the messages into the buffer */
> len = 0;
> - while (len >= 0 && seq < next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> - int textlen = msg_print_text(msg, true, time, text,
> - LOG_LINE_MAX + PREFIX_MAX);
> + for (;;) {
> + if (prb_iter_next_valid_entry(&iter) == 0)
> + break;
> +
> + textlen = msg_print_text(msg, true, time, text,
> + LOG_LINE_MAX + PREFIX_MAX);
>
> - idx = log_next(idx);
> - seq++;
> + if (len + textlen > size)
> + break;
>
> logbuf_unlock_irq();
> - if (copy_to_user(buf + len, text, textlen))
> + if (copy_to_user(buf + len, text, textlen)) {
> len = -EFAULT;
> - else
> - len += textlen;
> + logbuf_lock_irq();
> + break;
> + }
> logbuf_lock_irq();
>
> - if (seq < log_first_seq) {
> - /* messages are gone, move to next one */
> - seq = log_first_seq;
> - idx = log_first_idx;
> - }
> - }
> + len += textlen;
>
> - if (clear) {
> - clear_seq = log_next_seq;
> - clear_idx = log_next_idx;
> + if (clear)
> + clear_last_seq = e.seq;
> }
> logbuf_unlock_irq();
>
> + kfree(msgbuf);
> kfree(text);
> return len;
> }
>
> static void syslog_clear(void)
> {
> + DECLARE_PRINTKRB_SEQENTRY(e);
> + DECLARE_PRINTKRB_ITER(i, prb, &e);
> +
> logbuf_lock_irq();
> - clear_seq = log_next_seq;
> - clear_idx = log_next_idx;
> + prb_iter_sync(&i, &syslog_iter);
> + clear_last_seq = prb_iter_seek(&i, -1);
> logbuf_unlock_irq();
> }
>
> int do_syslog(int type, char __user *buf, int len, int source)
> {
> + DECLARE_PRINTKRB_SEQENTRY(e);
> + DECLARE_PRINTKRB_ITER(iter, prb, &e);
> bool clear = false;
> static int saved_console_loglevel = LOGLEVEL_DEFAULT;
> int error;
> @@ -1512,10 +1581,15 @@ int do_syslog(int type, char __user *buf, int len, int source)
> return 0;
> if (!access_ok(buf, len))
> return -EFAULT;
> - error = wait_event_interruptible(log_wait,
> - syslog_seq != log_next_seq);
> - if (error)
> +
> + logbuf_lock_irq();
> + prb_iter_sync(&iter, &syslog_iter);
> + logbuf_unlock_irq();
> +
> + error = prb_iter_wait_next_valid_entry(&iter);
> + if (error < 0)
> return error;
> +
> error = syslog_print(buf, len);
> break;
> /* Read/clear last kernel messages */
> @@ -1562,33 +1636,15 @@ int do_syslog(int type, char __user *buf, int len, int source)
> /* Number of chars in the log buffer */
> case SYSLOG_ACTION_SIZE_UNREAD:
> logbuf_lock_irq();
> - if (syslog_seq < log_first_seq) {
> - /* messages are gone, move to first one */
> - syslog_seq = log_first_seq;
> - syslog_idx = log_first_idx;
> - syslog_partial = 0;
> - }
> if (source == SYSLOG_FROM_PROC) {
> /*
> * Short-cut for poll(/"proc/kmsg") which simply checks
> - * for pending data, not the size; return the count of
> - * records, not the length.
> + * for pending data, not the size; return true if there
> + * is a pending record
> */
> - error = log_next_seq - syslog_seq;
> + error = prb_iter_peek_next_entry(&syslog_iter, NULL);
> } else {
> - u64 seq = syslog_seq;
> - u32 idx = syslog_idx;
> - bool time = syslog_partial ? syslog_time : printk_time;
> -
> - while (seq < log_next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> -
> - error += msg_print_text(msg, true, time, NULL,
> - 0);
> - time = printk_time;
> - idx = log_next(idx);
> - seq++;
> - }
> + error = count_remaining(&syslog_iter, -1);
> error -= syslog_partial;
> }
> logbuf_unlock_irq();
> @@ -1948,7 +2004,6 @@ asmlinkage int vprintk_emit(int facility, int level,
> int printed_len;
> bool in_sched = false, pending_output;
> unsigned long flags;
> - u64 curr_log_seq;
>
> /* Suppress unimportant messages after panic happens */
> if (unlikely(suppress_printk))
> @@ -1964,9 +2019,8 @@ asmlinkage int vprintk_emit(int facility, int level,
>
> /* This stops the holder of console_sem just where we want him */
> logbuf_lock_irqsave(flags);
> - curr_log_seq = log_next_seq;
> printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
> - pending_output = (curr_log_seq != log_next_seq);
> + pending_output = prb_iter_peek_next_entry(&console_iter, NULL);
> logbuf_unlock_irqrestore(flags);
>
> /* If called from the scheduler, we can not call up(). */
> @@ -2056,18 +2110,15 @@ EXPORT_SYMBOL(printk);
> #define PREFIX_MAX 0
> #define printk_time false
>
> -static u64 syslog_seq;
> -static u32 syslog_idx;
> -static u64 console_seq;
> -static u32 console_idx;
> +DECLARE_PRINTKRB_SEQENTRY(syslog_entry);
> +DECLARE_PRINTKRB_ITER(syslog_iter, NULL, NULL);
> +DECLARE_PRINTKRB_SEQENTRY(console_entry);
> +DECLARE_PRINTKRB_ITER(console_iter, NULL, NULL);
> +
> +static u64 console_last_seq;
> static u64 exclusive_console_stop_seq;
> -static u64 log_first_seq;
> -static u32 log_first_idx;
> -static u64 log_next_seq;
> static char *log_text(const struct printk_log *msg) { return NULL; }
> static char *log_dict(const struct printk_log *msg) { return NULL; }
> -static struct printk_log *log_from_idx(u32 idx) { return NULL; }
> -static u32 log_next(u32 idx) { return 0; }
> static ssize_t msg_print_ext_header(char *buf, size_t size,
> struct printk_log *msg,
> u64 seq) { return 0; }
> @@ -2402,36 +2453,32 @@ void console_unlock(void)
>
> printk_safe_enter_irqsave(flags);
> raw_spin_lock(&logbuf_lock);
> - if (console_seq < log_first_seq) {
> - len = sprintf(text,
> - "** %llu printk messages dropped **\n",
> - log_first_seq - console_seq);
> +skip:
> + if (prb_iter_next_valid_entry(&console_iter) == 0)
> + break;
>
> - /* messages are gone, move to first one */
> - console_seq = log_first_seq;
> - console_idx = log_first_idx;
> + if (console_entry.seq - console_last_seq != 1) {
> + len = sprintf(text,
> + "** %llu printk messages dropped **\n",
> + console_entry.seq - (console_last_seq + 1));
> } else {
> len = 0;
> }
> -skip:
> - if (console_seq == log_next_seq)
> - break;
> + console_last_seq = console_entry.seq;
>
> - msg = log_from_idx(console_idx);
> + msg = (struct printk_log *)&console_entry.buffer[0];
> if (suppress_message_printing(msg->level)) {
> /*
> * Skip record we have buffered and already printed
> * directly to the console when we received it, and
> * record that has level above the console loglevel.
> */
> - console_idx = log_next(console_idx);
> - console_seq++;
> goto skip;
> }
>
> /* Output to all consoles once old messages replayed. */
> if (unlikely(exclusive_console &&
> - console_seq >= exclusive_console_stop_seq)) {
> + console_last_seq > exclusive_console_stop_seq)) {
> exclusive_console = NULL;
> }
>
> @@ -2441,14 +2488,12 @@ void console_unlock(void)
> if (nr_ext_console_drivers) {
> ext_len = msg_print_ext_header(ext_text,
> sizeof(ext_text),
> - msg, console_seq);
> + msg, console_last_seq);
> ext_len += msg_print_ext_body(ext_text + ext_len,
> sizeof(ext_text) - ext_len,
> log_dict(msg), msg->dict_len,
> log_text(msg), msg->text_len);
> }
> - console_idx = log_next(console_idx);
> - console_seq++;
> raw_spin_unlock(&logbuf_lock);
>
> /*
> @@ -2487,7 +2532,7 @@ void console_unlock(void)
> * flush, no worries.
> */
> raw_spin_lock(&logbuf_lock);
> - retry = console_seq != log_next_seq;
> + retry = prb_iter_peek_next_entry(&console_iter, NULL);
> raw_spin_unlock(&logbuf_lock);
> printk_safe_exit_irqrestore(flags);
>
> @@ -2556,8 +2601,8 @@ void console_flush_on_panic(enum con_flush_mode mode)
> unsigned long flags;
>
> logbuf_lock_irqsave(flags);
> - console_seq = log_first_seq;
> - console_idx = log_first_idx;
> + console_last_seq = 0;
> + prb_iter_seek(&console_iter, 0);
> logbuf_unlock_irqrestore(flags);
> }
> console_unlock();
> @@ -2760,8 +2805,7 @@ void register_console(struct console *newcon)
> * for us.
> */
> logbuf_lock_irqsave(flags);
> - console_seq = syslog_seq;
> - console_idx = syslog_idx;
> + prb_iter_sync(&console_iter, &syslog_iter);
> /*
> * We're about to replay the log buffer. Only do this to the
> * just-registered console to avoid excessive message spam to
> @@ -2772,7 +2816,8 @@ void register_console(struct console *newcon)
> * ignores console_lock.
> */
> exclusive_console = newcon;
> - exclusive_console_stop_seq = console_seq;
> + exclusive_console_stop_seq = console_last_seq;
> + console_last_seq = 0;
> logbuf_unlock_irqrestore(flags);
> }
> console_unlock();
> @@ -3033,6 +3078,8 @@ EXPORT_SYMBOL(printk_timed_ratelimit);
> static DEFINE_SPINLOCK(dump_list_lock);
> static LIST_HEAD(dump_list);
>
> +static char kmsg_dump_msgbuf[PRINTK_RECORD_MAX];
> +
> /**
> * kmsg_dump_register - register a kernel log dumper.
> * @dumper: pointer to the kmsg_dumper structure
> @@ -3102,7 +3149,6 @@ module_param_named(always_kmsg_dump, always_kmsg_dump, bool, S_IRUGO | S_IWUSR);
> void kmsg_dump(enum kmsg_dump_reason reason)
> {
> struct kmsg_dumper *dumper;
> - unsigned long flags;
>
> if ((reason > KMSG_DUMP_OOPS) && !always_kmsg_dump)
> return;
> @@ -3115,12 +3161,7 @@ void kmsg_dump(enum kmsg_dump_reason reason)
> /* initialize iterator with data about the stored records */
> dumper->active = true;
>
> - logbuf_lock_irqsave(flags);
> - dumper->cur_seq = clear_seq;
> - dumper->cur_idx = clear_idx;
> - dumper->next_seq = log_next_seq;
> - dumper->next_idx = log_next_idx;
> - logbuf_unlock_irqrestore(flags);
> + kmsg_dump_rewind(dumper);
>
> /* invoke dumper which will iterate over records */
> dumper->dump(dumper, reason);
> @@ -3153,28 +3194,27 @@ void kmsg_dump(enum kmsg_dump_reason reason)
> bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
> char *line, size_t size, size_t *len)
> {
> - struct printk_log *msg;
> + struct prb_entry e = {
> + .buffer = &kmsg_dump_msgbuf[0],
> + .buffer_size = sizeof(kmsg_dump_msgbuf),
> + };
> + DECLARE_PRINTKRB_ITER(i, prb, &e);
> + struct printk_log *msg = (struct printk_log *)&e.buffer[0];
> size_t l = 0;
> bool ret = false;
>
> if (!dumper->active)
> goto out;
>
> - if (dumper->cur_seq < log_first_seq) {
> - /* messages are gone, move to first available one */
> - dumper->cur_seq = log_first_seq;
> - dumper->cur_idx = log_first_idx;
> - }
> + dumper->last_seq = prb_iter_seek(&i, dumper->last_seq);
>
> /* last entry */
> - if (dumper->cur_seq >= log_next_seq)
> + if (prb_iter_next_valid_entry(&i) == 0)
> goto out;
>
> - msg = log_from_idx(dumper->cur_idx);
> l = msg_print_text(msg, syslog, printk_time, line, size);
>
> - dumper->cur_idx = log_next(dumper->cur_idx);
> - dumper->cur_seq++;
> + dumper->last_seq = e.seq;
> ret = true;
> out:
> if (len)
> @@ -3235,11 +3275,14 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_line);
> bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
> char *buf, size_t size, size_t *len)
> {
> + struct prb_entry e = {
> + .buffer = &kmsg_dump_msgbuf[0],
> + .buffer_size = sizeof(kmsg_dump_msgbuf),
> + };
> + DECLARE_PRINTKRB_ITER(i, prb, &e);
> + struct printk_log *msg = (struct printk_log *)&e.buffer[0];
> unsigned long flags;
> - u64 seq;
> - u32 idx;
> - u64 next_seq;
> - u32 next_idx;
> + u64 next_until_seq;
> size_t l = 0;
> bool ret = false;
> bool time = printk_time;
> @@ -3248,55 +3291,45 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
> goto out;
>
> logbuf_lock_irqsave(flags);
> - if (dumper->cur_seq < log_first_seq) {
> - /* messages are gone, move to first available one */
> - dumper->cur_seq = log_first_seq;
> - dumper->cur_idx = log_first_idx;
> - }
> +
> + if (!dumper->until_seq)
> + dumper->until_seq = -1;
> +
> + dumper->last_seq = prb_iter_seek(&i, dumper->last_seq);
>
> /* last entry */
> - if (dumper->cur_seq >= dumper->next_seq) {
> + if (!prb_iter_peek_next_entry(&i, NULL)) {
> logbuf_unlock_irqrestore(flags);
> goto out;
> }
>
> /* calculate length of entire buffer */
> - seq = dumper->cur_seq;
> - idx = dumper->cur_idx;
> - while (seq < dumper->next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> + l = count_remaining(&i, dumper->until_seq);
>
> - l += msg_print_text(msg, true, time, NULL, 0);
> - idx = log_next(idx);
> - seq++;
> - }
> -
> - /* move first record forward until length fits into the buffer */
> - seq = dumper->cur_seq;
> - idx = dumper->cur_idx;
> - while (l > size && seq < dumper->next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> + if (l <= size) {
> + /* last message in next iteration */
> + next_until_seq = dumper->last_seq;
> + } else {
> + /* move iterator forward until text fits into the buffer */
> + while (l > size) {
> + prb_iter_next_valid_entry(&i);
> + l -= msg_print_text(msg, true, time, NULL, 0);
> + }
>
> - l -= msg_print_text(msg, true, time, NULL, 0);
> - idx = log_next(idx);
> - seq++;
> + /* last message in next iteration */
> + next_until_seq = e.seq + 1;
> }
>
> - /* last message in next interation */
> - next_seq = seq;
> - next_idx = idx;
> -
> + /* copy messages to buffer */
> l = 0;
> - while (seq < dumper->next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> -
> + for (;;) {
> + prb_iter_next_valid_entry(&i);
> + if (e.seq >= dumper->until_seq)
> + break;
> l += msg_print_text(msg, syslog, time, buf + l, size - l);
> - idx = log_next(idx);
> - seq++;
> }
>
> - dumper->next_seq = next_seq;
> - dumper->next_idx = next_idx;
> + dumper->until_seq = next_until_seq;
> ret = true;
> logbuf_unlock_irqrestore(flags);
> out:
> @@ -3318,10 +3351,8 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);
> */
> void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper)
> {
> - dumper->cur_seq = clear_seq;
> - dumper->cur_idx = clear_idx;
> - dumper->next_seq = log_next_seq;
> - dumper->next_idx = log_next_idx;
> + dumper->last_seq = clear_last_seq;
> + dumper->until_seq = 0;
> }
>
> /**
> diff --git a/kernel/printk/ringbuffer.h b/kernel/printk/ringbuffer.h
> index 70cb9ad284d4..02b4c53e287e 100644
> --- a/kernel/printk/ringbuffer.h
> +++ b/kernel/printk/ringbuffer.h
> @@ -134,6 +134,8 @@ struct prb_iterator {
> unsigned long next_id;
> };
>
> +#ifdef CONFIG_PRINTK
> +
> /* writer interface */
> char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
> unsigned long size);
> @@ -163,6 +165,28 @@ struct nl_node *prb_desc_node(unsigned long id, void *arg);
> bool prb_desc_busy(unsigned long id, void *arg);
> struct dr_desc *prb_getdesc(unsigned long id, void *arg);
>
> +#else /* CONFIG_PRINTK */
> +
> +#define prb_reserve(e, rb, size) NULL
> +#define prb_commit(e)
> +#define prb_iter_init(iter, rb, e)
> +#define prb_iter_next_valid_entry(iter) 0
> +#define prb_iter_wait_next_valid_entry(iter) -ERESTARTSYS
> +#define prb_iter_sync(dst, src)
> +#define prb_iter_copy(dst, src)
> +#define prb_iter_peek_next_entry(iter, last_seq) false
> +#define prb_iter_seek(iter, last_seq) 0
> +#define prb_wait_queue(rb) NULL
> +#define prb_iter_entry(iter) NULL
> +#define prb_getfail(rb) 0
> +#define prb_init(rb, data, data_size_bits, descs, desc_count_bits, waitq)
> +#define prb_unused(rb) 0
> +#define prb_desc_node NULL
> +#define prb_desc_busy NULL
> +#define prb_getdesc NULL
> +
> +#endif /* CONFIG_PRINTK */
> +
> /**
> * DECLARE_PRINTKRB() - Declare a printk ringbuffer.
> *
> --
> 2.20.1
>
Thanks
Dave
On 08/16/19 at 01:46pm, Dave Young wrote:
> John, can you cc kexec list for your later series?
>
> On 08/08/19 at 12:32am, John Ogness wrote:
> > This is a major change because the API (and underlying workings)
> > of the new ringbuffer are completely different than the previous
> > ringbuffer. Since there are several components of the printk
> > infrastructure that use the ringbuffer API (console, /dev/kmsg,
> > syslog, kmsg_dump), there are quite a few changes throughout the
> > printk implementation.
> >
> > This is also a conservative change because it continues to use the
> > logbuf_lock raw spinlock even though the new ringbuffer is lockless.
> >
> > The externally visible changes are:
> >
> > 1. The exported vmcore info has changed:
> >
> > - VMCOREINFO_SYMBOL(log_buf);
> > - VMCOREINFO_SYMBOL(log_buf_len);
> > - VMCOREINFO_SYMBOL(log_first_idx);
> > - VMCOREINFO_SYMBOL(clear_idx);
> > - VMCOREINFO_SYMBOL(log_next_idx);
> > + VMCOREINFO_SYMBOL(printk_rb_static);
> > + VMCOREINFO_SYMBOL(printk_rb_dynamic);
>
> I assumed this needs some userspace work in kexec, how did you test
> them?
>
> makedumpfile should need changes to dump the kernel log.
>
> Also kexec-tools includes a vmcore-dmesg.c to extrace dmesg from
> /proc/vmcore.
>
> >
> > 2. For the CONFIG_PPC_POWERNV powerpc platform, kernel log buffer
> > registration is no longer available because there is no longer
> > a single contigous block of memory to represent all of the
> > ringbuffer.
> >
> > Signed-off-by: John Ogness <[email protected]>
> > ---
> > arch/powerpc/platforms/powernv/opal.c | 22 +-
> > include/linux/kmsg_dump.h | 6 +-
> > include/linux/printk.h | 12 -
> > kernel/printk/printk.c | 745 ++++++++++++++------------
> > kernel/printk/ringbuffer.h | 24 +
> > 5 files changed, 415 insertions(+), 394 deletions(-)
> >
[snip]
Seems kexec list has 40k limitation for msg body. Simon and David, maybe it is
too small?
Thanks
Dave
On 2019-08-16, Dave Young <[email protected]> wrote:
> John, can you cc kexec list for your later series?
Sure.
> On 08/08/19 at 12:32am, John Ogness wrote:
>> This is a major change because the API (and underlying workings) of
>> the new ringbuffer are completely different than the previous
>> ringbuffer. Since there are several components of the printk
>> infrastructure that use the ringbuffer API (console, /dev/kmsg,
>> syslog, kmsg_dump), there are quite a few changes throughout the
>> printk implementation.
>>
>> This is also a conservative change because it continues to use the
>> logbuf_lock raw spinlock even though the new ringbuffer is lockless.
>>
>> The externally visible changes are:
>>
>> 1. The exported vmcore info has changed:
>>
>> - VMCOREINFO_SYMBOL(log_buf);
>> - VMCOREINFO_SYMBOL(log_buf_len);
>> - VMCOREINFO_SYMBOL(log_first_idx);
>> - VMCOREINFO_SYMBOL(clear_idx);
>> - VMCOREINFO_SYMBOL(log_next_idx);
>> + VMCOREINFO_SYMBOL(printk_rb_static);
>> + VMCOREINFO_SYMBOL(printk_rb_dynamic);
>
> I assumed this needs some userspace work in kexec, how did you test
> them?
I did not test any direct userspace access to the ringbuffer structures.
> makedumpfile should need changes to dump the kernel log.
>
> Also kexec-tools includes a vmcore-dmesg.c to extrace dmesg from
> /proc/vmcore.
Thanks for the heads up. I'll take a look at it. The code changes should
be straight forward. I expect there will need to be backwards
compatibility. Perhaps it would check first for "printk_rb_*" then
fallback to "log_*"?
John Ogness
DECLARE_PRINTKRB() now requires a wait queue argument, used
by the blocking reader interface.
Signed-off-by: John Ogness <[email protected]>
---
For RFCv4 the macro prototype changed. The fixup for the
test module didn't make it into the series.
kernel/printk/test_prb.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/printk/test_prb.c b/kernel/printk/test_prb.c
index 0157bbdf051f..49bcf831af7e 100644
--- a/kernel/printk/test_prb.c
+++ b/kernel/printk/test_prb.c
@@ -6,8 +6,11 @@
#include <linux/delay.h>
#include <linux/random.h>
#include <linux/slab.h>
+#include <linux/wait.h>
#include "ringbuffer.h"
+DECLARE_WAIT_QUEUE_HEAD(test_wait);
+
/*
* This is a test module that starts "num_online_cpus() - 1" writer threads
* and 1 reader thread. The writer threads each write strings of varying
@@ -63,7 +66,7 @@ static void dump_rb(struct printk_ringbuffer *rb)
trace_printk("END full dump\n");
}
-DECLARE_PRINTKRB(test_rb, 5, 7);
+DECLARE_PRINTKRB(test_rb, 5, 7, &test_wait);
static int prbtest_writer(void *data)
{
--
2.20.1
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/numlist.c
> +/**
> + * numlist_pop() - Remove the oldest node from the list.
> + *
> + * @nl: The numbered list from which to remove the tail node.
> + *
> + * The tail node can only be removed if two conditions are satisfied:
> + *
> + * * The node is not the only node on the list.
> + * * The node is not busy.
> + *
> + * If, during this function, another task removes the tail, this function
> + * will try again with the new tail.
> + *
> + * Return: The removed node or NULL if the tail node cannot be removed.
> + */
> +struct nl_node *numlist_pop(struct numlist *nl)
> +{
> + unsigned long tail_id;
> + unsigned long next_id;
> + unsigned long r;
> +
> + /* cA: #1 */
> + tail_id = atomic_long_read(&nl->tail_id);
> +
> + for (;;) {
> + /* cB */
> + while (!numlist_read(nl, tail_id, NULL, &next_id)) {
> + /*
> + * @tail_id is invalid. Try again with an
> + * updated value.
> + */
> +
> + cpu_relax();
> +
> + /* cA: #2 */
> + tail_id = atomic_long_read(&nl->tail_id);
> + }
The above while-cycle basically does the same as the upper for-cycle.
It tries again with freshly loaded nl->tail_id. The following code
looks easier to follow:
do {
tail_id = atomic_long_read(&nl->tail_id);
/*
* Read might fail when the tail node has been removed
* and reused in parallel.
*/
if (!numlist_read(nl, tail_id, NULL, &next_id))
continue;
/* Make sure the node is not the only node on the list. */
if (next_id == tail_id)
return NULL;
/* cC: Make sure the node is not busy. */
if (nl->busy(tail_id, nl->busy_arg))
return NULL;
while (atomic_long_cmpxchg_relaxed(&nl->tail_id, tail_id, next_id) !=
tail_id);
/* This should never fail. The node is ours. */
return nl->node(tail_id, nl->node_arg);
> + /* Make sure the node is not the only node on the list. */
> + if (next_id == tail_id)
> + return NULL;
> +
> + /*
> + * cC:
> + *
> + * Make sure the node is not busy.
> + */
> + if (nl->busy(tail_id, nl->busy_arg))
> + return NULL;
> +
> + r = atomic_long_cmpxchg_relaxed(&nl->tail_id,
> + tail_id, next_id);
> + if (r == tail_id)
> + break;
> +
> + /* cA: #3 */
> + tail_id = r;
> + }
> +
> + return nl->node(tail_id, nl->node_arg);
If I get it correctly, the above nl->node() call should never fail.
The node has been removed from the list and nobody else could
touch it. It is pretty useful information and it might be worth
mention it in a comment.
Best Regards,
Petr
PS: I am scratching my head around the patchset. I'll try Peter's
approach and comment independent things is separate mails.
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/ringbuffer.c
> +/**
> + * assign_desc() - Assign a descriptor to the caller.
> + *
> + * @e: The entry structure to store the assigned descriptor to.
> + *
> + * Find an available descriptor to assign to the caller. First it is checked
> + * if the tail descriptor from the committed list can be recycled. If not,
> + * perhaps a never-used descriptor is available. Otherwise, data blocks will
> + * be invalidated until the tail descriptor from the committed list can be
> + * recycled.
> + *
> + * Assigned descriptors are invalid until data has been reserved for them.
> + *
> + * Return: true if a descriptor was assigned, otherwise false.
> + *
> + * This will only fail if it was not possible to invalidate data blocks in
> + * order to recycle a descriptor. This can happen if a writer has reserved but
> + * not yet committed data and that reserved data is currently the oldest data.
> + */
> +static bool assign_desc(struct prb_reserved_entry *e)
> +{
> + struct printk_ringbuffer *rb = e->rb;
> + struct prb_desc *d;
> + struct nl_node *n;
> + unsigned long i;
> +
> + for (;;) {
> + /*
> + * jA:
> + *
> + * Try to recycle a descriptor on the committed list.
> + */
> + n = numlist_pop(&rb->nl);
> + if (n) {
> + d = container_of(n, struct prb_desc, list);
> + break;
> + }
> +
> + /* Fallback to static never-used descriptors. */
> + if (atomic_read(&rb->desc_next_unused) < DESCS_COUNT(rb)) {
> + i = atomic_fetch_inc(&rb->desc_next_unused);
> + if (i < DESCS_COUNT(rb)) {
> + d = &rb->descs[i];
> + atomic_long_set(&d->id, i);
> + break;
> + }
> + }
> +
> + /*
> + * No descriptor available. Make one available for recycling
> + * by invalidating data (which some descriptor will be
> + * referencing).
> + */
> + if (!dataring_pop(&rb->dr))
> + return false;
> + }
> +
> + /*
> + * jB:
> + *
> + * Modify the descriptor ID so that users of the descriptor see that
> + * it has been recycled. A _release() is used so that prb_getdesc()
> + * callers can see all data ringbuffer updates after issuing a
> + * pairing smb_rmb(). See iA for details.
> + *
> + * Memory barrier involvement:
> + *
> + * If dB->iA reads from jB, then dI reads the same value as
> + * jA->cD->hA.
> + *
> + * Relies on:
> + *
> + * RELEASE from jA->cD->hA to jB
> + * matching
> + * RMB between dB->iA and dI
> + */
> + atomic_long_set_release(&d->id, atomic_long_read(&d->id) +
> + DESCS_COUNT(rb));
atomic_long_set_release() might be a bit confusing here.
There is no related acquire.
In fact, d->id manipulation has barriers from both sides:
+ smp_rmb() before so that all reads are finished before
the id is updated (release)
+ smp_wmb() after so that the new ID is written before other
related values are modified (acquire).
The smp_wmb() barrier is in prb_reserve(). I would move it here.
Best Regards,
Petr
> +
> + e->desc = d;
> + return true;
> +}
> +
> +/**
> + * prb_reserve() - Reserve data in the ringbuffer.
> + *
> + * @e: The entry structure to setup.
> + *
> + * @rb: The ringbuffer to reserve data in.
> + *
> + * @size: The size of the data to reserve.
> + *
> + * This is the public function available to writers to reserve data.
> + *
> + * Context: Any context. Disables local interrupts on success.
> + * Return: A pointer to the reserved data or an ERR_PTR if data could not be
> + * reserved.
> + *
> + * If the provided size is legal, this will only fail if it was not possible
> + * to invalidate the oldest data block. This can happen if a writer has
> + * reserved but not yet committed data and that reserved data is currently
> + * the oldest data.
> + *
> + * The ERR_PTR values and their meaning:
> + *
> + * * -EINVAL: illegal @size value
> + * * -EBUSY: failed to reserve a descriptor (@fail count incremented)
> + * * -ENOMEM: failed to reserve data (invalid descriptor committed)
> + */
> +char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
> + unsigned int size)
> +{
> + struct prb_desc *d;
> + unsigned long id;
> + char *buf;
> +
> + if (!dataring_checksize(&rb->dr, size))
> + return ERR_PTR(-EINVAL);
> +
> + e->rb = rb;
> +
> + /*
> + * Disable interrupts during the reserve/commit window in order to
> + * minimize the number of reserved but not yet committed data blocks
> + * in the data ringbuffer. Although such data blocks are not bad per
> + * se, they act as blockers for writers once the data ringbuffer has
> + * wrapped back to them.
> + */
> + local_irq_save(e->irqflags);
> +
> + /* kA: */
> + if (!assign_desc(e)) {
> + /* Failures to reserve descriptors are counted. */
> + atomic_long_inc(&rb->fail);
> + buf = ERR_PTR(-EBUSY);
> + goto err_out;
> + }
> +
> + d = e->desc;
> +
> + /*
> + * kB:
> + *
> + * The descriptor ID has been updated so that its users can see that
> + * it is now invalid. Issue an smp_wmb() so that upcoming changes to
> + * the descriptor will not be associated with the old descriptor ID.
> + * This pairs with the smp_rmb() of prb_desc_busy() (see hB for
> + * details) and the smp_rmb() within numlist_read() and the smp_rmb()
> + * of prb_iter_next_valid_entry() (see mD for details).
> + *
> + * Memory barrier involvement:
> + *
> + * If hA reads from kC, then hC reads from jB.
> + * If mC reads from kC, then mE reads from jB.
> + *
> + * Relies on:
> + *
> + * WMB between jB and kC
> + * matching
> + * RMB between hA and hC
> + *
> + * WMB between jB and kC
> + * matching
> + * RMB between mC and mE
> + */
> + smp_wmb();
> +
> + id = atomic_long_read(&d->id);
> +
> + /* kC: */
> + buf = dataring_push(&rb->dr, size, &d->desc, id);
> + if (!buf) {
> + /* Put the invalid descriptor on the committed list. */
> + numlist_push(&rb->nl, &d->list, id);
> + buf = ERR_PTR(-ENOMEM);
> + goto err_out;
> + }
> +
> + return buf;
> +err_out:
> + local_irq_restore(e->irqflags);
> + return buf;
> +}
> +EXPORT_SYMBOL(prb_reserve);
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/dataring.c
> +/**
> + * _datablock_valid() - Check if given positions yield a valid data block.
> + *
> + * @dr: The associated data ringbuffer.
> + *
> + * @head_lpos: The newest data logical position.
> + *
> + * @tail_lpos: The oldest data logical position.
> + *
> + * @begin_lpos: The beginning logical position of the data block to check.
> + *
> + * @next_lpos: The logical position of the next adjacent data block.
> + * This value is used to identify the end of the data block.
> + *
Please remove the empty lines between arguments description. They make
the comments too scattered.
> + * A data block is considered valid if it satisfies the two conditions:
> + *
> + * * tail_lpos <= begin_lpos < next_lpos <= head_lpos
> + * * tail_lpos is at most exactly 1 wrap behind head_lpos
> + *
> + * Return: true if the specified data block is valid.
> + */
To be sure, the empty lines between paragraphs are useful.
The following is still well readable:
/**
* _datablock_valid() - Check if given positions yield a valid data block.
* @dr: The associated data ringbuffer.
* @head_lpos: The newest data logical position.
* @tail_lpos: The oldest data logical position.
* @begin_lpos: The beginning logical position of the data block to check.
* @next_lpos: The logical position of the next adjacent data block.
* This value is used to identify the end of the data block.
* A data block is considered valid if it satisfies the two conditions:
*
* * tail_lpos <= begin_lpos < next_lpos <= head_lpos
* * tail_lpos is at most exactly 1 wrap behind head_lpos
*
* Return: true if the specified data block is valid.
*/
> +static unsigned long _dataring_pop(struct dataring *dr,
> + unsigned long tail_lpos)
> +{
> + unsigned long new_tail_lpos;
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> + struct dr_datablock *db;
> + struct dr_desc *desc;
> +
> + /*
> + * dA:
> + *
> + * @db has an address dependency on @tail_pos. Therefore @tail_lpos
> + * must be loaded before dB, which accesses @db.
> + */
> + db = to_datablock(dr, tail_lpos);
> +
> + /*
> + * dB:
> + *
> + * When a writer has completed accessing its data block, it sets the
> + * @id thus making the data block available for invalidation. This
> + * _acquire() ensures that this task sees all data ringbuffer and
> + * descriptor values seen by the writer as @id was set. This is
> + * necessary to ensure that the data block can be correctly identified
> + * as valid (i.e. @begin_lpos, @next_lpos, @head_lpos are at least the
> + * values seen by that writer, which yielded a valid data block at
> + * that time). It is not enough to rely on the address dependency of
> + * @desc to @id because @head_lpos is not depedent on @id. This pairs
> + * with the _release() in dataring_datablock_setid().
This human readable description is really useful.
> + *
> + * Memory barrier involvement:
> + *
> + * If dB reads from gA, then dC reads from fG.
> + * If dB reads from gA, then dD reads from fH.
> + * If dB reads from gA, then dE reads from fE.
> + *
> + * Note that if dB reads from gA, then dC cannot read from fC.
> + * Note that if dB reads from gA, then dD cannot read from fD.
> + *
> + * Relies on:
> + *
> + * RELEASE from fG to gA
> + * matching
> + * ADDRESS DEP. from dB to dC
> + *
> + * RELEASE from fH to gA
> + * matching
> + * ADDRESS DEP. from dB to dD
> + *
> + * RELEASE from fE to gA
> + * matching
> + * ACQUIRE from dB to dE
> + */
But I am not sure how much this is useful. It would take ages to decrypt
all these shortcuts (signs) and translate them into something
human readable. Also it might get outdated easily.
That said, I haven't found yet if there was a system in all
the shortcuts. I mean if they can be descrypted easily
out of head. Also I am not familiar with the notation
of the dependencies.
If this is really needed then I am really scared of some barriers
that guard too many things. This one is a good example.
> + desc = dr->getdesc(smp_load_acquire(&db->id), dr->getdesc_arg);
> +
> + /* dD: */
It would be great if all these shortcuts (signs) are followed with
something human readable. Few words might be enough.
> + next_lpos = READ_ONCE(desc->next_lpos);
> +
> + if (!_datablock_valid(dr,
> + /* dE: */
> + atomic_long_read(&dr->head_lpos),
> + tail_lpos, begin_lpos, next_lpos)) {
> + /* Another task has already invalidated the data block. */
> + goto out;
> + }
> +
> +
> +++ b/kernel/printk/numlist.c
> +bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
> + unsigned long *next_id)
> +{
> + struct nl_node *n;
> +
> + n = nl->node(id, nl->node_arg);
> + if (!n)
> + return false;
> +
> + if (seq) {
> + /*
> + * aA:
> + *
> + * Adresss dependency on @id.
> + */
This is too scattered. If we really need so many shortcuts (signs)
then we should find a better style. The following looks perfectly
fine to me:
/* aA: Adresss dependency on @id. */
> + *seq = READ_ONCE(n->seq);
> + }
> +
> + if (next_id) {
> + /*
> + * aB:
> + *
> + * Adresss dependency on @id.
> + */
> + *next_id = READ_ONCE(n->next_id);
> + }
> +
Best Regards,
Petr
On Thu 2019-08-08 00:32:29, John Ogness wrote:
> Initialize never-used descriptors as permanently invalid so there
The word "permanently" is confusing. It suggests that it will
never ever be valid again. I would just remove the word.
> is no risk of the descriptor unexpectedly being determined as
> valid due to dataring head overflowing/wrapping.
Please, provide more details about the solved race. Is it because
some reader could have reference to an invalid (reused) descriptor?
Can be these invalid descriptors be member of the list?
Also it might be worth to mention where is the check that might
detect such invalid descriptors and what will be the consequences.
Well, this might be clear from the race description.
Best Regards,
Petr
On (08/20/19 10:55), Petr Mladek wrote:
[..]
> > + *
> > + * Memory barrier involvement:
> > + *
> > + * If dB reads from gA, then dC reads from fG.
> > + * If dB reads from gA, then dD reads from fH.
> > + * If dB reads from gA, then dE reads from fE.
> > + *
> > + * Note that if dB reads from gA, then dC cannot read from fC.
> > + * Note that if dB reads from gA, then dD cannot read from fD.
> > + *
> > + * Relies on:
> > + *
> > + * RELEASE from fG to gA
> > + * matching
> > + * ADDRESS DEP. from dB to dC
> > + *
> > + * RELEASE from fH to gA
> > + * matching
> > + * ADDRESS DEP. from dB to dD
> > + *
> > + * RELEASE from fE to gA
> > + * matching
> > + * ACQUIRE from dB to dE
> > + */
>
> But I am not sure how much this is useful. It would take ages to decrypt
> all these shortcuts (signs) and translate them into something
> human readable. Also it might get outdated easily.
>
> That said, I haven't found yet if there was a system in all
> the shortcuts. I mean if they can be descrypted easily
> out of head. Also I am not familiar with the notation
> of the dependencies.
Does not appear to be systematic to me, but maybe I'm missing something
obvious. For chains like
jA->cD->hA to jB
I haven't found anything better than just git grep jA kernel/printk/
so far.
But once you'll grep for label cD, for instance, you'd see
that it's not defined. It's mentioned but not defined
kernel/printk/ringbuffer.c: * jA->cD->hA.
kernel/printk/ringbuffer.c: * RELEASE from jA->cD->hA to jB
I was thinking about renaming labels. E.g.
dataring_desc_init()
{
/* di1 */
WRITE_ONCE(desc->begin_lpos, 1);
/* di2 */
WRITE_ONCE(desc->next_lpos, 1);
}
Where di stands for descriptor init.
dataring_push()
{
/* dp1 */
ret = get_new_lpos(dr, size, &begin_lpos, &next_lpos);
...
/* dp2 */
smp_mb();
...
}
Where dp stands for descriptor push. For dataring we can add a 'dr'
prefix, to avoid confusion with desc barriers, which have 'd' prefix.
And so on. Dunno.
-ss
On (08/08/19 00:32), John Ogness wrote:
[..]
> +void prb_init(struct printk_ringbuffer *rb, char *data, int data_size_bits,
> + struct prb_desc *descs, int desc_count_bits,
> + struct wait_queue_head *waitq)
> +{
> + struct dataring *dr = &rb->dr;
> + struct numlist *nl = &rb->nl;
> +
> + rb->desc_count_bits = desc_count_bits;
> + rb->descs = descs;
> + atomic_long_set(&descs[0].id, 0);
> + descs[0].desc.begin_lpos = 1;
> + descs[0].desc.next_lpos = 1;
dataring_desc_init(), perhaps?
> + atomic_set(&rb->desc_next_unused, 1);
> +
> + atomic_long_set(&nl->head_id, 0);
> + atomic_long_set(&nl->tail_id, 0);
> + nl->node = prb_desc_node;
> + nl->node_arg = rb;
> + nl->busy = prb_desc_busy;
> + nl->busy_arg = rb;
> +
> + dr->size_bits = data_size_bits;
> + dr->data = data;
> + atomic_long_set(&dr->head_lpos, -111 * sizeof(long));
> + atomic_long_set(&dr->tail_lpos, -111 * sizeof(long));
> + dr->getdesc = prb_getdesc;
> + dr->getdesc_arg = rb;
> +
> + atomic_long_set(&rb->fail, 0);
> +
> + rb->wq = waitq;
> +}
> +EXPORT_SYMBOL(prb_init);
-ss
On (08/20/19 11:23), Petr Mladek wrote:
> > is no risk of the descriptor unexpectedly being determined as
> > valid due to dataring head overflowing/wrapping.
>
> Please, provide more details about the solved race. Is it because
> some reader could have reference to an invalid (reused) descriptor?
> Can be these invalid descriptors be member of the list?
As far as I understand, such descriptors can be on the list:
prb_reserve()
assign_desc()
// pick a new never used descr
i = atomic_fetch_inc(&rb->desc_next_unused);
d = &rb->descs[i]
dataring_desc_init(&d->desc);
return d
buf = dataring_push()
// the oldest data is reserved, but not commited
ret = get_new_lpos()
if (ret)
dataring_desc_init()
return NULL
if (!buf)
numlist_push()
_datablock_valid() has a "desc->begin_lpos == desc->next_lpos" check.
-ss
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> +/**
> + * dataring_push() - Reserve a data block in the data array.
> + *
> + * @dr: The data ringbuffer to reserve data in.
> + *
> + * @size: The size to reserve.
> + *
> + * @desc: A pointer to a descriptor to store the data block information.
> + *
> + * @id: The ID of the descriptor to be associated.
> + * The data block will not be set with @id, but rather initialized with
> + * a value that is explicitly different than @id. This is to handle the
> + * case when newly available garbage by chance matches the descriptor
> + * ID.
> + *
> + * This function expects to move the head pointer forward. If this would
> + * result in overtaking the data array index of the tail, the tail data block
> + * will be invalidated.
> + *
> + * Return: A pointer to the reserved writer data, otherwise NULL.
> + *
> + * This will only fail if it was not possible to invalidate the tail data
> + * block.
> + */
> +char *dataring_push(struct dataring *dr, unsigned int size,
> + struct dr_desc *desc, unsigned long id)
> +{
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> + struct dr_datablock *db;
> + bool ret;
> +
> + to_db_size(&size);
> +
> + do {
> + /* fA: */
> + ret = get_new_lpos(dr, size, &begin_lpos, &next_lpos);
> +
> + /*
> + * fB:
> + *
> + * The data ringbuffer tail may have been pushed (by this or
> + * any other task). The updated @tail_lpos must be visible to
> + * all observers before changes to @begin_lpos, @next_lpos, or
> + * @head_lpos by this task are visible in order to allow other
> + * tasks to recognize the invalidation of the data
> + * blocks.
This sounds strange. The write barrier should be done only on CPU
that really modified tail_lpos. I.e. it should be in _dataring_pop()
after successful dr->tail_lpos modification.
> + * This pairs with the smp_rmb() in _dataring_pop() as well as
> + * any reader task using smp_rmb() to post-validate data that
> + * has been read from a data block.
> +
> + * Memory barrier involvement:
> + *
> + * If dE reads from fE, then dI reads from fA->eA.
> + * If dC reads from fG, then dI reads from fA->eA.
> + * If dD reads from fH, then dI reads from fA->eA.
> + * If mC reads from fH, then mF reads from fA->eA.
> + *
> + * Relies on:
> + *
> + * FULL MB between fA->eA and fE
> + * matching
> + * RMB between dE and dI
> + *
> + * FULL MB between fA->eA and fG
> + * matching
> + * RMB between dC and dI
> + *
> + * FULL MB between fA->eA and fH
> + * matching
> + * RMB between dD and dI
> + *
> + * FULL MB between fA->eA and fH
> + * matching
> + * RMB between mC and mF
> + */
> + smp_mb();
All these comments talk about sychronization against read barriers.
It means that we would need a write barrier here. But it does
not make much sense to do write barrier before actually
writing dr->head_lpos.
After all I think that we do not need any barrier here.
The write barrier for dr->tail_lpos should be in
_dataring_pop(). The read barrier is not needed because
we are not reading anything here.
Instead we should put a barrier after modyfying dr->head_lpos,
see below.
> + if (!ret) {
> + /*
> + * Force @desc permanently invalid to minimize risk
> + * of the descriptor later unexpectedly being
> + * determined as valid due to overflowing/wrapping of
> + * @head_lpos. An unaligned @begin_lpos can never
> + * point to a data block and having the same value
> + * for @begin_lpos and @next_lpos is also invalid.
> + */
> +
> + /* fC: */
> + WRITE_ONCE(desc->begin_lpos, 1);
> +
> + /* fD: */
> + WRITE_ONCE(desc->next_lpos, 1);
> +
> + return NULL;
> + }
> + /* fE: */
> + } while (atomic_long_cmpxchg_relaxed(&dr->head_lpos, begin_lpos,
> + next_lpos) != begin_lpos);
> +
We need a write barrier here to make sure that dr->head_lpos
is updated before we start updating other values, e.g.
db->id below.
Best Regards,
Petr
> + db = to_datablock(dr, begin_lpos);
> +
> + /*
> + * fF:
> + *
> + * @db->id is a garbage value and could possibly match the @id. This
> + * would be a problem because the data block would be considered
> + * valid before the writer has finished with it (i.e. before the
> + * writer has set @id). Force some other ID value.
> + */
> + WRITE_ONCE(db->id, id - 1);
>
> + /*
> + * fG:
> + *
> + * Ensure that @db->id is initialized to a wrong ID value before
> + * setting @begin_lpos so that there is no risk of accidentally
> + * matching a data block to a descriptor before the writer is finished
> + * with it (i.e. before the writer has set the correct @id). This
> + * pairs with the _acquire() in _dataring_pop().
> + *
> + * Memory barrier involvement:
> + *
> + * If dC reads from fG, then dF reads from fF.
> + *
> + * Relies on:
> + *
> + * RELEASE from fF to fG
> + * matching
> + * ACQUIRE from dC to dF
> + */
> + smp_store_release(&desc->begin_lpos, begin_lpos);
> +
> + /* fH: */
> + WRITE_ONCE(desc->next_lpos, next_lpos);
> +
> + /* If this data block wraps, use @data from the content data block. */
> + if (DATA_WRAPS(dr, begin_lpos) != DATA_WRAPS(dr, next_lpos))
> + db = to_datablock(dr, 0);
> +
> + return &db->data[0];
> +}
On Tue 2019-08-20 10:22:53, Petr Mladek wrote:
> On Thu 2019-08-08 00:32:26, John Ogness wrote:
> > --- /dev/null
> > +++ b/kernel/printk/ringbuffer.c
> > +/**
> > + * assign_desc() - Assign a descriptor to the caller.
> > + *
> > + * @e: The entry structure to store the assigned descriptor to.
> > + *
> > + * Find an available descriptor to assign to the caller. First it is checked
> > + * if the tail descriptor from the committed list can be recycled. If not,
> > + * perhaps a never-used descriptor is available. Otherwise, data blocks will
> > + * be invalidated until the tail descriptor from the committed list can be
> > + * recycled.
> > + *
> > + * Assigned descriptors are invalid until data has been reserved for them.
> > + *
> > + * Return: true if a descriptor was assigned, otherwise false.
> > + *
> > + * This will only fail if it was not possible to invalidate data blocks in
> > + * order to recycle a descriptor. This can happen if a writer has reserved but
> > + * not yet committed data and that reserved data is currently the oldest data.
> > + */
> > +static bool assign_desc(struct prb_reserved_entry *e)
> > +{
> > + struct printk_ringbuffer *rb = e->rb;
> > + struct prb_desc *d;
> > + struct nl_node *n;
> > + unsigned long i;
> > +
> > + for (;;) {
> > + /*
> > + * jA:
> > + *
> > + * Try to recycle a descriptor on the committed list.
> > + */
> > + n = numlist_pop(&rb->nl);
> > + if (n) {
> > + d = container_of(n, struct prb_desc, list);
> > + break;
> > + }
> > +
> > + /* Fallback to static never-used descriptors. */
> > + if (atomic_read(&rb->desc_next_unused) < DESCS_COUNT(rb)) {
> > + i = atomic_fetch_inc(&rb->desc_next_unused);
> > + if (i < DESCS_COUNT(rb)) {
> > + d = &rb->descs[i];
> > + atomic_long_set(&d->id, i);
> > + break;
> > + }
> > + }
> > +
> > + /*
> > + * No descriptor available. Make one available for recycling
> > + * by invalidating data (which some descriptor will be
> > + * referencing).
> > + */
> > + if (!dataring_pop(&rb->dr))
> > + return false;
> > + }
> > +
> > + /*
> > + * jB:
> > + *
> > + * Modify the descriptor ID so that users of the descriptor see that
> > + * it has been recycled. A _release() is used so that prb_getdesc()
> > + * callers can see all data ringbuffer updates after issuing a
> > + * pairing smb_rmb(). See iA for details.
> > + *
> > + * Memory barrier involvement:
> > + *
> > + * If dB->iA reads from jB, then dI reads the same value as
> > + * jA->cD->hA.
> > + *
> > + * Relies on:
> > + *
> > + * RELEASE from jA->cD->hA to jB
> > + * matching
> > + * RMB between dB->iA and dI
> > + */
> > + atomic_long_set_release(&d->id, atomic_long_read(&d->id) +
> > + DESCS_COUNT(rb));
>
> atomic_long_set_release() might be a bit confusing here.
> There is no related acquire.
>
> In fact, d->id manipulation has barriers from both sides:
>
> + smp_rmb() before so that all reads are finished before
> the id is updated (release)
Uh, this statement does not make sense. The read barrier is not
needed here. Instead the readers need it.
Well, we might need a write barrier before d->id manipulation.
It should be in numlist_pop() after successfully updating nl->tail_id.
It will allow readers to detect that the desriptor is being reused
(not in valid tail_id..head_id range) before we start manipulating it.
> + smp_wmb() after so that the new ID is written before other
> related values are modified (acquire).
>
> The smp_wmb() barrier is in prb_reserve(). I would move it here.
This still makes sense. I would move the write barrier from
prb_reserve() here.
Sigh, I have to admit that I am not familiar with the _acquire(),
_release(), and _relaxed() variants of the atomic operations.
They probably make it easier to implement some locking API.
I am not sure how to use it here. This code implements a complex
interlock between several variables. I mean that several variables
lock each other in a cycle, like a state machine? In each case,
it is not a simple locking where we check state of a single
variable.
Best Regards,
Petr
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> +/**
> + * _dataring_pop() - Move tail forward, invalidating the oldest data block.
> + *
> + * @dr: The data ringbuffer containing the data block.
> + *
> + * @tail_lpos: The logical position of the oldest data block.
> + *
> + * This function expects to move the pointer to the oldest data block forward,
> + * thus invalidating the oldest data block. Before attempting to move the
> + * tail, it is verified that the data block is valid. An invalid data block
> + * means that another task has already moved the tail pointer forward.
> + *
> + * Return: The new/current value (logical position) of the tail.
> + *
> + * From the return value the caller can identify if the tail was moved
> + * forward. However, the caller does not know if it was the task that
> + * performed the move.
> + *
> + * If, after seeing a moved tail, the caller will be modifying @begin_lpos or
> + * @next_lpos of a descriptor or will be modifying the head, a full memory
> + * barrier is required before doing so. This ensures that if any update to a
> + * descriptor's @begin_lpos or @next_lpos or the data ringbuffer's head is
> + * visible, that the previous update to the tail is also visible. This avoids
> + * the possibility of failure to notice when another task has moved the tail.
> + *
> + * If the tail has not moved forward it means the @id for the data block was
> + * not set yet. In this case the tail cannot move forward.
> + */
> +static unsigned long _dataring_pop(struct dataring *dr,
> + unsigned long tail_lpos)
> +{
> + unsigned long new_tail_lpos;
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> + struct dr_datablock *db;
> + struct dr_desc *desc;
> +
> + /*
> + * dA:
> + *
> + * @db has an address dependency on @tail_pos. Therefore @tail_lpos
> + * must be loaded before dB, which accesses @db.
> + */
> + db = to_datablock(dr, tail_lpos);
> +
> + /*
> + * dB:
> + *
> + * When a writer has completed accessing its data block, it sets the
> + * @id thus making the data block available for invalidation. This
> + * _acquire() ensures that this task sees all data ringbuffer and
> + * descriptor values seen by the writer as @id was set. This is
> + * necessary to ensure that the data block can be correctly identified
> + * as valid (i.e. @begin_lpos, @next_lpos, @head_lpos are at least the
> + * values seen by that writer, which yielded a valid data block at
> + * that time). It is not enough to rely on the address dependency of
> + * @desc to @id because @head_lpos is not depedent on @id. This pairs
> + * with the _release() in dataring_datablock_setid().
> + *
> + desc = dr->getdesc(smp_load_acquire(&db->id), dr->getdesc_arg);
I guess that we might read a garbage here before dataring_push()
writes an invalid ID after it shuffled dr->head_lpos.
I am not completely sure how we detect the garbage. prb_getdesc()
might reach a valid descriptor just by chance. My understanding
is below.
> + if (!desc) {
> + /*
> + * The data block @id is invalid. The data block is either in
> + * use by the writer (@id not yet set) or has already been
> + * invalidated by another task and the data array area or
> + * descriptor have already been recycled. The latter case
> + * (descriptor already recycled) relies on the implementation
> + * of getdesc(), which, when using an smp_rmb(), must allow
> + * this task to see @tail_lpos as it was visible to the task
> + * that changed the ID-to-descriptor mapping. See the
> + * implementation of getdesc() for details.
> + */
> + goto out;
> + }
> +
> + /*
> + * dC:
> + *
> + * Even though the data block @id was determined to be valid, it is
> + * possible that it is a data block recently made available and @id
> + * has not yet been initialized. The @id needs to be re-validated (dF)
> + * after checking if the descriptor points to the data block. Use
> + * _acquire() to ensure that the re-loading of @id occurs after
> + * loading @begin_lpos. This pairs with the _release() in
> + * dataring_push(). See fG for details.
> + */
> + begin_lpos = smp_load_acquire(&desc->begin_lpos);
> +
> + if (begin_lpos != tail_lpos) {
> + /*
> + * @desc is not describing the data block at @tail_lpos. Since
> + * a data block and its descriptor always become valid before
> + * @id is set (see dB for details) the data block at
> + * @tail_lpos has already been invalidated.
> + */
> + goto out;
I believe that this check should be enough to detect the garbage
or non-initialized db->id.
All descriptors are reused regularly. It means that all descriptors
should contain only valid or recently used lpos values. In each case,
there should not be any risk of overflow.
The only exception might be never used descriptors. Do we detect them?
Did I miss anything, please?
> + }
> +
> + /* dD: */
> + next_lpos = READ_ONCE(desc->next_lpos);
> +
> + if (!_datablock_valid(dr,
> + /* dE: */
> + atomic_long_read(&dr->head_lpos),
> + tail_lpos, begin_lpos, next_lpos)) {
> + /* Another task has already invalidated the data block. */
> + goto out;
> + }
> +
> + /* dF: */
> + if (dr->getdesc(READ_ONCE(db->id), dr->getdesc_arg) != desc) {
> + /*
> + * The data block ID has changed. The rare case of an
> + * uninitialized @db->id matching the descriptor ID was hit.
> + * This is a special case and it applies to the failure of the
> + * previous @id check (dB).
> + */
> + goto out;
> + }
I guess that this is related to WRITE_ONCE(db->id, id - 1) in
dataring_push(). It does not harm. But I would like to be sure
that I understand it correctly.
So, is there any chance that this check fails? IMHO, the previous
checks should catch all invalid descriptors.
Instead we might need to add a check to detect a never used
descriptor. Or is it already detected?
Might id == 0, begin_lpos == 0, next_lpos == 0 be valid values
by chance?
What if a never used descriptor has been just assigned on another CPU
and being modified?
> + /* dG: */
> + new_tail_lpos = atomic_long_cmpxchg_relaxed(&dr->tail_lpos,
> + begin_lpos, next_lpos);
> + if (new_tail_lpos == begin_lpos)
> + return next_lpos;
> + return new_tail_lpos;
> +out:
> + /*
> + * dH:
> + *
> + * Ensure that the updated @tail_lpos is visible if the data block has
> + * been invalidated. This pairs with the smp_mb() in dataring_push()
> + * (see fB for details) as well as with the ID synchronization used in
> + * the getdesc() implementation, which must guarantee that an
> + * smp_rmb() is sufficient for seeing an updated @tail_lpos (see the
> + * implementation of getdesc() for details).
> + */
> + smp_rmb();
> +
> + /* dI: */
> + return atomic_long_read(&dr->tail_lpos);
> +}
> +
> +/**
> + * dataring_push() - Reserve a data block in the data array.
> + *
> + * @dr: The data ringbuffer to reserve data in.
> + *
> + * @size: The size to reserve.
> + *
> + * @desc: A pointer to a descriptor to store the data block information.
> + *
> + * @id: The ID of the descriptor to be associated.
> + * The data block will not be set with @id, but rather initialized with
> + * a value that is explicitly different than @id. This is to handle the
> + * case when newly available garbage by chance matches the descriptor
> + * ID.
> + *
> + * This function expects to move the head pointer forward. If this would
> + * result in overtaking the data array index of the tail, the tail data block
> + * will be invalidated.
> + *
> + * Return: A pointer to the reserved writer data, otherwise NULL.
> + *
> + * This will only fail if it was not possible to invalidate the tail data
> + * block.
> + */
> +char *dataring_push(struct dataring *dr, unsigned int size,
> + struct dr_desc *desc, unsigned long id)
> +{
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> + struct dr_datablock *db;
> + bool ret;
> +
> + to_db_size(&size);
> +
> + do {
> + /* fA: */
> + ret = get_new_lpos(dr, size, &begin_lpos, &next_lpos);
> +
> + smp_mb();
> +
> + if (!ret) {
> + /*
> + * Force @desc permanently invalid to minimize risk
> + * of the descriptor later unexpectedly being
> + * determined as valid due to overflowing/wrapping of
> + * @head_lpos. An unaligned @begin_lpos can never
> + * point to a data block and having the same value
> + * for @begin_lpos and @next_lpos is also invalid.
> + */
> +
> + /* fC: */
> + WRITE_ONCE(desc->begin_lpos, 1);
> +
> + /* fD: */
> + WRITE_ONCE(desc->next_lpos, 1);
> +
> + return NULL;
> + }
> + /* fE: */
> + } while (atomic_long_cmpxchg_relaxed(&dr->head_lpos, begin_lpos,
> + next_lpos) != begin_lpos);
> +
> + db = to_datablock(dr, begin_lpos);
> +
> + /*
> + * fF:
> + *
> + * @db->id is a garbage value and could possibly match the @id. This
> + * would be a problem because the data block would be considered
> + * valid before the writer has finished with it (i.e. before the
> + * writer has set @id). Force some other ID value.
> + */
> + WRITE_ONCE(db->id, id - 1);
This would deserve a more detailed comment. Where the garbage can be
seen and how it is detected. I guess that it is in _dataring_pop()
that is being discussed above. Is it really needed?
Best Regards,
Petr
> + /*
> + * fG:
> + *
> + * Ensure that @db->id is initialized to a wrong ID value before
> + * setting @begin_lpos so that there is no risk of accidentally
> + * matching a data block to a descriptor before the writer is finished
> + * with it (i.e. before the writer has set the correct @id). This
> + * pairs with the _acquire() in _dataring_pop().
> + *
> + * Memory barrier involvement:
> + *
> + * If dC reads from fG, then dF reads from fF.
> + *
> + * Relies on:
> + *
> + * RELEASE from fF to fG
> + * matching
> + * ACQUIRE from dC to dF
> + */
> + smp_store_release(&desc->begin_lpos, begin_lpos);
> +
> + /* fH: */
> + WRITE_ONCE(desc->next_lpos, next_lpos);
> +
> + /* If this data block wraps, use @data from the content data block. */
> + if (DATA_WRAPS(dr, begin_lpos) != DATA_WRAPS(dr, next_lpos))
> + db = to_datablock(dr, 0);
> +
> + return &db->data[0];
> +}
On 2019-08-20, Petr Mladek <[email protected]> wrote:
>> Initialize never-used descriptors as permanently invalid so there
>
> The word "permanently" is confusing. It suggests that it will
> never ever be valid again. I would just remove the word.
Agreed.
>> is no risk of the descriptor unexpectedly being determined as
>> valid due to dataring head overflowing/wrapping.
>
> Please, provide more details about the solved race.
OK.
> Is it because some reader could have reference to an invalid
> (reused) descriptor?
Yes, but not because it is reused. If a writer succeeded in reserving a
descriptor, but failed to reserve a datablock, that (invalid) descriptor
is put on the committed list (see fA). By setting the lpos values to
something that could _never_ be valid, there is no risk of the
descriptor suddenly becoming valid due to head overflowing.
My RFCv2 did not account for this and instead invalid descriptors just
held on to whatever lpos values they last had. Although they are invalid
at that moment, if not set to something "permanently" invalid, those
values could become valid again. We talked about that here[0].
> Can be these invalid descriptors be member of the list?
Yes (as Sergey shows in his followup post). Readers see them as invalid
and treat them as dropped records.
> Also it might be worth to mention where is the check that might
> detect such invalid descriptors and what will be the consequences.
> Well, this might be clear from the race description.
The check itself is not special. However, readers do have to be aware of
and correctly handle the case of invalid descriptors on the list. I will
find an appropriate place to document this.
John Ogness
[0] https://lkml.kernel.org/r/[email protected]
On 2019-08-20, Sergey Senozhatsky <[email protected]> wrote:
> [..]
>> +void prb_init(struct printk_ringbuffer *rb, char *data, int data_size_bits,
>> + struct prb_desc *descs, int desc_count_bits,
>> + struct wait_queue_head *waitq)
>> +{
>> + struct dataring *dr = &rb->dr;
>> + struct numlist *nl = &rb->nl;
>> +
>> + rb->desc_count_bits = desc_count_bits;
>> + rb->descs = descs;
>> + atomic_long_set(&descs[0].id, 0);
>> + descs[0].desc.begin_lpos = 1;
>> + descs[0].desc.next_lpos = 1;
>
> dataring_desc_init(), perhaps?
Agreed.
On 2019-08-20, Petr Mladek <[email protected]> wrote:
>> --- /dev/null
>> +++ b/kernel/printk/numlist.c
>> +/**
>> + * numlist_pop() - Remove the oldest node from the list.
>> + *
>> + * @nl: The numbered list from which to remove the tail node.
>> + *
>> + * The tail node can only be removed if two conditions are satisfied:
>> + *
>> + * * The node is not the only node on the list.
>> + * * The node is not busy.
>> + *
>> + * If, during this function, another task removes the tail, this function
>> + * will try again with the new tail.
>> + *
>> + * Return: The removed node or NULL if the tail node cannot be removed.
>> + */
>> +struct nl_node *numlist_pop(struct numlist *nl)
>> +{
>> + unsigned long tail_id;
>> + unsigned long next_id;
>> + unsigned long r;
>> +
>> + /* cA: #1 */
>> + tail_id = atomic_long_read(&nl->tail_id);
>> +
>> + for (;;) {
>> + /* cB */
>> + while (!numlist_read(nl, tail_id, NULL, &next_id)) {
>> + /*
>> + * @tail_id is invalid. Try again with an
>> + * updated value.
>> + */
>> +
>> + cpu_relax();
>> +
>> + /* cA: #2 */
>> + tail_id = atomic_long_read(&nl->tail_id);
>> + }
>
> The above while-cycle basically does the same as the upper for-cycle.
> It tries again with freshly loaded nl->tail_id. The following code
> looks easier to follow:
>
> do {
> tail_id = atomic_long_read(&nl->tail_id);
>
> /*
> * Read might fail when the tail node has been removed
> * and reused in parallel.
> */
> if (!numlist_read(nl, tail_id, NULL, &next_id))
> continue;
>
> /* Make sure the node is not the only node on the list. */
> if (next_id == tail_id)
> return NULL;
>
> /* cC: Make sure the node is not busy. */
> if (nl->busy(tail_id, nl->busy_arg))
> return NULL;
>
> while (atomic_long_cmpxchg_relaxed(&nl->tail_id, tail_id, next_id) !=
> tail_id);
>
> /* This should never fail. The node is ours. */
> return nl->node(tail_id, nl->node_arg);
You will see that pattern in several cmpxchg() loops. The reason I chose
to do it that way was so that I could make use of the return value of
the failed cmpcxhg(). This avoids an unnecessary LOAD and establishes a
data dependency between the failed cmpxchg() and the following
numlist_read(). I suppose none of that matters since we only care about
the case where cmpxchg() is successful.
I agree that your variation is easier to read.
>> + /* Make sure the node is not the only node on the list. */
>> + if (next_id == tail_id)
>> + return NULL;
>> +
>> + /*
>> + * cC:
>> + *
>> + * Make sure the node is not busy.
>> + */
>> + if (nl->busy(tail_id, nl->busy_arg))
>> + return NULL;
>> +
>> + r = atomic_long_cmpxchg_relaxed(&nl->tail_id,
>> + tail_id, next_id);
>> + if (r == tail_id)
>> + break;
>> +
>> + /* cA: #3 */
>> + tail_id = r;
>> + }
>> +
>> + return nl->node(tail_id, nl->node_arg);
>
> If I get it correctly, the above nl->node() call should never fail.
> The node has been removed from the list and nobody else could
> touch it. It is pretty useful information and it might be worth
> mention it in a comment.
You are correct and I will add a comment.
> PS: I am scratching my head around the patchset. I'll try Peter's
> approach and comment independent things is separate mails.
I think it is an excellent approach. Especially when discussing the
memory barriers.
John Ogness
On 2019-08-20, Petr Mladek <[email protected]> wrote:
>> --- /dev/null
>> +++ b/kernel/printk/dataring.c
>> +/**
>> + * _datablock_valid() - Check if given positions yield a valid data block.
>> + *
>> + * @dr: The associated data ringbuffer.
>> + *
>> + * @head_lpos: The newest data logical position.
>> + *
>> + * @tail_lpos: The oldest data logical position.
>> + *
>> + * @begin_lpos: The beginning logical position of the data block to check.
>> + *
>> + * @next_lpos: The logical position of the next adjacent data block.
>> + * This value is used to identify the end of the data block.
>> + *
>
> Please remove the empty lines between arguments description. They make
> the comments too scattered.
Your feedback is contradicting what PeterZ requested[0]. Particularly
when multiple lines are involved with a description, I find the spacing
helpful. I've grown to like the spacing, but I won't fight for it.
>> + /*
>> + * dB:
>> + *
>> + * When a writer has completed accessing its data block, it sets the
>> + * @id thus making the data block available for invalidation. This
>> + * _acquire() ensures that this task sees all data ringbuffer and
>> + * descriptor values seen by the writer as @id was set. This is
>> + * necessary to ensure that the data block can be correctly identified
>> + * as valid (i.e. @begin_lpos, @next_lpos, @head_lpos are at least the
>> + * values seen by that writer, which yielded a valid data block at
>> + * that time). It is not enough to rely on the address dependency of
>> + * @desc to @id because @head_lpos is not depedent on @id. This pairs
>> + * with the _release() in dataring_datablock_setid().
>
> This human readable description is really useful.
>
>> + *
>> + * Memory barrier involvement:
>> + *
>> + * If dB reads from gA, then dC reads from fG.
>> + * If dB reads from gA, then dD reads from fH.
>> + * If dB reads from gA, then dE reads from fE.
>> + *
>> + * Note that if dB reads from gA, then dC cannot read from fC.
>> + * Note that if dB reads from gA, then dD cannot read from fD.
>> + *
>> + * Relies on:
>> + *
>> + * RELEASE from fG to gA
>> + * matching
>> + * ADDRESS DEP. from dB to dC
>> + *
>> + * RELEASE from fH to gA
>> + * matching
>> + * ADDRESS DEP. from dB to dD
>> + *
>> + * RELEASE from fE to gA
>> + * matching
>> + * ACQUIRE from dB to dE
>> + */
>
> But I am not sure how much this is useful.
When I was first implementing RFCv3, the "human-readable" text version
was very useful for me. However, now it is the formal descriptions that
I find more useful. They provide the proof and a far more detailed
description.
> It would take ages to decrypt all these shortcuts (signs) and
> translate them into something human readable. Also it might get
> outdated easily.
>
> That said, I haven't found yet if there was a system in all
> the shortcuts. I mean if they can be descrypted easily
> out of head. Also I am not familiar with the notation
> of the dependencies.
I'll respond to this part in Sergey's followup post.
> If this is really needed then I am really scared of some barriers
> that guard too many things. This one is a good example.
>
>> + desc = dr->getdesc(smp_load_acquire(&db->id), dr->getdesc_arg);
The variable's value (in this case db->id) is doing the guarding. The
barriers ensure that db->id is read first (and set last).
>> +
>> + /* dD: */
>
> It would be great if all these shortcuts (signs) are followed with
> something human readable. Few words might be enough.
I'll respond to this part in Sergey's followup post.
>> + next_lpos = READ_ONCE(desc->next_lpos);
>> +
>> + if (!_datablock_valid(dr,
>> + /* dE: */
>> + atomic_long_read(&dr->head_lpos),
>> + tail_lpos, begin_lpos, next_lpos)) {
>> + /* Another task has already invalidated the data block. */
>> + goto out;
>> + }
>> +
>> +
>> +++ b/kernel/printk/numlist.c
>> +bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
>> + unsigned long *next_id)
>> +
>> + struct nl_node *n;
>> +
>> + n = nl->node(id, nl->node_arg);
>> + if (!n)
>> + return false;
>> +
>> + if (seq) {
>> + /*
>> + * aA:
>> + *
>> + * Adresss dependency on @id.
>> + */
>
> This is too scattered. If we really need so many shortcuts (signs)
> then we should find a better style. The following looks perfectly
> fine to me:
>
> /* aA: Adresss dependency on @id. */
I'll respond to this part in Sergey's followup post.
>> + *seq = READ_ONCE(n->seq);
>> + }
>> +
>> + if (next_id) {
>> + /*
>> + * aB:
>> + *
>> + * Adresss dependency on @id.
>> + */
>> + *next_id = READ_ONCE(n->next_id);
>> + }
>> +
John Ogness
[0] https://lkml.kernel.org/r/[email protected]
On 2019-08-20, Petr Mladek <[email protected]> wrote:
>> > --- /dev/null
>> > +++ b/kernel/printk/ringbuffer.c
>> > +/**
>> > + * assign_desc() - Assign a descriptor to the caller.
>> > + *
>> > + * @e: The entry structure to store the assigned descriptor to.
>> > + *
>> > + * Find an available descriptor to assign to the caller. First it is checked
>> > + * if the tail descriptor from the committed list can be recycled. If not,
>> > + * perhaps a never-used descriptor is available. Otherwise, data blocks will
>> > + * be invalidated until the tail descriptor from the committed list can be
>> > + * recycled.
>> > + *
>> > + * Assigned descriptors are invalid until data has been reserved for them.
>> > + *
>> > + * Return: true if a descriptor was assigned, otherwise false.
>> > + *
>> > + * This will only fail if it was not possible to invalidate data blocks in
>> > + * order to recycle a descriptor. This can happen if a writer has reserved but
>> > + * not yet committed data and that reserved data is currently the oldest data.
>> > + */
>> > +static bool assign_desc(struct prb_reserved_entry *e)
>> > +{
>> > + struct printk_ringbuffer *rb = e->rb;
>> > + struct prb_desc *d;
>> > + struct nl_node *n;
>> > + unsigned long i;
>> > +
>> > + for (;;) {
>> > + /*
>> > + * jA:
>> > + *
>> > + * Try to recycle a descriptor on the committed list.
>> > + */
>> > + n = numlist_pop(&rb->nl);
>> > + if (n) {
>> > + d = container_of(n, struct prb_desc, list);
>> > + break;
>> > + }
>> > +
>> > + /* Fallback to static never-used descriptors. */
>> > + if (atomic_read(&rb->desc_next_unused) < DESCS_COUNT(rb)) {
>> > + i = atomic_fetch_inc(&rb->desc_next_unused);
>> > + if (i < DESCS_COUNT(rb)) {
>> > + d = &rb->descs[i];
>> > + atomic_long_set(&d->id, i);
>> > + break;
>> > + }
>> > + }
>> > +
>> > + /*
>> > + * No descriptor available. Make one available for recycling
>> > + * by invalidating data (which some descriptor will be
>> > + * referencing).
>> > + */
>> > + if (!dataring_pop(&rb->dr))
>> > + return false;
>> > + }
>> > +
>> > + /*
>> > + * jB:
>> > + *
>> > + * Modify the descriptor ID so that users of the descriptor see that
>> > + * it has been recycled. A _release() is used so that prb_getdesc()
>> > + * callers can see all data ringbuffer updates after issuing a
>> > + * pairing smb_rmb(). See iA for details.
>> > + *
>> > + * Memory barrier involvement:
>> > + *
>> > + * If dB->iA reads from jB, then dI reads the same value as
>> > + * jA->cD->hA.
>> > + *
>> > + * Relies on:
>> > + *
>> > + * RELEASE from jA->cD->hA to jB
>> > + * matching
>> > + * RMB between dB->iA and dI
>> > + */
>> > + atomic_long_set_release(&d->id, atomic_long_read(&d->id) +
>> > + DESCS_COUNT(rb));
>>
>> atomic_long_set_release() might be a bit confusing here.
>> There is no related acquire.
As the comment states, this release is for prb_getdesc() users. The only
prb_getdesc() user is _dataring_pop(). If getdesc() returns NULL
(i.e. the descriptor's ID is not what _dataring_pop() was expecting),
then the tail must have moved and _dataring_pop() needs to see
that. Since there are no data dependencies between descriptor ID and
tail_pos, an explicit memory barrier is used. More on this below.
>> In fact, d->id manipulation has barriers from both sides:
>>
>> + smp_rmb() before so that all reads are finished before
>> the id is updated (release)
>
> Uh, this statement does not make sense. The read barrier is not
> needed here. Instead the readers need it.
>
> Well, we might need a write barrier before d->id manipulation.
> It should be in numlist_pop() after successfully updating nl->tail_id.
> It will allow readers to detect that the desriptor is being reused
> (not in valid tail_id..head_id range) before we start manipulating it.
>
>> + smp_wmb() after so that the new ID is written before other
>> related values are modified (acquire).
>>
>> The smp_wmb() barrier is in prb_reserve(). I would move it here.
>
> This still makes sense. I would move the write barrier from
> prb_reserve() here.
The issue is that _dataring_pop() needs to see a moved dataring tail if
prb_getdesc() fails. Just because numlist_pop() succeeded, doesn't mean
that this was the task that changed the dataring tail. I.e. another CPU
could observe that this task changed the ID but _not_ yet see that
another task changed the dataring tail.
Issuing an smp_mb() before setting the the new ID would also suffice,
but that is a pretty big hammer for something that a set_release can
take care of.
> Sigh, I have to admit that I am not familiar with the _acquire(),
> _release(), and _relaxed() variants of the atomic operations.
>
> They probably make it easier to implement some locking API.
> I am not sure how to use it here. This code implements a complex
> interlock between several variables. I mean that several variables
> lock each other in a cycle, like a state machine? In each case,
> it is not a simple locking where we check state of a single
> variable.
Keep in mind that dataring and numlist were written independent of the
ringbuffer. They are structures with very specific purposes and their
own set of variables (and memory barriers to order _those_
variables). The high-level ringbuffer also has its own variables and
memory barriers. Sometimes there is overlap, which is implemented in the
callbacks (as is here), which is why the dataring callback getdesc() has
the implementation requirement that a following smp_rmb() by the caller
will guarantee seeing an updated dataring tail. But these overlaps are
the exception, not the rule.
I think trying to see "everything at once" with a top-down view is going
to seem too complex and hurt your brain. I think it would be easier to
verify the internal consistency of the individual dataring and numlist
structures first. Once you have faith in the integrity of those
structures, moving to the high-level ringbuffer is a much smaller step.
John Ogness
On 2019-08-20, Sergey Senozhatsky <[email protected]> wrote:
> [..]
>> > + *
>> > + * Memory barrier involvement:
>> > + *
>> > + * If dB reads from gA, then dC reads from fG.
>> > + * If dB reads from gA, then dD reads from fH.
>> > + * If dB reads from gA, then dE reads from fE.
>> > + *
>> > + * Note that if dB reads from gA, then dC cannot read from fC.
>> > + * Note that if dB reads from gA, then dD cannot read from fD.
>> > + *
>> > + * Relies on:
>> > + *
>> > + * RELEASE from fG to gA
>> > + * matching
>> > + * ADDRESS DEP. from dB to dC
>> > + *
>> > + * RELEASE from fH to gA
>> > + * matching
>> > + * ADDRESS DEP. from dB to dD
>> > + *
>> > + * RELEASE from fE to gA
>> > + * matching
>> > + * ACQUIRE from dB to dE
>> > + */
>>
>> But I am not sure how much this is useful. It would take ages to decrypt
>> all these shortcuts (signs) and translate them into something
>> human readable. Also it might get outdated easily.
>>
>> That said, I haven't found yet if there was a system in all
>> the shortcuts. I mean if they can be descrypted easily
>> out of head. Also I am not familiar with the notation
>> of the dependencies.
>
> Does not appear to be systematic to me, but maybe I'm missing something
> obvious. For chains like
>
> jA->cD->hA to jB
>
> I haven't found anything better than just git grep jA kernel/printk/
> so far.
I really struggled to find a way to label the code in order to document
the memory barriers. By grepping on "jA:" you will land at the exact
location.
> But once you'll grep for label cD, for instance, you'd see
> that it's not defined. It's mentioned but not defined
>
> kernel/printk/ringbuffer.c: * jA->cD->hA.
> kernel/printk/ringbuffer.c: * RELEASE from jA->cD->hA to jB
I tried to be very careful about the labeling, but you just found an
error. cD is supposed to be cC. (I probably refactored the labels and
missed this one.) Particularly with referencing labels from other files
I was not happy (which is the case with cC). This is one area that I
think it would be really helpful if the kernel guidelines had some
format.
The labels are necessary for the technical documentation of the
barriers. And, after spending much time in this, I find them very
useful. But I agree that there needs to be a better way to assign label
names.
FWIW, I chose a lowercase letter for each function and an uppercase
letter for each label within that function. The camel case (followed by
the colon) created a pair that was unique for grepping.
Petr, in case you missed it, this comment language came from my
discussion[0] with AndreaP.
> I was thinking about renaming labels. E.g.
>
> dataring_desc_init()
> {
> /* di1 */
> WRITE_ONCE(desc->begin_lpos, 1);
> /* di2 */
> WRITE_ONCE(desc->next_lpos, 1);
> }
>
> Where di stands for descriptor init.
>
> dataring_push()
> {
> /* dp1 */
> ret = get_new_lpos(dr, size, &begin_lpos, &next_lpos);
> ...
> /* dp2 */
> smp_mb();
> ...
> }
>
> Where dp stands for descriptor push. For dataring we can add a 'dr'
> prefix, to avoid confusion with desc barriers, which have 'd' prefix.
> And so on. Dunno.
Yeah, I spent a lot of time going in circles on this one.
I hope that we can agree that the labels are important. And that a
formal documentation of the barriers is also important. Yes, they are a
lot of work, but I find it makes it a lot easier to go back to the code
after I've been away for a while. Even now, as I go through your
feedback on code that I wrote over a month ago, I find the formal
comments critical to quickly understand _exactly_ why the memory
barriers exist.
Perhaps we should choose labels that are more clear, like:
dataring_push:A
dataring_push:B
Then we would see comments like:
Memory barrier involvement:
If _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:C reads from dataring_push:G.
If _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:D reads from dataring_push:H.
If _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:E reads from dataring_push:E.
Note that if _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:C cannot read from dataring_push:C->dataring_desc_init:A.
Note that if _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:D cannot read from dataring_push:C->dataring_desc_init:B.
Relies on:
RELEASE from dataring_push:G to dataring_datablock_setid:A
matching
ADDRESS DEP. from _dataring_pop:B to _dataring_pop:C
RELEASE from dataring_push:H to dataring_datablock_setid:A
matching
ADDRESS DEP. from _dataring_pop:B to _dataring_pop:D
RELEASE from dataring_push:E to dataring_datablock_setid:A
matching
ACQUIRE from _dataring_pop:B to _dataring_pop:E
But then how should the labels in the code look? Just the letter looks
simple in code, but cannot be grepped.
dataring_push()
{
...
/* E */
...
}
The full label can be grepped, but is redundant with the function name.
dataring_push()
{
...
/* dataring_push:E */
...
}
Andrea suggested that the documentation should be within the code, which
I think is a good idea. Even if it means we have more comments than
code.
I am open to suggestions.
John Ogness
[0] https://lkml.kernel.org/r/20190630140855.GA6005@andrea
On Wed 2019-08-21 07:42:57, John Ogness wrote:
> On 2019-08-20, Petr Mladek <[email protected]> wrote:
> >> --- /dev/null
> >> +++ b/kernel/printk/dataring.c
> >> +/**
> >> + * _datablock_valid() - Check if given positions yield a valid data block.
> >> + *
> >> + * @dr: The associated data ringbuffer.
> >> + *
> >> + * @head_lpos: The newest data logical position.
> >> + *
> >> + * @tail_lpos: The oldest data logical position.
> >> + *
> >> + * @begin_lpos: The beginning logical position of the data block to check.
> >> + *
> >> + * @next_lpos: The logical position of the next adjacent data block.
> >> + * This value is used to identify the end of the data block.
> >> + *
> >
> > Please remove the empty lines between arguments description. They make
> > the comments too scattered.
>
> Your feedback is contradicting what PeterZ requested[0]. Particularly
> when multiple lines are involved with a description, I find the spacing
> helpful. I've grown to like the spacing, but I won't fight for it.
I do not want to fight over it. Just note that >90% of argument
descriptors seem to be one liners.
Best Regards,
Petr
On Wed 2019-08-21 07:46:28, John Ogness wrote:
> On 2019-08-20, Sergey Senozhatsky <[email protected]> wrote:
> > [..]
> >> > + *
> >> > + * Memory barrier involvement:
> >> > + *
> >> > + * If dB reads from gA, then dC reads from fG.
> >> > + * If dB reads from gA, then dD reads from fH.
> >> > + * If dB reads from gA, then dE reads from fE.
> >> > + *
> >> > + * Note that if dB reads from gA, then dC cannot read from fC.
> >> > + * Note that if dB reads from gA, then dD cannot read from fD.
> >> > + *
> >> > + * Relies on:
> >> > + *
> >> > + * RELEASE from fG to gA
> >> > + * matching
> >> > + * ADDRESS DEP. from dB to dC
> >> > + *
> >> > + * RELEASE from fH to gA
> >> > + * matching
> >> > + * ADDRESS DEP. from dB to dD
> >> > + *
> >> > + * RELEASE from fE to gA
> >> > + * matching
> >> > + * ACQUIRE from dB to dE
> >> > + */
> >>
> >> But I am not sure how much this is useful. It would take ages to decrypt
> >> all these shortcuts (signs) and translate them into something
> >> human readable. Also it might get outdated easily.
> >>
> The labels are necessary for the technical documentation of the
> barriers. And, after spending much time in this, I find them very
> useful. But I agree that there needs to be a better way to assign label
> names.
I could understand that you spend a lot of time on creating the
labels and that they are somehow useful for you.
But I am not using them and I hope that I will not have to:
+ Grepping takes a lot of time, especially over several files.
+ Grepping is actually not enough. It is required to read
the following comment or code to realize what the label is for.
+ Several barriers have multiple dependencies. Grepping one
label helps to check that one connection makes sense.
But it is hard to keep all relations in head to confirm
that they are complete and make sense overall.
+ There are about 50 labels in the code. "Entry Lifecycle"
section in dataring.c talks about 8 step. One would
expect that it would require 8 read and 8 write barriers.
Even coordination of 16 barriers might be complicated to check.
Where 50 is just scary.
+ It seems to be a newly invented format and it is not documented.
I personally do not understand it completely, for example,
the meaning of "RELEASE from jA->cD->hA to jB".
I hope that we could do better. I believe that human readable
comments all less error prone because they describe the intention.
Pseudo code based on labels just describes the code but it
does not explain why it was done this way.
From my POV, the labels do more harm than good. The code gets
too scattered and is harder to follow.
> I hope that we can agree that the labels are important.
It would be great to hear from others.
> And that a formal documentation of the barriers is also important.
It might be helpful if it can be somehow feed to a tool that would
prove correctness. Is this the case?
In each case, it should follow some "widely" used format.
We should not invent a new one that nobody else would use
and understand.
> Perhaps we should choose labels that are more clear, like:
>
> dataring_push:A
> dataring_push:B
The dataring_push is clear. The A or B codes have no meaning
without searching.
It might look better if we replace A or B with variable names.
> Then we would see comments like:
>
> Memory barrier involvement:
>
> If _dataring_pop:B reads from dataring_datablock_setid:A, then
> _dataring_pop:C reads from dataring_push:G.
Is this some known syntax, please? I do not understand it.
>
> Andrea suggested that the documentation should be within the code, which
> I think is a good idea. Even if it means we have more comments than
> code.
It depends on the type of the information. I would describe:
+ The overall design on top of the source file or in
Documentation/...
+ The behavior of externally used API and non-obvious functions
above the function definition.
+ Implementation details, non-obvious effects, side effects,
relations, meaning of tricky calculation, meaning of
a block of code inside the code. But each function should
ideally fit on the screen.
I personally tend to write more documentation but it is sometimes
too much. I am trying to become more effective and to the point.
Best Regards,
Petr
On Wed 2019-08-21 07:52:26, John Ogness wrote:
> On 2019-08-20, Petr Mladek <[email protected]> wrote:
> >> > --- /dev/null
> >> > +++ b/kernel/printk/ringbuffer.c
> >> > +/**
> >> > + * assign_desc() - Assign a descriptor to the caller.
> >> > + *
> >> > + * @e: The entry structure to store the assigned descriptor to.
> >> > + *
> >> > + * Find an available descriptor to assign to the caller. First it is checked
> >> > + * if the tail descriptor from the committed list can be recycled. If not,
> >> > + * perhaps a never-used descriptor is available. Otherwise, data blocks will
> >> > + * be invalidated until the tail descriptor from the committed list can be
> >> > + * recycled.
> >> > + *
> >> > + * Assigned descriptors are invalid until data has been reserved for them.
> >> > + *
> >> > + * Return: true if a descriptor was assigned, otherwise false.
> >> > + *
> >> > + * This will only fail if it was not possible to invalidate data blocks in
> >> > + * order to recycle a descriptor. This can happen if a writer has reserved but
> >> > + * not yet committed data and that reserved data is currently the oldest data.
> >> > + */
> >> > +static bool assign_desc(struct prb_reserved_entry *e)
> >> > +{
> >> > + struct printk_ringbuffer *rb = e->rb;
> >> > + struct prb_desc *d;
> >> > + struct nl_node *n;
> >> > + unsigned long i;
> >> > +
> >> > + for (;;) {
> >> > + /*
> >> > + * jA:
> >> > + *
> >> > + * Try to recycle a descriptor on the committed list.
> >> > + */
> >> > + n = numlist_pop(&rb->nl);
> >> > + if (n) {
> >> > + d = container_of(n, struct prb_desc, list);
> >> > + break;
> >> > + }
> >> > +
> >> > + /* Fallback to static never-used descriptors. */
> >> > + if (atomic_read(&rb->desc_next_unused) < DESCS_COUNT(rb)) {
> >> > + i = atomic_fetch_inc(&rb->desc_next_unused);
> >> > + if (i < DESCS_COUNT(rb)) {
> >> > + d = &rb->descs[i];
> >> > + atomic_long_set(&d->id, i);
> >> > + break;
> >> > + }
> >> > + }
> >> > +
> >> > + /*
> >> > + * No descriptor available. Make one available for recycling
> >> > + * by invalidating data (which some descriptor will be
> >> > + * referencing).
> >> > + */
> >> > + if (!dataring_pop(&rb->dr))
> >> > + return false;
> >> > + }
> >> > +
> >> > + /*
> >> > + * jB:
> >> > + *
> >> > + * Modify the descriptor ID so that users of the descriptor see that
> >> > + * it has been recycled. A _release() is used so that prb_getdesc()
> >> > + * callers can see all data ringbuffer updates after issuing a
> >> > + * pairing smb_rmb(). See iA for details.
> >> > + *
> >> > + * Memory barrier involvement:
> >> > + *
> >> > + * If dB->iA reads from jB, then dI reads the same value as
> >> > + * jA->cD->hA.
> >> > + *
> >> > + * Relies on:
> >> > + *
> >> > + * RELEASE from jA->cD->hA to jB
> >> > + * matching
> >> > + * RMB between dB->iA and dI
> >> > + */
> >> > + atomic_long_set_release(&d->id, atomic_long_read(&d->id) +
> >> > + DESCS_COUNT(rb));
> >>
> >> atomic_long_set_release() might be a bit confusing here.
> >> There is no related acquire.
>
> As the comment states, this release is for prb_getdesc() users. The only
> prb_getdesc() user is _dataring_pop().
> (i.e. the descriptor's ID is not what _dataring_pop() was expecting),
> then the tail must have moved and _dataring_pop() needs to see
> that. Since there are no data dependencies between descriptor ID and
> tail_pos, an explicit memory barrier is used. More on this below.
OK, let me show how complicated and confusing this looks for me:
+ The two related barriers are in different source files
and APIs:
+ assign_desc() in ringbuffer.c; ringbuffer API
+ _dataring_pop in dataring.c; dataring API
+ Both the related barriers are around "id" manipulation.
But one is in dataring, other is in descriptors array.
One is about an old released "id". One is about a newly
assigned "id".
+ The release() barrier is called once for each assigned
descriptor. The acquire() barrier is called more times
or not at all depending on the amount of free space
in dataring.
+ prb_getdesc() is mentioned in the comment but the barrier
is in _dataring_pop()
+ prb_getdesc() is called via dr->getdesc() callback and thus
not straightforward to check.
+ dr->getdesc() is called twice in _dataring_pop(); once
with _acquire() and once without.
+ _acquire() is hidden in
desc = dr->getdesc(smp_load_acquire(&db->id), dr->getdesc_arg)
+ The comment says that it is pairing with smb_rmb() but it
the code uses _acquire().
+ The comment says that the barrier is issued "so that callers can
see all data ringbuffer updates". It is not specific what
updates are meant (earlier or later).
It can be guessed by the the type of the barrier. But it does
help with review (barrier matches what author wanted).
What would have helped me to understand this barrier might
be something like:
/*
* Got descriptor and have exclusive write access.
* Use _release() barrier before first modification
* so that others could detect the new owner via
* previous numlist and dataring head/tail updates.
*
* The related barrier is in _dataring_pop() when
* acquiring db->id.
*/
It explains what the barrier is synchronizing and where is
the counter part.
But it still does not explain if the counter part is correct.
I simply do not know. Barriers are usually symmetric but the
only symmetric thing here is the name of the variable ("id").
It might be correct after all. But it looks so non-standard
and far from obvious at least for me. I hope that we could
either make it more symmetric and better explain it.
> > Sigh, I have to admit that I am not familiar with the _acquire(),
> > _release(), and _relaxed() variants of the atomic operations.
> >
> > They probably make it easier to implement some locking API.
> > I am not sure how to use it here. This code implements a complex
> > interlock between several variables. I mean that several variables
> > lock each other in a cycle, like a state machine? In each case,
> > it is not a simple locking where we check state of a single
> > variable.
>
> Keep in mind that dataring and numlist were written independent of the
> ringbuffer. They are structures with very specific purposes and their
> own set of variables (and memory barriers to order _those_
> variables). The high-level ringbuffer also has its own variables and
> memory barriers. Sometimes there is overlap, which is implemented in the
> callbacks (as is here), which is why the dataring callback getdesc() has
> the implementation requirement that a following smp_rmb() by the caller
> will guarantee seeing an updated dataring tail. But these overlaps are
> the exception, not the rule.
Sure. It is possible that this is the worst place. But there are definitely
more of them:
+ smp_rmb() in numlist_read() is related to smp_wmb() in prb_reserve()
+ "node" callback in struct nl_node must do smp_wmb()
+ full memory barrier is required before calling get_new_lpos()
I think that I understand the overall algorithm in principle. But it
is really hard to prove that all pieces play well together.
I would like to discuss the overall design in a separate thread.
I wanted to comment and understand some details first.
It is clear that you put a lot of effort into it. Also it is great
that you tried to formalize the barriers. But it is still very
complicated.
> I think trying to see "everything at once" with a top-down view is going
> to seem too complex and hurt your brain. I think it would be easier to
> verify the internal consistency of the individual dataring and numlist
> structures first. Once you have faith in the integrity of those
> structures, moving to the high-level ringbuffer is a much smaller
> step.
I try hard to understand the think from different angles. I started
with numlist.c and there was quite some dependency on the rest.
Then I started to check the overall algorithm and saw even more
points that synchronized against more locations or synchronized
against another structures.
I do not know. I am going to stare more into it. It is possible
that I will get on top of it in the end.
I'll also try to apply the patch adding my approach. I wonder
if it makes a difference.
Best Regards,
Petr
On Thu, Aug 22, 2019 at 03:50:52PM +0200, Petr Mladek wrote:
> On Wed 2019-08-21 07:46:28, John Ogness wrote:
> > On 2019-08-20, Sergey Senozhatsky <[email protected]> wrote:
> > > [..]
> > >> > + *
> > >> > + * Memory barrier involvement:
> > >> > + *
> > >> > + * If dB reads from gA, then dC reads from fG.
> > >> > + * If dB reads from gA, then dD reads from fH.
> > >> > + * If dB reads from gA, then dE reads from fE.
> > >> > + *
> > >> > + * Note that if dB reads from gA, then dC cannot read from fC.
> > >> > + * Note that if dB reads from gA, then dD cannot read from fD.
> > >> > + *
> > >> > + * Relies on:
> > >> > + *
> > >> > + * RELEASE from fG to gA
> > >> > + * matching
> > >> > + * ADDRESS DEP. from dB to dC
> > >> > + *
> > >> > + * RELEASE from fH to gA
> > >> > + * matching
> > >> > + * ADDRESS DEP. from dB to dD
> > >> > + *
> > >> > + * RELEASE from fE to gA
> > >> > + * matching
> > >> > + * ACQUIRE from dB to dE
> > >> > + */
> > >>
> > >> But I am not sure how much this is useful. It would take ages to decrypt
> > >> all these shortcuts (signs) and translate them into something
> > >> human readable. Also it might get outdated easily.
> > >>
> > The labels are necessary for the technical documentation of the
> > barriers. And, after spending much time in this, I find them very
> > useful. But I agree that there needs to be a better way to assign label
> > names.
>
> I could understand that you spend a lot of time on creating the
> labels and that they are somehow useful for you.
>
> But I am not using them and I hope that I will not have to:
>
> + Grepping takes a lot of time, especially over several files.
>
> + Grepping is actually not enough. It is required to read
> the following comment or code to realize what the label is for.
>
> + Several barriers have multiple dependencies. Grepping one
> label helps to check that one connection makes sense.
> But it is hard to keep all relations in head to confirm
> that they are complete and make sense overall.
>
> + There are about 50 labels in the code. "Entry Lifecycle"
> section in dataring.c talks about 8 step. One would
> expect that it would require 8 read and 8 write barriers.
>
> Even coordination of 16 barriers might be complicated to check.
> Where 50 is just scary.
>
>
> + It seems to be a newly invented format and it is not documented.
> I personally do not understand it completely, for example,
> the meaning of "RELEASE from jA->cD->hA to jB".
IIUC, something like "hA is the interested access, happening within
cD (should have been cC?), which in turn happens within jA". But I
should defer to John (FWIW, I found that notation quite helpful).
>
>
> I hope that we could do better. I believe that human readable
> comments all less error prone because they describe the intention.
> Pseudo code based on labels just describes the code but it
> does not explain why it was done this way.
>
> From my POV, the labels do more harm than good. The code gets
> too scattered and is harder to follow.
>
>
> > I hope that we can agree that the labels are important.
>
> It would be great to hear from others.
I agree with you that reviewing these comments might be "scary" and
not suitable for a bed-reading ;-) (I didn't have time to complete
such review yet). OTOH, from my POV, removing such comments/labels
could only make such (and future) reviews scarier, because then the
(memory-ordering) "intention" would then be _hidden in the code.
>
> > And that a formal documentation of the barriers is also important.
>
> It might be helpful if it can be somehow feed to a tool that would
> prove correctness. Is this the case?
From what I've read so far, it _should be relatively straighforward
to write down a litmus test from any such comment (and give this to
the LKMM simulator).
>
> In each case, it should follow some "widely" used format.
> We should not invent a new one that nobody else would use
> and understand.
Agreed. Well, litmus tests (or the comments here in question, that
are intended to convey the same information) have been successfully
adopted by memory model and concurrency people for as long as I can
remember, current architecture reference manuals use these tools to
describe the semantics of fence or atomic instructions, discussions
about memory barriers on LKML, gcc MLs often reduce to a discussion
around one or more litmus tests...
[trimming]
> > Andrea suggested that the documentation should be within the code, which
> > I think is a good idea. Even if it means we have more comments than
> > code.
>
> It depends on the type of the information. I would describe:
>
> + The overall design on top of the source file or in
> Documentation/...
>
> + The behavior of externally used API and non-obvious functions
> above the function definition.
>
> + Implementation details, non-obvious effects, side effects,
> relations, meaning of tricky calculation, meaning of
> a block of code inside the code. But each function should
> ideally fit on the screen.
>
> I personally tend to write more documentation but it is sometimes
> too much. I am trying to become more effective and to the point.
Unfortunately, I don't know of more concise ways to convey the same
information that these comments are intended to provide. Thoughts?
Please don't get me wrong: I'm all for overall design, external API,
etc., if some improvements can be achieved here.
Andrea
On (08/21/19 07:46), John Ogness wrote:
[..]
> The labels are necessary for the technical documentation of the
> barriers. And, after spending much time in this, I find them very
> useful. But I agree that there needs to be a better way to assign label
> names.
[..]
> > Where dp stands for descriptor push. For dataring we can add a 'dr'
> > prefix, to avoid confusion with desc barriers, which have 'd' prefix.
> > And so on. Dunno.
>
> Yeah, I spent a lot of time going in circles on this one.
[..]
> I hope that we can agree that the labels are important. And that a
> formal documentation of the barriers is also important. Yes, they are a
> lot of work, but I find it makes it a lot easier to go back to the code
> after I've been away for a while. Even now, as I go through your
> feedback on code that I wrote over a month ago, I find the formal
> comments critical to quickly understand _exactly_ why the memory
> barriers exist.
Yeah. I like those tagsi/labels, and appreciate your efforts.
Speaking about it in general, not necessarily related to printk patch set.
With or without labels/tags we still have to grep. But grep-ing is much
easier when we have labels/tags. Otherwise it's sometimes hard to understand
what to grep for - _acquire, _relaxed, smp barrier, write_once, or
anything else.
> Perhaps we should choose labels that are more clear, like:
>
> dataring_push:A
> dataring_push:B
>
> Then we would see comments like:
>
> Memory barrier involvement:
>
> If _dataring_pop:B reads from dataring_datablock_setid:A, then
> _dataring_pop:C reads from dataring_push:G.
[..]
> RELEASE from dataring_push:E to dataring_datablock_setid:A
> matching
> ACQUIRE from _dataring_pop:B to _dataring_pop:E
I thought about it. That's very informative, albeit pretty hard to maintain.
The same applies to drA or prA and any other context dependent prefix.
> But then how should the labels in the code look? Just the letter looks
> simple in code, but cannot be grepped.
Yes, good point.
> dataring_push()
> {
> ...
> /* E */
> ...
> }
If only there was something as cool as grep-ing, but cooler. Something
that "just sucks less". Something that even folks like myself could use.
Bare with me.
Apologies. This email is rather long; but it's pretty easy to read.
Let's see if this can fly.
So what I did.
I changed several LMM tags/labels definitions, so they have common format:
LMM_TAG(name)
I don't insist on this particular naming scheme, it can be improved.
======================================================================
diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
index e48069dc27bc..54eb28d47d30 100644
--- a/kernel/printk/dataring.c
+++ b/kernel/printk/dataring.c
@@ -577,11 +577,11 @@ char *dataring_push(struct dataring *dr, unsigned long size,
to_db_size(&size);
do {
- /* fA: */
+ /* LMM_TAG(fA) */
ret = get_new_lpos(dr, size, &begin_lpos, &next_lpos);
/*
- * fB:
+ * LMM_TAG(fB)
*
* The data ringbuffer tail may have been pushed (by this or
* any other task). The updated @tail_lpos must be visible to
@@ -621,7 +621,7 @@ char *dataring_push(struct dataring *dr, unsigned long size,
if (!ret) {
/*
- * fC:
+ * LMM_TAG(fC)
*
* Force @desc permanently invalid to minimize risk
* of the descriptor later unexpectedly being
@@ -631,14 +631,14 @@ char *dataring_push(struct dataring *dr, unsigned long size,
dataring_desc_init(desc);
return NULL;
}
- /* fE: */
+ /* LMM_TAG(fE) */
} while (atomic_long_cmpxchg_relaxed(&dr->head_lpos, begin_lpos,
next_lpos) != begin_lpos);
db = to_datablock(dr, begin_lpos);
/*
- * fF:
+ * LMM_TAG(fF)
*
* @db->id is a garbage value and could possibly match the @id. This
* would be a problem because the data block would be considered
@@ -648,7 +648,7 @@ char *dataring_push(struct dataring *dr, unsigned long size,
WRITE_ONCE(db->id, id - 1);
/*
- * fG:
+ * LMM_TAG(fG)
*
* Ensure that @db->id is initialized to a wrong ID value before
* setting @begin_lpos so that there is no risk of accidentally
@@ -668,7 +668,7 @@ char *dataring_push(struct dataring *dr, unsigned long size,
*/
smp_store_release(&desc->begin_lpos, begin_lpos);
- /* fH: */
+ /* LMM_TAG(fH) */
WRITE_ONCE(desc->next_lpos, next_lpos);
/* If this data block wraps, use @data from the content data block. */
diff --git a/kernel/printk/numlist.c b/kernel/printk/numlist.c
index 16c6ffa74b01..285e0431dbf8 100644
--- a/kernel/printk/numlist.c
+++ b/kernel/printk/numlist.c
@@ -338,7 +338,7 @@ struct nl_node *numlist_pop(struct numlist *nl)
tail_id = atomic_long_read(&nl->tail_id);
for (;;) {
- /* cB */
+ /* LMM_TAG(cB) */
while (!numlist_read(nl, tail_id, NULL, &next_id)) {
/*
* @tail_id is invalid. Try again with an
@@ -357,6 +357,7 @@ struct nl_node *numlist_pop(struct numlist *nl)
/*
* cC:
+ * LMM_TAG(cD)
*
* Make sure the node is not busy.
*/
@@ -368,7 +369,7 @@ struct nl_node *numlist_pop(struct numlist *nl)
if (r == tail_id)
break;
- /* cA: #3 */
+ /* LMM_TAG(cA) #3 */
tail_id = r;
}
======================================================================
Okay.
Next, I added the following simple quick-n-dirty perl script:
======================================================================
Subject: [PATCH] add LMM_TAG parser
Signed-off-by: Sergey Senozhatsky <[email protected]>
---
scripts/ctags-parse-lmm.pl | 45 ++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
create mode 100755 scripts/ctags-parse-lmm.pl
diff --git a/scripts/ctags-parse-lmm.pl b/scripts/ctags-parse-lmm.pl
new file mode 100755
index 000000000000..785f6945c936
--- /dev/null
+++ b/scripts/ctags-parse-lmm.pl
@@ -0,0 +1,45 @@
+#!/usr/bin/perl
+#
+# SPDX-License-Identifier: GPL-2.0
+#
+# Parse Linux Memory Model tags and add corresponding entries to the ctags file
+#
+# LMM ?ber Alles!
+
+use strict;
+
+sub parse($$)
+{
+ my ($t, $f) = @_;
+ my $ctags;
+ my $file;
+
+ if (!open($ctags, '>>', $t)) {
+ print "Could not open $t: $!\n";
+ exit 1;
+ }
+
+ if (!open($file, '<', $f)) {
+ print "Could not open $f: $1\n";
+ exit 1;
+ }
+
+ while (my $row = <$file>) {
+ chomp $row;
+
+ if ($row =~ m/LMM_TAG\((.+)\)/) {
+ # yup...
+ print $ctags "$1\t$f\t/LMM_TAG($1)/;\"\td\n";
+ }
+ }
+ close($file);
+ close($ctags);
+}
+
+if ($#ARGV != 1) {
+ print "Usage:\n\tscripts/ctags-parse-lmm.pl tags C-file-to-parse\n";
+ exit 1;
+}
+
+parse($ARGV[0], $ARGV[1]);
+exit 0;
--
2.23.0
======================================================================
The next thing I did was
./scripts/ctags-parse-lmm.pl ./tags kernel/printk/dataring.c
./scripts/ctags-parse-lmm.pl ./tags kernel/printk/numlist.c
./scripts/ctags-parse-lmm.pl ./tags kernel/printk/ringbuffer.c
These 3 commands added the following entries to the tags file
(I'm using ctags and vim)
======================================================================
$ tail tags
fA kernel/printk/dataring.c /LMM_TAG(fA)/;" d
fB kernel/printk/dataring.c /LMM_TAG(fB)/;" d
fC kernel/printk/dataring.c /LMM_TAG(fC)/;" d
fE kernel/printk/dataring.c /LMM_TAG(fE)/;" d
fF kernel/printk/dataring.c /LMM_TAG(fF)/;" d
fG kernel/printk/dataring.c /LMM_TAG(fG)/;" d
fH kernel/printk/dataring.c /LMM_TAG(fH)/;" d
cB kernel/printk/numlist.c /LMM_TAG(cB)/;" d
cD kernel/printk/numlist.c /LMM_TAG(cD)/;" d
cA kernel/printk/numlist.c /LMM_TAG(cA)/;" d
======================================================================
So now when I perform LMM tag search or jump to a tag definition, vim
goes exactly to the line where the corresponding LMM_TAG was defined.
Example:
kernel/printk/ringbuffer.c
RELEASE from jA->cD->hA to jB
^
C-] // jump to tag under cursor
vim goes to kernel/printk/numlist.c
360 * LMM_TAG(cD)
^
Exactly where cD was defined.
Welcome to the future!
> Andrea suggested that the documentation should be within the code, which
> I think is a good idea. Even if it means we have more comments than
> code.
I agree that such documentation is handy. It, probably, would be even better
if we could use some tooling to make it easier to use.
-ss
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/numlist.c
> +/**
> + * numlist_push() - Add a node to the list and assign it a sequence number.
> + *
> + * @nl: The numbered list to push to.
> + *
> + * @n: A node to push to the numbered list.
> + * The node must not already be part of a list.
> + *
> + * @id: The ID of the node.
> + *
> + * A node is added in two steps: The first step is to make this node the
> + * head, which causes a following push to add to this node. The second step is
> + * to update @next_id of the former head node to point to this one, which
> + * makes this node visible to any task that sees the former head node.
> + */
> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
> +{
> + unsigned long head_id;
> + unsigned long seq;
> + unsigned long r;
> +
> + /*
> + * bA:
> + *
> + * Setup the node to be a list terminator: next_id == id.
> + */
> + WRITE_ONCE(n->next_id, id);
Do we need WRITE_ONCE() here?
Both "n" and "id" are given as parameters and do not change.
The assigment must be done before "id" is set as nl->head_id.
The ordering is enforced by cmpxchg_release().
> +
> + /* bB: #1 */
> + head_id = atomic_long_read(&nl->head_id);
> +
> + for (;;) {
> + /* bC: */
> + while (!numlist_read(nl, head_id, &seq, NULL)) {
> + /*
> + * @head_id is invalid. Try again with an
> + * updated value.
> + */
> +
> + cpu_relax();
I have got very confused by this. cpu_relax() suggests that this
cycle is busy waiting until a particular node becomes valid.
My first though was that it must cause deadlock in NMI when
the interrupted code is supposed to make the node valid.
But it is the other way. The head is always valid when it is
added to the list. It might become invalid when another CPU
moves the head and the old one gets reused.
Anyway, I do not see any reason for cpu_relax() here.
Also the entire cycle would deserve a comment to avoid this mistake.
For example:
/*
* bC: Read seq from current head. Repeat with new
* head when it has changed and the old one got reused.
*/
> +
> + /* bB: #2 */
> + head_id = atomic_long_read(&nl->head_id);
> + }
> +
> + /*
> + * bD:
> + *
> + * Set @seq to +1 of @seq from the previous head.
> + *
> + * Memory barrier involvement:
> + *
> + * If bB reads from bE, then bC->aA reads from bD.
> + *
> + * Relies on:
> + *
> + * RELEASE from bD to bE
> + * matching
> + * ADDRESS DEP. from bB to bC->aA
> + */
> + WRITE_ONCE(n->seq, seq + 1);
Do we really need WRITE_ONCE() here?
It is the same problem as with setting n->next_id above.
> +
> + /*
> + * bE:
> + *
> + * This store_release() guarantees that @seq and @next are
> + * stored before the node with @id is visible to any popping
> + * writers. It pairs with the address dependency between @id
> + * and @seq/@next provided by numlist_read(). See bD and bF
> + * for details.
> + */
> + r = atomic_long_cmpxchg_release(&nl->head_id, head_id, id);
> + if (r == head_id)
> + break;
> +
> + /* bB: #3 */
> + head_id = r;
> + }
> +
> + n = nl->node(head_id, nl->node_arg);
> +
> + /*
> + * The old head (which is still the list terminator), cannot be
> + * removed because the list will always have at least one node.
> + * Therefore @n must be non-NULL.
> + */
Please, move this comment above the nl->node() call. Both locations
makes sense. I just see it as an important note for the call and thus
is should be above. Also it will be better separated from the below
comments for the _release() barrier.
> + /*
> + * bF: the STORE part for @next_id
> + *
> + * Set @next_id of the previous head to @id.
> + *
> + * Memory barrier involvement:
> + *
> + * If bB reads from bE, then bF overwrites bA.
> + *
> + * Relies on:
> + *
> + * RELEASE from bA to bE
> + * matching
> + * ADDRESS DEP. from bB to bF
> + */
> + /*
> + * bG: the RELEASE part for @next_id
> + *
> + * This _release() guarantees that a reader will see the updates to
> + * this node's @seq/@next_id if the reader saw the @next_id of the
> + * previous node in the list. It pairs with the address dependency
> + * between @id and @seq/@next provided by numlist_read().
> + *
> + * Memory barrier involvement:
> + *
> + * If aB reads from bG, then aA' reads from bD, where aA' is in
> + * numlist_read() to read the node ID from bG.
> + * If aB reads from bG, then aB' reads from bA, where aB' is in
> + * numlist_read() to read the node ID from bG.
> + *
> + * Relies on:
> + *
> + * RELEASE from bG to bD
> + * matching
> + * ADDRESS DEP. from aB to aA'
> + *
> + * RELEASE from bG to bA
> + * matching
> + * ADDRESS DEP. from aB to aB'
> + */
> + smp_store_release(&n->next_id, id);
Sigh, I see this line one screen below the previous command thanks
to the extensive comments. Well, bF comment looks redundant.
> +}
Best Regards,
Petr
On (08/22/19 15:50), Petr Mladek wrote:
[..]
> I could understand that you spend a lot of time on creating the
> labels and that they are somehow useful for you.
>
> But I am not using them and I hope that I will not have to:
>
> + Grepping takes a lot of time, especially over several files.
But without labels one still has to grep. A label at least points
to one exact location.
> + Grepping is actually not enough. It is required to read
> the following comment or code to realize what the label is for.
>
> + Several barriers have multiple dependencies. Grepping one
> label helps to check that one connection makes sense.
> But it is hard to keep all relations in head to confirm
> that they are complete and make sense overall.
Hmm. Labels don't add dependencies per se. Those tricky and hard to
follow dependencies will still be there, even if we'd remove
labels from comments. Labels just attempt to document them and
to show the intent.
The most important label, which should be added, is John's cell
phone number. So people can call/text him when something is not
working ;)
> + There are about 50 labels in the code. "Entry Lifecycle"
> section in dataring.c talks about 8 step. One would
> expect that it would require 8 read and 8 write barriers.
>
> Even coordination of 16 barriers might be complicated to check.
> Where 50 is just scary.
>
> + It seems to be a newly invented format and it is not documented.
> I personally do not understand it completely, for example,
> the meaning of "RELEASE from jA->cD->hA to jB".
I was under impression that this is the lingo used by LMM, but
can't find it in Documentation.
I agree, things can be improved and, may be, standardized.
It feels that tooling is a big part of the problem here.
-ss
On Fri 2019-08-23 14:54:45, Sergey Senozhatsky wrote:
> On (08/21/19 07:46), John Ogness wrote:
> [..]
> > The labels are necessary for the technical documentation of the
> > barriers. And, after spending much time in this, I find them very
> > useful. But I agree that there needs to be a better way to assign label
> > names.
> [..]
> > > Where dp stands for descriptor push. For dataring we can add a 'dr'
> > > prefix, to avoid confusion with desc barriers, which have 'd' prefix.
> > > And so on. Dunno.
> >
> > Yeah, I spent a lot of time going in circles on this one.
> [..]
> > I hope that we can agree that the labels are important. And that a
> > formal documentation of the barriers is also important. Yes, they are a
> > lot of work, but I find it makes it a lot easier to go back to the code
> > after I've been away for a while. Even now, as I go through your
> > feedback on code that I wrote over a month ago, I find the formal
> > comments critical to quickly understand _exactly_ why the memory
> > barriers exist.
>
> Yeah. I like those tagsi/labels, and appreciate your efforts.
>
> Speaking about it in general, not necessarily related to printk patch set.
> With or without labels/tags we still have to grep. But grep-ing is much
> easier when we have labels/tags. Otherwise it's sometimes hard to understand
> what to grep for - _acquire, _relaxed, smp barrier, write_once, or
> anything else.
Grepping is not needed when function names are used in the comment
and cscope might be used. Each function should be short
and easy enough so that any nested label can be found by eyes.
A custom script is an alternative but it would be better to use
existing tools.
In each case, two letter labels would get redundant sooner or later
when the semantic gets used widely. And I hope that we use a semantic
that is going to be used widely.
> > Perhaps we should choose labels that are more clear, like:
> >
> > dataring_push:A
> > dataring_push:B
> >
> > Then we would see comments like:
> >
> > Memory barrier involvement:
> >
> > If _dataring_pop:B reads from dataring_datablock_setid:A, then
> > _dataring_pop:C reads from dataring_push:G.
> [..]
> > RELEASE from dataring_push:E to dataring_datablock_setid:A
> > matching
> > ACQUIRE from _dataring_pop:B to _dataring_pop:E
>
> I thought about it. That's very informative, albeit pretty hard to maintain.
> The same applies to drA or prA and any other context dependent prefix.
The maintenance is my concern as well. The labels should be primary
for an automatized consistency checker. They make sense only when
they are in sync with the code.
Best Regards,
Petr
On Thu 2019-08-22 19:38:01, Andrea Parri wrote:
> On Thu, Aug 22, 2019 at 03:50:52PM +0200, Petr Mladek wrote:
> > On Wed 2019-08-21 07:46:28, John Ogness wrote:
> > > On 2019-08-20, Sergey Senozhatsky <[email protected]> wrote:
> > > > [..]
> > > >> > + *
> > > >> > + * Memory barrier involvement:
> > > >> > + *
> > > >> > + * If dB reads from gA, then dC reads from fG.
> > > >> > + * If dB reads from gA, then dD reads from fH.
> > > >> > + * If dB reads from gA, then dE reads from fE.
> > > >> > + *
> > > >> > + * Note that if dB reads from gA, then dC cannot read from fC.
> > > >> > + * Note that if dB reads from gA, then dD cannot read from fD.
> > > >> > + *
> > > >> > + * Relies on:
> > > >> > + *
> > > >> > + * RELEASE from fG to gA
> > > >> > + * matching
> > > >> > + * ADDRESS DEP. from dB to dC
> > > >> > + *
> > > >> > + * RELEASE from fH to gA
> > > >> > + * matching
> > > >> > + * ADDRESS DEP. from dB to dD
> > > >> > + *
> > > >> > + * RELEASE from fE to gA
> > > >> > + * matching
> > > >> > + * ACQUIRE from dB to dE
> > > >> > + */
> > > >>
> > > >> But I am not sure how much this is useful. It would take ages to decrypt
> > > >> all these shortcuts (signs) and translate them into something
> > > >> human readable. Also it might get outdated easily.
> > > >>
> > > The labels are necessary for the technical documentation of the
> > > barriers. And, after spending much time in this, I find them very
> > > useful. But I agree that there needs to be a better way to assign label
> > > names.
> >
> > I could understand that you spend a lot of time on creating the
> > labels and that they are somehow useful for you.
> >
> > But I am not using them and I hope that I will not have to:
> >
> > + Grepping takes a lot of time, especially over several files.
> >
> > + Grepping is actually not enough. It is required to read
> > the following comment or code to realize what the label is for.
> >
> > + Several barriers have multiple dependencies. Grepping one
> > label helps to check that one connection makes sense.
> > But it is hard to keep all relations in head to confirm
> > that they are complete and make sense overall.
> >
> > + There are about 50 labels in the code. "Entry Lifecycle"
> > section in dataring.c talks about 8 step. One would
> > expect that it would require 8 read and 8 write barriers.
> >
> > Even coordination of 16 barriers might be complicated to check.
> > Where 50 is just scary.
> >
> >
> > + It seems to be a newly invented format and it is not documented.
> > I personally do not understand it completely, for example,
> > the meaning of "RELEASE from jA->cD->hA to jB".
>
> IIUC, something like "hA is the interested access, happening within
> cD (should have been cC?), which in turn happens within jA". But I
> should defer to John (FWIW, I found that notation quite helpful).
>
>
> >
> >
> > I hope that we could do better. I believe that human readable
> > comments all less error prone because they describe the intention.
> > Pseudo code based on labels just describes the code but it
> > does not explain why it was done this way.
> >
> > From my POV, the labels do more harm than good. The code gets
> > too scattered and is harder to follow.
> >
> >
> > > I hope that we can agree that the labels are important.
> >
> > It would be great to hear from others.
>
> I agree with you that reviewing these comments might be "scary" and
> not suitable for a bed-reading ;-) (I didn't have time to complete
> such review yet). OTOH, from my POV, removing such comments/labels
> could only make such (and future) reviews scarier, because then the
> (memory-ordering) "intention" would then be _hidden in the code.
I am not suggesting to remove all comments. Some human readable
explanation is important as long as the code is developed by humans.
I think that I'll have to accept also the extra comments if you are
really going to use them to check the consistency by a tool. Or
if they are really used for review by some people.
> > > And that a formal documentation of the barriers is also important.
> >
> > It might be helpful if it can be somehow feed to a tool that would
> > prove correctness. Is this the case?
>
> >From what I've read so far, it _should be relatively straighforward
> to write down a litmus test from any such comment (and give this to
> the LKMM simulator).
Sounds good.
> > In each case, it should follow some "widely" used format.
> > We should not invent a new one that nobody else would use
> > and understand.
>
> Agreed. Well, litmus tests (or the comments here in question, that
> are intended to convey the same information) have been successfully
> adopted by memory model and concurrency people for as long as I can
> remember, current architecture reference manuals use these tools to
> describe the semantics of fence or atomic instructions, discussions
> about memory barriers on LKML, gcc MLs often reduce to a discussion
> around one or more litmus tests...
Do all this manuals, tools, people use any common syntax, please?
Would it be usable in our case as well?
I would like to avoid reinventing the wheel. Also I do not want
to create a dialect for few people that other potentially interested
parties will not understand.
Best Regards,
Petr
> I am not suggesting to remove all comments. Some human readable
> explanation is important as long as the code is developed by humans.
>
> I think that I'll have to accept also the extra comments if you are
> really going to use them to check the consistency by a tool. Or
> if they are really used for review by some people.
Glad to hear this. Thank you, Petr.
> Do all this manuals, tools, people use any common syntax, please?
> Would it be usable in our case as well?
>
> I would like to avoid reinventing the wheel. Also I do not want
> to create a dialect for few people that other potentially interested
> parties will not understand.
Right; I think that terms such as "(barrier) matching", "reads-from"
and "overwrites" are commonly used to refer to litmus tests. (The
various primitives/instructions are of course specific to the given
context: the language, the memory model, etc. )
IOW, I'd say that that wheel _and a common denominator here can be
represented by the notion of "litmus test". I'm not suggesting to
reinvent this wheel of course; my point was more along the lines of
"let's use the wheel, it'll be helpful..." ;-)
Andrea
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/numlist.c
> @@ -0,0 +1,375 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/sched.h>
> +#include "numlist.h"
struct numlist is really special variant of a list. Let me to
do a short summary:
+ FIFO queue interface
+ nodes sequentially numbered
+ nodes referenced by ID instead pointers to avoid ABA problems
+ requires custom node() callback to get pointer for given ID
+ lockless access:
+ pushed nodes must not longer get modified by push() caller
+ pop() caller gets exclusive write access, except that they
must modify ID first and do smp_wmb() later
+ pop() does not work:
+ tail node is "busy"
+ needs a custom callback that defines when a node is busy
+ tail is the last node
+ needed for lockless sequential numbering
I will start with one inevitable question ;-) Is it realistic to find
another user for this API, please?
I am not sure that all the indirections, caused by the generic API,
are worth the gain.
Well, the separate API makes sense anyway. I have some ideas that
might make it cleaner.
The barriers are because of validating the ID. Now we have:
struct nl_node {
unsigned long seq;
unsigned long next_id;
};
that is used in:
struct prb_desc {
/* private */
atomic_long_t id;
struct dr_desc desc;
struct nl_node list;
};
What will happen when we move id from struct prb_desc into struct nl_node?
struct nl_node {
unsigned long seq;
atomic_long_t id;
unsigned long next_id;
};
struct prb_desc {
struct dr_desc desc;
struct nl_node list;
};
Then the "node" callback might just return the structure. It makes
perfect sense. struct nl_node is always static for a given id.
For the printk ringbuffer it would look like:
struct nl_node *prb_nl_get_node(unsigned long id, void *nl_user)
{
struct printk_ringbuffer *rb = (struct printk_ringbuffer *)nl_user;
struct prb_desc *d = to_desc(rb, id);
return &d->list;
}
I would also hide the callback behind a generic wrapper:
struct nl_node *numlist_get_node(struct numlist *nl, unsigned long id)
{
return nl->get_node(id, nl->user_data);
}
Then we could have nicely symetric and self contained barriers
in numlist_read():
bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
unsigned long *next_id)
{
struct nl_node *n;
unsigned long cur_id;
n = numlist_get_node(nl, id);
if (!n)
return false;
/*
* Make sure that seq and next_id values will be read
* for the expected id.
*/
cur_id = atomic_long_read_acquire(&n->id);
if (cur_id != id)
return false;
if (seq) {
*seq = n->seq;
if (next_id)
*next_id = n->next_id;
}
/*
* Make sure that seq and next_id values were read for
* the expected ID.
*/
cur_id = atomic_long_read_release(&n->id);
return cur_id == id;
}
numlist_push() might be the same, except the I would
remove several WRITE_ONCE as discussed in another mail:
void numlist_push(struct numlist *nl, struct nl_node *n)
{
unsigned long head_id;
unsigned long seq;
unsigned long r;
/* Setup the node to be a list terminator: next_id == id. */
n->next_id = n->id;
do {
do {
head_id = atomic_long_read(&nl->head_id);
} while (!numlist_read(nl, head_id, &seq, NULL));
n->seq = seq + 1;
/*
* This store_release() guarantees that @seq and @next are
* stored before the node with @id is visible to any popping
* writers.
*
* It pairs with the acquire() when tail_id gets updated
* in headlist_pop();
*/
} while (atomic_long_cmpxchg_release(&nl->head_id, head_id, id) !=
head_id);
n = nl->get_node(nl, head_id);
/*
* This barrier makes sure that nl->head_id already points to
* the newly pushed node.
*
* It pairs with acquire when new id is written in numlist_pop().
* It allows to pop() and reuse this node. It can not longer
* be the last one.
*/
smp_store_release(&n->next_id, id);
}
Then I would add a symetric callback that would generate ID for
a newly popped struct. It will allow to set new ID in the numlist
API and have the barriers symetric. Something like:
unsined long prb_new_node_id(unsigned long old_id, , void *nl_user)
{
struct printk_ringbuffer *rb = (struct printk_ringbuffer *)nl_user;
return id + DESCS_COUNT(rb);
}
Then we could hide it in
unsigned long numlist_get_new_id(struct numlist *nl, unsigned long id)
{
return nl->get_new_id(id, nl->user_data);
}
and do
struct nl_node *numlist_pop(struct numlist *nl)
{
struct nl_node *n;
unsigned long tail_id;
unsigned long next_id;
unsigned long r;
tail_id = atomic_long_read(&nl->tail_id);
do {
do {
tail_id = atomic_long_read(&nl->tail_id);
} while (!numlist_read(nl, tail_id, NULL, &next_id));
/* Make sure the node is not the only node on the list. */
if (next_id == tail_id)
return NULL;
/* Make sure the node is not busy. */
if (nl->busy(tail_id, nl->busy_arg))
return NULL;
/*
* Make sure that nl->tail_id is update before
* we start modyfying the popped node.
*
* It pairs with release() when head_id is
* pushed in numlist_push().
*/
} while (atomic_long_cmpxchg_acquire(&nl->tail_id,
tail_id, next_id) !=
tail_id);
/* Got exclusive write access to the node. */
n = numlist_get_node(nl, tail_id);
tail_id = numlist_get_new_id(tail_id, nl);
/*
* Make sure that we set new ID before we allow
* more changes in user structure handled by this node.
*
* It pairs with release() barrier when the node is
* pushed into the numlist again, gets linked to
* the previous node and can't be modified anymore.
* See numlist_push().
*/
atomic_long_set_acquire(&d->id, atomic_long_read(&d->id) +
DESCS_COUNT(rb));
return n;
}
I hope that it makes some sense. I feel exhausted. It is Friday
evening here. I just wanted to send it because it looked like the most
constructive idea that I had this week. And I wanted to send something
more positive ;-)
Best Regards,
Petr
On 2019-08-22, Petr Mladek <[email protected]> wrote:
>>>>> --- /dev/null
>>>>> +++ b/kernel/printk/ringbuffer.c
>>>>> +/**
>>>>> + * assign_desc() - Assign a descriptor to the caller.
>>>>> + *
>>>>> + * @e: The entry structure to store the assigned descriptor to.
>>>>> + *
>>>>> + * Find an available descriptor to assign to the caller. First it is checked
>>>>> + * if the tail descriptor from the committed list can be recycled. If not,
>>>>> + * perhaps a never-used descriptor is available. Otherwise, data blocks will
>>>>> + * be invalidated until the tail descriptor from the committed list can be
>>>>> + * recycled.
>>>>> + *
>>>>> + * Assigned descriptors are invalid until data has been reserved for them.
>>>>> + *
>>>>> + * Return: true if a descriptor was assigned, otherwise false.
>>>>> + *
>>>>> + * This will only fail if it was not possible to invalidate data blocks in
>>>>> + * order to recycle a descriptor. This can happen if a writer has reserved but
>>>>> + * not yet committed data and that reserved data is currently the oldest data.
>>>>> + */
>>>>> +static bool assign_desc(struct prb_reserved_entry *e)
>>>>> +{
>>>>> + struct printk_ringbuffer *rb = e->rb;
>>>>> + struct prb_desc *d;
>>>>> + struct nl_node *n;
>>>>> + unsigned long i;
>>>>> +
>>>>> + for (;;) {
>>>>> + /*
>>>>> + * jA:
>>>>> + *
>>>>> + * Try to recycle a descriptor on the committed list.
>>>>> + */
>>>>> + n = numlist_pop(&rb->nl);
>>>>> + if (n) {
>>>>> + d = container_of(n, struct prb_desc, list);
>>>>> + break;
>>>>> + }
>>>>> +
>>>>> + /* Fallback to static never-used descriptors. */
>>>>> + if (atomic_read(&rb->desc_next_unused) < DESCS_COUNT(rb)) {
>>>>> + i = atomic_fetch_inc(&rb->desc_next_unused);
>>>>> + if (i < DESCS_COUNT(rb)) {
>>>>> + d = &rb->descs[i];
>>>>> + atomic_long_set(&d->id, i);
>>>>> + break;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + /*
>>>>> + * No descriptor available. Make one available for recycling
>>>>> + * by invalidating data (which some descriptor will be
>>>>> + * referencing).
>>>>> + */
>>>>> + if (!dataring_pop(&rb->dr))
>>>>> + return false;
>>>>> + }
>>>>> +
>>>>> + /*
>>>>> + * jB:
>>>>> + *
>>>>> + * Modify the descriptor ID so that users of the descriptor see that
>>>>> + * it has been recycled. A _release() is used so that prb_getdesc()
>>>>> + * callers can see all data ringbuffer updates after issuing a
>>>>> + * pairing smb_rmb(). See iA for details.
>>>>> + *
>>>>> + * Memory barrier involvement:
>>>>> + *
>>>>> + * If dB->iA reads from jB, then dI reads the same value as
>>>>> + * jA->cD->hA.
>>>>> + *
>>>>> + * Relies on:
>>>>> + *
>>>>> + * RELEASE from jA->cD->hA to jB
>>>>> + * matching
>>>>> + * RMB between dB->iA and dI
>>>>> + */
>>>>> + atomic_long_set_release(&d->id, atomic_long_read(&d->id) +
>>>>> + DESCS_COUNT(rb));
>>>>
>>>> atomic_long_set_release() might be a bit confusing here.
>>>> There is no related acquire.
>>
>> As the comment states, this release is for prb_getdesc() users. The
>> only prb_getdesc() user is _dataring_pop(). (i.e. the descriptor's
>> ID is not what _dataring_pop() was expecting), then the tail must
>> have moved and _dataring_pop() needs to see that. Since there are no
>> data dependencies between descriptor ID and tail_pos, an explicit
>> memory barrier is used. More on this below.
After reading through your post, I think you are pairing the wrong
barriers together. jB pairs with dH (i.e. the set_release() in
assign_desc() pairs with the smp_rmb() in _dataring_pop()).
(The comment for jB wrongly says dI instead of dH! Argh!)
> OK, let me show how complicated and confusing this looks for me:
I want to address all your points here. _Not_ because I want to justify
or defend my insanity, but because it may help to provide some clarity.
> + The two related barriers are in different source files
> and APIs:
>
> + assign_desc() in ringbuffer.c; ringbuffer API
> + _dataring_pop in dataring.c; dataring API
Agreed. This is a consequence of the ID management being within the
high-level ringbuffer code. I could have added an smp_rmb() to the NULL
case in prb_getdesc(). Then both barriers would be in the same
file. However, this would mean smp_rmb() is called many times
(particularly by readers) when it is not necessary.
> + Both the related barriers are around "id" manipulation.
> But one is in dataring, other is in descriptors array.
> One is about an old released "id". One is about a newly
> assigned "id".
dB is not the pairing barrier of jB. As dB's comment says, it pairs with
gA. (The load_acquire(id) in _dataring_pop() pairs with the
store_release(id) in dataring_datablock_setid().)
I should probably use a temporary variable so that the load_acquire(),
which needs a comment, can be separated from the dr->getdesc(), which
may or may not need a comment.
> + The release() barrier is called once for each assigned
> descriptor. The acquire() barrier is called more times
> or not at all depending on the amount of free space
> in dataring.
smp_rmb() is only called if dr->getdesc() fails. It matches exactly once
the latest set_release() for that descriptor.
> + prb_getdesc() is mentioned in the comment but the barrier
> is in _dataring_pop()
The comment also says, "See iA for details."
The comments for jB, iA, prb_getdesc(), and at the definition of struct
dataring all talk about how a matching smp_rmb() will be used. Yet
somehow you seem to think that the load_acquire() is the matching
barrier?
I agree that having a callback with memory barrier constraints is
complicated. I will look to see if I can simplify this. Or maybe instead
of saying "this matches the smp_rmb() in callers of prb_getdesc()" I
should say "this matches the smp_rmb() in _dataring_pop()".
> + prb_getdesc() is called via dr->getdesc() callback and thus
> not straightforward to check.
Agreed. Especially when memory barriers are involved. This needs to be
documented more clearly. What makes this particular case complicated is
that prb_getdesc() doesn't have any related memory barriers. The
synchronization is really just between assign_desc() and
_dataring_pop(). prb_getdesc() is only involved because it is the
"middle man" that translates the ID to the expected descriptor.
> + dr->getdesc() is called twice in _dataring_pop(); once
> with _acquire() and once without.
The _acquire() is not related to dr->getdesc(). This feedback should be
"db->id is loaded twice in _dataring_pop(); once with _acquire() and
once without."
The load with load_acquire() (dB) has comments that pretty throroughly
describe why the acquire is needed. The second load (dF) is just a check
if the ID hasn't unexpectedly changed. It doesn't need any memory
barriers. But if you grep for dF you will see that there is another
memory barrier pair that is used to ensure that this check is valid
(dC/fG).
> + _acquire() is hidden in
> desc = dr->getdesc(smp_load_acquire(&db->id), dr->getdesc_arg)
Agreed. Should be moved out.
> + The comment says that it is pairing with smb_rmb() but it
> the code uses _acquire().
The comment is correct. It is pairing with the smp_rmb() (later in the
function).
> + The comment says that the barrier is issued "so that callers can
> see all data ringbuffer updates". It is not specific what
> updates are meant (earlier or later).
The comment says:
A _release() is used so that prb_getdesc() callers can see all data
ringbuffer updates after issuing a pairing smb_rmb().
> It can be guessed by the the type of the barrier. But it does
> help with review (barrier matches what author wanted).
Perhaps something like:
A _release() is used so that the matching smp_rmb() in
_dataring_pop() can see (at least) the data ringbuffer values at the
time of the _release(). For _dataring_pop() it is critical that it
does not see a tail_lpos value older than the one at the time of the
_release().
> What would have helped me to understand this barrier might
> be something like:
>
> /*
> * Got descriptor and have exclusive write access.
> * Use _release() barrier before first modification
> * so that others could detect the new owner via
> * previous numlist and dataring head/tail updates.
> *
> * The related barrier is in _dataring_pop() when
> * acquiring db->id.
> */
>
> It explains what the barrier is synchronizing and where is
> the counter part.
Except that isn't the reason for the release. I hope the above
suggestion is acceptable?
> But it still does not explain if the counter part is correct.
> I simply do not know. Barriers are usually symmetric but the
> only symmetric thing here is the name of the variable ("id").
>
> It might be correct after all. But it looks so non-standard
> and far from obvious at least for me. I hope that we could
> either make it more symmetric and better explain it.
It is symmetric. But horribly documented? Or designed? It is not obvious
to me how I could refactor this to make it clean. The dataring knows
nothing about struct prb_desc and its ID (nor should it). The dataring
is not responsible for the descriptor IDs.
>>> Sigh, I have to admit that I am not familiar with the _acquire(),
>>> _release(), and _relaxed() variants of the atomic operations.
>>>
>>> They probably make it easier to implement some locking API.
>>> I am not sure how to use it here. This code implements a complex
>>> interlock between several variables. I mean that several variables
>>> lock each other in a cycle, like a state machine? In each case,
>>> it is not a simple locking where we check state of a single
>>> variable.
>>
>> Keep in mind that dataring and numlist were written independent of the
>> ringbuffer. They are structures with very specific purposes and their
>> own set of variables (and memory barriers to order _those_
>> variables). The high-level ringbuffer also has its own variables and
>> memory barriers. Sometimes there is overlap, which is implemented in the
>> callbacks (as is here), which is why the dataring callback getdesc() has
>> the implementation requirement that a following smp_rmb() by the caller
>> will guarantee seeing an updated dataring tail. But these overlaps are
>> the exception, not the rule.
>
> Sure. It is possible that this is the worst place. But there are definitely
> more of them:
>
> + smp_rmb() in numlist_read() is related to smp_wmb() in prb_reserve()
Correct.
> + "node" callback in struct nl_node must do smp_wmb()
I don't know what you mean here. The only smp_wmb() is in prb_reserve().
> + full memory barrier is required before calling get_new_lpos()
The full memory barrier is required _after_ calling get_new_lpos() if
the caller will later modify certain variables (listed in the function
comments). I think the comments at the full memory barrier (fB) explain
pretty well why that is needed. Since get_new_lpos() is a static helper
function, I'm not sure it is appropriate for what you are trying to list
here.
> I think that I understand the overall algorithm in principle. But it
> is really hard to prove that all pieces play well together.
>
> I would like to discuss the overall design in a separate thread.
> I wanted to comment and understand some details first.
I'll let you start that thread so you can decide the Cc list.
At Linux Plumbers in Lisbon I hope you can find some time so that we can
go through some of this together. I really appreciate your perspective.
> It is clear that you put a lot of effort into it.
The amount of effort does not matter. It definitely is not a reason to
accept code.
> Also it is great that you tried to formalize the barriers. But
> it is still very complicated.
I've been swimming in this code so long, I'm having a hard time
documenting the key points. I am sure over time we can improve
that. Right now I need to get it to the point where at least it is
reviewable.
John Ogness
On 2019-08-20, Petr Mladek <[email protected]> wrote:
>> +/**
>> + * dataring_push() - Reserve a data block in the data array.
>> + *
>> + * @dr: The data ringbuffer to reserve data in.
>> + *
>> + * @size: The size to reserve.
>> + *
>> + * @desc: A pointer to a descriptor to store the data block information.
>> + *
>> + * @id: The ID of the descriptor to be associated.
>> + * The data block will not be set with @id, but rather initialized with
>> + * a value that is explicitly different than @id. This is to handle the
>> + * case when newly available garbage by chance matches the descriptor
>> + * ID.
>> + *
>> + * This function expects to move the head pointer forward. If this would
>> + * result in overtaking the data array index of the tail, the tail data block
>> + * will be invalidated.
>> + *
>> + * Return: A pointer to the reserved writer data, otherwise NULL.
>> + *
>> + * This will only fail if it was not possible to invalidate the tail data
>> + * block.
>> + */
>> +char *dataring_push(struct dataring *dr, unsigned int size,
>> + struct dr_desc *desc, unsigned long id)
>> +{
>> + unsigned long begin_lpos;
>> + unsigned long next_lpos;
>> + struct dr_datablock *db;
>> + bool ret;
>> +
>> + to_db_size(&size);
>> +
>> + do {
>> + /* fA: */
>> + ret = get_new_lpos(dr, size, &begin_lpos, &next_lpos);
>> +
>> + /*
>> + * fB:
>> + *
>> + * The data ringbuffer tail may have been pushed (by this or
>> + * any other task). The updated @tail_lpos must be visible to
>> + * all observers before changes to @begin_lpos, @next_lpos, or
>> + * @head_lpos by this task are visible in order to allow other
>> + * tasks to recognize the invalidation of the data
>> + * blocks.
>
> This sounds strange. The write barrier should be done only on CPU
> that really modified tail_lpos. I.e. it should be in _dataring_pop()
> after successful dr->tail_lpos modification.
The problem is that there are no data dependencies between the different
variables. When a new datablock is being reserved, it is critical that
all other observers see that the tail_lpos moved forward _before_ any
other changes. _dataring_pop() uses an smp_rmb() to synchronize for
tail_lpos movement. This CPU is about to make some changes and may have
seen an updated tail_lpos. An smp_wmb() is useless if this is not the
CPU that performed that update. The full memory barrier ensures that all
other observers will see what this CPU sees before any of its future
changes are seen.
You suggest an alternative implementation below. I will address that
there.
>> + * This pairs with the smp_rmb() in _dataring_pop() as well as
>> + * any reader task using smp_rmb() to post-validate data that
>> + * has been read from a data block.
>> +
>> + * Memory barrier involvement:
>> + *
>> + * If dE reads from fE, then dI reads from fA->eA.
>> + * If dC reads from fG, then dI reads from fA->eA.
>> + * If dD reads from fH, then dI reads from fA->eA.
>> + * If mC reads from fH, then mF reads from fA->eA.
>> + *
>> + * Relies on:
>> + *
>> + * FULL MB between fA->eA and fE
>> + * matching
>> + * RMB between dE and dI
>> + *
>> + * FULL MB between fA->eA and fG
>> + * matching
>> + * RMB between dC and dI
>> + *
>> + * FULL MB between fA->eA and fH
>> + * matching
>> + * RMB between dD and dI
>> + *
>> + * FULL MB between fA->eA and fH
>> + * matching
>> + * RMB between mC and mF
>> + */
>> + smp_mb();
>
> All these comments talk about sychronization against read barriers.
> It means that we would need a write barrier here. But it does
> not make much sense to do write barrier before actually
> writing dr->head_lpos.
I think my comments above address this.
> After all I think that we do not need any barrier here.
> The write barrier for dr->tail_lpos should be in
> _dataring_pop(). The read barrier is not needed because
> we are not reading anything here.
>
> Instead we should put a barrier after modyfying dr->head_lpos,
> see below.
Comments below.
>> + if (!ret) {
>> + /*
>> + * Force @desc permanently invalid to minimize risk
>> + * of the descriptor later unexpectedly being
>> + * determined as valid due to overflowing/wrapping of
>> + * @head_lpos. An unaligned @begin_lpos can never
>> + * point to a data block and having the same value
>> + * for @begin_lpos and @next_lpos is also invalid.
>> + */
>> +
>> + /* fC: */
>> + WRITE_ONCE(desc->begin_lpos, 1);
>> +
>> + /* fD: */
>> + WRITE_ONCE(desc->next_lpos, 1);
>> +
>> + return NULL;
>> + }
>> + /* fE: */
>> + } while (atomic_long_cmpxchg_relaxed(&dr->head_lpos, begin_lpos,
>> + next_lpos) != begin_lpos);
>> +
>
> We need a write barrier here to make sure that dr->head_lpos
> is updated before we start updating other values, e.g.
> db->id below.
My RFCv2 implemented it that way. The function was called data_reserve()
and it moved the head using cmpxchg_release(). For RFCv3 I changed to a
full memory barrier instead because using acquire/release here is a bit
messy. There are 2 different places where the acquire needed to be:
- In _dataring_pop() a load_acquire() of head_lpos would need to be
_before_ loading of begin_lpos and next_lpos.
- In prb_iter_next_valid_entry() a load_acquire() of head_lpos would
need to be at the beginning within the dataring_datablock_isvalid()
check (mC).
If smp_mb() is too heavy to call for every printk(), then we can move to
acquire/release. The comments of fB list exactly what is synchronized
(and where).
John Ogness
On 2019-08-25, John Ogness <[email protected]> wrote:
>>>>>> --- /dev/null
>>>>>> +++ b/kernel/printk/ringbuffer.c
>>>>>> +static bool assign_desc(struct prb_reserved_entry *e)
>>>>>> +{
[...]
>>>>>> + atomic_long_set_release(&d->id, atomic_long_read(&d->id) +
>>>>>> + DESCS_COUNT(rb));
>>>>>
>>>>> atomic_long_set_release() might be a bit confusing here.
>>>>> There is no related acquire.
>>>
>>> As the comment states, this release is for prb_getdesc() users. The
>>> only prb_getdesc() user is _dataring_pop(). (i.e. the descriptor's
>>> ID is not what _dataring_pop() was expecting), then the tail must
>>> have moved and _dataring_pop() needs to see that. Since there are no
>>> data dependencies between descriptor ID and tail_pos, an explicit
>>> memory barrier is used. More on this below.
>
>> + The two related barriers are in different source files
>> and APIs:
>>
>> + assign_desc() in ringbuffer.c; ringbuffer API
>> + _dataring_pop in dataring.c; dataring API
>
> Agreed. This is a consequence of the ID management being within the
> high-level ringbuffer code. I could have added an smp_rmb() to the
> NULL case in prb_getdesc(). Then both barriers would be in the same
> file. However, this would mean smp_rmb() is called many times
> (particularly by readers) when it is not necessary.
What I wrote here is wrong. prb_getdesc() is not called "many times
(particularly by readers)". It is only called once within the writer
function _dataring_pop().
Looking at this again, I think it would be better to move the smp_rmb()
into the NULL case of prb_getdesc(). Then both barrier pairs are located
(and documented) in the same file. This also simplifies the
documentation by not saying "the caller's smp_rmb() everywhere".
I would also change _dataring_pop() so that the smp_rmb() is located
within the handling of the other two failed checks (begin_lpos !=
tail_lpos and !_datablock_valid()). Then the out: at the end is just
return atomic_long_read(&dr->tail_lpos).
After modifying the code in this way, I think it looks more straight
forward and would have avoided your confusion: The RMB in
dataring.c:_dataring_pop() matches the MB in dataring.c:dataring_push()
and the RMB in ringbuffer.c:prb_getdesc() matches the SET_RELEASE in
ringbuffer.c:assign_desc().
John Ogness
> > + /*
> > + * bA:
> > + *
> > + * Setup the node to be a list terminator: next_id == id.
> > + */
> > + WRITE_ONCE(n->next_id, id);
>
> Do we need WRITE_ONCE() here?
> Both "n" and "id" are given as parameters and do not change.
> The assigment must be done before "id" is set as nl->head_id.
> The ordering is enforced by cmpxchg_release().
(Disclaimer: this is still a very much debated issue...)
According to the LKMM, this question boils down to the question:
Is there "ordering"/synchronization between the above access and
the "matching accesses" bF and aA' to the same location?
Again according to the LKMM's analysis, such synchronization is provided
by the RELEASE -> "reads-from" -> ADDR relation. (Encoding address dep.
in litmus tests is kind of tricky but possible, e.g., for the pattern in
question, we could write/model as follows:
C S+ponarelease+addroncena
{
int *y = &a;
}
P0(int *x, int **y, int *a)
{
int *r0;
*x = 2;
r0 = cmpxchg_release(y, a, x);
}
P1(int *x, int **y)
{
int *r0;
r0 = READ_ONCE(*y);
*r0 = 1;
}
exists (1:r0=x /\ x=2)
Then
$ herd7 -conf linux-kernel.cfg S+ponarelease+addroncena
Test S+ponarelease+addroncena Allowed
States 2
1:r0=a; x=2;
1:r0=x; x=1;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (1:r0=x /\ x=2)
Observation S+ponarelease+addroncena Never 0 2
Time S+ponarelease+addroncena 0.01
Hash=7eaf7b5e95419a3c352d7fd50b9cd0d5
that is, the test is not racy and the "exists" clause is not satisfiable
in the LKMM. Notice that _if the READ_ONCE(*y) in P1 were replaced by a
plain read, then we would obtain:
Test S+ponarelease+addrnana Allowed
States 2
1:r0=x; x=1;
1:r0=x; x=2;
Ok
Witnesses
Positive: 1 Negative: 1
Flag data-race [ <-- the LKMM warns about a data-race ]
Condition exists (1:r0=x /\ x=2)
Observation S+ponarelease+addrnana Sometimes 1 1
Time S+ponarelease+addrnana 0.00
Hash=a61acf2e8e51c2129d33ddf5e4c76a49
N.B. This analysis generally depends on the assumption that every marked
access (e.g., the cmpxchg_release() called out above and the READ_ONCE()
heading the address dependencies) are _single-copy atomic, an assumption
which has been recently shown to _not be valid in such generality:
https://lkml.kernel.org/r/20190821103200.kpufwtviqhpbuv2n@willie-the-truck
(Bug in the LKMM? or in the Linux implementation of these primitives? or
in the compiler? your blame here...)
[...]
> > + /*
> > + * bD:
> > + *
> > + * Set @seq to +1 of @seq from the previous head.
> > + *
> > + * Memory barrier involvement:
> > + *
> > + * If bB reads from bE, then bC->aA reads from bD.
> > + *
> > + * Relies on:
> > + *
> > + * RELEASE from bD to bE
> > + * matching
> > + * ADDRESS DEP. from bB to bC->aA
> > + */
> > + WRITE_ONCE(n->seq, seq + 1);
>
> Do we really need WRITE_ONCE() here?
> It is the same problem as with setting n->next_id above.
Same considerations as above would apply here.
Andrea
Sorry for top posting, but I forgot to mention: as you might have
noticed, my @amarulasolutions address is not active anymore; FWIW,
you should still be able to reach me at this @gmail address.
Thanks,
Andrea
On Mon, Aug 26, 2019 at 10:34:36AM +0200, Andrea Parri wrote:
> > > + /*
> > > + * bA:
> > > + *
> > > + * Setup the node to be a list terminator: next_id == id.
> > > + */
> > > + WRITE_ONCE(n->next_id, id);
> >
> > Do we need WRITE_ONCE() here?
> > Both "n" and "id" are given as parameters and do not change.
> > The assigment must be done before "id" is set as nl->head_id.
> > The ordering is enforced by cmpxchg_release().
>
> (Disclaimer: this is still a very much debated issue...)
>
> According to the LKMM, this question boils down to the question:
>
> Is there "ordering"/synchronization between the above access and
> the "matching accesses" bF and aA' to the same location?
>
> Again according to the LKMM's analysis, such synchronization is provided
> by the RELEASE -> "reads-from" -> ADDR relation. (Encoding address dep.
> in litmus tests is kind of tricky but possible, e.g., for the pattern in
> question, we could write/model as follows:
>
> C S+ponarelease+addroncena
>
> {
> int *y = &a;
> }
>
> P0(int *x, int **y, int *a)
> {
> int *r0;
>
> *x = 2;
> r0 = cmpxchg_release(y, a, x);
> }
>
> P1(int *x, int **y)
> {
> int *r0;
>
> r0 = READ_ONCE(*y);
> *r0 = 1;
> }
>
> exists (1:r0=x /\ x=2)
>
> Then
>
> $ herd7 -conf linux-kernel.cfg S+ponarelease+addroncena
> Test S+ponarelease+addroncena Allowed
> States 2
> 1:r0=a; x=2;
> 1:r0=x; x=1;
> No
> Witnesses
> Positive: 0 Negative: 2
> Condition exists (1:r0=x /\ x=2)
> Observation S+ponarelease+addroncena Never 0 2
> Time S+ponarelease+addroncena 0.01
> Hash=7eaf7b5e95419a3c352d7fd50b9cd0d5
>
> that is, the test is not racy and the "exists" clause is not satisfiable
> in the LKMM. Notice that _if the READ_ONCE(*y) in P1 were replaced by a
> plain read, then we would obtain:
>
> Test S+ponarelease+addrnana Allowed
> States 2
> 1:r0=x; x=1;
> 1:r0=x; x=2;
> Ok
> Witnesses
> Positive: 1 Negative: 1
> Flag data-race [ <-- the LKMM warns about a data-race ]
> Condition exists (1:r0=x /\ x=2)
> Observation S+ponarelease+addrnana Sometimes 1 1
> Time S+ponarelease+addrnana 0.00
> Hash=a61acf2e8e51c2129d33ddf5e4c76a49
>
> N.B. This analysis generally depends on the assumption that every marked
> access (e.g., the cmpxchg_release() called out above and the READ_ONCE()
> heading the address dependencies) are _single-copy atomic, an assumption
> which has been recently shown to _not be valid in such generality:
>
> https://lkml.kernel.org/r/20190821103200.kpufwtviqhpbuv2n@willie-the-truck
>
> (Bug in the LKMM? or in the Linux implementation of these primitives? or
> in the compiler? your blame here...)
>
>
> [...]
>
> > > + /*
> > > + * bD:
> > > + *
> > > + * Set @seq to +1 of @seq from the previous head.
> > > + *
> > > + * Memory barrier involvement:
> > > + *
> > > + * If bB reads from bE, then bC->aA reads from bD.
> > > + *
> > > + * Relies on:
> > > + *
> > > + * RELEASE from bD to bE
> > > + * matching
> > > + * ADDRESS DEP. from bB to bC->aA
> > > + */
> > > + WRITE_ONCE(n->seq, seq + 1);
> >
> > Do we really need WRITE_ONCE() here?
> > It is the same problem as with setting n->next_id above.
>
> Same considerations as above would apply here.
>
> Andrea
On Mon 2019-08-26 10:34:36, Andrea Parri wrote:
> > > + /*
> > > + * bA:
> > > + *
> > > + * Setup the node to be a list terminator: next_id == id.
> > > + */
> > > + WRITE_ONCE(n->next_id, id);
> >
> > Do we need WRITE_ONCE() here?
> > Both "n" and "id" are given as parameters and do not change.
> > The assigment must be done before "id" is set as nl->head_id.
> > The ordering is enforced by cmpxchg_release().
>
> (Disclaimer: this is still a very much debated issue...)
>
> According to the LKMM, this question boils down to the question:
>
> Is there "ordering"/synchronization between the above access and
> the "matching accesses" bF and aA' to the same location?
>
> Again according to the LKMM's analysis, such synchronization is provided
> by the RELEASE -> "reads-from" -> ADDR relation. (Encoding address dep.
> in litmus tests is kind of tricky but possible, e.g., for the pattern in
> question, we could write/model as follows:
>
> C S+ponarelease+addroncena
>
> {
> int *y = &a;
> }
>
> P0(int *x, int **y, int *a)
> {
> int *r0;
>
> *x = 2;
> r0 = cmpxchg_release(y, a, x);
> }
>
> P1(int *x, int **y)
> {
> int *r0;
>
> r0 = READ_ONCE(*y);
> *r0 = 1;
> }
>
> exists (1:r0=x /\ x=2)
Which r0 the above exists rule refers to, please?
Do both P0 and P1 define r0 by purpose?
> Then
>
> $ herd7 -conf linux-kernel.cfg S+ponarelease+addroncena
> Test S+ponarelease+addroncena Allowed
> States 2
> 1:r0=a; x=2;
> 1:r0=x; x=1;
> No
> Witnesses
> Positive: 0 Negative: 2
> Condition exists (1:r0=x /\ x=2)
> Observation S+ponarelease+addroncena Never 0 2
> Time S+ponarelease+addroncena 0.01
> Hash=7eaf7b5e95419a3c352d7fd50b9cd0d5
>
> that is, the test is not racy and the "exists" clause is not satisfiable
> in the LKMM. Notice that _if the READ_ONCE(*y) in P1 were replaced by a
> plain read, then we would obtain:
>
> Test S+ponarelease+addrnana Allowed
> States 2
> 1:r0=x; x=1;
> 1:r0=x; x=2;
Do you have any explanation how r0=x; x=2; could happen, please?
Does the ommited READ_ONCE allows to do r0 = (*y) twice
before and after *r0 = 1?
Or the two operations P1 can be called in any order?
I am sorry if it obvious. Feel free to ask me to re-read Paul's
articles on LWN more times or point me to another resources.
> Ok
> Witnesses
> Positive: 1 Negative: 1
> Flag data-race [ <-- the LKMM warns about a data-race ]
> Condition exists (1:r0=x /\ x=2)
> Observation S+ponarelease+addrnana Sometimes 1 1
> Time S+ponarelease+addrnana 0.00
> Hash=a61acf2e8e51c2129d33ddf5e4c76a49
>
> N.B. This analysis generally depends on the assumption that every marked
> access (e.g., the cmpxchg_release() called out above and the READ_ONCE()
> heading the address dependencies) are _single-copy atomic, an assumption
> which has been recently shown to _not be valid in such generality:
>
> https://lkml.kernel.org/r/20190821103200.kpufwtviqhpbuv2n@willie-the-truck
So, it might be even worse. Do I get it correctly?
Best Regards,
Petr
> > C S+ponarelease+addroncena
> >
> > {
> > int *y = &a;
> > }
> >
> > P0(int *x, int **y, int *a)
> > {
> > int *r0;
> >
> > *x = 2;
> > r0 = cmpxchg_release(y, a, x);
> > }
> >
> > P1(int *x, int **y)
> > {
> > int *r0;
> >
> > r0 = READ_ONCE(*y);
> > *r0 = 1;
> > }
> >
> > exists (1:r0=x /\ x=2)
>
> Which r0 the above exists rule refers to, please?
> Do both P0 and P1 define r0 by purpose?
"1:r0" is the value returned by the above READ_ONCE(*y), following the
convention [thread number]:[local variable]; but yes, I could probably
have saved you this question by picking a different name, ;-) sorry.
>
> > Then
> >
> > $ herd7 -conf linux-kernel.cfg S+ponarelease+addroncena
> > Test S+ponarelease+addroncena Allowed
> > States 2
> > 1:r0=a; x=2;
> > 1:r0=x; x=1;
> > No
> > Witnesses
> > Positive: 0 Negative: 2
> > Condition exists (1:r0=x /\ x=2)
> > Observation S+ponarelease+addroncena Never 0 2
> > Time S+ponarelease+addroncena 0.01
> > Hash=7eaf7b5e95419a3c352d7fd50b9cd0d5
> >
> > that is, the test is not racy and the "exists" clause is not satisfiable
> > in the LKMM. Notice that _if the READ_ONCE(*y) in P1 were replaced by a
> > plain read, then we would obtain:
> >
> > Test S+ponarelease+addrnana Allowed
> > States 2
> > 1:r0=x; x=1;
> > 1:r0=x; x=2;
>
> Do you have any explanation how r0=x; x=2; could happen, please?
I should have remarked: the states listed here lose their significance
when there is a data race: "data race" is LKMM's way of saying "I give
up, I'm unable to list all the reachable states; your call...". ;-)
This example is "complicated", e.g., by the tearing of the plain read,
tearing which is envisaged/modelled by the LKMM: however, this tearing
doesn't explain the "1:r0=x; x=2;" state by itself, AFAICT.
Said this, I'm not sure how I copied this output... For completeness,
I report the full/intended test at the bottom of my email.
>
> Does the ommited READ_ONCE allows to do r0 = (*y) twice
> before and after *r0 = 1?
> Or the two operations P1 can be called in any order?
>
> I am sorry if it obvious. Feel free to ask me to re-read Paul's
> articles on LWN more times or point me to another resources.
>
>
>
> > Ok
> > Witnesses
> > Positive: 1 Negative: 1
> > Flag data-race [ <-- the LKMM warns about a data-race ]
> > Condition exists (1:r0=x /\ x=2)
> > Observation S+ponarelease+addrnana Sometimes 1 1
> > Time S+ponarelease+addrnana 0.00
> > Hash=a61acf2e8e51c2129d33ddf5e4c76a49
> >
> > N.B. This analysis generally depends on the assumption that every marked
> > access (e.g., the cmpxchg_release() called out above and the READ_ONCE()
> > heading the address dependencies) are _single-copy atomic, an assumption
> > which has been recently shown to _not be valid in such generality:
> >
> > https://lkml.kernel.org/r/20190821103200.kpufwtviqhpbuv2n@willie-the-truck
>
> So, it might be even worse. Do I get it correctly?
Worse than I was hoping..., definitely! ;-)
Andrea
---
C S+ponarelease+addrnana
{
int *y = &a;
}
P0(int *x, int **y, int *a)
{
int *r0;
*x = 2;
r0 = cmpxchg_release(y, a, x);
}
P1(int *x, int **y)
{
int *r0;
r0 = *y;
*r0 = 1;
}
exists (1:r0=x /\ x=2)
Hi Petr,
AndreaP responded with some explanation (and great links!) on the topic
of READ_ONCE. But I feel like your comments about the WRITE_ONCE were
not addressed. I address that (and your other comments) below...
On 2019-08-23, Petr Mladek <[email protected]> wrote:
>> --- /dev/null
>> +++ b/kernel/printk/numlist.c
>> +/**
>> + * numlist_push() - Add a node to the list and assign it a sequence number.
>> + *
>> + * @nl: The numbered list to push to.
>> + *
>> + * @n: A node to push to the numbered list.
>> + * The node must not already be part of a list.
>> + *
>> + * @id: The ID of the node.
>> + *
>> + * A node is added in two steps: The first step is to make this node the
>> + * head, which causes a following push to add to this node. The second step is
>> + * to update @next_id of the former head node to point to this one, which
>> + * makes this node visible to any task that sees the former head node.
>> + */
>> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
>> +{
>> + unsigned long head_id;
>> + unsigned long seq;
>> + unsigned long r;
>> +
>> + /*
>> + * bA:
>> + *
>> + * Setup the node to be a list terminator: next_id == id.
>> + */
>> + WRITE_ONCE(n->next_id, id);
>
> Do we need WRITE_ONCE() here?
> Both "n" and "id" are given as parameters and do not change.
> The assigment must be done before "id" is set as nl->head_id.
> The ordering is enforced by cmpxchg_release().
The cmpxchg_release() ensures that if the node is visible to writers,
then the finalized assignment is also visible. And the store_release()
ensures that if the previous node is visible to any readers, then the
finalized assignment is also visible. In the reader case, if any readers
happen to be sitting on the node, numlist_read() will fail because the
ID was updated when the node was popped. So for all these cases any
compiler optimizations leading to that assigment (tearing, speculation,
etc) should be irrelevant. Therefore, IMO the WRITE_ONCE() is not
needed.
Since all of this is lockless, I used WRITE_ONCE() whenever touching
shared variables. I must admit the decision may be motivated primarily
by fear of compiler optimizations. Although "documenting lockless shared
variable access" did play a role as well.
I will replace the WRITE_ONCE with an assignment.
>> +
>> + /* bB: #1 */
>> + head_id = atomic_long_read(&nl->head_id);
>> +
>> + for (;;) {
>> + /* bC: */
>> + while (!numlist_read(nl, head_id, &seq, NULL)) {
>> + /*
>> + * @head_id is invalid. Try again with an
>> + * updated value.
>> + */
>> +
>> + cpu_relax();
>
> I have got very confused by this. cpu_relax() suggests that this
> cycle is busy waiting until a particular node becomes valid.
> My first though was that it must cause deadlock in NMI when
> the interrupted code is supposed to make the node valid.
>
> But it is the other way. The head is always valid when it is
> added to the list. It might become invalid when another CPU
> moves the head and the old one gets reused.
>
> Anyway, I do not see any reason for cpu_relax() here.
You are correct. The cpu_relax() should not be there. But there is still
an issue that this could spin hard if the head was recycled and this CPU
does not yet see the new head value.
To handle that, and in preparation for my next version, I'm now using a
read_acquire() to load the ID in the node() callback (matching the
set_release() in assign_desc()). This ensures that if numlist_read()
fails, the new head will be visible.
> Also the entire cycle would deserve a comment to avoid this mistake.
> For example:
>
> /*
> * bC: Read seq from current head. Repeat with new
> * head when it has changed and the old one got reused.
> */
Agreed.
>> +
>> + /* bB: #2 */
>> + head_id = atomic_long_read(&nl->head_id);
>> + }
>> +
>> + /*
>> + * bD:
>> + *
>> + * Set @seq to +1 of @seq from the previous head.
>> + *
>> + * Memory barrier involvement:
>> + *
>> + * If bB reads from bE, then bC->aA reads from bD.
>> + *
>> + * Relies on:
>> + *
>> + * RELEASE from bD to bE
>> + * matching
>> + * ADDRESS DEP. from bB to bC->aA
>> + */
>> + WRITE_ONCE(n->seq, seq + 1);
>
> Do we really need WRITE_ONCE() here?
> It is the same problem as with setting n->next_id above.
For the same reasons as the other WRITE_ONCE, I will replace the
WRITE_ONCE with an assignment.
>> +
>> + /*
>> + * bE:
>> + *
>> + * This store_release() guarantees that @seq and @next are
>> + * stored before the node with @id is visible to any popping
>> + * writers. It pairs with the address dependency between @id
>> + * and @seq/@next provided by numlist_read(). See bD and bF
>> + * for details.
>> + */
>> + r = atomic_long_cmpxchg_release(&nl->head_id, head_id, id);
>> + if (r == head_id)
>> + break;
>> +
>> + /* bB: #3 */
>> + head_id = r;
>> + }
>> +
>> + n = nl->node(head_id, nl->node_arg);
>> +
>> + /*
>> + * The old head (which is still the list terminator), cannot be
>> + * removed because the list will always have at least one node.
>> + * Therefore @n must be non-NULL.
>> + */
>
> Please, move this comment above the nl->node() call. Both locations
> makes sense. I just see it as an important note for the call and thus
> is should be above. Also it will be better separated from the below
> comments for the _release() barrier.
OK.
>> + /*
>> + * bF: the STORE part for @next_id
>> + *
>> + * Set @next_id of the previous head to @id.
>> + *
>> + * Memory barrier involvement:
>> + *
>> + * If bB reads from bE, then bF overwrites bA.
>> + *
>> + * Relies on:
>> + *
>> + * RELEASE from bA to bE
>> + * matching
>> + * ADDRESS DEP. from bB to bF
>> + */
>> + /*
>> + * bG: the RELEASE part for @next_id
>> + *
>> + * This _release() guarantees that a reader will see the updates to
>> + * this node's @seq/@next_id if the reader saw the @next_id of the
>> + * previous node in the list. It pairs with the address dependency
>> + * between @id and @seq/@next provided by numlist_read().
>> + *
>> + * Memory barrier involvement:
>> + *
>> + * If aB reads from bG, then aA' reads from bD, where aA' is in
>> + * numlist_read() to read the node ID from bG.
>> + * If aB reads from bG, then aB' reads from bA, where aB' is in
>> + * numlist_read() to read the node ID from bG.
>> + *
>> + * Relies on:
>> + *
>> + * RELEASE from bG to bD
>> + * matching
>> + * ADDRESS DEP. from aB to aA'
>> + *
>> + * RELEASE from bG to bA
>> + * matching
>> + * ADDRESS DEP. from aB to aB'
>> + */
>> + smp_store_release(&n->next_id, id);
>
> Sigh, I see this line one screen below the previous command thanks
> to the extensive comments. Well, bF comment looks redundant.
Yes, it is a lot, but bF and bG are commenting on different things. bF
is the explanation that the _writer_ will overwrite the previous node's
terminating value with the ID of the new node. bG is the explanation
that if the _reader_ reads a non-terminating @next_id value, then the
initialization of @seq and @next_id for that next node will be visible.
Both of these points are key to the numlist implementation because they
guarantee the complete and correct linking of the list for all nodes
pushed.
John Ogness
On 2019-08-23, Petr Mladek <[email protected]> wrote:
>> --- /dev/null
>> +++ b/kernel/printk/numlist.c
>> @@ -0,0 +1,375 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <linux/sched.h>
>> +#include "numlist.h"
>
> struct numlist is really special variant of a list. Let me to
> do a short summary:
>
> + FIFO queue interface
>
> + nodes sequentially numbered
>
> + nodes referenced by ID instead pointers to avoid ABA problems
> + requires custom node() callback to get pointer for given ID
>
> + lockless access:
> + pushed nodes must not longer get modified by push() caller
> + pop() caller gets exclusive write access, except that they
> must modify ID first and do smp_wmb() later
Only if the "numlist user" decides to recycle descriptors (which the
printk_ringbuffer does) is ID modification of descriptors necessary. How
that is synchronized with readers is up to the user (for example,
whether a RELEASE or an smp_wmb() is used).
> + pop() does not work:
> + tail node is "busy"
> + needs a custom callback that defines when a node is busy
Note that busy() could always return false if the user has no concept of
nodes that should not be popped.
> + tail is the last node
> + needed for lockless sequential numbering
>
> I will start with one inevitable question ;-) Is it realistic to find
> another user for this API, please?
If someone needs a FIFO queue that supports:
1. multiple concurrent writers and multiple concurrent non-consuming
readers
2. where readers are allowed to miss nodes but are able to detect how
many were missed
3. from any context (including NMI)
then I know of no other data structure available. (Otherwise I would
have used it!)
> I am not sure that all the indirections, caused by the generic API,
> are worth the gain.
IMHO the API is sane. The only bizarre rule is that the numlist must
always have at least 1 node. But since the readers are non-consuming,
there is no real tragedy here.
My goal is not to create some fabulous abstract data structure that
everyone should use. But I did try to minimize numlist (and dataring) to
only be concerned with clearly defined and minimal responsibilities
without imposing unnecessary restrictions on the user.
> Well, the separate API makes sense anyway. I have some ideas that
> might make it cleaner.
[snipped the nice refactoring of the ID into the nl_node]
Your idea (along with previous discussions) convinced me of the
importance of moving the ID-related barriers into the same
file. However, rather than pushing the ID parts into the numlist, I will
be moving them all into the "numlist user"
(i.e. printk_ringbuffer). Your use of the ACQUIRE to load the ID made me
realize that I need to be doing that as well! (but in the node()
callback)
The reasons why I do not want the ID in nl_node is:
- The numlist would need to implement the ID-to-node mapping. For the
printk_ringbuffer that mapping is simply masking to an index within an
array. But why should a numlist user be forced to do it that way? I
see no advantage to restricting numlists to being arrays of nodes.
- The dataring structure also uses IDs and requires an ID-to-node
mapping. I do not want to bind the dataring and numlist data
structures together at this level because they really have nothing to
do with each other. Having the dataring and numlist ID-to-node
mappings (and their barriers) in the same place (in the
numlist/dataring _user_) simplifies the big picture.
- ID-related barriers are only needed if node recycling is involved. The
numlist user decides if recycling is used and if yes, then the numlist
user is responsible for correctly implementing that.
- By moving all the ID-related barriers to the callbacks, the numlist
code remains clean and (with the exception of the one smp_rmb()) does
not expect anything from the numlist user.
I believe your main concern was having easily visible symmetric
barriers. We can achieve that if the read-barriers are in the callbacks
(for both numlist and dataring). I think it makes more sense to put them
there. dataring and numlist should not care about the ID-to-node
mapping.
John Ogness
On Tue 2019-08-27 00:36:18, John Ogness wrote:
> Hi Petr,
>
> AndreaP responded with some explanation (and great links!) on the topic
> of READ_ONCE. But I feel like your comments about the WRITE_ONCE were
> not addressed. I address that (and your other comments) below...
>
> On 2019-08-23, Petr Mladek <[email protected]> wrote:
> >> --- /dev/null
> >> +++ b/kernel/printk/numlist.c
> >> +/**
> >> + * numlist_push() - Add a node to the list and assign it a sequence number.
> >> + *
> >> + * @nl: The numbered list to push to.
> >> + *
> >> + * @n: A node to push to the numbered list.
> >> + * The node must not already be part of a list.
> >> + *
> >> + * @id: The ID of the node.
> >> + *
> >> + * A node is added in two steps: The first step is to make this node the
> >> + * head, which causes a following push to add to this node. The second step is
> >> + * to update @next_id of the former head node to point to this one, which
> >> + * makes this node visible to any task that sees the former head node.
> >> + */
> >> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
> >> +{
[...]
> >> +
> >> + /* bB: #1 */
> >> + head_id = atomic_long_read(&nl->head_id);
> >> +
> >> + for (;;) {
> >> + /* bC: */
> >> + while (!numlist_read(nl, head_id, &seq, NULL)) {
> >> + /*
> >> + * @head_id is invalid. Try again with an
> >> + * updated value.
> >> + */
> >> +
> >> + cpu_relax();
> >
> > I have got very confused by this. cpu_relax() suggests that this
> > cycle is busy waiting until a particular node becomes valid.
> > My first though was that it must cause deadlock in NMI when
> > the interrupted code is supposed to make the node valid.
> >
> > But it is the other way. The head is always valid when it is
> > added to the list. It might become invalid when another CPU
> > moves the head and the old one gets reused.
> >
> > Anyway, I do not see any reason for cpu_relax() here.
>
> You are correct. The cpu_relax() should not be there. But there is still
> an issue that this could spin hard if the head was recycled and this CPU
> does not yet see the new head value.
I do not understand this. The head could get reused only after
head_id was replaced with the following valid node.
The next cycle is done with a new id that should be valid.
Of course, the new ID might get reused as well. But then we just
repeat the cycle. We have to be able to find a valid head after
few cycles. The last valid ID could not get reused because nodes
can be removed only if was not the last valid node.
> To handle that, and in preparation for my next version, I'm now using a
> read_acquire() to load the ID in the node() callback (matching the
> set_release() in assign_desc()). This ensures that if numlist_read()
> fails, the new head will be visible.
I do not understand this either. The above paragraph seems to
describe a race. I do not see how it could cause an infinite loop.
Best Regards,
Petr
On Tue 2019-08-27 01:57:39, John Ogness wrote:
> On 2019-08-23, Petr Mladek <[email protected]> wrote:
> >> --- /dev/null
> >> +++ b/kernel/printk/numlist.c
> >> @@ -0,0 +1,375 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +
> >> +#include <linux/sched.h>
> >> +#include "numlist.h"
> >
> > struct numlist is really special variant of a list. Let me to
> > do a short summary:
> >
> > + FIFO queue interface
> >
> > + nodes sequentially numbered
> >
> > + nodes referenced by ID instead pointers to avoid ABA problems
> > + requires custom node() callback to get pointer for given ID
> >
> > + lockless access:
> > + pushed nodes must not longer get modified by push() caller
> > + pop() caller gets exclusive write access, except that they
> > must modify ID first and do smp_wmb() later
>
> Only if the "numlist user" decides to recycle descriptors (which the
> printk_ringbuffer does) is ID modification of descriptors necessary. How
> that is synchronized with readers is up to the user (for example,
> whether a RELEASE or an smp_wmb() is used).
IMHO, the most tricky part of numlist API is handling of IDs.
The IDs are there to avoid ABA races when reusing the nodes.
I want to say that this API is useful only when the nodes are reused.
All other users would want anything easier.
> > + pop() does not work:
> > + tail node is "busy"
> > + needs a custom callback that defines when a node is busy
>
> Note that busy() could always return false if the user has no concept of
> nodes that should not be popped.
This would be append only list. Again the is no need for IDs
and other complexities.
> > + tail is the last node
> > + needed for lockless sequential numbering
> >
> > I will start with one inevitable question ;-) Is it realistic to find
> > another user for this API, please?
>
> If someone needs a FIFO queue that supports:
>
> 1. multiple concurrent writers and multiple concurrent non-consuming
> readers
>
> 2. where readers are allowed to miss nodes but are able to detect how
> many were missed
>
> 3. from any context (including NMI)
>
> then I know of no other data structure available. (Otherwise I would
> have used it!)
It might be also because nobody else needed structure with
our numlist semantic. I guess that lockless read/write
structures are usually implemented using RCU.
> > I am not sure that all the indirections, caused by the generic API,
> > are worth the gain.
>
> IMHO the API is sane. The only bizarre rule is that the numlist must
> always have at least 1 node. But since the readers are non-consuming,
> there is no real tragedy here.
>
> My goal is not to create some fabulous abstract data structure that
> everyone should use. But I did try to minimize numlist (and dataring) to
> only be concerned with clearly defined and minimal responsibilities
> without imposing unnecessary restrictions on the user.
The API is complicated because of the callbacks. It depends on a logic
that is implemented externally. It makes it abstract to some extent.
My view is that the API would be much cleaner and easier to review
when the ID handling is "hardcoded" (helper functions). It could be
made abstract anytime later when there is another user.
There should always be a reason why to make a code more complicated
than necessary. It seems that the only reason is some theoretical
future user and its theoretical requirements.
> > Well, the separate API makes sense anyway. I have some ideas that
> > might make it cleaner.
>
> [snipped the nice refactoring of the ID into the nl_node]
>
> Your idea (along with previous discussions) convinced me of the
> importance of moving the ID-related barriers into the same
> file. However, rather than pushing the ID parts into the numlist, I will
> be moving them all into the "numlist user"
> (i.e. printk_ringbuffer). Your use of the ACQUIRE to load the ID made me
> realize that I need to be doing that as well! (but in the node()
> callback)
>
> The reasons why I do not want the ID in nl_node is:
>
> - The numlist would need to implement the ID-to-node mapping. For the
> printk_ringbuffer that mapping is simply masking to an index within an
> array. But why should a numlist user be forced to do it that way? I
> see no advantage to restricting numlists to being arrays of nodes.
It might be done a generic way when there is a user with another need.
Honestly, I have big troubles to imagine another reasonable mapping
between id and pointer than masking. We are talking about lockless
code. Anything more complicated might become a nightmare.
> - The dataring structure also uses IDs and requires an ID-to-node
> mapping. I do not want to bind the dataring and numlist data
> structures together at this level because they really have nothing to
> do with each other. Having the dataring and numlist ID-to-node
> mappings (and their barriers) in the same place (in the
> numlist/dataring _user_) simplifies the big picture.
The ID is used in all three APIs. Then it might be only a matter of
taste where it is stored.
I still feel that struct nl_node is better place because:
+ already includes next_id
+ includes seq that identifies the structure another way
+ id describes the node
I still need to think more about the other APIs. Well, id
substitutes a pointer here. It is like struct list_head
pointer. It is normal that it is passed as parameter
by a list API user.
> I believe your main concern was having easily visible symmetric
> barriers. We can achieve that if the read-barriers are in the callbacks
> (for both numlist and dataring). I think it makes more sense to put them
> there. dataring and numlist should not care about the ID-to-node
> mapping.
Symmetry is really important. It is often sign of a good design.
Simple and straightforward code is another important thing at
this stage. The code is complicated and we need to make sure
that it works. Any optimizations and generalization might
be done later when needed.
Best Regards,
Petr
On 2019-08-27, Petr Mladek <[email protected]> wrote:
>> On 2019-08-23, Petr Mladek <[email protected]> wrote:
>>>> --- /dev/null
>>>> +++ b/kernel/printk/numlist.c
>>>> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
>>>> +{
> [...]
>>>> +
>>>> + /* bB: #1 */
>>>> + head_id = atomic_long_read(&nl->head_id);
>>>> +
>>>> + for (;;) {
>>>> + /* bC: */
>>>> + while (!numlist_read(nl, head_id, &seq, NULL)) {
>>>> + /*
>>>> + * @head_id is invalid. Try again with an
>>>> + * updated value.
>>>> + */
>>>> +
>>>> + cpu_relax();
>>>
>>> I have got very confused by this. cpu_relax() suggests that this
>>> cycle is busy waiting until a particular node becomes valid.
>>> My first though was that it must cause deadlock in NMI when
>>> the interrupted code is supposed to make the node valid.
>>>
>>> But it is the other way. The head is always valid when it is
>>> added to the list. It might become invalid when another CPU
>>> moves the head and the old one gets reused.
>>>
>>> Anyway, I do not see any reason for cpu_relax() here.
>>
>> You are correct. The cpu_relax() should not be there. But there is
>> still an issue that this could spin hard if the head was recycled and
>> this CPU does not yet see the new head value.
>
> I do not understand this. The head could get reused only after
> head_id was replaced with the following valid node.
> The next cycle is done with a new id that should be valid.
>
> Of course, the new ID might get reused as well. But then we just
> repeat the cycle. We have to be able to find a valid head after
> few cycles. The last valid ID could not get reused because nodes
> can be removed only if was not the last valid node.
Sorry, I was not very precise with my language. I will try again...
nl->head_id is read using a relaxed read. A second CPU may have added
new nodes and removed/recycled the node with the ID that the first CPU
read as the head.
As a result, the first CPU's numlist_read() will (correctly) fail. If
numlist_read() failed in the first node() callback within numlist_read()
(i.e. it sees that the node already has a new ID), there is no guarantee
that rereading the head ID will provide a new ID. At some point the
memory system would make the new head ID visible, but there could be
some heavy spinning until that happens.
Here is a litmus test showing the problem (using comments and verbose
variable names):
C numlist_push_loop
{
int node1 = 1;
int node2 = 2;
int *numlist_head = &node1;
}
P0(int **numlist_head)
{
int *head;
int id;
// read head ID
head = READ_ONCE(*numlist_head);
// read head node ID
id = READ_ONCE(*head);
// re-read head ID when node ID is unexpected
head = READ_ONCE(*numlist_head);
}
P1(int **numlist_head, int *node1, int *node2)
{
int *r0;
// push node2
r0 = cmpxchg_release(numlist_head, node1, node2);
// pop node1, reassigning a new ID
smp_store_release(node1, 3);
}
exists (0:head=node1 /\ 0:id=3)
$ herd7 -conf linux-kernel.cfg numlist_push_loop.litmus
Test numlist_push_loop Allowed
States 5
0:head=node1; 0:id=1;
0:head=node1; 0:id=3;
0:head=node2; 0:id=1;
0:head=node2; 0:id=2;
0:head=node2; 0:id=3;
Ok
Witnesses
Positive: 1 Negative: 4
Condition exists (0:head=node1 /\ 0:id=3)
Observation numlist_push_loop Sometimes 1 4
Time numlist_push_loop 0.01
Hash=27b10efb171ab4cf390bd612a9e79bf0
The results show that P0 sees the head is node1 but also sees that
node1's ID has changed. (And if node1's ID changed, it means P1 had
previously replaced the head.) If P0 ran in a while-loop, at some point
it _would_ see that node2 is now the head. But that is wasteful spinning
and may possibly have negative influence on the memory system.
>> To handle that, and in preparation for my next version, I'm now using
>> a read_acquire() to load the ID in the node() callback (matching the
>> set_release() in assign_desc()). This ensures that if numlist_read()
>> fails, the new head will be visible.
>
> I do not understand this either. The above paragraph seems to
> describe a race. I do not see how it could cause an infinite loop.
It isn't an infinite loop. It is burning some/many CPU cycles.
By changing P0's ID read to:
id = smp_load_acquire(head);
the results change to:
$ herd7 -conf linux-kernel.cfg numlist_push_loop.litmus
Test numlist_push_loop Allowed
States 4
0:head=node1; 0:id=1;
0:head=node2; 0:id=1;
0:head=node2; 0:id=2;
0:head=node2; 0:id=3;
No
Witnesses
Positive: 0 Negative: 4
Condition exists (0:head=node1 /\ 0:id=3)
Observation numlist_push_loop Never 0 4
Time numlist_push_loop 0.01
Hash=3eb63ea3bec59f8941f61faddb5499da
Meaning that if a new ID is seen, a new head ID is also visible.
Loading the ID is what the node() callback does, and the ACQUIRE pairs
with the set_release() in assign_desc(). Both in ringbuffer.c.
John Ogness
On Sun 2019-08-25 04:42:37, John Ogness wrote:
> On 2019-08-20, Petr Mladek <[email protected]> wrote:
> >> +/**
> >> + * dataring_push() - Reserve a data block in the data array.
> >> + *
> >> + * @dr: The data ringbuffer to reserve data in.
> >> + *
> >> + * @size: The size to reserve.
> >> + *
> >> + * @desc: A pointer to a descriptor to store the data block information.
> >> + *
> >> + * @id: The ID of the descriptor to be associated.
> >> + * The data block will not be set with @id, but rather initialized with
> >> + * a value that is explicitly different than @id. This is to handle the
> >> + * case when newly available garbage by chance matches the descriptor
> >> + * ID.
> >> + *
> >> + * This function expects to move the head pointer forward. If this would
> >> + * result in overtaking the data array index of the tail, the tail data block
> >> + * will be invalidated.
> >> + *
> >> + * Return: A pointer to the reserved writer data, otherwise NULL.
> >> + *
> >> + * This will only fail if it was not possible to invalidate the tail data
> >> + * block.
> >> + */
> >> +char *dataring_push(struct dataring *dr, unsigned int size,
> >> + struct dr_desc *desc, unsigned long id)
> >> +{
> >> + unsigned long begin_lpos;
> >> + unsigned long next_lpos;
> >> + struct dr_datablock *db;
> >> + bool ret;
> >> +
> >> + to_db_size(&size);
> >> +
> >> + do {
> >> + /* fA: */
> >> + ret = get_new_lpos(dr, size, &begin_lpos, &next_lpos);
> >> +
> >> + /*
> >> + * fB:
> >> + *
> >> + * The data ringbuffer tail may have been pushed (by this or
> >> + * any other task). The updated @tail_lpos must be visible to
> >> + * all observers before changes to @begin_lpos, @next_lpos, or
> >> + * @head_lpos by this task are visible in order to allow other
> >> + * tasks to recognize the invalidation of the data
> >> + * blocks.
> >
> > This sounds strange. The write barrier should be done only on CPU
> > that really modified tail_lpos. I.e. it should be in _dataring_pop()
> > after successful dr->tail_lpos modification.
>
> The problem is that there are no data dependencies between the different
> variables. When a new datablock is being reserved, it is critical that
> all other observers see that the tail_lpos moved forward _before_ any
> other changes. _dataring_pop() uses an smp_rmb() to synchronize for
> tail_lpos movement.
It should be symmetric. It makes sense that _dataring_pop() uses an
smp_rmb(). Then there should be wmb() in dataring_push().
The wmb() should be done only by the CPU that actually did the write.
And it should be done after the write. This is why I suggested to
do it after cmpxchg(dr->head_lpos).
> This CPU is about to make some changes and may have
> seen an updated tail_lpos. An smp_wmb() is useless if this is not the
> CPU that performed that update. The full memory barrier ensures that all
> other observers will see what this CPU sees before any of its future
> changes are seen.
I do not understand it. Full memory barrier will not cause that all
CPUs will see the same.
My understanding of barriers is:
+ wmb() is needed after some value is modified and any following
modifications must be done later.
+ rmb() is needed when a value has to be read before the other
values are read.
These barriers need to be symmetric. The reader will see the values
in the right order only when both the writer and the reader use
the right barriers.
+ wmb() full barrier is needed around some critical section
to make sure that all operations happened inside the section
Back to our situation:
+ rmb() should not be needed here because get_new_lpos() provided
a valid lpos.
It is possible that get_new_lpos() used rmb() to make sure
that there was enough space. But such wmb() would be
between reading dr->tail_lpos and dr->head_lpos. No
other rmb() is needed once the check passed.
+ wmb() is not needed because we have not written anything yet
If there was a race with another CPU than cmpxchg(dr->head_lpos)
would fail and we will need to repeat everything again.
> >> + /* fE: */
> >> + } while (atomic_long_cmpxchg_relaxed(&dr->head_lpos, begin_lpos,
> >> + next_lpos) != begin_lpos);
> >> +
> >
> > We need a write barrier here to make sure that dr->head_lpos
> > is updated before we start updating other values, e.g.
> > db->id below.
>
> My RFCv2 implemented it that way. The function was called data_reserve()
> and it moved the head using cmpxchg_release(). For RFCv3 I changed to a
> full memory barrier instead because using acquire/release here is a bit
> messy. There are 2 different places where the acquire needed to be:
>
> - In _dataring_pop() a load_acquire() of head_lpos would need to be
> _before_ loading of begin_lpos and next_lpos.
>
> - In prb_iter_next_valid_entry() a load_acquire() of head_lpos would
> need to be at the beginning within the dataring_datablock_isvalid()
> check (mC).
>
> If smp_mb() is too heavy to call for every printk(), then we can move to
> acquire/release. The comments of fB list exactly what is synchronized
> (and where).
smp_mb() is not a problem. printk() is a slow path.
My problem is that I want to make sure that the code works as
expected. For this, I want to understand the used barriers.
And the discussed full barrier in dataring_push() does not make
sense to me.
Best Regards,
Petr
On Tue 2019-08-27 16:28:55, John Ogness wrote:
> On 2019-08-27, Petr Mladek <[email protected]> wrote:
> >> On 2019-08-23, Petr Mladek <[email protected]> wrote:
> >>>> --- /dev/null
> >>>> +++ b/kernel/printk/numlist.c
> >>>> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
> >>>> +{
> > [...]
> >>>> +
> >>>> + /* bB: #1 */
> >>>> + head_id = atomic_long_read(&nl->head_id);
> >>>> +
> >>>> + for (;;) {
> >>>> + /* bC: */
> >>>> + while (!numlist_read(nl, head_id, &seq, NULL)) {
> >>>> + /*
> >>>> + * @head_id is invalid. Try again with an
> >>>> + * updated value.
> >>>> + */
> >>>> +
> >>>> + cpu_relax();
> >>>
> >>> I have got very confused by this. cpu_relax() suggests that this
> >>> cycle is busy waiting until a particular node becomes valid.
> >>> My first though was that it must cause deadlock in NMI when
> >>> the interrupted code is supposed to make the node valid.
> >>>
> >>> But it is the other way. The head is always valid when it is
> >>> added to the list. It might become invalid when another CPU
> >>> moves the head and the old one gets reused.
> >>>
> >>> Anyway, I do not see any reason for cpu_relax() here.
> >>
> >> You are correct. The cpu_relax() should not be there. But there is
> >> still an issue that this could spin hard if the head was recycled and
> >> this CPU does not yet see the new head value.
> >
> > I do not understand this. The head could get reused only after
> > head_id was replaced with the following valid node.
> > The next cycle is done with a new id that should be valid.
> >
> > Of course, the new ID might get reused as well. But then we just
> > repeat the cycle. We have to be able to find a valid head after
> > few cycles. The last valid ID could not get reused because nodes
> > can be removed only if was not the last valid node.
>
> Sorry, I was not very precise with my language. I will try again...
>
> nl->head_id is read using a relaxed read.
I wonder if the "relaxed read" causes the confusion. Could it read
the old id even when numlist_read() for this id failed?
If this is true then it should not be relaxed read.
> A second CPU may have added new nodes and removed/recycled
> the node with the ID that the first CPU read as the head.
This sounds like ABA problem. My understanding is that we
use ID to prevent these problems and could ignore them.
> As a result, the first CPU's numlist_read() will (correctly) fail. If
> numlist_read() failed in the first node() callback within numlist_read()
> (i.e. it sees that the node already has a new ID), there is no guarantee
> that rereading the head ID will provide a new ID. At some point the
> memory system would make the new head ID visible, but there could be
> some heavy spinning until that happens.
>
> Here is a litmus test showing the problem (using comments and verbose
> variable names):
>
> C numlist_push_loop
>
> {
> int node1 = 1;
> int node2 = 2;
> int *numlist_head = &node1;
> }
>
> P0(int **numlist_head)
> {
> int *head;
> int id;
>
> // read head ID
> head = READ_ONCE(*numlist_head);
>
> // read head node ID
> id = READ_ONCE(*head);
>
> // re-read head ID when node ID is unexpected
> head = READ_ONCE(*numlist_head);
> }
>
> P1(int **numlist_head, int *node1, int *node2)
> {
> int *r0;
>
> // push node2
> r0 = cmpxchg_release(numlist_head, node1, node2);
>
> // pop node1, reassigning a new ID
> smp_store_release(node1, 3);
> }
I think that the Litmus test does not describe the code.
If it does then we need to fix the algorithm or barriers.
> The results show that P0 sees the head is node1 but also sees that
> node1's ID has changed. (And if node1's ID changed, it means P1 had
> previously replaced the head.) If P0 ran in a while-loop, at some point
> it _would_ see that node2 is now the head. But that is wasteful spinning
> and may possibly have negative influence on the memory system.
My undestanding is that only valid nodes are added to the list.
If a node read via head_id is not valid then head_id already
points to another valid node. Am I wrong, please?
Best Regards,
Petr
On 2019-08-27, Petr Mladek <[email protected]> wrote:
> The API is complicated because of the callbacks. It depends on a logic
> that is implemented externally. It makes it abstract to some extent.
>
> My view is that the API would be much cleaner and easier to review
> when the ID handling is "hardcoded" (helper functions). It could be
> made abstract anytime later when there is another user.
>
> There should always be a reason why to make a code more complicated
> than necessary. It seems that the only reason is some theoretical
> future user and its theoretical requirements.
FWIW, I did _not_ create the numlist and dataring structures in order to
support some theoretical future user. PeterZ helped[0] me realize that
RFCv2 was actually using multiple internal data structures. Each of
these internal data structures has their own set of memory barriers and
semantics. By explicitly refactoring them behind strong APIs, the memory
barriers could be clearly visible and the semantics clearly defined.
For me this was a great help in _simplifying_ the design. For me it also
greatly simplified debugging, testing, and verifying because I could
write tests for numlist and datalist that explicitly targeted those data
structures. Once I believed they were bullet-proof, I could move on to
higher-level tests of the printk_ringbuffer. And once I believed the
printk_ringbuffer was bullet-proof, I could move on to the higher-level
printk tests. When a problem was found, I could effectively isolate
which component failed their job.
I understand that we disagree about the abstractions being a
simplification. And I'm not sure how to proceed in this regard. (Maybe
once we get everything bullet-proof, we can put everything back together
into a monolith like RFCv2.) Either way, please understand that the
abstractions were done for the benefit of printk_ringbuffer, not for any
theoretical future user.
John Ogness
[0] https://lkml.kernel.org/r/[email protected]
On Wed 2019-08-28 09:13:39, John Ogness wrote:
> On 2019-08-27, Petr Mladek <[email protected]> wrote:
> > The API is complicated because of the callbacks. It depends on a logic
> > that is implemented externally. It makes it abstract to some extent.
> >
> > My view is that the API would be much cleaner and easier to review
> > when the ID handling is "hardcoded" (helper functions). It could be
> > made abstract anytime later when there is another user.
> >
> > There should always be a reason why to make a code more complicated
> > than necessary. It seems that the only reason is some theoretical
> > future user and its theoretical requirements.
>
> FWIW, I did _not_ create the numlist and dataring structures in order to
> support some theoretical future user. PeterZ helped[0] me realize that
> RFCv2 was actually using multiple internal data structures. Each of
> these internal data structures has their own set of memory barriers and
> semantics. By explicitly refactoring them behind strong APIs, the memory
> barriers could be clearly visible and the semantics clearly defined.
>
> For me this was a great help in _simplifying_ the design. For me it also
> greatly simplified debugging, testing, and verifying because I could
> write tests for numlist and datalist that explicitly targeted those data
> structures. Once I believed they were bullet-proof, I could move on to
> higher-level tests of the printk_ringbuffer. And once I believed the
> printk_ringbuffer was bullet-proof, I could move on to the higher-level
> printk tests. When a problem was found, I could effectively isolate
> which component failed their job.
>
> I understand that we disagree about the abstractions being a
> simplification.
This is a misunderstanding. I probably was not clear enough. It makes
perfect sense to have separate APIs for numlist and dataring. I agree
that they allow to split the problem into smaller pieces.
I only think that, especially, numlist API is too generic in v4.
It is not selfcontained. The consistency depends on external barriers.
I believe that it might become fully self-contained and consistent
if we reduce possibilities of the generic usage. In particular,
the numlist should allow only linking of reusable structures
stored in an array.
I explained in the previous mail that other use cases are
questionable. If anyone really finds another usecase,
the API might be made more generic. But we should start
with something simple.
> And I'm not sure how to proceed in this regard. (Maybe
> once we get everything bullet-proof, we can put everything back together
> into a monolith like RFCv2.)
I would actually go the other way. It would be nice to add numlist
and dataring API in separate patches.
> Either way, please understand that the
> abstractions were done for the benefit of printk_ringbuffer, not for any
> theoretical future user.
I understand. I hope that it is more clear now.
Best Regards,
Petr
On 2019-08-27, Petr Mladek <[email protected]> wrote:
>>>>>> --- /dev/null
>>>>>> +++ b/kernel/printk/numlist.c
>>>>>> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
>>>>>> +{
>> > [...]
>>>>>> +
>>>>>> + /* bB: #1 */
>>>>>> + head_id = atomic_long_read(&nl->head_id);
>>>>>> +
>>>>>> + for (;;) {
>>>>>> + /* bC: */
>>>>>> + while (!numlist_read(nl, head_id, &seq, NULL)) {
>>>>>> + /*
>>>>>> + * @head_id is invalid. Try again with an
>>>>>> + * updated value.
>>>>>> + */
>>>>
>>>> But there is still an issue that this could spin hard if the head
>>>> was recycled and this CPU does not yet see the new head value.
>>>
>>> I do not understand this. The head could get reused only after
>>> head_id was replaced with the following valid node.
>>> The next cycle is done with a new id that should be valid.
>>>
>>> Of course, the new ID might get reused as well. But then we just
>>> repeat the cycle. We have to be able to find a valid head after
>>> few cycles. The last valid ID could not get reused because nodes
>>> can be removed only if was not the last valid node.
>>
>> nl->head_id is read using a relaxed read.
>
> I wonder if the "relaxed read" causes the confusion. Could it read
> the old id even when numlist_read() for this id failed?
Yes. It is possible that the new head ID is not yet visible. That is
what the litmus test shows.
> If this is true then it should not be relaxed read.
That is essentially what the new change facilitates. By using an ACQUIRE
to load the descriptor ID, this CPU sees what the CPU that changed the
descriptor ID saw. And that CPU must have seen a different head ID
because a successful numlist_pop() means the popped node's @next_id was
verified that it isn't a terminator, and seeing @next_id means the new
head ID from the CPU that set @next_id is also visible.
Now this still isn't a guarantee that the head ID hasn't changed since
then. So the CPU still may loop. But the CPU is guaranteed to make
forward progress with each new head ID it sees.
>> A second CPU may have added new nodes and removed/recycled
>> the node with the ID that the first CPU read as the head.
>
> This sounds like ABA problem. My understanding is that we
> use ID to prevent these problems and could ignore them.
Yes, it is an ABA problem. And yes, because of IDs, it is prevented via
detection and numlist_read() failing. The ABA problem is not ignored, it
is handled.
[snipped vague litmus test]
> I think that the Litmus test does not describe the code.
OK, at the end is a new litmus test with extensive annotation so that
you see exactly how it is modelling the code. I have also increased the
concurrency, splitting the push/pop to 2 different CPUs. And to be fair,
I left in general memory barriers (smp_rmb/smp_wmb) that, although
unrelated, do exist in the code paths as well.
All annotation/code are based on RFCv4.
> If it does then we need to fix the algorithm or barriers.
Indeed, I've fixed the barriers for the next version.
> My undestanding is that only valid nodes are added to the list.
Correct.
> If a node read via head_id is not valid then head_id already
> points to another valid node. Am I wrong, please?
You are correct.
John Ogness
C numlist_push_loop
(*
* Result: Never
*
* Check numlist_push() loop to re-read the head,
* if it is possible that a new head ID is not visible.
*)
{
int node1 = 1;
int node1_next = 1;
int node2 = 2;
int *numlist_head = &node1;
int *numlist_tail = &node1;
}
/*
* This CPU wants to push a new node (not shown) to the numlist but another
* CPU pushed a node after this CPU had read the head ID, and yet another
* CPU pops/recycles the node that this CPU first saw as the head.
*/
P0(int **numlist_head)
{
int *head;
int id;
/*
* Read the head ID.
*
* numlist_push() numlist.c:215
*
* head_id = atomic_long_read(&nl->head_id);
*/
head = READ_ONCE(*numlist_head);
/*
* Read/validate the descriptor ID.
*
* NOTE: To guarantee seeing a new head ID, this should be:
* id = smp_load_acquire(head);
*
* numlist_push() numlist.c:219
* numlist_read() numlist.c:116
* prb_desc_node() ringbuffer.c:220-223
*
* struct prb_desc *d = to_desc(arg, id);
* if (id != atomic_long_read(&d->id))
* return NULL;
*/
id = READ_ONCE(*head);
/*
* Re-read the head ID when validation failed.
*
* numlist_push() numlist.c:228
*
* head_id = atomic_long_read(&nl->head_id);
*/
head = READ_ONCE(*numlist_head);
}
/* This CPU pushes a new node (node2) to the numlist. */
P1(int **numlist_head, int *node1, int *node2, int *node1_next)
{
int *r0;
/*
* Set a new head.
*
* numlist_push() numlist.c:257
*
* r = atomic_long_cmpxchg_release(&nl->head_id, head_id, id);
*/
r0 = cmpxchg_release(numlist_head, node1, node2);
/*
* Set the next of the previous head (node1) to node2's ID.
*
* numlist_push() numlist.c:313
*
* smp_store_release(&n->next_id, id);
*/
smp_store_release(node1_next, 2);
}
/* This CPU will pop/recycle a node (node) from the numlist. */
P2(int **numlist_head, int **numlist_tail, int *node1, int *node2,
int *node1_next)
{
int tail_id;
int *r0;
/*
* Read the tail ID. (Not used, but it touches the shared variables.)
*
* prb_reserve() ringbuffer.c:441
* assign_desc() ringbuffer.c:337
* numlist_pop() numlist.c:338
*
* tail_id = atomic_long_read(&nl->tail_id);
*/
tail_id = READ_ONCE(*numlist_tail);
/*
* Read the next value of the tail node.
*
* prb_reserve() ringbuffer.c:441
* assign_desc() ringbuffer.c:337
* numlist_pop() numlist.c:342
* numlist_read() numlist.c:116,135,147
*
* n = nl->node(id, nl->node_arg);
* *next_id = READ_ONCE(n->next_id);
* smp_rmb();
*/
r0 = READ_ONCE(*node1_next);
smp_rmb();
/*
* Verify that the node (node1) is not a terminator.
*
* prb_reserve() ringbuffer.c:441
* assign_desc() ringbuffer.c:337
* numlist_pop() numlist.c:355-356
*
* if (next_id == tail_id)
* return NULL;
*/
if (r0 != 1) {
/*
* Remove the node (node1) from the list.
*
* prb_reserve() ringbuffer.c:441
* assign_desc() ringbuffer.c:337
* numlist_pop() numlist.c:366-367
*
* r = atomic_long_cmpxchg_relaxed(&nl->tail_id,
* tail_id, next_id);
*/
r0 = cmpxchg_release(numlist_tail, node1, node2);
/*
* Assign the popped node (node1) a new ID.
*
* prb_reserve() ringbuffer.c:441
* assign_desc() ringbuffer.c:386-387
*
* atomic_long_set_release(&d->id, atomic_long_read(&d->id) +
* DESCS_COUNT(rb));
*/
smp_store_release(node1, 3);
/*
* Prepare to make changes to node data.
*
* prb_reserve() ringbuffer.c:475
*
* smp_wmb();
*/
smp_wmb();
}
}
exists (0:head=node1 /\ 0:id=3)
On 2019-08-27, Petr Mladek <[email protected]> wrote:
>>>> +/**
>>>> + * dataring_push() - Reserve a data block in the data array.
>>>> + *
>>>> + * @dr: The data ringbuffer to reserve data in.
>>>> + *
>>>> + * @size: The size to reserve.
>>>> + *
>>>> + * @desc: A pointer to a descriptor to store the data block information.
>>>> + *
>>>> + * @id: The ID of the descriptor to be associated.
>>>> + * The data block will not be set with @id, but rather initialized with
>>>> + * a value that is explicitly different than @id. This is to handle the
>>>> + * case when newly available garbage by chance matches the descriptor
>>>> + * ID.
>>>> + *
>>>> + * This function expects to move the head pointer forward. If this would
>>>> + * result in overtaking the data array index of the tail, the tail data block
>>>> + * will be invalidated.
>>>> + *
>>>> + * Return: A pointer to the reserved writer data, otherwise NULL.
>>>> + *
>>>> + * This will only fail if it was not possible to invalidate the tail data
>>>> + * block.
>>>> + */
>>>> +char *dataring_push(struct dataring *dr, unsigned int size,
>>>> + struct dr_desc *desc, unsigned long id)
>>>> +{
>>>> + unsigned long begin_lpos;
>>>> + unsigned long next_lpos;
>>>> + struct dr_datablock *db;
>>>> + bool ret;
>>>> +
>>>> + to_db_size(&size);
>>>> +
>>>> + do {
>>>> + /* fA: */
>>>> + ret = get_new_lpos(dr, size, &begin_lpos, &next_lpos);
>>>> +
>>>> + /*
>>>> + * fB:
>>>> + *
>>>> + * The data ringbuffer tail may have been pushed (by this or
>>>> + * any other task). The updated @tail_lpos must be visible to
>>>> + * all observers before changes to @begin_lpos, @next_lpos, or
>>>> + * @head_lpos by this task are visible in order to allow other
>>>> + * tasks to recognize the invalidation of the data
>>>> + * blocks.
>>>
>>> This sounds strange. The write barrier should be done only on CPU
>>> that really modified tail_lpos. I.e. it should be in _dataring_pop()
>>> after successful dr->tail_lpos modification.
>>
>> The problem is that there are no data dependencies between the different
>> variables. When a new datablock is being reserved, it is critical that
>> all other observers see that the tail_lpos moved forward _before_ any
>> other changes. _dataring_pop() uses an smp_rmb() to synchronize for
>> tail_lpos movement.
>
> It should be symmetric. It makes sense that _dataring_pop() uses an
> smp_rmb(). Then there should be wmb() in dataring_push().
dataring_pop() is adjusting the tail. dataring_push() is adjusting the
head. These operations are handled (ordered) separately. They do not
need be happening in lockstep. They don't need to be happening on the
same CPU.
> The wmb() should be done only by the CPU that actually did the write.
> And it should be done after the write. This is why I suggested to
> do it after cmpxchg(dr->head_lpos).
If CPU0 issues an smp_wmb() after moving the tail and (after seeing the
moved tail) CPU1 issues an smp_wmb() after updating the head, it is
still possible for CPU2 to see the head move (and possibly even overtake
the tail) before seeing the tail move.
If a CPU didn't move the tail but _will_ move the head, only a full
memory barrier will allow _all_ observers to see the tail move before
seeing the head move.
>> This CPU is about to make some changes and may have seen an updated
>> tail_lpos. An smp_wmb() is useless if this is not the CPU that
>> performed that update. The full memory barrier ensures that all other
>> observers will see what this CPU sees before any of its future
>> changes are seen.
>
> I do not understand it. Full memory barrier will not cause that all
> CPUs will see the same.
I did not write that. I wrote (emphasis added):
The full memory barrier ensures that all other observers will see
what _this_ CPU sees before any of _its_ future changes are seen.
> These barriers need to be symmetric.
They are. The comments for fB list the pairs (all being
smp_mb()/smp_rmb() pairings).
> Back to our situation:
>
> + rmb() should not be needed here because get_new_lpos() provided
> a valid lpos.
>
> + wmb() is not needed because we have not written anything yet
>
> If there was a race with another CPU than cmpxchg(dr->head_lpos)
> would fail and we will need to repeat everything again.
It's not about racing to update the head. It's about making sure that
_all_ CPUs observe that a datablock was invalidated _before_ observing
that _this_ CPU started modifying other shared variables. And again,
this CPU might _not_ be the one that invalidated the datablock
(i.e. moved the tail).
John Ogness
On 2019-08-28, Petr Mladek <[email protected]> wrote:
> I only think that, especially, numlist API is too generic in v4.
> It is not selfcontained. The consistency depends on external barriers.
>
> I believe that it might become fully self-contained and consistent
> if we reduce possibilities of the generic usage. In particular,
> the numlist should allow only linking of reusable structures
> stored in an array.
OK. I will make the numlist the master of the ID-to-node mapping. To
implement the getdesc() callback of the dataring, the printk_ringbuffer
can call a numlist mapping function. Also, numlist will need to provide
a function to bump the descriptor version (as your previous idea already
showed).
I plan to change the array to be numlist nodes. The ID would move into
the numlist node structure and a void-pointer private would be added so
that the numlist user can add private data (for printk_ringbuffer that
would just be a pointer to the dataring structure). When the
printk_ringbuffer gets a never-used numlist node, it can set the private
field.
This has the added benefit of making it easy to detect accidental
never-used descriptor usage when reading dataring garbage. This was
non-trivial and I'm still not sure I solved it correctly. (I've already
spent a week working on a definitive answer to your email[0] asking
about this.)
John Ogness
[0] https://lkml.kernel.org/r/[email protected]
On Wed 2019-08-28 16:03:38, John Ogness wrote:
> On 2019-08-28, Petr Mladek <[email protected]> wrote:
> > I only think that, especially, numlist API is too generic in v4.
> > It is not selfcontained. The consistency depends on external barriers.
> >
> > I believe that it might become fully self-contained and consistent
> > if we reduce possibilities of the generic usage. In particular,
> > the numlist should allow only linking of reusable structures
> > stored in an array.
>
> OK. I will make the numlist the master of the ID-to-node mapping. To
> implement the getdesc() callback of the dataring, the printk_ringbuffer
> can call a numlist mapping function. Also, numlist will need to provide
> a function to bump the descriptor version (as your previous idea already
> showed).
Sounds good.
> I plan to change the array to be numlist nodes. The ID would move into
> the numlist node structure and a void-pointer private would be added so
> that the numlist user can add private data (for printk_ringbuffer that
> would just be a pointer to the dataring structure). When the
> printk_ringbuffer gets a never-used numlist node, it can set the private
> field.
I am not sure that I get the full picture. It would help to see some
snippet of the code (struct declaration).
Anyway, adding void-pointer into struct numlist looks like classic
(userspace?) implementation of dynamically linked structures.
I do not have strong opinion. But I would prefer to stay with
the kernel-style. I mean that the numlist structure is part
of the linked structures. And container_of() is eventually
used to get pointer to the upper structure. Also passing values
via a pointer with generic name (data) slightly complicates the code.
I know that numlist is special because id is used to get the pointer
of the numlist and also the upper structure. Anyway, the kernel
style looks more familiar to me in the kernel context.
But I'll leave it up to you.
> This has the added benefit of making it easy to detect accidental
> never-used descriptor usage when reading dataring garbage. This was
> non-trivial and I'm still not sure I solved it correctly. (I've already
> spent a week working on a definitive answer to your email[0] asking
> about this.)
Would a check for NULL data pointer help here? Well, it might be
motivation for using the pointer.
I wonder if begin_lpos == end_lpos check might be used to detect
never used descriptors. It is already used in another situation
in dataring_desc_init(). IMHO, the values even do not need to be
unaligned. But I might miss something.
Best Regards,
Petr
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
> new file mode 100644
> index 000000000000..911bac593ec1
> --- /dev/null
> +++ b/kernel/printk/dataring.c
> + * DOC: dataring overview
I have to spend a lot of time thinking about dataring API. I started
with some small comments. I ended with a description how I see
the API consistency.
I am still not sure that I see it from the right angle. Let's see.
> + * A dataring is a lockless ringbuffer consisting of variable length data
> + * blocks, each of which are assigned an ID. The IDs map to descriptors, which
> + * contain metadata about the data block. The lookup function mapping IDs to
> + * descriptors is implemented by the user.
> + *
> + * Descriptors
> + * -----------
> + * A descriptor is a handle to a data block. How descriptors are structured
> + * and mapped to IDs is implemented by the user.
We should add a note that logical IDs are used to avoid ABA problems.
> + * Descriptors contain the begin (begin_lpos) and end (next_lpos) logical
> + * positions of the data block they represent. The end logical position
> + * matches the begin logical position of the adjacent data block.
> + *
> + * Why Descriptors?
> + * ----------------
> + * The data ringbuffer supports variable length entities, which means that
> + * data blocks will not always begin at a predictable offset of the byte
> + * array. This is a major problem for lockless writers that, for example, will
> + * compete to expire and reuse old data blocks when the ringbuffer is full.
> + * Without a predictable begin for the data blocks, a writer has no reliable
> + * information about the status of the "free" area. Are any flags or state
> + * variables already set or is it just garbage left over from previous usage?
> + *
> + * Descriptors allow safe and controlled access to data block metadata by
> + * providing predictable offsets for such metadata. This is key to supporting
> + * multiple concurrent lockless writers.
> + *
> + * Behavior
> + * --------
> + * The data ringbuffer allows writers to commit data without regard for
> + * readers. Readers must pre- and post-validate the data blocks they are
> + * processing to be sure the processed data is consistent. A function
> + * dataring_datablock_isvalid() is available for that. Readers can only
> + * iterate data blocks by utilizing an external implementation using
> + * descriptor lookups based on IDs.
> + *
> + * Writers commit data in two steps:
> + *
> + * (1) Reserve a new data block (dataring_push()).
> + * (2) Commit the data block (dataring_datablock_setid()).
> + *
> + * Once a data block is committed, it is available for recycling by another
> + * writer. Therefore, once committed, a writer must no longer access the data
> + * block.
> + *
> + * If data block reservation fails, it means the oldest reserved data block
> + * has not yet been committed by its writer. This acts as a blocker for any
> + * future data block reservation.
Let's start with something easier. I am not sure with the FIFO interface:
> +bool dataring_pop(struct dataring *dr);
> +char *dataring_push(struct dataring *dr, unsigned int size,
> + struct dr_desc *desc, unsigned long id);
This is the same FIFO interface as in numlist. But the semantic
is different, especially the push() part:
+ numlist_push():
+ adds a new (external node)
+ node is valid when added
+ always succeeds
+ dataring_push()
+ just reserves space in its own data structure;
+ data are written later and not valid until anyone calls
setid() (commits)
+ might fail when it is unable to remove old (non-committed)
data with wrong id
The pop() part is similar but the wording is slightly different:
+ numlist_pop()
+ succeeds only when the oldest node is not blocked().
+ blocked state is defined by external callback
+ dataring_pop()
+ succeeds only when the oldest node has correct id (committed)
+ the id is validated using external callback
I am somehow confused by the same names and the differences.
Also dataring_push() does not push any data. It is only making
space.
I believe that the following interface will be much easier
to understand. It is inspired by ringbuffer API. Both APIs
actually have similar usage:
+ dataring_reserve() instead of push()
+ dataring_commit() instead of setid()
+ dataring_remove_oldest() instead of pop()
> +struct dr_datablock *dataring_getdatablock(struct dataring *dr,
> + struct dr_desc *desc, int *size);
please: getdatablock -> get_datablock
On the same note, please, rename getdesc() callback to get_desc().
> +bool dataring_datablock_isvalid(struct dataring *dr, struct dr_desc *desc);
I am always in doubts what it exactly means, especially whether
the data are valid. I suggest something like:
dataring_datablock_is_used() (means reserved or committed)
Or make it clear that it only compares the given ranges:
dataring_range_in_use()
OK, now, the complicated part: consistency:
Dataring API works with:
+ global: head_lpos, tail_lpos
+ per datablock:
+ locally stored: id
+ externally stored: begin_lpos, next_lpos
Let's look at "id" more closely from the dataring API side of view:
1. ID stored the in the external struct prb_desc.
Here the ID match is a bit fuzzy from dataring API point of view.
begin_lpos and next_lpos stored in the external prb_desc can point
to datablock for the current id or id from the previous
cycle because they were not updated yet.
Now, dataring API is allowed to rewrite them in
dataring_push()/reserve(). But it does not know
if they are consistent in another situation.
2. ID stored in the datablock
Here the ID match means that the data are committed
and could get reused. Except when they are already
overwritten and the ID matches just by chance.
Summary:
The dataring API could not decide about the validity on its own.
The API is a slave that does some actions when asked.
The action is that it computes all the lpos indexes.
But the external code decides when they are valid and
when they can be written.
More pitfalls of the internally stored id (in datablock):
+ invalid value (id - 1) is set in dataring_push/reserve()
+ valid value (id) is set in dataring_setid/commit()
+ id is used only in one check and indirectly!!!
In particular, the "id" is used to index the external dr_desc
in dataring_pop().
The external prb_getdesc() callback then checks that it
matches with id stored in the external prb_desc. It returns
back pointer to db_desc. dataring_pop() then checks that
lpos_begin and lpos_next are in bounds.
WARN: I do not see a check whether lpos_begin points back
to exactly the same location when the original
"id" was read.
WARN: Reader, prb_iter_next_valid_entry(), does not check
the ID stored in datablock. It should make sure
the data are consistent. It relies on the fact that
only valid datablocks are reachable via numlist.
Except that the related numlist entries are not
removed from the list when the related datablock
is poped. It relies on the fact that the stored
lpos values are not longer in valid range.
Sigh, these are a lot of assumptions that are
hard to describe and keep in mind.
BTW: The check at the end of the reading is really weak,
see prb_iter_next_valid_entry(). It does not check
that the ID is still the same and that lpos_begin
and lpos_end are still the same.
OK, the above looks a bit scary to me. Let's look at it
from another angle.
What is the purpose of the API?
+ provide the requested space for writer
+ assist with reading the data???
What are the responsibilities (current code):
+ Writer:
+ provides "id" pointing to external struct db_desc
where the ring buffer could store lpos for
the reserved space.
+ ensures exclusive write access to this structure until
the space is reserved.
+ tells when the data are committed and the space can get
reused
+ provides callback to find the external struct prb_desc
for a given ID. The callback does only basic consistency
check (by ID). The information is used to check if
the space is reusable.
+ Reader:
+ uses dataring_getdatablock() to get address and size
of the datablock; There are no barriers and no consistency
checks done by the dataring API.
+ Calls dataring_datablock_isvalid() to check if lpos
are in valid bounds. The dataring API does not provide
any barriers to assist with it.
+ dataring API:
+ Manipulates the global lpos_head, lpos_tail indexes
and "id" field.
+ Uses barriers to manipulate the above three variables
a safe way.
Power and weakness:
+ The dataring API helps:
+ separate lpos computation
+ handle lpos related barriers
+ The dataring API does not help much:
+ writer has many responsibilities and have to use
the API carefully.
+ readers have to get and validate information by more
API calls
+ more consistency checks are possible; it is hard to
say what is must-to-have or nice-to-have and
which API is responsible for what.
Result:
I am sure that this API might get improved. But I am not sure
that it is worth it.
I though that the splint into 3 layers (ringbuffer, dataring, numlist)
might help to split the problem and make it easier. It helped,
definitely. But the consistency is still too complicated. The number
of memory barriers is really high.
There might be also argument that the APIs are reusable. But I do not
believe it. They are too complicated and depends on each other.
I do not think that anyone else might need exactly this complexity
for their use case.
OK, I have spent big part of two weeks with this patchset. I think
that I did a big progress. But it seems that either me or someone
else would need to put quite some effort to make this approach
manageable for me.
This brings me back to my alternative solution, see
https://lore.kernel.org/lkml/[email protected]/
I believe that it might be easier to get into some
usable and manageable state than to fight with the numlist
based approach.
John did a port on top of this patchset, see
https://lore.kernel.org/lkml/[email protected]/
and it seems to work. It would need to get refactored
back to a cleaner state: remove numlist and maybe even
ringbuffer API. What do you think?
Best Regards,
Petr
On (08/27/19 15:03), Petr Mladek wrote:
[..]
> > IMHO the API is sane. The only bizarre rule is that the numlist must
> > always have at least 1 node. But since the readers are non-consuming,
> > there is no real tragedy here.
> >
> > My goal is not to create some fabulous abstract data structure that
> > everyone should use. But I did try to minimize numlist (and dataring) to
> > only be concerned with clearly defined and minimal responsibilities
> > without imposing unnecessary restrictions on the user.
>
> The API is complicated because of the callbacks. It depends on a logic
> that is implemented externally. It makes it abstract to some extent.
>
> My view is that the API would be much cleaner and easier to review
> when the ID handling is "hardcoded" (helper functions). It could be
> made abstract anytime later when there is another user.
Makes sense.
> There should always be a reason why to make a code more complicated
> than necessary. It seems that the only reason is some theoretical
> future user and its theoretical requirements.
Agreed.
> Symmetry is really important. It is often sign of a good design.
>
> Simple and straightforward code is another important thing at
> this stage. The code is complicated and we need to make sure
> that it works. Any optimizations and generalization might
> be done later when needed.
Agreed.
-ss
On Tue, Aug 20, 2019 at 10:15:18AM +0200, Petr Mladek wrote:
> do {
> tail_id = atomic_long_read(&nl->tail_id);
>
> /*
> * Read might fail when the tail node has been removed
> * and reused in parallel.
> */
> if (!numlist_read(nl, tail_id, NULL, &next_id))
> continue;
>
> /* Make sure the node is not the only node on the list. */
> if (next_id == tail_id)
> return NULL;
>
> /* cC: Make sure the node is not busy. */
> if (nl->busy(tail_id, nl->busy_arg))
> return NULL;
>
> while (atomic_long_cmpxchg_relaxed(&nl->tail_id, tail_id, next_id) !=
> tail_id);
Both you and John should have a look at atomic*_try_cmpxchg*(); with
that you can write the above as:
tail_id = atomic_long_read(&nl->tai_id);
do {
...
} while (!atomic_long_try_cmpxchg_relaxed(&nl->tail_id, &tail_id, next_id));
And get better code-gen to boot.
On Thu, Aug 08, 2019 at 12:32:25AM +0206, John Ogness wrote:
> Hello,
>
> This is a follow-up RFC on the work to re-implement much of
> the core of printk. The threads for the previous RFC versions
> are here: v1[0], v2[1], v3[2].
>
> This series only builds upon v3 (i.e. the first part of this
> series is exactly v3). The main purpose of this series is to
> replace the current printk ringbuffer with the new
> ringbuffer. As was discussed[3], this is a conservative
> first step to rework printk. For example, all logbuf_lock
> usage is kept even though the new ringbuffer does not
> require it. This avoids any side-effect bugs in case the
> logbuf_lock is (unintentionally) synchronizing more than
> just the ringbuffer. However, this also means that the
> series does not bring any improvements, just swapping out
> implementations. A future patch will remove the logbuf_lock.
So after reading most of the first patch (and it look _much_ better than
previous times), I'm left wondering *why* ?!
That is, why do we need this complexity, as compared to that
CPU serialized approach?
What do we hope to gain by doing a multi-writer buffer? Yes, it is
awesome, but from where I'm sitting it is also completely silly, because
we'll want to CPU serialize the serial console anyway (otherwise it gets
to be a completely unreadable mess).
By having the whole thing CPU serialized we looose multi-writer and
consequently the buffer gets to be significantly simpler (as you know;
because ISTR you've actually done this before -- but I cannot find here
why that didn't live).
In my book simpler is better here. printk() is an absolute utter slow
path anyway, nobody cares about the performance much, and I'm thinking
that it should be plenty fast enough as long as you don't run a
synchronous serial output (which is exactly what I do do/require
anyway).
So can we have a few words to explain why we need multi-writer and all
this complexity?
On Thu, Sep 05, 2019 at 03:05:13PM +0200, Petr Mladek wrote:
> The serialized approach used a lock. It was re-entrant and thus less
> error-prone but still a lock.
>
> The lock was planed to be used not only to access the buffer but also
> for eventual locking inside lockless consoles. It might allow to
> have some synchronization even in lockless consoles. But it
> would be big-kernel-lock-like style. It might create yet
> another maze of problems.
I really don't see your point. All it does is limit buffer writers to a
single CPU, and does the same for the atomic/early console output.
But it must very much be a leaf lock -- that is, there must not be any
locking inside it -- and that is fine, if a console cannot do lockless
output, it simply cannot be marked as having an atomic/early console.
You've seen the force_earlyprintk patches I use [*], that stuff works
and is infinitely better than the current printk trainwreck -- and it
uses exactly such serialization -- although I only added it to make the
output actually readable. And _that_ is exactly why I propose adding it,
you need it _anyway_.
So the argument goes like:
- synchronous output to lockless consoles (early serial) is mandatory
- such output needs to be CPU serialized, otherwise it becomes
unreadable garbage.
- since we need that serialization anyway, might as well lift it up one
layer an put it around the buffer.
Since a single-cpu buffer writer can be wait free (and relatively
simple), the only possible waiting is on the lockless console (polling
until the UART is ready for it's next byte). There is nothing else. It
will make progress.
> If we remove per-CPU buffers in NMI. We would need to synchronize
> again printing backtraces from all CPUs. Otherwise they would get
> mixed and hard to read. It might be solved by some prefix and
> sorting in userspace but...
It must have cpu prefixes anyway; the multi-writer thing will equally
mix them together. This is a complete non sequitur.
That current printk stuff is just pure and utter crap. Those NMI buffers
are a trainwreck and need to die a horrible death.
> I agree that this lockless variant is really complicated. I am not
> able to prove that it is race free as it is now. I understand
> the algorithm. But there are too many synchronization points.
>
> Peter, have you seen my alternative approach, please. See
> https://lore.kernel.org/lkml/[email protected]/
>
> It uses two tricks:
>
> 1. Two bits in the sequence number are used to track the state
> of the related data. It allows to implement the entire
> life cycle of each entry using atomic operation on a single
> variable.
>
> 2. There is a helper function to read valid data for each entry,
> see prb_read_desc(). It checks the state before and after
> reading the data to make sure that they are valid. And
> it includes the needed read barriers. As a result there
> are only three explicit barriers in the code. All other
> are implicitly done by cmpxchg() atomic operations.
>
> The alternative lockless approach is still more complicated than
> the serialized one. But I think that it is manageable thanks to
> the simplified state tracking. And I might safe use some pain
> in the long term.
I've not looked at it yet, sorry. But per the above argument of needing
the CPU serialization _anyway_, I don't see a compelling reason not to
use it.
It is simple, it works. Let's use it.
If you really fancy a multi-writer buffer, you can always switch to one
later, if you can convince someone it actually brings benefits and not
just head-aches.
So I have something roughly like the below; I'm suggesting you add the
line with + on:
int early_vprintk(const char *fmt, va_list args)
{
char buf[256]; // teh suck!
int old, n = vscnprintf(buf, sizeof(buf), fmt, args);
old = cpu_lock();
+ printk_buffer_store(buf, n);
early_console->write(early_console, buf, n);
cpu_unlock(old);
return n;
}
(yes, yes, we can get rid of the on-stack @buf thing with a
reserve+commit API, but who cares :-))
[*] git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git debug/experimental
On Wed 2019-09-04 14:35:31, Peter Zijlstra wrote:
> On Thu, Aug 08, 2019 at 12:32:25AM +0206, John Ogness wrote:
> > Hello,
> >
> > This is a follow-up RFC on the work to re-implement much of
> > the core of printk. The threads for the previous RFC versions
> > are here: v1[0], v2[1], v3[2].
> >
> > This series only builds upon v3 (i.e. the first part of this
> > series is exactly v3). The main purpose of this series is to
> > replace the current printk ringbuffer with the new
> > ringbuffer. As was discussed[3], this is a conservative
> > first step to rework printk. For example, all logbuf_lock
> > usage is kept even though the new ringbuffer does not
> > require it. This avoids any side-effect bugs in case the
> > logbuf_lock is (unintentionally) synchronizing more than
> > just the ringbuffer. However, this also means that the
> > series does not bring any improvements, just swapping out
> > implementations. A future patch will remove the logbuf_lock.
>
> So after reading most of the first patch (and it look _much_ better than
> previous times), I'm left wondering *why* ?!
>
> That is, why do we need this complexity, as compared to that
> CPU serialized approach?
The serialized approach used a lock. It was re-entrant and thus less
error-prone but still a lock.
The lock was planed to be used not only to access the buffer but also
for eventual locking inside lockless consoles. It might allow to
have some synchronization even in lockless consoles. But it
would be big-kernel-lock-like style. It might create yet
another maze of problems.
If we remove per-CPU buffers in NMI. We would need to synchronize
again printing backtraces from all CPUs. Otherwise they would get
mixed and hard to read. It might be solved by some prefix and
sorting in userspace but...
This why I asked to see a fully lockless code to see how
more complicated it was. John told me that he had an early
version of it around.
I agree that this lockless variant is really complicated. I am not
able to prove that it is race free as it is now. I understand
the algorithm. But there are too many synchronization points.
Peter, have you seen my alternative approach, please. See
https://lore.kernel.org/lkml/[email protected]/
It uses two tricks:
1. Two bits in the sequence number are used to track the state
of the related data. It allows to implement the entire
life cycle of each entry using atomic operation on a single
variable.
2. There is a helper function to read valid data for each entry,
see prb_read_desc(). It checks the state before and after
reading the data to make sure that they are valid. And
it includes the needed read barriers. As a result there
are only three explicit barriers in the code. All other
are implicitly done by cmpxchg() atomic operations.
The alternative lockless approach is still more complicated than
the serialized one. But I think that it is manageable thanks to
the simplified state tracking. And I might safe use some pain
in the long term.
> In my book simpler is better here. printk() is an absolute utter slow
> path anyway, nobody cares about the performance much, and I'm thinking
> that it should be plenty fast enough as long as you don't run a
> synchronous serial output (which is exactly what I do do/require
> anyway).
I fully agree.
Best Regards,
Petr
[ Added Ted and Linux Plumbers ]
On Thu, 5 Sep 2019 17:38:21 +0200 (CEST)
Thomas Gleixner <[email protected]> wrote:
> On Thu, 5 Sep 2019, Peter Zijlstra wrote:
> > On Thu, Sep 05, 2019 at 03:05:13PM +0200, Petr Mladek wrote:
> > > The alternative lockless approach is still more complicated than
> > > the serialized one. But I think that it is manageable thanks to
> > > the simplified state tracking. And I might safe use some pain
> > > in the long term.
> >
> > I've not looked at it yet, sorry. But per the above argument of needing
> > the CPU serialization _anyway_, I don't see a compelling reason not to
> > use it.
> >
> > It is simple, it works. Let's use it.
> >
> > If you really fancy a multi-writer buffer, you can always switch to one
> > later, if you can convince someone it actually brings benefits and not
> > just head-aches.
>
> Can we please grab one of the TBD slots at kernel summit next week, sit
> down in a room and hash that out?
>
We should definitely be able to find a room that will be available next
week.
-- Steve
On Thu, 5 Sep 2019, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 03:05:13PM +0200, Petr Mladek wrote:
> > The alternative lockless approach is still more complicated than
> > the serialized one. But I think that it is manageable thanks to
> > the simplified state tracking. And I might safe use some pain
> > in the long term.
>
> I've not looked at it yet, sorry. But per the above argument of needing
> the CPU serialization _anyway_, I don't see a compelling reason not to
> use it.
>
> It is simple, it works. Let's use it.
>
> If you really fancy a multi-writer buffer, you can always switch to one
> later, if you can convince someone it actually brings benefits and not
> just head-aches.
Can we please grab one of the TBD slots at kernel summit next week, sit
down in a room and hash that out?
Thanks,
tglx
On 2019-09-05, Steven Rostedt <[email protected]> wrote:
>>> But per the above argument of needing the CPU serialization
>>> _anyway_, I don't see a compelling reason not to use it.
>>>
>>> It is simple, it works. Let's use it.
>>>
>>> If you really fancy a multi-writer buffer, you can always switch to
>>> one later, if you can convince someone it actually brings benefits
>>> and not just head-aches.
>>
>> Can we please grab one of the TBD slots at kernel summit next week,
>> sit down in a room and hash that out?
>>
>
> We should definitely be able to find a room that will be available
> next week.
FWIW, on Monday at 12:45 I am giving a talk[0] on the printk
rework. I'll be dedicating a few slides to presenting the lockless
multi-writer design, but will also talk about the serialized CPU
approach from RFCv1.
John Ogness
[0] https://www.linuxplumbersconf.org/event/4/contributions/290/
On Thu, Sep 05, 2019 at 04:31:18PM +0200, Peter Zijlstra wrote:
> So I have something roughly like the below; I'm suggesting you add the
> line with + on:
>
> int early_vprintk(const char *fmt, va_list args)
> {
> char buf[256]; // teh suck!
> int old, n = vscnprintf(buf, sizeof(buf), fmt, args);
>
> old = cpu_lock();
> + printk_buffer_store(buf, n);
> early_console->write(early_console, buf, n);
> cpu_unlock(old);
>
> return n;
> }
>
> (yes, yes, we can get rid of the on-stack @buf thing with a
> reserve+commit API, but who cares :-))
Another approach is something like:
DEFINE_PER_CPU(int, printk_nest);
DEFINE_PER_CPU(char, printk_line[4][256]);
int vprintk(const char *fmt, va_list args)
{
int c, n, i;
char *buf;
preempt_disable();
i = min(3, this_cpu_inc_return(printk_nest) - 1);
buf = this_cpu_ptr(printk_line[i]);
n = vscnprintf(buf, 256, fmt, args);
c = cpu_lock();
printk_buffer_store(buf, n);
if (early_console)
early_console->write(early_console, buf, n);
cpu_unlock(c);
this_cpu_dec(printk_nest);
preempt_enable();
return n;
}
Again, simple and straight forward (and I'm sure it's been mentioned
before too).
We really should not be making this stuff harder than it needs to be
(and anybody whining about lines longer than 256 characters can just go
away, those are unreadable anyway).
On Thu 2019-09-05 12:11:01, Steven Rostedt wrote:
>
> [ Added Ted and Linux Plumbers ]
>
> On Thu, 5 Sep 2019 17:38:21 +0200 (CEST)
> Thomas Gleixner <[email protected]> wrote:
>
> > On Thu, 5 Sep 2019, Peter Zijlstra wrote:
> > > On Thu, Sep 05, 2019 at 03:05:13PM +0200, Petr Mladek wrote:
> > > > The alternative lockless approach is still more complicated than
> > > > the serialized one. But I think that it is manageable thanks to
> > > > the simplified state tracking. And I might safe use some pain
> > > > in the long term.
> > >
> > > I've not looked at it yet, sorry. But per the above argument of needing
> > > the CPU serialization _anyway_, I don't see a compelling reason not to
> > > use it.
> > >
> > > It is simple, it works. Let's use it.
> > >
> > > If you really fancy a multi-writer buffer, you can always switch to one
> > > later, if you can convince someone it actually brings benefits and not
> > > just head-aches.
> >
> > Can we please grab one of the TBD slots at kernel summit next week, sit
> > down in a room and hash that out?
> >
>
> We should definitely be able to find a room that will be available next
> week.
Sounds great. I am blocked only during Livepatching miniconference
that is scheduled on Wednesday, Sep 11 at 15:00
(basically the very last slot).
Best Regards,
Petr
On (09/06/19 11:06), Peter Zijlstra wrote:
> Another approach is something like:
>
> DEFINE_PER_CPU(int, printk_nest);
> DEFINE_PER_CPU(char, printk_line[4][256]);
>
> int vprintk(const char *fmt, va_list args)
> {
> int c, n, i;
> char *buf;
>
> preempt_disable();
> i = min(3, this_cpu_inc_return(printk_nest) - 1);
> buf = this_cpu_ptr(printk_line[i]);
> n = vscnprintf(buf, 256, fmt, args);
>
> c = cpu_lock();
> printk_buffer_store(buf, n);
> if (early_console)
> early_console->write(early_console, buf, n);
> cpu_unlock(c);
>
> this_cpu_dec(printk_nest);
> preempt_enable();
>
> return n;
> }
>
> Again, simple and straight forward (and I'm sure it's been mentioned
> before too).
:)
---
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index 139c310049b1..9c73eb6259ce 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -103,7 +103,10 @@ static __printf(2, 0) int printk_safe_log_store(struct printk_safe_seq_buf *s,
if (atomic_cmpxchg(&s->len, len, len + add) != len)
goto again;
- queue_flush_work(s);
+ if (early_console)
+ early_console->write(early_console, s->buffer + len, add);
+ else
+ queue_flush_work(s);
return add;
}
---
-ss
On Fri, Sep 06, 2019 at 07:09:43PM +0900, Sergey Senozhatsky wrote:
> ---
> diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
> index 139c310049b1..9c73eb6259ce 100644
> --- a/kernel/printk/printk_safe.c
> +++ b/kernel/printk/printk_safe.c
> @@ -103,7 +103,10 @@ static __printf(2, 0) int printk_safe_log_store(struct printk_safe_seq_buf *s,
> if (atomic_cmpxchg(&s->len, len, len + add) != len)
> goto again;
>
> - queue_flush_work(s);
> + if (early_console)
> + early_console->write(early_console, s->buffer + len, add);
> + else
> + queue_flush_work(s);
> return add;
> }
You've not been following along, that generates absolutely unreadable
garbage.
On Fri 2019-09-06 11:06:27, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 04:31:18PM +0200, Peter Zijlstra wrote:
> > So I have something roughly like the below; I'm suggesting you add the
> > line with + on:
> >
> > int early_vprintk(const char *fmt, va_list args)
> > {
> > char buf[256]; // teh suck!
> > int old, n = vscnprintf(buf, sizeof(buf), fmt, args);
> >
> > old = cpu_lock();
> > + printk_buffer_store(buf, n);
> > early_console->write(early_console, buf, n);
> > cpu_unlock(old);
> >
> > return n;
> > }
> >
> > (yes, yes, we can get rid of the on-stack @buf thing with a
> > reserve+commit API, but who cares :-))
>
> Another approach is something like:
>
> DEFINE_PER_CPU(int, printk_nest);
> DEFINE_PER_CPU(char, printk_line[4][256]);
>
> int vprintk(const char *fmt, va_list args)
> {
> int c, n, i;
> char *buf;
>
> preempt_disable();
> i = min(3, this_cpu_inc_return(printk_nest) - 1);
> buf = this_cpu_ptr(printk_line[i]);
> n = vscnprintf(buf, 256, fmt, args);
>
> c = cpu_lock();
> printk_buffer_store(buf, n);
> if (early_console)
> early_console->write(early_console, buf, n);
> cpu_unlock(c);
>
> this_cpu_dec(printk_nest);
> preempt_enable();
>
> return n;
> }
>
> Again, simple and straight forward (and I'm sure it's been mentioned
> before too).
>
> We really should not be making this stuff harder than it needs to be
> (and anybody whining about lines longer than 256 characters can just go
> away, those are unreadable anyway).
I wish it was that simple. It is possible that I see it too
complicated. But this comes to my mind:
1. The simple printk_buffer_store(buf, n) is not NMI safe. For this,
we might need the reserve-store approach.
2. The simple approach works only with lockless consoles. We need
something else for the rest at least for NMI. Simle offloading
to a kthread has been blocked for years. People wanted the
trylock-and-flush-immediately approach.
3. console_lock works in tty as a big kernel lock. I do not know
much details. But people familiar with the code said that
it was a disaster. I assume that tty is still rather
important console. I am not sure how it would fit into the
simple approach.
4. The console handling has got non-synchronous (console_trylock)
quite early (ver 2.4.10, year 2001). The reason was to do not
serialize CPUs by the speed of the console.
Serialized output could remove many troubles. The logic in
console_unlock() is really crazy. It might be acceptable
for debugging. But is it acceptable on production systems?
5. John planed to use the cpu_lock in the lockless consoles.
I wonder if it was only in the console->write() callback
or if it would spread the lock more widely.
6. One huge nightmare is panic() and code called from there.
It is a maze of hacks, including arch-specific code, to
prevent deadlocks and get the messages out.
Any lock might be blocked on any CPU at the moment. Or it
it might become blocked when CPUs are stopped by NMI.
Fully lock-less log buffer might save us some headache.
I am not sure whether a single lock shared between printk()
writers and console drivers will make the situation easier
or more complicated.
7. People would complain when continuous lines become less
reliable. It might be most visible when mixing backtraces
from all CPUs. Simple sorting by prefix will not make
it readable. The historic way was to synchronize CPUs
by a spin lock. But then the cpu_lock() could cause
deadlock.
I would be really happy when we could ignore some of the problems
or find an easy solution. I just want to make sure that we take
into account all the known aspects.
I am sure that we could do better than we do now. I do not want
to block any improvements. I am just a bit lost in the many
black corners.
Best Regards,
Petr
On (09/06/19 12:49), Peter Zijlstra wrote:
> On Fri, Sep 06, 2019 at 07:09:43PM +0900, Sergey Senozhatsky wrote:
>
> > ---
> > diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
> > index 139c310049b1..9c73eb6259ce 100644
> > --- a/kernel/printk/printk_safe.c
> > +++ b/kernel/printk/printk_safe.c
> > @@ -103,7 +103,10 @@ static __printf(2, 0) int printk_safe_log_store(struct printk_safe_seq_buf *s,
> > if (atomic_cmpxchg(&s->len, len, len + add) != len)
> > goto again;
> >
> > - queue_flush_work(s);
> > + if (early_console)
> > + early_console->write(early_console, s->buffer + len, add);
> > + else
> > + queue_flush_work(s);
> > return add;
> > }
>
> You've not been following along, that generates absolutely unreadable
> garbage.
This was more of a joke/reference to "Those NMI buffers are a trainwreck
and need to die a horrible death". Of course this needs a re-entrant cpu
lock to serialize access to atomic/early consoles. But here is one more
missing thing - we need atomic/early consoles on a separate, sort of
immutable, list. And probably forbid any modifications of such console
drivers, (PM, etc.) If we can do this then we don't need to take console_sem
while we iterate that list, which removes sched/timekeeping locks out
of the fast printk() path.
We, at the same time, don't have that many options on systems without
atomic/early consoles. Move printing to NMI (e.g. up to X pending logbug
lines per NMI)? Move printing to IPI (again, up to X pending logbuf lines
per IPI)? printk() softirqs?
-ss
On Fri, Sep 06, 2019 at 02:42:11PM +0200, Petr Mladek wrote:
> I wish it was that simple. It is possible that I see it too
> complicated. But this comes to my mind:
>
> 1. The simple printk_buffer_store(buf, n) is not NMI safe. For this,
> we might need the reserve-store approach.
Of course it is, and sure it has a reserve+commit internally. I'm sure I
posted an implenentation of something like this at some point.
It is lockless (wait-free in fact, which is stronger) and supports
multi-readers. I'm sure I posted something like that before, and ISTR
John has something like that around somewhere too.
The only thing I'm omitting is doing vscnprintf() twice, first to
determine the length, and then into the reservation. Partly because I
think that is silly and 256 chars should be plenty for everyone, partly
because that avoids having vscnprintf() inside the cpu_lock() and partly
because it is simpler to not do that.
> 2. The simple approach works only with lockless consoles. We need
> something else for the rest at least for NMI. Simle offloading
> to a kthread has been blocked for years. People wanted the
> trylock-and-flush-immediately approach.
Have an irq_work to wake up a kthread that will print to shit consoles.
Seriously.. the trylock and flush stuff is horrific crap. You guys been
piling on the hack for years now, surely you're tired of that gunk?
(and if you _reallllly_ care, build a flush function that 'works'
mostly and waits for the kthread of choice to finish printing to the
'imporant' shit console).
> 3. console_lock works in tty as a big kernel lock. I do not know
> much details. But people familiar with the code said that
> it was a disaster. I assume that tty is still rather
> important console. I am not sure how it would fit into the
> simple approach.
The kernel thread in charge of printing doesn't care.
> 4. The console handling has got non-synchronous (console_trylock)
> quite early (ver 2.4.10, year 2001). The reason was to do not
> serialize CPUs by the speed of the console.
>
> Serialized output could remove many troubles. The logic in
> console_unlock() is really crazy. It might be acceptable
> for debugging. But is it acceptable on production systems?
The kernel thread doesn't care. If you care about independent consoles,
have a kernel thread per console. That way a fast console can print fast
while a slow console will print slow and everybody is happy.
> 5. John planed to use the cpu_lock in the lockless consoles.
> I wonder if it was only in the console->write() callback
> or if it would spread the lock more widely.
Right, I'm saying that since you need it anyway, lift it up one layer.
It makes everything simpler. More simpler is more better.
> 6. One huge nightmare is panic() and code called from there.
> It is a maze of hacks, including arch-specific code, to
> prevent deadlocks and get the messages out.
>
> Any lock might be blocked on any CPU at the moment. Or it
> it might become blocked when CPUs are stopped by NMI.
>
> Fully lock-less log buffer might save us some headache.
> I am not sure whether a single lock shared between printk()
> writers and console drivers will make the situation easier
> or more complicated.
So panic is a non issue for the lockless console.
It only matters if you care to get something out of the crap consoles.
So print everything to the lockless buffer and lockless consoles, then
try and force as much as you can out of the crap consoles.
If you die, tought luck, at least the lockless consoles and kdump image
have the whole message.
> 7. People would complain when continuous lines become less
> reliable. It might be most visible when mixing backtraces
> from all CPUs. Simple sorting by prefix will not make
> it readable. The historic way was to synchronize CPUs
> by a spin lock. But then the cpu_lock() could cause
> deadlock.
Why? I'm running with that thing on, I've never seen a deadlock ever
because of it. In fact, i've gotten output that is plain impossible with
the current junk.
The cpu-lock is inside the all-backtrace spinlock, not outside. And as I
said yesterday, only the lockless console has any wait-loops while
holding the cpu-lock. It _will_ make progress.
> I would be really happy when we could ignore some of the problems
> or find an easy solution. I just want to make sure that we take
> into account all the known aspects.
>
> I am sure that we could do better than we do now. I do not want
> to block any improvements. I am just a bit lost in the many
> black corners.
I hope the above helps. Also note that Linus' memory buffer is a
lockless console.
On Fri, Sep 06, 2019 at 04:01:26PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 06, 2019 at 02:42:11PM +0200, Petr Mladek wrote:
> > 7. People would complain when continuous lines become less
> > reliable. It might be most visible when mixing backtraces
> > from all CPUs. Simple sorting by prefix will not make
> > it readable. The historic way was to synchronize CPUs
> > by a spin lock. But then the cpu_lock() could cause
> > deadlock.
>
> Why? I'm running with that thing on, I've never seen a deadlock ever
> because of it. In fact, i've gotten output that is plain impossible with
> the current junk.
>
> The cpu-lock is inside the all-backtrace spinlock, not outside. And as I
> said yesterday, only the lockless console has any wait-loops while
> holding the cpu-lock. It _will_ make progress.
Oooh, I think I see. So one solution would be to pass the NMI along in
chain like. Send it to a single CPU at a time, when finished, send it
to the next.
On (09/06/19 16:01), Peter Zijlstra wrote:
> > 2. The simple approach works only with lockless consoles. We need
> > something else for the rest at least for NMI. Simle offloading
> > to a kthread has been blocked for years. People wanted the
> > trylock-and-flush-immediately approach.
>
> Have an irq_work to wake up a kthread that will print to shit consoles.
Do we need sched dependency? We can print a batch of pending
logbuf messages and queue another irw_work if there are more
pending messages, right?
-ss
On 2019-09-06, Peter Zijlstra <[email protected]> wrote:
>> I wish it was that simple. It is possible that I see it too
>> complicated. But this comes to my mind:
>>
>> 1. The simple printk_buffer_store(buf, n) is not NMI safe. For this,
>> we might need the reserve-store approach.
>
> Of course it is, and sure it has a reserve+commit internally. I'm sure
> I posted an implenentation of something like this at some point.
>
> It is lockless (wait-free in fact, which is stronger) and supports
> multi-readers. I'm sure I posted something like that before, and ISTR
> John has something like that around somewhere too.
Yes. It was called RFCv1[0].
> The only thing I'm omitting is doing vscnprintf() twice, first to
> determine the length, and then into the reservation. Partly because I
> think that is silly and 256 chars should be plenty for everyone,
> partly because that avoids having vscnprintf() inside the cpu_lock()
> and partly because it is simpler to not do that.
Yes, this approach is more straight forward and was suggested in the
feedback to RFCv1. Although I think the current limit (1024) should
still be OK. Then we have 1 dedicated page per CPU for vscnprintf().
>> 2. The simple approach works only with lockless consoles. We need
>> something else for the rest at least for NMI. Simle offloading
>> to a kthread has been blocked for years. People wanted the
>> trylock-and-flush-immediately approach.
>
> Have an irq_work to wake up a kthread that will print to shit
> consoles.
This is the approach in all the RFC versions.
>> 5. John planed to use the cpu_lock in the lockless consoles.
>> I wonder if it was only in the console->write() callback
>> or if it would spread the lock more widely.
The 8250 driver in RFCv1 uses the cpu-lock in console->write() on a
per-character basis and in console->write_atomic() on a per-line
basis. This is necessary because the 8250 driver cannot run lockless. It
requires synchronization for its UART_IER clearing/setting before/after
transmit.
IMO the existing early console implementations are _not_ safe for
preemption. This was the reason for the new write_atomic() callback in
RFCv1.
> Right, I'm saying that since you need it anyway, lift it up one layer.
> It makes everything simpler. More simpler is more better.
This was my reasoning for using the cpu-lock in RFCv1. Moving to a
lockless ringbuffer for RFCv2 was because there was too much
resistance/concern surrounding the cpu-lock. But yes, if we want to
support atomic consoles, the cpu-lock will still be needed.
The cpu-lock (and the related concerns) were discussed here[1].
>> 7. People would complain when continuous lines become less
>> reliable. It might be most visible when mixing backtraces
>> from all CPUs. Simple sorting by prefix will not make
>> it readable. The historic way was to synchronize CPUs
>> by a spin lock. But then the cpu_lock() could cause
>> deadlock.
>
> Why? I'm running with that thing on, I've never seen a deadlock ever
> because of it.
As was discussed in the thread I just mentioned, introducing the
cpu-lock means that _all_ NMI functions taking spinlocks need to use the
cpu-lock. Even though Peter has never seen a deadlock, a deadlock is
possible if a BUG is triggered while one such spinlock is held. Also
note that it is not allowed to have 2 cpu-locks in the system. This is
where the BKL references started showing up.
Spinlocks in NMI context are rare, but they have existed in the past and
could exist again in the future. My suggestion was to create the policy
that any needed locking in NMI context must be done using the one
cpu-lock.
John Ogness
[0] https://lkml.kernel.org/r/[email protected]
[1] https://lkml.kernel.org/r/[email protected]
On Fri, Sep 06, 2019 at 04:01:26PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 06, 2019 at 02:42:11PM +0200, Petr Mladek wrote:
> > 7. People would complain when continuous lines become less
> > reliable. It might be most visible when mixing backtraces
> > from all CPUs. Simple sorting by prefix will not make
> > it readable. The historic way was to synchronize CPUs
> > by a spin lock. But then the cpu_lock() could cause
> > deadlock.
>
> Why? I'm running with that thing on, I've never seen a deadlock ever
> because of it. In fact, i've gotten output that is plain impossible with
> the current junk.
>
> The cpu-lock is inside the all-backtrace spinlock, not outside. And as I
> said yesterday, only the lockless console has any wait-loops while
> holding the cpu-lock. It _will_ make progress.
So I've been a huge flaming idiot.. so while I'm not particularly
sympathetic to NMIs that block, there are a number of really trivial
deadlocks possible -- and it is a minor miracle I've not actually hit
them (I suppose because printk() isn't really all that common).
The whole cpu-lock thing I had needs to go. But not having it makes
lockless console output unreadable and unsable garbage.
I've got some ideas on a replacement, but I need to further consider it.
:-/
Folks,
printk meeting at LPC Meeting Room - SAFIRA on Tuesday Sept 10. from 2PM to 3PM.
Thanks
tglx
On (09/06/19 16:01), Peter Zijlstra wrote:
> In fact, i've gotten output that is plain impossible with
> the current junk.
Peter, can you post any of those backtraces? Very curious.
-ss
On (08/10/19 07:53), Thomas Gleixner wrote:
>
> Right now we have an implementation for serial only, but that already is
> useful. I nicely got (minimaly garbled) crash dumps out of an NMI
> handler. With the current mainline console code the machine just hung.
>
Thomas, any chance you can post backtraces? Just curious where exactly
current printk() and console_drivers() hung.
-ss
On 2019-09-09, Thomas Gleixner <[email protected]> wrote:
> printk meeting at LPC Meeting Room - SAFIRA on Tuesday Sept 10. from
> 2PM to 3PM.
The meeting was very effective in letting us come to decisions on the
direction to take. Thanks for the outstanding attendance! It certainly
saved hundreds of hours of reading/writing emails!
The slides[0] from my printk talk served as a _rough_ basis for the
discussion. Here is a summary of the decisions:
1. As a new ringbuffer, the lockless state-based proof of concept
posted[1] by Petr Mladek will be used. Since it has far fewer memory
barriers in the code, it will be simpler to review. I posted[2] a patch
to hack my RFCv4 into a fully functional version of Petr's PoC. So we
know it will work. With this, printk() can be called from any context
and the message will be put directly into the ringbuffer.
2. A kernel thread will be created for each registered console, each
responsible for being the sole printers to their respective
consoles. With this, console printing is _fully_ decoupled from printk()
callers.
3. Rather than defining emergency _messages_, we define an emergency
_state_ where the kernel wants to flush the messages immediately before
dying. Unlike oops_in_progress, this state will not be visible to
anything outside of the printk infrastructure.
4. When in emergency state, the kernel will use a new console callback
write_atomic() to flush the messages in whatever context the CPU is in
at that moment. Only consoles that implement the NMI-safe write_atomic()
will be able to flush in this state.
5. LOG_CONT message pieces will be stored as individual records in the
ringbuffer. They will be "assembled" by the ringbuffer reader (in
kernel) before being copied to userspace or printed on the
console. Since each record in the ringbuffer has its own sequence
number, this has the effect for userspace that sequence numbers will
appear to be skipped. (i.e. if there were LOG_CONT pieces with sequence
numbers 4, 5, 6, the fully assembled message will appear only as
sequence number 6 (and will have the timestamp from the first piece)).
6. A new may-sleep function pr_flush() will be made available to wait
for all previously printk'd messages to be output on all consoles before
proceeding. For example:
pr_cont("Running test ABC... ");
pr_flush();
do_test();
pr_cont("PASSED\n");
pr_flush();
7. The ringbuffer raw data (log_buf) will be simplified to only consist
of alignment-padded strings separated by a single unsigned long. All
record meta-data (timestamp, loglevel, caller_id, etc.) will move into
the record descriptors, which are located in an extra array. The
appropriate crash tools will need to be adjusted for this. (FYI: The
unsigned long in the string data is the descriptor ID.)
8. A CPU-reentrant spinlock (the so-called cpu-lock) will be used to
synchronize/stop the kthreads during emergency state.
9. Support for printk dictionaries will be discontinued. I will look
into who is using this and why. If printk dictionaries are important for
you, speak up now!
(There was also some talk about possibly discontinuing kdb, but that is
not directly related to printk. I'm mentioning it here in case anyone
wants to pursue that.)
If I missed (or misunderstood) anything, please let me know!
John Ogness
[0] https://www.linuxplumbersconf.org/event/4/contributions/290/attachments/276/463/lpc2019_jogness_printk.pdf
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
On Fri, Sep 13, 2019 at 3:26 PM John Ogness <[email protected]> wrote:
>
> On 2019-09-09, Thomas Gleixner <[email protected]> wrote:
> > printk meeting at LPC Meeting Room - SAFIRA on Tuesday Sept 10. from
> > 2PM to 3PM.
>
> The meeting was very effective in letting us come to decisions on the
> direction to take. Thanks for the outstanding attendance! It certainly
> saved hundreds of hours of reading/writing emails!
>
> The slides[0] from my printk talk served as a _rough_ basis for the
> discussion. Here is a summary of the decisions:
>
> 1. As a new ringbuffer, the lockless state-based proof of concept
> posted[1] by Petr Mladek will be used. Since it has far fewer memory
> barriers in the code, it will be simpler to review. I posted[2] a patch
> to hack my RFCv4 into a fully functional version of Petr's PoC. So we
> know it will work. With this, printk() can be called from any context
> and the message will be put directly into the ringbuffer.
>
> 2. A kernel thread will be created for each registered console, each
> responsible for being the sole printers to their respective
> consoles. With this, console printing is _fully_ decoupled from printk()
> callers.
Is the plan to split the console_lock up into a per-console thing? Or
postponed for later on?
> 3. Rather than defining emergency _messages_, we define an emergency
> _state_ where the kernel wants to flush the messages immediately before
> dying. Unlike oops_in_progress, this state will not be visible to
> anything outside of the printk infrastructure.
>
> 4. When in emergency state, the kernel will use a new console callback
> write_atomic() to flush the messages in whatever context the CPU is in
> at that moment. Only consoles that implement the NMI-safe write_atomic()
> will be able to flush in this state.
>
> 5. LOG_CONT message pieces will be stored as individual records in the
> ringbuffer. They will be "assembled" by the ringbuffer reader (in
> kernel) before being copied to userspace or printed on the
> console. Since each record in the ringbuffer has its own sequence
> number, this has the effect for userspace that sequence numbers will
> appear to be skipped. (i.e. if there were LOG_CONT pieces with sequence
> numbers 4, 5, 6, the fully assembled message will appear only as
> sequence number 6 (and will have the timestamp from the first piece)).
>
> 6. A new may-sleep function pr_flush() will be made available to wait
> for all previously printk'd messages to be output on all consoles before
> proceeding. For example:
>
> pr_cont("Running test ABC... ");
> pr_flush();
>
> do_test();
>
> pr_cont("PASSED\n");
> pr_flush();
Just crossed my mind: Could/should we lockdep-annotate pr_flush (take
a lockdep map in there that we also take around the calls down into
console drivers in each of the console printing kthreads or something
like that)? Just to avoid too many surprises when people call pr_flush
from within gpu drivers and wonder why it doesn't work so well.
Although with this nice plan we'll take the modeset paths fully out of
the printk paths (even for normal outputs) I hope, so should be a lot
more reasonable.
> 7. The ringbuffer raw data (log_buf) will be simplified to only consist
> of alignment-padded strings separated by a single unsigned long. All
> record meta-data (timestamp, loglevel, caller_id, etc.) will move into
> the record descriptors, which are located in an extra array. The
> appropriate crash tools will need to be adjusted for this. (FYI: The
> unsigned long in the string data is the descriptor ID.)
>
> 8. A CPU-reentrant spinlock (the so-called cpu-lock) will be used to
> synchronize/stop the kthreads during emergency state.
>
> 9. Support for printk dictionaries will be discontinued. I will look
> into who is using this and why. If printk dictionaries are important for
> you, speak up now!
>
> (There was also some talk about possibly discontinuing kdb, but that is
> not directly related to printk. I'm mentioning it here in case anyone
> wants to pursue that.)
>
> If I missed (or misunderstood) anything, please let me know!
From gpu perspective this all sounds extremely good and first
realistic plan that might lead us to an actually working bsod on
linux. But we'll make it pink w/ yellow text or something like that
ofc :-)
Thanks, Daniel
>
> John Ogness
>
> [0] https://www.linuxplumbersconf.org/event/4/contributions/290/attachments/276/463/lpc2019_jogness_printk.pdf
> [1] https://lkml.kernel.org/r/[email protected]
> [2] https://lkml.kernel.org/r/[email protected]
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
On 2019-09-13, Daniel Vetter <[email protected]> wrote:
>> 2. A kernel thread will be created for each registered console, each
>> responsible for being the sole printers to their respective
>> consoles. With this, console printing is _fully_ decoupled from
>> printk() callers.
>
> Is the plan to split the console_lock up into a per-console thing? Or
> postponed for later on?
AFAICT, the only purpose for a console_lock would be to synchronize
between the console printing kthread and some other component that wants
to write to that same device. So a per-console console_lock should be
the proper solution. However, I will look into the details. My main
concerns about this are the suspend/resume logic and the code sitting
behind /dev/console. I will share details once I've sorted it all out.
>> 6. A new may-sleep function pr_flush() will be made available to wait
>> for all previously printk'd messages to be output on all consoles
>> before proceeding. For example:
>>
>> pr_cont("Running test ABC... ");
>> pr_flush();
>>
>> do_test();
>>
>> pr_cont("PASSED\n");
>> pr_flush();
>
> Just crossed my mind: Could/should we lockdep-annotate pr_flush (take
> a lockdep map in there that we also take around the calls down into
> console drivers in each of the console printing kthreads or something
> like that)? Just to avoid too many surprises when people call pr_flush
> from within gpu drivers and wonder why it doesn't work so well.
Why would it not work so well? Basically the task calling pr_flush()
will monitor the lockless iterators of the various consoles until _all_
have hit/passed the latest sequence number from the time of the call.
> Although with this nice plan we'll take the modeset paths fully out of
> the printk paths (even for normal outputs) I hope, so should be a lot
> more reasonable.
You will be running in your own preemptible kthread, so any paths you
take should be safe.
> From gpu perspective this all sounds extremely good and first
> realistic plan that might lead us to an actually working bsod on
> linux.
Are you planning on basing the bsod stuff on write_atomic() (which is
used after entering an emergency state) or on the kmsg_dump facility? I
would expect kmsg_dump might be more appropriate, unless there are
concerns that the machine will die before getting that far (i.e. there
is a lot that happens between when an OOPS begins and when kmsg_dumpers
are invoked).
John Ogness
On 2019/09/13 22:26, John Ogness wrote:
> 6. A new may-sleep function pr_flush() will be made available to wait
> for all previously printk'd messages to be output on all consoles before
> proceeding. For example:
>
> pr_cont("Running test ABC... ");
> pr_flush();
>
> do_test();
>
> pr_cont("PASSED\n");
> pr_flush();
Don't we need to allow printk() callers to know the sequence number which
the printk() has queued? Something like
u64 seq;
pr_info(...);
pr_info(...);
pr_info(...);
seq = pr_current_seq();
pr_wait_seq(seq);
in case concurrently executed printk() flooding keeps adding a lot of
pending output?
By the way, do we need to keep printk() return bytes like printf() ?
Maybe we can make printk() return "void", for almost nobody can do
meaningful things with the return value.
> 9. Support for printk dictionaries will be discontinued. I will look
> into who is using this and why. If printk dictionaries are important for
> you, speak up now!
I think that dev_printk() is using "const char *dict, size_t dictlen," part
via create_syslog_header(). Some userspace programs might depend on
availability of such information.
On Mon 2019-09-16 13:30:17, Tetsuo Handa wrote:
> On 2019/09/13 22:26, John Ogness wrote:
> > 6. A new may-sleep function pr_flush() will be made available to wait
> > for all previously printk'd messages to be output on all consoles before
> > proceeding. For example:
> >
> > pr_cont("Running test ABC... ");
> > pr_flush();
> >
> > do_test();
> >
> > pr_cont("PASSED\n");
> > pr_flush();
>
> Don't we need to allow printk() callers to know the sequence number which
> the printk() has queued? Something like
>
> u64 seq;
> pr_info(...);
> pr_info(...);
> pr_info(...);
> seq = pr_current_seq();
> pr_wait_seq(seq);
>
> in case concurrently executed printk() flooding keeps adding a lot of
> pending output?
My expectation is that pr_flush() would wait only until the current
message appears on all consoles. It will not wait for messages that
would get added later.
> By the way, do we need to keep printk() return bytes like printf() ?
> Maybe we can make printk() return "void", for almost nobody can do
> meaningful things with the return value.
It is true that I have never seen anyone checking the return value.
On the other hand, it is a minor detail. And I would prefer to stay
compatible with the userland printf() as much as possible.
> > 9. Support for printk dictionaries will be discontinued. I will look
> > into who is using this and why. If printk dictionaries are important for
> > you, speak up now!
>
> I think that dev_printk() is using "const char *dict, size_t dictlen," part
> via create_syslog_header(). Some userspace programs might depend on
> availability of such information.
Yeah, but it seems to be the only dictionary writer. There were doubts
(during the meeting) whether anyone was actually using the information.
Hmm, it seems that journalctl is able to filer device specific
information, for example, I get:
$> journalctl _KERNEL_DEVICE=+usb:2-1
-- Logs begin at Tue 2019-08-13 09:00:03 CEST, end at Mon 2019-09-16 12:32:58 CEST. --
Aug 13 09:00:04 linux-qszd kernel: usb 2-1: new high-speed USB device number 2 using ehci-pci
One question is if anyone is using this filtering. Simple grep is
enough. Another question is whether it really needs to get passed
this way.
Best Regards,
Petr
On Sun, Sep 15, 2019 at 3:48 PM John Ogness <[email protected]> wrote:
>
> On 2019-09-13, Daniel Vetter <[email protected]> wrote:
> >> 2. A kernel thread will be created for each registered console, each
> >> responsible for being the sole printers to their respective
> >> consoles. With this, console printing is _fully_ decoupled from
> >> printk() callers.
> >
> > Is the plan to split the console_lock up into a per-console thing? Or
> > postponed for later on?
>
> AFAICT, the only purpose for a console_lock would be to synchronize
> between the console printing kthread and some other component that wants
> to write to that same device. So a per-console console_lock should be
> the proper solution. However, I will look into the details. My main
> concerns about this are the suspend/resume logic and the code sitting
> behind /dev/console. I will share details once I've sorted it all out.
>
> >> 6. A new may-sleep function pr_flush() will be made available to wait
> >> for all previously printk'd messages to be output on all consoles
> >> before proceeding. For example:
> >>
> >> pr_cont("Running test ABC... ");
> >> pr_flush();
> >>
> >> do_test();
> >>
> >> pr_cont("PASSED\n");
> >> pr_flush();
> >
> > Just crossed my mind: Could/should we lockdep-annotate pr_flush (take
> > a lockdep map in there that we also take around the calls down into
> > console drivers in each of the console printing kthreads or something
> > like that)? Just to avoid too many surprises when people call pr_flush
> > from within gpu drivers and wonder why it doesn't work so well.
>
> Why would it not work so well? Basically the task calling pr_flush()
> will monitor the lockless iterators of the various consoles until _all_
> have hit/passed the latest sequence number from the time of the call.
Classic deadlock like the below: Some thread:
mutex_lock(A);
pr_flush();
mutex_unlock(A);
And in the normal console write code also needs do to mutex_lock(A);
mutex_unlock(A); somewhere.
> > Although with this nice plan we'll take the modeset paths fully out of
> > the printk paths (even for normal outputs) I hope, so should be a lot
> > more reasonable.
>
> You will be running in your own preemptible kthread, so any paths you
> take should be safe.
>
> > From gpu perspective this all sounds extremely good and first
> > realistic plan that might lead us to an actually working bsod on
> > linux.
>
> Are you planning on basing the bsod stuff on write_atomic() (which is
> used after entering an emergency state) or on the kmsg_dump facility? I
> would expect kmsg_dump might be more appropriate, unless there are
> concerns that the machine will die before getting that far (i.e. there
> is a lot that happens between when an OOPS begins and when kmsg_dumpers
> are invoked).
Yeah I think kms_dump is what the current patches use. From the fbcon
pov the important bit here is the clearly split out write_atomic, so
that we can make sure we never try to do anything stupid from special
contexts. Aside from the printing itself we also have all kinds of fun
stuff like unblank_screen() and console_unblank() currently in the
panic path still.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Mon, 16 Sep 2019 12:46:24 +0200
Petr Mladek <[email protected]> wrote:
> On Mon 2019-09-16 13:30:17, Tetsuo Handa wrote:
> > On 2019/09/13 22:26, John Ogness wrote:
> > > 6. A new may-sleep function pr_flush() will be made available to wait
> > > for all previously printk'd messages to be output on all consoles before
> > > proceeding. For example:
> > >
> > > pr_cont("Running test ABC... ");
> > > pr_flush();
> > >
> > > do_test();
> > >
> > > pr_cont("PASSED\n");
> > > pr_flush();
> >
> > Don't we need to allow printk() callers to know the sequence number which
> > the printk() has queued? Something like
> >
> > u64 seq;
> > pr_info(...);
> > pr_info(...);
> > pr_info(...);
> > seq = pr_current_seq();
> > pr_wait_seq(seq);
> >
> > in case concurrently executed printk() flooding keeps adding a lot of
> > pending output?
>
> My expectation is that pr_flush() would wait only until the current
> message appears on all consoles. It will not wait for messages that
> would get added later.
Right, I believe we agreed that pr_flush() would take care of all this.
>
>
> > By the way, do we need to keep printk() return bytes like printf() ?
> > Maybe we can make printk() return "void", for almost nobody can do
> > meaningful things with the return value.
>
> It is true that I have never seen anyone checking the return value.
> On the other hand, it is a minor detail. And I would prefer to stay
> compatible with the userland printf() as much as possible.
I understand your wanting to keep compatibility with printf(), but I
would suggest that we only do so if it doesn't complicate any of the
design. I'm actually leaning on recommending that we remove the return
value, to prevent there becoming a dependency on it. I don't see any
reason to have the "number of bytes processed" as the return value
being useful within the kernel.
>
>
> > > 9. Support for printk dictionaries will be discontinued. I will look
> > > into who is using this and why. If printk dictionaries are important for
> > > you, speak up now!
> >
> > I think that dev_printk() is using "const char *dict, size_t dictlen," part
> > via create_syslog_header(). Some userspace programs might depend on
> > availability of such information.
>
> Yeah, but it seems to be the only dictionary writer. There were doubts
> (during the meeting) whether anyone was actually using the information.
>
> Hmm, it seems that journalctl is able to filer device specific
> information, for example, I get:
>
> $> journalctl _KERNEL_DEVICE=+usb:2-1
> -- Logs begin at Tue 2019-08-13 09:00:03 CEST, end at Mon 2019-09-16 12:32:58 CEST. --
> Aug 13 09:00:04 linux-qszd kernel: usb 2-1: new high-speed USB device number 2 using ehci-pci
>
> One question is if anyone is using this filtering. Simple grep is
> enough. Another question is whether it really needs to get passed
> this way.
>
If worse comes to worse, perhaps we let the console decide what to do
with it. Where all consoles but the "kmsg" one ignores it?
Then journalctl should work as normal.
Or will this break one of our other changes?
-- Steve
On 2019-09-16, Steven Rostedt <[email protected]> wrote:
>>>> 6. A new may-sleep function pr_flush() will be made available to
>>>> wait for all previously printk'd messages to be output on all
>>>> consoles before proceeding. For example:
>>>>
>>>> pr_cont("Running test ABC... ");
>>>> pr_flush();
>>>>
>>>> do_test();
>>>>
>>>> pr_cont("PASSED\n");
>>>> pr_flush();
>>>
>>> Don't we need to allow printk() callers to know the sequence number
>>> which the printk() has queued? Something like
>>>
>>> u64 seq;
>>> pr_info(...);
>>> pr_info(...);
>>> pr_info(...);
>>> seq = pr_current_seq();
>>> pr_wait_seq(seq);
>>>
>>> in case concurrently executed printk() flooding keeps adding a lot
>>> of pending output?
>>
>> My expectation is that pr_flush() would wait only until the current
>> message appears on all consoles. It will not wait for messages that
>> would get added later.
>
> Right, I believe we agreed that pr_flush() would take care of all this.
Yes, this is what we agreed on.
>>>> 9. Support for printk dictionaries will be discontinued. I will
>>>> look into who is using this and why. If printk dictionaries are
>>>> important for you, speak up now!
>>>
>>> I think that dev_printk() is using "const char *dict, size_t
>>> dictlen," part via create_syslog_header(). Some userspace programs
>>> might depend on availability of such information.
>>
>> Yeah, but it seems to be the only dictionary writer. There were
>> doubts (during the meeting) whether anyone was actually using the
>> information.
>>
>> Hmm, it seems that journalctl is able to filer device specific
>> information, for example, I get:
>>
>> $> journalctl _KERNEL_DEVICE=+usb:2-1
>> -- Logs begin at Tue 2019-08-13 09:00:03 CEST, end at Mon 2019-09-16 12:32:58 CEST. --
>> Aug 13 09:00:04 linux-qszd kernel: usb 2-1: new high-speed USB device number 2 using ehci-pci
>>
>> One question is if anyone is using this filtering. Simple grep is
>> enough. Another question is whether it really needs to get passed
>> this way.
>>
>
> If worse comes to worse, perhaps we let the console decide what to do
> with it. Where all consoles but the "kmsg" one ignores it?
>
> Then journalctl should work as normal.
>
> Or will this break one of our other changes?
The consoles will just iterate the ringbuffer. So if any console needs
dictionary information, that information needs to be stored in the
ringbuffer as well.
The dictionary text and message text could be stored as concatenated
strings. The descriptor would point separately to the beginning of
dictionary and message. So the data-buffer would still be a clean
collection of text. But AFAIK Linus didn't want to see that "extra" text
at all.
If we want to keep dictionary text out of the data-buffer, we could have
a 2nd data-buffer dedicated for dictionary text. I expect it would not
really complicate things. Especially if the dictionary part was "best
effort" (i.e. if the dictionary text does not fit in the free part of
its data-buffer, it is dropped).
John Ogness
On (09/16/19 12:46), Petr Mladek wrote:
> Hmm, it seems that journalctl is able to filer device specific
> information, for example, I get:
>
> $> journalctl _KERNEL_DEVICE=+usb:2-1
> -- Logs begin at Tue 2019-08-13 09:00:03 CEST, end at Mon 2019-09-16 12:32:58 CEST. --
> Aug 13 09:00:04 linux-qszd kernel: usb 2-1: new high-speed USB device number 2 using ehci-pci
>
> One question is if anyone is using this filtering. Simple grep is
> enough. Another question is whether it really needs to get passed
> this way.
Hmm. If I recall correctly...
There was some sort of discussion (and a patch, I believe) a long time
ago. If I'm not mistaken, guys at facebook somehow add "machine ID"
(e.g. CONFIG_DEFAULT_HOSTNAME?) to kernel messages (via dicts). This
has one interesting use case: net consoles print extended headers.
So they have monitoring systems, which capture and track net consoles
output from many servers, and should one of them warn/oom/etc. they
immediately know which one of the machines is "under the weather"
(ext_text directly points at the right server).
Well, once again, if I recall this correctly.
-ss
On Mon 2019-09-16 16:28:54, John Ogness wrote:
> On 2019-09-16, Steven Rostedt <[email protected]> wrote:
> >>>> 9. Support for printk dictionaries will be discontinued. I will
> >>>> look into who is using this and why. If printk dictionaries are
> >>>> important for you, speak up now!
> >>>
> >>> I think that dev_printk() is using "const char *dict, size_t
> >>> dictlen," part via create_syslog_header(). Some userspace programs
> >>> might depend on availability of such information.
> >>
> >> Yeah, but it seems to be the only dictionary writer. There were
> >> doubts (during the meeting) whether anyone was actually using the
> >> information.
> >>
> >> Hmm, it seems that journalctl is able to filer device specific
> >> information, for example, I get:
> >>
> >> $> journalctl _KERNEL_DEVICE=+usb:2-1
> >> -- Logs begin at Tue 2019-08-13 09:00:03 CEST, end at Mon 2019-09-16 12:32:58 CEST. --
> >> Aug 13 09:00:04 linux-qszd kernel: usb 2-1: new high-speed USB device number 2 using ehci-pci
> >>
> >> One question is if anyone is using this filtering. Simple grep is
> >> enough. Another question is whether it really needs to get passed
> >> this way.
> >>
> >
> > If worse comes to worse, perhaps we let the console decide what to do
> > with it. Where all consoles but the "kmsg" one ignores it?
/dev/kmsg is one interface passing dictionary. The other is
netconsole. It is the only console with CON_EXTENDED flag set.
> > Then journalctl should work as normal.
> >
> > Or will this break one of our other changes?
It just complicates the code because we need to store and read
the information separately.
> The consoles will just iterate the ringbuffer. So if any console needs
> dictionary information, that information needs to be stored in the
> ringbuffer as well.
>
> The dictionary text and message text could be stored as concatenated
> strings. The descriptor would point separately to the beginning of
> dictionary and message. So the data-buffer would still be a clean
> collection of text. But AFAIK Linus didn't want to see that "extra" text
> at all.
I would double check with Linus that he would consider this as
breaking userspace.
IMHO, it is perfectly fine to add this support later in separate patch(set) if
really necessary. I can't imagine that anyone would depend on this
feature when bisecting kernel. We could discuss and handle this later.
At least after the merge window.
> If we want to keep dictionary text out of the data-buffer, we could have
> a 2nd data-buffer dedicated for dictionary text. I expect it would not
> really complicate things. Especially if the dictionary part was "best
> effort" (i.e. if the dictionary text does not fit in the free part of
> its data-buffer, it is dropped).
Interesting idea. I like it if it does not complicate the code too much.
Best Regards,
Petr
On Tue, 17 Sep 2019 09:52:16 +0200
Petr Mladek <[email protected]> wrote:
> Heh, I did some grepping and the return value is actually used on
> three locations:
>
> $> git grep "= printk("
> drivers/scsi/aic7xxx/aic79xx_core.c: printed = printk("%s[0x%x]", name, value);
> drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk(" ");
> drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk("%s%s",
> drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk(") ");
> drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk(" ");
> drivers/scsi/aic7xxx/aic79xx_core.c: cur_col = printk("\n%3d FIFO_USE[0x%x] ", SCB_GET_TAG(scb),
> drivers/scsi/aic7xxx/aic79xx_core.c: cur_col += printk("SHADDR = 0x%x%x, SHCNT = 0x%x ",
> drivers/scsi/aic7xxx/aic79xx_core.c: cur_col += printk("HADDR = 0x%x%x, HCNT = 0x%x ",
> drivers/scsi/aic7xxx/aic7xxx_core.c: printed = printk("%s[0x%x]", name, value);
> drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk(" ");
> drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk("%s%s",
> drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk(") ");
> drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk(" ");
> drivers/scsi/aic7xxx/aic7xxx_core.c: cur_col = printk("\n%3d ", i);
> drivers/scsi/aic7xxx/aic7xxx_core.c: cur_col = printk("\n%3d ", scb->hscb->tag);
> drivers/scsi/libsas/sas_ata.c: r = printk("%s" SAS_FMT "ata%u: %s: %pV",
> kernel/locking/lockdep.c: len += printk("%*s %s", depth, "", usage_str[bit]);
> kernel/locking/lockdep.c: len += printk(KERN_CONT " at:\n");
>
> It is probably not a big deal. For example, lockdep uses the value
> just for formatting (extra spaces) when printing backtrace.
>
> I agree that it does not make sense to return the value if it
> complicates the code too much. Well, we will need to count
> the string length also from another reason (reservation).
Well, it's being used. I was thinking of dropping it if it was not.
Let's keep it then.
Thanks for looking into this.
-- Steve
On Tue, 17 Sep 2019 15:12:04 +0200
Greg Kroah-Hartman <[email protected]> wrote:
> > Well, it's being used. I was thinking of dropping it if it was not.
> > Let's keep it then.
>
> I think it should be dropped, only one user of the kernel is using it in
> a legitimate way, which kind of implies it isn't needed.
I'm thinking if it isn't hard to support then we can keep it (meaning
that we already have to calculate the length anyway). But if it starts
to complicate the code, then we should drop it.
-- Steve
On Mon 2019-09-16 09:43:14, Steven Rostedt wrote:
> On Mon, 16 Sep 2019 12:46:24 +0200
> Petr Mladek <[email protected]> wrote:
> > > By the way, do we need to keep printk() return bytes like printf() ?
> > > Maybe we can make printk() return "void", for almost nobody can do
> > > meaningful things with the return value.
> >
> > It is true that I have never seen anyone checking the return value.
> > On the other hand, it is a minor detail. And I would prefer to stay
> > compatible with the userland printf() as much as possible.
>
> I understand your wanting to keep compatibility with printf(), but I
> would suggest that we only do so if it doesn't complicate any of the
> design. I'm actually leaning on recommending that we remove the return
> value, to prevent there becoming a dependency on it. I don't see any
> reason to have the "number of bytes processed" as the return value
> being useful within the kernel.
Heh, I did some grepping and the return value is actually used on
three locations:
$> git grep "= printk("
drivers/scsi/aic7xxx/aic79xx_core.c: printed = printk("%s[0x%x]", name, value);
drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk(" ");
drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk("%s%s",
drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk(") ");
drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk(" ");
drivers/scsi/aic7xxx/aic79xx_core.c: cur_col = printk("\n%3d FIFO_USE[0x%x] ", SCB_GET_TAG(scb),
drivers/scsi/aic7xxx/aic79xx_core.c: cur_col += printk("SHADDR = 0x%x%x, SHCNT = 0x%x ",
drivers/scsi/aic7xxx/aic79xx_core.c: cur_col += printk("HADDR = 0x%x%x, HCNT = 0x%x ",
drivers/scsi/aic7xxx/aic7xxx_core.c: printed = printk("%s[0x%x]", name, value);
drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk(" ");
drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk("%s%s",
drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk(") ");
drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk(" ");
drivers/scsi/aic7xxx/aic7xxx_core.c: cur_col = printk("\n%3d ", i);
drivers/scsi/aic7xxx/aic7xxx_core.c: cur_col = printk("\n%3d ", scb->hscb->tag);
drivers/scsi/libsas/sas_ata.c: r = printk("%s" SAS_FMT "ata%u: %s: %pV",
kernel/locking/lockdep.c: len += printk("%*s %s", depth, "", usage_str[bit]);
kernel/locking/lockdep.c: len += printk(KERN_CONT " at:\n");
It is probably not a big deal. For example, lockdep uses the value
just for formatting (extra spaces) when printing backtrace.
I agree that it does not make sense to return the value if it
complicates the code too much. Well, we will need to count
the string length also from another reason (reservation).
Best Regards,
Petr
On Tue, Sep 17, 2019 at 09:02:54AM -0400, Steven Rostedt wrote:
> On Tue, 17 Sep 2019 09:52:16 +0200
> Petr Mladek <[email protected]> wrote:
>
> > Heh, I did some grepping and the return value is actually used on
> > three locations:
> >
> > $> git grep "= printk("
> > drivers/scsi/aic7xxx/aic79xx_core.c: printed = printk("%s[0x%x]", name, value);
> > drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk(" ");
> > drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk("%s%s",
> > drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk(") ");
> > drivers/scsi/aic7xxx/aic79xx_core.c: printed += printk(" ");
> > drivers/scsi/aic7xxx/aic79xx_core.c: cur_col = printk("\n%3d FIFO_USE[0x%x] ", SCB_GET_TAG(scb),
> > drivers/scsi/aic7xxx/aic79xx_core.c: cur_col += printk("SHADDR = 0x%x%x, SHCNT = 0x%x ",
> > drivers/scsi/aic7xxx/aic79xx_core.c: cur_col += printk("HADDR = 0x%x%x, HCNT = 0x%x ",
> > drivers/scsi/aic7xxx/aic7xxx_core.c: printed = printk("%s[0x%x]", name, value);
> > drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk(" ");
> > drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk("%s%s",
> > drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk(") ");
> > drivers/scsi/aic7xxx/aic7xxx_core.c: printed += printk(" ");
> > drivers/scsi/aic7xxx/aic7xxx_core.c: cur_col = printk("\n%3d ", i);
> > drivers/scsi/aic7xxx/aic7xxx_core.c: cur_col = printk("\n%3d ", scb->hscb->tag);
All of those can be removed as none of the returned values are used.
What a mess.
> > drivers/scsi/libsas/sas_ata.c: r = printk("%s" SAS_FMT "ata%u: %s: %pV",
return value is also ignored, this can be fixed.
> > kernel/locking/lockdep.c: len += printk("%*s %s", depth, "", usage_str[bit]);
> > kernel/locking/lockdep.c: len += printk(KERN_CONT " at:\n");
That seems to be the only semi-legitimate use, but if it _really_ needs
it, it can just create the string ahead of time, and use the length then
after it printks it out.
> > It is probably not a big deal. For example, lockdep uses the value
> > just for formatting (extra spaces) when printing backtrace.
> >
> > I agree that it does not make sense to return the value if it
> > complicates the code too much. Well, we will need to count
> > the string length also from another reason (reservation).
>
> Well, it's being used. I was thinking of dropping it if it was not.
> Let's keep it then.
I think it should be dropped, only one user of the kernel is using it in
a legitimate way, which kind of implies it isn't needed.
thanks,
greg k-h
On 2019/09/17 22:37, Steven Rostedt wrote:
> On Tue, 17 Sep 2019 15:12:04 +0200
> Greg Kroah-Hartman <[email protected]> wrote:
>
>>> Well, it's being used. I was thinking of dropping it if it was not.
>>> Let's keep it then.
>>
>> I think it should be dropped, only one user of the kernel is using it in
>> a legitimate way, which kind of implies it isn't needed.
>
> I'm thinking if it isn't hard to support then we can keep it (meaning
> that we already have to calculate the length anyway). But if it starts
> to complicate the code, then we should drop it.
>
Due to console_loglevel (some are printed and others are not printed) and
possibility of concurrent printk() callers (one line can be printed as
multiple lines), using printk()'s return value for calculating column offset
will not always work as expected. I guess that users should not count on
printk()'s return value. They might want to try printing one line at a time
using their local buffers...
On (09/13/19 15:26), John Ogness wrote:
> 2. A kernel thread will be created for each registered console, each
> responsible for being the sole printers to their respective
> consoles. With this, console printing is _fully_ decoupled from printk()
> callers.
sysrq over serial?
What we currently have is hacky, but, as usual, is a "best effort":
>> serial driver IRQ
serial_handle_irq() [console driver]
uart_handle_sysrq_char()
handle_sysrq()
printk()
call_console_drivers()
serial_write() [re-enter console driver]
offloading this to kthread may be unreliable.
-ss
On Wed, 18 Sep 2019 10:25:46 +0900
Sergey Senozhatsky <[email protected]> wrote:
> On (09/13/19 15:26), John Ogness wrote:
> > 2. A kernel thread will be created for each registered console, each
> > responsible for being the sole printers to their respective
> > consoles. With this, console printing is _fully_ decoupled from printk()
> > callers.
>
> sysrq over serial?
>
> What we currently have is hacky, but, as usual, is a "best effort":
>
> >> serial driver IRQ
>
> serial_handle_irq() [console driver]
> uart_handle_sysrq_char()
> handle_sysrq()
> printk()
> call_console_drivers()
> serial_write() [re-enter console driver]
>
> offloading this to kthread may be unreliable.
But we also talked about an "emergency flush" which will not wait for
the kthreads to finish and just output everything it can find in the
printk buffers (expecting that the consoles have an "emergency"
handler. We can add a sysrq to do an emergency flush.
-- Steve
On (09/17/19 22:08), Steven Rostedt wrote:
> > On (09/13/19 15:26), John Ogness wrote:
> > > 2. A kernel thread will be created for each registered console, each
> > > responsible for being the sole printers to their respective
> > > consoles. With this, console printing is _fully_ decoupled from printk()
> > > callers.
> >
> > sysrq over serial?
> >
> > What we currently have is hacky, but, as usual, is a "best effort":
> >
> > >> serial driver IRQ
> >
> > serial_handle_irq() [console driver]
> > uart_handle_sysrq_char()
> > handle_sysrq()
> > printk()
> > call_console_drivers()
> > serial_write() [re-enter console driver]
> >
> > offloading this to kthread may be unreliable.
>
> But we also talked about an "emergency flush" which will not wait for
> the kthreads to finish and just output everything it can find in the
> printk buffers (expecting that the consoles have an "emergency"
> handler. We can add a sysrq to do an emergency flush.
I'm sorry, I wasn't there, so I'm surely is missing on some details.
I agree that when consoles have ->atomic_write then it surely makes sense
to switch to emergency mode. I like the emergency state approach, but I'm
not sure how it can be completely invisible to the rest of the system.
Quoting John:
: Unlike oops_in_progress, this state will not be visible to
: anything outside of the printk infrastructure.
For instance, tty/sysrq must be able to switch printk emergency on/off.
That already means that printk emergency knob should be visible to the
rest of the kernel. A long time ago, we had printk_emergency_begin_sync()
and printk_emergency_end_sync(), which would define reentrable
printk_emergency blocks [1]:
printk_emergency_begin_sync();
handle_sysrq();
printk_emergency_end_sync();
We also figured out that some PM (hibernation/suspend/etc.) stages (very
early and/or very late ones) [2] also should have printk in emergency mode,
plus some other parts of the kernel [3].
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/
-ss
On (09/18/19 11:36), Sergey Senozhatsky wrote:
[..]
> For instance, tty/sysrq must be able to switch printk emergency on/off.
> That already means that printk emergency knob should be visible to the
> rest of the kernel. A long time ago, we had printk_emergency_begin_sync()
> and printk_emergency_end_sync(), which would define reentrable
> printk_emergency blocks [1]:
>
> printk_emergency_begin_sync();
> handle_sysrq();
> printk_emergency_end_sync();
Some explanations.
How did we come up to that _sync() printk() emergency mode (when we
make sure that there is no active printing kthread)? We had a number
of cases (complaints) of lost kernel messages. There are scenarios
in which we cannot offload to async preemptible printing kthread,
because current control path is, for instance, going to reboot the
kernel. In sync printk() mode we have some sort (!) of guarantees
that when we do
pr_emerg("Restarting system\n");
kmsg_dump(KMSG_DUMP_RESTART);
machine_restart(cmd);
pr_emerg("Restarting system\n") is going to flush logbuf before the
system will machine_restart().
I can also recall a regression report from 0day bot. 0day uses sysrq over
serial to reboot running qemu instances. The way things currently work
is that we have printk() in sysrq handler, which flushes logbuf before
it reboots the system. With printk_kthread offloading this "flush logbuf
before reboot()" was not there, because printing was offloaded to kthread,
so the system used to immediately reboot with pending (and thus lost)
logbuf messages.
I suspect that emergency flush from sysrq is easier to handle once we
have one global printing kthread.
Suppose:
logbuf
id 100
id 101
id 102
...
id 198 <- printing kthread
id 199
id 200
<+> sysrq, a bunch of printk()-s in emergency flush mode
id 201 -> atomic_write()
id 202 -> atomic_write()
...
id 300 -> atomic_write()
<-> sysrq iret
When we park printing kthread, we make sure that the first sysrq->printk()
will also print pending messages 198,199,200 before it prints message 201.
When we unpark printing kthread it knows that there are no pending messages
(last printed message id is in the logbuf head).
It's going to be a bit harder when we have per-console kthread. If
per-console kthread is simply gogin to continue from the last message id
it printed (e.g. 198) then it will re-print messages which we already
printed via ->atomic_write() path. If all per-console printing kthread
are going to jump to id 300, because this is the last printed id on
consoles, then we can lose some messages on consoles (possibly a different
number of messages on different consoles, depending on console's kthread
position).
Once again, I'm sorry I was not on LPC/KS and maybe you have already
discussed all of those cases and got everything covered.
-ss
On 2019-09-18, Sergey Senozhatsky <[email protected]> wrote:
>> For instance, tty/sysrq must be able to switch printk emergency
>> on/off.
>
> How did we come up to that _sync() printk() emergency mode (when we
> make sure that there is no active printing kthread)? We had a number
> of cases (complaints) of lost kernel messages. There are scenarios in
> which we cannot offload to async preemptible printing kthread, because
> current control path is, for instance, going to reboot the kernel. In
> sync printk() mode we have some sort (!) of guarantees that when we do
>
> pr_emerg("Restarting system\n");
> kmsg_dump(KMSG_DUMP_RESTART);
> machine_restart(cmd);
>
> pr_emerg("Restarting system\n") is going to flush logbuf before the
> system will machine_restart().
Yes, this was why I asked Daniel how the bsod stuff will be
implemented. We don't want a bsod just because we are
restarting. Perhaps write_atomic() should also have a "reason" argument
like kmsg_dump does. I will keep in touch with Daniel to make sure we
are sync on this.
> It's going to be a bit harder when we have per-console kthread.
Each console has its own iterator. This iterators will need to advance,
regardless if the message was printed via write() or write_atomic().
John Ogness
On 2019-09-18, Sergey Senozhatsky <[email protected]> wrote:
>>>> 2. A kernel thread will be created for each registered console,
>>>> each responsible for being the sole printers to their respective
>>>> consoles. With this, console printing is _fully_ decoupled from
>>>> printk() callers.
>>>
>>> sysrq over serial?
>>>
>>> offloading this to kthread may be unreliable.
>>
>> But we also talked about an "emergency flush" which will not wait for
>> the kthreads to finish and just output everything it can find in the
>> printk buffers (expecting that the consoles have an "emergency"
>> handler. We can add a sysrq to do an emergency flush.
The problem with only a flush here is that the sysrq output may not fit
in the ringbuffer (ftrace, for example). It probably makes more sense to
have a switch to enter/exit "synchronous state", where all atomic
consoles are flushed upon enter and all future printk's are synchronous
on atomic consoles until exit.
I expect sysrq to be the only valid use of "synchronous state" other
than oops/panic. Although I suppose PeterZ would like a boot argument to
always run the consoles in this state.
> I agree that when consoles have ->atomic_write then it surely makes
> sense to switch to emergency mode. I like the emergency state
> approach, but I'm not sure how it can be completely invisible to the
> rest of the system. Quoting John:
>
> : Unlike oops_in_progress, this state will not be visible to
> : anything outside of the printk infrastructure.
>
> For instance, tty/sysrq must be able to switch printk emergency
> on/off.
The switch/flush _will_ be visible. But not the state. So, for example,
it won't be possible for some random driver to determine if we are in an
emergency state. (Well, I don't know if oops_in_progress will really
disappear. But at least the printk/console stuff will no longer rely on
it.)
> We also figured out that some PM (hibernation/suspend/etc.) stages
> (very early and/or very late ones) [2] also should have printk in
> emergency mode, plus some other parts of the kernel [3].
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3] https://lore.kernel.org/lkml/[email protected]/
Thanks for bringing up that RFC thread again. I haven't looked at it in
over a year. I will go through it again to see if there is anything I've
overlooked. Particularly the suspend stuff.
John Ogness
On (09/18/19 09:42), John Ogness wrote:
> > It's going to be a bit harder when we have per-console kthread.
>
> Each console has its own iterator. This iterators will need to advance,
> regardless if the message was printed via write() or write_atomic().
Great.
->atomic_write() path will make sure that kthread is parked or will
those compete for uart port?
-ss
On (09/18/19 09:33), John Ogness wrote:
>
> I expect sysrq to be the only valid use of "synchronous state" other
> than oops/panic. Although I suppose PeterZ would like a boot argument to
> always run the consoles in this state.
Yes, there might be more cases when we need sync printk(). Like lockdep
splats, KASAN warnings, PM debugging, etc. Those things sometimes come
right before "truly bad stuff".
> > For instance, tty/sysrq must be able to switch printk emergency
> > on/off.
>
> The switch/flush _will_ be visible. But not the state. So, for example,
> it won't be possible for some random driver to determine if we are in an
> emergency state. (Well, I don't know if oops_in_progress will really
> disappear. But at least the printk/console stuff will no longer rely on
> it.)
[..]
> Thanks for bringing up that RFC thread again. I haven't looked at it in
> over a year. I will go through it again to see if there is anything I've
> overlooked. Particularly the suspend stuff.
That thread most likely is incomplet and incorrekt in some parts;
shouldn't be taken too seriously, I guess.
-ss
On (09/18/19 11:05), John Ogness wrote:
> On 2019-09-18, Sergey Senozhatsky <[email protected]> wrote:
> >> Each console has its own iterator. This iterators will need to
> >> advance, regardless if the message was printed via write() or
> >> write_atomic().
> >
> > Great.
> >
> > ->atomic_write() path will make sure that kthread is parked or will
> > those compete for uart port?
>
> A cpu-lock (probably per-console) will be used to synchronize the
> two. Unlike my RFCv1, we want to keep the cpu-lock out of the console
> drivers and we want it to be less aggressive (using trylock's instead of
> spinning).
That's my expectation as well. cpu-lock and per-console kthread can
live just fine in printk.c file.
-ss
On 2019-09-18, Sergey Senozhatsky <[email protected]> wrote:
>> Each console has its own iterator. This iterators will need to
>> advance, regardless if the message was printed via write() or
>> write_atomic().
>
> Great.
>
> ->atomic_write() path will make sure that kthread is parked or will
> those compete for uart port?
A cpu-lock (probably per-console) will be used to synchronize the
two. Unlike my RFCv1, we want to keep the cpu-lock out of the console
drivers and we want it to be less aggressive (using trylock's instead of
spinning). This should make the cpu-lock less "dangerous". I talked with
PeterZ, Thomas, and PetrM about how this can be implemented, but there
may still be some corner cases.
I would like to put everything together now so that we can run and test
if the decisions made in that meeting hold up for all the cases. I think
it will be easier to identify/add the missing pieces, once we have it
coded.
John Ogness
On Wed 2019-09-18 11:05:28, John Ogness wrote:
> On 2019-09-18, Sergey Senozhatsky <[email protected]> wrote:
> >> Each console has its own iterator. This iterators will need to
> >> advance, regardless if the message was printed via write() or
> >> write_atomic().
> >
> > Great.
> >
> > ->atomic_write() path will make sure that kthread is parked or will
> > those compete for uart port?
>
> A cpu-lock (probably per-console) will be used to synchronize the
> two. Unlike my RFCv1, we want to keep the cpu-lock out of the console
> drivers and we want it to be less aggressive (using trylock's instead of
> spinning). This should make the cpu-lock less "dangerous". I talked with
> PeterZ, Thomas, and PetrM about how this can be implemented, but there
> may still be some corner cases.
If we take cpu_lock() only in non-preemptive context and the system is
normally working then try_lock() should be pretty reliable. I mean
that try_lock() would either succeed or the other CPU would be able
to flush the messages.
We might need to be more aggressive in panic(). But then it should be
easier because only one CPU can be running panic. This CPU would try
to stop the other CPUs and flush the consoles.
I though also about reusing the console-waiter logic in panic()
We could try to steel the cpu_lock() a more safe way. We would only
need to limit the busy waiting to 1 sec or so.
Regarding SysRq. I could imagine introducing another SysRq that
would just call panic(). I mean that it would try to flush the
logs and reboot in the most safe way.
I am not completely sure what to do with suspend, halt, and other
operations where we could not rely on the kthread. I would prefer to
allow only atomic consoles there in the beginning.
These are just some ideas. I do not think that everything needs to be
done immediately. I am sure that we will break some scenarios. We
should not complicate the code too much proactively because of
scenarios that are not much reliable even now.
> I would like to put everything together now so that we can run and test
> if the decisions made in that meeting hold up for all the cases. I think
> it will be easier to identify/add the missing pieces, once we have it
> coded.
Make sense. Just please, do not hold the entire series until all
details are solved.
It is always easier to review small pieces. Also it is a big pain
to rework/rebase huge series. IMHO, we need to reasonably handle
normal state and panic() at the beginning. All the other special
situations can be solved by follow up patches.
Best Regards,
Petr
On Wed, 18 Sep 2019 18:41:55 +0200
Petr Mladek <[email protected]> wrote:
> Regarding SysRq. I could imagine introducing another SysRq that
> would just call panic(). I mean that it would try to flush the
> logs and reboot in the most safe way.
You mean sysrq-c ?
-- Steve
On Wed, Sep 18, 2019 at 9:42 AM John Ogness <[email protected]> wrote:
>
> On 2019-09-18, Sergey Senozhatsky <[email protected]> wrote:
> >> For instance, tty/sysrq must be able to switch printk emergency
> >> on/off.
> >
> > How did we come up to that _sync() printk() emergency mode (when we
> > make sure that there is no active printing kthread)? We had a number
> > of cases (complaints) of lost kernel messages. There are scenarios in
> > which we cannot offload to async preemptible printing kthread, because
> > current control path is, for instance, going to reboot the kernel. In
> > sync printk() mode we have some sort (!) of guarantees that when we do
> >
> > pr_emerg("Restarting system\n");
> > kmsg_dump(KMSG_DUMP_RESTART);
> > machine_restart(cmd);
> >
> > pr_emerg("Restarting system\n") is going to flush logbuf before the
> > system will machine_restart().
>
> Yes, this was why I asked Daniel how the bsod stuff will be
> implemented. We don't want a bsod just because we are
> restarting. Perhaps write_atomic() should also have a "reason" argument
> like kmsg_dump does. I will keep in touch with Daniel to make sure we
> are sync on this.
I thought that's why there'll be the oops_in_progress parameter for
write_atomic?
For the fbcon/graphics side I think we maybe need three levels:
- normal console writes with the kthread
- write_atomic, but non-destructive: Just directly write into the
framebuffer. Might need a serious locking rework in fbcon to make this
possible, plus won't work on drivers where the framebuffer is either
not statically pinned, or where you need to take additional work to
flush the updates out to the display.
- bsod, where we attempt an unfriendly takeover of the display with
trylocks and just overwrite what's there to display the oops. that one
is probably best suited for kmsg_dump.
Cheers, Daniel
> > It's going to be a bit harder when we have per-console kthread.
>
> Each console has its own iterator. This iterators will need to advance,
> regardless if the message was printed via write() or write_atomic().
>
> John Ogness
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Wed 2019-09-18 12:48:01, Steven Rostedt wrote:
> On Wed, 18 Sep 2019 18:41:55 +0200
> Petr Mladek <[email protected]> wrote:
>
> > Regarding SysRq. I could imagine introducing another SysRq that
> > would just call panic(). I mean that it would try to flush the
> > logs and reboot in the most safe way.
>
> You mean sysrq-c ?
Sysrq-c is confusing because the NULL pointer dereference is reported.
I meant a completely new sysrq that would just call panic() without
an artificial noise.
Hmm, sysrq is already using most of the keys. sysrq-c might be good enough
after all.
Best Regards,
Petr
On 9/13/19 8:26 AM, John Ogness wrote:
> 9. Support for printk dictionaries will be discontinued. I will look
> into who is using this and why. If printk dictionaries are important for
> you, speak up now!
I think this functionality is important.
I've been experimenting with a change which adds dictionary data to
storage related printk messages so that a persistent durable id is
associated with them for filtering, eg.
$ journalctl -r _KERNEL_DURABLE_NAME=naa.0000000000bc614e
This has the advantage that when the device attachment changes across
reboots or detach/reattach cycles you can easily find its messages
throughout it's recorded history.
Other reasons were outlined when introduced, ref.
https://lwn.net/Articles/490690/
I believe this functionality hasn't been leveraged to its full potential
yet.
-Tony
On Fri 2019-10-04 09:48:24, Tony Asleson wrote:
> On 9/13/19 8:26 AM, John Ogness wrote:
> > 9. Support for printk dictionaries will be discontinued. I will look
> > into who is using this and why. If printk dictionaries are important for
> > you, speak up now!
>
> I think this functionality is important.
>
> I've been experimenting with a change which adds dictionary data to
> storage related printk messages so that a persistent durable id is
> associated with them for filtering, eg.
>
> $ journalctl -r _KERNEL_DURABLE_NAME=naa.0000000000bc614e
>
> This has the advantage that when the device attachment changes across
> reboots or detach/reattach cycles you can easily find its messages
> throughout it's recorded history.
Thanks for the pointers. I think that we will need to keep the
dictionaries then.
Just for explanation. We were not aware of this functionality when
it was discussed. The expectation was that this feature has never
been used in userspace. We were too optimistic ;-)
Best Regards,
Petr