2020-06-18 14:51:58

by John Ogness

[permalink] [raw]
Subject: [PATCH v3 0/3] printk: replace ringbuffer

Hello,

Here is a v3 for the first series to rework the printk
subsystem. The v2 and history are here [0]. This first series
only replaces the existing ringbuffer implementation. No locking
is removed. No semantics/behavior of printk are changed.

Reviews on the ringbuffer are still ongoing, but I was asked to
post this new version since several changes from v2 have been
already agreed upon.

The series is based on v5.8-rc1.

The list of changes since v2:

printk.c
========

- console_unlock(): fix extended console printing [1]

printk_ringbuffer
=================

- data_push_tail(): fixed handling when another CPU already
pushed the tail [2]

- desc_push_tail(): added a full memory barrier before the
descriptor tail push [3]

- data_make_reusable()/data_alloc(): changed block ID
reading/writing from READ_ONCE()/WRITE_ONCE() to regular
assignments [4]

- data_make_reusable(): removed unnecessary data tail
re-check [5]

- general: folded all smp_mb() and smp_wmb() calls into their
neighboring cmpxchg_relaxed(), changing them to full cmpxchg()
calls [6]

- desc_read(): changed descriptor content reading from
READ_ONCE() to memcpy() since it served no purpose for the
legal data race

- general: cleaned up memory barrier comments; in particular
made sure that the reader and writer sides of the memory
barrier pairs match in their descriptions

- added a new section in the memory barrier documentation (near
the top of printk_ringbuffer.c) that lists all 10 memory
barrier pairs and briefly describes what they are ordering

- _prb_read_valid(): changed the helper function to static

- general: changed block size argument type from "unsigned long"
to "unsigned int" since a record's *_buf_size fields are of
type "unsigned int"

- general: allow some lines to go beyond 80 characters

John Ogness

[0] https://lkml.kernel.org/r/[email protected]
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lkml.kernel.org/r/[email protected]
[5] https://lkml.kernel.org/r/[email protected]

John Ogness (3):
crash: add VMCOREINFO macro to define offset in a struct declared by
typedef
printk: add lockless ringbuffer
printk: use the lockless ringbuffer

include/linux/crash_core.h | 3 +
include/linux/kmsg_dump.h | 2 -
kernel/printk/Makefile | 1 +
kernel/printk/printk.c | 944 ++++++++--------
kernel/printk/printk_ringbuffer.c | 1674 +++++++++++++++++++++++++++++
kernel/printk/printk_ringbuffer.h | 352 ++++++
6 files changed, 2527 insertions(+), 449 deletions(-)
create mode 100644 kernel/printk/printk_ringbuffer.c
create mode 100644 kernel/printk/printk_ringbuffer.h

--
2.20.1


2020-06-18 14:52:19

by John Ogness

[permalink] [raw]
Subject: [PATCH v3 3/3] printk: use the lockless ringbuffer

Replace the existing ringbuffer usage and implementation with
lockless ringbuffer usage. Even though the new ringbuffer does not
require locking, all existing locking is left in place. Therefore,
this change is purely replacing the underlining ringbuffer.

Changes that exist due to the ringbuffer replacement:

- The VMCOREINFO has been updated for the new structures.

- Dictionary data is now stored in a separate data buffer from the
human-readable messages. The dictionary data buffer is set to the
same size as the message buffer. Therefore, the total required
memory for both dictionary and message data is
2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the initial static buffers and
2 * log_buf_len (the kernel parameter) for the dynamic buffers.

- Record meta-data is now stored in a separate array of descriptors.
This is an additional 72 * (2 ^ (CONFIG_LOG_BUF_SHIFT - 5)) bytes
for the static array and 72 * (log_buf_len >> 5) bytes for the
dynamic array.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/kmsg_dump.h | 2 -
kernel/printk/printk.c | 944 ++++++++++++++++++++------------------
2 files changed, 497 insertions(+), 449 deletions(-)

diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
index 3378bcbe585e..c9b0abe5ca91 100644
--- a/include/linux/kmsg_dump.h
+++ b/include/linux/kmsg_dump.h
@@ -45,8 +45,6 @@ struct kmsg_dumper {
bool registered;

/* private state of the kmsg iterator */
- u32 cur_idx;
- u32 next_idx;
u64 cur_seq;
u64 next_seq;
};
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 8c14835be46c..7642ef634956 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -55,6 +55,7 @@
#define CREATE_TRACE_POINTS
#include <trace/events/printk.h>

+#include "printk_ringbuffer.h"
#include "console_cmdline.h"
#include "braille.h"
#include "internal.h"
@@ -294,30 +295,24 @@ enum con_msg_format_flags {
static int console_msg_format = MSG_FORMAT_DEFAULT;

/*
- * The printk log buffer consists of a chain of concatenated variable
- * length records. Every record starts with a record header, containing
- * the overall length of the record.
+ * The printk log buffer consists of a sequenced collection of records, each
+ * containing variable length message and dictionary text. Every record
+ * also contains its own meta-data (@info).
*
- * The heads to the first and last entry in the buffer, as well as the
- * sequence numbers of these entries are maintained when messages are
- * stored.
+ * Every record meta-data carries the timestamp in microseconds, as well as
+ * the standard userspace syslog level and syslog facility. The usual kernel
+ * messages use LOG_KERN; userspace-injected messages always carry a matching
+ * syslog facility, by default LOG_USER. The origin of every message can be
+ * reliably determined that way.
*
- * If the heads indicate available messages, the length in the header
- * tells the start next message. A length == 0 for the next message
- * indicates a wrap-around to the beginning of the buffer.
+ * The human readable log message of a record is available in @text, the
+ * length of the message text in @text_len. The stored message is not
+ * terminated.
*
- * Every record carries the monotonic timestamp in microseconds, as well as
- * the standard userspace syslog level and syslog facility. The usual
- * kernel messages use LOG_KERN; userspace-injected messages always carry
- * a matching syslog facility, by default LOG_USER. The origin of every
- * message can be reliably determined that way.
- *
- * The human readable log message directly follows the message header. The
- * length of the message text is stored in the header, the stored message
- * is not terminated.
- *
- * Optionally, a message can carry a dictionary of properties (key/value pairs),
- * to provide userspace with a machine-readable message context.
+ * Optionally, a record can carry a dictionary of properties (key/value
+ * pairs), to provide userspace with a machine-readable message context. The
+ * length of the dictionary is available in @dict_len. The dictionary is not
+ * terminated.
*
* Examples for well-defined, commonly used property names are:
* DEVICE=b12:8 device identifier
@@ -331,21 +326,19 @@ static int console_msg_format = MSG_FORMAT_DEFAULT;
* follows directly after a '=' character. Every property is terminated by
* a '\0' character. The last property is not terminated.
*
- * Example of a message structure:
- * 0000 ff 8f 00 00 00 00 00 00 monotonic time in nsec
- * 0008 34 00 record is 52 bytes long
- * 000a 0b 00 text is 11 bytes long
- * 000c 1f 00 dictionary is 23 bytes long
- * 000e 03 00 LOG_KERN (facility) LOG_ERR (level)
- * 0010 69 74 27 73 20 61 20 6c "it's a l"
- * 69 6e 65 "ine"
- * 001b 44 45 56 49 43 "DEVIC"
- * 45 3d 62 38 3a 32 00 44 "E=b8:2\0D"
- * 52 49 56 45 52 3d 62 75 "RIVER=bu"
- * 67 "g"
- * 0032 00 00 00 padding to next message header
- *
- * The 'struct printk_log' buffer header must never be directly exported to
+ * Example of record values:
+ * record.text_buf = "it's a line" (unterminated)
+ * record.dict_buf = "DEVICE=b8:2\0DRIVER=bug" (unterminated)
+ * record.info.seq = 56
+ * record.info.ts_nsec = 36863
+ * record.info.text_len = 11
+ * record.info.dict_len = 22
+ * record.info.facility = 0 (LOG_KERN)
+ * record.info.flags = 0
+ * record.info.level = 3 (LOG_ERR)
+ * record.info.caller_id = 299 (task 299)
+ *
+ * The 'struct printk_info' buffer must never be directly exported to
* userspace, it is a kernel-private implementation detail that might
* need to be changed in the future, when the requirements change.
*
@@ -365,23 +358,6 @@ enum log_flags {
LOG_CONT = 8, /* text is a fragment of a continuation line */
};

-struct printk_log {
- u64 ts_nsec; /* timestamp in nanoseconds */
- u16 len; /* length of entire record */
- u16 text_len; /* length of text buffer */
- u16 dict_len; /* length of dictionary buffer */
- u8 facility; /* syslog facility */
- u8 flags:5; /* internal record flags */
- u8 level:3; /* syslog level */
-#ifdef CONFIG_PRINTK_CALLER
- u32 caller_id; /* thread id or processor id */
-#endif
-}
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-__packed __aligned(4)
-#endif
-;
-
/*
* The logbuf_lock protects kmsg buffer, indices, counters. This can be taken
* within the scheduler's rq lock. It must be released before calling
@@ -421,26 +397,16 @@ DEFINE_RAW_SPINLOCK(logbuf_lock);
DECLARE_WAIT_QUEUE_HEAD(log_wait);
/* the next printk record to read by syslog(READ) or /proc/kmsg */
static u64 syslog_seq;
-static u32 syslog_idx;
static size_t syslog_partial;
static bool syslog_time;

-/* index and sequence number of the first record stored in the buffer */
-static u64 log_first_seq;
-static u32 log_first_idx;
-
-/* index and sequence number of the next record to store in the buffer */
-static u64 log_next_seq;
-static u32 log_next_idx;
-
/* the next printk record to write to the console */
static u64 console_seq;
-static u32 console_idx;
static u64 exclusive_console_stop_seq;
+static unsigned long console_dropped;

/* the next printk record to read after the last 'clear' command */
static u64 clear_seq;
-static u32 clear_idx;

#ifdef CONFIG_PRINTK_CALLER
#define PREFIX_MAX 48
@@ -453,13 +419,28 @@ static u32 clear_idx;
#define LOG_FACILITY(v) ((v) >> 3 & 0xff)

/* record buffer */
-#define LOG_ALIGN __alignof__(struct printk_log)
+#define LOG_ALIGN __alignof__(unsigned long)
#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
#define LOG_BUF_LEN_MAX (u32)(1 << 31)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
static char *log_buf = __log_buf;
static u32 log_buf_len = __LOG_BUF_LEN;

+/*
+ * Define the average message size. This only affects the number of
+ * descriptors that will be available. Underestimating is better than
+ * overestimating (too many available descriptors is better than not enough).
+ * The dictionary buffer will be the same size as the text buffer.
+ */
+#define PRB_AVGBITS 5 /* 32 character average length */
+
+_DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
+ PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);
+
+static struct printk_ringbuffer printk_rb_dynamic;
+
+static struct printk_ringbuffer *prb = &printk_rb_static;
+
/*
* We cannot access per-CPU data (e.g. per-CPU flush irq_work) before
* per_cpu_areas are initialised. This variable is set to true when
@@ -484,108 +465,6 @@ u32 log_buf_len_get(void)
return log_buf_len;
}

-/* human readable text of the record */
-static char *log_text(const struct printk_log *msg)
-{
- return (char *)msg + sizeof(struct printk_log);
-}
-
-/* optional key/value pair dictionary attached to the record */
-static char *log_dict(const struct printk_log *msg)
-{
- return (char *)msg + sizeof(struct printk_log) + msg->text_len;
-}
-
-/* get record by index; idx must point to valid msg */
-static struct printk_log *log_from_idx(u32 idx)
-{
- struct printk_log *msg = (struct printk_log *)(log_buf + idx);
-
- /*
- * A length == 0 record is the end of buffer marker. Wrap around and
- * read the message at the start of the buffer.
- */
- if (!msg->len)
- return (struct printk_log *)log_buf;
- return msg;
-}
-
-/* get next record; idx must point to valid msg */
-static u32 log_next(u32 idx)
-{
- struct printk_log *msg = (struct printk_log *)(log_buf + idx);
-
- /* length == 0 indicates the end of the buffer; wrap */
- /*
- * A length == 0 record is the end of buffer marker. Wrap around and
- * read the message at the start of the buffer as *this* one, and
- * return the one after that.
- */
- if (!msg->len) {
- msg = (struct printk_log *)log_buf;
- return msg->len;
- }
- return idx + msg->len;
-}
-
-/*
- * Check whether there is enough free space for the given message.
- *
- * The same values of first_idx and next_idx mean that the buffer
- * is either empty or full.
- *
- * If the buffer is empty, we must respect the position of the indexes.
- * They cannot be reset to the beginning of the buffer.
- */
-static int logbuf_has_space(u32 msg_size, bool empty)
-{
- u32 free;
-
- if (log_next_idx > log_first_idx || empty)
- free = max(log_buf_len - log_next_idx, log_first_idx);
- else
- free = log_first_idx - log_next_idx;
-
- /*
- * We need space also for an empty header that signalizes wrapping
- * of the buffer.
- */
- return free >= msg_size + sizeof(struct printk_log);
-}
-
-static int log_make_free_space(u32 msg_size)
-{
- while (log_first_seq < log_next_seq &&
- !logbuf_has_space(msg_size, false)) {
- /* drop old messages until we have enough contiguous space */
- log_first_idx = log_next(log_first_idx);
- log_first_seq++;
- }
-
- if (clear_seq < log_first_seq) {
- clear_seq = log_first_seq;
- clear_idx = log_first_idx;
- }
-
- /* sequence numbers are equal, so the log buffer is empty */
- if (logbuf_has_space(msg_size, log_first_seq == log_next_seq))
- return 0;
-
- return -ENOMEM;
-}
-
-/* compute the message size including the padding bytes */
-static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
-{
- u32 size;
-
- size = sizeof(struct printk_log) + text_len + dict_len;
- *pad_len = (-size) & (LOG_ALIGN - 1);
- size += *pad_len;
-
- return size;
-}
-
/*
* Define how much of the log buffer we could take at maximum. The value
* must be greater than two. Note that only half of the buffer is available
@@ -594,22 +473,26 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
#define MAX_LOG_TAKE_PART 4
static const char trunc_msg[] = "<truncated>";

-static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
- u16 *dict_len, u32 *pad_len)
+static void truncate_msg(u16 *text_len, u16 *trunc_msg_len, u16 *dict_len)
{
/*
* The message should not take the whole buffer. Otherwise, it might
* get removed too soon.
*/
u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
+
if (*text_len > max_text_len)
*text_len = max_text_len;
- /* enable the warning message */
+
+ /* enable the warning message (if there is room) */
*trunc_msg_len = strlen(trunc_msg);
+ if (*text_len >= *trunc_msg_len)
+ *text_len -= *trunc_msg_len;
+ else
+ *trunc_msg_len = 0;
+
/* disable the "dict" completely */
*dict_len = 0;
- /* compute the size again, count also the warning message */
- return msg_used_size(*text_len + *trunc_msg_len, 0, pad_len);
}

/* insert record into the buffer, discard old ones, update heads */
@@ -618,60 +501,40 @@ static int log_store(u32 caller_id, int facility, int level,
const char *dict, u16 dict_len,
const char *text, u16 text_len)
{
- struct printk_log *msg;
- u32 size, pad_len;
+ struct prb_reserved_entry e;
+ struct printk_record r;
u16 trunc_msg_len = 0;

- /* number of '\0' padding bytes to next message */
- size = msg_used_size(text_len, dict_len, &pad_len);
+ prb_rec_init_wr(&r, text_len, dict_len);

- if (log_make_free_space(size)) {
+ if (!prb_reserve(&e, prb, &r)) {
/* truncate the message if it is too long for empty buffer */
- size = truncate_msg(&text_len, &trunc_msg_len,
- &dict_len, &pad_len);
+ truncate_msg(&text_len, &trunc_msg_len, &dict_len);
+ prb_rec_init_wr(&r, text_len + trunc_msg_len, dict_len);
/* survive when the log buffer is too small for trunc_msg */
- if (log_make_free_space(size))
+ if (!prb_reserve(&e, prb, &r))
return 0;
}

- if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
- /*
- * This message + an additional empty header does not fit
- * at the end of the buffer. Add an empty header with len == 0
- * to signify a wrap around.
- */
- memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
- log_next_idx = 0;
- }
-
/* fill message */
- msg = (struct printk_log *)(log_buf + log_next_idx);
- memcpy(log_text(msg), text, text_len);
- msg->text_len = text_len;
- if (trunc_msg_len) {
- memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len);
- msg->text_len += trunc_msg_len;
- }
- memcpy(log_dict(msg), dict, dict_len);
- msg->dict_len = dict_len;
- msg->facility = facility;
- msg->level = level & 7;
- msg->flags = flags & 0x1f;
+ memcpy(&r.text_buf[0], text, text_len);
+ if (trunc_msg_len)
+ memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len);
+ if (r.dict_buf)
+ memcpy(&r.dict_buf[0], dict, dict_len);
+ r.info->facility = facility;
+ r.info->level = level & 7;
+ r.info->flags = flags & 0x1f;
if (ts_nsec > 0)
- msg->ts_nsec = ts_nsec;
+ r.info->ts_nsec = ts_nsec;
else
- msg->ts_nsec = local_clock();
-#ifdef CONFIG_PRINTK_CALLER
- msg->caller_id = caller_id;
-#endif
- memset(log_dict(msg) + dict_len, 0, pad_len);
- msg->len = size;
+ r.info->ts_nsec = local_clock();
+ r.info->caller_id = caller_id;

/* insert message */
- log_next_idx += msg->len;
- log_next_seq++;
+ prb_commit(&e);

- return msg->text_len;
+ return (text_len + trunc_msg_len);
}

int dmesg_restrict = IS_ENABLED(CONFIG_SECURITY_DMESG_RESTRICT);
@@ -723,13 +586,13 @@ static void append_char(char **pp, char *e, char c)
*(*pp)++ = c;
}

-static ssize_t msg_print_ext_header(char *buf, size_t size,
- struct printk_log *msg, u64 seq)
+static ssize_t info_print_ext_header(char *buf, size_t size,
+ struct printk_info *info)
{
- u64 ts_usec = msg->ts_nsec;
+ u64 ts_usec = info->ts_nsec;
char caller[20];
#ifdef CONFIG_PRINTK_CALLER
- u32 id = msg->caller_id;
+ u32 id = info->caller_id;

snprintf(caller, sizeof(caller), ",caller=%c%u",
id & 0x80000000 ? 'C' : 'T', id & ~0x80000000);
@@ -740,8 +603,8 @@ static ssize_t msg_print_ext_header(char *buf, size_t size,
do_div(ts_usec, 1000);

return scnprintf(buf, size, "%u,%llu,%llu,%c%s;",
- (msg->facility << 3) | msg->level, seq, ts_usec,
- msg->flags & LOG_CONT ? 'c' : '-', caller);
+ (info->facility << 3) | info->level, info->seq,
+ ts_usec, info->flags & LOG_CONT ? 'c' : '-', caller);
}

static ssize_t msg_print_ext_body(char *buf, size_t size,
@@ -795,10 +658,14 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
/* /dev/kmsg - userspace message inject/listen interface */
struct devkmsg_user {
u64 seq;
- u32 idx;
struct ratelimit_state rs;
struct mutex lock;
char buf[CONSOLE_EXT_LOG_MAX];
+
+ struct printk_info info;
+ char text_buf[CONSOLE_EXT_LOG_MAX];
+ char dict_buf[CONSOLE_EXT_LOG_MAX];
+ struct printk_record record;
};

static __printf(3, 4) __cold
@@ -881,7 +748,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
struct devkmsg_user *user = file->private_data;
- struct printk_log *msg;
+ struct printk_record *r = &user->record;
size_t len;
ssize_t ret;

@@ -893,7 +760,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
return ret;

logbuf_lock_irq();
- while (user->seq == log_next_seq) {
+ if (!prb_read_valid(prb, user->seq, r)) {
if (file->f_flags & O_NONBLOCK) {
ret = -EAGAIN;
logbuf_unlock_irq();
@@ -902,30 +769,26 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,

logbuf_unlock_irq();
ret = wait_event_interruptible(log_wait,
- user->seq != log_next_seq);
+ prb_read_valid(prb, user->seq, r));
if (ret)
goto out;
logbuf_lock_irq();
}

- if (user->seq < log_first_seq) {
+ if (user->seq < prb_first_seq(prb)) {
/* our last seen message is gone, return error and reset */
- user->idx = log_first_idx;
- user->seq = log_first_seq;
+ user->seq = prb_first_seq(prb);
ret = -EPIPE;
logbuf_unlock_irq();
goto out;
}

- msg = log_from_idx(user->idx);
- len = msg_print_ext_header(user->buf, sizeof(user->buf),
- msg, user->seq);
+ len = info_print_ext_header(user->buf, sizeof(user->buf), r->info);
len += msg_print_ext_body(user->buf + len, sizeof(user->buf) - len,
- log_dict(msg), msg->dict_len,
- log_text(msg), msg->text_len);
+ &r->dict_buf[0], r->info->dict_len,
+ &r->text_buf[0], r->info->text_len);

- user->idx = log_next(user->idx);
- user->seq++;
+ user->seq = r->info->seq + 1;
logbuf_unlock_irq();

if (len > count) {
@@ -957,8 +820,7 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
switch (whence) {
case SEEK_SET:
/* the first record */
- user->idx = log_first_idx;
- user->seq = log_first_seq;
+ user->seq = prb_first_seq(prb);
break;
case SEEK_DATA:
/*
@@ -966,13 +828,11 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
* like issued by 'dmesg -c'. Reading /dev/kmsg itself
* changes no global state, and does not clear anything.
*/
- user->idx = clear_idx;
user->seq = clear_seq;
break;
case SEEK_END:
/* after the last record */
- user->idx = log_next_idx;
- user->seq = log_next_seq;
+ user->seq = prb_next_seq(prb);
break;
case SEEK_CUR:
/*
@@ -1002,9 +862,9 @@ static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
poll_wait(file, &log_wait, wait);

logbuf_lock_irq();
- if (user->seq < log_next_seq) {
+ if (prb_read_valid(prb, user->seq, NULL)) {
/* return error when data has vanished underneath us */
- if (user->seq < log_first_seq)
+ if (user->seq < prb_first_seq(prb))
ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
else
ret = EPOLLIN|EPOLLRDNORM;
@@ -1039,9 +899,12 @@ static int devkmsg_open(struct inode *inode, struct file *file)

mutex_init(&user->lock);

+ prb_rec_init_rd(&user->record, &user->info,
+ &user->text_buf[0], sizeof(user->text_buf),
+ &user->dict_buf[0], sizeof(user->dict_buf));
+
logbuf_lock_irq();
- user->idx = log_first_idx;
- user->seq = log_first_seq;
+ user->seq = prb_first_seq(prb);
logbuf_unlock_irq();

file->private_data = user;
@@ -1082,23 +945,52 @@ const struct file_operations kmsg_fops = {
*/
void log_buf_vmcoreinfo_setup(void)
{
- VMCOREINFO_SYMBOL(log_buf);
- VMCOREINFO_SYMBOL(log_buf_len);
- VMCOREINFO_SYMBOL(log_first_idx);
- VMCOREINFO_SYMBOL(clear_idx);
- VMCOREINFO_SYMBOL(log_next_idx);
+ VMCOREINFO_SYMBOL(prb);
+ VMCOREINFO_SYMBOL(printk_rb_static);
+ VMCOREINFO_SYMBOL(clear_seq);
+
/*
- * Export struct printk_log size and field offsets. User space tools can
+ * Export struct size and field offsets. User space tools can
* parse it and detect any changes to structure down the line.
*/
- VMCOREINFO_STRUCT_SIZE(printk_log);
- VMCOREINFO_OFFSET(printk_log, ts_nsec);
- VMCOREINFO_OFFSET(printk_log, len);
- VMCOREINFO_OFFSET(printk_log, text_len);
- VMCOREINFO_OFFSET(printk_log, dict_len);
-#ifdef CONFIG_PRINTK_CALLER
- VMCOREINFO_OFFSET(printk_log, caller_id);
-#endif
+
+ VMCOREINFO_STRUCT_SIZE(printk_ringbuffer);
+ VMCOREINFO_OFFSET(printk_ringbuffer, desc_ring);
+ VMCOREINFO_OFFSET(printk_ringbuffer, text_data_ring);
+ VMCOREINFO_OFFSET(printk_ringbuffer, dict_data_ring);
+ VMCOREINFO_OFFSET(printk_ringbuffer, fail);
+
+ VMCOREINFO_STRUCT_SIZE(prb_desc_ring);
+ VMCOREINFO_OFFSET(prb_desc_ring, count_bits);
+ VMCOREINFO_OFFSET(prb_desc_ring, descs);
+ VMCOREINFO_OFFSET(prb_desc_ring, head_id);
+ VMCOREINFO_OFFSET(prb_desc_ring, tail_id);
+
+ VMCOREINFO_STRUCT_SIZE(prb_desc);
+ VMCOREINFO_OFFSET(prb_desc, info);
+ VMCOREINFO_OFFSET(prb_desc, state_var);
+ VMCOREINFO_OFFSET(prb_desc, text_blk_lpos);
+ VMCOREINFO_OFFSET(prb_desc, dict_blk_lpos);
+
+ VMCOREINFO_STRUCT_SIZE(prb_data_blk_lpos);
+ VMCOREINFO_OFFSET(prb_data_blk_lpos, begin);
+ VMCOREINFO_OFFSET(prb_data_blk_lpos, next);
+
+ VMCOREINFO_STRUCT_SIZE(printk_info);
+ VMCOREINFO_OFFSET(printk_info, seq);
+ VMCOREINFO_OFFSET(printk_info, ts_nsec);
+ VMCOREINFO_OFFSET(printk_info, text_len);
+ VMCOREINFO_OFFSET(printk_info, dict_len);
+ VMCOREINFO_OFFSET(printk_info, caller_id);
+
+ VMCOREINFO_STRUCT_SIZE(prb_data_ring);
+ VMCOREINFO_OFFSET(prb_data_ring, size_bits);
+ VMCOREINFO_OFFSET(prb_data_ring, data);
+ VMCOREINFO_OFFSET(prb_data_ring, head_lpos);
+ VMCOREINFO_OFFSET(prb_data_ring, tail_lpos);
+
+ VMCOREINFO_SIZE(atomic_long_t);
+ VMCOREINFO_TYPE_OFFSET(atomic_long_t, counter);
}
#endif

@@ -1176,11 +1068,46 @@ static void __init set_percpu_data_ready(void)
__printk_percpu_data_ready = true;
}

+static unsigned int __init add_to_rb(struct printk_ringbuffer *rb,
+ struct printk_record *r)
+{
+ struct prb_reserved_entry e;
+ struct printk_record dest_r;
+
+ prb_rec_init_wr(&dest_r, r->info->text_len, r->info->dict_len);
+
+ if (!prb_reserve(&e, rb, &dest_r))
+ return 0;
+
+ memcpy(&dest_r.text_buf[0], &r->text_buf[0], dest_r.text_buf_size);
+ if (dest_r.dict_buf) {
+ memcpy(&dest_r.dict_buf[0], &r->dict_buf[0],
+ dest_r.dict_buf_size);
+ }
+ dest_r.info->facility = r->info->facility;
+ dest_r.info->level = r->info->level;
+ dest_r.info->flags = r->info->flags;
+ dest_r.info->ts_nsec = r->info->ts_nsec;
+ dest_r.info->caller_id = r->info->caller_id;
+
+ prb_commit(&e);
+
+ return prb_record_text_space(&e);
+}
+
+static char setup_text_buf[CONSOLE_EXT_LOG_MAX] __initdata;
+static char setup_dict_buf[CONSOLE_EXT_LOG_MAX] __initdata;
+
void __init setup_log_buf(int early)
{
+ struct prb_desc *new_descs;
+ struct printk_info info;
+ struct printk_record r;
unsigned long flags;
+ char *new_dict_buf;
char *new_log_buf;
unsigned int free;
+ u64 seq;

/*
* Some archs call setup_log_buf() multiple times - first is very
@@ -1201,17 +1128,50 @@ void __init setup_log_buf(int early)

new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
if (unlikely(!new_log_buf)) {
- pr_err("log_buf_len: %lu bytes not available\n",
+ pr_err("log_buf_len: %lu text bytes not available\n",
new_log_buf_len);
return;
}

+ new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
+ if (unlikely(!new_dict_buf)) {
+ /* dictionary failure is allowed */
+ pr_err("log_buf_len: %lu dict bytes not available\n",
+ new_log_buf_len);
+ }
+
+ new_descs = memblock_alloc((new_log_buf_len >> PRB_AVGBITS) *
+ sizeof(struct prb_desc), LOG_ALIGN);
+ if (unlikely(!new_descs)) {
+ pr_err("log_buf_len: %lu desc bytes not available\n",
+ new_log_buf_len >> PRB_AVGBITS);
+ if (new_dict_buf)
+ memblock_free(__pa(new_dict_buf), new_log_buf_len);
+ memblock_free(__pa(new_log_buf), new_log_buf_len);
+ return;
+ }
+
+ prb_rec_init_rd(&r, &info,
+ &setup_text_buf[0], sizeof(setup_text_buf),
+ &setup_dict_buf[0], sizeof(setup_dict_buf));
+
logbuf_lock_irqsave(flags);
+
+ prb_init(&printk_rb_dynamic,
+ new_log_buf, bits_per(new_log_buf_len) - 1,
+ new_dict_buf, bits_per(new_log_buf_len) - 1,
+ new_descs, (bits_per(new_log_buf_len) - 1) - PRB_AVGBITS);
+
log_buf_len = new_log_buf_len;
log_buf = new_log_buf;
new_log_buf_len = 0;
- free = __LOG_BUF_LEN - log_next_idx;
- memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
+
+ free = __LOG_BUF_LEN;
+ prb_for_each_record(0, &printk_rb_static, seq, &r)
+ free -= add_to_rb(&printk_rb_dynamic, &r);
+
+ prb = &printk_rb_dynamic;
+
logbuf_unlock_irqrestore(flags);

pr_info("log_buf_len: %u bytes\n", log_buf_len);
@@ -1323,18 +1283,18 @@ static size_t print_caller(u32 id, char *buf)
#define print_caller(id, buf) 0
#endif

-static size_t print_prefix(const struct printk_log *msg, bool syslog,
- bool time, char *buf)
+static size_t info_print_prefix(const struct printk_info *info, bool syslog,
+ bool time, char *buf)
{
size_t len = 0;

if (syslog)
- len = print_syslog((msg->facility << 3) | msg->level, buf);
+ len = print_syslog((info->facility << 3) | info->level, buf);

if (time)
- len += print_time(msg->ts_nsec, buf + len);
+ len += print_time(info->ts_nsec, buf + len);

- len += print_caller(msg->caller_id, buf + len);
+ len += print_caller(info->caller_id, buf + len);

if (IS_ENABLED(CONFIG_PRINTK_CALLER) || time) {
buf[len++] = ' ';
@@ -1344,72 +1304,150 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog,
return len;
}

-static size_t msg_print_text(const struct printk_log *msg, bool syslog,
- bool time, char *buf, size_t size)
+static size_t record_print_text(struct printk_record *r, bool syslog,
+ bool time)
{
- const char *text = log_text(msg);
- size_t text_size = msg->text_len;
- size_t len = 0;
+ size_t text_len = r->info->text_len;
+ size_t buf_size = r->text_buf_size;
+ char *text = r->text_buf;
char prefix[PREFIX_MAX];
- const size_t prefix_len = print_prefix(msg, syslog, time, prefix);
+ bool truncated = false;
+ size_t prefix_len;
+ size_t line_len;
+ size_t len = 0;
+ char *next;

- do {
- const char *next = memchr(text, '\n', text_size);
- size_t text_len;
+ prefix_len = info_print_prefix(r->info, syslog, time, prefix);

+ /*
+ * Add the prefix for each line by shifting the rest of the text to
+ * make room for each prefix. If the buffer is not large enough for
+ * all the prefixes, then drop the trailing text and report the
+ * largest length that includes full lines with their prefixes.
+ *
+ * @text_len: bytes of unprocessed text
+ * @line_len: bytes of current line (newline _not_ included)
+ * @text: pointer to beginning of current line
+ * @len: number of bytes processed (size of r->text_buf done)
+ */
+ for (;;) {
+ next = memchr(text, '\n', text_len);
if (next) {
- text_len = next - text;
- next++;
- text_size -= next - text;
+ line_len = next - text;
} else {
- text_len = text_size;
+ /*
+ * No newline. If the text was previously truncated,
+ * assume this line was truncated and do not include
+ * it.
+ */
+ if (truncated)
+ break;
+ line_len = text_len;
}

- if (buf) {
- if (prefix_len + text_len + 1 >= size - len)
+ /*
+ * Ensure there is enough buffer available to shift this line
+ * (and add a newline at the end).
+ */
+ if (len + prefix_len + line_len + 1 > buf_size)
+ break;
+
+ /*
+ * Ensure there is enough buffer available to shift all
+ * remaining text (and add a newline at the end). If this
+ * test fails, then there must be a newline (i.e.
+ * text_len > line_len because the previous test succeeded).
+ */
+ if (len + prefix_len + text_len + 1 > buf_size) {
+ /*
+ * Truncate @text_len so that there is enough space
+ * for a prefix. A newline will not be added because
+ * the last line of the text is now truncated and
+ * will not be included.
+ */
+ text_len = (buf_size - len) - prefix_len;
+
+ /*
+ * Ensure there is still a newline. Otherwise this
+ * line may have been truncated and will not be
+ * included.
+ */
+ if (memchr(text, '\n', text_len) == NULL)
break;

- memcpy(buf + len, prefix, prefix_len);
- len += prefix_len;
- memcpy(buf + len, text, text_len);
- len += text_len;
- buf[len++] = '\n';
- } else {
- /* SYSLOG_ACTION_* buffer size only calculation */
- len += prefix_len + text_len + 1;
+ /* Note that the last line will not be included. */
+ truncated = true;
}

- text = next;
- } while (text);
+ memmove(text + prefix_len, text, text_len);
+ memcpy(text, prefix, prefix_len);
+
+ /* Advance beyond the newly added prefix and existing line. */
+ text += prefix_len + line_len;
+
+ /* The remaining text has only decreased by the line. */
+ text_len -= line_len;
+
+ len += prefix_len + line_len + 1;
+
+ if (text_len) {
+ /* Advance past the newline. */
+ text++;
+ text_len--;
+ } else {
+ /* The final line, add a newline. */
+ *text = '\n';
+ break;
+ }
+ }

return len;
}

+static size_t get_record_text_size(struct printk_info *info,
+ unsigned int line_count,
+ bool syslog, bool time)
+{
+ char prefix[PREFIX_MAX];
+ size_t prefix_len;
+
+ prefix_len = info_print_prefix(info, syslog, time, prefix);
+
+ /*
+ * Each line will be preceded with a prefix. The intermediate
+ * newlines are already within the text, but a final trailing
+ * newline will be added.
+ */
+ return ((prefix_len * line_count) + info->text_len + 1);
+}
+
static int syslog_print(char __user *buf, int size)
{
+ struct printk_info info;
+ struct printk_record r;
char *text;
- struct printk_log *msg;
int len = 0;

text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
if (!text)
return -ENOMEM;

+ prb_rec_init_rd(&r, &info, text, LOG_LINE_MAX + PREFIX_MAX, NULL, 0);
+
while (size > 0) {
size_t n;
size_t skip;

logbuf_lock_irq();
- if (syslog_seq < log_first_seq) {
- /* messages are gone, move to first one */
- syslog_seq = log_first_seq;
- syslog_idx = log_first_idx;
- syslog_partial = 0;
- }
- if (syslog_seq == log_next_seq) {
+ if (!prb_read_valid(prb, syslog_seq, &r)) {
logbuf_unlock_irq();
break;
}
+ if (r.info->seq != syslog_seq) {
+ /* message is gone, move to next valid one */
+ syslog_seq = r.info->seq;
+ syslog_partial = 0;
+ }

/*
* To keep reading/counting partial line consistent,
@@ -1419,13 +1457,10 @@ static int syslog_print(char __user *buf, int size)
syslog_time = printk_time;

skip = syslog_partial;
- msg = log_from_idx(syslog_idx);
- n = msg_print_text(msg, true, syslog_time, text,
- LOG_LINE_MAX + PREFIX_MAX);
+ n = record_print_text(&r, true, syslog_time);
if (n - syslog_partial <= size) {
/* message fits into buffer, move forward */
- syslog_idx = log_next(syslog_idx);
- syslog_seq++;
+ syslog_seq = r.info->seq + 1;
n -= syslog_partial;
syslog_partial = 0;
} else if (!len){
@@ -1456,55 +1491,49 @@ static int syslog_print(char __user *buf, int size)

static int syslog_print_all(char __user *buf, int size, bool clear)
{
+ struct printk_info info;
+ unsigned int line_count;
+ struct printk_record r;
char *text;
int len = 0;
- u64 next_seq;
u64 seq;
- u32 idx;
bool time;

text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
if (!text)
return -ENOMEM;

+ prb_rec_init_rd(&r, &info, text, LOG_LINE_MAX + PREFIX_MAX, NULL, 0);
+
time = printk_time;
logbuf_lock_irq();
/*
* Find first record that fits, including all following records,
* into the user-provided buffer for this dump.
*/
- seq = clear_seq;
- idx = clear_idx;
- while (seq < log_next_seq) {
- struct printk_log *msg = log_from_idx(idx);
-
- len += msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
+ prb_for_each_record(clear_seq, prb, seq, &r) {
+ line_count = prb_count_lines(text, r.info->text_len);
+ len += get_record_text_size(r.info, line_count, true, time);
}

/* move first record forward until length fits into the buffer */
- seq = clear_seq;
- idx = clear_idx;
- while (len > size && seq < log_next_seq) {
- struct printk_log *msg = log_from_idx(idx);
-
- len -= msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
+ prb_for_each_record(clear_seq, prb, seq, &r) {
+ if (len <= size)
+ break;
+ line_count = prb_count_lines(text, r.info->text_len);
+ len -= get_record_text_size(r.info, line_count, true, time);
}

- /* last message fitting into this dump */
- next_seq = log_next_seq;
-
len = 0;
- while (len >= 0 && seq < next_seq) {
- struct printk_log *msg = log_from_idx(idx);
- int textlen = msg_print_text(msg, true, time, text,
- LOG_LINE_MAX + PREFIX_MAX);
+ prb_for_each_record(seq, prb, seq, &r) {
+ int textlen;

- idx = log_next(idx);
- seq++;
+ textlen = record_print_text(&r, true, time);
+
+ if (len + textlen > size) {
+ seq--;
+ break;
+ }

logbuf_unlock_irq();
if (copy_to_user(buf + len, text, textlen))
@@ -1513,17 +1542,12 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
len += textlen;
logbuf_lock_irq();

- if (seq < log_first_seq) {
- /* messages are gone, move to next one */
- seq = log_first_seq;
- idx = log_first_idx;
- }
+ if (len < 0)
+ break;
}

- if (clear) {
- clear_seq = log_next_seq;
- clear_idx = log_next_idx;
- }
+ if (clear)
+ clear_seq = seq;
logbuf_unlock_irq();

kfree(text);
@@ -1533,8 +1557,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
static void syslog_clear(void)
{
logbuf_lock_irq();
- clear_seq = log_next_seq;
- clear_idx = log_next_idx;
+ clear_seq = prb_next_seq(prb);
logbuf_unlock_irq();
}

@@ -1542,6 +1565,7 @@ int do_syslog(int type, char __user *buf, int len, int source)
{
bool clear = false;
static int saved_console_loglevel = LOGLEVEL_DEFAULT;
+ char *text = NULL;
int error;

error = check_syslog_permissions(type, source);
@@ -1561,7 +1585,7 @@ int do_syslog(int type, char __user *buf, int len, int source)
if (!access_ok(buf, len))
return -EFAULT;
error = wait_event_interruptible(log_wait,
- syslog_seq != log_next_seq);
+ prb_read_valid(prb, syslog_seq, NULL));
if (error)
return error;
error = syslog_print(buf, len);
@@ -1609,11 +1633,15 @@ int do_syslog(int type, char __user *buf, int len, int source)
break;
/* Number of chars in the log buffer */
case SYSLOG_ACTION_SIZE_UNREAD:
+ if (source != SYSLOG_FROM_PROC) {
+ text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
+ if (!text)
+ return -ENOMEM;
+ }
logbuf_lock_irq();
- if (syslog_seq < log_first_seq) {
+ if (syslog_seq < prb_first_seq(prb)) {
/* messages are gone, move to first one */
- syslog_seq = log_first_seq;
- syslog_idx = log_first_idx;
+ syslog_seq = prb_first_seq(prb);
syslog_partial = 0;
}
if (source == SYSLOG_FROM_PROC) {
@@ -1622,24 +1650,29 @@ int do_syslog(int type, char __user *buf, int len, int source)
* for pending data, not the size; return the count of
* records, not the length.
*/
- error = log_next_seq - syslog_seq;
+ error = prb_next_seq(prb) - syslog_seq;
} else {
- u64 seq = syslog_seq;
- u32 idx = syslog_idx;
bool time = syslog_partial ? syslog_time : printk_time;
-
- while (seq < log_next_seq) {
- struct printk_log *msg = log_from_idx(idx);
-
- error += msg_print_text(msg, true, time, NULL,
- 0);
+ struct printk_info info;
+ unsigned int line_count;
+ struct printk_record r;
+ u64 seq;
+
+ prb_rec_init_rd(&r, &info, text,
+ LOG_LINE_MAX + PREFIX_MAX, NULL, 0);
+
+ prb_for_each_record(syslog_seq, prb, seq, &r) {
+ line_count = prb_count_lines(text,
+ r.info->text_len);
+ error += get_record_text_size(r.info,
+ line_count,
+ true, time);
time = printk_time;
- idx = log_next(idx);
- seq++;
}
error -= syslog_partial;
}
logbuf_unlock_irq();
+ kfree(text);
break;
/* Size of the log buffer */
case SYSLOG_ACTION_SIZE_BUFFER:
@@ -1806,10 +1839,22 @@ static int console_trylock_spinning(void)
static void call_console_drivers(const char *ext_text, size_t ext_len,
const char *text, size_t len)
{
+ static char dropped_text[64];
+ size_t dropped_len = 0;
struct console *con;

trace_console_rcuidle(text, len);

+ if (!console_drivers)
+ return;
+
+ if (console_dropped) {
+ dropped_len = snprintf(dropped_text, sizeof(dropped_text),
+ "** %lu printk messages dropped **\n",
+ console_dropped);
+ console_dropped = 0;
+ }
+
for_each_console(con) {
if (exclusive_console && con != exclusive_console)
continue;
@@ -1822,8 +1867,11 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,
continue;
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
- else
+ else {
+ if (dropped_len)
+ con->write(con, dropped_text, dropped_len);
con->write(con, text, len);
+ }
}
}

@@ -1993,7 +2041,6 @@ asmlinkage int vprintk_emit(int facility, int level,
int printed_len;
bool in_sched = false, pending_output;
unsigned long flags;
- u64 curr_log_seq;

/* Suppress unimportant messages after panic happens */
if (unlikely(suppress_printk))
@@ -2009,9 +2056,9 @@ asmlinkage int vprintk_emit(int facility, int level,

/* This stops the holder of console_sem just where we want him */
logbuf_lock_irqsave(flags);
- curr_log_seq = log_next_seq;
+ pending_output = !prb_read_valid(prb, console_seq, NULL);
printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
- pending_output = (curr_log_seq != log_next_seq);
+ pending_output &= prb_read_valid(prb, console_seq, NULL);
logbuf_unlock_irqrestore(flags);

/* If called from the scheduler, we can not call up(). */
@@ -2090,21 +2137,24 @@ EXPORT_SYMBOL(printk);
#define PREFIX_MAX 0
#define printk_time false

+#define prb_read_valid(rb, seq, r) false
+#define prb_first_seq(rb) 0
+
static u64 syslog_seq;
-static u32 syslog_idx;
static u64 console_seq;
-static u32 console_idx;
static u64 exclusive_console_stop_seq;
-static u64 log_first_seq;
-static u32 log_first_idx;
-static u64 log_next_seq;
-static char *log_text(const struct printk_log *msg) { return NULL; }
-static char *log_dict(const struct printk_log *msg) { return NULL; }
-static struct printk_log *log_from_idx(u32 idx) { return NULL; }
-static u32 log_next(u32 idx) { return 0; }
-static ssize_t msg_print_ext_header(char *buf, size_t size,
- struct printk_log *msg,
- u64 seq) { return 0; }
+static unsigned long console_dropped;
+
+static size_t record_print_text(const struct printk_record *r,
+ bool syslog, bool time)
+{
+ return 0;
+}
+static ssize_t info_print_ext_header(char *buf, size_t size,
+ struct printk_info *info)
+{
+ return 0;
+}
static ssize_t msg_print_ext_body(char *buf, size_t size,
char *dict, size_t dict_len,
char *text, size_t text_len) { return 0; }
@@ -2112,8 +2162,6 @@ static void console_lock_spinning_enable(void) { }
static int console_lock_spinning_disable_and_check(void) { return 0; }
static void call_console_drivers(const char *ext_text, size_t ext_len,
const char *text, size_t len) {}
-static size_t msg_print_text(const struct printk_log *msg, bool syslog,
- bool time, char *buf, size_t size) { return 0; }
static bool suppress_message_printing(int level) { return false; }

#endif /* CONFIG_PRINTK */
@@ -2398,14 +2446,19 @@ void console_unlock(void)
{
static char ext_text[CONSOLE_EXT_LOG_MAX];
static char text[LOG_LINE_MAX + PREFIX_MAX];
+ static char dict[LOG_LINE_MAX];
unsigned long flags;
bool do_cond_resched, retry;
+ struct printk_info info;
+ struct printk_record r;

if (console_suspended) {
up_console_sem();
return;
}

+ prb_rec_init_rd(&r, &info, text, sizeof(text), dict, sizeof(dict));
+
/*
* Console drivers are called with interrupts disabled, so
* @console_may_schedule should be cleared before; however, we may
@@ -2436,35 +2489,26 @@ void console_unlock(void)
}

for (;;) {
- struct printk_log *msg;
size_t ext_len = 0;
size_t len;

printk_safe_enter_irqsave(flags);
raw_spin_lock(&logbuf_lock);
- if (console_seq < log_first_seq) {
- len = snprintf(text, sizeof(text),
- "** %llu printk messages dropped **\n",
- log_first_seq - console_seq);
-
- /* messages are gone, move to first one */
- console_seq = log_first_seq;
- console_idx = log_first_idx;
- } else {
- len = 0;
- }
skip:
- if (console_seq == log_next_seq)
+ if (!prb_read_valid(prb, console_seq, &r))
break;

- msg = log_from_idx(console_idx);
- if (suppress_message_printing(msg->level)) {
+ if (console_seq != r.info->seq) {
+ console_dropped += r.info->seq - console_seq;
+ console_seq = r.info->seq;
+ }
+
+ if (suppress_message_printing(r.info->level)) {
/*
* Skip record we have buffered and already printed
* directly to the console when we received it, and
* record that has level above the console loglevel.
*/
- console_idx = log_next(console_idx);
console_seq++;
goto skip;
}
@@ -2475,19 +2519,24 @@ void console_unlock(void)
exclusive_console = NULL;
}

- len += msg_print_text(msg,
- console_msg_format & MSG_FORMAT_SYSLOG,
- printk_time, text + len, sizeof(text) - len);
+ /*
+ * Handle extended console text first because later
+ * record_print_text() will modify the record buffer in-place.
+ */
if (nr_ext_console_drivers) {
- ext_len = msg_print_ext_header(ext_text,
+ ext_len = info_print_ext_header(ext_text,
sizeof(ext_text),
- msg, console_seq);
+ r.info);
ext_len += msg_print_ext_body(ext_text + ext_len,
sizeof(ext_text) - ext_len,
- log_dict(msg), msg->dict_len,
- log_text(msg), msg->text_len);
+ &r.dict_buf[0],
+ r.info->dict_len,
+ &r.text_buf[0],
+ r.info->text_len);
}
- console_idx = log_next(console_idx);
+ len = record_print_text(&r,
+ console_msg_format & MSG_FORMAT_SYSLOG,
+ printk_time);
console_seq++;
raw_spin_unlock(&logbuf_lock);

@@ -2527,7 +2576,7 @@ void console_unlock(void)
* flush, no worries.
*/
raw_spin_lock(&logbuf_lock);
- retry = console_seq != log_next_seq;
+ retry = prb_read_valid(prb, console_seq, NULL);
raw_spin_unlock(&logbuf_lock);
printk_safe_exit_irqrestore(flags);

@@ -2596,8 +2645,7 @@ void console_flush_on_panic(enum con_flush_mode mode)
unsigned long flags;

logbuf_lock_irqsave(flags);
- console_seq = log_first_seq;
- console_idx = log_first_idx;
+ console_seq = prb_first_seq(prb);
logbuf_unlock_irqrestore(flags);
}
console_unlock();
@@ -2840,7 +2888,6 @@ void register_console(struct console *newcon)
exclusive_console = newcon;
exclusive_console_stop_seq = console_seq;
console_seq = syslog_seq;
- console_idx = syslog_idx;
logbuf_unlock_irqrestore(flags);
}
console_unlock();
@@ -3229,9 +3276,7 @@ void kmsg_dump(enum kmsg_dump_reason reason)

logbuf_lock_irqsave(flags);
dumper->cur_seq = clear_seq;
- dumper->cur_idx = clear_idx;
- dumper->next_seq = log_next_seq;
- dumper->next_idx = log_next_idx;
+ dumper->next_seq = prb_next_seq(prb);
logbuf_unlock_irqrestore(flags);

/* invoke dumper which will iterate over records */
@@ -3265,28 +3310,33 @@ void kmsg_dump(enum kmsg_dump_reason reason)
bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
char *line, size_t size, size_t *len)
{
- struct printk_log *msg;
+ struct printk_info info;
+ unsigned int line_count;
+ struct printk_record r;
size_t l = 0;
bool ret = false;

+ prb_rec_init_rd(&r, &info, line, size, NULL, 0);
+
if (!dumper->active)
goto out;

- if (dumper->cur_seq < log_first_seq) {
- /* messages are gone, move to first available one */
- dumper->cur_seq = log_first_seq;
- dumper->cur_idx = log_first_idx;
- }
-
- /* last entry */
- if (dumper->cur_seq >= log_next_seq)
- goto out;
+ /* Read text or count text lines? */
+ if (line) {
+ if (!prb_read_valid(prb, dumper->cur_seq, &r))
+ goto out;
+ l = record_print_text(&r, syslog, printk_time);
+ } else {
+ if (!prb_read_valid_info(prb, dumper->cur_seq,
+ &info, &line_count)) {
+ goto out;
+ }
+ l = get_record_text_size(&info, line_count, syslog,
+ printk_time);

- msg = log_from_idx(dumper->cur_idx);
- l = msg_print_text(msg, syslog, printk_time, line, size);
+ }

- dumper->cur_idx = log_next(dumper->cur_idx);
- dumper->cur_seq++;
+ dumper->cur_seq = r.info->seq + 1;
ret = true;
out:
if (len)
@@ -3347,23 +3397,25 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_line);
bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
char *buf, size_t size, size_t *len)
{
+ struct printk_info info;
+ unsigned int line_count;
+ struct printk_record r;
unsigned long flags;
u64 seq;
- u32 idx;
u64 next_seq;
- u32 next_idx;
size_t l = 0;
bool ret = false;
bool time = printk_time;

- if (!dumper->active)
+ prb_rec_init_rd(&r, &info, buf, size, NULL, 0);
+
+ if (!dumper->active || !buf || !size)
goto out;

logbuf_lock_irqsave(flags);
- if (dumper->cur_seq < log_first_seq) {
+ if (dumper->cur_seq < prb_first_seq(prb)) {
/* messages are gone, move to first available one */
- dumper->cur_seq = log_first_seq;
- dumper->cur_idx = log_first_idx;
+ dumper->cur_seq = prb_first_seq(prb);
}

/* last entry */
@@ -3374,41 +3426,41 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,

/* calculate length of entire buffer */
seq = dumper->cur_seq;
- idx = dumper->cur_idx;
- while (seq < dumper->next_seq) {
- struct printk_log *msg = log_from_idx(idx);
-
- l += msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
+ while (prb_read_valid_info(prb, seq, &info, &line_count)) {
+ if (r.info->seq >= dumper->next_seq)
+ break;
+ l += get_record_text_size(&info, line_count, true, time);
+ seq = r.info->seq + 1;
}

/* move first record forward until length fits into the buffer */
seq = dumper->cur_seq;
- idx = dumper->cur_idx;
- while (l >= size && seq < dumper->next_seq) {
- struct printk_log *msg = log_from_idx(idx);
-
- l -= msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
+ while (l >= size && prb_read_valid_info(prb, seq,
+ &info, &line_count)) {
+ if (r.info->seq >= dumper->next_seq)
+ break;
+ l -= get_record_text_size(&info, line_count, true, time);
+ seq = r.info->seq + 1;
}

/* last message in next interation */
next_seq = seq;
- next_idx = idx;

+ /* actually read text into the buffer now */
l = 0;
- while (seq < dumper->next_seq) {
- struct printk_log *msg = log_from_idx(idx);
+ while (prb_read_valid(prb, seq, &r)) {
+ if (r.info->seq >= dumper->next_seq)
+ break;
+
+ l += record_print_text(&r, syslog, time);
+
+ /* adjust record to store to remaining buffer space */
+ prb_rec_init_rd(&r, &info, buf + l, size - l, NULL, 0);

- l += msg_print_text(msg, syslog, time, buf + l, size - l);
- idx = log_next(idx);
- seq++;
+ seq = r.info->seq + 1;
}

dumper->next_seq = next_seq;
- dumper->next_idx = next_idx;
ret = true;
logbuf_unlock_irqrestore(flags);
out:
@@ -3431,9 +3483,7 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);
void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper)
{
dumper->cur_seq = clear_seq;
- dumper->cur_idx = clear_idx;
- dumper->next_seq = log_next_seq;
- dumper->next_idx = log_next_idx;
+ dumper->next_seq = prb_next_seq(prb);
}

/**
--
2.20.1

2020-06-18 14:54:58

by John Ogness

[permalink] [raw]
Subject: [PATCH v3 2/3] printk: add lockless ringbuffer

Introduce a multi-reader multi-writer lockless ringbuffer for storing
the kernel log messages. Readers and writers may use their API from
any context (including scheduler and NMI). This ringbuffer will make
it possible to decouple printk() callers from any context, locking,
or console constraints. It also makes it possible for readers to have
full access to the ringbuffer contents at any time and context (for
example from any panic situation).

The printk_ringbuffer is made up of 3 internal ringbuffers:

desc_ring:
A ring of descriptors. A descriptor contains all record meta data
(sequence number, timestamp, loglevel, etc.) as well as internal state
information about the record and logical positions specifying where in
the other ringbuffers the text and dictionary strings are located.

text_data_ring:
A ring of data blocks. A data block consists of an unsigned long
integer (ID) that maps to a desc_ring index followed by the text
string of the record.

dict_data_ring:
A ring of data blocks. A data block consists of an unsigned long
integer (ID) that maps to a desc_ring index followed by the dictionary
string of the record.

The internal state information of a descriptor is the key element to
allow readers and writers to locklessly synchronize access to the data.

Co-developed-by: Petr Mladek <[email protected]>
Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/Makefile | 1 +
kernel/printk/printk_ringbuffer.c | 1674 +++++++++++++++++++++++++++++
kernel/printk/printk_ringbuffer.h | 352 ++++++
3 files changed, 2027 insertions(+)
create mode 100644 kernel/printk/printk_ringbuffer.c
create mode 100644 kernel/printk/printk_ringbuffer.h

diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
index 4d052fc6bcde..eee3dc9b60a9 100644
--- a/kernel/printk/Makefile
+++ b/kernel/printk/Makefile
@@ -2,3 +2,4 @@
obj-y = printk.o
obj-$(CONFIG_PRINTK) += printk_safe.o
obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o
+obj-$(CONFIG_PRINTK) += printk_ringbuffer.o
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
new file mode 100644
index 000000000000..75d056436cc5
--- /dev/null
+++ b/kernel/printk/printk_ringbuffer.c
@@ -0,0 +1,1674 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/irqflags.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/bug.h>
+#include "printk_ringbuffer.h"
+
+/**
+ * DOC: printk_ringbuffer overview
+ *
+ * Data Structure
+ * --------------
+ * The printk_ringbuffer is made up of 3 internal ringbuffers:
+ *
+ * desc_ring
+ * A ring of descriptors. A descriptor contains all record meta data
+ * (sequence number, timestamp, loglevel, etc.) as well as internal state
+ * information about the record and logical positions specifying where in
+ * the other ringbuffers the text and dictionary strings are located.
+ *
+ * text_data_ring
+ * A ring of data blocks. A data block consists of an unsigned long
+ * integer (ID) that maps to a desc_ring index followed by the text
+ * string of the record.
+ *
+ * dict_data_ring
+ * A ring of data blocks. A data block consists of an unsigned long
+ * integer (ID) that maps to a desc_ring index followed by the dictionary
+ * string of the record.
+ *
+ * The internal state information of a descriptor is the key element to allow
+ * readers and writers to locklessly synchronize access to the data.
+ *
+ * Implementation
+ * --------------
+ *
+ * Descriptor Ring
+ * ~~~~~~~~~~~~~~~
+ * The descriptor ring is an array of descriptors. A descriptor contains all
+ * the meta data of a printk record as well as blk_lpos structs pointing to
+ * associated text and dictionary data blocks (see "Data Rings" below). Each
+ * descriptor is assigned an ID that maps directly to index values of the
+ * descriptor array and has a state. The ID and the state are bitwise combined
+ * into a single descriptor field named @state_var, allowing ID and state to
+ * be synchronously and atomically updated.
+ *
+ * Descriptors have three states:
+ *
+ * reserved
+ * A writer is modifying the record.
+ *
+ * committed
+ * The record and all its data are complete and available for reading.
+ *
+ * reusable
+ * The record exists, but its text and/or dictionary data may no longer
+ * be available.
+ *
+ * Querying the @state_var of a record requires providing the ID of the
+ * descriptor to query. This can yield a possible fourth (pseudo) state:
+ *
+ * miss
+ * The descriptor being queried has an unexpected ID.
+ *
+ * The descriptor ring has a @tail_id that contains the ID of the oldest
+ * descriptor and @head_id that contains the ID of the newest descriptor.
+ *
+ * When a new descriptor should be created (and the ring is full), the tail
+ * descriptor is invalidated by first transitioning to the reusable state and
+ * then invalidating all tail data blocks up to and including the data blocks
+ * associated with the tail descriptor (for text and dictionary rings). Then
+ * @tail_id is advanced, followed by advancing @head_id. And finally the
+ * @state_var of the new descriptor is initialized to the new ID and reserved
+ * state.
+ *
+ * The @tail_id can only be advanced if the the new @tail_id would be in the
+ * committed or reusable queried state. This makes it possible that a valid
+ * sequence number of the tail is always available.
+ *
+ * Data Rings
+ * ~~~~~~~~~~
+ * The two data rings (text and dictionary) function identically. They exist
+ * separately so that their buffer sizes can be individually set and they do
+ * not affect one another.
+ *
+ * Data rings are byte arrays composed of data blocks. Data blocks are
+ * referenced by blk_lpos structs that point to the logical position of the
+ * beginning of a data block and the beginning of the next adjacent data
+ * block. Logical positions are mapped directly to index values of the byte
+ * array ringbuffer.
+ *
+ * Each data block consists of an ID followed by the raw data. The ID is the
+ * identifier of a descriptor that is associated with the data block. A data
+ * block is considered valid if all of the following conditions are met:
+ *
+ * 1) The descriptor associated with the data block is in the committed
+ * or reusable queried state.
+ *
+ * 2) The blk_lpos struct within the descriptor associated with the data
+ * block references back to the same data block.
+ *
+ * 3) The data block is within the head/tail logical position range.
+ *
+ * If the raw data of a data block would extend beyond the end of the byte
+ * array, only the ID of the data block is stored at the logical position
+ * and the full data block (ID and raw data) is stored at the beginning of
+ * the byte array. The referencing blk_lpos will point to the ID before the
+ * wrap and the next data block will be at the logical position adjacent the
+ * full data block after the wrap.
+ *
+ * Data rings have a @tail_lpos that points to the beginning of the oldest
+ * data block and a @head_lpos that points to the logical position of the
+ * next (not yet existing) data block.
+ *
+ * When a new data block should be created (and the ring is full), tail data
+ * blocks will first be invalidated by putting their associated descriptors
+ * into the reusable state and then pushing the @tail_lpos forward beyond
+ * them. Then the @head_lpos is pushed forward and is associated with a new
+ * descriptor. If a data block is not valid, the @tail_lpos cannot be
+ * advanced beyond it.
+ *
+ * Usage
+ * -----
+ * Here are some simple examples demonstrating writers and readers. For the
+ * examples a global ringbuffer (test_rb) is available (which is not the
+ * actual ringbuffer used by printk)::
+ *
+ * DECLARE_PRINTKRB(test_rb, 15, 5, 3);
+ *
+ * This ringbuffer allows up to 32768 records (2 ^ 15) and has a size of
+ * 1 MiB (2 ^ (15 + 5)) for text data and 256 KiB (2 ^ (15 + 3)) for
+ * dictionary data.
+ *
+ * Sample writer code::
+ *
+ * const char *dictstr = "dictionary text";
+ * const char *textstr = "message text";
+ * struct prb_reserved_entry e;
+ * struct printk_record r;
+ *
+ * // specify how much to allocate
+ * prb_rec_init_wr(&r, strlen(textstr) + 1, strlen(dictstr) + 1);
+ *
+ * if (prb_reserve(&e, &test_rb, &r)) {
+ * snprintf(r.text_buf, r.text_buf_size, "%s", textstr);
+ *
+ * // dictionary allocation may have failed
+ * if (r.dict_buf)
+ * snprintf(r.dict_buf, r.dict_buf_size, "%s", dictstr);
+ *
+ * r.info->ts_nsec = local_clock();
+ *
+ * prb_commit(&e);
+ * }
+ *
+ * Sample reader code::
+ *
+ * struct printk_info info;
+ * struct printk_record r;
+ * char text_buf[32];
+ * char dict_buf[32];
+ * u64 seq;
+ *
+ * prb_rec_init_rd(&r, &info, &text_buf[0], sizeof(text_buf),
+ * &dict_buf[0], sizeof(dict_buf));
+ *
+ * prb_for_each_record(0, &test_rb, &seq, &r) {
+ * if (info.seq != seq)
+ * pr_warn("lost %llu records\n", info.seq - seq);
+ *
+ * if (info.text_len > r.text_buf_size) {
+ * pr_warn("record %llu text truncated\n", info.seq);
+ * text_buf[sizeof(text_buf) - 1] = 0;
+ * }
+ *
+ * if (info.dict_len > r.dict_buf_size) {
+ * pr_warn("record %llu dict truncated\n", info.seq);
+ * dict_buf[sizeof(dict_buf) - 1] = 0;
+ * }
+ *
+ * pr_info("%llu: %llu: %s;%s\n", info.seq, info.ts_nsec,
+ * &text_buf[0], info.dict_len ? &dict_buf[0] : "");
+ * }
+ *
+ * Note that additional less convenient reader functions are available to
+ * allow complex record access.
+ *
+ * ABA Issues
+ * ~~~~~~~~~~
+ * To help avoid ABA issues, descriptors are referenced by IDs (array index
+ * values with tagged states) and data blocks are referenced by logical
+ * positions (array index values with tagged states). However, on 32-bit
+ * systems the number of tagged states is relatively small such that an ABA
+ * incident is (at least theoretically) possible. For example, if 4 million
+ * maximally sized (1KiB) printk messages were to occur in NMI context on a
+ * 32-bit system, the interrupted context would not be able to recognize that
+ * the 32-bit integer completely wrapped and thus represents a different
+ * data block than the one the interrupted context expects.
+ *
+ * To help combat this possibility, additional state checking is performed
+ * (such as using cmpxchg() even though set() would suffice). These extra
+ * checks are commented as such and will hopefully catch any ABA issue that
+ * a 32-bit system might experience.
+ *
+ * Memory Barriers
+ * ~~~~~~~~~~~~~~~
+ * Multiple memory barriers are used. To simplify proving correctness and
+ * generating litmus tests, lines of code related to memory barriers
+ * (loads, stores, and the associated memory barriers) are labeled::
+ *
+ * LMM(function:letter)
+ *
+ * Comments reference the labels using only the "function:letter" part.
+ *
+ * The memory barrier pairs and their ordering are:
+ *
+ * desc_reserve:D / desc_reserve:B
+ * push descriptor tail (id), then push descriptor head (id)
+ *
+ * desc_reserve:D / data_push_tail:B
+ * push data tail (lpos), then set new descriptor reserved (state)
+ *
+ * desc_reserve:D / desc_push_tail:C
+ * push descriptor tail (id), then set new descriptor reserved (state)
+ *
+ * desc_reserve:D / prb_first_seq:C
+ * push descriptor tail (id), then set new descriptor reserved (state)
+ *
+ * desc_reserve:F / desc_read:D
+ * set new descriptor id and reserved (state), then allow writer changes
+ *
+ * data_alloc:A / desc_read:D
+ * set old descriptor reusable (state), then modify new data block area
+ *
+ * data_alloc:A / data_push_tail:B
+ * push data tail (lpos), then modify new data block area
+ *
+ * prb_commit:B / desc_read:B
+ * store writer changes, then set new descriptor committed (state)
+ *
+ * data_push_tail:D / data_push_tail:A
+ * set descriptor reusable (state), then push data tail (lpos)
+ *
+ * desc_push_tail:B / desc_reserve:D
+ * set descriptor reusable (state), then push descriptor tail (id)
+ */
+
+#define DATA_SIZE(data_ring) _DATA_SIZE((data_ring)->size_bits)
+#define DATA_SIZE_MASK(data_ring) (DATA_SIZE(data_ring) - 1)
+
+#define DESCS_COUNT(desc_ring) _DESCS_COUNT((desc_ring)->count_bits)
+#define DESCS_COUNT_MASK(desc_ring) (DESCS_COUNT(desc_ring) - 1)
+
+/* Determine the data array index from a logical position. */
+#define DATA_INDEX(data_ring, lpos) ((lpos) & DATA_SIZE_MASK(data_ring))
+
+/* Determine the desc array index from an ID or sequence number. */
+#define DESC_INDEX(desc_ring, n) ((n) & DESCS_COUNT_MASK(desc_ring))
+
+/* Determine how many times the data array has wrapped. */
+#define DATA_WRAPS(data_ring, lpos) ((lpos) >> (data_ring)->size_bits)
+
+/* Get the logical position at index 0 of the current wrap. */
+#define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \
+((lpos) & ~DATA_SIZE_MASK(data_ring))
+
+/* Get the ID for the same index of the previous wrap as the given ID. */
+#define DESC_ID_PREV_WRAP(desc_ring, id) \
+DESC_ID((id) - DESCS_COUNT(desc_ring))
+
+/* A data block: maps to the raw data within the data ring. */
+struct prb_data_block {
+ unsigned long id;
+ char data[0];
+};
+
+static struct prb_desc *to_desc(struct prb_desc_ring *desc_ring, u64 n)
+{
+ return &desc_ring->descs[DESC_INDEX(desc_ring, n)];
+}
+
+static struct prb_data_block *to_block(struct prb_data_ring *data_ring,
+ unsigned long begin_lpos)
+{
+ return (void *)&data_ring->data[DATA_INDEX(data_ring, begin_lpos)];
+}
+
+/*
+ * Increase the data size to account for data block meta data plus any
+ * padding so that the adjacent data block is aligned on the ID size.
+ */
+static unsigned int to_blk_size(unsigned int size)
+{
+ struct prb_data_block *db = NULL;
+
+ size += sizeof(*db);
+ size = ALIGN(size, sizeof(db->id));
+ return size;
+}
+
+/*
+ * Sanity checker for reserve size. The ringbuffer code assumes that a data
+ * block does not exceed the maximum possible size that could fit within the
+ * ringbuffer. This function provides that basic size check so that the
+ * assumption is safe.
+ *
+ * Writers are also not allowed to write 0-sized (data-less) records. Such
+ * records are used only internally by the ringbuffer.
+ */
+static bool data_check_size(struct prb_data_ring *data_ring, unsigned int size)
+{
+ struct prb_data_block *db = NULL;
+
+ /*
+ * Writers are not allowed to write data-less records. Such records
+ * are used only internally by the ringbuffer to denote records where
+ * their data failed to allocate or have been lost.
+ */
+ if (size == 0)
+ return false;
+
+ /*
+ * Ensure the alignment padded size could possibly fit in the data
+ * array. The largest possible data block must still leave room for
+ * at least the ID of the next block.
+ */
+ size = to_blk_size(size);
+ if (size > DATA_SIZE(data_ring) - sizeof(db->id))
+ return false;
+
+ return true;
+}
+
+/* The possible responses of a descriptor state-query. */
+enum desc_state {
+ desc_miss, /* ID mismatch */
+ desc_reserved, /* reserved, in use by writer */
+ desc_committed, /* committed, writer is done */
+ desc_reusable, /* free, not yet used by any writer */
+};
+
+/* Query the state of a descriptor. */
+static enum desc_state get_desc_state(unsigned long id,
+ unsigned long state_val)
+{
+ if (id != DESC_ID(state_val))
+ return desc_miss;
+
+ if (state_val & DESC_REUSE_MASK)
+ return desc_reusable;
+
+ if (state_val & DESC_COMMITTED_MASK)
+ return desc_committed;
+
+ return desc_reserved;
+}
+
+/*
+ * Get a copy of a specified descriptor and its queried state. A descriptor
+ * that is not in the committed or reusable state must be considered garbage
+ * by the reader.
+ */
+static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
+ unsigned long id, struct prb_desc *desc_out)
+{
+ struct prb_desc *desc = to_desc(desc_ring, id);
+ atomic_long_t *state_var = &desc->state_var;
+ enum desc_state d_state;
+ unsigned long state_val;
+
+ /* Check the descriptor state. */
+ state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
+ d_state = get_desc_state(id, state_val);
+ if (d_state != desc_committed && d_state != desc_reusable)
+ return d_state;
+
+ /*
+ * Guarantee the state is loaded before copying the descriptor
+ * content. This avoids copying obsolete descriptor content that might
+ * not apply to the descriptor state. This pairs with prb_commit:B.
+ *
+ * Memory barrier involvement:
+ *
+ * If desc_read:A reads from prb_commit:B, then desc_read:C reads
+ * from prb_commit:A.
+ *
+ * Relies on:
+ *
+ * WMB from prb_commit:A to prb_commit:B
+ * matching
+ * RMB from desc_read:A to desc_read:C
+ */
+ smp_rmb(); /* LMM(desc_read:B) */
+
+ /*
+ * Copy the descriptor data. The data is not valid until the
+ * state has been re-checked.
+ */
+ memcpy(desc_out, desc, sizeof(*desc_out)); /* LMM(desc_read:C) */
+
+ /*
+ * 1. Guarantee the descriptor content is loaded before re-checking
+ * the state. This avoids reading an obsolete descriptor state
+ * that may not apply to the copied content. This pairs with
+ * desc_reserve:F.
+ *
+ * Memory barrier involvement:
+ *
+ * If desc_read:C reads from desc_reserve:G, then desc_read:E
+ * reads from desc_reserve:F.
+ *
+ * Relies on:
+ *
+ * WMB from desc_reserve:F to desc_reserve:G
+ * matching
+ * RMB from desc_read:C to desc_read:E
+ *
+ * 2. Guarantee the record data is loaded before re-checking the
+ * state. This avoids reading an obsolete descriptor state that may
+ * not apply to the copied data. This pairs with data_alloc:A.
+ *
+ * Memory barrier involvement:
+ *
+ * If copy_data:A reads from data_alloc:B, then desc_read:E
+ * reads from desc_make_reusable:A.
+ *
+ * Relies on:
+ *
+ * MB from desc_make_reusable:A to data_alloc:B
+ * matching
+ * RMB from desc_read:C to desc_read:E
+ *
+ * Note: desc_make_reusable:A and data_alloc:B can be different
+ * CPUs. However, the data_alloc:B CPU (which performs the
+ * full memory barrier) must have previously seen
+ * desc_make_reusable:A.
+ */
+ smp_rmb(); /* LMM(desc_read:D) */
+
+ /* Re-check the descriptor state. */
+ state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
+ return get_desc_state(id, state_val);
+}
+
+/*
+ * Take a specified descriptor out of the committed state by attempting
+ * the transition from committed to reusable. Either this context or some
+ * other context will have been successful.
+ */
+static void desc_make_reusable(struct prb_desc_ring *desc_ring,
+ unsigned long id)
+{
+ unsigned long val_committed = id | DESC_COMMITTED_MASK;
+ unsigned long val_reusable = val_committed | DESC_REUSE_MASK;
+ struct prb_desc *desc = to_desc(desc_ring, id);
+ atomic_long_t *state_var = &desc->state_var;
+
+ atomic_long_cmpxchg_relaxed(state_var, val_committed,
+ val_reusable); /* LMM(desc_make_reusable:A) */
+}
+
+/*
+ * Given a data ring (text or dict), put the associated descriptor of each
+ * data block from @lpos_begin until @lpos_end into the reusable state.
+ *
+ * If there is any problem making the associated descriptor reusable, either
+ * the descriptor has not yet been committed or another writer context has
+ * already pushed the tail lpos past the problematic data block. Regardless,
+ * on error the caller can re-load the tail lpos to determine the situation.
+ */
+static bool data_make_reusable(struct printk_ringbuffer *rb,
+ struct prb_data_ring *data_ring,
+ unsigned long lpos_begin,
+ unsigned long lpos_end,
+ unsigned long *lpos_out)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ struct prb_data_blk_lpos *blk_lpos;
+ struct prb_data_block *blk;
+ enum desc_state d_state;
+ struct prb_desc desc;
+ unsigned long id;
+
+ /*
+ * Using the provided @data_ring, point @blk_lpos to the correct
+ * blk_lpos within the local copy of the descriptor.
+ */
+ if (data_ring == &rb->text_data_ring)
+ blk_lpos = &desc.text_blk_lpos;
+ else
+ blk_lpos = &desc.dict_blk_lpos;
+
+ /* Loop until @lpos_begin has advanced to or beyond @lpos_end. */
+ while ((lpos_end - lpos_begin) - 1 < DATA_SIZE(data_ring)) {
+ blk = to_block(data_ring, lpos_begin);
+
+ /*
+ * Load the block ID from the data block. This is a data race
+ * against a writer that may have newly reserved this data
+ * area. If the loaded value matches a valid descriptor ID,
+ * the blk_lpos of that descriptor will be checked to make
+ * sure it points back to this data block. If the check fails,
+ * the data area has been recycled by another writer.
+ */
+ id = blk->id; /* LMM(data_make_reusable:A) */
+
+ d_state = desc_read(desc_ring, id, &desc); /* LMM(data_make_reusable:B) */
+
+ switch (d_state) {
+ case desc_miss:
+ return false;
+ case desc_reserved:
+ return false;
+ case desc_committed:
+ /*
+ * This data block is invalid if the descriptor
+ * does not point back to it.
+ */
+ if (blk_lpos->begin != lpos_begin)
+ return false;
+ desc_make_reusable(desc_ring, id);
+ break;
+ case desc_reusable:
+ /*
+ * This data block is invalid if the descriptor
+ * does not point back to it.
+ */
+ if (blk_lpos->begin != lpos_begin)
+ return false;
+ break;
+ }
+
+ /* Advance @lpos_begin to the next data block. */
+ lpos_begin = blk_lpos->next;
+ }
+
+ *lpos_out = lpos_begin;
+ return true;
+}
+
+/*
+ * Advance the data ring tail to at least @lpos. This function puts
+ * descriptors into the reusable state if the tail is pushed beyond
+ * their associated data block.
+ */
+static bool data_push_tail(struct printk_ringbuffer *rb,
+ struct prb_data_ring *data_ring,
+ unsigned long lpos)
+{
+ unsigned long tail_lpos_new;
+ unsigned long tail_lpos;
+ unsigned long next_lpos;
+
+ /* If @lpos is not valid, there is nothing to do. */
+ if (lpos == INVALID_LPOS)
+ return true;
+
+ /*
+ * Any descriptor states that have transitioned to reusable due to the
+ * data tail being pushed to this loaded value will be visible to this
+ * CPU. This pairs with data_push_tail:D.
+ *
+ * Memory barrier involvement:
+ *
+ * If data_push_tail:A reads from data_push_tail:D, then this CPU can
+ * see desc_make_reusable:A.
+ *
+ * Relies on:
+ *
+ * MB from desc_make_reusable:A to data_push_tail:D
+ * matches
+ * READFROM from data_push_tail:D to data_push_tail:A
+ * thus
+ * READFROM from desc_make_reusable:A to this CPU
+ */
+ tail_lpos = atomic_long_read(&data_ring->tail_lpos); /* LMM(data_push_tail:A) */
+
+ /*
+ * Loop until the tail lpos is at or beyond @lpos. This condition
+ * may already be satisfied, resulting in no full memory barrier
+ * from data_push_tail:D being performed. However, since this CPU
+ * sees the new tail lpos, any descriptor states that transitioned to
+ * the reusable state must already be visible.
+ */
+ while ((lpos - tail_lpos) - 1 < DATA_SIZE(data_ring)) {
+ /*
+ * Make all descriptors reusable that are associated with
+ * data blocks before @lpos.
+ */
+ if (!data_make_reusable(rb, data_ring, tail_lpos, lpos,
+ &next_lpos)) {
+ /*
+ * 1. Guarantee the block ID loaded in
+ * data_make_reusable() is performed before
+ * reloading the tail lpos. The failed
+ * data_make_reusable() may be due to a newly
+ * recycled data area causing the tail lpos to
+ * have been previously pushed. This pairs with
+ * data_alloc:A.
+ *
+ * Memory barrier involvement:
+ *
+ * If data_make_reusable:A reads from data_alloc:B,
+ * then data_push_tail:C reads from
+ * data_push_tail:D.
+ *
+ * Relies on:
+ *
+ * MB from data_push_tail:D to data_alloc:B
+ * matching
+ * RMB from data_make_reusable:A to
+ * data_push_tail:C
+ *
+ * Note: data_push_tail:D and data_alloc:B can be
+ * different CPUs. However, the data_alloc:B
+ * CPU (which performs the full memory
+ * barrier) must have previously seen
+ * data_push_tail:D.
+ *
+ * 2. Guarantee the descriptor state loaded in
+ * data_make_reusable() is performed before
+ * reloading the tail lpos. The failed
+ * data_make_reusable() may be due to a newly
+ * recycled descriptor causing the tail lpos to
+ * have been previously pushed. This pairs with
+ * desc_reserve:D.
+ *
+ * Memory barrier involvement:
+ *
+ * If data_make_reusable:B reads from
+ * desc_reserve:F, then data_push_tail:C reads
+ * from data_push_tail:D.
+ *
+ * Relies on:
+ *
+ * MB from data_push_tail:D to desc_reserve:F
+ * matching
+ * RMB from data_make_reusable:B to
+ * data_push_tail:C
+ *
+ * Note: data_push_tail:D and desc_reserve:F can
+ * be different CPUs. However, the
+ * desc_reserve:F CPU (which performs the
+ * full memory barrier) must have previously
+ * seen data_push_tail:D.
+ */
+ smp_rmb(); /* LMM(data_push_tail:B) */
+
+ tail_lpos_new = atomic_long_read(&data_ring->tail_lpos
+ ); /* LMM(data_push_tail:C) */
+ if (tail_lpos_new == tail_lpos)
+ return false;
+
+ /* Another CPU pushed the tail. Try again. */
+ tail_lpos = tail_lpos_new;
+ continue;
+ }
+
+ /*
+ * Guarantee any descriptor states that have transitioned to
+ * reusable are stored before pushing the tail lpos. A full
+ * memory barrier is needed since other CPUs may have made
+ * the descriptor states reusable. This pairs with
+ * data_push_tail:A.
+ */
+ if (atomic_long_try_cmpxchg(&data_ring->tail_lpos, &tail_lpos,
+ next_lpos)) { /* LMM(data_push_tail:D) */
+ break;
+ }
+ }
+
+ return true;
+}
+
+/*
+ * Advance the desc ring tail. This function advances the tail by one
+ * descriptor, thus invalidating the oldest descriptor. Before advancing
+ * the tail, the tail descriptor is made reusable and all data blocks up to
+ * and including the descriptor's data block are invalidated (i.e. the data
+ * ring tail is pushed past the data block of the descriptor being made
+ * reusable).
+ */
+static bool desc_push_tail(struct printk_ringbuffer *rb,
+ unsigned long tail_id)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ enum desc_state d_state;
+ struct prb_desc desc;
+
+ d_state = desc_read(desc_ring, tail_id, &desc);
+
+ switch (d_state) {
+ case desc_miss:
+ /*
+ * If the ID is exactly 1 wrap behind the expected, it is
+ * in the process of being reserved by another writer and
+ * must be considered reserved.
+ */
+ if (DESC_ID(atomic_long_read(&desc.state_var)) ==
+ DESC_ID_PREV_WRAP(desc_ring, tail_id)) {
+ return false;
+ }
+
+ /*
+ * The ID has changed. Another writer must have pushed the
+ * tail and recycled the descriptor already. Success is
+ * returned because the caller is only interested in the
+ * specified tail being pushed, which it was.
+ */
+ return true;
+ case desc_reserved:
+ return false;
+ case desc_committed:
+ desc_make_reusable(desc_ring, tail_id);
+ break;
+ case desc_reusable:
+ break;
+ }
+
+ /*
+ * Data blocks must be invalidated before their associated
+ * descriptor can be made available for recycling. Invalidating
+ * them later is not possible because there is no way to trust
+ * data blocks once their associated descriptor is gone.
+ */
+
+ if (!data_push_tail(rb, &rb->text_data_ring, desc.text_blk_lpos.next))
+ return false;
+ if (!data_push_tail(rb, &rb->dict_data_ring, desc.dict_blk_lpos.next))
+ return false;
+
+ /*
+ * Check the next descriptor after @tail_id before pushing the tail
+ * to it because the tail must always be in a committed or reusable
+ * state. The implementation of prb_first_seq() relies on this.
+ *
+ * A successful read implies that the next descriptor is less than or
+ * equal to @head_id so there is no risk of pushing the tail past the
+ * head.
+ */
+ d_state = desc_read(desc_ring, DESC_ID(tail_id + 1), &desc); /* LMM(desc_push_tail:A) */
+
+ if (d_state == desc_committed || d_state == desc_reusable) {
+ /*
+ * Guarantee any descriptor states that have transitioned to
+ * reusable are stored before pushing the tail ID. This allows
+ * verifying the recycled descriptor state. A full memory
+ * barrier is needed since other CPUs may have made the
+ * descriptor states reusable. This pairs with desc_reserve:D.
+ */
+ atomic_long_cmpxchg(&desc_ring->tail_id, tail_id,
+ DESC_ID(tail_id + 1)); /* LMM(desc_push_tail:B) */
+ } else {
+ /*
+ * Guarantee the last state load from desc_read() is before
+ * reloading @tail_id in order to see a new tail ID in the
+ * case that the descriptor has been recycled. This pairs
+ * with desc_reserve:D.
+ *
+ * Memory barrier involvement:
+ *
+ * If desc_push_tail:A reads from desc_reserve:F, then
+ * desc_push_tail:D reads from desc_push_tail:B.
+ *
+ * Relies on:
+ *
+ * MB from desc_push_tail:B to desc_reserve:F
+ * matching
+ * RMB from desc_push_tail:A to desc_push_tail:D
+ *
+ * Note: desc_push_tail:B and desc_reserve:F can be different
+ * CPUs. However, the desc_reserve:F CPU (which performs
+ * the full memory barrier) must have previously seen
+ * desc_push_tail:B.
+ */
+ smp_rmb(); /* LMM(desc_push_tail:C) */
+
+ /*
+ * Re-check the tail ID. The descriptor following @tail_id is
+ * not in an allowed tail state. But if the tail has since
+ * been moved by another CPU, then it does not matter.
+ */
+ if (atomic_long_read(&desc_ring->tail_id) == tail_id) /* LMM(desc_push_tail:D) */
+ return false;
+ }
+
+ return true;
+}
+
+/* Reserve a new descriptor, invalidating the oldest if necessary. */
+static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ unsigned long prev_state_val;
+ unsigned long id_prev_wrap;
+ struct prb_desc *desc;
+ unsigned long head_id;
+ unsigned long id;
+
+ head_id = atomic_long_read(&desc_ring->head_id); /* LMM(desc_reserve:A) */
+
+ do {
+ desc = to_desc(desc_ring, head_id);
+
+ id = DESC_ID(head_id + 1);
+ id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
+
+ /*
+ * Guarantee the head ID is read before reading the tail ID.
+ * Since the tail ID is updated before the head ID, this
+ * guarantees that @id_prev_wrap is never ahead of the tail
+ * ID. This pairs with desc_reserve:D.
+ *
+ * Memory barrier involvement:
+ *
+ * If desc_reserve:A reads from desc_reserve:D, then
+ * desc_reserve:C reads from desc_push_tail:B.
+ *
+ * Relies on:
+ *
+ * MB from desc_push_tail:B to desc_reserve:D
+ * matching
+ * RMB from desc_reserve:A to desc_reserve:C
+ *
+ * Note: desc_push_tail:B and desc_reserve:D can be different
+ * CPUs. However, the desc_reserve:D CPU (which performs
+ * the full memory barrier) must have previously seen
+ * desc_push_tail:B.
+ */
+ smp_rmb(); /* LMM(desc_reserve:B) */
+
+ if (id_prev_wrap == atomic_long_read(&desc_ring->tail_id
+ )) { /* LMM(desc_reserve:C) */
+ /*
+ * Make space for the new descriptor by
+ * advancing the tail.
+ */
+ if (!desc_push_tail(rb, id_prev_wrap))
+ return false;
+ }
+
+ /*
+ * 1. Guarantee the tail ID is read before validating the
+ * recycled descriptor state. A read memory barrier is
+ * sufficient for this. This pairs with desc_push_tail:B.
+ *
+ * Memory barrier involvement:
+ *
+ * If desc_reserve:C reads from desc_push_tail:B, then
+ * desc_reserve:E reads from desc_make_reusable:A.
+ *
+ * Relies on:
+ *
+ * MB from desc_make_reusable:A to desc_push_tail:B
+ * matching
+ * RMB from desc_reserve:C to desc_reserve:E
+ *
+ * Note: desc_make_reusable:A and desc_push_tail:B can be
+ * different CPUs. However, the desc_push_tail:B CPU
+ * (which performs the full memory barrier) must have
+ * previously seen desc_make_reusable:A.
+ *
+ * 2. Guarantee the tail ID is stored before storing the head
+ * ID. This pairs with desc_reserve:B.
+ *
+ * 3. Guarantee any data ring tail changes are stored before
+ * recycling the descriptor. Data ring tail changes can
+ * happen via desc_push_tail()->data_push_tail(). A full
+ * memory barrier is needed since another CPU may have
+ * pushed the data ring tails. This pairs with
+ * data_push_tail:B.
+ *
+ * 4. Guarantee a new tail ID is stored before recycling the
+ * descriptor. A full memory barrier is needed since
+ * another CPU may have pushed the tail ID. This pairs
+ * with desc_push_tail:C and this also pairs with
+ * prb_first_seq:C.
+ */
+ } while (!atomic_long_try_cmpxchg(&desc_ring->head_id, &head_id,
+ id)); /* LMM(desc_reserve:D) */
+
+ desc = to_desc(desc_ring, id);
+
+ /*
+ * If the descriptor has been recycled, verify the old state val.
+ * See "ABA Issues" about why this verification is performed.
+ */
+ prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */
+ if (prev_state_val &&
+ prev_state_val != (id_prev_wrap | DESC_COMMITTED_MASK | DESC_REUSE_MASK)) {
+ WARN_ON_ONCE(1);
+ return false;
+ }
+
+ /*
+ * Assign the descriptor a new ID and set its state to reserved.
+ * See "ABA Issues" about why cmpxchg() instead of set() is used.
+ *
+ * Guarantee the new descriptor ID and state is stored before making
+ * any other changes. A write memory barrier is sufficient for this.
+ * This pairs with desc_read:D.
+ */
+ if (!atomic_long_try_cmpxchg(&desc->state_var, &prev_state_val,
+ id | 0)) { /* LMM(desc_reserve:F) */
+ WARN_ON_ONCE(1);
+ return false;
+ }
+
+ /* Now data in @desc can be modified: LMM(desc_reserve:G) */
+
+ *id_out = id;
+ return true;
+}
+
+/* Determine the end of a data block. */
+static unsigned long get_next_lpos(struct prb_data_ring *data_ring,
+ unsigned long lpos, unsigned int size)
+{
+ unsigned long begin_lpos;
+ unsigned long next_lpos;
+
+ begin_lpos = lpos;
+ next_lpos = lpos + size;
+
+ /* First check if the data block does not wrap. */
+ if (DATA_WRAPS(data_ring, begin_lpos) == DATA_WRAPS(data_ring, next_lpos))
+ return next_lpos;
+
+ /* Wrapping data blocks store their data at the beginning. */
+ return (DATA_THIS_WRAP_START_LPOS(data_ring, next_lpos) + size);
+}
+
+/*
+ * Allocate a new data block, invalidating the oldest data block(s)
+ * if necessary. This function also associates the data block with
+ * a specified descriptor.
+ */
+static char *data_alloc(struct printk_ringbuffer *rb,
+ struct prb_data_ring *data_ring, unsigned int size,
+ struct prb_data_blk_lpos *blk_lpos, unsigned long id)
+{
+ struct prb_data_block *blk;
+ unsigned long begin_lpos;
+ unsigned long next_lpos;
+
+ if (!data_ring->data || size == 0) {
+ /* Specify a data-less block. */
+ blk_lpos->begin = INVALID_LPOS;
+ blk_lpos->next = INVALID_LPOS;
+ return NULL;
+ }
+
+ size = to_blk_size(size);
+
+ begin_lpos = atomic_long_read(&data_ring->head_lpos);
+
+ do {
+ next_lpos = get_next_lpos(data_ring, begin_lpos, size);
+
+ if (!data_push_tail(rb, data_ring, next_lpos - DATA_SIZE(data_ring))) {
+ /* Failed to allocate, specify a data-less block. */
+ blk_lpos->begin = INVALID_LPOS;
+ blk_lpos->next = INVALID_LPOS;
+ return NULL;
+ }
+
+ /*
+ * 1. Guarantee any descriptor states that have transitioned
+ * to reusable are stored before modifying the newly
+ * allocated data area. A full memory barrier is needed
+ * since other CPUs may have made the descriptor states
+ * reusable. See data_push_tail:A about why the reusable
+ * states are visible. This pairs with desc_read:D.
+ *
+ * 2. Guarantee any updated tail lpos is stored before
+ * modifying the newly allocated data area. Another CPU may
+ * be in data_make_reusable() and is reading a block ID
+ * from this area. data_make_reusable() can handle reading
+ * a garbage block ID value, but then it must be able to
+ * load a new tail lpos. A full memory barrier is needed
+ * since other CPUs may have updated the tail lpos. This
+ * pairs with data_push_tail:B.
+ */
+ } while (!atomic_long_try_cmpxchg(&data_ring->head_lpos, &begin_lpos,
+ next_lpos)); /* LMM(data_alloc:A) */
+
+ blk = to_block(data_ring, begin_lpos);
+ blk->id = id; /* LMM(data_alloc:B) */
+
+ if (DATA_WRAPS(data_ring, begin_lpos) != DATA_WRAPS(data_ring, next_lpos)) {
+ /* Wrapping data blocks store their data at the beginning. */
+ blk = to_block(data_ring, 0);
+
+ /*
+ * Store the ID on the wrapped block for consistency.
+ * The printk_ringbuffer does not actually use it.
+ */
+ blk->id = id;
+ }
+
+ blk_lpos->begin = begin_lpos;
+ blk_lpos->next = next_lpos;
+
+ return &blk->data[0];
+}
+
+/* Return the number of bytes used by a data block. */
+static unsigned int space_used(struct prb_data_ring *data_ring,
+ struct prb_data_blk_lpos *blk_lpos)
+{
+ if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next)) {
+ /* Data block does not wrap. */
+ return (DATA_INDEX(data_ring, blk_lpos->next) -
+ DATA_INDEX(data_ring, blk_lpos->begin));
+ }
+
+ /*
+ * For wrapping data blocks, the trailing (wasted) space is
+ * also counted.
+ */
+ return (DATA_INDEX(data_ring, blk_lpos->next) +
+ DATA_SIZE(data_ring) - DATA_INDEX(data_ring, blk_lpos->begin));
+}
+
+/**
+ * prb_reserve() - Reserve space in the ringbuffer.
+ *
+ * @e: The entry structure to setup.
+ * @rb: The ringbuffer to reserve data in.
+ * @r: The record structure to allocate buffers for.
+ *
+ * This is the public function available to writers to reserve data.
+ *
+ * The writer specifies the text and dict sizes to reserve by setting the
+ * @text_buf_size and @dict_buf_size fields of @r, respectively. Dictionaries
+ * are optional, so @dict_buf_size is allowed to be 0. To ensure proper
+ * initialization of @r, prb_rec_init_wr() should be used.
+ *
+ * Context: Any context. Disables local interrupts on success.
+ * Return: true if at least text data could be allocated, otherwise false.
+ *
+ * On success, the fields @info, @text_buf, @dict_buf of @r will be set by
+ * this function and should be filled in by the writer before committing. Also
+ * on success, prb_record_text_space() can be used on @e to query the actual
+ * space used for the text data block.
+ *
+ * If the function fails to reserve dictionary space (but all else succeeded),
+ * it will still report success. In that case @dict_buf is set to NULL and
+ * @dict_buf_size is set to 0. Writers must check this before writing to
+ * dictionary space.
+ *
+ * @info->text_len and @info->dict_len will already be set to @text_buf_size
+ * and @dict_buf_size, respectively. If dictionary space reservation fails,
+ * @info->dict_len is set to 0.
+ */
+bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
+ struct printk_record *r)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ struct prb_desc *d;
+ unsigned long id;
+
+ if (!data_check_size(&rb->text_data_ring, r->text_buf_size))
+ goto fail;
+
+ /* Records are allowed to not have dictionaries. */
+ if (r->dict_buf_size) {
+ if (!data_check_size(&rb->dict_data_ring, r->dict_buf_size))
+ goto fail;
+ }
+
+ /*
+ * Descriptors in the reserved state act as blockers to all further
+ * reservations once the desc_ring has fully wrapped. Disable
+ * interrupts during the reserve/commit window in order to minimize
+ * the likelihood of this happening.
+ */
+ local_irq_save(e->irqflags);
+
+ if (!desc_reserve(rb, &id)) {
+ /* Descriptor reservation failures are tracked. */
+ atomic_long_inc(&rb->fail);
+ local_irq_restore(e->irqflags);
+ goto fail;
+ }
+
+ d = to_desc(desc_ring, id);
+
+ /*
+ * Set the @e fields here so that prb_commit() can be used if
+ * text data allocation fails.
+ */
+ e->rb = rb;
+ e->id = id;
+
+ /*
+ * Initialize the sequence number if it has "never been set".
+ * Otherwise just increment it by a full wrap.
+ *
+ * @seq is considered "never been set" if it has a value of 0,
+ * _except_ for @descs[0], which was specially setup by the ringbuffer
+ * initializer and therefore is always considered as set.
+ *
+ * See the "Bootstrap" comment block in printk_ringbuffer.h for
+ * details about how the initializer bootstraps the descriptors.
+ */
+ if (d->info.seq == 0 && DESC_INDEX(desc_ring, id) != 0)
+ d->info.seq = DESC_INDEX(desc_ring, id);
+ else
+ d->info.seq += DESCS_COUNT(desc_ring);
+
+ r->text_buf = data_alloc(rb, &rb->text_data_ring, r->text_buf_size,
+ &d->text_blk_lpos, id);
+ /* If text data allocation fails, a data-less record is committed. */
+ if (r->text_buf_size && !r->text_buf) {
+ d->info.text_len = 0;
+ d->info.dict_len = 0;
+ prb_commit(e);
+ /* prb_commit() re-enabled interrupts. */
+ goto fail;
+ }
+
+ r->dict_buf = data_alloc(rb, &rb->dict_data_ring, r->dict_buf_size,
+ &d->dict_blk_lpos, id);
+ /*
+ * If dict data allocation fails, the caller can still commit
+ * text. But dictionary information will not be available.
+ */
+ if (r->dict_buf_size && !r->dict_buf)
+ r->dict_buf_size = 0;
+
+ r->info = &d->info;
+
+ /* Set default values for the sizes. */
+ d->info.text_len = r->text_buf_size;
+ d->info.dict_len = r->dict_buf_size;
+
+ /* Record full text space used by record. */
+ e->text_space = space_used(&rb->text_data_ring, &d->text_blk_lpos);
+
+ return true;
+fail:
+ /* Make it clear to the caller that the reserve failed. */
+ memset(r, 0, sizeof(*r));
+ return false;
+}
+
+/**
+ * prb_commit() - Commit (previously reserved) data to the ringbuffer.
+ *
+ * @e: The entry containing the reserved data information.
+ *
+ * This is the public function available to writers to commit data.
+ *
+ * Context: Any context. Enables local interrupts.
+ */
+void prb_commit(struct prb_reserved_entry *e)
+{
+ struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
+ struct prb_desc *d = to_desc(desc_ring, e->id);
+ unsigned long prev_state_val = e->id | 0;
+
+ /* Now the writer has finished all writing: LMM(prb_commit:A) */
+
+ /*
+ * Set the descriptor as committed. See "ABA Issues" about why
+ * cmpxchg() instead of set() is used.
+ *
+ * Guarantee all record data is stored before the descriptor state
+ * is stored as committed. A write memory barrier is sufficient for
+ * this. This pairs with desc_read:B.
+ */
+ if (!atomic_long_try_cmpxchg(&d->state_var, &prev_state_val,
+ e->id | DESC_COMMITTED_MASK)) { /* LMM(prb_commit:B) */
+ WARN_ON_ONCE(1);
+ }
+
+ /* Restore interrupts, the reserve/commit window is finished. */
+ local_irq_restore(e->irqflags);
+}
+
+/*
+ * Given @blk_lpos, return a pointer to the raw data from the data block
+ * and calculate the size of the data part. A NULL pointer is returned
+ * if @blk_lpos specifies values that could never be legal.
+ *
+ * This function (used by readers) performs strict validation on the lpos
+ * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
+ * triggered if an internal error is detected.
+ */
+static char *get_data(struct prb_data_ring *data_ring,
+ struct prb_data_blk_lpos *blk_lpos,
+ unsigned int *data_size)
+{
+ struct prb_data_block *db;
+
+ /* Data-less data block description. */
+ if (blk_lpos->begin == INVALID_LPOS &&
+ blk_lpos->next == INVALID_LPOS) {
+ return NULL;
+ }
+
+ /* Regular data block: @begin less than @next and in same wrap. */
+ if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) &&
+ blk_lpos->begin < blk_lpos->next) {
+ db = to_block(data_ring, blk_lpos->begin);
+ *data_size = blk_lpos->next - blk_lpos->begin;
+
+ /* Wrapping data block: @begin is one wrap behind @next. */
+ } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) ==
+ DATA_WRAPS(data_ring, blk_lpos->next)) {
+ db = to_block(data_ring, 0);
+ *data_size = DATA_INDEX(data_ring, blk_lpos->next);
+
+ /* Illegal block description. */
+ } else {
+ WARN_ON_ONCE(1);
+ return NULL;
+ }
+
+ /* A valid data block will always be aligned to the ID size. */
+ if (WARN_ON_ONCE(blk_lpos->begin != ALIGN(blk_lpos->begin, sizeof(db->id))) ||
+ WARN_ON_ONCE(blk_lpos->next != ALIGN(blk_lpos->next, sizeof(db->id)))) {
+ return NULL;
+ }
+
+ /* A valid data block will always have at least an ID. */
+ if (WARN_ON_ONCE(*data_size < sizeof(db->id)))
+ return NULL;
+
+ /* Subtract block ID space from size to reflect data size. */
+ *data_size -= sizeof(db->id);
+
+ return &db->data[0];
+}
+
+/**
+ * prb_count_lines() - Count the number of lines in provided text.
+ *
+ * @text: The text to count the lines of.
+ * @text_size: The size of the text to process.
+ *
+ * This is the public function available to readers to count the number of
+ * lines in a text string.
+ *
+ * Context: Any context.
+ * Return: The number of lines in the text.
+ *
+ * All text has at least 1 line (even if @text_size is 0). Each '\n'
+ * processed is counted as an additional line.
+ */
+unsigned int prb_count_lines(char *text, unsigned int text_size)
+{
+ unsigned int next_size = text_size;
+ unsigned int line_count = 1;
+ char *next = text;
+
+ while (next_size) {
+ next = memchr(next, '\n', next_size);
+ if (!next)
+ break;
+ line_count++;
+ next++;
+ next_size = text_size - (next - text);
+ }
+
+ return line_count;
+}
+
+/*
+ * Given @blk_lpos, copy an expected @len of data into the provided buffer.
+ * If @line_count is provided, count the number of lines in the data.
+ *
+ * This function (used by readers) performs strict validation on the data
+ * size to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
+ * triggered if an internal error is detected.
+ */
+static bool copy_data(struct prb_data_ring *data_ring,
+ struct prb_data_blk_lpos *blk_lpos, u16 len, char *buf,
+ unsigned int buf_size, unsigned int *line_count)
+{
+ unsigned int data_size;
+ char *data;
+
+ /* Caller might not want any data. */
+ if ((!buf || !buf_size) && !line_count)
+ return true;
+
+ data = get_data(data_ring, blk_lpos, &data_size);
+ if (!data)
+ return false;
+
+ /*
+ * Actual cannot be less than expected. It can be more than expected
+ * because of the trailing alignment padding.
+ */
+ if (WARN_ON_ONCE(data_size < (unsigned int)len)) {
+ pr_warn_once("wrong data size (%u, expecting %hu) for data: %.*s\n",
+ data_size, len, data_size, data);
+ return false;
+ }
+
+ /* Caller interested in the line count? */
+ if (line_count)
+ *line_count = prb_count_lines(data, data_size);
+
+ /* Caller interested in the data content? */
+ if (!buf || !buf_size)
+ return true;
+
+ data_size = min_t(u16, buf_size, len);
+
+ if (!WARN_ON_ONCE(!data_size))
+ memcpy(&buf[0], data, data_size); /* LMM(copy_data:A) */
+ return true;
+}
+
+/*
+ * This is an extended version of desc_read(). It gets a copy of a specified
+ * descriptor. However, it also verifies that the record is committed and has
+ * the sequence number @seq. On success, 0 is returned.
+ *
+ * Error return values:
+ * -EINVAL: A committed record with sequence number @seq does not exist.
+ * -ENOENT: A committed record with sequence number @seq exists, but its data
+ * is not available. This is a valid record, so readers should
+ * continue with the next record.
+ */
+static int desc_read_committed_seq(struct prb_desc_ring *desc_ring,
+ unsigned long id, u64 seq,
+ struct prb_desc *desc_out)
+{
+ struct prb_data_blk_lpos *blk_lpos = &desc_out->text_blk_lpos;
+ enum desc_state d_state;
+
+ d_state = desc_read(desc_ring, id, desc_out);
+
+ /*
+ * An unexpected @id (desc_miss) or @seq mismatch means the record
+ * does not exist. A descriptor in the reserved state means the
+ * record does not yet exist for the reader.
+ */
+ if (d_state == desc_miss ||
+ d_state == desc_reserved ||
+ desc_out->info.seq != seq) {
+ return -EINVAL;
+ }
+
+ /*
+ * A descriptor in the reusable state may no longer have its data
+ * available; report it as a data-less record. Or the record may
+ * actually be a data-less record.
+ */
+ if (d_state == desc_reusable ||
+ (blk_lpos->begin == INVALID_LPOS && blk_lpos->next == INVALID_LPOS)) {
+ return -ENOENT;
+ }
+
+ return 0;
+}
+
+/*
+ * Copy the ringbuffer data from the record with @seq to the provided
+ * @r buffer. On success, 0 is returned.
+ *
+ * See desc_read_committed_seq() for error return values.
+ */
+static int prb_read(struct printk_ringbuffer *rb, u64 seq,
+ struct printk_record *r, unsigned int *line_count)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ struct prb_desc *rdesc = to_desc(desc_ring, seq);
+ atomic_long_t *state_var = &rdesc->state_var;
+ struct prb_desc desc;
+ unsigned long id;
+ int err;
+
+ /* Extract the ID, used to specify the descriptor to read. */
+ id = DESC_ID(atomic_long_read(state_var));
+
+ /* Get a local copy of the correct descriptor (if available). */
+ err = desc_read_committed_seq(desc_ring, id, seq, &desc);
+
+ /*
+ * If @r is NULL, the caller is only interested in the availability
+ * of the record.
+ */
+ if (err || !r)
+ return err;
+
+ /* If requested, copy meta data. */
+ if (r->info)
+ memcpy(r->info, &desc.info, sizeof(*(r->info)));
+
+ /* Copy text data. If it fails, this is a data-less descriptor. */
+ if (!copy_data(&rb->text_data_ring, &desc.text_blk_lpos, desc.info.text_len,
+ r->text_buf, r->text_buf_size, line_count)) {
+ return -ENOENT;
+ }
+
+ /*
+ * Copy dict data. Although this should not fail, dict data is not
+ * important. So if it fails, modify the copied meta data to report
+ * that there is no dict data, thus silently dropping the dict data.
+ */
+ if (!copy_data(&rb->dict_data_ring, &desc.dict_blk_lpos, desc.info.dict_len,
+ r->dict_buf, r->dict_buf_size, NULL)) {
+ if (r->info)
+ r->info->dict_len = 0;
+ }
+
+ /* Ensure the record is still committed and has the same @seq. */
+ return desc_read_committed_seq(desc_ring, id, seq, &desc);
+}
+
+/**
+ * prb_first_seq() - Get the sequence number of the tail descriptor.
+ *
+ * @rb: The ringbuffer to get the sequence number from.
+ *
+ * This is the public function available to readers to see what the
+ * first/oldest sequence number is.
+ *
+ * This provides readers a starting point to begin iterating the ringbuffer.
+ * Note that the returned sequence number might not belong to a valid record.
+ *
+ * Context: Any context.
+ * Return: The sequence number of the first/oldest record or, if the
+ * ringbuffer is empty, 0 is returned.
+ */
+u64 prb_first_seq(struct printk_ringbuffer *rb)
+{
+ struct prb_desc_ring *desc_ring = &rb->desc_ring;
+ enum desc_state d_state;
+ struct prb_desc desc;
+ unsigned long id;
+
+ for (;;) {
+ id = atomic_long_read(&rb->desc_ring.tail_id); /* LMM(prb_first_seq:A) */
+
+ d_state = desc_read(desc_ring, id, &desc); /* LMM(prb_first_seq:B) */
+
+ /*
+ * This loop will not be infinite because the tail is
+ * _always_ in the committed or reusable state.
+ */
+ if (d_state == desc_committed || d_state == desc_reusable)
+ break;
+
+ /*
+ * Guarantee the last state load from desc_read() is before
+ * reloading @tail_id in order to see a new tail in the case
+ * that the descriptor has been recycled. This pairs with
+ * desc_reserve:D.
+ *
+ * Memory barrier involvement:
+ *
+ * If prb_first_seq:B reads from desc_reserve:F, then
+ * prb_first_seq:A reads from desc_push_tail:B.
+ *
+ * Relies on:
+ *
+ * MB from desc_push_tail:B to desc_reserve:F
+ * matching
+ * RMB prb_first_seq:B to prb_first_seq:A
+ */
+ smp_rmb(); /* LMM(prb_first_seq:C) */
+ }
+
+ return desc.info.seq;
+}
+
+/*
+ * Non-blocking read of a record. Updates @seq to the last committed record
+ * (which may have no data).
+ *
+ * See the description of prb_read_valid() and prb_read_valid_info()
+ * for details.
+ */
+static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
+ struct printk_record *r, unsigned int *line_count)
+{
+ u64 tail_seq;
+ int err;
+
+ while ((err = prb_read(rb, *seq, r, line_count))) {
+ tail_seq = prb_first_seq(rb);
+
+ if (*seq < tail_seq) {
+ /*
+ * Behind the tail. Catch up and try again. This
+ * can happen for -ENOENT and -EINVAL cases.
+ */
+ *seq = tail_seq;
+
+ } else if (err == -ENOENT) {
+ /* Record exists, but no data available. Skip. */
+ (*seq)++;
+
+ } else {
+ /* Non-existent/non-committed record. Must stop. */
+ return false;
+ }
+ }
+
+ return true;
+}
+
+/**
+ * prb_read_valid() - Non-blocking read of a requested record or (if gone)
+ * the next available record.
+ *
+ * @rb: The ringbuffer to read from.
+ * @seq: The sequence number of the record to read.
+ * @r: A record data buffer to store the read record to.
+ *
+ * This is the public function available to readers to read a record.
+ *
+ * The reader provides the @info, @text_buf, @dict_buf buffers of @r to be
+ * filled in. Any of the buffer pointers can be set to NULL if the reader
+ * is not interested in that data. To ensure proper initialization of @r,
+ * prb_rec_init_rd() should be used.
+ *
+ * Context: Any context.
+ * Return: true if a record was read, otherwise false.
+ *
+ * On success, the reader must check r->info.seq to see which record was
+ * actually read. This allows the reader to detect dropped records.
+ *
+ * Failure means @seq refers to a not yet written record.
+ */
+bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
+ struct printk_record *r)
+{
+ return _prb_read_valid(rb, &seq, r, NULL);
+}
+
+/**
+ * prb_read_valid_info() - Non-blocking read of meta data for a requested
+ * record or (if gone) the next available record.
+ *
+ * @rb: The ringbuffer to read from.
+ * @seq: The sequence number of the record to read.
+ * @info: A buffer to store the read record meta data to.
+ * @line_count: A buffer to store the number of lines in the record text.
+ *
+ * This is the public function available to readers to read only the
+ * meta data of a record.
+ *
+ * The reader provides the @info, @line_count buffers to be filled in.
+ * Either of the buffer pointers can be set to NULL if the reader is not
+ * interested in that data.
+ *
+ * Context: Any context.
+ * Return: true if a record's meta data was read, otherwise false.
+ *
+ * On success, the reader must check info->seq to see which record meta data
+ * was actually read. This allows the reader to detect dropped records.
+ *
+ * Failure means @seq refers to a not yet written record.
+ */
+bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
+ struct printk_info *info, unsigned int *line_count)
+{
+ struct printk_record r;
+
+ prb_rec_init_rd(&r, info, NULL, 0, NULL, 0);
+
+ return _prb_read_valid(rb, &seq, &r, line_count);
+}
+
+/**
+ * prb_next_seq() - Get the sequence number after the last available record.
+ *
+ * @rb: The ringbuffer to get the sequence number from.
+ *
+ * This is the public function available to readers to see what the next
+ * newest sequence number available to readers will be.
+ *
+ * This provides readers a sequence number to jump to if all currently
+ * available records should be skipped.
+ *
+ * Context: Any context.
+ * Return: The sequence number of the next newest (not yet available) record
+ * for readers.
+ */
+u64 prb_next_seq(struct printk_ringbuffer *rb)
+{
+ u64 seq = 0;
+
+ do {
+ /* Search forward from the oldest descriptor. */
+ if (!_prb_read_valid(rb, &seq, NULL, NULL))
+ return seq;
+ seq++;
+ } while (seq);
+
+ return 0;
+}
+
+/**
+ * prb_init() - Initialize a ringbuffer to use provided external buffers.
+ *
+ * @rb: The ringbuffer to initialize.
+ * @text_buf: The data buffer for text data.
+ * @textbits: The size of @text_buf as a power-of-2 value.
+ * @dict_buf: The data buffer for dictionary data.
+ * @dictbits: The size of @dict_buf as a power-of-2 value.
+ * @descs: The descriptor buffer for ringbuffer records.
+ * @descbits: The count of @descs items as a power-of-2 value.
+ *
+ * This is the public function available to writers to setup a ringbuffer
+ * during runtime using provided buffers.
+ *
+ * This must match the initialization of DECLARE_PRINTKRB().
+ *
+ * Context: Any context.
+ */
+void prb_init(struct printk_ringbuffer *rb,
+ char *text_buf, unsigned int textbits,
+ char *dict_buf, unsigned int dictbits,
+ struct prb_desc *descs, unsigned int descbits)
+{
+ memset(descs, 0, _DESCS_COUNT(descbits) * sizeof(descs[0]));
+
+ rb->desc_ring.count_bits = descbits;
+ rb->desc_ring.descs = descs;
+ atomic_long_set(&rb->desc_ring.head_id, DESC0_ID(descbits));
+ atomic_long_set(&rb->desc_ring.tail_id, DESC0_ID(descbits));
+
+ rb->text_data_ring.size_bits = textbits;
+ rb->text_data_ring.data = text_buf;
+ atomic_long_set(&rb->text_data_ring.head_lpos, BLK0_LPOS(textbits));
+ atomic_long_set(&rb->text_data_ring.tail_lpos, BLK0_LPOS(textbits));
+
+ rb->dict_data_ring.size_bits = dictbits;
+ rb->dict_data_ring.data = dict_buf;
+ atomic_long_set(&rb->dict_data_ring.head_lpos, BLK0_LPOS(dictbits));
+ atomic_long_set(&rb->dict_data_ring.tail_lpos, BLK0_LPOS(dictbits));
+
+ atomic_long_set(&rb->fail, 0);
+
+ descs[0].info.seq = -(u64)_DESCS_COUNT(descbits);
+
+ descs[_DESCS_COUNT(descbits) - 1].info.seq = 0;
+ atomic_long_set(&(descs[_DESCS_COUNT(descbits) - 1].state_var), DESC0_SV(descbits));
+ descs[_DESCS_COUNT(descbits) - 1].text_blk_lpos.begin = INVALID_LPOS;
+ descs[_DESCS_COUNT(descbits) - 1].text_blk_lpos.next = INVALID_LPOS;
+ descs[_DESCS_COUNT(descbits) - 1].dict_blk_lpos.begin = INVALID_LPOS;
+ descs[_DESCS_COUNT(descbits) - 1].dict_blk_lpos.next = INVALID_LPOS;
+}
+
+/**
+ * prb_record_text_space() - Query the full actual used ringbuffer space for
+ * the text data of a reserved entry.
+ *
+ * @e: The successfully reserved entry to query.
+ *
+ * This is the public function available to writers to see how much actual
+ * space is used in the ringbuffer to store the text data of the specified
+ * entry.
+ *
+ * This function is only valid if @e has been successfully reserved using
+ * prb_reserve().
+ *
+ * Context: Any context.
+ * Return: The size in bytes used by the text data of the associated record.
+ */
+unsigned int prb_record_text_space(struct prb_reserved_entry *e)
+{
+ return e->text_space;
+}
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
new file mode 100644
index 000000000000..df03039dca7e
--- /dev/null
+++ b/kernel/printk/printk_ringbuffer.h
@@ -0,0 +1,352 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _KERNEL_PRINTK_RINGBUFFER_H
+#define _KERNEL_PRINTK_RINGBUFFER_H
+
+#include <linux/atomic.h>
+
+struct printk_info {
+ u64 seq; /* sequence number */
+ u64 ts_nsec; /* timestamp in nanoseconds */
+ u16 text_len; /* length of text message */
+ u16 dict_len; /* length of dictionary message */
+ u8 facility; /* syslog facility */
+ u8 flags:5; /* internal record flags */
+ u8 level:3; /* syslog level */
+ u32 caller_id; /* thread id or processor id */
+};
+
+/*
+ * A structure providing the buffers, used by writers and readers.
+ *
+ * Writers:
+ * Using prb_rec_init_wr(), a writer sets @text_buf_size and @dict_buf_size
+ * before calling prb_reserve(). On success, prb_reserve() sets @info,
+ * @text_buf, @dict_buf to buffers reserved for that writer.
+ *
+ * Readers:
+ * Using prb_rec_init_rd(), a reader sets all fields before calling
+ * prb_read_valid(). Note that the reader provides the @info, @text_buf,
+ * @dict_buf buffers. On success, the struct pointed to by @info will be
+ * filled and the char arrays pointed to by @text_buf and @dict_buf will
+ * be filled with text and dict data.
+ */
+struct printk_record {
+ struct printk_info *info;
+ char *text_buf;
+ char *dict_buf;
+ unsigned int text_buf_size;
+ unsigned int dict_buf_size;
+};
+
+/* Specifies the position/span of a data block. */
+struct prb_data_blk_lpos {
+ unsigned long begin;
+ unsigned long next;
+};
+
+/* A descriptor: the complete meta-data for a record. */
+struct prb_desc {
+ struct printk_info info;
+ atomic_long_t state_var;
+ struct prb_data_blk_lpos text_blk_lpos;
+ struct prb_data_blk_lpos dict_blk_lpos;
+};
+
+/* A ringbuffer of "ID + data" elements. */
+struct prb_data_ring {
+ unsigned int size_bits;
+ char *data;
+ atomic_long_t head_lpos;
+ atomic_long_t tail_lpos;
+};
+
+/* A ringbuffer of "struct prb_desc" elements. */
+struct prb_desc_ring {
+ unsigned int count_bits;
+ struct prb_desc *descs;
+ atomic_long_t head_id;
+ atomic_long_t tail_id;
+};
+
+/* The high level structure representing the printk ringbuffer. */
+struct printk_ringbuffer {
+ struct prb_desc_ring desc_ring;
+ struct prb_data_ring text_data_ring;
+ struct prb_data_ring dict_data_ring;
+ atomic_long_t fail;
+};
+
+/* Used by writers as a reserve/commit handle. */
+struct prb_reserved_entry {
+ struct printk_ringbuffer *rb;
+ unsigned long irqflags;
+ unsigned long id;
+ unsigned int text_space;
+};
+
+#define _DATA_SIZE(sz_bits) (1UL << (sz_bits))
+#define _DESCS_COUNT(ct_bits) (1U << (ct_bits))
+#define DESC_SV_BITS (sizeof(unsigned long) * 8)
+#define DESC_COMMITTED_MASK (1UL << (DESC_SV_BITS - 1))
+#define DESC_REUSE_MASK (1UL << (DESC_SV_BITS - 2))
+#define DESC_FLAGS_MASK (DESC_COMMITTED_MASK | DESC_REUSE_MASK)
+#define DESC_ID_MASK (~DESC_FLAGS_MASK)
+#define DESC_ID(sv) ((sv) & DESC_ID_MASK)
+#define INVALID_LPOS 1
+
+#define INVALID_BLK_LPOS \
+{ \
+ .begin = INVALID_LPOS, \
+ .next = INVALID_LPOS, \
+}
+
+/*
+ * Descriptor Bootstrap
+ *
+ * The descriptor array is minimally initialized to allow immediate usage
+ * by readers and writers. The requirements that the descriptor array
+ * initialization must satisfy:
+ *
+ * Req1
+ * The tail must point to an existing (committed or reusable) descriptor.
+ * This is required by the implementation of prb_first_seq().
+ *
+ * Req2
+ * Readers must see that the ringbuffer is initially empty.
+ *
+ * Req3
+ * The first record reserved by a writer is assigned sequence number 0.
+ *
+ * To satisfy Req1, the tail initially points to a descriptor that is
+ * minimally initialized (having no data block, i.e. data-less with the
+ * data block's lpos @begin and @next values set to INVALID_LPOS).
+ *
+ * To satisfy Req2, the initial tail descriptor is initialized to the
+ * reusable state. Readers recognize reusable descriptors as existing
+ * records, but skip over them.
+ *
+ * To satisfy Req3, the last descriptor in the array is used as the initial
+ * head (and tail) descriptor. This allows the first record reserved by a
+ * writer (head + 1) to be the first descriptor in the array. (Only the first
+ * descriptor in the array could have a valid sequence number of 0.)
+ *
+ * The first time a descriptor is reserved, it is assigned a sequence number
+ * with the value of the array index. A "first time reserved" descriptor can
+ * be recognized because it has a sequence number of 0 but does not have an
+ * index of 0. (Only the first descriptor in the array could have a valid
+ * sequence number of 0.) After the first reservation, all future reservations
+ * (recycling) simply involve incrementing the sequence number by the array
+ * count.
+ *
+ * Hack #1
+ * Only the first descriptor in the array is allowed to have the sequence
+ * number 0. In this case it is not possible to recognize if it is being
+ * reserved the first time (set to index value) or has been reserved
+ * previously (increment by the array count). This is handled by _always_
+ * incrementing the sequence number by the array count when reserving the
+ * first descriptor in the array. In order to satisfy Req3, the sequence
+ * number of the first descriptor in the array is initialized to minus
+ * the array count. Then, upon the first reservation, it is incremented
+ * to 0, thus satisfying Req3.
+ *
+ * Hack #2
+ * prb_first_seq() can be called at any time by readers to retrieve the
+ * sequence number of the tail descriptor. However, due to Req2 and Req3,
+ * initially there are no records to report the sequence number of
+ * (sequence numbers are u64 and there is nothing less than 0). To handle
+ * this, the sequence number of the initial tail descriptor is initialized
+ * to 0. Technically this is incorrect, because there is no record with
+ * sequence number 0 (yet) and the tail descriptor is not the first
+ * descriptor in the array. But it allows prb_read_valid() to correctly
+ * report the existence of a record for _any_ given sequence number at all
+ * times. Bootstrapping is complete when the tail is pushed the first
+ * time, thus finally pointing to the first descriptor reserved by a
+ * writer, which has the assigned sequence number 0.
+ */
+
+/*
+ * Initiating Logical Value Overflows
+ *
+ * Both logical position (lpos) and ID values can be mapped to array indexes
+ * but may experience overflows during the lifetime of the system. To ensure
+ * that printk_ringbuffer can handle the overflows for these types, initial
+ * values are chosen that map to the correct initial array indexes, but will
+ * result in overflows soon.
+ *
+ * BLK0_LPOS
+ * The initial @head_lpos and @tail_lpos for data rings. It is at index
+ * 0 and the lpos value is such that it will overflow on the first wrap.
+ *
+ * DESC0_ID
+ * The initial @head_id and @tail_id for the desc ring. It is at the last
+ * index of the descriptor array (see Req3 above) and the ID value is such
+ * that it will overflow on the second wrap.
+ */
+#define BLK0_LPOS(sz_bits) (-(_DATA_SIZE(sz_bits)))
+#define DESC0_ID(ct_bits) DESC_ID(-(_DESCS_COUNT(ct_bits) + 1))
+#define DESC0_SV(ct_bits) (DESC_COMMITTED_MASK | DESC_REUSE_MASK | DESC0_ID(ct_bits))
+
+/*
+ * Declare a ringbuffer with an external text data buffer. The same as
+ * DECLARE_PRINTKRB() but requires specifying an external buffer for the
+ * text data.
+ *
+ * Note: The specified external buffer must be of the size:
+ * 2 ^ (descbits + avgtextbits)
+ */
+#define _DECLARE_PRINTKRB(name, descbits, avgtextbits, avgdictbits, text_buf) \
+char _##name##_dict[1U << ((avgdictbits) + (descbits))] __aligned(__alignof__(unsigned long)); \
+struct prb_desc _##name##_descs[_DESCS_COUNT(descbits)] = { \
+ /* this will be the first record reserved by a writer */ \
+ [0] = { \
+ .info = { \
+ /* will be incremented to 0 on the first reservation */ \
+ .seq = -(u64)_DESCS_COUNT(descbits), \
+ }, \
+ }, \
+ /* the initial head and tail */ \
+ [_DESCS_COUNT(descbits) - 1] = { \
+ .info = { \
+ /* reports the first seq value during the bootstrap phase */ \
+ .seq = 0, \
+ }, \
+ /* reusable */ \
+ .state_var = ATOMIC_INIT(DESC0_SV(descbits)), \
+ /* no associated data block */ \
+ .text_blk_lpos = INVALID_BLK_LPOS, \
+ .dict_blk_lpos = INVALID_BLK_LPOS, \
+ }, \
+}; \
+struct printk_ringbuffer name = { \
+ .desc_ring = { \
+ .count_bits = descbits, \
+ .descs = &_##name##_descs[0], \
+ .head_id = ATOMIC_INIT(DESC0_ID(descbits)), \
+ .tail_id = ATOMIC_INIT(DESC0_ID(descbits)), \
+ }, \
+ .text_data_ring = { \
+ .size_bits = (avgtextbits) + (descbits), \
+ .data = text_buf, \
+ .head_lpos = ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))), \
+ .tail_lpos = ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))), \
+ }, \
+ .dict_data_ring = { \
+ .size_bits = (avgtextbits) + (descbits), \
+ .data = &_##name##_dict[0], \
+ .head_lpos = ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))), \
+ .tail_lpos = ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))), \
+ }, \
+ .fail = ATOMIC_LONG_INIT(0), \
+}
+
+/**
+ * DECLARE_PRINTKRB() - Declare a ringbuffer.
+ *
+ * @name: The name of the ringbuffer variable.
+ * @descbits: The number of descriptors as a power-of-2 value.
+ * @avgtextbits: The average text data size per record as a power-of-2 value.
+ * @avgdictbits: The average dictionary data size per record as a
+ * power-of-2 value.
+ *
+ * This is a macro for declaring a ringbuffer and all internal structures
+ * such that it is ready for immediate use. See _DECLARE_PRINTKRB() for a
+ * variant where the text data buffer can be specified externally.
+ */
+#define DECLARE_PRINTKRB(name, descbits, avgtextbits, avgdictbits) \
+char _##name##_text[1U << ((avgtextbits) + (descbits))] __aligned(__alignof__(unsigned long)); \
+_DECLARE_PRINTKRB(name, descbits, avgtextbits, avgdictbits, &_##name##_text[0])
+
+/* Writer Interface */
+
+/**
+ * prb_rec_init_wd() - Initialize a buffer for writing records.
+ *
+ * @r: The record to initialize.
+ * @text_buf_size: The needed text buffer size.
+ * @dict_buf_size: The needed dictionary buffer size.
+ *
+ * Initialize all the fields that a writer is interested in. If
+ * @dict_buf_size is 0, a dictionary buffer will not be reserved.
+ * @text_buf_size must be greater than 0.
+ *
+ * Note that although @dict_buf_size may be initialized to non-zero,
+ * its value must be rechecked after a successful call to prb_reserve()
+ * to verify a dictionary buffer was actually reserved. Dictionary buffer
+ * reservation is allowed to fail.
+ */
+static inline void prb_rec_init_wr(struct printk_record *r,
+ unsigned int text_buf_size,
+ unsigned int dict_buf_size)
+{
+ r->info = NULL;
+ r->text_buf = NULL;
+ r->dict_buf = NULL;
+ r->text_buf_size = text_buf_size;
+ r->dict_buf_size = dict_buf_size;
+}
+
+bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
+ struct printk_record *r);
+void prb_commit(struct prb_reserved_entry *e);
+
+void prb_init(struct printk_ringbuffer *rb,
+ char *text_buf, unsigned int text_buf_size,
+ char *dict_buf, unsigned int dict_buf_size,
+ struct prb_desc *descs, unsigned int descs_count_bits);
+unsigned int prb_record_text_space(struct prb_reserved_entry *e);
+
+/* Reader Interface */
+
+/**
+ * prb_rec_init_rd() - Initialize a buffer for reading records.
+ *
+ * @r: The record to initialize.
+ * @info: A buffer to store record meta-data.
+ * @text_buf: A buffer to store text data.
+ * @text_buf_size: The size of @text_buf.
+ * @dict_buf: A buffer to store dictionary data.
+ * @dict_buf_size: The size of @dict_buf.
+ *
+ * Initialize all the fields that a reader is interested in. All arguments
+ * (except @r) are optional. Only record data for arguments that are
+ * non-NULL or non-zero will be read.
+ */
+static inline void prb_rec_init_rd(struct printk_record *r,
+ struct printk_info *info,
+ char *text_buf, unsigned int text_buf_size,
+ char *dict_buf, unsigned int dict_buf_size)
+{
+ r->info = info;
+ r->text_buf = text_buf;
+ r->dict_buf = dict_buf;
+ r->text_buf_size = text_buf_size;
+ r->dict_buf_size = dict_buf_size;
+}
+
+/**
+ * prb_for_each_record() - Iterate over a ringbuffer.
+ *
+ * @from: The sequence number to begin with.
+ * @rb: The ringbuffer to iterate over.
+ * @s: A u64 to store the sequence number on each iteration.
+ * @r: A printk_record to store the record on each iteration.
+ *
+ * This is a macro for conveniently iterating over a ringbuffer.
+ *
+ * Context: Any context.
+ */
+#define prb_for_each_record(from, rb, s, r) \
+for ((s) = from; prb_read_valid(rb, s, r); (s) = (r)->info->seq + 1)
+
+bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
+ struct printk_record *r);
+bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
+ struct printk_info *info, unsigned int *line_count);
+
+u64 prb_first_seq(struct printk_ringbuffer *rb);
+u64 prb_next_seq(struct printk_ringbuffer *rb);
+
+unsigned int prb_count_lines(char *text, unsigned int text_size);
+
+#endif /* _KERNEL_PRINTK_RINGBUFFER_H */
--
2.20.1

2020-06-18 19:07:45

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

Hi John,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v5.8-rc1 next-20200618]
[cannot apply to pmladek/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/John-Ogness/printk-replace-ringbuffer/20200618-225239
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1b5044021070efa3259f3e9548dc35d1eb6aa844
config: x86_64-randconfig-s021-20200618 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.2-rc1-10-gc17b1b06-dirty
# save the attached .config to linux build tree
make W=1 C=1 ARCH=x86_64 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>


sparse warnings: (new ones prefixed by >>)

kernel/printk/printk.c:397:1: sparse: sparse: symbol 'log_wait' was not declared. Should it be static?
>> kernel/printk/printk.c:437:1: sparse: sparse: symbol '_printk_rb_static_dict' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (1.49 kB)
.config.gz (32.56 kB)
Download all attachments

2020-06-18 21:26:19

by John Ogness

[permalink] [raw]
Subject: [PATCH v3 1/3] crash: add VMCOREINFO macro to define offset in a struct declared by typedef

The existing macro VMCOREINFO_OFFSET() can't be used for structures
declared via typedef because "struct" is not part of type definition.

Create another macro for this purpose.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/crash_core.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 525510a9f965..43b51c9df571 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -53,6 +53,9 @@ phys_addr_t paddr_vmcoreinfo_note(void);
#define VMCOREINFO_OFFSET(name, field) \
vmcoreinfo_append_str("OFFSET(%s.%s)=%lu\n", #name, #field, \
(unsigned long)offsetof(struct name, field))
+#define VMCOREINFO_TYPE_OFFSET(name, field) \
+ vmcoreinfo_append_str("OFFSET(%s.%s)=%lu\n", #name, #field, \
+ (unsigned long)offsetof(name, field))
#define VMCOREINFO_LENGTH(name, value) \
vmcoreinfo_append_str("LENGTH(%s)=%lu\n", #name, (unsigned long)value)
#define VMCOREINFO_NUMBER(name) \
--
2.20.1

2020-06-18 21:53:20

by kernel test robot

[permalink] [raw]
Subject: [RFC PATCH] printk: _printk_rb_static_dict can be static


Signed-off-by: kernel test robot <[email protected]>
---
printk.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 7642ef634956f..d812ada06735f 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -434,7 +434,7 @@ static u32 log_buf_len = __LOG_BUF_LEN;
*/
#define PRB_AVGBITS 5 /* 32 character average length */

-_DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
+static _DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);

static struct printk_ringbuffer printk_rb_dynamic;

2020-06-19 08:52:40

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: _printk_rb_static_dict can be static

On 2020-06-18, kernel test robot <[email protected]> wrote:
> Signed-off-by: kernel test robot <[email protected]>
> ---
> printk.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 7642ef634956f..d812ada06735f 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -434,7 +434,7 @@ static u32 log_buf_len = __LOG_BUF_LEN;
> */
> #define PRB_AVGBITS 5 /* 32 character average length */
>
> -_DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
> +static _DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
> PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);

_DECLARE_PRINTKRB declares multiple variables, so this patch will not
work as intended. I would like to declare the variables static but am
not sure how best to go about it.

In the Linux source I see examples of macros just desclaring the
variables static. And I see examples of the macros providing a parameter
where the "static" keyword can be specified.

Since the ringbuffer was created exclusively to serve printk, I would
prefer to just have _DECLARE_PRINTKRB (and DECLARE_PRINTKRB) declare all
the variables as static.

John Ogness

2020-06-19 13:03:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: _printk_rb_static_dict can be static

On Fri, 19 Jun 2020 08:49:21 +0200
John Ogness <[email protected]> wrote:

> > +++ b/kernel/printk/printk.c
> > @@ -434,7 +434,7 @@ static u32 log_buf_len = __LOG_BUF_LEN;
> > */
> > #define PRB_AVGBITS 5 /* 32 character average length */
> >
> > -_DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
> > +static _DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
> > PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);
>
> _DECLARE_PRINTKRB declares multiple variables, so this patch will not
> work as intended. I would like to declare the variables static but am
> not sure how best to go about it.
>
> In the Linux source I see examples of macros just desclaring the
> variables static. And I see examples of the macros providing a parameter
> where the "static" keyword can be specified.
>
> Since the ringbuffer was created exclusively to serve printk, I would
> prefer to just have _DECLARE_PRINTKRB (and DECLARE_PRINTKRB) declare all
> the variables as static.

Haven written macros that do such things, I agree with your last
statement. Just have the macro declare all the variables static.

-- Steve

2020-06-24 08:53:07

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] crash: add VMCOREINFO macro to define offset in a struct declared by typedef

On Thu 2020-06-18 16:55:17, John Ogness wrote:
> The existing macro VMCOREINFO_OFFSET() can't be used for structures
> declared via typedef because "struct" is not part of type definition.
>
> Create another macro for this purpose.
>
> Signed-off-by: John Ogness <[email protected]>

Reviewed-by: Petr Mladek <[email protected]>

Best Regards,
Petr

2020-06-25 07:23:00

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] printk: replace ringbuffer

Hi John,
On 06/18/20 at 04:55pm, John Ogness wrote:
> Hello,
>
> Here is a v3 for the first series to rework the printk
> subsystem. The v2 and history are here [0]. This first series
> only replaces the existing ringbuffer implementation. No locking
> is removed. No semantics/behavior of printk are changed.
>
> Reviews on the ringbuffer are still ongoing, but I was asked to
> post this new version since several changes from v2 have been
> already agreed upon.
>
> The series is based on v5.8-rc1.

Do you have the kdump userspace part link so that people can try
and do some testing?

Eg. some makedumpfile/crash tool git branch etc.

Thanks
Dave

2020-06-25 08:20:27

by Petr Mladek

[permalink] [raw]
Subject: truncate dict: was: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

On Thu 2020-06-18 16:55:19, John Ogness wrote:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.

> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -594,22 +473,26 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
> #define MAX_LOG_TAKE_PART 4
> static const char trunc_msg[] = "<truncated>";
>
> -static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
> - u16 *dict_len, u32 *pad_len)
> +static void truncate_msg(u16 *text_len, u16 *trunc_msg_len, u16 *dict_len)
> {
> /*
> * The message should not take the whole buffer. Otherwise, it might
> * get removed too soon.
> */
> u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
> +
> if (*text_len > max_text_len)
> *text_len = max_text_len;
> - /* enable the warning message */
> +
> + /* enable the warning message (if there is room) */
> *trunc_msg_len = strlen(trunc_msg);
> + if (*text_len >= *trunc_msg_len)
> + *text_len -= *trunc_msg_len;
> + else
> + *trunc_msg_len = 0;
> +
> /* disable the "dict" completely */
> *dict_len = 0;

The dictionary does not longer need to be removed at this point.
It is stored in a separate buffer. It is ignored by prb_reserve()
when there is not enough space for it.

> - /* compute the size again, count also the warning message */
> - return msg_used_size(*text_len + *trunc_msg_len, 0, pad_len);
> }
>
> /* insert record into the buffer, discard old ones, update heads */

Best Regards,
Petr

PS: I am still in the middle of review of this patch. I decided to
send the tree comments that I already have separately.

2020-06-25 09:41:57

by Petr Mladek

[permalink] [raw]
Subject: buffer allocation: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On Thu 2020-06-18 16:55:19, John Ogness wrote:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1176,11 +1068,46 @@ static void __init set_percpu_data_ready(void)
> __printk_percpu_data_ready = true;
> }
>
> +static unsigned int __init add_to_rb(struct printk_ringbuffer *rb,
> + struct printk_record *r)
> +{
> + struct prb_reserved_entry e;
> + struct printk_record dest_r;
> +
> + prb_rec_init_wr(&dest_r, r->info->text_len, r->info->dict_len);
> +
> + if (!prb_reserve(&e, rb, &dest_r))
> + return 0;
> +
> + memcpy(&dest_r.text_buf[0], &r->text_buf[0], dest_r.text_buf_size);
> + if (dest_r.dict_buf) {
> + memcpy(&dest_r.dict_buf[0], &r->dict_buf[0],
> + dest_r.dict_buf_size);
> + }
> + dest_r.info->facility = r->info->facility;
> + dest_r.info->level = r->info->level;
> + dest_r.info->flags = r->info->flags;
> + dest_r.info->ts_nsec = r->info->ts_nsec;
> + dest_r.info->caller_id = r->info->caller_id;
> +
> + prb_commit(&e);
> +
> + return prb_record_text_space(&e);
> +}
> +
> +static char setup_text_buf[CONSOLE_EXT_LOG_MAX] __initdata;
> +static char setup_dict_buf[CONSOLE_EXT_LOG_MAX] __initdata;
> +
> void __init setup_log_buf(int early)
> {
> + struct prb_desc *new_descs;
> + struct printk_info info;
> + struct printk_record r;
> unsigned long flags;
> + char *new_dict_buf;
> char *new_log_buf;
> unsigned int free;
> + u64 seq;
>
> /*
> * Some archs call setup_log_buf() multiple times - first is very
> @@ -1201,17 +1128,50 @@ void __init setup_log_buf(int early)
>
> new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
> if (unlikely(!new_log_buf)) {
> - pr_err("log_buf_len: %lu bytes not available\n",
> + pr_err("log_buf_len: %lu text bytes not available\n",
> new_log_buf_len);
> return;
> }
>
> + new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
> + if (unlikely(!new_dict_buf)) {
> + /* dictionary failure is allowed */
> + pr_err("log_buf_len: %lu dict bytes not available\n",
> + new_log_buf_len);
> + }
> +
> + new_descs = memblock_alloc((new_log_buf_len >> PRB_AVGBITS) *
> + sizeof(struct prb_desc), LOG_ALIGN);
> + if (unlikely(!new_descs)) {
> + pr_err("log_buf_len: %lu desc bytes not available\n",
> + new_log_buf_len >> PRB_AVGBITS);
> + if (new_dict_buf)
> + memblock_free(__pa(new_dict_buf), new_log_buf_len);
> + memblock_free(__pa(new_log_buf), new_log_buf_len);
> + return;
> + }
> +
> + prb_rec_init_rd(&r, &info,
> + &setup_text_buf[0], sizeof(setup_text_buf),
> + &setup_dict_buf[0], sizeof(setup_dict_buf));
> +
> logbuf_lock_irqsave(flags);
> +
> + prb_init(&printk_rb_dynamic,
> + new_log_buf, bits_per(new_log_buf_len) - 1,
> + new_dict_buf, bits_per(new_log_buf_len) - 1,

This does not check whether new_dict_buf was really allocated.

We should reuse the static one when it was not allocated. But it might
become tricky. We would need to preserve the sequence
numbers of each copied messages and do not copy the dictionary.

I suggest to make it easy and switch to the new buffers only when
all three could get allocated.

Well, we still should try to preserve the sequence numbers. For this,
we might need to allow storing empty messages instead of the lost ones.


> + new_descs, (bits_per(new_log_buf_len) - 1) - PRB_AVGBITS);

This is likely safe because new_log_buf_len is assigned only when it is
greater than the previous one. As a result bits_per(new_log_buf_len)
- 1) should be greater than PRB_AVGBITS.

Well, I still feel a bit uneasy about these PRB_AVGBITS operations,
including new_log_buf_len >> PRB_AVGBITS used above.

A more safe design would be to add some sanity checks at the beginning of
the function. And maybe convert new_log_buf_let to number of bits.
Then operate with the number of bits in the rest of the function.
It might be easier to make sure that we are on the safe side.


> log_buf_len = new_log_buf_len;
> log_buf = new_log_buf;
> new_log_buf_len = 0;
> - free = __LOG_BUF_LEN - log_next_idx;
> - memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
> +
> + free = __LOG_BUF_LEN;
> + prb_for_each_record(0, &printk_rb_static, seq, &r)
> + free -= add_to_rb(&printk_rb_dynamic, &r);
> +
> + prb = &printk_rb_dynamic;

This might deserve a comment that this is safe (no lost message)
because it is called early enough when everything is still running
on the boot CPU.

> +
> logbuf_unlock_irqrestore(flags);
>
> pr_info("log_buf_len: %u bytes\n", log_buf_len);

Best Regards,
Petr

2020-06-25 12:10:48

by Petr Mladek

[permalink] [raw]
Subject: record_printk_text tricks: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On Thu 2020-06-18 16:55:19, John Ogness wrote:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.

> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1344,72 +1304,150 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog,
> return len;
> }
>
> -static size_t msg_print_text(const struct printk_log *msg, bool syslog,
> - bool time, char *buf, size_t size)
> +static size_t record_print_text(struct printk_record *r, bool syslog,
> + bool time)
> {
> - const char *text = log_text(msg);
> - size_t text_size = msg->text_len;
> - size_t len = 0;
> + size_t text_len = r->info->text_len;
> + size_t buf_size = r->text_buf_size;
> + char *text = r->text_buf;
> char prefix[PREFIX_MAX];
> - const size_t prefix_len = print_prefix(msg, syslog, time, prefix);
> + bool truncated = false;
> + size_t prefix_len;
> + size_t line_len;
> + size_t len = 0;
> + char *next;
>
> - do {
> - const char *next = memchr(text, '\n', text_size);
> - size_t text_len;
> + prefix_len = info_print_prefix(r->info, syslog, time, prefix);
>
> + /*
> + * Add the prefix for each line by shifting the rest of the text to
> + * make room for each prefix. If the buffer is not large enough for
> + * all the prefixes, then drop the trailing text and report the
> + * largest length that includes full lines with their prefixes.

This should go up as the function description.

Sigh, this function does (already did) many subltle decisions. We
should mention them in the description. My current understanding is
the following:

/*
* Prepare the record for printing. The text is shifted in the given
* buffer to avoid a need for another one. The following operations
* are done:
*
* - Add prefix for each line.
* - Add the trailing newline that has been removed in vprintk_store().
* - Drop truncated lines that do not longer fit into the buffer.
*
* Return: The length of the updated text, including the added
* prefixes, newline. The dropped line(s) are not counted.
*/

> + *
> + * @text_len: bytes of unprocessed text
> + * @line_len: bytes of current line (newline _not_ included)
> + * @text: pointer to beginning of current line
> + * @len: number of bytes processed (size of r->text_buf done)

Please, move this to the variable declaration.

> + */
> + for (;;) {
> + next = memchr(text, '\n', text_len);
> if (next) {
> - text_len = next - text;
> - next++;
> - text_size -= next - text;
> + line_len = next - text;
> } else {
> - text_len = text_size;
> + /*
> + * No newline. If the text was previously truncated,
> + * assume this line was truncated and do not include
> + * it.
> + */

The word "assume" is confusing. The last line _must_ have been truncated when
"truncated" is set. I would write:

/* Drop incomplete truncated lines. */

> + if (truncated)
> + break;
> + line_len = text_len;
> }
>
> - if (buf) {
> - if (prefix_len + text_len + 1 >= size - len)
> + /*
> + * Ensure there is enough buffer available to shift this line
> + * (and add a newline at the end).
> + */
> + if (len + prefix_len + line_len + 1 > buf_size)
> + break;
> +
> + /*
> + * Ensure there is enough buffer available to shift all
> + * remaining text (and add a newline at the end). If this
> + * test fails, then there must be a newline (i.e.
> + * text_len > line_len because the previous test succeeded).
> + */
> + if (len + prefix_len + text_len + 1 > buf_size) {
> + /*
> + * Truncate @text_len so that there is enough space
> + * for a prefix. A newline will not be added because
> + * the last line of the text is now truncated and
> + * will not be included.
> + */
> + text_len = (buf_size - len) - prefix_len;

This is confusing. The check requires that one more character gets truncated.
And it should be perfectly fine to truncate '\n' from the previous
line. The final '\n' is always added.

There are more problems with this branch. See below for alternative code.

> +
> + /*
> + * Ensure there is still a newline. Otherwise this
> + * line may have been truncated and will not be
> + * included.
> + */
> + if (memchr(text, '\n', text_len) == NULL)
> break;

This looks superfluous. We do this check only when the first check of "line_len"
passed. It means that at least one line is printable.

I would personally replace this part of the code by:

/*
* Truncate the remaining text when it does not longer
* fit into the buffer.
*
* Note that this check could fail only when
* text_len > line_len because the previous check
* passed.
*/
if (len + prefix_len + text_len + 1 > buf_size) {
text_len = buf_size - len - prefix_len - 1;
truncated = 1;
}

Or we could actually switch the two checks and do:

/*
* Truncate the text if it would not longer fit into
* the given buffer.
*/
if (len + prefix_len + text_len + 1 > buf_size) {
text_len = buf_size - len - prefix_len - 1;
truncated = 1;
}

/* Drop even the current line when truncated */
if (line_len > text_len)
break;


> - memcpy(buf + len, prefix, prefix_len);
> - len += prefix_len;
> - memcpy(buf + len, text, text_len);
> - len += text_len;
> - buf[len++] = '\n';
> - } else {
> - /* SYSLOG_ACTION_* buffer size only calculation */
> - len += prefix_len + text_len + 1;
> + /* Note that the last line will not be included. */
> + truncated = true;
> }
>
> - text = next;
> - } while (text);
> + memmove(text + prefix_len, text, text_len);
> + memcpy(text, prefix, prefix_len);
> +
> + /* Advance beyond the newly added prefix and existing line. */
> + text += prefix_len + line_len;
> +
> + /* The remaining text has only decreased by the line. */
> + text_len -= line_len;
> +
> + len += prefix_len + line_len + 1;
> +
> + if (text_len) {
> + /* Advance past the newline. */
> + text++;
> + text_len--;

"text_len" might be zero at this point when the stored string ended
with newline.

It might happen already now when the original text ended by "\n\n".
Only the very last '\n' gets removed in vprintk_store().

It actually looks reasonable to do the extra cycle and printk
the (requested) empty line.

Except that we might end here also when the text was truncated right
before the newline marker. Well, this is a corner case, the string
is truncated anyway, so we could do whatever is easier in this case.


Well, it will need another update after we remove logbuf_lock. I think that
we will need to store the final '\n'. Or at least we would need to
reserve a space for it.

Anyway, it might deserve some comment somewhere. See below for
alternative code.


> + } else {
> + /* The final line, add a newline. */
> + *text = '\n';
> + break;
> + }

I do not have strong opinion. The code does things by small steps that
might be checked easily. But it has many steps and it might hide some
subtle catches like the one commented right above.

I wonder if the following would be easier:

len += prefix_len + line_len + 1;

/* Add the trailing newline removed in vprintk_store(). */
if (text_len == line_len) {
text[prefix_len + line_len] = '\n';
break;
}

/* Advance beyond the added prefix and the related line */
text += prefix_len + line_len + 1;

/*
* The remaining text has only decreased by the line.
*
* Special case is when text_len becomes zero. It
* happens when the original string ended with "\n\n".
* The cycle is correctly repeated and the empty line
* with a prefix printed.
*/
text_len -= (line_len + 1);
> + }
>
> return len;
> }
>
> +static size_t get_record_text_size(struct printk_info *info,
> + unsigned int line_count,
> + bool syslog, bool time)
> +{
> + char prefix[PREFIX_MAX];
> + size_t prefix_len;
> +
> + prefix_len = info_print_prefix(info, syslog, time, prefix);
> +
> + /*
> + * Each line will be preceded with a prefix. The intermediate
> + * newlines are already within the text, but a final trailing
> + * newline will be added.
> + */
> + return ((prefix_len * line_count) + info->text_len + 1);

Note: This might need an update when lockbuf_lock gets removed and the
trailing newline is not longer removed. We will need to explicitly
add it when it was missing.

We might need to check LOG_NEWLINE flag here. Or come with even better
solution.

> +}

Best Regards,
Petr

2020-06-25 14:14:41

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] printk: replace ringbuffer

Hi Dave,

On 2020-06-25, Dave Young <[email protected]> wrote:
> On 06/18/20 at 04:55pm, John Ogness wrote:
>> Here is a v3 for the first series to rework the printk
>> subsystem. The v2 and history are here [0]. This first series
>> only replaces the existing ringbuffer implementation. No locking
>> is removed. No semantics/behavior of printk are changed.
>>
>> Reviews on the ringbuffer are still ongoing, but I was asked to
>> post this new version since several changes from v2 have been
>> already agreed upon.
>>
>> The series is based on v5.8-rc1.
>
> Do you have the kdump userspace part link so that people can try
> and do some testing?
>
> Eg. some makedumpfile/crash tool git branch etc.

I have set up github forks, each with the RFC patch on top in their
respective "printk" branch:

https://github.com/Linutronix/crash.git

https://github.com/Linutronix/makedumpfile.git

Note that for crash, only the "log" command (using debug symbols) is
implemented. The --log argument (using VMCOREINFO) is not implemented.

For makedumpfile both symbol and VMCOREINFO variants are implemented.

The VMCOREINFO implementation in makedumpfile shows that there is enough
VMCOREINFO exported. So it will not be a problem to implement that for
crash as well.

John Ogness

2020-06-25 15:20:57

by Petr Mladek

[permalink] [raw]
Subject: pending output optimization: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On Thu 2020-06-18 16:55:19, John Ogness wrote:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
>
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2009,9 +2056,9 @@ asmlinkage int vprintk_emit(int facility, int level,
>
> /* This stops the holder of console_sem just where we want him */
> logbuf_lock_irqsave(flags);
> - curr_log_seq = log_next_seq;
> + pending_output = !prb_read_valid(prb, console_seq, NULL);
> printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
> - pending_output = (curr_log_seq != log_next_seq);
> + pending_output &= prb_read_valid(prb, console_seq, NULL);

This will stop working after we remove the locks. Consoles will be
able to handle messages while the new one is being added. There will
be no gurantee that someone is still hadling the previously pending output.

Please, always handle consoles when printed_len is not zero!!!

The pending output was just an optimization added recently. Nobody
requested it. It was just an idea that made sense.

This new code is a ticking bomb. It is far from obvious that it _must_
get removed after we remove the lock. And it will be hard to debug
why the consoles are sometimes not handled.


> logbuf_unlock_irqrestore(flags);
>
> /* If called from the scheduler, we can not call up(). */

Best Regards,
Petr

2020-06-25 15:22:08

by Petr Mladek

[permalink] [raw]
Subject: syslog size unread: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On Thu 2020-06-18 16:55:19, John Ogness wrote:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1609,11 +1633,15 @@ int do_syslog(int type, char __user *buf, int len, int source)
> break;
> /* Number of chars in the log buffer */
> case SYSLOG_ACTION_SIZE_UNREAD:
> + if (source != SYSLOG_FROM_PROC) {
> + text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
> + if (!text)
> + return -ENOMEM;

The buffer is needed only to count lines to count the size of
added prefixes. Could we use the new prb_read_valid_info() that
allows to get the number of lines without actually reading the buffer?

> +
> + }
> logbuf_lock_irq();
> - if (syslog_seq < log_first_seq) {
> + if (syslog_seq < prb_first_seq(prb)) {
> /* messages are gone, move to first one */
> - syslog_seq = log_first_seq;
> - syslog_idx = log_first_idx;
> + syslog_seq = prb_first_seq(prb);
> syslog_partial = 0;
> }
> if (source == SYSLOG_FROM_PROC) {

Best Regards,
Petr

2020-06-25 15:29:28

by Petr Mladek

[permalink] [raw]
Subject: Re: record_printk_text tricks: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On Thu 2020-06-25 14:09:46, Petr Mladek wrote:
> On Thu 2020-06-18 16:55:19, John Ogness wrote:
> > Replace the existing ringbuffer usage and implementation with
> > lockless ringbuffer usage. Even though the new ringbuffer does not
> > require locking, all existing locking is left in place. Therefore,
> > this change is purely replacing the underlining ringbuffer.
>
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > +static size_t record_print_text(struct printk_record *r, bool syslog,
> > + bool time)

[...]

> > +static size_t get_record_text_size(struct printk_info *info,
> > + unsigned int line_count,
> > + bool syslog, bool time)
> > +{

Nit: This should get called get_record_print_text_size(). It will make
it more clear that it counts the prefixes added for printing.

Best Regards,
Petr

PS: I have finished review of this 3rd patch. I am going to look at
the 2nd patch the following week if I find the courage ;-)

2020-06-26 13:51:57

by John Ogness

[permalink] [raw]
Subject: Re: truncate dict: was: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

On 2020-06-25, Petr Mladek <[email protected]> wrote:
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -594,22 +473,26 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
>> #define MAX_LOG_TAKE_PART 4
>> static const char trunc_msg[] = "<truncated>";
>>
>> -static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
>> - u16 *dict_len, u32 *pad_len)
>> +static void truncate_msg(u16 *text_len, u16 *trunc_msg_len, u16 *dict_len)
>> {
>> /*
>> * The message should not take the whole buffer. Otherwise, it might
>> * get removed too soon.
>> */
>> u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
>> +
>> if (*text_len > max_text_len)
>> *text_len = max_text_len;
>> - /* enable the warning message */
>> +
>> + /* enable the warning message (if there is room) */
>> *trunc_msg_len = strlen(trunc_msg);
>> + if (*text_len >= *trunc_msg_len)
>> + *text_len -= *trunc_msg_len;
>> + else
>> + *trunc_msg_len = 0;
>> +
>> /* disable the "dict" completely */
>> *dict_len = 0;
>
> The dictionary does not longer need to be removed at this point.
> It is stored in a separate buffer. It is ignored by prb_reserve()
> when there is not enough space for it.

Agreed. @dict_len argument will also be dropped from the function.

John Ogness

2020-06-26 16:38:26

by John Ogness

[permalink] [raw]
Subject: Re: buffer allocation: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On 2020-06-25, Petr Mladek <[email protected]> wrote:
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1176,11 +1068,46 @@ static void __init set_percpu_data_ready(void)
>> __printk_percpu_data_ready = true;
>> }
>>
>> +static unsigned int __init add_to_rb(struct printk_ringbuffer *rb,
>> + struct printk_record *r)
>> +{
>> + struct prb_reserved_entry e;
>> + struct printk_record dest_r;
>> +
>> + prb_rec_init_wr(&dest_r, r->info->text_len, r->info->dict_len);
>> +
>> + if (!prb_reserve(&e, rb, &dest_r))
>> + return 0;
>> +
>> + memcpy(&dest_r.text_buf[0], &r->text_buf[0], dest_r.text_buf_size);
>> + if (dest_r.dict_buf) {
>> + memcpy(&dest_r.dict_buf[0], &r->dict_buf[0],
>> + dest_r.dict_buf_size);
>> + }
>> + dest_r.info->facility = r->info->facility;
>> + dest_r.info->level = r->info->level;
>> + dest_r.info->flags = r->info->flags;
>> + dest_r.info->ts_nsec = r->info->ts_nsec;
>> + dest_r.info->caller_id = r->info->caller_id;
>> +
>> + prb_commit(&e);
>> +
>> + return prb_record_text_space(&e);
>> +}
>> +
>> +static char setup_text_buf[CONSOLE_EXT_LOG_MAX] __initdata;
>> +static char setup_dict_buf[CONSOLE_EXT_LOG_MAX] __initdata;
>> +
>> void __init setup_log_buf(int early)
>> {
>> + struct prb_desc *new_descs;
>> + struct printk_info info;
>> + struct printk_record r;
>> unsigned long flags;
>> + char *new_dict_buf;
>> char *new_log_buf;
>> unsigned int free;
>> + u64 seq;
>>
>> /*
>> * Some archs call setup_log_buf() multiple times - first is very
>> @@ -1201,17 +1128,50 @@ void __init setup_log_buf(int early)
>>
>> new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
>> if (unlikely(!new_log_buf)) {
>> - pr_err("log_buf_len: %lu bytes not available\n",
>> + pr_err("log_buf_len: %lu text bytes not available\n",
>> new_log_buf_len);
>> return;
>> }
>>
>> + new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
>> + if (unlikely(!new_dict_buf)) {
>> + /* dictionary failure is allowed */
>> + pr_err("log_buf_len: %lu dict bytes not available\n",
>> + new_log_buf_len);
>> + }
>> +
>> + new_descs = memblock_alloc((new_log_buf_len >> PRB_AVGBITS) *
>> + sizeof(struct prb_desc), LOG_ALIGN);
>> + if (unlikely(!new_descs)) {
>> + pr_err("log_buf_len: %lu desc bytes not available\n",
>> + new_log_buf_len >> PRB_AVGBITS);
>> + if (new_dict_buf)
>> + memblock_free(__pa(new_dict_buf), new_log_buf_len);
>> + memblock_free(__pa(new_log_buf), new_log_buf_len);
>> + return;
>> + }
>> +
>> + prb_rec_init_rd(&r, &info,
>> + &setup_text_buf[0], sizeof(setup_text_buf),
>> + &setup_dict_buf[0], sizeof(setup_dict_buf));
>> +
>> logbuf_lock_irqsave(flags);
>> +
>> + prb_init(&printk_rb_dynamic,
>> + new_log_buf, bits_per(new_log_buf_len) - 1,
>> + new_dict_buf, bits_per(new_log_buf_len) - 1,
>
> This does not check whether new_dict_buf was really allocated.

Thank you.

> I suggest to make it easy and switch to the new buffers only when
> all three could get allocated.

Agreed.

> Well, I still feel a bit uneasy about these PRB_AVGBITS operations,
> including new_log_buf_len >> PRB_AVGBITS used above.
>
> A more safe design would be to add some sanity checks at the beginning
> of the function. And maybe convert new_log_buf_let to number of bits.
> Then operate with the number of bits in the rest of the function. It
> might be easier to make sure that we are on the safe side.

I will clean it up with additional checks and temporary variables.

>> log_buf_len = new_log_buf_len;
>> log_buf = new_log_buf;
>> new_log_buf_len = 0;
>> - free = __LOG_BUF_LEN - log_next_idx;
>> - memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
>> +
>> + free = __LOG_BUF_LEN;
>> + prb_for_each_record(0, &printk_rb_static, seq, &r)
>> + free -= add_to_rb(&printk_rb_dynamic, &r);
>> +
>> + prb = &printk_rb_dynamic;
>
> This might deserve a comment that this is safe (no lost message)
> because it is called early enough when everything is still running
> on the boot CPU.

I will add a comment and an extra check to make sure.

Once everything is lockless, new messages could appear (for example, due
to NMI messages). The simple check should probably change to a loop. But
let us not worry about that at this point.

Below is a new version of the relevant patch snippets against mainline
just to show you how I plan to make it look for the next version.

John Ogness

--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ @@
+#define PRB_AVGBITS 5 /* 32 character average length */
+
+#if CONFIG_LOG_BUF_SHIFT <= PRB_AVGBITS
+#error CONFIG_LOG_BUF_SHIFT value too small.
+#endif
+_DEFINE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
+ PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);
+
@@ @@
void __init setup_log_buf(int early)
{
+ unsigned int new_descs_count;
+ struct prb_desc *new_descs;
+ struct printk_info info;
+ struct printk_record r;
+ size_t new_descs_size;
unsigned long flags;
+ char *new_dict_buf;
char *new_log_buf;
unsigned int free;
+ u64 seq;

/*
* Some archs call setup_log_buf() multiple times - first is very
@@ @@ void __init setup_log_buf(int early)
if (!new_log_buf_len)
return;

+ new_descs_count = new_log_buf_len >> PRB_AVGBITS;
+ if (new_descs_count == 0) {
+ pr_err("new_log_buf_len: %lu too small\n", new_log_buf_len);
+ return;
+ }
+
new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
if (unlikely(!new_log_buf)) {
- pr_err("log_buf_len: %lu bytes not available\n",
- new_log_buf_len);
+ pr_err("log_buf_len: %lu text bytes not available\n",
+ new_log_buf_len);
+ return;
+ }
+
+ new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
+ if (unlikely(!new_dict_buf)) {
+ pr_err("log_buf_len: %lu dict bytes not available\n",
+ new_log_buf_len);
+ memblock_free(__pa(new_log_buf), new_log_buf_len);
return;
}

+ new_descs_size = new_descs_count * sizeof(struct prb_desc);
+ new_descs = memblock_alloc(new_descs_size, LOG_ALIGN);
+ if (unlikely(!new_descs)) {
+ pr_err("log_buf_len: %lu desc bytes not available\n",
+ new_descs_size);
+ memblock_free(__pa(new_dict_buf), new_log_buf_len);
+ memblock_free(__pa(new_log_buf), new_log_buf_len);
+ return;
+ }
+
+ prb_rec_init_rd(&r, &info,
+ &setup_text_buf[0], sizeof(setup_text_buf),
+ &setup_dict_buf[0], sizeof(setup_dict_buf));
+
+ prb_init(&printk_rb_dynamic,
+ new_log_buf, order_base_2(new_log_buf_len),
+ new_dict_buf, order_base_2(new_log_buf_len),
+ new_descs, order_base_2(new_descs_count));
+
logbuf_lock_irqsave(flags);
+
log_buf_len = new_log_buf_len;
log_buf = new_log_buf;
new_log_buf_len = 0;
- free = __LOG_BUF_LEN - log_next_idx;
- memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
+
+ free = __LOG_BUF_LEN;
+ prb_for_each_record(0, &printk_rb_static, seq, &r)
+ free -= add_to_rb(&printk_rb_dynamic, &r);
+
+ /*
+ * This is early enough that everything is still running on the
+ * boot CPU and interrupts are disabled. So no new messages will
+ * appear during this switch.
+ */
+ prb = &printk_rb_dynamic;
+
logbuf_unlock_irqrestore(flags);

+ /* Verify no messages were dropped. */
+ if (seq != prb_next_seq(&printk_rb_static)) {
+ pr_err("dropped %llu messages\n",
+ prb_next_seq(&printk_rb_static) - seq);
+ }
+
pr_info("log_buf_len: %u bytes\n", log_buf_len);
pr_info("early log buf free: %u(%u%%)\n",
free, (free * 100) / __LOG_BUF_LEN);

2020-06-26 23:26:41

by John Ogness

[permalink] [raw]
Subject: Re: record_printk_text tricks: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On 2020-06-25, Petr Mladek <[email protected]> wrote:
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1344,72 +1304,150 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog,
>> return len;
>> }
>>
>> -static size_t msg_print_text(const struct printk_log *msg, bool syslog,
>> - bool time, char *buf, size_t size)
>> +static size_t record_print_text(struct printk_record *r, bool syslog,
>> + bool time)
>> {
>> - const char *text = log_text(msg);
>> - size_t text_size = msg->text_len;
>> - size_t len = 0;
>> + size_t text_len = r->info->text_len;
>> + size_t buf_size = r->text_buf_size;
>> + char *text = r->text_buf;
>> char prefix[PREFIX_MAX];
>> - const size_t prefix_len = print_prefix(msg, syslog, time, prefix);
>> + bool truncated = false;
>> + size_t prefix_len;
>> + size_t line_len;
>> + size_t len = 0;
>> + char *next;
>>
>> - do {
>> - const char *next = memchr(text, '\n', text_size);
>> - size_t text_len;
>> + prefix_len = info_print_prefix(r->info, syslog, time, prefix);
>>
>> + /*
>> + * Add the prefix for each line by shifting the rest of the text to
>> + * make room for each prefix. If the buffer is not large enough for
>> + * all the prefixes, then drop the trailing text and report the
>> + * largest length that includes full lines with their prefixes.
>
> This should go up as the function description.
>
> Sigh, this function does (already did) many subltle decisions. We
> should mention them in the description. My current understanding is
> the following:
>
> /*
> * Prepare the record for printing. The text is shifted in the given
> * buffer to avoid a need for another one. The following operations
> * are done:
> *
> * - Add prefix for each line.
> * - Add the trailing newline that has been removed in vprintk_store().
> * - Drop truncated lines that do not longer fit into the buffer.
> *
> * Return: The length of the updated text, including the added
> * prefixes, newline. The dropped line(s) are not counted.
> */

OK.

>> + *
>> + * @text_len: bytes of unprocessed text
>> + * @line_len: bytes of current line (newline _not_ included)
>> + * @text: pointer to beginning of current line
>> + * @len: number of bytes processed (size of r->text_buf done)
>
> Please, move this to the variable declaration.

I find it more helpful to see these critical definitions directly above
the loop. When I add them to the variable declarations, I find it really
ugly and hard to reference. Maybe you could show me an example of what
you want it to look like?

>> + */
>> + for (;;) {
>> + next = memchr(text, '\n', text_len);
>> if (next) {
>> - text_len = next - text;
>> - next++;
>> - text_size -= next - text;
>> + line_len = next - text;
>> } else {
>> - text_len = text_size;
>> + /*
>> + * No newline. If the text was previously truncated,
>> + * assume this line was truncated and do not include
>> + * it.
>> + */
>
> The word "assume" is confusing. The last line _must_ have been truncated when
> "truncated" is set. I would write:
>
> /* Drop incomplete truncated lines. */

OK.

>> + if (truncated)
>> + break;
>> + line_len = text_len;
>> }
>>
>> - if (buf) {
>> - if (prefix_len + text_len + 1 >= size - len)
>> + /*
>> + * Ensure there is enough buffer available to shift this line
>> + * (and add a newline at the end).
>> + */
>> + if (len + prefix_len + line_len + 1 > buf_size)
>> + break;
>> +
>> + /*
>> + * Ensure there is enough buffer available to shift all
>> + * remaining text (and add a newline at the end). If this
>> + * test fails, then there must be a newline (i.e.
>> + * text_len > line_len because the previous test succeeded).
>> + */
>> + if (len + prefix_len + text_len + 1 > buf_size) {
>> + /*
>> + * Truncate @text_len so that there is enough space
>> + * for a prefix. A newline will not be added because
>> + * the last line of the text is now truncated and
>> + * will not be included.
>> + */
>> + text_len = (buf_size - len) - prefix_len;
>
> This is confusing. The check requires that one more character gets truncated.
> And it should be perfectly fine to truncate '\n' from the previous
> line. The final '\n' is always added.

Agreed.

>> +
>> + /*
>> + * Ensure there is still a newline. Otherwise this
>> + * line may have been truncated and will not be
>> + * included.
>> + */
>> + if (memchr(text, '\n', text_len) == NULL)
> > break;
>
> This looks superfluous. We do this check only when the first check of "line_len"
> passed. It means that at least one line is printable.

You are correct.

> I would personally replace this part of the code by:
>
> /*
> * Truncate the remaining text when it does not longer
> * fit into the buffer.
> *
> * Note that this check could fail only when
> * text_len > line_len because the previous check
> * passed.
> */
> if (len + prefix_len + text_len + 1 > buf_size) {
> text_len = buf_size - len - prefix_len - 1;
> truncated = 1;
> }
>
> Or we could actually switch the two checks and do:
>
> /*
> * Truncate the text if it would not longer fit into
> * the given buffer.
> */
> if (len + prefix_len + text_len + 1 > buf_size) {
> text_len = buf_size - len - prefix_len - 1;
> truncated = 1;
> }
>
> /* Drop even the current line when truncated */
> if (line_len > text_len)
> break;

Yes, I like how you switched the checks. Also, the second check can be
nested inside the first one.

However, the "line_len > text_len" check will not work here because
text_len is unsigned and could underflow with this logic. I think doing
the math twice (as I did) is the simplest.

>> - memcpy(buf + len, prefix, prefix_len);
>> - len += prefix_len;
>> - memcpy(buf + len, text, text_len);
>> - len += text_len;
>> - buf[len++] = '\n';
>> - } else {
>> - /* SYSLOG_ACTION_* buffer size only calculation */
>> - len += prefix_len + text_len + 1;
>> + /* Note that the last line will not be included. */
>> + truncated = true;
>> }
>>
>> - text = next;
>> - } while (text);
>> + memmove(text + prefix_len, text, text_len);
>> + memcpy(text, prefix, prefix_len);
>> +
>> + /* Advance beyond the newly added prefix and existing line. */
>> + text += prefix_len + line_len;
>> +
>> + /* The remaining text has only decreased by the line. */
>> + text_len -= line_len;
>> +
>> + len += prefix_len + line_len + 1;
>> +
>> + if (text_len) {
>> + /* Advance past the newline. */
>> + text++;
>> + text_len--;
>
> "text_len" might be zero at this point when the stored string ended
> with newline.
>
> It might happen already now when the original text ended by "\n\n".
> Only the very last '\n' gets removed in vprintk_store().
>
> It actually looks reasonable to do the extra cycle and printk
> the (requested) empty line.
>
> Except that we might end here also when the text was truncated right
> before the newline marker. Well, this is a corner case, the string
> is truncated anyway, so we could do whatever is easier in this case.

I agree that the code is quite tricky here. I will refactor as you have
suggested.

> Well, it will need another update after we remove logbuf_lock. I think that
> we will need to store the final '\n'. Or at least we would need to
> reserve a space for it.

What does the logbuf_lock have to do with this? What does this have to
do with reserving extra space? We are modifying the reader's buffer
in-place, not the datablock data.

> Anyway, it might deserve some comment somewhere. See below for
> alternative code.
>
>> + } else {
>> + /* The final line, add a newline. */
>> + *text = '\n';
>> + break;
>> + }
>
> I do not have strong opinion. The code does things by small steps that
> might be checked easily. But it has many steps and it might hide some
> subtle catches like the one commented right above.

Your suggestions to simplify are good.

>> +static size_t get_record_text_size(struct printk_info *info,
>> + unsigned int line_count,
>> + bool syslog, bool time)
>> +{
>> + char prefix[PREFIX_MAX];
>> + size_t prefix_len;
>> +
>> + prefix_len = info_print_prefix(info, syslog, time, prefix);
>> +
>> + /*
>> + * Each line will be preceded with a prefix. The intermediate
>> + * newlines are already within the text, but a final trailing
>> + * newline will be added.
>> + */
>> + return ((prefix_len * line_count) + info->text_len + 1);
>
> Note: This might need an update when lockbuf_lock gets removed and the
> trailing newline is not longer removed. We will need to explicitly
> add it when it was missing.
>
> We might need to check LOG_NEWLINE flag here. Or come with even better
> solution.

Again, I do not understand what this has to do with the
lockbuf_lock. Please explain.

Below is how I would like to implement record_print_text() after
incorporating your suggestions. Please show me how you would like to see
the important variable role comments formatted. Thanks.

John Ogness

/*
* Prepare the record for printing. The text is shifted within the given
* buffer to avoid a need for another one. The following operations are
* done:
*
* - Add prefix for each line.
* - Add the trailing newline that has been removed in vprintk_store().
* - Drop truncated lines that do not longer fit into the buffer.
*
* Return: The length of the updated/prepared text, including the added
* prefixes and the newline. The dropped line(s) are not counted.
*/
static size_t record_print_text(struct printk_record *r, bool syslog,
bool time)
{
size_t text_len = r->info->text_len;
size_t buf_size = r->text_buf_size;
char *text = r->text_buf;
char prefix[PREFIX_MAX];
bool truncated = false;
size_t prefix_len;
size_t line_len;
size_t len = 0;
char *next;

prefix_len = info_print_prefix(r->info, syslog, time, prefix);

/*
* @text_len: bytes of unprocessed text
* @line_len: bytes of current line _without_ newline
* @text: pointer to beginning of current line
* @len: number of bytes prepared in r->text_buf
*/
for (;;) {
next = memchr(text, '\n', text_len);
if (next) {
line_len = next - text;
} else {
/* Drop truncated line(s). */
if (truncated)
break;
line_len = text_len;
}

/*
* Truncate the text if there is not enough space to add the
* prefix and a trailing newline.
*/
if (len + prefix_len + text_len + 1 > buf_size) {
/* Drop even the current line if no space. */
if (len + prefix_len + line_len + 1 > buf_size)
break;

text_len = buf_size - len - prefix_len - 1;
truncated = true;
}

memmove(text + prefix_len, text, text_len);
memcpy(text, prefix, prefix_len);

len += prefix_len + line_len + 1;

if (text_len == line_len) {
/*
* Add the trailing newline removed in
* vprintk_store().
*/
text[prefix_len + line_len] = '\n';
break;
}

/*
* Advance beyond the added prefix and the related line with
* its newline.
*/
text += prefix_len + line_len + 1;

/*
* The remaining text has only decreased by the line with its
* newline.
*
* Note that @text_len can become zero. It happens when @text
* ended with a newline (either due to truncation or the
* original string ending with "\n\n"). The loop is correctly
* repeated and (if not truncated) an empty line with a prefix
* will be prepared.
*/
text_len -= line_len + 1;
}

return len;
}

2020-06-29 19:35:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] printk: add lockless ringbuffer

On Thu, Jun 18, 2020 at 04:55:18PM +0206, John Ogness wrote:
> Introduce a multi-reader multi-writer lockless ringbuffer for storing
> the kernel log messages. Readers and writers may use their API from
> any context (including scheduler and NMI). This ringbuffer will make
> it possible to decouple printk() callers from any context, locking,
> or console constraints. It also makes it possible for readers to have
> full access to the ringbuffer contents at any time and context (for
> example from any panic situation).
>
> The printk_ringbuffer is made up of 3 internal ringbuffers:
>
> desc_ring:
> A ring of descriptors. A descriptor contains all record meta data
> (sequence number, timestamp, loglevel, etc.) as well as internal state
> information about the record and logical positions specifying where in
> the other ringbuffers the text and dictionary strings are located.
>
> text_data_ring:
> A ring of data blocks. A data block consists of an unsigned long
> integer (ID) that maps to a desc_ring index followed by the text
> string of the record.
>
> dict_data_ring:
> A ring of data blocks. A data block consists of an unsigned long
> integer (ID) that maps to a desc_ring index followed by the dictionary
> string of the record.
>
> The internal state information of a descriptor is the key element to
> allow readers and writers to locklessly synchronize access to the data.
>
> Co-developed-by: Petr Mladek <[email protected]>
> Signed-off-by: John Ogness <[email protected]>

The orderings match the comments, although a number could (later!)
be weakened to the easier-to-read smp_load_acquire() and/or
smp_store_release(). So, from a memory-ordering perspective:

Reviewed-by: Paul E. McKenney <[email protected]>

> ---
> kernel/printk/Makefile | 1 +
> kernel/printk/printk_ringbuffer.c | 1674 +++++++++++++++++++++++++++++
> kernel/printk/printk_ringbuffer.h | 352 ++++++
> 3 files changed, 2027 insertions(+)
> create mode 100644 kernel/printk/printk_ringbuffer.c
> create mode 100644 kernel/printk/printk_ringbuffer.h
>
> diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
> index 4d052fc6bcde..eee3dc9b60a9 100644
> --- a/kernel/printk/Makefile
> +++ b/kernel/printk/Makefile
> @@ -2,3 +2,4 @@
> obj-y = printk.o
> obj-$(CONFIG_PRINTK) += printk_safe.o
> obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o
> +obj-$(CONFIG_PRINTK) += printk_ringbuffer.o
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> new file mode 100644
> index 000000000000..75d056436cc5
> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -0,0 +1,1674 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/kernel.h>
> +#include <linux/irqflags.h>
> +#include <linux/string.h>
> +#include <linux/errno.h>
> +#include <linux/bug.h>
> +#include "printk_ringbuffer.h"
> +
> +/**
> + * DOC: printk_ringbuffer overview
> + *
> + * Data Structure
> + * --------------
> + * The printk_ringbuffer is made up of 3 internal ringbuffers:
> + *
> + * desc_ring
> + * A ring of descriptors. A descriptor contains all record meta data
> + * (sequence number, timestamp, loglevel, etc.) as well as internal state
> + * information about the record and logical positions specifying where in
> + * the other ringbuffers the text and dictionary strings are located.
> + *
> + * text_data_ring
> + * A ring of data blocks. A data block consists of an unsigned long
> + * integer (ID) that maps to a desc_ring index followed by the text
> + * string of the record.
> + *
> + * dict_data_ring
> + * A ring of data blocks. A data block consists of an unsigned long
> + * integer (ID) that maps to a desc_ring index followed by the dictionary
> + * string of the record.
> + *
> + * The internal state information of a descriptor is the key element to allow
> + * readers and writers to locklessly synchronize access to the data.
> + *
> + * Implementation
> + * --------------
> + *
> + * Descriptor Ring
> + * ~~~~~~~~~~~~~~~
> + * The descriptor ring is an array of descriptors. A descriptor contains all
> + * the meta data of a printk record as well as blk_lpos structs pointing to
> + * associated text and dictionary data blocks (see "Data Rings" below). Each
> + * descriptor is assigned an ID that maps directly to index values of the
> + * descriptor array and has a state. The ID and the state are bitwise combined
> + * into a single descriptor field named @state_var, allowing ID and state to
> + * be synchronously and atomically updated.
> + *
> + * Descriptors have three states:
> + *
> + * reserved
> + * A writer is modifying the record.
> + *
> + * committed
> + * The record and all its data are complete and available for reading.
> + *
> + * reusable
> + * The record exists, but its text and/or dictionary data may no longer
> + * be available.
> + *
> + * Querying the @state_var of a record requires providing the ID of the
> + * descriptor to query. This can yield a possible fourth (pseudo) state:
> + *
> + * miss
> + * The descriptor being queried has an unexpected ID.
> + *
> + * The descriptor ring has a @tail_id that contains the ID of the oldest
> + * descriptor and @head_id that contains the ID of the newest descriptor.
> + *
> + * When a new descriptor should be created (and the ring is full), the tail
> + * descriptor is invalidated by first transitioning to the reusable state and
> + * then invalidating all tail data blocks up to and including the data blocks
> + * associated with the tail descriptor (for text and dictionary rings). Then
> + * @tail_id is advanced, followed by advancing @head_id. And finally the
> + * @state_var of the new descriptor is initialized to the new ID and reserved
> + * state.
> + *
> + * The @tail_id can only be advanced if the the new @tail_id would be in the
> + * committed or reusable queried state. This makes it possible that a valid
> + * sequence number of the tail is always available.
> + *
> + * Data Rings
> + * ~~~~~~~~~~
> + * The two data rings (text and dictionary) function identically. They exist
> + * separately so that their buffer sizes can be individually set and they do
> + * not affect one another.
> + *
> + * Data rings are byte arrays composed of data blocks. Data blocks are
> + * referenced by blk_lpos structs that point to the logical position of the
> + * beginning of a data block and the beginning of the next adjacent data
> + * block. Logical positions are mapped directly to index values of the byte
> + * array ringbuffer.
> + *
> + * Each data block consists of an ID followed by the raw data. The ID is the
> + * identifier of a descriptor that is associated with the data block. A data
> + * block is considered valid if all of the following conditions are met:
> + *
> + * 1) The descriptor associated with the data block is in the committed
> + * or reusable queried state.
> + *
> + * 2) The blk_lpos struct within the descriptor associated with the data
> + * block references back to the same data block.
> + *
> + * 3) The data block is within the head/tail logical position range.
> + *
> + * If the raw data of a data block would extend beyond the end of the byte
> + * array, only the ID of the data block is stored at the logical position
> + * and the full data block (ID and raw data) is stored at the beginning of
> + * the byte array. The referencing blk_lpos will point to the ID before the
> + * wrap and the next data block will be at the logical position adjacent the
> + * full data block after the wrap.
> + *
> + * Data rings have a @tail_lpos that points to the beginning of the oldest
> + * data block and a @head_lpos that points to the logical position of the
> + * next (not yet existing) data block.
> + *
> + * When a new data block should be created (and the ring is full), tail data
> + * blocks will first be invalidated by putting their associated descriptors
> + * into the reusable state and then pushing the @tail_lpos forward beyond
> + * them. Then the @head_lpos is pushed forward and is associated with a new
> + * descriptor. If a data block is not valid, the @tail_lpos cannot be
> + * advanced beyond it.
> + *
> + * Usage
> + * -----
> + * Here are some simple examples demonstrating writers and readers. For the
> + * examples a global ringbuffer (test_rb) is available (which is not the
> + * actual ringbuffer used by printk)::
> + *
> + * DECLARE_PRINTKRB(test_rb, 15, 5, 3);
> + *
> + * This ringbuffer allows up to 32768 records (2 ^ 15) and has a size of
> + * 1 MiB (2 ^ (15 + 5)) for text data and 256 KiB (2 ^ (15 + 3)) for
> + * dictionary data.
> + *
> + * Sample writer code::
> + *
> + * const char *dictstr = "dictionary text";
> + * const char *textstr = "message text";
> + * struct prb_reserved_entry e;
> + * struct printk_record r;
> + *
> + * // specify how much to allocate
> + * prb_rec_init_wr(&r, strlen(textstr) + 1, strlen(dictstr) + 1);
> + *
> + * if (prb_reserve(&e, &test_rb, &r)) {
> + * snprintf(r.text_buf, r.text_buf_size, "%s", textstr);
> + *
> + * // dictionary allocation may have failed
> + * if (r.dict_buf)
> + * snprintf(r.dict_buf, r.dict_buf_size, "%s", dictstr);
> + *
> + * r.info->ts_nsec = local_clock();
> + *
> + * prb_commit(&e);
> + * }
> + *
> + * Sample reader code::
> + *
> + * struct printk_info info;
> + * struct printk_record r;
> + * char text_buf[32];
> + * char dict_buf[32];
> + * u64 seq;
> + *
> + * prb_rec_init_rd(&r, &info, &text_buf[0], sizeof(text_buf),
> + * &dict_buf[0], sizeof(dict_buf));
> + *
> + * prb_for_each_record(0, &test_rb, &seq, &r) {
> + * if (info.seq != seq)
> + * pr_warn("lost %llu records\n", info.seq - seq);
> + *
> + * if (info.text_len > r.text_buf_size) {
> + * pr_warn("record %llu text truncated\n", info.seq);
> + * text_buf[sizeof(text_buf) - 1] = 0;
> + * }
> + *
> + * if (info.dict_len > r.dict_buf_size) {
> + * pr_warn("record %llu dict truncated\n", info.seq);
> + * dict_buf[sizeof(dict_buf) - 1] = 0;
> + * }
> + *
> + * pr_info("%llu: %llu: %s;%s\n", info.seq, info.ts_nsec,
> + * &text_buf[0], info.dict_len ? &dict_buf[0] : "");
> + * }
> + *
> + * Note that additional less convenient reader functions are available to
> + * allow complex record access.
> + *
> + * ABA Issues
> + * ~~~~~~~~~~
> + * To help avoid ABA issues, descriptors are referenced by IDs (array index
> + * values with tagged states) and data blocks are referenced by logical
> + * positions (array index values with tagged states). However, on 32-bit
> + * systems the number of tagged states is relatively small such that an ABA
> + * incident is (at least theoretically) possible. For example, if 4 million
> + * maximally sized (1KiB) printk messages were to occur in NMI context on a
> + * 32-bit system, the interrupted context would not be able to recognize that
> + * the 32-bit integer completely wrapped and thus represents a different
> + * data block than the one the interrupted context expects.
> + *
> + * To help combat this possibility, additional state checking is performed
> + * (such as using cmpxchg() even though set() would suffice). These extra
> + * checks are commented as such and will hopefully catch any ABA issue that
> + * a 32-bit system might experience.
> + *
> + * Memory Barriers
> + * ~~~~~~~~~~~~~~~
> + * Multiple memory barriers are used. To simplify proving correctness and
> + * generating litmus tests, lines of code related to memory barriers
> + * (loads, stores, and the associated memory barriers) are labeled::
> + *
> + * LMM(function:letter)
> + *
> + * Comments reference the labels using only the "function:letter" part.
> + *
> + * The memory barrier pairs and their ordering are:
> + *
> + * desc_reserve:D / desc_reserve:B
> + * push descriptor tail (id), then push descriptor head (id)
> + *
> + * desc_reserve:D / data_push_tail:B
> + * push data tail (lpos), then set new descriptor reserved (state)
> + *
> + * desc_reserve:D / desc_push_tail:C
> + * push descriptor tail (id), then set new descriptor reserved (state)
> + *
> + * desc_reserve:D / prb_first_seq:C
> + * push descriptor tail (id), then set new descriptor reserved (state)
> + *
> + * desc_reserve:F / desc_read:D
> + * set new descriptor id and reserved (state), then allow writer changes
> + *
> + * data_alloc:A / desc_read:D
> + * set old descriptor reusable (state), then modify new data block area
> + *
> + * data_alloc:A / data_push_tail:B
> + * push data tail (lpos), then modify new data block area
> + *
> + * prb_commit:B / desc_read:B
> + * store writer changes, then set new descriptor committed (state)
> + *
> + * data_push_tail:D / data_push_tail:A
> + * set descriptor reusable (state), then push data tail (lpos)
> + *
> + * desc_push_tail:B / desc_reserve:D
> + * set descriptor reusable (state), then push descriptor tail (id)
> + */
> +
> +#define DATA_SIZE(data_ring) _DATA_SIZE((data_ring)->size_bits)
> +#define DATA_SIZE_MASK(data_ring) (DATA_SIZE(data_ring) - 1)
> +
> +#define DESCS_COUNT(desc_ring) _DESCS_COUNT((desc_ring)->count_bits)
> +#define DESCS_COUNT_MASK(desc_ring) (DESCS_COUNT(desc_ring) - 1)
> +
> +/* Determine the data array index from a logical position. */
> +#define DATA_INDEX(data_ring, lpos) ((lpos) & DATA_SIZE_MASK(data_ring))
> +
> +/* Determine the desc array index from an ID or sequence number. */
> +#define DESC_INDEX(desc_ring, n) ((n) & DESCS_COUNT_MASK(desc_ring))
> +
> +/* Determine how many times the data array has wrapped. */
> +#define DATA_WRAPS(data_ring, lpos) ((lpos) >> (data_ring)->size_bits)
> +
> +/* Get the logical position at index 0 of the current wrap. */
> +#define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \
> +((lpos) & ~DATA_SIZE_MASK(data_ring))
> +
> +/* Get the ID for the same index of the previous wrap as the given ID. */
> +#define DESC_ID_PREV_WRAP(desc_ring, id) \
> +DESC_ID((id) - DESCS_COUNT(desc_ring))
> +
> +/* A data block: maps to the raw data within the data ring. */
> +struct prb_data_block {
> + unsigned long id;
> + char data[0];
> +};
> +
> +static struct prb_desc *to_desc(struct prb_desc_ring *desc_ring, u64 n)
> +{
> + return &desc_ring->descs[DESC_INDEX(desc_ring, n)];
> +}
> +
> +static struct prb_data_block *to_block(struct prb_data_ring *data_ring,
> + unsigned long begin_lpos)
> +{
> + return (void *)&data_ring->data[DATA_INDEX(data_ring, begin_lpos)];
> +}
> +
> +/*
> + * Increase the data size to account for data block meta data plus any
> + * padding so that the adjacent data block is aligned on the ID size.
> + */
> +static unsigned int to_blk_size(unsigned int size)
> +{
> + struct prb_data_block *db = NULL;
> +
> + size += sizeof(*db);
> + size = ALIGN(size, sizeof(db->id));
> + return size;
> +}
> +
> +/*
> + * Sanity checker for reserve size. The ringbuffer code assumes that a data
> + * block does not exceed the maximum possible size that could fit within the
> + * ringbuffer. This function provides that basic size check so that the
> + * assumption is safe.
> + *
> + * Writers are also not allowed to write 0-sized (data-less) records. Such
> + * records are used only internally by the ringbuffer.
> + */
> +static bool data_check_size(struct prb_data_ring *data_ring, unsigned int size)
> +{
> + struct prb_data_block *db = NULL;
> +
> + /*
> + * Writers are not allowed to write data-less records. Such records
> + * are used only internally by the ringbuffer to denote records where
> + * their data failed to allocate or have been lost.
> + */
> + if (size == 0)
> + return false;
> +
> + /*
> + * Ensure the alignment padded size could possibly fit in the data
> + * array. The largest possible data block must still leave room for
> + * at least the ID of the next block.
> + */
> + size = to_blk_size(size);
> + if (size > DATA_SIZE(data_ring) - sizeof(db->id))
> + return false;
> +
> + return true;
> +}
> +
> +/* The possible responses of a descriptor state-query. */
> +enum desc_state {
> + desc_miss, /* ID mismatch */
> + desc_reserved, /* reserved, in use by writer */
> + desc_committed, /* committed, writer is done */
> + desc_reusable, /* free, not yet used by any writer */
> +};
> +
> +/* Query the state of a descriptor. */
> +static enum desc_state get_desc_state(unsigned long id,
> + unsigned long state_val)
> +{
> + if (id != DESC_ID(state_val))
> + return desc_miss;
> +
> + if (state_val & DESC_REUSE_MASK)
> + return desc_reusable;
> +
> + if (state_val & DESC_COMMITTED_MASK)
> + return desc_committed;
> +
> + return desc_reserved;
> +}
> +
> +/*
> + * Get a copy of a specified descriptor and its queried state. A descriptor
> + * that is not in the committed or reusable state must be considered garbage
> + * by the reader.
> + */
> +static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
> + unsigned long id, struct prb_desc *desc_out)
> +{
> + struct prb_desc *desc = to_desc(desc_ring, id);
> + atomic_long_t *state_var = &desc->state_var;
> + enum desc_state d_state;
> + unsigned long state_val;
> +
> + /* Check the descriptor state. */
> + state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
> + d_state = get_desc_state(id, state_val);
> + if (d_state != desc_committed && d_state != desc_reusable)
> + return d_state;
> +
> + /*
> + * Guarantee the state is loaded before copying the descriptor
> + * content. This avoids copying obsolete descriptor content that might
> + * not apply to the descriptor state. This pairs with prb_commit:B.
> + *
> + * Memory barrier involvement:
> + *
> + * If desc_read:A reads from prb_commit:B, then desc_read:C reads
> + * from prb_commit:A.
> + *
> + * Relies on:
> + *
> + * WMB from prb_commit:A to prb_commit:B
> + * matching
> + * RMB from desc_read:A to desc_read:C
> + */
> + smp_rmb(); /* LMM(desc_read:B) */
> +
> + /*
> + * Copy the descriptor data. The data is not valid until the
> + * state has been re-checked.
> + */
> + memcpy(desc_out, desc, sizeof(*desc_out)); /* LMM(desc_read:C) */
> +
> + /*
> + * 1. Guarantee the descriptor content is loaded before re-checking
> + * the state. This avoids reading an obsolete descriptor state
> + * that may not apply to the copied content. This pairs with
> + * desc_reserve:F.
> + *
> + * Memory barrier involvement:
> + *
> + * If desc_read:C reads from desc_reserve:G, then desc_read:E
> + * reads from desc_reserve:F.
> + *
> + * Relies on:
> + *
> + * WMB from desc_reserve:F to desc_reserve:G
> + * matching
> + * RMB from desc_read:C to desc_read:E
> + *
> + * 2. Guarantee the record data is loaded before re-checking the
> + * state. This avoids reading an obsolete descriptor state that may
> + * not apply to the copied data. This pairs with data_alloc:A.
> + *
> + * Memory barrier involvement:
> + *
> + * If copy_data:A reads from data_alloc:B, then desc_read:E
> + * reads from desc_make_reusable:A.
> + *
> + * Relies on:
> + *
> + * MB from desc_make_reusable:A to data_alloc:B
> + * matching
> + * RMB from desc_read:C to desc_read:E
> + *
> + * Note: desc_make_reusable:A and data_alloc:B can be different
> + * CPUs. However, the data_alloc:B CPU (which performs the
> + * full memory barrier) must have previously seen
> + * desc_make_reusable:A.
> + */
> + smp_rmb(); /* LMM(desc_read:D) */
> +
> + /* Re-check the descriptor state. */
> + state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
> + return get_desc_state(id, state_val);
> +}
> +
> +/*
> + * Take a specified descriptor out of the committed state by attempting
> + * the transition from committed to reusable. Either this context or some
> + * other context will have been successful.
> + */
> +static void desc_make_reusable(struct prb_desc_ring *desc_ring,
> + unsigned long id)
> +{
> + unsigned long val_committed = id | DESC_COMMITTED_MASK;
> + unsigned long val_reusable = val_committed | DESC_REUSE_MASK;
> + struct prb_desc *desc = to_desc(desc_ring, id);
> + atomic_long_t *state_var = &desc->state_var;
> +
> + atomic_long_cmpxchg_relaxed(state_var, val_committed,
> + val_reusable); /* LMM(desc_make_reusable:A) */
> +}
> +
> +/*
> + * Given a data ring (text or dict), put the associated descriptor of each
> + * data block from @lpos_begin until @lpos_end into the reusable state.
> + *
> + * If there is any problem making the associated descriptor reusable, either
> + * the descriptor has not yet been committed or another writer context has
> + * already pushed the tail lpos past the problematic data block. Regardless,
> + * on error the caller can re-load the tail lpos to determine the situation.
> + */
> +static bool data_make_reusable(struct printk_ringbuffer *rb,
> + struct prb_data_ring *data_ring,
> + unsigned long lpos_begin,
> + unsigned long lpos_end,
> + unsigned long *lpos_out)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + struct prb_data_blk_lpos *blk_lpos;
> + struct prb_data_block *blk;
> + enum desc_state d_state;
> + struct prb_desc desc;
> + unsigned long id;
> +
> + /*
> + * Using the provided @data_ring, point @blk_lpos to the correct
> + * blk_lpos within the local copy of the descriptor.
> + */
> + if (data_ring == &rb->text_data_ring)
> + blk_lpos = &desc.text_blk_lpos;
> + else
> + blk_lpos = &desc.dict_blk_lpos;
> +
> + /* Loop until @lpos_begin has advanced to or beyond @lpos_end. */
> + while ((lpos_end - lpos_begin) - 1 < DATA_SIZE(data_ring)) {
> + blk = to_block(data_ring, lpos_begin);
> +
> + /*
> + * Load the block ID from the data block. This is a data race
> + * against a writer that may have newly reserved this data
> + * area. If the loaded value matches a valid descriptor ID,
> + * the blk_lpos of that descriptor will be checked to make
> + * sure it points back to this data block. If the check fails,
> + * the data area has been recycled by another writer.
> + */
> + id = blk->id; /* LMM(data_make_reusable:A) */
> +
> + d_state = desc_read(desc_ring, id, &desc); /* LMM(data_make_reusable:B) */
> +
> + switch (d_state) {
> + case desc_miss:
> + return false;
> + case desc_reserved:
> + return false;
> + case desc_committed:
> + /*
> + * This data block is invalid if the descriptor
> + * does not point back to it.
> + */
> + if (blk_lpos->begin != lpos_begin)
> + return false;
> + desc_make_reusable(desc_ring, id);
> + break;
> + case desc_reusable:
> + /*
> + * This data block is invalid if the descriptor
> + * does not point back to it.
> + */
> + if (blk_lpos->begin != lpos_begin)
> + return false;
> + break;
> + }
> +
> + /* Advance @lpos_begin to the next data block. */
> + lpos_begin = blk_lpos->next;
> + }
> +
> + *lpos_out = lpos_begin;
> + return true;
> +}
> +
> +/*
> + * Advance the data ring tail to at least @lpos. This function puts
> + * descriptors into the reusable state if the tail is pushed beyond
> + * their associated data block.
> + */
> +static bool data_push_tail(struct printk_ringbuffer *rb,
> + struct prb_data_ring *data_ring,
> + unsigned long lpos)
> +{
> + unsigned long tail_lpos_new;
> + unsigned long tail_lpos;
> + unsigned long next_lpos;
> +
> + /* If @lpos is not valid, there is nothing to do. */
> + if (lpos == INVALID_LPOS)
> + return true;
> +
> + /*
> + * Any descriptor states that have transitioned to reusable due to the
> + * data tail being pushed to this loaded value will be visible to this
> + * CPU. This pairs with data_push_tail:D.
> + *
> + * Memory barrier involvement:
> + *
> + * If data_push_tail:A reads from data_push_tail:D, then this CPU can
> + * see desc_make_reusable:A.
> + *
> + * Relies on:
> + *
> + * MB from desc_make_reusable:A to data_push_tail:D
> + * matches
> + * READFROM from data_push_tail:D to data_push_tail:A
> + * thus
> + * READFROM from desc_make_reusable:A to this CPU
> + */
> + tail_lpos = atomic_long_read(&data_ring->tail_lpos); /* LMM(data_push_tail:A) */
> +
> + /*
> + * Loop until the tail lpos is at or beyond @lpos. This condition
> + * may already be satisfied, resulting in no full memory barrier
> + * from data_push_tail:D being performed. However, since this CPU
> + * sees the new tail lpos, any descriptor states that transitioned to
> + * the reusable state must already be visible.
> + */
> + while ((lpos - tail_lpos) - 1 < DATA_SIZE(data_ring)) {
> + /*
> + * Make all descriptors reusable that are associated with
> + * data blocks before @lpos.
> + */
> + if (!data_make_reusable(rb, data_ring, tail_lpos, lpos,
> + &next_lpos)) {
> + /*
> + * 1. Guarantee the block ID loaded in
> + * data_make_reusable() is performed before
> + * reloading the tail lpos. The failed
> + * data_make_reusable() may be due to a newly
> + * recycled data area causing the tail lpos to
> + * have been previously pushed. This pairs with
> + * data_alloc:A.
> + *
> + * Memory barrier involvement:
> + *
> + * If data_make_reusable:A reads from data_alloc:B,
> + * then data_push_tail:C reads from
> + * data_push_tail:D.
> + *
> + * Relies on:
> + *
> + * MB from data_push_tail:D to data_alloc:B
> + * matching
> + * RMB from data_make_reusable:A to
> + * data_push_tail:C
> + *
> + * Note: data_push_tail:D and data_alloc:B can be
> + * different CPUs. However, the data_alloc:B
> + * CPU (which performs the full memory
> + * barrier) must have previously seen
> + * data_push_tail:D.
> + *
> + * 2. Guarantee the descriptor state loaded in
> + * data_make_reusable() is performed before
> + * reloading the tail lpos. The failed
> + * data_make_reusable() may be due to a newly
> + * recycled descriptor causing the tail lpos to
> + * have been previously pushed. This pairs with
> + * desc_reserve:D.
> + *
> + * Memory barrier involvement:
> + *
> + * If data_make_reusable:B reads from
> + * desc_reserve:F, then data_push_tail:C reads
> + * from data_push_tail:D.
> + *
> + * Relies on:
> + *
> + * MB from data_push_tail:D to desc_reserve:F
> + * matching
> + * RMB from data_make_reusable:B to
> + * data_push_tail:C
> + *
> + * Note: data_push_tail:D and desc_reserve:F can
> + * be different CPUs. However, the
> + * desc_reserve:F CPU (which performs the
> + * full memory barrier) must have previously
> + * seen data_push_tail:D.
> + */
> + smp_rmb(); /* LMM(data_push_tail:B) */
> +
> + tail_lpos_new = atomic_long_read(&data_ring->tail_lpos
> + ); /* LMM(data_push_tail:C) */
> + if (tail_lpos_new == tail_lpos)
> + return false;
> +
> + /* Another CPU pushed the tail. Try again. */
> + tail_lpos = tail_lpos_new;
> + continue;
> + }
> +
> + /*
> + * Guarantee any descriptor states that have transitioned to
> + * reusable are stored before pushing the tail lpos. A full
> + * memory barrier is needed since other CPUs may have made
> + * the descriptor states reusable. This pairs with
> + * data_push_tail:A.
> + */
> + if (atomic_long_try_cmpxchg(&data_ring->tail_lpos, &tail_lpos,
> + next_lpos)) { /* LMM(data_push_tail:D) */
> + break;
> + }
> + }
> +
> + return true;
> +}
> +
> +/*
> + * Advance the desc ring tail. This function advances the tail by one
> + * descriptor, thus invalidating the oldest descriptor. Before advancing
> + * the tail, the tail descriptor is made reusable and all data blocks up to
> + * and including the descriptor's data block are invalidated (i.e. the data
> + * ring tail is pushed past the data block of the descriptor being made
> + * reusable).
> + */
> +static bool desc_push_tail(struct printk_ringbuffer *rb,
> + unsigned long tail_id)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + enum desc_state d_state;
> + struct prb_desc desc;
> +
> + d_state = desc_read(desc_ring, tail_id, &desc);
> +
> + switch (d_state) {
> + case desc_miss:
> + /*
> + * If the ID is exactly 1 wrap behind the expected, it is
> + * in the process of being reserved by another writer and
> + * must be considered reserved.
> + */
> + if (DESC_ID(atomic_long_read(&desc.state_var)) ==
> + DESC_ID_PREV_WRAP(desc_ring, tail_id)) {
> + return false;
> + }
> +
> + /*
> + * The ID has changed. Another writer must have pushed the
> + * tail and recycled the descriptor already. Success is
> + * returned because the caller is only interested in the
> + * specified tail being pushed, which it was.
> + */
> + return true;
> + case desc_reserved:
> + return false;
> + case desc_committed:
> + desc_make_reusable(desc_ring, tail_id);
> + break;
> + case desc_reusable:
> + break;
> + }
> +
> + /*
> + * Data blocks must be invalidated before their associated
> + * descriptor can be made available for recycling. Invalidating
> + * them later is not possible because there is no way to trust
> + * data blocks once their associated descriptor is gone.
> + */
> +
> + if (!data_push_tail(rb, &rb->text_data_ring, desc.text_blk_lpos.next))
> + return false;
> + if (!data_push_tail(rb, &rb->dict_data_ring, desc.dict_blk_lpos.next))
> + return false;
> +
> + /*
> + * Check the next descriptor after @tail_id before pushing the tail
> + * to it because the tail must always be in a committed or reusable
> + * state. The implementation of prb_first_seq() relies on this.
> + *
> + * A successful read implies that the next descriptor is less than or
> + * equal to @head_id so there is no risk of pushing the tail past the
> + * head.
> + */
> + d_state = desc_read(desc_ring, DESC_ID(tail_id + 1), &desc); /* LMM(desc_push_tail:A) */
> +
> + if (d_state == desc_committed || d_state == desc_reusable) {
> + /*
> + * Guarantee any descriptor states that have transitioned to
> + * reusable are stored before pushing the tail ID. This allows
> + * verifying the recycled descriptor state. A full memory
> + * barrier is needed since other CPUs may have made the
> + * descriptor states reusable. This pairs with desc_reserve:D.
> + */
> + atomic_long_cmpxchg(&desc_ring->tail_id, tail_id,
> + DESC_ID(tail_id + 1)); /* LMM(desc_push_tail:B) */
> + } else {
> + /*
> + * Guarantee the last state load from desc_read() is before
> + * reloading @tail_id in order to see a new tail ID in the
> + * case that the descriptor has been recycled. This pairs
> + * with desc_reserve:D.
> + *
> + * Memory barrier involvement:
> + *
> + * If desc_push_tail:A reads from desc_reserve:F, then
> + * desc_push_tail:D reads from desc_push_tail:B.
> + *
> + * Relies on:
> + *
> + * MB from desc_push_tail:B to desc_reserve:F
> + * matching
> + * RMB from desc_push_tail:A to desc_push_tail:D
> + *
> + * Note: desc_push_tail:B and desc_reserve:F can be different
> + * CPUs. However, the desc_reserve:F CPU (which performs
> + * the full memory barrier) must have previously seen
> + * desc_push_tail:B.
> + */
> + smp_rmb(); /* LMM(desc_push_tail:C) */
> +
> + /*
> + * Re-check the tail ID. The descriptor following @tail_id is
> + * not in an allowed tail state. But if the tail has since
> + * been moved by another CPU, then it does not matter.
> + */
> + if (atomic_long_read(&desc_ring->tail_id) == tail_id) /* LMM(desc_push_tail:D) */
> + return false;
> + }
> +
> + return true;
> +}
> +
> +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> +static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + unsigned long prev_state_val;
> + unsigned long id_prev_wrap;
> + struct prb_desc *desc;
> + unsigned long head_id;
> + unsigned long id;
> +
> + head_id = atomic_long_read(&desc_ring->head_id); /* LMM(desc_reserve:A) */
> +
> + do {
> + desc = to_desc(desc_ring, head_id);
> +
> + id = DESC_ID(head_id + 1);
> + id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
> +
> + /*
> + * Guarantee the head ID is read before reading the tail ID.
> + * Since the tail ID is updated before the head ID, this
> + * guarantees that @id_prev_wrap is never ahead of the tail
> + * ID. This pairs with desc_reserve:D.
> + *
> + * Memory barrier involvement:
> + *
> + * If desc_reserve:A reads from desc_reserve:D, then
> + * desc_reserve:C reads from desc_push_tail:B.
> + *
> + * Relies on:
> + *
> + * MB from desc_push_tail:B to desc_reserve:D
> + * matching
> + * RMB from desc_reserve:A to desc_reserve:C
> + *
> + * Note: desc_push_tail:B and desc_reserve:D can be different
> + * CPUs. However, the desc_reserve:D CPU (which performs
> + * the full memory barrier) must have previously seen
> + * desc_push_tail:B.
> + */
> + smp_rmb(); /* LMM(desc_reserve:B) */
> +
> + if (id_prev_wrap == atomic_long_read(&desc_ring->tail_id
> + )) { /* LMM(desc_reserve:C) */
> + /*
> + * Make space for the new descriptor by
> + * advancing the tail.
> + */
> + if (!desc_push_tail(rb, id_prev_wrap))
> + return false;
> + }
> +
> + /*
> + * 1. Guarantee the tail ID is read before validating the
> + * recycled descriptor state. A read memory barrier is
> + * sufficient for this. This pairs with desc_push_tail:B.
> + *
> + * Memory barrier involvement:
> + *
> + * If desc_reserve:C reads from desc_push_tail:B, then
> + * desc_reserve:E reads from desc_make_reusable:A.
> + *
> + * Relies on:
> + *
> + * MB from desc_make_reusable:A to desc_push_tail:B
> + * matching
> + * RMB from desc_reserve:C to desc_reserve:E
> + *
> + * Note: desc_make_reusable:A and desc_push_tail:B can be
> + * different CPUs. However, the desc_push_tail:B CPU
> + * (which performs the full memory barrier) must have
> + * previously seen desc_make_reusable:A.
> + *
> + * 2. Guarantee the tail ID is stored before storing the head
> + * ID. This pairs with desc_reserve:B.
> + *
> + * 3. Guarantee any data ring tail changes are stored before
> + * recycling the descriptor. Data ring tail changes can
> + * happen via desc_push_tail()->data_push_tail(). A full
> + * memory barrier is needed since another CPU may have
> + * pushed the data ring tails. This pairs with
> + * data_push_tail:B.
> + *
> + * 4. Guarantee a new tail ID is stored before recycling the
> + * descriptor. A full memory barrier is needed since
> + * another CPU may have pushed the tail ID. This pairs
> + * with desc_push_tail:C and this also pairs with
> + * prb_first_seq:C.
> + */
> + } while (!atomic_long_try_cmpxchg(&desc_ring->head_id, &head_id,
> + id)); /* LMM(desc_reserve:D) */
> +
> + desc = to_desc(desc_ring, id);
> +
> + /*
> + * If the descriptor has been recycled, verify the old state val.
> + * See "ABA Issues" about why this verification is performed.
> + */
> + prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */
> + if (prev_state_val &&
> + prev_state_val != (id_prev_wrap | DESC_COMMITTED_MASK | DESC_REUSE_MASK)) {
> + WARN_ON_ONCE(1);
> + return false;
> + }
> +
> + /*
> + * Assign the descriptor a new ID and set its state to reserved.
> + * See "ABA Issues" about why cmpxchg() instead of set() is used.
> + *
> + * Guarantee the new descriptor ID and state is stored before making
> + * any other changes. A write memory barrier is sufficient for this.
> + * This pairs with desc_read:D.
> + */
> + if (!atomic_long_try_cmpxchg(&desc->state_var, &prev_state_val,
> + id | 0)) { /* LMM(desc_reserve:F) */
> + WARN_ON_ONCE(1);
> + return false;
> + }
> +
> + /* Now data in @desc can be modified: LMM(desc_reserve:G) */
> +
> + *id_out = id;
> + return true;
> +}
> +
> +/* Determine the end of a data block. */
> +static unsigned long get_next_lpos(struct prb_data_ring *data_ring,
> + unsigned long lpos, unsigned int size)
> +{
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> +
> + begin_lpos = lpos;
> + next_lpos = lpos + size;
> +
> + /* First check if the data block does not wrap. */
> + if (DATA_WRAPS(data_ring, begin_lpos) == DATA_WRAPS(data_ring, next_lpos))
> + return next_lpos;
> +
> + /* Wrapping data blocks store their data at the beginning. */
> + return (DATA_THIS_WRAP_START_LPOS(data_ring, next_lpos) + size);
> +}
> +
> +/*
> + * Allocate a new data block, invalidating the oldest data block(s)
> + * if necessary. This function also associates the data block with
> + * a specified descriptor.
> + */
> +static char *data_alloc(struct printk_ringbuffer *rb,
> + struct prb_data_ring *data_ring, unsigned int size,
> + struct prb_data_blk_lpos *blk_lpos, unsigned long id)
> +{
> + struct prb_data_block *blk;
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> +
> + if (!data_ring->data || size == 0) {
> + /* Specify a data-less block. */
> + blk_lpos->begin = INVALID_LPOS;
> + blk_lpos->next = INVALID_LPOS;
> + return NULL;
> + }
> +
> + size = to_blk_size(size);
> +
> + begin_lpos = atomic_long_read(&data_ring->head_lpos);
> +
> + do {
> + next_lpos = get_next_lpos(data_ring, begin_lpos, size);
> +
> + if (!data_push_tail(rb, data_ring, next_lpos - DATA_SIZE(data_ring))) {
> + /* Failed to allocate, specify a data-less block. */
> + blk_lpos->begin = INVALID_LPOS;
> + blk_lpos->next = INVALID_LPOS;
> + return NULL;
> + }
> +
> + /*
> + * 1. Guarantee any descriptor states that have transitioned
> + * to reusable are stored before modifying the newly
> + * allocated data area. A full memory barrier is needed
> + * since other CPUs may have made the descriptor states
> + * reusable. See data_push_tail:A about why the reusable
> + * states are visible. This pairs with desc_read:D.
> + *
> + * 2. Guarantee any updated tail lpos is stored before
> + * modifying the newly allocated data area. Another CPU may
> + * be in data_make_reusable() and is reading a block ID
> + * from this area. data_make_reusable() can handle reading
> + * a garbage block ID value, but then it must be able to
> + * load a new tail lpos. A full memory barrier is needed
> + * since other CPUs may have updated the tail lpos. This
> + * pairs with data_push_tail:B.
> + */
> + } while (!atomic_long_try_cmpxchg(&data_ring->head_lpos, &begin_lpos,
> + next_lpos)); /* LMM(data_alloc:A) */
> +
> + blk = to_block(data_ring, begin_lpos);
> + blk->id = id; /* LMM(data_alloc:B) */
> +
> + if (DATA_WRAPS(data_ring, begin_lpos) != DATA_WRAPS(data_ring, next_lpos)) {
> + /* Wrapping data blocks store their data at the beginning. */
> + blk = to_block(data_ring, 0);
> +
> + /*
> + * Store the ID on the wrapped block for consistency.
> + * The printk_ringbuffer does not actually use it.
> + */
> + blk->id = id;
> + }
> +
> + blk_lpos->begin = begin_lpos;
> + blk_lpos->next = next_lpos;
> +
> + return &blk->data[0];
> +}
> +
> +/* Return the number of bytes used by a data block. */
> +static unsigned int space_used(struct prb_data_ring *data_ring,
> + struct prb_data_blk_lpos *blk_lpos)
> +{
> + if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next)) {
> + /* Data block does not wrap. */
> + return (DATA_INDEX(data_ring, blk_lpos->next) -
> + DATA_INDEX(data_ring, blk_lpos->begin));
> + }
> +
> + /*
> + * For wrapping data blocks, the trailing (wasted) space is
> + * also counted.
> + */
> + return (DATA_INDEX(data_ring, blk_lpos->next) +
> + DATA_SIZE(data_ring) - DATA_INDEX(data_ring, blk_lpos->begin));
> +}
> +
> +/**
> + * prb_reserve() - Reserve space in the ringbuffer.
> + *
> + * @e: The entry structure to setup.
> + * @rb: The ringbuffer to reserve data in.
> + * @r: The record structure to allocate buffers for.
> + *
> + * This is the public function available to writers to reserve data.
> + *
> + * The writer specifies the text and dict sizes to reserve by setting the
> + * @text_buf_size and @dict_buf_size fields of @r, respectively. Dictionaries
> + * are optional, so @dict_buf_size is allowed to be 0. To ensure proper
> + * initialization of @r, prb_rec_init_wr() should be used.
> + *
> + * Context: Any context. Disables local interrupts on success.
> + * Return: true if at least text data could be allocated, otherwise false.
> + *
> + * On success, the fields @info, @text_buf, @dict_buf of @r will be set by
> + * this function and should be filled in by the writer before committing. Also
> + * on success, prb_record_text_space() can be used on @e to query the actual
> + * space used for the text data block.
> + *
> + * If the function fails to reserve dictionary space (but all else succeeded),
> + * it will still report success. In that case @dict_buf is set to NULL and
> + * @dict_buf_size is set to 0. Writers must check this before writing to
> + * dictionary space.
> + *
> + * @info->text_len and @info->dict_len will already be set to @text_buf_size
> + * and @dict_buf_size, respectively. If dictionary space reservation fails,
> + * @info->dict_len is set to 0.
> + */
> +bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
> + struct printk_record *r)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + struct prb_desc *d;
> + unsigned long id;
> +
> + if (!data_check_size(&rb->text_data_ring, r->text_buf_size))
> + goto fail;
> +
> + /* Records are allowed to not have dictionaries. */
> + if (r->dict_buf_size) {
> + if (!data_check_size(&rb->dict_data_ring, r->dict_buf_size))
> + goto fail;
> + }
> +
> + /*
> + * Descriptors in the reserved state act as blockers to all further
> + * reservations once the desc_ring has fully wrapped. Disable
> + * interrupts during the reserve/commit window in order to minimize
> + * the likelihood of this happening.
> + */
> + local_irq_save(e->irqflags);
> +
> + if (!desc_reserve(rb, &id)) {
> + /* Descriptor reservation failures are tracked. */
> + atomic_long_inc(&rb->fail);
> + local_irq_restore(e->irqflags);
> + goto fail;
> + }
> +
> + d = to_desc(desc_ring, id);
> +
> + /*
> + * Set the @e fields here so that prb_commit() can be used if
> + * text data allocation fails.
> + */
> + e->rb = rb;
> + e->id = id;
> +
> + /*
> + * Initialize the sequence number if it has "never been set".
> + * Otherwise just increment it by a full wrap.
> + *
> + * @seq is considered "never been set" if it has a value of 0,
> + * _except_ for @descs[0], which was specially setup by the ringbuffer
> + * initializer and therefore is always considered as set.
> + *
> + * See the "Bootstrap" comment block in printk_ringbuffer.h for
> + * details about how the initializer bootstraps the descriptors.
> + */
> + if (d->info.seq == 0 && DESC_INDEX(desc_ring, id) != 0)
> + d->info.seq = DESC_INDEX(desc_ring, id);
> + else
> + d->info.seq += DESCS_COUNT(desc_ring);
> +
> + r->text_buf = data_alloc(rb, &rb->text_data_ring, r->text_buf_size,
> + &d->text_blk_lpos, id);
> + /* If text data allocation fails, a data-less record is committed. */
> + if (r->text_buf_size && !r->text_buf) {
> + d->info.text_len = 0;
> + d->info.dict_len = 0;
> + prb_commit(e);
> + /* prb_commit() re-enabled interrupts. */
> + goto fail;
> + }
> +
> + r->dict_buf = data_alloc(rb, &rb->dict_data_ring, r->dict_buf_size,
> + &d->dict_blk_lpos, id);
> + /*
> + * If dict data allocation fails, the caller can still commit
> + * text. But dictionary information will not be available.
> + */
> + if (r->dict_buf_size && !r->dict_buf)
> + r->dict_buf_size = 0;
> +
> + r->info = &d->info;
> +
> + /* Set default values for the sizes. */
> + d->info.text_len = r->text_buf_size;
> + d->info.dict_len = r->dict_buf_size;
> +
> + /* Record full text space used by record. */
> + e->text_space = space_used(&rb->text_data_ring, &d->text_blk_lpos);
> +
> + return true;
> +fail:
> + /* Make it clear to the caller that the reserve failed. */
> + memset(r, 0, sizeof(*r));
> + return false;
> +}
> +
> +/**
> + * prb_commit() - Commit (previously reserved) data to the ringbuffer.
> + *
> + * @e: The entry containing the reserved data information.
> + *
> + * This is the public function available to writers to commit data.
> + *
> + * Context: Any context. Enables local interrupts.
> + */
> +void prb_commit(struct prb_reserved_entry *e)
> +{
> + struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
> + struct prb_desc *d = to_desc(desc_ring, e->id);
> + unsigned long prev_state_val = e->id | 0;
> +
> + /* Now the writer has finished all writing: LMM(prb_commit:A) */
> +
> + /*
> + * Set the descriptor as committed. See "ABA Issues" about why
> + * cmpxchg() instead of set() is used.
> + *
> + * Guarantee all record data is stored before the descriptor state
> + * is stored as committed. A write memory barrier is sufficient for
> + * this. This pairs with desc_read:B.
> + */
> + if (!atomic_long_try_cmpxchg(&d->state_var, &prev_state_val,
> + e->id | DESC_COMMITTED_MASK)) { /* LMM(prb_commit:B) */
> + WARN_ON_ONCE(1);
> + }
> +
> + /* Restore interrupts, the reserve/commit window is finished. */
> + local_irq_restore(e->irqflags);
> +}
> +
> +/*
> + * Given @blk_lpos, return a pointer to the raw data from the data block
> + * and calculate the size of the data part. A NULL pointer is returned
> + * if @blk_lpos specifies values that could never be legal.
> + *
> + * This function (used by readers) performs strict validation on the lpos
> + * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
> + * triggered if an internal error is detected.
> + */
> +static char *get_data(struct prb_data_ring *data_ring,
> + struct prb_data_blk_lpos *blk_lpos,
> + unsigned int *data_size)
> +{
> + struct prb_data_block *db;
> +
> + /* Data-less data block description. */
> + if (blk_lpos->begin == INVALID_LPOS &&
> + blk_lpos->next == INVALID_LPOS) {
> + return NULL;
> + }
> +
> + /* Regular data block: @begin less than @next and in same wrap. */
> + if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) &&
> + blk_lpos->begin < blk_lpos->next) {
> + db = to_block(data_ring, blk_lpos->begin);
> + *data_size = blk_lpos->next - blk_lpos->begin;
> +
> + /* Wrapping data block: @begin is one wrap behind @next. */
> + } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) ==
> + DATA_WRAPS(data_ring, blk_lpos->next)) {
> + db = to_block(data_ring, 0);
> + *data_size = DATA_INDEX(data_ring, blk_lpos->next);
> +
> + /* Illegal block description. */
> + } else {
> + WARN_ON_ONCE(1);
> + return NULL;
> + }
> +
> + /* A valid data block will always be aligned to the ID size. */
> + if (WARN_ON_ONCE(blk_lpos->begin != ALIGN(blk_lpos->begin, sizeof(db->id))) ||
> + WARN_ON_ONCE(blk_lpos->next != ALIGN(blk_lpos->next, sizeof(db->id)))) {
> + return NULL;
> + }
> +
> + /* A valid data block will always have at least an ID. */
> + if (WARN_ON_ONCE(*data_size < sizeof(db->id)))
> + return NULL;
> +
> + /* Subtract block ID space from size to reflect data size. */
> + *data_size -= sizeof(db->id);
> +
> + return &db->data[0];
> +}
> +
> +/**
> + * prb_count_lines() - Count the number of lines in provided text.
> + *
> + * @text: The text to count the lines of.
> + * @text_size: The size of the text to process.
> + *
> + * This is the public function available to readers to count the number of
> + * lines in a text string.
> + *
> + * Context: Any context.
> + * Return: The number of lines in the text.
> + *
> + * All text has at least 1 line (even if @text_size is 0). Each '\n'
> + * processed is counted as an additional line.
> + */
> +unsigned int prb_count_lines(char *text, unsigned int text_size)
> +{
> + unsigned int next_size = text_size;
> + unsigned int line_count = 1;
> + char *next = text;
> +
> + while (next_size) {
> + next = memchr(next, '\n', next_size);
> + if (!next)
> + break;
> + line_count++;
> + next++;
> + next_size = text_size - (next - text);
> + }
> +
> + return line_count;
> +}
> +
> +/*
> + * Given @blk_lpos, copy an expected @len of data into the provided buffer.
> + * If @line_count is provided, count the number of lines in the data.
> + *
> + * This function (used by readers) performs strict validation on the data
> + * size to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
> + * triggered if an internal error is detected.
> + */
> +static bool copy_data(struct prb_data_ring *data_ring,
> + struct prb_data_blk_lpos *blk_lpos, u16 len, char *buf,
> + unsigned int buf_size, unsigned int *line_count)
> +{
> + unsigned int data_size;
> + char *data;
> +
> + /* Caller might not want any data. */
> + if ((!buf || !buf_size) && !line_count)
> + return true;
> +
> + data = get_data(data_ring, blk_lpos, &data_size);
> + if (!data)
> + return false;
> +
> + /*
> + * Actual cannot be less than expected. It can be more than expected
> + * because of the trailing alignment padding.
> + */
> + if (WARN_ON_ONCE(data_size < (unsigned int)len)) {
> + pr_warn_once("wrong data size (%u, expecting %hu) for data: %.*s\n",
> + data_size, len, data_size, data);
> + return false;
> + }
> +
> + /* Caller interested in the line count? */
> + if (line_count)
> + *line_count = prb_count_lines(data, data_size);
> +
> + /* Caller interested in the data content? */
> + if (!buf || !buf_size)
> + return true;
> +
> + data_size = min_t(u16, buf_size, len);
> +
> + if (!WARN_ON_ONCE(!data_size))
> + memcpy(&buf[0], data, data_size); /* LMM(copy_data:A) */
> + return true;
> +}
> +
> +/*
> + * This is an extended version of desc_read(). It gets a copy of a specified
> + * descriptor. However, it also verifies that the record is committed and has
> + * the sequence number @seq. On success, 0 is returned.
> + *
> + * Error return values:
> + * -EINVAL: A committed record with sequence number @seq does not exist.
> + * -ENOENT: A committed record with sequence number @seq exists, but its data
> + * is not available. This is a valid record, so readers should
> + * continue with the next record.
> + */
> +static int desc_read_committed_seq(struct prb_desc_ring *desc_ring,
> + unsigned long id, u64 seq,
> + struct prb_desc *desc_out)
> +{
> + struct prb_data_blk_lpos *blk_lpos = &desc_out->text_blk_lpos;
> + enum desc_state d_state;
> +
> + d_state = desc_read(desc_ring, id, desc_out);
> +
> + /*
> + * An unexpected @id (desc_miss) or @seq mismatch means the record
> + * does not exist. A descriptor in the reserved state means the
> + * record does not yet exist for the reader.
> + */
> + if (d_state == desc_miss ||
> + d_state == desc_reserved ||
> + desc_out->info.seq != seq) {
> + return -EINVAL;
> + }
> +
> + /*
> + * A descriptor in the reusable state may no longer have its data
> + * available; report it as a data-less record. Or the record may
> + * actually be a data-less record.
> + */
> + if (d_state == desc_reusable ||
> + (blk_lpos->begin == INVALID_LPOS && blk_lpos->next == INVALID_LPOS)) {
> + return -ENOENT;
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Copy the ringbuffer data from the record with @seq to the provided
> + * @r buffer. On success, 0 is returned.
> + *
> + * See desc_read_committed_seq() for error return values.
> + */
> +static int prb_read(struct printk_ringbuffer *rb, u64 seq,
> + struct printk_record *r, unsigned int *line_count)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + struct prb_desc *rdesc = to_desc(desc_ring, seq);
> + atomic_long_t *state_var = &rdesc->state_var;
> + struct prb_desc desc;
> + unsigned long id;
> + int err;
> +
> + /* Extract the ID, used to specify the descriptor to read. */
> + id = DESC_ID(atomic_long_read(state_var));
> +
> + /* Get a local copy of the correct descriptor (if available). */
> + err = desc_read_committed_seq(desc_ring, id, seq, &desc);
> +
> + /*
> + * If @r is NULL, the caller is only interested in the availability
> + * of the record.
> + */
> + if (err || !r)
> + return err;
> +
> + /* If requested, copy meta data. */
> + if (r->info)
> + memcpy(r->info, &desc.info, sizeof(*(r->info)));
> +
> + /* Copy text data. If it fails, this is a data-less descriptor. */
> + if (!copy_data(&rb->text_data_ring, &desc.text_blk_lpos, desc.info.text_len,
> + r->text_buf, r->text_buf_size, line_count)) {
> + return -ENOENT;
> + }
> +
> + /*
> + * Copy dict data. Although this should not fail, dict data is not
> + * important. So if it fails, modify the copied meta data to report
> + * that there is no dict data, thus silently dropping the dict data.
> + */
> + if (!copy_data(&rb->dict_data_ring, &desc.dict_blk_lpos, desc.info.dict_len,
> + r->dict_buf, r->dict_buf_size, NULL)) {
> + if (r->info)
> + r->info->dict_len = 0;
> + }
> +
> + /* Ensure the record is still committed and has the same @seq. */
> + return desc_read_committed_seq(desc_ring, id, seq, &desc);
> +}
> +
> +/**
> + * prb_first_seq() - Get the sequence number of the tail descriptor.
> + *
> + * @rb: The ringbuffer to get the sequence number from.
> + *
> + * This is the public function available to readers to see what the
> + * first/oldest sequence number is.
> + *
> + * This provides readers a starting point to begin iterating the ringbuffer.
> + * Note that the returned sequence number might not belong to a valid record.
> + *
> + * Context: Any context.
> + * Return: The sequence number of the first/oldest record or, if the
> + * ringbuffer is empty, 0 is returned.
> + */
> +u64 prb_first_seq(struct printk_ringbuffer *rb)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + enum desc_state d_state;
> + struct prb_desc desc;
> + unsigned long id;
> +
> + for (;;) {
> + id = atomic_long_read(&rb->desc_ring.tail_id); /* LMM(prb_first_seq:A) */
> +
> + d_state = desc_read(desc_ring, id, &desc); /* LMM(prb_first_seq:B) */
> +
> + /*
> + * This loop will not be infinite because the tail is
> + * _always_ in the committed or reusable state.
> + */
> + if (d_state == desc_committed || d_state == desc_reusable)
> + break;
> +
> + /*
> + * Guarantee the last state load from desc_read() is before
> + * reloading @tail_id in order to see a new tail in the case
> + * that the descriptor has been recycled. This pairs with
> + * desc_reserve:D.
> + *
> + * Memory barrier involvement:
> + *
> + * If prb_first_seq:B reads from desc_reserve:F, then
> + * prb_first_seq:A reads from desc_push_tail:B.
> + *
> + * Relies on:
> + *
> + * MB from desc_push_tail:B to desc_reserve:F
> + * matching
> + * RMB prb_first_seq:B to prb_first_seq:A
> + */
> + smp_rmb(); /* LMM(prb_first_seq:C) */
> + }
> +
> + return desc.info.seq;
> +}
> +
> +/*
> + * Non-blocking read of a record. Updates @seq to the last committed record
> + * (which may have no data).
> + *
> + * See the description of prb_read_valid() and prb_read_valid_info()
> + * for details.
> + */
> +static bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
> + struct printk_record *r, unsigned int *line_count)
> +{
> + u64 tail_seq;
> + int err;
> +
> + while ((err = prb_read(rb, *seq, r, line_count))) {
> + tail_seq = prb_first_seq(rb);
> +
> + if (*seq < tail_seq) {
> + /*
> + * Behind the tail. Catch up and try again. This
> + * can happen for -ENOENT and -EINVAL cases.
> + */
> + *seq = tail_seq;
> +
> + } else if (err == -ENOENT) {
> + /* Record exists, but no data available. Skip. */
> + (*seq)++;
> +
> + } else {
> + /* Non-existent/non-committed record. Must stop. */
> + return false;
> + }
> + }
> +
> + return true;
> +}
> +
> +/**
> + * prb_read_valid() - Non-blocking read of a requested record or (if gone)
> + * the next available record.
> + *
> + * @rb: The ringbuffer to read from.
> + * @seq: The sequence number of the record to read.
> + * @r: A record data buffer to store the read record to.
> + *
> + * This is the public function available to readers to read a record.
> + *
> + * The reader provides the @info, @text_buf, @dict_buf buffers of @r to be
> + * filled in. Any of the buffer pointers can be set to NULL if the reader
> + * is not interested in that data. To ensure proper initialization of @r,
> + * prb_rec_init_rd() should be used.
> + *
> + * Context: Any context.
> + * Return: true if a record was read, otherwise false.
> + *
> + * On success, the reader must check r->info.seq to see which record was
> + * actually read. This allows the reader to detect dropped records.
> + *
> + * Failure means @seq refers to a not yet written record.
> + */
> +bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
> + struct printk_record *r)
> +{
> + return _prb_read_valid(rb, &seq, r, NULL);
> +}
> +
> +/**
> + * prb_read_valid_info() - Non-blocking read of meta data for a requested
> + * record or (if gone) the next available record.
> + *
> + * @rb: The ringbuffer to read from.
> + * @seq: The sequence number of the record to read.
> + * @info: A buffer to store the read record meta data to.
> + * @line_count: A buffer to store the number of lines in the record text.
> + *
> + * This is the public function available to readers to read only the
> + * meta data of a record.
> + *
> + * The reader provides the @info, @line_count buffers to be filled in.
> + * Either of the buffer pointers can be set to NULL if the reader is not
> + * interested in that data.
> + *
> + * Context: Any context.
> + * Return: true if a record's meta data was read, otherwise false.
> + *
> + * On success, the reader must check info->seq to see which record meta data
> + * was actually read. This allows the reader to detect dropped records.
> + *
> + * Failure means @seq refers to a not yet written record.
> + */
> +bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
> + struct printk_info *info, unsigned int *line_count)
> +{
> + struct printk_record r;
> +
> + prb_rec_init_rd(&r, info, NULL, 0, NULL, 0);
> +
> + return _prb_read_valid(rb, &seq, &r, line_count);
> +}
> +
> +/**
> + * prb_next_seq() - Get the sequence number after the last available record.
> + *
> + * @rb: The ringbuffer to get the sequence number from.
> + *
> + * This is the public function available to readers to see what the next
> + * newest sequence number available to readers will be.
> + *
> + * This provides readers a sequence number to jump to if all currently
> + * available records should be skipped.
> + *
> + * Context: Any context.
> + * Return: The sequence number of the next newest (not yet available) record
> + * for readers.
> + */
> +u64 prb_next_seq(struct printk_ringbuffer *rb)
> +{
> + u64 seq = 0;
> +
> + do {
> + /* Search forward from the oldest descriptor. */
> + if (!_prb_read_valid(rb, &seq, NULL, NULL))
> + return seq;
> + seq++;
> + } while (seq);
> +
> + return 0;
> +}
> +
> +/**
> + * prb_init() - Initialize a ringbuffer to use provided external buffers.
> + *
> + * @rb: The ringbuffer to initialize.
> + * @text_buf: The data buffer for text data.
> + * @textbits: The size of @text_buf as a power-of-2 value.
> + * @dict_buf: The data buffer for dictionary data.
> + * @dictbits: The size of @dict_buf as a power-of-2 value.
> + * @descs: The descriptor buffer for ringbuffer records.
> + * @descbits: The count of @descs items as a power-of-2 value.
> + *
> + * This is the public function available to writers to setup a ringbuffer
> + * during runtime using provided buffers.
> + *
> + * This must match the initialization of DECLARE_PRINTKRB().
> + *
> + * Context: Any context.
> + */
> +void prb_init(struct printk_ringbuffer *rb,
> + char *text_buf, unsigned int textbits,
> + char *dict_buf, unsigned int dictbits,
> + struct prb_desc *descs, unsigned int descbits)
> +{
> + memset(descs, 0, _DESCS_COUNT(descbits) * sizeof(descs[0]));
> +
> + rb->desc_ring.count_bits = descbits;
> + rb->desc_ring.descs = descs;
> + atomic_long_set(&rb->desc_ring.head_id, DESC0_ID(descbits));
> + atomic_long_set(&rb->desc_ring.tail_id, DESC0_ID(descbits));
> +
> + rb->text_data_ring.size_bits = textbits;
> + rb->text_data_ring.data = text_buf;
> + atomic_long_set(&rb->text_data_ring.head_lpos, BLK0_LPOS(textbits));
> + atomic_long_set(&rb->text_data_ring.tail_lpos, BLK0_LPOS(textbits));
> +
> + rb->dict_data_ring.size_bits = dictbits;
> + rb->dict_data_ring.data = dict_buf;
> + atomic_long_set(&rb->dict_data_ring.head_lpos, BLK0_LPOS(dictbits));
> + atomic_long_set(&rb->dict_data_ring.tail_lpos, BLK0_LPOS(dictbits));
> +
> + atomic_long_set(&rb->fail, 0);
> +
> + descs[0].info.seq = -(u64)_DESCS_COUNT(descbits);
> +
> + descs[_DESCS_COUNT(descbits) - 1].info.seq = 0;
> + atomic_long_set(&(descs[_DESCS_COUNT(descbits) - 1].state_var), DESC0_SV(descbits));
> + descs[_DESCS_COUNT(descbits) - 1].text_blk_lpos.begin = INVALID_LPOS;
> + descs[_DESCS_COUNT(descbits) - 1].text_blk_lpos.next = INVALID_LPOS;
> + descs[_DESCS_COUNT(descbits) - 1].dict_blk_lpos.begin = INVALID_LPOS;
> + descs[_DESCS_COUNT(descbits) - 1].dict_blk_lpos.next = INVALID_LPOS;
> +}
> +
> +/**
> + * prb_record_text_space() - Query the full actual used ringbuffer space for
> + * the text data of a reserved entry.
> + *
> + * @e: The successfully reserved entry to query.
> + *
> + * This is the public function available to writers to see how much actual
> + * space is used in the ringbuffer to store the text data of the specified
> + * entry.
> + *
> + * This function is only valid if @e has been successfully reserved using
> + * prb_reserve().
> + *
> + * Context: Any context.
> + * Return: The size in bytes used by the text data of the associated record.
> + */
> +unsigned int prb_record_text_space(struct prb_reserved_entry *e)
> +{
> + return e->text_space;
> +}
> diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
> new file mode 100644
> index 000000000000..df03039dca7e
> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.h
> @@ -0,0 +1,352 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _KERNEL_PRINTK_RINGBUFFER_H
> +#define _KERNEL_PRINTK_RINGBUFFER_H
> +
> +#include <linux/atomic.h>
> +
> +struct printk_info {
> + u64 seq; /* sequence number */
> + u64 ts_nsec; /* timestamp in nanoseconds */
> + u16 text_len; /* length of text message */
> + u16 dict_len; /* length of dictionary message */
> + u8 facility; /* syslog facility */
> + u8 flags:5; /* internal record flags */
> + u8 level:3; /* syslog level */
> + u32 caller_id; /* thread id or processor id */
> +};
> +
> +/*
> + * A structure providing the buffers, used by writers and readers.
> + *
> + * Writers:
> + * Using prb_rec_init_wr(), a writer sets @text_buf_size and @dict_buf_size
> + * before calling prb_reserve(). On success, prb_reserve() sets @info,
> + * @text_buf, @dict_buf to buffers reserved for that writer.
> + *
> + * Readers:
> + * Using prb_rec_init_rd(), a reader sets all fields before calling
> + * prb_read_valid(). Note that the reader provides the @info, @text_buf,
> + * @dict_buf buffers. On success, the struct pointed to by @info will be
> + * filled and the char arrays pointed to by @text_buf and @dict_buf will
> + * be filled with text and dict data.
> + */
> +struct printk_record {
> + struct printk_info *info;
> + char *text_buf;
> + char *dict_buf;
> + unsigned int text_buf_size;
> + unsigned int dict_buf_size;
> +};
> +
> +/* Specifies the position/span of a data block. */
> +struct prb_data_blk_lpos {
> + unsigned long begin;
> + unsigned long next;
> +};
> +
> +/* A descriptor: the complete meta-data for a record. */
> +struct prb_desc {
> + struct printk_info info;
> + atomic_long_t state_var;
> + struct prb_data_blk_lpos text_blk_lpos;
> + struct prb_data_blk_lpos dict_blk_lpos;
> +};
> +
> +/* A ringbuffer of "ID + data" elements. */
> +struct prb_data_ring {
> + unsigned int size_bits;
> + char *data;
> + atomic_long_t head_lpos;
> + atomic_long_t tail_lpos;
> +};
> +
> +/* A ringbuffer of "struct prb_desc" elements. */
> +struct prb_desc_ring {
> + unsigned int count_bits;
> + struct prb_desc *descs;
> + atomic_long_t head_id;
> + atomic_long_t tail_id;
> +};
> +
> +/* The high level structure representing the printk ringbuffer. */
> +struct printk_ringbuffer {
> + struct prb_desc_ring desc_ring;
> + struct prb_data_ring text_data_ring;
> + struct prb_data_ring dict_data_ring;
> + atomic_long_t fail;
> +};
> +
> +/* Used by writers as a reserve/commit handle. */
> +struct prb_reserved_entry {
> + struct printk_ringbuffer *rb;
> + unsigned long irqflags;
> + unsigned long id;
> + unsigned int text_space;
> +};
> +
> +#define _DATA_SIZE(sz_bits) (1UL << (sz_bits))
> +#define _DESCS_COUNT(ct_bits) (1U << (ct_bits))
> +#define DESC_SV_BITS (sizeof(unsigned long) * 8)
> +#define DESC_COMMITTED_MASK (1UL << (DESC_SV_BITS - 1))
> +#define DESC_REUSE_MASK (1UL << (DESC_SV_BITS - 2))
> +#define DESC_FLAGS_MASK (DESC_COMMITTED_MASK | DESC_REUSE_MASK)
> +#define DESC_ID_MASK (~DESC_FLAGS_MASK)
> +#define DESC_ID(sv) ((sv) & DESC_ID_MASK)
> +#define INVALID_LPOS 1
> +
> +#define INVALID_BLK_LPOS \
> +{ \
> + .begin = INVALID_LPOS, \
> + .next = INVALID_LPOS, \
> +}
> +
> +/*
> + * Descriptor Bootstrap
> + *
> + * The descriptor array is minimally initialized to allow immediate usage
> + * by readers and writers. The requirements that the descriptor array
> + * initialization must satisfy:
> + *
> + * Req1
> + * The tail must point to an existing (committed or reusable) descriptor.
> + * This is required by the implementation of prb_first_seq().
> + *
> + * Req2
> + * Readers must see that the ringbuffer is initially empty.
> + *
> + * Req3
> + * The first record reserved by a writer is assigned sequence number 0.
> + *
> + * To satisfy Req1, the tail initially points to a descriptor that is
> + * minimally initialized (having no data block, i.e. data-less with the
> + * data block's lpos @begin and @next values set to INVALID_LPOS).
> + *
> + * To satisfy Req2, the initial tail descriptor is initialized to the
> + * reusable state. Readers recognize reusable descriptors as existing
> + * records, but skip over them.
> + *
> + * To satisfy Req3, the last descriptor in the array is used as the initial
> + * head (and tail) descriptor. This allows the first record reserved by a
> + * writer (head + 1) to be the first descriptor in the array. (Only the first
> + * descriptor in the array could have a valid sequence number of 0.)
> + *
> + * The first time a descriptor is reserved, it is assigned a sequence number
> + * with the value of the array index. A "first time reserved" descriptor can
> + * be recognized because it has a sequence number of 0 but does not have an
> + * index of 0. (Only the first descriptor in the array could have a valid
> + * sequence number of 0.) After the first reservation, all future reservations
> + * (recycling) simply involve incrementing the sequence number by the array
> + * count.
> + *
> + * Hack #1
> + * Only the first descriptor in the array is allowed to have the sequence
> + * number 0. In this case it is not possible to recognize if it is being
> + * reserved the first time (set to index value) or has been reserved
> + * previously (increment by the array count). This is handled by _always_
> + * incrementing the sequence number by the array count when reserving the
> + * first descriptor in the array. In order to satisfy Req3, the sequence
> + * number of the first descriptor in the array is initialized to minus
> + * the array count. Then, upon the first reservation, it is incremented
> + * to 0, thus satisfying Req3.
> + *
> + * Hack #2
> + * prb_first_seq() can be called at any time by readers to retrieve the
> + * sequence number of the tail descriptor. However, due to Req2 and Req3,
> + * initially there are no records to report the sequence number of
> + * (sequence numbers are u64 and there is nothing less than 0). To handle
> + * this, the sequence number of the initial tail descriptor is initialized
> + * to 0. Technically this is incorrect, because there is no record with
> + * sequence number 0 (yet) and the tail descriptor is not the first
> + * descriptor in the array. But it allows prb_read_valid() to correctly
> + * report the existence of a record for _any_ given sequence number at all
> + * times. Bootstrapping is complete when the tail is pushed the first
> + * time, thus finally pointing to the first descriptor reserved by a
> + * writer, which has the assigned sequence number 0.
> + */
> +
> +/*
> + * Initiating Logical Value Overflows
> + *
> + * Both logical position (lpos) and ID values can be mapped to array indexes
> + * but may experience overflows during the lifetime of the system. To ensure
> + * that printk_ringbuffer can handle the overflows for these types, initial
> + * values are chosen that map to the correct initial array indexes, but will
> + * result in overflows soon.
> + *
> + * BLK0_LPOS
> + * The initial @head_lpos and @tail_lpos for data rings. It is at index
> + * 0 and the lpos value is such that it will overflow on the first wrap.
> + *
> + * DESC0_ID
> + * The initial @head_id and @tail_id for the desc ring. It is at the last
> + * index of the descriptor array (see Req3 above) and the ID value is such
> + * that it will overflow on the second wrap.
> + */
> +#define BLK0_LPOS(sz_bits) (-(_DATA_SIZE(sz_bits)))
> +#define DESC0_ID(ct_bits) DESC_ID(-(_DESCS_COUNT(ct_bits) + 1))
> +#define DESC0_SV(ct_bits) (DESC_COMMITTED_MASK | DESC_REUSE_MASK | DESC0_ID(ct_bits))
> +
> +/*
> + * Declare a ringbuffer with an external text data buffer. The same as
> + * DECLARE_PRINTKRB() but requires specifying an external buffer for the
> + * text data.
> + *
> + * Note: The specified external buffer must be of the size:
> + * 2 ^ (descbits + avgtextbits)
> + */
> +#define _DECLARE_PRINTKRB(name, descbits, avgtextbits, avgdictbits, text_buf) \
> +char _##name##_dict[1U << ((avgdictbits) + (descbits))] __aligned(__alignof__(unsigned long)); \
> +struct prb_desc _##name##_descs[_DESCS_COUNT(descbits)] = { \
> + /* this will be the first record reserved by a writer */ \
> + [0] = { \
> + .info = { \
> + /* will be incremented to 0 on the first reservation */ \
> + .seq = -(u64)_DESCS_COUNT(descbits), \
> + }, \
> + }, \
> + /* the initial head and tail */ \
> + [_DESCS_COUNT(descbits) - 1] = { \
> + .info = { \
> + /* reports the first seq value during the bootstrap phase */ \
> + .seq = 0, \
> + }, \
> + /* reusable */ \
> + .state_var = ATOMIC_INIT(DESC0_SV(descbits)), \
> + /* no associated data block */ \
> + .text_blk_lpos = INVALID_BLK_LPOS, \
> + .dict_blk_lpos = INVALID_BLK_LPOS, \
> + }, \
> +}; \
> +struct printk_ringbuffer name = { \
> + .desc_ring = { \
> + .count_bits = descbits, \
> + .descs = &_##name##_descs[0], \
> + .head_id = ATOMIC_INIT(DESC0_ID(descbits)), \
> + .tail_id = ATOMIC_INIT(DESC0_ID(descbits)), \
> + }, \
> + .text_data_ring = { \
> + .size_bits = (avgtextbits) + (descbits), \
> + .data = text_buf, \
> + .head_lpos = ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))), \
> + .tail_lpos = ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))), \
> + }, \
> + .dict_data_ring = { \
> + .size_bits = (avgtextbits) + (descbits), \
> + .data = &_##name##_dict[0], \
> + .head_lpos = ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))), \
> + .tail_lpos = ATOMIC_LONG_INIT(BLK0_LPOS((avgtextbits) + (descbits))), \
> + }, \
> + .fail = ATOMIC_LONG_INIT(0), \
> +}
> +
> +/**
> + * DECLARE_PRINTKRB() - Declare a ringbuffer.
> + *
> + * @name: The name of the ringbuffer variable.
> + * @descbits: The number of descriptors as a power-of-2 value.
> + * @avgtextbits: The average text data size per record as a power-of-2 value.
> + * @avgdictbits: The average dictionary data size per record as a
> + * power-of-2 value.
> + *
> + * This is a macro for declaring a ringbuffer and all internal structures
> + * such that it is ready for immediate use. See _DECLARE_PRINTKRB() for a
> + * variant where the text data buffer can be specified externally.
> + */
> +#define DECLARE_PRINTKRB(name, descbits, avgtextbits, avgdictbits) \
> +char _##name##_text[1U << ((avgtextbits) + (descbits))] __aligned(__alignof__(unsigned long)); \
> +_DECLARE_PRINTKRB(name, descbits, avgtextbits, avgdictbits, &_##name##_text[0])
> +
> +/* Writer Interface */
> +
> +/**
> + * prb_rec_init_wd() - Initialize a buffer for writing records.
> + *
> + * @r: The record to initialize.
> + * @text_buf_size: The needed text buffer size.
> + * @dict_buf_size: The needed dictionary buffer size.
> + *
> + * Initialize all the fields that a writer is interested in. If
> + * @dict_buf_size is 0, a dictionary buffer will not be reserved.
> + * @text_buf_size must be greater than 0.
> + *
> + * Note that although @dict_buf_size may be initialized to non-zero,
> + * its value must be rechecked after a successful call to prb_reserve()
> + * to verify a dictionary buffer was actually reserved. Dictionary buffer
> + * reservation is allowed to fail.
> + */
> +static inline void prb_rec_init_wr(struct printk_record *r,
> + unsigned int text_buf_size,
> + unsigned int dict_buf_size)
> +{
> + r->info = NULL;
> + r->text_buf = NULL;
> + r->dict_buf = NULL;
> + r->text_buf_size = text_buf_size;
> + r->dict_buf_size = dict_buf_size;
> +}
> +
> +bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
> + struct printk_record *r);
> +void prb_commit(struct prb_reserved_entry *e);
> +
> +void prb_init(struct printk_ringbuffer *rb,
> + char *text_buf, unsigned int text_buf_size,
> + char *dict_buf, unsigned int dict_buf_size,
> + struct prb_desc *descs, unsigned int descs_count_bits);
> +unsigned int prb_record_text_space(struct prb_reserved_entry *e);
> +
> +/* Reader Interface */
> +
> +/**
> + * prb_rec_init_rd() - Initialize a buffer for reading records.
> + *
> + * @r: The record to initialize.
> + * @info: A buffer to store record meta-data.
> + * @text_buf: A buffer to store text data.
> + * @text_buf_size: The size of @text_buf.
> + * @dict_buf: A buffer to store dictionary data.
> + * @dict_buf_size: The size of @dict_buf.
> + *
> + * Initialize all the fields that a reader is interested in. All arguments
> + * (except @r) are optional. Only record data for arguments that are
> + * non-NULL or non-zero will be read.
> + */
> +static inline void prb_rec_init_rd(struct printk_record *r,
> + struct printk_info *info,
> + char *text_buf, unsigned int text_buf_size,
> + char *dict_buf, unsigned int dict_buf_size)
> +{
> + r->info = info;
> + r->text_buf = text_buf;
> + r->dict_buf = dict_buf;
> + r->text_buf_size = text_buf_size;
> + r->dict_buf_size = dict_buf_size;
> +}
> +
> +/**
> + * prb_for_each_record() - Iterate over a ringbuffer.
> + *
> + * @from: The sequence number to begin with.
> + * @rb: The ringbuffer to iterate over.
> + * @s: A u64 to store the sequence number on each iteration.
> + * @r: A printk_record to store the record on each iteration.
> + *
> + * This is a macro for conveniently iterating over a ringbuffer.
> + *
> + * Context: Any context.
> + */
> +#define prb_for_each_record(from, rb, s, r) \
> +for ((s) = from; prb_read_valid(rb, s, r); (s) = (r)->info->seq + 1)
> +
> +bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
> + struct printk_record *r);
> +bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
> + struct printk_info *info, unsigned int *line_count);
> +
> +u64 prb_first_seq(struct printk_ringbuffer *rb);
> +u64 prb_next_seq(struct printk_ringbuffer *rb);
> +
> +unsigned int prb_count_lines(char *text, unsigned int text_size);
> +
> +#endif /* _KERNEL_PRINTK_RINGBUFFER_H */
> --
> 2.20.1
>

2020-06-29 20:37:01

by Petr Mladek

[permalink] [raw]
Subject: Re: buffer allocation: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On Fri 2020-06-26 17:02:48, John Ogness wrote:
> On 2020-06-25, Petr Mladek <[email protected]> wrote:
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> + free = __LOG_BUF_LEN;
> >> + prb_for_each_record(0, &printk_rb_static, seq, &r)
> >> + free -= add_to_rb(&printk_rb_dynamic, &r);
> >> +
> >> + prb = &printk_rb_dynamic;
> >
> > This might deserve a comment that this is safe (no lost message)
> > because it is called early enough when everything is still running
> > on the boot CPU.
>
> I will add a comment and an extra check to make sure.
>
> Once everything is lockless, new messages could appear (for example, due
> to NMI messages). The simple check should probably change to a loop. But
> let us not worry about that at this point.

Yup.

> Below is a new version of the relevant patch snippets against mainline
> just to show you how I plan to make it look for the next version.
>
> John Ogness
>
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ @@
> +#define PRB_AVGBITS 5 /* 32 character average length */
> +
> +#if CONFIG_LOG_BUF_SHIFT <= PRB_AVGBITS
> +#error CONFIG_LOG_BUF_SHIFT value too small.
> +#endif
> +_DEFINE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
> + PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);
> +
> @@ @@
> void __init setup_log_buf(int early)
> {
> + unsigned int new_descs_count;
> + struct prb_desc *new_descs;
> + struct printk_info info;
> + struct printk_record r;
> + size_t new_descs_size;
> unsigned long flags;
> + char *new_dict_buf;
> char *new_log_buf;
> unsigned int free;
> + u64 seq;
>
> /*
> * Some archs call setup_log_buf() multiple times - first is very
> @@ @@ void __init setup_log_buf(int early)
> if (!new_log_buf_len)
> return;
>
> + new_descs_count = new_log_buf_len >> PRB_AVGBITS;
> + if (new_descs_count == 0) {
> + pr_err("new_log_buf_len: %lu too small\n", new_log_buf_len);
> + return;
> + }
> +
> new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
> if (unlikely(!new_log_buf)) {
> - pr_err("log_buf_len: %lu bytes not available\n",
> - new_log_buf_len);
> + pr_err("log_buf_len: %lu text bytes not available\n",
> + new_log_buf_len);
> + return;
> + }
> +
> + new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
> + if (unlikely(!new_dict_buf)) {
> + pr_err("log_buf_len: %lu dict bytes not available\n",
> + new_log_buf_len);
> + memblock_free(__pa(new_log_buf), new_log_buf_len);
> return;
> }
>
> + new_descs_size = new_descs_count * sizeof(struct prb_desc);
> + new_descs = memblock_alloc(new_descs_size, LOG_ALIGN);
> + if (unlikely(!new_descs)) {
> + pr_err("log_buf_len: %lu desc bytes not available\n",
> + new_descs_size);
> + memblock_free(__pa(new_dict_buf), new_log_buf_len);
> + memblock_free(__pa(new_log_buf), new_log_buf_len);
> + return;
> + }
> +
> + prb_rec_init_rd(&r, &info,
> + &setup_text_buf[0], sizeof(setup_text_buf),
> + &setup_dict_buf[0], sizeof(setup_dict_buf));
> +
> + prb_init(&printk_rb_dynamic,
> + new_log_buf, order_base_2(new_log_buf_len),
> + new_dict_buf, order_base_2(new_log_buf_len),
> + new_descs, order_base_2(new_descs_count));

order_base_2() is safe. But the result might be tat some allocated
space is not used.

I would prefer to make sure that new_log_buf_len is rounded, e.g.
by roundup_pow_of_two(), at the beginning of the function. Then we
could use ilog2() here.

Otherwise, it looks fine to me.

Best Regards,
Petr

2020-06-29 21:55:16

by John Ogness

[permalink] [raw]
Subject: Re: syslog size unread: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On 2020-06-25, Petr Mladek <[email protected]> wrote:
> On Thu 2020-06-18 16:55:19, John Ogness wrote:
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1609,11 +1633,15 @@ int do_syslog(int type, char __user *buf, int len, int source)
>> break;
>> /* Number of chars in the log buffer */
>> case SYSLOG_ACTION_SIZE_UNREAD:
>> + if (source != SYSLOG_FROM_PROC) {
>> + text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
>> + if (!text)
>> + return -ENOMEM;
>
> The buffer is needed only to count lines to count the size of added
> prefixes. Could we use the new prb_read_valid_info() that allows to
> get the number of lines without actually reading the buffer?

Yes!

For the next version I introduce a macro to iterate just the meta data:

#define prb_for_each_info(from, rb, s, i, lc) \
for ((s) = from; prb_read_valid_info(rb, s, i, lc); (s) = (i)->seq + 1)

This can be used in all 3 locations where prb_count_lines() is used. In
all three places, there is no need to be copying the data. And for
SYSLOG_ACTION_SIZE_UNREAD, there is no need for the record and kmalloc'd
buffer.

This also means that prb_count_lines() will become a private static
helper of the ringbuffer.

John Ogness

2020-06-29 21:59:33

by John Ogness

[permalink] [raw]
Subject: Re: buffer allocation: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On 2020-06-29, Petr Mladek <[email protected]> wrote:
>> @@ @@ void __init setup_log_buf(int early)
>> + prb_init(&printk_rb_dynamic,
>> + new_log_buf, order_base_2(new_log_buf_len),
>> + new_dict_buf, order_base_2(new_log_buf_len),
>> + new_descs, order_base_2(new_descs_count));
>
> order_base_2() is safe. But the result might be tat some allocated
> space is not used.
>
> I would prefer to make sure that new_log_buf_len is rounded, e.g.
> by roundup_pow_of_two(), at the beginning of the function. Then we
> could use ilog2() here.

new_log_buf_len can only be set within log_buf_len_update(), and it
is already doing exactly what you want:

if (size)
size = roundup_pow_of_two(size);
if (size > log_buf_len)
new_log_buf_len = (unsigned long)size;

I can switch to ilog2() instead of the more conservative order_base_2().

John Ogness

2020-07-01 19:59:31

by John Ogness

[permalink] [raw]
Subject: Re: pending output optimization: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On 2020-06-25, Petr Mladek <[email protected]> wrote:
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -2009,9 +2056,9 @@ asmlinkage int vprintk_emit(int facility, int level,
>>
>> /* This stops the holder of console_sem just where we want him */
>> logbuf_lock_irqsave(flags);
>> - curr_log_seq = log_next_seq;
>> + pending_output = !prb_read_valid(prb, console_seq, NULL);
>> printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
>> - pending_output = (curr_log_seq != log_next_seq);
>> + pending_output &= prb_read_valid(prb, console_seq, NULL);
>
> This will stop working after we remove the locks. Consoles will be
> able to handle messages while the new one is being added. There will
> be no gurantee that someone is still hadling the previously pending
> output.
>
> Please, always handle consoles when printed_len is not zero!!!
>
> The pending output was just an optimization added recently. Nobody
> requested it. It was just an idea that made sense.

OK. I will insert a patch before this one that reverts commit
3ac37a93fa92 ("printk: lock/unlock console only for new logbuf
entries"). Then there is no @pending_output for me to implement and it
will be clear that this series is changing/reverting some printk
behavior.

John Ogness

2020-07-02 08:27:41

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

Hi, John Ogness

About the VMCOREINFO part, I made some tests based on the kernel patch
v3, the makedumpfile and crash-utility can work as expected with your
patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
can't correctly read the printk ring buffer information, and get the
following error:

"Missing the log_buf symbol"

The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
in the makedumpfile and crash-utility.

BTW: If you already have a patch for the kexec-tools, would you mind sharing
it? I will make a test for the vmcore-dmesg.

Thanks.
Lianbo

在 2020年06月18日 22:49, John Ogness 写道:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
>
> Changes that exist due to the ringbuffer replacement:
>
> - The VMCOREINFO has been updated for the new structures.
>
> - Dictionary data is now stored in a separate data buffer from the
> human-readable messages. The dictionary data buffer is set to the
> same size as the message buffer. Therefore, the total required
> memory for both dictionary and message data is
> 2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the initial static buffers and
> 2 * log_buf_len (the kernel parameter) for the dynamic buffers.
>
> - Record meta-data is now stored in a separate array of descriptors.
> This is an additional 72 * (2 ^ (CONFIG_LOG_BUF_SHIFT - 5)) bytes
> for the static array and 72 * (log_buf_len >> 5) bytes for the
> dynamic array.
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> include/linux/kmsg_dump.h | 2 -
> kernel/printk/printk.c | 944 ++++++++++++++++++++------------------
> 2 files changed, 497 insertions(+), 449 deletions(-)
>
> diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
> index 3378bcbe585e..c9b0abe5ca91 100644
> --- a/include/linux/kmsg_dump.h
> +++ b/include/linux/kmsg_dump.h
> @@ -45,8 +45,6 @@ struct kmsg_dumper {
> bool registered;
>
> /* private state of the kmsg iterator */
> - u32 cur_idx;
> - u32 next_idx;
> u64 cur_seq;
> u64 next_seq;
> };
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 8c14835be46c..7642ef634956 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -55,6 +55,7 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/printk.h>
>
> +#include "printk_ringbuffer.h"
> #include "console_cmdline.h"
> #include "braille.h"
> #include "internal.h"
> @@ -294,30 +295,24 @@ enum con_msg_format_flags {
> static int console_msg_format = MSG_FORMAT_DEFAULT;
>
> /*
> - * The printk log buffer consists of a chain of concatenated variable
> - * length records. Every record starts with a record header, containing
> - * the overall length of the record.
> + * The printk log buffer consists of a sequenced collection of records, each
> + * containing variable length message and dictionary text. Every record
> + * also contains its own meta-data (@info).
> *
> - * The heads to the first and last entry in the buffer, as well as the
> - * sequence numbers of these entries are maintained when messages are
> - * stored.
> + * Every record meta-data carries the timestamp in microseconds, as well as
> + * the standard userspace syslog level and syslog facility. The usual kernel
> + * messages use LOG_KERN; userspace-injected messages always carry a matching
> + * syslog facility, by default LOG_USER. The origin of every message can be
> + * reliably determined that way.
> *
> - * If the heads indicate available messages, the length in the header
> - * tells the start next message. A length == 0 for the next message
> - * indicates a wrap-around to the beginning of the buffer.
> + * The human readable log message of a record is available in @text, the
> + * length of the message text in @text_len. The stored message is not
> + * terminated.
> *
> - * Every record carries the monotonic timestamp in microseconds, as well as
> - * the standard userspace syslog level and syslog facility. The usual
> - * kernel messages use LOG_KERN; userspace-injected messages always carry
> - * a matching syslog facility, by default LOG_USER. The origin of every
> - * message can be reliably determined that way.
> - *
> - * The human readable log message directly follows the message header. The
> - * length of the message text is stored in the header, the stored message
> - * is not terminated.
> - *
> - * Optionally, a message can carry a dictionary of properties (key/value pairs),
> - * to provide userspace with a machine-readable message context.
> + * Optionally, a record can carry a dictionary of properties (key/value
> + * pairs), to provide userspace with a machine-readable message context. The
> + * length of the dictionary is available in @dict_len. The dictionary is not
> + * terminated.
> *
> * Examples for well-defined, commonly used property names are:
> * DEVICE=b12:8 device identifier
> @@ -331,21 +326,19 @@ static int console_msg_format = MSG_FORMAT_DEFAULT;
> * follows directly after a '=' character. Every property is terminated by
> * a '\0' character. The last property is not terminated.
> *
> - * Example of a message structure:
> - * 0000 ff 8f 00 00 00 00 00 00 monotonic time in nsec
> - * 0008 34 00 record is 52 bytes long
> - * 000a 0b 00 text is 11 bytes long
> - * 000c 1f 00 dictionary is 23 bytes long
> - * 000e 03 00 LOG_KERN (facility) LOG_ERR (level)
> - * 0010 69 74 27 73 20 61 20 6c "it's a l"
> - * 69 6e 65 "ine"
> - * 001b 44 45 56 49 43 "DEVIC"
> - * 45 3d 62 38 3a 32 00 44 "E=b8:2\0D"
> - * 52 49 56 45 52 3d 62 75 "RIVER=bu"
> - * 67 "g"
> - * 0032 00 00 00 padding to next message header
> - *
> - * The 'struct printk_log' buffer header must never be directly exported to
> + * Example of record values:
> + * record.text_buf = "it's a line" (unterminated)
> + * record.dict_buf = "DEVICE=b8:2\0DRIVER=bug" (unterminated)
> + * record.info.seq = 56
> + * record.info.ts_nsec = 36863
> + * record.info.text_len = 11
> + * record.info.dict_len = 22
> + * record.info.facility = 0 (LOG_KERN)
> + * record.info.flags = 0
> + * record.info.level = 3 (LOG_ERR)
> + * record.info.caller_id = 299 (task 299)
> + *
> + * The 'struct printk_info' buffer must never be directly exported to
> * userspace, it is a kernel-private implementation detail that might
> * need to be changed in the future, when the requirements change.
> *
> @@ -365,23 +358,6 @@ enum log_flags {
> LOG_CONT = 8, /* text is a fragment of a continuation line */
> };
>
> -struct printk_log {
> - u64 ts_nsec; /* timestamp in nanoseconds */
> - u16 len; /* length of entire record */
> - u16 text_len; /* length of text buffer */
> - u16 dict_len; /* length of dictionary buffer */
> - u8 facility; /* syslog facility */
> - u8 flags:5; /* internal record flags */
> - u8 level:3; /* syslog level */
> -#ifdef CONFIG_PRINTK_CALLER
> - u32 caller_id; /* thread id or processor id */
> -#endif
> -}
> -#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> -__packed __aligned(4)
> -#endif
> -;
> -
> /*
> * The logbuf_lock protects kmsg buffer, indices, counters. This can be taken
> * within the scheduler's rq lock. It must be released before calling
> @@ -421,26 +397,16 @@ DEFINE_RAW_SPINLOCK(logbuf_lock);
> DECLARE_WAIT_QUEUE_HEAD(log_wait);
> /* the next printk record to read by syslog(READ) or /proc/kmsg */
> static u64 syslog_seq;
> -static u32 syslog_idx;
> static size_t syslog_partial;
> static bool syslog_time;
>
> -/* index and sequence number of the first record stored in the buffer */
> -static u64 log_first_seq;
> -static u32 log_first_idx;
> -
> -/* index and sequence number of the next record to store in the buffer */
> -static u64 log_next_seq;
> -static u32 log_next_idx;
> -
> /* the next printk record to write to the console */
> static u64 console_seq;
> -static u32 console_idx;
> static u64 exclusive_console_stop_seq;
> +static unsigned long console_dropped;
>
> /* the next printk record to read after the last 'clear' command */
> static u64 clear_seq;
> -static u32 clear_idx;
>
> #ifdef CONFIG_PRINTK_CALLER
> #define PREFIX_MAX 48
> @@ -453,13 +419,28 @@ static u32 clear_idx;
> #define LOG_FACILITY(v) ((v) >> 3 & 0xff)
>
> /* record buffer */
> -#define LOG_ALIGN __alignof__(struct printk_log)
> +#define LOG_ALIGN __alignof__(unsigned long)
> #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> #define LOG_BUF_LEN_MAX (u32)(1 << 31)
> static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> static char *log_buf = __log_buf;
> static u32 log_buf_len = __LOG_BUF_LEN;
>
> +/*
> + * Define the average message size. This only affects the number of
> + * descriptors that will be available. Underestimating is better than
> + * overestimating (too many available descriptors is better than not enough).
> + * The dictionary buffer will be the same size as the text buffer.
> + */
> +#define PRB_AVGBITS 5 /* 32 character average length */
> +
> +_DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
> + PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);
> +
> +static struct printk_ringbuffer printk_rb_dynamic;
> +
> +static struct printk_ringbuffer *prb = &printk_rb_static;
> +
> /*
> * We cannot access per-CPU data (e.g. per-CPU flush irq_work) before
> * per_cpu_areas are initialised. This variable is set to true when
> @@ -484,108 +465,6 @@ u32 log_buf_len_get(void)
> return log_buf_len;
> }
>
> -/* human readable text of the record */
> -static char *log_text(const struct printk_log *msg)
> -{
> - return (char *)msg + sizeof(struct printk_log);
> -}
> -
> -/* optional key/value pair dictionary attached to the record */
> -static char *log_dict(const struct printk_log *msg)
> -{
> - return (char *)msg + sizeof(struct printk_log) + msg->text_len;
> -}
> -
> -/* get record by index; idx must point to valid msg */
> -static struct printk_log *log_from_idx(u32 idx)
> -{
> - struct printk_log *msg = (struct printk_log *)(log_buf + idx);
> -
> - /*
> - * A length == 0 record is the end of buffer marker. Wrap around and
> - * read the message at the start of the buffer.
> - */
> - if (!msg->len)
> - return (struct printk_log *)log_buf;
> - return msg;
> -}
> -
> -/* get next record; idx must point to valid msg */
> -static u32 log_next(u32 idx)
> -{
> - struct printk_log *msg = (struct printk_log *)(log_buf + idx);
> -
> - /* length == 0 indicates the end of the buffer; wrap */
> - /*
> - * A length == 0 record is the end of buffer marker. Wrap around and
> - * read the message at the start of the buffer as *this* one, and
> - * return the one after that.
> - */
> - if (!msg->len) {
> - msg = (struct printk_log *)log_buf;
> - return msg->len;
> - }
> - return idx + msg->len;
> -}
> -
> -/*
> - * Check whether there is enough free space for the given message.
> - *
> - * The same values of first_idx and next_idx mean that the buffer
> - * is either empty or full.
> - *
> - * If the buffer is empty, we must respect the position of the indexes.
> - * They cannot be reset to the beginning of the buffer.
> - */
> -static int logbuf_has_space(u32 msg_size, bool empty)
> -{
> - u32 free;
> -
> - if (log_next_idx > log_first_idx || empty)
> - free = max(log_buf_len - log_next_idx, log_first_idx);
> - else
> - free = log_first_idx - log_next_idx;
> -
> - /*
> - * We need space also for an empty header that signalizes wrapping
> - * of the buffer.
> - */
> - return free >= msg_size + sizeof(struct printk_log);
> -}
> -
> -static int log_make_free_space(u32 msg_size)
> -{
> - while (log_first_seq < log_next_seq &&
> - !logbuf_has_space(msg_size, false)) {
> - /* drop old messages until we have enough contiguous space */
> - log_first_idx = log_next(log_first_idx);
> - log_first_seq++;
> - }
> -
> - if (clear_seq < log_first_seq) {
> - clear_seq = log_first_seq;
> - clear_idx = log_first_idx;
> - }
> -
> - /* sequence numbers are equal, so the log buffer is empty */
> - if (logbuf_has_space(msg_size, log_first_seq == log_next_seq))
> - return 0;
> -
> - return -ENOMEM;
> -}
> -
> -/* compute the message size including the padding bytes */
> -static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
> -{
> - u32 size;
> -
> - size = sizeof(struct printk_log) + text_len + dict_len;
> - *pad_len = (-size) & (LOG_ALIGN - 1);
> - size += *pad_len;
> -
> - return size;
> -}
> -
> /*
> * Define how much of the log buffer we could take at maximum. The value
> * must be greater than two. Note that only half of the buffer is available
> @@ -594,22 +473,26 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
> #define MAX_LOG_TAKE_PART 4
> static const char trunc_msg[] = "<truncated>";
>
> -static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
> - u16 *dict_len, u32 *pad_len)
> +static void truncate_msg(u16 *text_len, u16 *trunc_msg_len, u16 *dict_len)
> {
> /*
> * The message should not take the whole buffer. Otherwise, it might
> * get removed too soon.
> */
> u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
> +
> if (*text_len > max_text_len)
> *text_len = max_text_len;
> - /* enable the warning message */
> +
> + /* enable the warning message (if there is room) */
> *trunc_msg_len = strlen(trunc_msg);
> + if (*text_len >= *trunc_msg_len)
> + *text_len -= *trunc_msg_len;
> + else
> + *trunc_msg_len = 0;
> +
> /* disable the "dict" completely */
> *dict_len = 0;
> - /* compute the size again, count also the warning message */
> - return msg_used_size(*text_len + *trunc_msg_len, 0, pad_len);
> }
>
> /* insert record into the buffer, discard old ones, update heads */
> @@ -618,60 +501,40 @@ static int log_store(u32 caller_id, int facility, int level,
> const char *dict, u16 dict_len,
> const char *text, u16 text_len)
> {
> - struct printk_log *msg;
> - u32 size, pad_len;
> + struct prb_reserved_entry e;
> + struct printk_record r;
> u16 trunc_msg_len = 0;
>
> - /* number of '\0' padding bytes to next message */
> - size = msg_used_size(text_len, dict_len, &pad_len);
> + prb_rec_init_wr(&r, text_len, dict_len);
>
> - if (log_make_free_space(size)) {
> + if (!prb_reserve(&e, prb, &r)) {
> /* truncate the message if it is too long for empty buffer */
> - size = truncate_msg(&text_len, &trunc_msg_len,
> - &dict_len, &pad_len);
> + truncate_msg(&text_len, &trunc_msg_len, &dict_len);
> + prb_rec_init_wr(&r, text_len + trunc_msg_len, dict_len);
> /* survive when the log buffer is too small for trunc_msg */
> - if (log_make_free_space(size))
> + if (!prb_reserve(&e, prb, &r))
> return 0;
> }
>
> - if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
> - /*
> - * This message + an additional empty header does not fit
> - * at the end of the buffer. Add an empty header with len == 0
> - * to signify a wrap around.
> - */
> - memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
> - log_next_idx = 0;
> - }
> -
> /* fill message */
> - msg = (struct printk_log *)(log_buf + log_next_idx);
> - memcpy(log_text(msg), text, text_len);
> - msg->text_len = text_len;
> - if (trunc_msg_len) {
> - memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len);
> - msg->text_len += trunc_msg_len;
> - }
> - memcpy(log_dict(msg), dict, dict_len);
> - msg->dict_len = dict_len;
> - msg->facility = facility;
> - msg->level = level & 7;
> - msg->flags = flags & 0x1f;
> + memcpy(&r.text_buf[0], text, text_len);
> + if (trunc_msg_len)
> + memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len);
> + if (r.dict_buf)
> + memcpy(&r.dict_buf[0], dict, dict_len);
> + r.info->facility = facility;
> + r.info->level = level & 7;
> + r.info->flags = flags & 0x1f;
> if (ts_nsec > 0)
> - msg->ts_nsec = ts_nsec;
> + r.info->ts_nsec = ts_nsec;
> else
> - msg->ts_nsec = local_clock();
> -#ifdef CONFIG_PRINTK_CALLER
> - msg->caller_id = caller_id;
> -#endif
> - memset(log_dict(msg) + dict_len, 0, pad_len);
> - msg->len = size;
> + r.info->ts_nsec = local_clock();
> + r.info->caller_id = caller_id;
>
> /* insert message */
> - log_next_idx += msg->len;
> - log_next_seq++;
> + prb_commit(&e);
>
> - return msg->text_len;
> + return (text_len + trunc_msg_len);
> }
>
> int dmesg_restrict = IS_ENABLED(CONFIG_SECURITY_DMESG_RESTRICT);
> @@ -723,13 +586,13 @@ static void append_char(char **pp, char *e, char c)
> *(*pp)++ = c;
> }
>
> -static ssize_t msg_print_ext_header(char *buf, size_t size,
> - struct printk_log *msg, u64 seq)
> +static ssize_t info_print_ext_header(char *buf, size_t size,
> + struct printk_info *info)
> {
> - u64 ts_usec = msg->ts_nsec;
> + u64 ts_usec = info->ts_nsec;
> char caller[20];
> #ifdef CONFIG_PRINTK_CALLER
> - u32 id = msg->caller_id;
> + u32 id = info->caller_id;
>
> snprintf(caller, sizeof(caller), ",caller=%c%u",
> id & 0x80000000 ? 'C' : 'T', id & ~0x80000000);
> @@ -740,8 +603,8 @@ static ssize_t msg_print_ext_header(char *buf, size_t size,
> do_div(ts_usec, 1000);
>
> return scnprintf(buf, size, "%u,%llu,%llu,%c%s;",
> - (msg->facility << 3) | msg->level, seq, ts_usec,
> - msg->flags & LOG_CONT ? 'c' : '-', caller);
> + (info->facility << 3) | info->level, info->seq,
> + ts_usec, info->flags & LOG_CONT ? 'c' : '-', caller);
> }
>
> static ssize_t msg_print_ext_body(char *buf, size_t size,
> @@ -795,10 +658,14 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
> /* /dev/kmsg - userspace message inject/listen interface */
> struct devkmsg_user {
> u64 seq;
> - u32 idx;
> struct ratelimit_state rs;
> struct mutex lock;
> char buf[CONSOLE_EXT_LOG_MAX];
> +
> + struct printk_info info;
> + char text_buf[CONSOLE_EXT_LOG_MAX];
> + char dict_buf[CONSOLE_EXT_LOG_MAX];
> + struct printk_record record;
> };
>
> static __printf(3, 4) __cold
> @@ -881,7 +748,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
> size_t count, loff_t *ppos)
> {
> struct devkmsg_user *user = file->private_data;
> - struct printk_log *msg;
> + struct printk_record *r = &user->record;
> size_t len;
> ssize_t ret;
>
> @@ -893,7 +760,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
> return ret;
>
> logbuf_lock_irq();
> - while (user->seq == log_next_seq) {
> + if (!prb_read_valid(prb, user->seq, r)) {
> if (file->f_flags & O_NONBLOCK) {
> ret = -EAGAIN;
> logbuf_unlock_irq();
> @@ -902,30 +769,26 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
>
> logbuf_unlock_irq();
> ret = wait_event_interruptible(log_wait,
> - user->seq != log_next_seq);
> + prb_read_valid(prb, user->seq, r));
> if (ret)
> goto out;
> logbuf_lock_irq();
> }
>
> - if (user->seq < log_first_seq) {
> + if (user->seq < prb_first_seq(prb)) {
> /* our last seen message is gone, return error and reset */
> - user->idx = log_first_idx;
> - user->seq = log_first_seq;
> + user->seq = prb_first_seq(prb);
> ret = -EPIPE;
> logbuf_unlock_irq();
> goto out;
> }
>
> - msg = log_from_idx(user->idx);
> - len = msg_print_ext_header(user->buf, sizeof(user->buf),
> - msg, user->seq);
> + len = info_print_ext_header(user->buf, sizeof(user->buf), r->info);
> len += msg_print_ext_body(user->buf + len, sizeof(user->buf) - len,
> - log_dict(msg), msg->dict_len,
> - log_text(msg), msg->text_len);
> + &r->dict_buf[0], r->info->dict_len,
> + &r->text_buf[0], r->info->text_len);
>
> - user->idx = log_next(user->idx);
> - user->seq++;
> + user->seq = r->info->seq + 1;
> logbuf_unlock_irq();
>
> if (len > count) {
> @@ -957,8 +820,7 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
> switch (whence) {
> case SEEK_SET:
> /* the first record */
> - user->idx = log_first_idx;
> - user->seq = log_first_seq;
> + user->seq = prb_first_seq(prb);
> break;
> case SEEK_DATA:
> /*
> @@ -966,13 +828,11 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
> * like issued by 'dmesg -c'. Reading /dev/kmsg itself
> * changes no global state, and does not clear anything.
> */
> - user->idx = clear_idx;
> user->seq = clear_seq;
> break;
> case SEEK_END:
> /* after the last record */
> - user->idx = log_next_idx;
> - user->seq = log_next_seq;
> + user->seq = prb_next_seq(prb);
> break;
> case SEEK_CUR:
> /*
> @@ -1002,9 +862,9 @@ static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
> poll_wait(file, &log_wait, wait);
>
> logbuf_lock_irq();
> - if (user->seq < log_next_seq) {
> + if (prb_read_valid(prb, user->seq, NULL)) {
> /* return error when data has vanished underneath us */
> - if (user->seq < log_first_seq)
> + if (user->seq < prb_first_seq(prb))
> ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
> else
> ret = EPOLLIN|EPOLLRDNORM;
> @@ -1039,9 +899,12 @@ static int devkmsg_open(struct inode *inode, struct file *file)
>
> mutex_init(&user->lock);
>
> + prb_rec_init_rd(&user->record, &user->info,
> + &user->text_buf[0], sizeof(user->text_buf),
> + &user->dict_buf[0], sizeof(user->dict_buf));
> +
> logbuf_lock_irq();
> - user->idx = log_first_idx;
> - user->seq = log_first_seq;
> + user->seq = prb_first_seq(prb);
> logbuf_unlock_irq();
>
> file->private_data = user;
> @@ -1082,23 +945,52 @@ const struct file_operations kmsg_fops = {
> */
> void log_buf_vmcoreinfo_setup(void)
> {
> - VMCOREINFO_SYMBOL(log_buf);
> - VMCOREINFO_SYMBOL(log_buf_len);
> - VMCOREINFO_SYMBOL(log_first_idx);
> - VMCOREINFO_SYMBOL(clear_idx);
> - VMCOREINFO_SYMBOL(log_next_idx);
> + VMCOREINFO_SYMBOL(prb);
> + VMCOREINFO_SYMBOL(printk_rb_static);
> + VMCOREINFO_SYMBOL(clear_seq);
> +
> /*
> - * Export struct printk_log size and field offsets. User space tools can
> + * Export struct size and field offsets. User space tools can
> * parse it and detect any changes to structure down the line.
> */
> - VMCOREINFO_STRUCT_SIZE(printk_log);
> - VMCOREINFO_OFFSET(printk_log, ts_nsec);
> - VMCOREINFO_OFFSET(printk_log, len);
> - VMCOREINFO_OFFSET(printk_log, text_len);
> - VMCOREINFO_OFFSET(printk_log, dict_len);
> -#ifdef CONFIG_PRINTK_CALLER
> - VMCOREINFO_OFFSET(printk_log, caller_id);
> -#endif
> +
> + VMCOREINFO_STRUCT_SIZE(printk_ringbuffer);
> + VMCOREINFO_OFFSET(printk_ringbuffer, desc_ring);
> + VMCOREINFO_OFFSET(printk_ringbuffer, text_data_ring);
> + VMCOREINFO_OFFSET(printk_ringbuffer, dict_data_ring);
> + VMCOREINFO_OFFSET(printk_ringbuffer, fail);
> +
> + VMCOREINFO_STRUCT_SIZE(prb_desc_ring);
> + VMCOREINFO_OFFSET(prb_desc_ring, count_bits);
> + VMCOREINFO_OFFSET(prb_desc_ring, descs);
> + VMCOREINFO_OFFSET(prb_desc_ring, head_id);
> + VMCOREINFO_OFFSET(prb_desc_ring, tail_id);
> +
> + VMCOREINFO_STRUCT_SIZE(prb_desc);
> + VMCOREINFO_OFFSET(prb_desc, info);
> + VMCOREINFO_OFFSET(prb_desc, state_var);
> + VMCOREINFO_OFFSET(prb_desc, text_blk_lpos);
> + VMCOREINFO_OFFSET(prb_desc, dict_blk_lpos);
> +
> + VMCOREINFO_STRUCT_SIZE(prb_data_blk_lpos);
> + VMCOREINFO_OFFSET(prb_data_blk_lpos, begin);
> + VMCOREINFO_OFFSET(prb_data_blk_lpos, next);
> +
> + VMCOREINFO_STRUCT_SIZE(printk_info);
> + VMCOREINFO_OFFSET(printk_info, seq);
> + VMCOREINFO_OFFSET(printk_info, ts_nsec);
> + VMCOREINFO_OFFSET(printk_info, text_len);
> + VMCOREINFO_OFFSET(printk_info, dict_len);
> + VMCOREINFO_OFFSET(printk_info, caller_id);
> +
> + VMCOREINFO_STRUCT_SIZE(prb_data_ring);
> + VMCOREINFO_OFFSET(prb_data_ring, size_bits);
> + VMCOREINFO_OFFSET(prb_data_ring, data);
> + VMCOREINFO_OFFSET(prb_data_ring, head_lpos);
> + VMCOREINFO_OFFSET(prb_data_ring, tail_lpos);
> +
> + VMCOREINFO_SIZE(atomic_long_t);
> + VMCOREINFO_TYPE_OFFSET(atomic_long_t, counter);
> }
> #endif
>
> @@ -1176,11 +1068,46 @@ static void __init set_percpu_data_ready(void)
> __printk_percpu_data_ready = true;
> }
>
> +static unsigned int __init add_to_rb(struct printk_ringbuffer *rb,
> + struct printk_record *r)
> +{
> + struct prb_reserved_entry e;
> + struct printk_record dest_r;
> +
> + prb_rec_init_wr(&dest_r, r->info->text_len, r->info->dict_len);
> +
> + if (!prb_reserve(&e, rb, &dest_r))
> + return 0;
> +
> + memcpy(&dest_r.text_buf[0], &r->text_buf[0], dest_r.text_buf_size);
> + if (dest_r.dict_buf) {
> + memcpy(&dest_r.dict_buf[0], &r->dict_buf[0],
> + dest_r.dict_buf_size);
> + }
> + dest_r.info->facility = r->info->facility;
> + dest_r.info->level = r->info->level;
> + dest_r.info->flags = r->info->flags;
> + dest_r.info->ts_nsec = r->info->ts_nsec;
> + dest_r.info->caller_id = r->info->caller_id;
> +
> + prb_commit(&e);
> +
> + return prb_record_text_space(&e);
> +}
> +
> +static char setup_text_buf[CONSOLE_EXT_LOG_MAX] __initdata;
> +static char setup_dict_buf[CONSOLE_EXT_LOG_MAX] __initdata;
> +
> void __init setup_log_buf(int early)
> {
> + struct prb_desc *new_descs;
> + struct printk_info info;
> + struct printk_record r;
> unsigned long flags;
> + char *new_dict_buf;
> char *new_log_buf;
> unsigned int free;
> + u64 seq;
>
> /*
> * Some archs call setup_log_buf() multiple times - first is very
> @@ -1201,17 +1128,50 @@ void __init setup_log_buf(int early)
>
> new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
> if (unlikely(!new_log_buf)) {
> - pr_err("log_buf_len: %lu bytes not available\n",
> + pr_err("log_buf_len: %lu text bytes not available\n",
> new_log_buf_len);
> return;
> }
>
> + new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
> + if (unlikely(!new_dict_buf)) {
> + /* dictionary failure is allowed */
> + pr_err("log_buf_len: %lu dict bytes not available\n",
> + new_log_buf_len);
> + }
> +
> + new_descs = memblock_alloc((new_log_buf_len >> PRB_AVGBITS) *
> + sizeof(struct prb_desc), LOG_ALIGN);
> + if (unlikely(!new_descs)) {
> + pr_err("log_buf_len: %lu desc bytes not available\n",
> + new_log_buf_len >> PRB_AVGBITS);
> + if (new_dict_buf)
> + memblock_free(__pa(new_dict_buf), new_log_buf_len);
> + memblock_free(__pa(new_log_buf), new_log_buf_len);
> + return;
> + }
> +
> + prb_rec_init_rd(&r, &info,
> + &setup_text_buf[0], sizeof(setup_text_buf),
> + &setup_dict_buf[0], sizeof(setup_dict_buf));
> +
> logbuf_lock_irqsave(flags);
> +
> + prb_init(&printk_rb_dynamic,
> + new_log_buf, bits_per(new_log_buf_len) - 1,
> + new_dict_buf, bits_per(new_log_buf_len) - 1,
> + new_descs, (bits_per(new_log_buf_len) - 1) - PRB_AVGBITS);
> +
> log_buf_len = new_log_buf_len;
> log_buf = new_log_buf;
> new_log_buf_len = 0;
> - free = __LOG_BUF_LEN - log_next_idx;
> - memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
> +
> + free = __LOG_BUF_LEN;
> + prb_for_each_record(0, &printk_rb_static, seq, &r)
> + free -= add_to_rb(&printk_rb_dynamic, &r);
> +
> + prb = &printk_rb_dynamic;
> +
> logbuf_unlock_irqrestore(flags);
>
> pr_info("log_buf_len: %u bytes\n", log_buf_len);
> @@ -1323,18 +1283,18 @@ static size_t print_caller(u32 id, char *buf)
> #define print_caller(id, buf) 0
> #endif
>
> -static size_t print_prefix(const struct printk_log *msg, bool syslog,
> - bool time, char *buf)
> +static size_t info_print_prefix(const struct printk_info *info, bool syslog,
> + bool time, char *buf)
> {
> size_t len = 0;
>
> if (syslog)
> - len = print_syslog((msg->facility << 3) | msg->level, buf);
> + len = print_syslog((info->facility << 3) | info->level, buf);
>
> if (time)
> - len += print_time(msg->ts_nsec, buf + len);
> + len += print_time(info->ts_nsec, buf + len);
>
> - len += print_caller(msg->caller_id, buf + len);
> + len += print_caller(info->caller_id, buf + len);
>
> if (IS_ENABLED(CONFIG_PRINTK_CALLER) || time) {
> buf[len++] = ' ';
> @@ -1344,72 +1304,150 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog,
> return len;
> }
>
> -static size_t msg_print_text(const struct printk_log *msg, bool syslog,
> - bool time, char *buf, size_t size)
> +static size_t record_print_text(struct printk_record *r, bool syslog,
> + bool time)
> {
> - const char *text = log_text(msg);
> - size_t text_size = msg->text_len;
> - size_t len = 0;
> + size_t text_len = r->info->text_len;
> + size_t buf_size = r->text_buf_size;
> + char *text = r->text_buf;
> char prefix[PREFIX_MAX];
> - const size_t prefix_len = print_prefix(msg, syslog, time, prefix);
> + bool truncated = false;
> + size_t prefix_len;
> + size_t line_len;
> + size_t len = 0;
> + char *next;
>
> - do {
> - const char *next = memchr(text, '\n', text_size);
> - size_t text_len;
> + prefix_len = info_print_prefix(r->info, syslog, time, prefix);
>
> + /*
> + * Add the prefix for each line by shifting the rest of the text to
> + * make room for each prefix. If the buffer is not large enough for
> + * all the prefixes, then drop the trailing text and report the
> + * largest length that includes full lines with their prefixes.
> + *
> + * @text_len: bytes of unprocessed text
> + * @line_len: bytes of current line (newline _not_ included)
> + * @text: pointer to beginning of current line
> + * @len: number of bytes processed (size of r->text_buf done)
> + */
> + for (;;) {
> + next = memchr(text, '\n', text_len);
> if (next) {
> - text_len = next - text;
> - next++;
> - text_size -= next - text;
> + line_len = next - text;
> } else {
> - text_len = text_size;
> + /*
> + * No newline. If the text was previously truncated,
> + * assume this line was truncated and do not include
> + * it.
> + */
> + if (truncated)
> + break;
> + line_len = text_len;
> }
>
> - if (buf) {
> - if (prefix_len + text_len + 1 >= size - len)
> + /*
> + * Ensure there is enough buffer available to shift this line
> + * (and add a newline at the end).
> + */
> + if (len + prefix_len + line_len + 1 > buf_size)
> + break;
> +
> + /*
> + * Ensure there is enough buffer available to shift all
> + * remaining text (and add a newline at the end). If this
> + * test fails, then there must be a newline (i.e.
> + * text_len > line_len because the previous test succeeded).
> + */
> + if (len + prefix_len + text_len + 1 > buf_size) {
> + /*
> + * Truncate @text_len so that there is enough space
> + * for a prefix. A newline will not be added because
> + * the last line of the text is now truncated and
> + * will not be included.
> + */
> + text_len = (buf_size - len) - prefix_len;
> +
> + /*
> + * Ensure there is still a newline. Otherwise this
> + * line may have been truncated and will not be
> + * included.
> + */
> + if (memchr(text, '\n', text_len) == NULL)
> break;
>
> - memcpy(buf + len, prefix, prefix_len);
> - len += prefix_len;
> - memcpy(buf + len, text, text_len);
> - len += text_len;
> - buf[len++] = '\n';
> - } else {
> - /* SYSLOG_ACTION_* buffer size only calculation */
> - len += prefix_len + text_len + 1;
> + /* Note that the last line will not be included. */
> + truncated = true;
> }
>
> - text = next;
> - } while (text);
> + memmove(text + prefix_len, text, text_len);
> + memcpy(text, prefix, prefix_len);
> +
> + /* Advance beyond the newly added prefix and existing line. */
> + text += prefix_len + line_len;
> +
> + /* The remaining text has only decreased by the line. */
> + text_len -= line_len;
> +
> + len += prefix_len + line_len + 1;
> +
> + if (text_len) {
> + /* Advance past the newline. */
> + text++;
> + text_len--;
> + } else {
> + /* The final line, add a newline. */
> + *text = '\n';
> + break;
> + }
> + }
>
> return len;
> }
>
> +static size_t get_record_text_size(struct printk_info *info,
> + unsigned int line_count,
> + bool syslog, bool time)
> +{
> + char prefix[PREFIX_MAX];
> + size_t prefix_len;
> +
> + prefix_len = info_print_prefix(info, syslog, time, prefix);
> +
> + /*
> + * Each line will be preceded with a prefix. The intermediate
> + * newlines are already within the text, but a final trailing
> + * newline will be added.
> + */
> + return ((prefix_len * line_count) + info->text_len + 1);
> +}
> +
> static int syslog_print(char __user *buf, int size)
> {
> + struct printk_info info;
> + struct printk_record r;
> char *text;
> - struct printk_log *msg;
> int len = 0;
>
> text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
> if (!text)
> return -ENOMEM;
>
> + prb_rec_init_rd(&r, &info, text, LOG_LINE_MAX + PREFIX_MAX, NULL, 0);
> +
> while (size > 0) {
> size_t n;
> size_t skip;
>
> logbuf_lock_irq();
> - if (syslog_seq < log_first_seq) {
> - /* messages are gone, move to first one */
> - syslog_seq = log_first_seq;
> - syslog_idx = log_first_idx;
> - syslog_partial = 0;
> - }
> - if (syslog_seq == log_next_seq) {
> + if (!prb_read_valid(prb, syslog_seq, &r)) {
> logbuf_unlock_irq();
> break;
> }
> + if (r.info->seq != syslog_seq) {
> + /* message is gone, move to next valid one */
> + syslog_seq = r.info->seq;
> + syslog_partial = 0;
> + }
>
> /*
> * To keep reading/counting partial line consistent,
> @@ -1419,13 +1457,10 @@ static int syslog_print(char __user *buf, int size)
> syslog_time = printk_time;
>
> skip = syslog_partial;
> - msg = log_from_idx(syslog_idx);
> - n = msg_print_text(msg, true, syslog_time, text,
> - LOG_LINE_MAX + PREFIX_MAX);
> + n = record_print_text(&r, true, syslog_time);
> if (n - syslog_partial <= size) {
> /* message fits into buffer, move forward */
> - syslog_idx = log_next(syslog_idx);
> - syslog_seq++;
> + syslog_seq = r.info->seq + 1;
> n -= syslog_partial;
> syslog_partial = 0;
> } else if (!len){
> @@ -1456,55 +1491,49 @@ static int syslog_print(char __user *buf, int size)
>
> static int syslog_print_all(char __user *buf, int size, bool clear)
> {
> + struct printk_info info;
> + unsigned int line_count;
> + struct printk_record r;
> char *text;
> int len = 0;
> - u64 next_seq;
> u64 seq;
> - u32 idx;
> bool time;
>
> text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
> if (!text)
> return -ENOMEM;
>
> + prb_rec_init_rd(&r, &info, text, LOG_LINE_MAX + PREFIX_MAX, NULL, 0);
> +
> time = printk_time;
> logbuf_lock_irq();
> /*
> * Find first record that fits, including all following records,
> * into the user-provided buffer for this dump.
> */
> - seq = clear_seq;
> - idx = clear_idx;
> - while (seq < log_next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> -
> - len += msg_print_text(msg, true, time, NULL, 0);
> - idx = log_next(idx);
> - seq++;
> + prb_for_each_record(clear_seq, prb, seq, &r) {
> + line_count = prb_count_lines(text, r.info->text_len);
> + len += get_record_text_size(r.info, line_count, true, time);
> }
>
> /* move first record forward until length fits into the buffer */
> - seq = clear_seq;
> - idx = clear_idx;
> - while (len > size && seq < log_next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> -
> - len -= msg_print_text(msg, true, time, NULL, 0);
> - idx = log_next(idx);
> - seq++;
> + prb_for_each_record(clear_seq, prb, seq, &r) {
> + if (len <= size)
> + break;
> + line_count = prb_count_lines(text, r.info->text_len);
> + len -= get_record_text_size(r.info, line_count, true, time);
> }
>
> - /* last message fitting into this dump */
> - next_seq = log_next_seq;
> -
> len = 0;
> - while (len >= 0 && seq < next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> - int textlen = msg_print_text(msg, true, time, text,
> - LOG_LINE_MAX + PREFIX_MAX);
> + prb_for_each_record(seq, prb, seq, &r) {
> + int textlen;
>
> - idx = log_next(idx);
> - seq++;
> + textlen = record_print_text(&r, true, time);
> +
> + if (len + textlen > size) {
> + seq--;
> + break;
> + }
>
> logbuf_unlock_irq();
> if (copy_to_user(buf + len, text, textlen))
> @@ -1513,17 +1542,12 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
> len += textlen;
> logbuf_lock_irq();
>
> - if (seq < log_first_seq) {
> - /* messages are gone, move to next one */
> - seq = log_first_seq;
> - idx = log_first_idx;
> - }
> + if (len < 0)
> + break;
> }
>
> - if (clear) {
> - clear_seq = log_next_seq;
> - clear_idx = log_next_idx;
> - }
> + if (clear)
> + clear_seq = seq;
> logbuf_unlock_irq();
>
> kfree(text);
> @@ -1533,8 +1557,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
> static void syslog_clear(void)
> {
> logbuf_lock_irq();
> - clear_seq = log_next_seq;
> - clear_idx = log_next_idx;
> + clear_seq = prb_next_seq(prb);
> logbuf_unlock_irq();
> }
>
> @@ -1542,6 +1565,7 @@ int do_syslog(int type, char __user *buf, int len, int source)
> {
> bool clear = false;
> static int saved_console_loglevel = LOGLEVEL_DEFAULT;
> + char *text = NULL;
> int error;
>
> error = check_syslog_permissions(type, source);
> @@ -1561,7 +1585,7 @@ int do_syslog(int type, char __user *buf, int len, int source)
> if (!access_ok(buf, len))
> return -EFAULT;
> error = wait_event_interruptible(log_wait,
> - syslog_seq != log_next_seq);
> + prb_read_valid(prb, syslog_seq, NULL));
> if (error)
> return error;
> error = syslog_print(buf, len);
> @@ -1609,11 +1633,15 @@ int do_syslog(int type, char __user *buf, int len, int source)
> break;
> /* Number of chars in the log buffer */
> case SYSLOG_ACTION_SIZE_UNREAD:
> + if (source != SYSLOG_FROM_PROC) {
> + text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
> + if (!text)
> + return -ENOMEM;
> + }
> logbuf_lock_irq();
> - if (syslog_seq < log_first_seq) {
> + if (syslog_seq < prb_first_seq(prb)) {
> /* messages are gone, move to first one */
> - syslog_seq = log_first_seq;
> - syslog_idx = log_first_idx;
> + syslog_seq = prb_first_seq(prb);
> syslog_partial = 0;
> }
> if (source == SYSLOG_FROM_PROC) {
> @@ -1622,24 +1650,29 @@ int do_syslog(int type, char __user *buf, int len, int source)
> * for pending data, not the size; return the count of
> * records, not the length.
> */
> - error = log_next_seq - syslog_seq;
> + error = prb_next_seq(prb) - syslog_seq;
> } else {
> - u64 seq = syslog_seq;
> - u32 idx = syslog_idx;
> bool time = syslog_partial ? syslog_time : printk_time;
> -
> - while (seq < log_next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> -
> - error += msg_print_text(msg, true, time, NULL,
> - 0);
> + struct printk_info info;
> + unsigned int line_count;
> + struct printk_record r;
> + u64 seq;
> +
> + prb_rec_init_rd(&r, &info, text,
> + LOG_LINE_MAX + PREFIX_MAX, NULL, 0);
> +
> + prb_for_each_record(syslog_seq, prb, seq, &r) {
> + line_count = prb_count_lines(text,
> + r.info->text_len);
> + error += get_record_text_size(r.info,
> + line_count,
> + true, time);
> time = printk_time;
> - idx = log_next(idx);
> - seq++;
> }
> error -= syslog_partial;
> }
> logbuf_unlock_irq();
> + kfree(text);
> break;
> /* Size of the log buffer */
> case SYSLOG_ACTION_SIZE_BUFFER:
> @@ -1806,10 +1839,22 @@ static int console_trylock_spinning(void)
> static void call_console_drivers(const char *ext_text, size_t ext_len,
> const char *text, size_t len)
> {
> + static char dropped_text[64];
> + size_t dropped_len = 0;
> struct console *con;
>
> trace_console_rcuidle(text, len);
>
> + if (!console_drivers)
> + return;
> +
> + if (console_dropped) {
> + dropped_len = snprintf(dropped_text, sizeof(dropped_text),
> + "** %lu printk messages dropped **\n",
> + console_dropped);
> + console_dropped = 0;
> + }
> +
> for_each_console(con) {
> if (exclusive_console && con != exclusive_console)
> continue;
> @@ -1822,8 +1867,11 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,
> continue;
> if (con->flags & CON_EXTENDED)
> con->write(con, ext_text, ext_len);
> - else
> + else {
> + if (dropped_len)
> + con->write(con, dropped_text, dropped_len);
> con->write(con, text, len);
> + }
> }
> }
>
> @@ -1993,7 +2041,6 @@ asmlinkage int vprintk_emit(int facility, int level,
> int printed_len;
> bool in_sched = false, pending_output;
> unsigned long flags;
> - u64 curr_log_seq;
>
> /* Suppress unimportant messages after panic happens */
> if (unlikely(suppress_printk))
> @@ -2009,9 +2056,9 @@ asmlinkage int vprintk_emit(int facility, int level,
>
> /* This stops the holder of console_sem just where we want him */
> logbuf_lock_irqsave(flags);
> - curr_log_seq = log_next_seq;
> + pending_output = !prb_read_valid(prb, console_seq, NULL);
> printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
> - pending_output = (curr_log_seq != log_next_seq);
> + pending_output &= prb_read_valid(prb, console_seq, NULL);
> logbuf_unlock_irqrestore(flags);
>
> /* If called from the scheduler, we can not call up(). */
> @@ -2090,21 +2137,24 @@ EXPORT_SYMBOL(printk);
> #define PREFIX_MAX 0
> #define printk_time false
>
> +#define prb_read_valid(rb, seq, r) false
> +#define prb_first_seq(rb) 0
> +
> static u64 syslog_seq;
> -static u32 syslog_idx;
> static u64 console_seq;
> -static u32 console_idx;
> static u64 exclusive_console_stop_seq;
> -static u64 log_first_seq;
> -static u32 log_first_idx;
> -static u64 log_next_seq;
> -static char *log_text(const struct printk_log *msg) { return NULL; }
> -static char *log_dict(const struct printk_log *msg) { return NULL; }
> -static struct printk_log *log_from_idx(u32 idx) { return NULL; }
> -static u32 log_next(u32 idx) { return 0; }
> -static ssize_t msg_print_ext_header(char *buf, size_t size,
> - struct printk_log *msg,
> - u64 seq) { return 0; }
> +static unsigned long console_dropped;
> +
> +static size_t record_print_text(const struct printk_record *r,
> + bool syslog, bool time)
> +{
> + return 0;
> +}
> +static ssize_t info_print_ext_header(char *buf, size_t size,
> + struct printk_info *info)
> +{
> + return 0;
> +}
> static ssize_t msg_print_ext_body(char *buf, size_t size,
> char *dict, size_t dict_len,
> char *text, size_t text_len) { return 0; }
> @@ -2112,8 +2162,6 @@ static void console_lock_spinning_enable(void) { }
> static int console_lock_spinning_disable_and_check(void) { return 0; }
> static void call_console_drivers(const char *ext_text, size_t ext_len,
> const char *text, size_t len) {}
> -static size_t msg_print_text(const struct printk_log *msg, bool syslog,
> - bool time, char *buf, size_t size) { return 0; }
> static bool suppress_message_printing(int level) { return false; }
>
> #endif /* CONFIG_PRINTK */
> @@ -2398,14 +2446,19 @@ void console_unlock(void)
> {
> static char ext_text[CONSOLE_EXT_LOG_MAX];
> static char text[LOG_LINE_MAX + PREFIX_MAX];
> + static char dict[LOG_LINE_MAX];
> unsigned long flags;
> bool do_cond_resched, retry;
> + struct printk_info info;
> + struct printk_record r;
>
> if (console_suspended) {
> up_console_sem();
> return;
> }
>
> + prb_rec_init_rd(&r, &info, text, sizeof(text), dict, sizeof(dict));
> +
> /*
> * Console drivers are called with interrupts disabled, so
> * @console_may_schedule should be cleared before; however, we may
> @@ -2436,35 +2489,26 @@ void console_unlock(void)
> }
>
> for (;;) {
> - struct printk_log *msg;
> size_t ext_len = 0;
> size_t len;
>
> printk_safe_enter_irqsave(flags);
> raw_spin_lock(&logbuf_lock);
> - if (console_seq < log_first_seq) {
> - len = snprintf(text, sizeof(text),
> - "** %llu printk messages dropped **\n",
> - log_first_seq - console_seq);
> -
> - /* messages are gone, move to first one */
> - console_seq = log_first_seq;
> - console_idx = log_first_idx;
> - } else {
> - len = 0;
> - }
> skip:
> - if (console_seq == log_next_seq)
> + if (!prb_read_valid(prb, console_seq, &r))
> break;
>
> - msg = log_from_idx(console_idx);
> - if (suppress_message_printing(msg->level)) {
> + if (console_seq != r.info->seq) {
> + console_dropped += r.info->seq - console_seq;
> + console_seq = r.info->seq;
> + }
> +
> + if (suppress_message_printing(r.info->level)) {
> /*
> * Skip record we have buffered and already printed
> * directly to the console when we received it, and
> * record that has level above the console loglevel.
> */
> - console_idx = log_next(console_idx);
> console_seq++;
> goto skip;
> }
> @@ -2475,19 +2519,24 @@ void console_unlock(void)
> exclusive_console = NULL;
> }
>
> - len += msg_print_text(msg,
> - console_msg_format & MSG_FORMAT_SYSLOG,
> - printk_time, text + len, sizeof(text) - len);
> + /*
> + * Handle extended console text first because later
> + * record_print_text() will modify the record buffer in-place.
> + */
> if (nr_ext_console_drivers) {
> - ext_len = msg_print_ext_header(ext_text,
> + ext_len = info_print_ext_header(ext_text,
> sizeof(ext_text),
> - msg, console_seq);
> + r.info);
> ext_len += msg_print_ext_body(ext_text + ext_len,
> sizeof(ext_text) - ext_len,
> - log_dict(msg), msg->dict_len,
> - log_text(msg), msg->text_len);
> + &r.dict_buf[0],
> + r.info->dict_len,
> + &r.text_buf[0],
> + r.info->text_len);
> }
> - console_idx = log_next(console_idx);
> + len = record_print_text(&r,
> + console_msg_format & MSG_FORMAT_SYSLOG,
> + printk_time);
> console_seq++;
> raw_spin_unlock(&logbuf_lock);
>
> @@ -2527,7 +2576,7 @@ void console_unlock(void)
> * flush, no worries.
> */
> raw_spin_lock(&logbuf_lock);
> - retry = console_seq != log_next_seq;
> + retry = prb_read_valid(prb, console_seq, NULL);
> raw_spin_unlock(&logbuf_lock);
> printk_safe_exit_irqrestore(flags);
>
> @@ -2596,8 +2645,7 @@ void console_flush_on_panic(enum con_flush_mode mode)
> unsigned long flags;
>
> logbuf_lock_irqsave(flags);
> - console_seq = log_first_seq;
> - console_idx = log_first_idx;
> + console_seq = prb_first_seq(prb);
> logbuf_unlock_irqrestore(flags);
> }
> console_unlock();
> @@ -2840,7 +2888,6 @@ void register_console(struct console *newcon)
> exclusive_console = newcon;
> exclusive_console_stop_seq = console_seq;
> console_seq = syslog_seq;
> - console_idx = syslog_idx;
> logbuf_unlock_irqrestore(flags);
> }
> console_unlock();
> @@ -3229,9 +3276,7 @@ void kmsg_dump(enum kmsg_dump_reason reason)
>
> logbuf_lock_irqsave(flags);
> dumper->cur_seq = clear_seq;
> - dumper->cur_idx = clear_idx;
> - dumper->next_seq = log_next_seq;
> - dumper->next_idx = log_next_idx;
> + dumper->next_seq = prb_next_seq(prb);
> logbuf_unlock_irqrestore(flags);
>
> /* invoke dumper which will iterate over records */
> @@ -3265,28 +3310,33 @@ void kmsg_dump(enum kmsg_dump_reason reason)
> bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
> char *line, size_t size, size_t *len)
> {
> - struct printk_log *msg;
> + struct printk_info info;
> + unsigned int line_count;
> + struct printk_record r;
> size_t l = 0;
> bool ret = false;
>
> + prb_rec_init_rd(&r, &info, line, size, NULL, 0);
> +
> if (!dumper->active)
> goto out;
>
> - if (dumper->cur_seq < log_first_seq) {
> - /* messages are gone, move to first available one */
> - dumper->cur_seq = log_first_seq;
> - dumper->cur_idx = log_first_idx;
> - }
> -
> - /* last entry */
> - if (dumper->cur_seq >= log_next_seq)
> - goto out;
> + /* Read text or count text lines? */
> + if (line) {
> + if (!prb_read_valid(prb, dumper->cur_seq, &r))
> + goto out;
> + l = record_print_text(&r, syslog, printk_time);
> + } else {
> + if (!prb_read_valid_info(prb, dumper->cur_seq,
> + &info, &line_count)) {
> + goto out;
> + }
> + l = get_record_text_size(&info, line_count, syslog,
> + printk_time);
>
> - msg = log_from_idx(dumper->cur_idx);
> - l = msg_print_text(msg, syslog, printk_time, line, size);
> + }
>
> - dumper->cur_idx = log_next(dumper->cur_idx);
> - dumper->cur_seq++;
> + dumper->cur_seq = r.info->seq + 1;
> ret = true;
> out:
> if (len)
> @@ -3347,23 +3397,25 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_line);
> bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
> char *buf, size_t size, size_t *len)
> {
> + struct printk_info info;
> + unsigned int line_count;
> + struct printk_record r;
> unsigned long flags;
> u64 seq;
> - u32 idx;
> u64 next_seq;
> - u32 next_idx;
> size_t l = 0;
> bool ret = false;
> bool time = printk_time;
>
> - if (!dumper->active)
> + prb_rec_init_rd(&r, &info, buf, size, NULL, 0);
> +
> + if (!dumper->active || !buf || !size)
> goto out;
>
> logbuf_lock_irqsave(flags);
> - if (dumper->cur_seq < log_first_seq) {
> + if (dumper->cur_seq < prb_first_seq(prb)) {
> /* messages are gone, move to first available one */
> - dumper->cur_seq = log_first_seq;
> - dumper->cur_idx = log_first_idx;
> + dumper->cur_seq = prb_first_seq(prb);
> }
>
> /* last entry */
> @@ -3374,41 +3426,41 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
>
> /* calculate length of entire buffer */
> seq = dumper->cur_seq;
> - idx = dumper->cur_idx;
> - while (seq < dumper->next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> -
> - l += msg_print_text(msg, true, time, NULL, 0);
> - idx = log_next(idx);
> - seq++;
> + while (prb_read_valid_info(prb, seq, &info, &line_count)) {
> + if (r.info->seq >= dumper->next_seq)
> + break;
> + l += get_record_text_size(&info, line_count, true, time);
> + seq = r.info->seq + 1;
> }
>
> /* move first record forward until length fits into the buffer */
> seq = dumper->cur_seq;
> - idx = dumper->cur_idx;
> - while (l >= size && seq < dumper->next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> -
> - l -= msg_print_text(msg, true, time, NULL, 0);
> - idx = log_next(idx);
> - seq++;
> + while (l >= size && prb_read_valid_info(prb, seq,
> + &info, &line_count)) {
> + if (r.info->seq >= dumper->next_seq)
> + break;
> + l -= get_record_text_size(&info, line_count, true, time);
> + seq = r.info->seq + 1;
> }
>
> /* last message in next interation */
> next_seq = seq;
> - next_idx = idx;
>
> + /* actually read text into the buffer now */
> l = 0;
> - while (seq < dumper->next_seq) {
> - struct printk_log *msg = log_from_idx(idx);
> + while (prb_read_valid(prb, seq, &r)) {
> + if (r.info->seq >= dumper->next_seq)
> + break;
> +
> + l += record_print_text(&r, syslog, time);
> +
> + /* adjust record to store to remaining buffer space */
> + prb_rec_init_rd(&r, &info, buf + l, size - l, NULL, 0);
>
> - l += msg_print_text(msg, syslog, time, buf + l, size - l);
> - idx = log_next(idx);
> - seq++;
> + seq = r.info->seq + 1;
> }
>
> dumper->next_seq = next_seq;
> - dumper->next_idx = next_idx;
> ret = true;
> logbuf_unlock_irqrestore(flags);
> out:
> @@ -3431,9 +3483,7 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);
> void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper)
> {
> dumper->cur_seq = clear_seq;
> - dumper->cur_idx = clear_idx;
> - dumper->next_seq = log_next_seq;
> - dumper->next_idx = log_next_idx;
> + dumper->next_seq = prb_next_seq(prb);
> }
>
> /**
>

2020-07-02 08:35:54

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] printk: add lockless ringbuffer

On Thu 2020-06-18 16:55:18, John Ogness wrote:
> Introduce a multi-reader multi-writer lockless ringbuffer for storing
> the kernel log messages. Readers and writers may use their API from
> any context (including scheduler and NMI). This ringbuffer will make
> it possible to decouple printk() callers from any context, locking,
> or console constraints. It also makes it possible for readers to have
> full access to the ringbuffer contents at any time and context (for
> example from any panic situation).

Please, find few comments below. They are primary about improving
the comments. There are also 3 ideas how to clean up the code
a bit[*].

Everything might be solved by followup patches. Well, I would prefer
to update the comments in the next respin if you do it because
of the third patch.[**]

The small code changes can be done later. I do not want to introduce
eventual problems at this stage.

Feel free to use:

Reviewed-by: Petr Mladek <[email protected]>


[*] I have pretty good feeling about the code. And I really like the extensive
comments. The remaining ideas are mostly from the nits department.

[**] We are getting close to push this into linux-next. I think that
it makes sense only when also the 3rd patch is ready.

Also we should have a solution for the crashdump/crash tools at hands.
But I think that this is already the case.


> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.c
> + * Data Rings
> + * ~~~~~~~~~~
> + * The two data rings (text and dictionary) function identically. They exist
> + * separately so that their buffer sizes can be individually set and they do
> + * not affect one another.
> + *
> + * Data rings are byte arrays composed of data blocks. Data blocks are
> + * referenced by blk_lpos structs that point to the logical position of the
> + * beginning of a data block and the beginning of the next adjacent data
> + * block. Logical positions are mapped directly to index values of the byte
> + * array ringbuffer.
> + *
> + * Each data block consists of an ID followed by the raw data. The ID is the
> + * identifier of a descriptor that is associated with the data block. A data
> + * block is considered valid if all of the following conditions are met:
> + *
> + * 1) The descriptor associated with the data block is in the committed
> + * or reusable queried state.

This sounds strange. The information in the descriptor is valid when
the descriptor is in the committed or reusable state. But the data
in the data block are valid only when the related descriptor is in
the committed state.

The data block might get reused faster than the descriptor that owned
it earlier.

> + * 2) The blk_lpos struct within the descriptor associated with the data
> + * block references back to the same data block.

This is true. But this cross check is needed only when the assocoated
descriptor is found via the ID stored in the data block. It is not
needed when the data block is found via id/seq number stored in
the descriptor that is in commited state.


> + * 3) The data block is within the head/tail logical position range.

Also this is true. But it works by design. It does not need to be
checked when validating the data.

I would like to make this clear. So, I would write something like:


The raw data stored in a data block can be considered valid when the
the associated descriptor is in the committed state.

The above condition is enough when the data block is found via
information from the descriptor that was chosen by its ID.

One more check is needed when the data block is found via
blk_lpos.next stored in the descriptor for the previous data block.
The data block might point to a descriptor in a committed state just
by chance. It is the right one only when blk_lpos.begin points back
to the data block.

I do not know. It might be to complicated. I actually did not mind
about the original text in previous versions. But I realize now that
it might have been the reason why the checks in data_make_reusable()
were too complicated in the previous versions.



> + * Sample reader code::
> + *
> + * struct printk_info info;
> + * struct printk_record r;
> + * char text_buf[32];
> + * char dict_buf[32];
> + * u64 seq;
> + *
> + * prb_rec_init_rd(&r, &info, &text_buf[0], sizeof(text_buf),
> + * &dict_buf[0], sizeof(dict_buf));
> + *
> + * prb_for_each_record(0, &test_rb, &seq, &r) {
> + * if (info.seq != seq)
> + * pr_warn("lost %llu records\n", info.seq - seq);
> + *
> + * if (info.text_len > r.text_buf_size) {
> + * pr_warn("record %llu text truncated\n", info.seq);
> + * text_buf[sizeof(text_buf) - 1] = 0;
> + * }

AFAIK, the trailing '\0' is missing even the the text is not
truncated. We need to always add it here. Otherwise, we could not
print the text using %s below.

Also it means that it is not safe to copy the text using
memcpy(new_buf, &r.text_buf[0], r.text_len) because r.text_len
might be greater than r.text_bug_size.

I would prefer a more safe API where r.text_len is set to
the length that is really stored r.text_buf. The information
about truncation might be passed via another struct member,
e.g. r.text_truncated or so.

Well, this might be done later. The printk() code does not have
this problem because the length of the stored messages is
limited on the store side. The reader side uses buffers that
are always big enough. But I consider this error prone and
non-intuitive. It might hit us in the long term.


> + *
> + * if (info.dict_len > r.dict_buf_size) {
> + * pr_warn("record %llu dict truncated\n", info.seq);
> + * dict_buf[sizeof(dict_buf) - 1] = 0;
> + * }
> + *
> + * pr_info("%llu: %llu: %s;%s\n", info.seq, info.ts_nsec,
> + * &text_buf[0], info.dict_len ? &dict_buf[0] : "");
> + * }
> + *
> + * Note that additional less convenient reader functions are available to
> + * allow complex record access.
> + *
> + * ABA Issues
> + * ~~~~~~~~~~
> + * To help avoid ABA issues, descriptors are referenced by IDs (array index
> + * values with tagged states)

"array index values" makes the feeling that it is a direct position in
the array. We should make it clear that it is growing with each
overlap.

Also my understanding is that "tagged states" are the two bits.

See below for an alternative text.

> + * and data blocks are referenced by logical
> + * positions (array index values with tagged states).

Finally, lpos does does not include the tagged state. Or the meaning of "tagged
state" is rather confusing.

I would replace this with something like:

To help avoid ABA issues, both descriptors and data blocks are
referenced by logical positions that are growing with each
wrap. Where the position of the descriptor is stored together
with the state bits so that they can be manipulated atomically.

> + * However, on 32-bit
> + * systems the number of tagged states is relatively small such that an ABA

"tagged states" ? Should it be "number of logical positions"

> + * incident is (at least theoretically) possible. For example, if 4 million
> + * maximally sized (1KiB) printk messages were to occur in NMI context on a
> + * 32-bit system, the interrupted context would not be able to recognize that
> + * the 32-bit integer completely wrapped and thus represents a different
> + * data block than the one the interrupted context expects.
> + *
> + * To help combat this possibility, additional state checking is performed
> + * (such as using cmpxchg() even though set() would suffice). These extra
> + * checks are commented as such and will hopefully catch any ABA issue that
> + * a 32-bit system might experience.
> + *
> + * Memory Barriers
> + * ~~~~~~~~~~~~~~~
> + * Multiple memory barriers are used. To simplify proving correctness and
> + * generating litmus tests, lines of code related to memory barriers
> + * (loads, stores, and the associated memory barriers) are labeled::
> + *
> + * LMM(function:letter)
> + *
> + * Comments reference the labels using only the "function:letter" part.
> + *
> + * The memory barrier pairs and their ordering are:
> + *
> + * desc_reserve:D / desc_reserve:B
> + * push descriptor tail (id), then push descriptor head (id)
> + *
> + * desc_reserve:D / data_push_tail:B
> + * push data tail (lpos), then set new descriptor reserved (state)
> + *
> + * desc_reserve:D / desc_push_tail:C
> + * push descriptor tail (id), then set new descriptor reserved (state)
> + *
> + * desc_reserve:D / prb_first_seq:C
> + * push descriptor tail (id), then set new descriptor reserved (state)
> + *
> + * desc_reserve:F / desc_read:D
> + * set new descriptor id and reserved (state), then allow writer changes
> + *
> + * data_alloc:A / desc_read:D
> + * set old descriptor reusable (state), then modify new data block area
> + *
> + * data_alloc:A / data_push_tail:B
> + * push data tail (lpos), then modify new data block area
> + *
> + * prb_commit:B / desc_read:B
> + * store writer changes, then set new descriptor committed (state)
> + *
> + * data_push_tail:D / data_push_tail:A
> + * set descriptor reusable (state), then push data tail (lpos)
> + *
> + * desc_push_tail:B / desc_reserve:D
> + * set descriptor reusable (state), then push descriptor tail (id)

I am not sure that this list is really needs. It might get outdated
easily. It does not help me to make sure that all these barriers
are enough to keep the consistency.

That said, feel free to keep it if it helps anyone.

> + */
> +
> +#define DATA_SIZE(data_ring) _DATA_SIZE((data_ring)->size_bits)
> +#define DATA_SIZE_MASK(data_ring) (DATA_SIZE(data_ring) - 1)
> +
> +#define DESCS_COUNT(desc_ring) _DESCS_COUNT((desc_ring)->count_bits)
> +#define DESCS_COUNT_MASK(desc_ring) (DESCS_COUNT(desc_ring) - 1)
> +
> +/* Determine the data array index from a logical position. */
> +#define DATA_INDEX(data_ring, lpos) ((lpos) & DATA_SIZE_MASK(data_ring))
> +
> +/* Determine the desc array index from an ID or sequence number. */
> +#define DESC_INDEX(desc_ring, n) ((n) & DESCS_COUNT_MASK(desc_ring))
> +
> +/* Determine how many times the data array has wrapped. */
> +#define DATA_WRAPS(data_ring, lpos) ((lpos) >> (data_ring)->size_bits)
> +
> +/* Get the logical position at index 0 of the current wrap. */
> +#define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \
> +((lpos) & ~DATA_SIZE_MASK(data_ring))
> +
> +/* Get the ID for the same index of the previous wrap as the given ID. */
> +#define DESC_ID_PREV_WRAP(desc_ring, id) \
> +DESC_ID((id) - DESCS_COUNT(desc_ring))
> +
> +/* A data block: maps to the raw data within the data ring. */

"raw data" is ambiguous here. Does it mean "data" array or the entire
structure. See below for an alternative text.

> +struct prb_data_block {
> + unsigned long id;
> + char data[0];
> +};

I think that the relation between desc.state_val, id, and, and state
bits is not descried anywhere. It might be confusing for someone
trying to understand the code for the first time.

Also we should make it clear that id is the logical position without
the state bits. For example, the meaning of DESC_COUNT, DATA_SIZE

It might really help to describe of all struct members here, e.g.

/**
* struct prb_data_block: maps to the data block stored in the data ring
*
* @id: logical position in desc array (without state bits)
* @data[0]: raw data stored by the writer
*/
struct prb_data_block {
unsigned long id;
char data[0];
};

> +
> +static struct prb_desc *to_desc(struct prb_desc_ring *desc_ring, u64 n)

This might deserve a description of possible n values: id, seq, state_val.


> +{
> + return &desc_ring->descs[DESC_INDEX(desc_ring, n)];
> +}
> +


> +/*
> + * Allocate a new data block, invalidating the oldest data block(s)
> + * if necessary. This function also associates the data block with
> + * a specified descriptor.
> + */
> +static char *data_alloc(struct printk_ringbuffer *rb,
> + struct prb_data_ring *data_ring, unsigned int size,
> + struct prb_data_blk_lpos *blk_lpos, unsigned long id)
> +{
> + struct prb_data_block *blk;
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> +
> + if (!data_ring->data || size == 0) {

Is there any specific reason to check data_ring->data is not NULL,
please? It is not a bit deal. I am just curious because it is
not checked on other locations.

> + /* Specify a data-less block. */
> + blk_lpos->begin = INVALID_LPOS;
> + blk_lpos->next = INVALID_LPOS;
> + return NULL;
> + }


> +/**
> + * prb_next_seq() - Get the sequence number after the last available record.
> + *
> + * @rb: The ringbuffer to get the sequence number from.
> + *
> + * This is the public function available to readers to see what the next
> + * newest sequence number available to readers will be.
> + *
> + * This provides readers a sequence number to jump to if all currently
> + * available records should be skipped.
> + *
> + * Context: Any context.
> + * Return: The sequence number of the next newest (not yet available) record
> + * for readers.
> + */
> +u64 prb_next_seq(struct printk_ringbuffer *rb)
> +{
> + u64 seq = 0;
> +
> + do {
> + /* Search forward from the oldest descriptor. */
> + if (!_prb_read_valid(rb, &seq, NULL, NULL))
> + return seq;
> + seq++;
> + } while (seq);

seq should never become zero here. It would take ages to overflow
and just creates false assumption about the logic. I would replace
this with for(;;)

> +
> + return 0;
> +}
> +


> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.h
> @@ -0,0 +1,352 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _KERNEL_PRINTK_RINGBUFFER_H
> +#define _KERNEL_PRINTK_RINGBUFFER_H
> +
> +#include <linux/atomic.h>
> +

It might be worth to add some information. And maybe also define who
is supposed to manipulate the struct members.

/*
* Meta information about each stored message.
*
* They are stored and used by printk code. The only exception are:
* seq, text_len, dict_len that are handled also in the ringbuffer
* code.
*/

> +struct printk_info {
> + u64 seq; /* sequence number */
> + u64 ts_nsec; /* timestamp in nanoseconds */
> + u16 text_len; /* length of text message */
> + u16 dict_len; /* length of dictionary message */
> + u8 facility; /* syslog facility */
> + u8 flags:5; /* internal record flags */
> + u8 level:3; /* syslog level */
> + u32 caller_id; /* thread id or processor id */
> +};



> +/* Specifies the position/span of a data block. */

s/position/logical position/

> +struct prb_data_blk_lpos {
> + unsigned long begin;
> + unsigned long next;
> +};
> +
> +/* A descriptor: the complete meta-data for a record. */

state_var might deserve some description, e.g.

* @state_var combines logical position of the descriptor
* in the array ring buffer (id) and two state bits.

> +struct prb_desc {
> + struct printk_info info;
> + atomic_long_t state_var;
> + struct prb_data_blk_lpos text_blk_lpos;
> + struct prb_data_blk_lpos dict_blk_lpos;
> +};
> +
> +/* A ringbuffer of "ID + data" elements. */
> +struct prb_data_ring {
> + unsigned int size_bits;
> + char *data;
> + atomic_long_t head_lpos;
> + atomic_long_t tail_lpos;
> +};
> +
> +/* A ringbuffer of "struct prb_desc" elements. */
> +struct prb_desc_ring {
> + unsigned int count_bits;
> + struct prb_desc *descs;
> + atomic_long_t head_id;
> + atomic_long_t tail_id;
> +};
> +
> +/* The high level structure representing the printk ringbuffer. */

@fault might desrve some comment, .e.g.

* @fail: automatically counted number of failed prb_reserve() calls.

> +struct printk_ringbuffer {
> + struct prb_desc_ring desc_ring;
> + struct prb_data_ring text_data_ring;
> + struct prb_data_ring dict_data_ring;
> + atomic_long_t fail;
> +};
> +
> +/* Used by writers as a reserve/commit handle. */

We should emphasise that is should be manipulated only by
the ring_buffer code. Also @id and @text_space might deserve
some explanation. I think that it is better to use the full
description here:

/**
* struct prb_reserved_entry - reserve/commit handle used
* by writers and readers.
*
* @rb: ring buffer where the entry has been reserved
* @irqflags: saved irq flags to be used when the entry gets committed
* @id: logical position of the reserved descriptor
* @text_space: Really used space in text data ring, including
* the stored "id" and and alignment
*
* It must be manipulated only by the ring buffer code.
*/

> +struct prb_reserved_entry {
> + struct printk_ringbuffer *rb;
> + unsigned long irqflags;
> + unsigned long id;
> + unsigned int text_space;
> +};


> +/**
> + * prb_for_each_record() - Iterate over a ringbuffer.
> + *
> + * @from: The sequence number to begin with.
> + * @rb: The ringbuffer to iterate over.
> + * @s: A u64 to store the sequence number on each iteration.
> + * @r: A printk_record to store the record on each iteration.
> + *
> + * This is a macro for conveniently iterating over a ringbuffer.

It might be worth emphasizing that @s is internal value. The reader
should check r->seq to see that acutally read sequence number.

IMHO, it is not clearly visible from the code. At least, I had
to check prb_read_valid() to realize this,

> + *
> + * Context: Any context.
> + */
> +#define prb_for_each_record(from, rb, s, r) \
> +for ((s) = from; prb_read_valid(rb, s, r); (s) = (r)->info->seq + 1)

That's it. Uff, it is a huge relief that there was nothing memory
barriers related ;-)

Best Regards,
Petr

2020-07-02 09:02:54

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

On 2020-07-02, lijiang <[email protected]> wrote:
> About the VMCOREINFO part, I made some tests based on the kernel patch
> v3, the makedumpfile and crash-utility can work as expected with your
> patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
> can't correctly read the printk ring buffer information, and get the
> following error:
>
> "Missing the log_buf symbol"
>
> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
> in the makedumpfile and crash-utility.

Yes, a patch for this is needed (as well as for any other related
software floating around the internet).

I have no RFC patches for vmcore-dmesg. Looking at the code, I think it
would be quite straight forward to port the makedumpfile patch. I will
try to make some time for this.

I do not want to patch any other software for this. I think with 3
examples (crash, makedumpfile, vmcore-dmesg), others should be able to
implement the changes to their software without needing my help.

John Ogness

2020-07-02 09:44:22

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

在 2020年07月02日 17:02, John Ogness 写道:
> On 2020-07-02, lijiang <[email protected]> wrote:
>> About the VMCOREINFO part, I made some tests based on the kernel patch
>> v3, the makedumpfile and crash-utility can work as expected with your
>> patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
>> can't correctly read the printk ring buffer information, and get the
>> following error:
>>
>> "Missing the log_buf symbol"
>>
>> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
>> in the makedumpfile and crash-utility.
>
> Yes, a patch for this is needed (as well as for any other related
> software floating around the internet).
>
> I have no RFC patches for vmcore-dmesg. Looking at the code, I think it
> would be quite straight forward to port the makedumpfile patch. I will

Yes, it should be a similar patch.

> try to make some time for this.
>
That would be nice. Thank you, John Ogness.

> I do not want to patch any other software for this. I think with 3
> examples (crash, makedumpfile, vmcore-dmesg), others should be able to

It's good enough to have the patch for the makedumpfile, crash and vmcore-dmesg,
which can ensure the kdump(userspace) work well.

Thanks.
Lianbo

> implement the changes to their software without needing my help.
>
> John Ogness
>

2020-07-02 13:34:55

by Petr Mladek

[permalink] [raw]
Subject: Re: buffer allocation: was: [PATCH v3 3/3] printk: use the lockless ringbuffer

On Mon 2020-06-29 23:57:59, John Ogness wrote:
> On 2020-06-29, Petr Mladek <[email protected]> wrote:
> >> @@ @@ void __init setup_log_buf(int early)
> >> + prb_init(&printk_rb_dynamic,
> >> + new_log_buf, order_base_2(new_log_buf_len),
> >> + new_dict_buf, order_base_2(new_log_buf_len),
> >> + new_descs, order_base_2(new_descs_count));
> >
> > order_base_2() is safe. But the result might be tat some allocated
> > space is not used.
> >
> > I would prefer to make sure that new_log_buf_len is rounded, e.g.
> > by roundup_pow_of_two(), at the beginning of the function. Then we
> > could use ilog2() here.
>
> new_log_buf_len can only be set within log_buf_len_update(), and it
> is already doing exactly what you want:
>
> if (size)
> size = roundup_pow_of_two(size);
> if (size > log_buf_len)
> new_log_buf_len = (unsigned long)size;

I know. I though about repeating this in setup_log_bug() to make it
error proof. But it is not really needed. I do not resist on it.

> I can switch to ilog2() instead of the more conservative order_base_2().

Yes, please change it to ilog2().

Best Regards,
Petr

2020-07-02 13:36:11

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

On Thu 2020-07-02 17:43:22, lijiang wrote:
> 在 2020年07月02日 17:02, John Ogness 写道:
> > On 2020-07-02, lijiang <[email protected]> wrote:
> >> About the VMCOREINFO part, I made some tests based on the kernel patch
> >> v3, the makedumpfile and crash-utility can work as expected with your
> >> patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
> >> can't correctly read the printk ring buffer information, and get the
> >> following error:
> >>
> >> "Missing the log_buf symbol"
> >>
> >> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
> >> in the makedumpfile and crash-utility.
> >
> > Yes, a patch for this is needed (as well as for any other related
> > software floating around the internet).
> >
> > I have no RFC patches for vmcore-dmesg. Looking at the code, I think it
> > would be quite straight forward to port the makedumpfile patch. I will
>
> Yes, it should be a similar patch.
>
> > try to make some time for this.
> >
> That would be nice. Thank you, John Ogness.
>
> > I do not want to patch any other software for this. I think with 3
> > examples (crash, makedumpfile, vmcore-dmesg), others should be able to
>
> It's good enough to have the patch for the makedumpfile, crash and vmcore-dmesg,
> which can ensure the kdump(userspace) work well.

I agree that this three are the most important ones and should be
enough.

Thanks a lot for working on it and testing it.

Best Regards,
Petr

2020-07-03 11:58:18

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

On 2020-07-02, lijiang <[email protected]> wrote:
> About the VMCOREINFO part, I made some tests based on the kernel patch
> v3, the makedumpfile and crash-utility can work as expected with your
> patch(userspace patch), but, unfortunately, the
> vmcore-dmesg(kexec-tools) can't correctly read the printk ring buffer
> information, and get the following error:
>
> "Missing the log_buf symbol"
>
> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
> in the makedumpfile and crash-utility.

A patched kexec-tools is available here [0].

I did not test using 32-bit dumps on 64-bit machines and vice versa. But
it should work.

John Ogness

[0] https://github.com/Linutronix/kexec-tools.git (printk branch)

2020-07-04 01:13:38

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

在 2020年07月02日 21:31, Petr Mladek 写道:
> On Thu 2020-07-02 17:43:22, lijiang wrote:
>> 在 2020年07月02日 17:02, John Ogness 写道:
>>> On 2020-07-02, lijiang <[email protected]> wrote:
>>>> About the VMCOREINFO part, I made some tests based on the kernel patch
>>>> v3, the makedumpfile and crash-utility can work as expected with your
>>>> patch(userspace patch), but, unfortunately, the vmcore-dmesg(kexec-tools)
>>>> can't correctly read the printk ring buffer information, and get the
>>>> following error:
>>>>
>>>> "Missing the log_buf symbol"
>>>>
>>>> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
>>>> in the makedumpfile and crash-utility.
>>>
>>> Yes, a patch for this is needed (as well as for any other related
>>> software floating around the internet).
>>>
>>> I have no RFC patches for vmcore-dmesg. Looking at the code, I think it
>>> would be quite straight forward to port the makedumpfile patch. I will
>>
>> Yes, it should be a similar patch.
>>
>>> try to make some time for this.
>>>
>> That would be nice. Thank you, John Ogness.
>>
>>> I do not want to patch any other software for this. I think with 3
>>> examples (crash, makedumpfile, vmcore-dmesg), others should be able to
>>
>> It's good enough to have the patch for the makedumpfile, crash and vmcore-dmesg,
>> which can ensure the kdump(userspace) work well.
>
> I agree that this three are the most important ones and should be
> enough.
>
> Thanks a lot for working on it and testing it.
>
My pleasure. I will test the vmcore-dmesg later.

Thanks.
Lianbo

> Best Regards,
> Petr
>

2020-07-04 09:32:59

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] crash: add VMCOREINFO macro to define offset in a struct declared by typedef

On 06/18/20 at 04:55pm, John Ogness wrote:
> The existing macro VMCOREINFO_OFFSET() can't be used for structures
> declared via typedef because "struct" is not part of type definition.
>
> Create another macro for this purpose.
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> include/linux/crash_core.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index 525510a9f965..43b51c9df571 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -53,6 +53,9 @@ phys_addr_t paddr_vmcoreinfo_note(void);
> #define VMCOREINFO_OFFSET(name, field) \
> vmcoreinfo_append_str("OFFSET(%s.%s)=%lu\n", #name, #field, \
> (unsigned long)offsetof(struct name, field))
> +#define VMCOREINFO_TYPE_OFFSET(name, field) \
> + vmcoreinfo_append_str("OFFSET(%s.%s)=%lu\n", #name, #field, \
> + (unsigned long)offsetof(name, field))

Acked-by: Baoquan He <[email protected]>

> #define VMCOREINFO_LENGTH(name, value) \
> vmcoreinfo_append_str("LENGTH(%s)=%lu\n", #name, (unsigned long)value)
> #define VMCOREINFO_NUMBER(name) \
> --
> 2.20.1
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
>

2020-07-08 05:54:27

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] printk: use the lockless ringbuffer

在 2020年07月03日 19:54, John Ogness 写道:
> On 2020-07-02, lijiang <[email protected]> wrote:
>> About the VMCOREINFO part, I made some tests based on the kernel patch
>> v3, the makedumpfile and crash-utility can work as expected with your
>> patch(userspace patch), but, unfortunately, the
>> vmcore-dmesg(kexec-tools) can't correctly read the printk ring buffer
>> information, and get the following error:
>>
>> "Missing the log_buf symbol"
>>
>> The kexec-tools(vmcore-dmesg) should also have a similar patch, just like
>> in the makedumpfile and crash-utility.
>
> A patched kexec-tools is available here [0].
>
> I did not test using 32-bit dumps on 64-bit machines and vice versa. But
> it should work.
>
> John Ogness
>
> [0] https://github.com/Linutronix/kexec-tools.git (printk branch)
>

After applying this patch, the vmcore-dmesg can work.

Thank you, John Ogness.