2012-05-03 00:29:41

by Kay Sievers

[permalink] [raw]
Subject: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

From: Kay Sievers <[email protected]>
Subject: printk: convert byte-buffer to variable-length record buffer

- Record-based stream instead of the traditional byte stream
buffer. All records carry a 64 bit timestamp, the syslog facility
and priority in the record header.

- Records consume almost the same amount, sometimes less memory than
the traditional byte stream buffer (if printk_time is enabled). The record
header is 16 bytes long, plus some padding bytes at the end if needed.
The byte-stream buffer needed 3 chars for the syslog prefix, 15 char for
the timestamp and a newline.

- Buffer management is based on message sequence numbers. When records
need to be discarded, the reading heads move on to the next full
record. Unlike the byte-stream buffer, no old logged lines get
truncated or partly overwritten by new ones. Sequence numbers also
allow consumers of the log stream to get notified if any message in
the stream they are about to read gets discarded during the time
of reading.

- Better buffered IO support for KERN_CONT continuation lines, when printk()
is called multiple times for a single line. The use of KERN_CONT is now
mandatory to use continuation; a few places in the kernel need trivial fixes
here. The buffering could possibly be extended to per-cpu variables to allow
better thread-safety for multiple printk() invocations for a single line.

- Full-featured syslog facility value support. Different facilities
can tag their messages. All userspace-injected messages enforce a
facility value > 0 now, to be able to reliably distinguish them from
the kernel-generated messages. Independent subsystems like a
baseband processor running its own firmware, or a kernel-related
userspace process can use their own unique facility values. Multiple
independent log streams can co-exist that way in the same
buffer. All share the same global sequence number counter to ensure
proper ordering (and interleaving) and to allow the consumers of the
log to reliably correlate the events from different facilities.

Tested-by: William Douglas <[email protected]>
Signed-off-by: Kay Sievers <[email protected]>
---
drivers/char/mem.c | 55 +-
include/linux/printk.h | 11
kernel/printk.c | 1020 ++++++++++++++++++++++++++++---------------------
3 files changed, 642 insertions(+), 444 deletions(-)

--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -810,33 +810,54 @@ static const struct file_operations oldm
static ssize_t kmsg_writev(struct kiocb *iocb, const struct iovec *iv,
unsigned long count, loff_t pos)
{
- char *line, *p;
+ char *buf, *line;
int i;
- ssize_t ret = -EFAULT;
+ int level = default_message_loglevel;
+ int facility = 1; /* LOG_USER */
size_t len = iov_length(iv, count);
+ ssize_t ret = len;

- line = kmalloc(len + 1, GFP_KERNEL);
- if (line == NULL)
+ if (len > 1024)
+ return -EINVAL;
+ buf = kmalloc(len+1, GFP_KERNEL);
+ if (buf == NULL)
return -ENOMEM;

- /*
- * copy all vectors into a single string, to ensure we do
- * not interleave our log line with other printk calls
- */
- p = line;
+ line = buf;
for (i = 0; i < count; i++) {
- if (copy_from_user(p, iv[i].iov_base, iv[i].iov_len))
+ if (copy_from_user(line, iv[i].iov_base, iv[i].iov_len))
goto out;
- p += iv[i].iov_len;
+ line += iv[i].iov_len;
+ }
+
+ /*
+ * Extract and skip the syslog prefix <[0-9]*>. Coming from userspace
+ * the decimal value represents 32bit, the lower 3 bit are the log
+ * level, the rest are the log facility.
+ *
+ * If no prefix or no userspace facility is specified, we
+ * enforce LOG_USER, to be able to reliably distinguish
+ * kernel-generated messages from userspace-injected ones.
+ */
+ line = buf;
+ if (line[0] == '<') {
+ char *endp = NULL;
+
+ i = simple_strtoul(line+1, &endp, 10);
+ if (endp && endp[0] == '>') {
+ level = i & 7;
+ if (i >> 3)
+ facility = i >> 3;
+ endp++;
+ len -= endp - line;
+ line = endp;
+ }
}
- p[0] = '\0';
+ line[len] = '\0';

- ret = printk("%s", line);
- /* printk can add a prefix */
- if (ret > len)
- ret = len;
+ printk_emit(facility, level, NULL, 0, "%s", line);
out:
- kfree(line);
+ kfree(buf);
return ret;
}

--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -95,8 +95,19 @@ extern int printk_needs_cpu(int cpu);
extern void printk_tick(void);

#ifdef CONFIG_PRINTK
+asmlinkage __printf(5, 0)
+int vprintk_emit(int facility, int level,
+ const char *dict, size_t dictlen,
+ const char *fmt, va_list args);
+
asmlinkage __printf(1, 0)
int vprintk(const char *fmt, va_list args);
+
+asmlinkage __printf(5, 6) __cold
+asmlinkage int printk_emit(int facility, int level,
+ const char *dict, size_t dictlen,
+ const char *fmt, ...);
+
asmlinkage __printf(1, 2) __cold
int printk(const char *fmt, ...);

--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -54,8 +54,6 @@ void asmlinkage __attribute__((weak)) ea
{
}

-#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
-
/* printk's without a loglevel use this.. */
#define DEFAULT_MESSAGE_LOGLEVEL CONFIG_DEFAULT_MESSAGE_LOGLEVEL

@@ -99,24 +97,6 @@ EXPORT_SYMBOL_GPL(console_drivers);
static int console_locked, console_suspended;

/*
- * logbuf_lock protects log_buf, log_start, log_end, con_start and logged_chars
- * It is also used in interesting ways to provide interlocking in
- * console_unlock();.
- */
-static DEFINE_RAW_SPINLOCK(logbuf_lock);
-
-#define LOG_BUF_MASK (log_buf_len-1)
-#define LOG_BUF(idx) (log_buf[(idx) & LOG_BUF_MASK])
-
-/*
- * The indices into log_buf are not constrained to log_buf_len - they
- * must be masked before subscripting
- */
-static unsigned log_start; /* Index into log_buf: next char to be read by syslog() */
-static unsigned con_start; /* Index into log_buf: next char to be sent to consoles */
-static unsigned log_end; /* Index into log_buf: most-recently-written-char + 1 */
-
-/*
* If exclusive_console is non-NULL then only this console is to be printed to.
*/
static struct console *exclusive_console;
@@ -146,12 +126,176 @@ EXPORT_SYMBOL(console_set_on_cmdline);
static int console_may_schedule;

#ifdef CONFIG_PRINTK
+/*
+ * The printk log buffer consists of a chain of concatenated variable
+ * length records. Every record starts with a record header, containing
+ * the overall length of the record.
+ *
+ * The heads to the first and last entry in the buffer, as well as the
+ * sequence numbers of these both entries are maintained when messages
+ * are stored..
+ *
+ * If the heads indicate available messages, the length in the header
+ * tells the start next message. A length == 0 for the next message
+ * indicates a wrap-around to the beginning of the buffer.
+ *
+ * Every record carries the monotonic timestamp in microseconds, as well as
+ * the standard userspace syslog level and syslog facility. The usual
+ * kernel messages use LOG_KERN; userspace-injected messages always carry
+ * a matching syslog facility, by default LOG_USER. The origin of every
+ * message can be reliably determined that way.
+ *
+ * The human readable log message directly follows the message header. The
+ * length of the message text is stored in the header, the stored message
+ * is not terminated.
+ *
+ */
+
+struct log {
+ u64 ts_nsec; /* timestamp in nanoseconds */
+ u16 len; /* length of entire record */
+ u16 text_len; /* length of text buffer */
+ u16 dict_len; /* length of dictionary buffer */
+ u16 level; /* syslog level + facility */
+};
+
+/*
+ * The logbuf_lock protects kmsg buffer, indices, counters. It is also
+ * used in interesting ways to provide interlocking in console_unlock();
+ */
+static DEFINE_RAW_SPINLOCK(logbuf_lock);
+
+/* cpu currently holding logbuf_lock */
+static volatile unsigned int logbuf_cpu = UINT_MAX;
+
+#define LOG_LINE_MAX 1024

+/* record buffer */
+#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
static char __log_buf[__LOG_BUF_LEN];
static char *log_buf = __log_buf;
-static int log_buf_len = __LOG_BUF_LEN;
-static unsigned logged_chars; /* Number of chars produced since last read+clear operation */
-static int saved_console_loglevel = -1;
+static u32 log_buf_len = __LOG_BUF_LEN;
+
+/* index and sequence number of the first record stored in the buffer */
+static u64 log_first_seq;
+static u32 log_first_idx;
+
+/* index and sequence number of the next record to store in the buffer */
+static u64 log_next_seq;
+static u32 log_next_idx;
+
+/* the next printk record to read after the last 'clear' command */
+static u64 clear_seq;
+static u32 clear_idx;
+
+/* the next printk record to read by syslog(READ) or /proc/kmsg */
+static u64 syslog_seq;
+static u32 syslog_idx;
+
+/* human readable text of the record */
+static char *log_text(const struct log *msg)
+{
+ return (char *)msg + sizeof(struct log);
+}
+
+/* optional key/value pair dictionary attached to the record */
+static char *log_dict(const struct log *msg)
+{
+ return (char *)msg + sizeof(struct log) + msg->text_len;
+}
+
+/* get record by index; idx must point to valid msg */
+static struct log *log_from_idx(u32 idx)
+{
+ struct log *msg = (struct log *)(log_buf + idx);
+
+ /*
+ * A length == 0 record is the end of buffer marker. Wrap around and
+ * read the message at the start of the buffer.
+ */
+ if (!msg->len)
+ return (struct log *)log_buf;
+ return msg;
+}
+
+/* get next record; idx must point to valid msg */
+static u32 log_next(u32 idx)
+{
+ struct log *msg = (struct log *)(log_buf + idx);
+
+ /* length == 0 indicates the end of the buffer; wrap */
+ /*
+ * A length == 0 record is the end of buffer marker. Wrap around and
+ * read the message at the start of the buffer as *this* one, and
+ * return the one after that.
+ */
+ if (!msg->len) {
+ msg = (struct log *)log_buf;
+ return msg->len;
+ }
+ return idx + msg->len;
+}
+
+#if !defined(CONFIG_64BIT) || defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
+#define LOG_ALIGN 4
+#else
+#define LOG_ALIGN 8
+#endif
+
+/* insert record into the buffer, discard old ones, update heads */
+static void log_store(int facility, int level,
+ const char *dict, u16 dict_len,
+ const char *text, u16 text_len)
+{
+ struct log *msg;
+ u32 size, pad_len;
+
+ /* number of '\0' padding bytes to next message */
+ size = sizeof(struct log) + text_len + dict_len;
+ pad_len = (-size) & (LOG_ALIGN - 1);
+ size += pad_len;
+
+ while (log_first_seq < log_next_seq) {
+ u32 free;
+
+ if (log_next_idx > log_first_idx)
+ free = max(log_buf_len - log_next_idx, log_first_idx);
+ else
+ free = log_first_idx - log_next_idx;
+
+ if (free > size + sizeof(struct log))
+ break;
+
+ /* drop old messages until we have enough contiuous space */
+ log_first_idx = log_next(log_first_idx);
+ log_first_seq++;
+ }
+
+ if (log_next_idx + size + sizeof(struct log) >= log_buf_len) {
+ /*
+ * This message + an additional empty header does not fit
+ * at the end of the buffer. Add an empty header with len == 0
+ * to signify a wrap around.
+ */
+ memset(log_buf + log_next_idx, 0, sizeof(struct log));
+ log_next_idx = 0;
+ }
+
+ /* fill message */
+ msg = (struct log *)(log_buf + log_next_idx);
+ memcpy(log_text(msg), text, text_len);
+ msg->text_len = text_len;
+ memcpy(log_dict(msg), dict, dict_len);
+ msg->dict_len = dict_len;
+ msg->level = (facility << 3) | (level & 7);
+ msg->ts_nsec = local_clock();
+ memset(log_dict(msg) + dict_len, 0, pad_len);
+ msg->len = sizeof(struct log) + text_len + dict_len + pad_len;
+
+ /* insert message */
+ log_next_idx += msg->len;
+ log_next_seq++;
+}

#ifdef CONFIG_KEXEC
/*
@@ -165,9 +309,9 @@ static int saved_console_loglevel = -1;
void log_buf_kexec_setup(void)
{
VMCOREINFO_SYMBOL(log_buf);
- VMCOREINFO_SYMBOL(log_end);
VMCOREINFO_SYMBOL(log_buf_len);
- VMCOREINFO_SYMBOL(logged_chars);
+ VMCOREINFO_SYMBOL(log_first_idx);
+ VMCOREINFO_SYMBOL(log_next_idx);
}
#endif

@@ -191,7 +335,6 @@ early_param("log_buf_len", log_buf_len_s
void __init setup_log_buf(int early)
{
unsigned long flags;
- unsigned start, dest_idx, offset;
char *new_log_buf;
int free;

@@ -219,20 +362,8 @@ void __init setup_log_buf(int early)
log_buf_len = new_log_buf_len;
log_buf = new_log_buf;
new_log_buf_len = 0;
- free = __LOG_BUF_LEN - log_end;
-
- offset = start = min(con_start, log_start);
- dest_idx = 0;
- while (start != log_end) {
- unsigned log_idx_mask = start & (__LOG_BUF_LEN - 1);
-
- log_buf[dest_idx] = __log_buf[log_idx_mask];
- start++;
- dest_idx++;
- }
- log_start -= offset;
- con_start -= offset;
- log_end -= offset;
+ free = __LOG_BUF_LEN - log_next_idx;
+ memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
raw_spin_unlock_irqrestore(&logbuf_lock, flags);

pr_info("log_buf_len: %d\n", log_buf_len);
@@ -332,11 +463,165 @@ static int check_syslog_permissions(int
return 0;
}

+#if defined(CONFIG_PRINTK_TIME)
+static bool printk_time = 1;
+#else
+static bool printk_time;
+#endif
+module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR);
+
+static int syslog_print_line(u32 idx, char *text, size_t size)
+{
+ struct log *msg;
+ size_t len;
+
+ msg = log_from_idx(idx);
+ if (!text) {
+ /* calculate length only */
+ len = 3;
+
+ if (msg->level > 9)
+ len++;
+ if (msg->level > 99)
+ len++;
+
+ if (printk_time)
+ len += 15;
+
+ len += msg->text_len;
+ len++;
+ return len;
+ }
+
+ len = sprintf(text, "<%u>", msg->level);
+
+ if (printk_time) {
+ unsigned long long t = msg->ts_nsec;
+ unsigned long rem_ns = do_div(t, 1000000000);
+
+ len += sprintf(text + len, "[%5lu.%06lu] ",
+ (unsigned long) t, rem_ns / 1000);
+ }
+
+ if (len + msg->text_len > size)
+ return -EINVAL;
+ memcpy(text + len, log_text(msg), msg->text_len);
+ len += msg->text_len;
+ text[len++] = '\n';
+ return len;
+}
+
+static int syslog_print(char __user *buf, int size)
+{
+ char *text;
+ int len;
+
+ text = kmalloc(LOG_LINE_MAX, GFP_KERNEL);
+ if (!text)
+ return -ENOMEM;
+
+ raw_spin_lock_irq(&logbuf_lock);
+ if (syslog_seq < log_first_seq) {
+ /* messages are gone, move to first one */
+ syslog_seq = log_first_seq;
+ syslog_idx = log_first_idx;
+ }
+ len = syslog_print_line(syslog_idx, text, LOG_LINE_MAX);
+ syslog_idx = log_next(syslog_idx);
+ syslog_seq++;
+ raw_spin_unlock_irq(&logbuf_lock);
+
+ if (len > 0 && copy_to_user(buf, text, len))
+ len = -EFAULT;
+
+ kfree(text);
+ return len;
+}
+
+static int syslog_print_all(char __user *buf, int size, bool clear)
+{
+ char *text;
+ int len = 0;
+
+ text = kmalloc(LOG_LINE_MAX, GFP_KERNEL);
+ if (!text)
+ return -ENOMEM;
+
+ raw_spin_lock_irq(&logbuf_lock);
+ if (buf) {
+ u64 next_seq;
+ u64 seq;
+ u32 idx;
+
+ if (clear_seq < log_first_seq) {
+ /* messages are gone, move to first available one */
+ clear_seq = log_first_seq;
+ clear_idx = log_first_idx;
+ }
+
+ /*
+ * Find first record that fits, including all following records,
+ * into the user-provided buffer for this dump.
+ */
+ seq = clear_seq;
+ idx = clear_idx;
+ while (seq < log_next_seq) {
+ len += syslog_print_line(idx, NULL, 0);
+ idx = log_next(idx);
+ seq++;
+ }
+ seq = clear_seq;
+ idx = clear_idx;
+ while (len > size && seq < log_next_seq) {
+ len -= syslog_print_line(idx, NULL, 0);
+ idx = log_next(idx);
+ seq++;
+ }
+
+ /* last message in this dump */
+ next_seq = log_next_seq;
+
+ len = 0;
+ while (len >= 0 && seq < next_seq) {
+ int textlen;
+
+ textlen = syslog_print_line(idx, text, LOG_LINE_MAX);
+ if (textlen < 0) {
+ len = textlen;
+ break;
+ }
+ idx = log_next(idx);
+ seq++;
+
+ raw_spin_unlock_irq(&logbuf_lock);
+ if (copy_to_user(buf + len, text, textlen))
+ len = -EFAULT;
+ else
+ len += textlen;
+ raw_spin_lock_irq(&logbuf_lock);
+
+ if (seq < log_first_seq) {
+ /* messages are gone, move to next one */
+ seq = log_first_seq;
+ idx = log_first_idx;
+ }
+ }
+ }
+
+ if (clear) {
+ clear_seq = log_next_seq;
+ clear_idx = log_next_idx;
+ }
+ raw_spin_unlock_irq(&logbuf_lock);
+
+ kfree(text);
+ return len;
+}
+
int do_syslog(int type, char __user *buf, int len, bool from_file)
{
- unsigned i, j, limit, count;
- int do_clear = 0;
- char c;
+ bool clear = false;
+ static int saved_console_loglevel = -1;
int error;

error = check_syslog_permissions(type, from_file);
@@ -364,28 +649,14 @@ int do_syslog(int type, char __user *buf
goto out;
}
error = wait_event_interruptible(log_wait,
- (log_start - log_end));
+ syslog_seq != log_next_seq);
if (error)
goto out;
- i = 0;
- raw_spin_lock_irq(&logbuf_lock);
- while (!error && (log_start != log_end) && i < len) {
- c = LOG_BUF(log_start);
- log_start++;
- raw_spin_unlock_irq(&logbuf_lock);
- error = __put_user(c,buf);
- buf++;
- i++;
- cond_resched();
- raw_spin_lock_irq(&logbuf_lock);
- }
- raw_spin_unlock_irq(&logbuf_lock);
- if (!error)
- error = i;
+ error = syslog_print(buf, len);
break;
/* Read/clear last kernel messages */
case SYSLOG_ACTION_READ_CLEAR:
- do_clear = 1;
+ clear = true;
/* FALL THRU */
/* Read last kernel messages */
case SYSLOG_ACTION_READ_ALL:
@@ -399,52 +670,11 @@ int do_syslog(int type, char __user *buf
error = -EFAULT;
goto out;
}
- count = len;
- if (count > log_buf_len)
- count = log_buf_len;
- raw_spin_lock_irq(&logbuf_lock);
- if (count > logged_chars)
- count = logged_chars;
- if (do_clear)
- logged_chars = 0;
- limit = log_end;
- /*
- * __put_user() could sleep, and while we sleep
- * printk() could overwrite the messages
- * we try to copy to user space. Therefore
- * the messages are copied in reverse. <manfreds>
- */
- for (i = 0; i < count && !error; i++) {
- j = limit-1-i;
- if (j + log_buf_len < log_end)
- break;
- c = LOG_BUF(j);
- raw_spin_unlock_irq(&logbuf_lock);
- error = __put_user(c,&buf[count-1-i]);
- cond_resched();
- raw_spin_lock_irq(&logbuf_lock);
- }
- raw_spin_unlock_irq(&logbuf_lock);
- if (error)
- break;
- error = i;
- if (i != count) {
- int offset = count-error;
- /* buffer overflow during copy, correct user buffer. */
- for (i = 0; i < error; i++) {
- if (__get_user(c,&buf[i+offset]) ||
- __put_user(c,&buf[i])) {
- error = -EFAULT;
- break;
- }
- cond_resched();
- }
- }
+ error = syslog_print_all(buf, len, clear);
break;
/* Clear ring buffer */
case SYSLOG_ACTION_CLEAR:
- logged_chars = 0;
- break;
+ syslog_print_all(NULL, 0, true);
/* Disable logging to console */
case SYSLOG_ACTION_CONSOLE_OFF:
if (saved_console_loglevel == -1)
@@ -472,7 +702,33 @@ int do_syslog(int type, char __user *buf
break;
/* Number of chars in the log buffer */
case SYSLOG_ACTION_SIZE_UNREAD:
- error = log_end - log_start;
+ raw_spin_lock_irq(&logbuf_lock);
+ if (syslog_seq < log_first_seq) {
+ /* messages are gone, move to first one */
+ syslog_seq = log_first_seq;
+ syslog_idx = log_first_idx;
+ }
+ if (from_file) {
+ /*
+ * Short-cut for poll(/"proc/kmsg") which simply checks
+ * for pending data, not the size; return the count of
+ * records, not the length.
+ */
+ error = log_next_idx - syslog_idx;
+ } else {
+ u64 seq;
+ u32 idx;
+
+ error = 0;
+ seq = syslog_seq;
+ idx = syslog_idx;
+ while (seq < log_next_seq) {
+ error += syslog_print_line(idx, NULL, 0);
+ idx = log_next(idx);
+ seq++;
+ }
+ }
+ raw_spin_unlock_irq(&logbuf_lock);
break;
/* Size of the log buffer */
case SYSLOG_ACTION_SIZE_BUFFER:
@@ -501,29 +757,11 @@ void kdb_syslog_data(char *syslog_data[4
{
syslog_data[0] = log_buf;
syslog_data[1] = log_buf + log_buf_len;
- syslog_data[2] = log_buf + log_end -
- (logged_chars < log_buf_len ? logged_chars : log_buf_len);
- syslog_data[3] = log_buf + log_end;
+ syslog_data[2] = log_buf + log_first_idx;
+ syslog_data[3] = log_buf + log_next_idx;
}
#endif /* CONFIG_KGDB_KDB */

-/*
- * Call the console drivers on a range of log_buf
- */
-static void __call_console_drivers(unsigned start, unsigned end)
-{
- struct console *con;
-
- for_each_console(con) {
- if (exclusive_console && con != exclusive_console)
- continue;
- if ((con->flags & CON_ENABLED) && con->write &&
- (cpu_online(smp_processor_id()) ||
- (con->flags & CON_ANYTIME)))
- con->write(con, &LOG_BUF(start), end - start);
- }
-}
-
static bool __read_mostly ignore_loglevel;

static int __init ignore_loglevel_setup(char *str)
@@ -540,142 +778,33 @@ MODULE_PARM_DESC(ignore_loglevel, "ignor
"print all kernel messages to the console.");

/*
- * Write out chars from start to end - 1 inclusive
- */
-static void _call_console_drivers(unsigned start,
- unsigned end, int msg_log_level)
-{
- trace_console(&LOG_BUF(0), start, end, log_buf_len);
-
- if ((msg_log_level < console_loglevel || ignore_loglevel) &&
- console_drivers && start != end) {
- if ((start & LOG_BUF_MASK) > (end & LOG_BUF_MASK)) {
- /* wrapped write */
- __call_console_drivers(start & LOG_BUF_MASK,
- log_buf_len);
- __call_console_drivers(0, end & LOG_BUF_MASK);
- } else {
- __call_console_drivers(start, end);
- }
- }
-}
-
-/*
- * Parse the syslog header <[0-9]*>. The decimal value represents 32bit, the
- * lower 3 bit are the log level, the rest are the log facility. In case
- * userspace passes usual userspace syslog messages to /dev/kmsg or
- * /dev/ttyprintk, the log prefix might contain the facility. Printk needs
- * to extract the correct log level for in-kernel processing, and not mangle
- * the original value.
- *
- * If a prefix is found, the length of the prefix is returned. If 'level' is
- * passed, it will be filled in with the log level without a possible facility
- * value. If 'special' is passed, the special printk prefix chars are accepted
- * and returned. If no valid header is found, 0 is returned and the passed
- * variables are not touched.
- */
-static size_t log_prefix(const char *p, unsigned int *level, char *special)
-{
- unsigned int lev = 0;
- char sp = '\0';
- size_t len;
-
- if (p[0] != '<' || !p[1])
- return 0;
- if (p[2] == '>') {
- /* usual single digit level number or special char */
- switch (p[1]) {
- case '0' ... '7':
- lev = p[1] - '0';
- break;
- case 'c': /* KERN_CONT */
- case 'd': /* KERN_DEFAULT */
- sp = p[1];
- break;
- default:
- return 0;
- }
- len = 3;
- } else {
- /* multi digit including the level and facility number */
- char *endp = NULL;
-
- lev = (simple_strtoul(&p[1], &endp, 10) & 7);
- if (endp == NULL || endp[0] != '>')
- return 0;
- len = (endp + 1) - p;
- }
-
- /* do not accept special char if not asked for */
- if (sp && !special)
- return 0;
-
- if (special) {
- *special = sp;
- /* return special char, do not touch level */
- if (sp)
- return len;
- }
-
- if (level)
- *level = lev;
- return len;
-}
-
-/*
* Call the console drivers, asking them to write out
* log_buf[start] to log_buf[end - 1].
* The console_lock must be held.
*/
-static void call_console_drivers(unsigned start, unsigned end)
+static void call_console_drivers(int level, const char *text, size_t len)
{
- unsigned cur_index, start_print;
- static int msg_level = -1;
+ struct console *con;

- BUG_ON(((int)(start - end)) > 0);
+ trace_console(text, 0, len, len);

- cur_index = start;
- start_print = start;
- while (cur_index != end) {
- if (msg_level < 0 && ((end - cur_index) > 2)) {
- /* strip log prefix */
- cur_index += log_prefix(&LOG_BUF(cur_index), &msg_level, NULL);
- start_print = cur_index;
- }
- while (cur_index != end) {
- char c = LOG_BUF(cur_index);
-
- cur_index++;
- if (c == '\n') {
- if (msg_level < 0) {
- /*
- * printk() has already given us loglevel tags in
- * the buffer. This code is here in case the
- * log buffer has wrapped right round and scribbled
- * on those tags
- */
- msg_level = default_message_loglevel;
- }
- _call_console_drivers(start_print, cur_index, msg_level);
- msg_level = -1;
- start_print = cur_index;
- break;
- }
- }
- }
- _call_console_drivers(start_print, end, msg_level);
-}
+ if (level >= console_loglevel && !ignore_loglevel)
+ return;
+ if (!console_drivers)
+ return;

-static void emit_log_char(char c)
-{
- LOG_BUF(log_end) = c;
- log_end++;
- if (log_end - log_start > log_buf_len)
- log_start = log_end - log_buf_len;
- if (log_end - con_start > log_buf_len)
- con_start = log_end - log_buf_len;
- if (logged_chars < log_buf_len)
- logged_chars++;
+ for_each_console(con) {
+ if (exclusive_console && con != exclusive_console)
+ continue;
+ if (!(con->flags & CON_ENABLED))
+ continue;
+ if (!con->write)
+ continue;
+ if (!cpu_online(smp_processor_id()) &&
+ !(con->flags & CON_ANYTIME))
+ continue;
+ con->write(con, text, len);
+ }
}

/*
@@ -700,16 +829,6 @@ static void zap_locks(void)
sema_init(&console_sem, 1);
}

-#if defined(CONFIG_PRINTK_TIME)
-static bool printk_time = 1;
-#else
-static bool printk_time = 0;
-#endif
-module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR);
-
-static bool always_kmsg_dump;
-module_param_named(always_kmsg_dump, always_kmsg_dump, bool, S_IRUGO | S_IWUSR);
-
/* Check if we have any console registered that can be called early in boot. */
static int have_callable_console(void)
{
@@ -722,51 +841,6 @@ static int have_callable_console(void)
return 0;
}

-/**
- * printk - print a kernel message
- * @fmt: format string
- *
- * This is printk(). It can be called from any context. We want it to work.
- *
- * We try to grab the console_lock. If we succeed, it's easy - we log the output and
- * call the console drivers. If we fail to get the semaphore we place the output
- * into the log buffer and return. The current holder of the console_sem will
- * notice the new output in console_unlock(); and will send it to the
- * consoles before releasing the lock.
- *
- * One effect of this deferred printing is that code which calls printk() and
- * then changes console_loglevel may break. This is because console_loglevel
- * is inspected when the actual printing occurs.
- *
- * See also:
- * printf(3)
- *
- * See the vsnprintf() documentation for format string extensions over C99.
- */
-
-asmlinkage int printk(const char *fmt, ...)
-{
- va_list args;
- int r;
-
-#ifdef CONFIG_KGDB_KDB
- if (unlikely(kdb_trap_printk)) {
- va_start(args, fmt);
- r = vkdb_printf(fmt, args);
- va_end(args);
- return r;
- }
-#endif
- va_start(args, fmt);
- r = vprintk(fmt, args);
- va_end(args);
-
- return r;
-}
-
-/* cpu currently holding logbuf_lock */
-static volatile unsigned int printk_cpu = UINT_MAX;
-
/*
* Can we actually use the console at this time on this cpu?
*
@@ -810,17 +884,12 @@ static int console_trylock_for_printk(un
retval = 0;
}
}
- printk_cpu = UINT_MAX;
+ logbuf_cpu = UINT_MAX;
if (wake)
up(&console_sem);
raw_spin_unlock(&logbuf_lock);
return retval;
}
-static const char recursion_bug_msg [] =
- KERN_CRIT "BUG: recent printk recursion!\n";
-static int recursion_bug;
-static int new_text_line = 1;
-static char printk_buf[1024];

int printk_delay_msec __read_mostly;

@@ -836,15 +905,22 @@ static inline void printk_delay(void)
}
}

-asmlinkage int vprintk(const char *fmt, va_list args)
-{
- int printed_len = 0;
- int current_log_level = default_message_loglevel;
+asmlinkage int vprintk_emit(int facility, int level,
+ const char *dict, size_t dictlen,
+ const char *fmt, va_list args)
+{
+ static int recursion_bug;
+ static char buf[LOG_LINE_MAX];
+ static size_t buflen;
+ static int buflevel;
+ static char textbuf[LOG_LINE_MAX];
+ char *text = textbuf;
+ size_t textlen;
unsigned long flags;
int this_cpu;
- char *p;
- size_t plen;
- char special;
+ bool newline = false;
+ bool cont = false;
+ int printed_len = 0;

boot_delay_msec();
printk_delay();
@@ -856,7 +932,7 @@ asmlinkage int vprintk(const char *fmt,
/*
* Ouch, printk recursed into itself!
*/
- if (unlikely(printk_cpu == this_cpu)) {
+ if (unlikely(logbuf_cpu == this_cpu)) {
/*
* If a crash is occurring during printk() on this CPU,
* then try to get the crash message out but make sure
@@ -873,97 +949,92 @@ asmlinkage int vprintk(const char *fmt,

lockdep_off();
raw_spin_lock(&logbuf_lock);
- printk_cpu = this_cpu;
+ logbuf_cpu = this_cpu;

if (recursion_bug) {
+ static const char recursion_msg[] =
+ "BUG: recent printk recursion!";
+
recursion_bug = 0;
- strcpy(printk_buf, recursion_bug_msg);
- printed_len = strlen(recursion_bug_msg);
- }
- /* Emit the output into the temporary buffer */
- printed_len += vscnprintf(printk_buf + printed_len,
- sizeof(printk_buf) - printed_len, fmt, args);
-
- p = printk_buf;
-
- /* Read log level and handle special printk prefix */
- plen = log_prefix(p, &current_log_level, &special);
- if (plen) {
- p += plen;
-
- switch (special) {
- case 'c': /* Strip <c> KERN_CONT, continue line */
- plen = 0;
- break;
- case 'd': /* Strip <d> KERN_DEFAULT, start new line */
- plen = 0;
- default:
- if (!new_text_line) {
- emit_log_char('\n');
- new_text_line = 1;
- }
- }
+ printed_len += strlen(recursion_msg);
+ /* emit KERN_CRIT message */
+ log_store(0, 2, NULL, 0, recursion_msg, printed_len);
}

/*
- * Copy the output into log_buf. If the caller didn't provide
- * the appropriate log prefix, we insert them here
+ * The printf needs to come first; we need the syslog
+ * prefix which might be passed-in as a parameter.
*/
- for (; *p; p++) {
- if (new_text_line) {
- new_text_line = 0;
-
- if (plen) {
- /* Copy original log prefix */
- int i;
-
- for (i = 0; i < plen; i++)
- emit_log_char(printk_buf[i]);
- printed_len += plen;
- } else {
- /* Add log prefix */
- emit_log_char('<');
- emit_log_char(current_log_level + '0');
- emit_log_char('>');
- printed_len += 3;
- }
+ textlen = vscnprintf(text, sizeof(textbuf), fmt, args);

- if (printk_time) {
- /* Add the current time stamp */
- char tbuf[50], *tp;
- unsigned tlen;
- unsigned long long t;
- unsigned long nanosec_rem;
-
- t = cpu_clock(printk_cpu);
- nanosec_rem = do_div(t, 1000000000);
- tlen = sprintf(tbuf, "[%5lu.%06lu] ",
- (unsigned long) t,
- nanosec_rem / 1000);
-
- for (tp = tbuf; tp < tbuf + tlen; tp++)
- emit_log_char(*tp);
- printed_len += tlen;
- }
+ /* mark and strip a trailing newline */
+ if (textlen && text[textlen-1] == '\n') {
+ textlen--;
+ newline = true;
+ }

- if (!*p)
- break;
+ /* strip syslog prefix and extract log level or flags */
+ if (text[0] == '<' && text[1] && text[2] == '>') {
+ switch (text[1]) {
+ case '0' ... '7':
+ if (level == -1)
+ level = text[1] - '0';
+ text += 3;
+ textlen -= 3;
+ break;
+ case 'c': /* KERN_CONT */
+ cont = true;
+ case 'd': /* KERN_DEFAULT */
+ text += 3;
+ textlen -= 3;
+ break;
}
+ }

- emit_log_char(*p);
- if (*p == '\n')
- new_text_line = 1;
+ if (buflen && (!cont || dict)) {
+ /* no continuation; flush existing buffer */
+ log_store(facility, buflevel, NULL, 0, buf, buflen);
+ printed_len += buflen;
+ buflen = 0;
+ }
+
+ if (buflen == 0) {
+ /* remember level for first message in the buffer */
+ if (level == -1)
+ buflevel = default_message_loglevel;
+ else
+ buflevel = level;
+ }
+
+ if (buflen || !newline) {
+ /* append to existing buffer, or buffer until next message */
+ if (buflen + textlen > sizeof(buf))
+ textlen = sizeof(buf) - buflen;
+ memcpy(buf + buflen, text, textlen);
+ buflen += textlen;
+ }
+
+ if (newline) {
+ /* end of line; flush buffer */
+ if (buflen) {
+ log_store(facility, buflevel,
+ dict, dictlen, buf, buflen);
+ printed_len += buflen;
+ buflen = 0;
+ } else {
+ log_store(facility, buflevel,
+ dict, dictlen, text, textlen);
+ printed_len += textlen;
+ }
}

/*
- * Try to acquire and then immediately release the
- * console semaphore. The release will do all the
- * actual magic (print out buffers, wake up klogd,
- * etc).
+ * Try to acquire and then immediately release the console semaphore.
+ * The release will print out buffers and wake up /dev/kmsg and syslog()
+ * users.
*
- * The console_trylock_for_printk() function
- * will release 'logbuf_lock' regardless of whether it
- * actually gets the semaphore or not.
+ * The console_trylock_for_printk() function will release 'logbuf_lock'
+ * regardless of whether it actually gets the console semaphore or not.
*/
if (console_trylock_for_printk(this_cpu))
console_unlock();
@@ -974,12 +1045,73 @@ out_restore_irqs:

return printed_len;
}
-EXPORT_SYMBOL(printk);
+EXPORT_SYMBOL(vprintk_emit);
+
+asmlinkage int vprintk(const char *fmt, va_list args)
+{
+ return vprintk_emit(0, -1, NULL, 0, fmt, args);
+}
EXPORT_SYMBOL(vprintk);

+asmlinkage int printk_emit(int facility, int level,
+ const char *dict, size_t dictlen,
+ const char *fmt, ...)
+{
+ va_list args;
+ int r;
+
+ va_start(args, fmt);
+ r = vprintk_emit(facility, level, dict, dictlen, fmt, args);
+ va_end(args);
+
+ return r;
+}
+EXPORT_SYMBOL(printk_emit);
+
+/**
+ * printk - print a kernel message
+ * @fmt: format string
+ *
+ * This is printk(). It can be called from any context. We want it to work.
+ *
+ * We try to grab the console_lock. If we succeed, it's easy - we log the
+ * output and call the console drivers. If we fail to get the semaphore, we
+ * place the output into the log buffer and return. The current holder of
+ * the console_sem will notice the new output in console_unlock(); and will
+ * send it to the consoles before releasing the lock.
+ *
+ * One effect of this deferred printing is that code which calls printk() and
+ * then changes console_loglevel may break. This is because console_loglevel
+ * is inspected when the actual printing occurs.
+ *
+ * See also:
+ * printf(3)
+ *
+ * See the vsnprintf() documentation for format string extensions over C99.
+ */
+asmlinkage int printk(const char *fmt, ...)
+{
+ va_list args;
+ int r;
+
+#ifdef CONFIG_KGDB_KDB
+ if (unlikely(kdb_trap_printk)) {
+ va_start(args, fmt);
+ r = vkdb_printf(fmt, args);
+ va_end(args);
+ return r;
+ }
+#endif
+ va_start(args, fmt);
+ r = vprintk_emit(0, -1, NULL, 0, fmt, args);
+ va_end(args);
+
+ return r;
+}
+EXPORT_SYMBOL(printk);
#else

-static void call_console_drivers(unsigned start, unsigned end)
+static void call_console_drivers(int level, const char *text, size_t len)
{
}

@@ -1217,7 +1349,7 @@ int is_console_locked(void)
}

/*
- * Delayed printk facility, for scheduler-internal messages:
+ * Delayed printk version, for scheduler-internal messages:
*/
#define PRINTK_BUF_SIZE 512

@@ -1253,6 +1385,10 @@ void wake_up_klogd(void)
this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
}

+/* the next printk record to write to the console */
+static u64 console_seq;
+static u32 console_idx;
+
/**
* console_unlock - unlock the console system
*
@@ -1263,15 +1399,16 @@ void wake_up_klogd(void)
* by printk(). If this is the case, console_unlock(); emits
* the output prior to releasing the lock.
*
- * If there is output waiting for klogd, we wake it up.
+ * If there is output waiting, we wake it /dev/kmsg and syslog() users.
*
* console_unlock(); may be called from any context.
*/
void console_unlock(void)
{
+ static u64 seen_seq;
unsigned long flags;
- unsigned _con_start, _log_end;
- unsigned wake_klogd = 0, retry = 0;
+ bool wake_klogd = false;
+ bool retry;

if (console_suspended) {
up(&console_sem);
@@ -1281,17 +1418,41 @@ void console_unlock(void)
console_may_schedule = 0;

again:
- for ( ; ; ) {
+ for (;;) {
+ struct log *msg;
+ static char text[LOG_LINE_MAX];
+ size_t len;
+ int level;
+
raw_spin_lock_irqsave(&logbuf_lock, flags);
- wake_klogd |= log_start - log_end;
- if (con_start == log_end)
- break; /* Nothing to print */
- _con_start = con_start;
- _log_end = log_end;
- con_start = log_end; /* Flush */
+ if (seen_seq != log_next_seq) {
+ wake_klogd = true;
+ seen_seq = log_next_seq;
+ }
+
+ if (console_seq < log_first_seq) {
+ /* messages are gone, move to first one */
+ console_seq = log_first_seq;
+ console_idx = log_first_idx;
+ }
+
+ if (console_seq == log_next_seq)
+ break;
+
+ msg = log_from_idx(console_idx);
+ level = msg->level & 7;
+ len = msg->text_len;
+ if (len+1 >= sizeof(text))
+ len = sizeof(text)-1;
+ memcpy(text, log_text(msg), len);
+ text[len++] = '\n';
+
+ console_idx = log_next(console_idx);
+ console_seq++;
raw_spin_unlock(&logbuf_lock);
+
stop_critical_timings(); /* don't trace print latency */
- call_console_drivers(_con_start, _log_end);
+ call_console_drivers(level, text, len);
start_critical_timings();
local_irq_restore(flags);
}
@@ -1312,8 +1473,7 @@ again:
* flush, no worries.
*/
raw_spin_lock(&logbuf_lock);
- if (con_start != log_end)
- retry = 1;
+ retry = console_seq != log_next_seq;
raw_spin_unlock_irqrestore(&logbuf_lock, flags);

if (retry && console_trylock())
@@ -1549,7 +1709,8 @@ void register_console(struct console *ne
* for us.
*/
raw_spin_lock_irqsave(&logbuf_lock, flags);
- con_start = log_start;
+ console_seq = syslog_seq;
+ console_idx = syslog_idx;
raw_spin_unlock_irqrestore(&logbuf_lock, flags);
/*
* We're about to replay the log buffer. Only do this to the
@@ -1758,6 +1919,9 @@ int kmsg_dump_unregister(struct kmsg_dum
}
EXPORT_SYMBOL_GPL(kmsg_dump_unregister);

+static bool always_kmsg_dump;
+module_param_named(always_kmsg_dump, always_kmsg_dump, bool, S_IRUGO | S_IWUSR);
+
/**
* kmsg_dump - dump kernel log to kernel message dumpers.
* @reason: the reason (oops, panic etc) for dumping
@@ -1767,8 +1931,7 @@ EXPORT_SYMBOL_GPL(kmsg_dump_unregister);
*/
void kmsg_dump(enum kmsg_dump_reason reason)
{
- unsigned long end;
- unsigned chars;
+ u64 idx;
struct kmsg_dumper *dumper;
const char *s1, *s2;
unsigned long l1, l2;
@@ -1780,24 +1943,27 @@ void kmsg_dump(enum kmsg_dump_reason rea
/* Theoretically, the log could move on after we do this, but
there's not a lot we can do about that. The new messages
will overwrite the start of what we dump. */
- raw_spin_lock_irqsave(&logbuf_lock, flags);
- end = log_end & LOG_BUF_MASK;
- chars = logged_chars;
- raw_spin_unlock_irqrestore(&logbuf_lock, flags);

- if (chars > end) {
- s1 = log_buf + log_buf_len - chars + end;
- l1 = chars - end;
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ if (syslog_seq < log_first_seq)
+ idx = syslog_idx;
+ else
+ idx = log_first_idx;
+
+ if (idx > log_next_idx) {
+ s1 = log_buf;
+ l1 = log_next_idx;

- s2 = log_buf;
- l2 = end;
+ s2 = log_buf + idx;
+ l2 = log_buf_len - idx;
} else {
s1 = "";
l1 = 0;

- s2 = log_buf + end - chars;
- l2 = chars;
+ s2 = log_buf + idx;
+ l2 = log_next_idx - idx;
}
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);

rcu_read_lock();
list_for_each_entry_rcu(dumper, &dump_list, list)


2012-05-03 19:48:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-03 at 02:29 +0200, Kay Sievers wrote:
> + * This is printk(). It can be called from any context. We want it to work.

This is not true..

2012-05-03 19:55:41

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 3, 2012 at 9:48 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-05-03 at 02:29 +0200, Kay Sievers wrote:
>> + * This is printk(). It can be called from any context. We want it to work.
>
> This is not true..

I did not add that, it's just 'diff' that moves this hunk while adding
stuff before it. It's the same text as in the current code:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=kernel/printk.c;h=b663c2c95d3922e4c080e8974c09c6ce99b4355f;hb=HEAD#l729

Kay

2012-05-03 19:56:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-03 at 21:54 +0200, Kay Sievers wrote:
> On Thu, May 3, 2012 at 9:48 PM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2012-05-03 at 02:29 +0200, Kay Sievers wrote:
> >> + * This is printk(). It can be called from any context. We want it to work.
> >
> > This is not true..
>
> I did not add that, it's just 'diff' that moves this hunk while adding
> stuff before it. It's the same text as in the current code:

Sure, but its still not true.

2012-05-03 19:57:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 3, 2012 at 12:48 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-05-03 at 02:29 +0200, Kay Sievers wrote:
>> + * This is printk(). It can be called from any context. We want it to work.
>
> This is not true..

Well, it is very very nearly true, and people should basically expect it to be.

There are exceptions, but they are ridiculously specific
- no NMI context
- we've had trouble with the percpu runqueue lock (due to the stupid wakeup)
- for similar reasons, we've had trouble with printk called from
within some terminal locks.

But basically, it damn well should be considered a hard rule that you
can printk from pretty much anything. The few cases where that is not
true are big problems, and should be considered to be real problems,
not "oh, you can't printk from here".

Linus

2012-05-03 20:03:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-03 at 12:56 -0700, Linus Torvalds wrote:
> On Thu, May 3, 2012 at 12:48 PM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2012-05-03 at 02:29 +0200, Kay Sievers wrote:
> >> + * This is printk(). It can be called from any context. We want it to work.
> >
> > This is not true..
>
> Well, it is very very nearly true, and people should basically expect it to be.
>
> There are exceptions, but they are ridiculously specific
> - no NMI context

With the new record design you could actually make the thing lockless
and allow writing to it from NMI context. IIRC vsnprintf() doesn't do
any allocations/locks.

> - we've had trouble with the percpu runqueue lock (due to the stupid wakeup)

Yep :-(

> - for similar reasons, we've had trouble with printk called from
> within some terminal locks.
>
> But basically, it damn well should be considered a hard rule that you
> can printk from pretty much anything. The few cases where that is not
> true are big problems, and should be considered to be real problems,
> not "oh, you can't printk from here".

Thing is, with bonkers stuff like usb-console and kms/drm that's a _lot_
of code running under the logbuf/console locks.

2012-05-03 20:10:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 3, 2012 at 1:02 PM, Peter Zijlstra <[email protected]> wrote:
>
> Thing is, with bonkers stuff like usb-console and kms/drm that's a _lot_
> of code running under the logbuf/console locks.

The top-level console lock shouldn't be a problem - we use trylock and
delay if it is held.

It's the lower-level driver-specific locks that screw us up. And quite
frankly, I am *not* willing to say that that is a printk() problem.
That is purely a "USB serial console is damn well broken" issue, and
should not be considered a limitation of printk.

Linus

2012-05-03 20:11:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-03 at 13:09 -0700, Linus Torvalds wrote:
> On Thu, May 3, 2012 at 1:02 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > Thing is, with bonkers stuff like usb-console and kms/drm that's a _lot_
> > of code running under the logbuf/console locks.
>
> The top-level console lock shouldn't be a problem - we use trylock and
> delay if it is held.
>
> It's the lower-level driver-specific locks that screw us up. And quite
> frankly, I am *not* willing to say that that is a printk() problem.
> That is purely a "USB serial console is damn well broken" issue, and
> should not be considered a limitation of printk.

I'm happy to call all that broken ;-)

2012-05-03 20:18:47

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 03, 2012 at 10:11:02PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-05-03 at 13:09 -0700, Linus Torvalds wrote:
> > On Thu, May 3, 2012 at 1:02 PM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > Thing is, with bonkers stuff like usb-console and kms/drm that's a _lot_
> > > of code running under the logbuf/console locks.
> >
> > The top-level console lock shouldn't be a problem - we use trylock and
> > delay if it is held.
> >
> > It's the lower-level driver-specific locks that screw us up. And quite
> > frankly, I am *not* willing to say that that is a printk() problem.
> > That is purely a "USB serial console is damn well broken" issue, and
> > should not be considered a limitation of printk.
>
> I'm happy to call all that broken ;-)

USB serial console was a drunken bet gone wrong. I'm amazed that it
even works at times, and it should never be considered critical enough
to affect any core kernel code.

greg k-h

2012-05-08 08:49:19

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

Hi Kay,

On Thu, May 3, 2012 at 2:29 AM, Kay Sievers <[email protected]> wrote:
> From: Kay Sievers <[email protected]>
> Subject: printk: convert byte-buffer to variable-length record buffer

This patch was included in today's linux-next, and it looks like it
broke printk on my configuration.

It's a bit hard to describe exactly what's wrong, but here are the main points:

1. PRINTK_TIME timing information is gone.
2. Newline issues. It appears to be adding newlines where it
shouldn't. For example:

Before:
[ 10.110626] parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE]

After:
parport0: PC-style at 0x378
, irq 7
[
PCSPP
,TRISTATE
]

3. Things are not properly aligned, such as stack traces.

2012-05-08 11:15:23

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, 2012-05-08 at 10:48 +0200, Sasha Levin wrote:
> This patch was included in today's linux-next, and it looks like it
> broke printk on my configuration.
>
> It's a bit hard to describe exactly what's wrong, but here are the main points:
>
> 1. PRINTK_TIME timing information is gone.

It's always unconditionally recorded now. It looks fine in 'dmesg'
output, right?

> 2. Newline issues. It appears to be adding newlines where it
> shouldn't.

> parport0: PC-style at 0x378
> , irq 7
> [
> PCSPP

Yeah, we need to make sure, we never merge the (always racy)
continuation printk() users with (non-racy) non-continuation users.
Therefore KERN_CONT is required to suppress the newline and to merge the
content with the earlier non-newline-terminated printk() line.

I 'm sure, I have signed up for the job to fix what's needed here. :)

Patch for parport is below.

> 3. Things are not properly aligned, such as stack traces.

Mind sending the entire output of your 'dmesg', which shows it?

We recently just needed to fix the "<%u>" in the dump, which got
recognized as syslog prority prefixes. There might be a few more
KERN_CONT missing ...

Thanks,
Kay


From: Kay Sievers <[email protected]>
Subject: parport: use KERN_CONT in printk() continuation lines

On Tue, May 8, 2012 at 10:48 AM, Sasha Levin <[email protected]> wrote:
> Before:
> [ 10.110626] parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE]
>
> After:
> parport0: PC-style at 0x378
> , irq 7
> [
> PCSPP
> ,TRISTATE
> ]

Reported-By: Sasha Levin <[email protected]>
Signed-off-by: Kay Sievers <[email protected]>
---
drivers/parport/parport_pc.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

--- a/drivers/parport/parport_pc.c
+++ b/drivers/parport/parport_pc.c
@@ -2351,7 +2351,7 @@ struct parport *parport_pc_probe_port(un

printk(KERN_INFO "%s: PC-style at 0x%lx", p->name, p->base);
if (p->base_hi && priv->ecr)
- printk(" (0x%lx)", p->base_hi);
+ printk(KERN_CONT " (0x%lx)", p->base_hi);
if (p->irq == PARPORT_IRQ_AUTO) {
p->irq = PARPORT_IRQ_NONE;
parport_irq_probe(p);
@@ -2362,7 +2362,7 @@ struct parport *parport_pc_probe_port(un
p->irq = PARPORT_IRQ_NONE;
}
if (p->irq != PARPORT_IRQ_NONE) {
- printk(", irq %d", p->irq);
+ printk(KERN_CONT ", irq %d", p->irq);
priv->ctr_writable |= 0x10;

if (p->dma == PARPORT_DMA_AUTO) {
@@ -2386,21 +2386,21 @@ struct parport *parport_pc_probe_port(un
/* p->ops->ecp_read_data = parport_pc_ecp_read_block_pio; */
#endif /* IEEE 1284 support */
if (p->dma != PARPORT_DMA_NONE) {
- printk(", dma %d", p->dma);
+ printk(KERN_CONT ", dma %d", p->dma);
p->modes |= PARPORT_MODE_DMA;
} else
- printk(", using FIFO");
+ printk(KERN_CONT ", using FIFO");
} else
/* We can't use the DMA channel after all. */
p->dma = PARPORT_DMA_NONE;
#endif /* Allowed to use FIFO/DMA */

- printk(" [");
+ printk(KERN_CONT " [");

#define printmode(x) \
{\
if (p->modes & PARPORT_MODE_##x) {\
- printk("%s%s", f ? "," : "", #x);\
+ printk(KERN_CONT "%s%s", f ? "," : "", #x);\
f++;\
} \
}
@@ -2416,9 +2416,9 @@ struct parport *parport_pc_probe_port(un
}
#undef printmode
#ifndef CONFIG_PARPORT_1284
- printk("(,...)");
+ printk(KERN_CONT "(,...)");
#endif /* CONFIG_PARPORT_1284 */
- printk("]\n");
+ printk(KERN_CONT "]\n");
if (probedirq != PARPORT_IRQ_NONE)
printk(KERN_INFO "%s: irq %d detected\n", p->name, probedirq);


2012-05-08 13:33:33

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 8, 2012 at 1:14 PM, Kay Sievers <[email protected]> wrote:
> On Tue, 2012-05-08 at 10:48 +0200, Sasha Levin wrote:
>> This patch was included in today's linux-next, and it looks like it
>> broke printk on my configuration.
>>
>> It's a bit hard to describe exactly what's wrong, but here are the main points:
>>
>> 1. PRINTK_TIME timing information is gone.
>
> It's always unconditionally recorded now. It looks fine in 'dmesg'
> output, right?

Yup. Is there no intention to show it in the console output though? it
was quite handy there.

>
>> 2. Newline issues. It appears to be adding newlines where it
>> shouldn't.
>
>> parport0: PC-style at 0x378
>> , irq 7
>> ?[
>> PCSPP
>
> Yeah, we need to make sure, we never merge the (always racy)
> continuation printk() users with (non-racy) non-continuation users.
> Therefore KERN_CONT is required to suppress the newline and to merge the
> content with the earlier non-newline-terminated printk() line.
>
> I 'm sure, I have signed up for the job to fix what's needed here. :)
>
> Patch for parport is below.
>
>> 3. Things are not properly aligned, such as stack traces.
>
> Mind sending the entire output of your 'dmesg', which shows it?
>
> We recently just needed to fix the "<%u>" in the dump, which got
> recognized as syslog prority prefixes. There might be a few more
> KERN_CONT missing ...

The 3rd one actually looks like the KERN_CONT problem now that you've
explained it. You can see it in the 'saved trace' test in the boot
log.

I've attached the boot log which shows several other places with this
issue, if you're ok with it I can try patching those myself.


Attachments:
log.txt (71.47 kB)

2012-05-08 14:29:33

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 8, 2012 at 3:33 PM, Sasha Levin <[email protected]> wrote:
> On Tue, May 8, 2012 at 1:14 PM, Kay Sievers <[email protected]> wrote:

> The 3rd one actually looks like the KERN_CONT problem now that you've
> explained it. You can see it in the 'saved trace' test in the boot
> log.
>
> I've attached the boot log which shows several other places with this
> issue,

The percpu stuff should be pending (thought it was in -next already):
https://lkml.org/lkml/2012/3/29/317

The ACPI stuff was fixed here:
https://lkml.org/lkml/2012/4/2/532

But only the vt and hpet things got merged so far:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=5da527aafed2834852fc4fe21daeaeadf7c61af3

I'll respin the ACPI parts now.

> if you're ok with it I can try patching those myself.

That would be awesome. If you decide otherwise, please let me know,
and I'll take care of them.

Even if we decide to relax the current strict behaviour later, we
should fix all stuff we run into. Userspace really relies on proper
messages, and not partial or mixed up things to guess what they mean.

The nice thing with the currently strict logic is that only two
instances of KERN_CONT running at the same time can race against each
other and mix their own messages. Unlike the old kmsg behaviour, users
which log a single proper terminated line, will never get interleaved
with the KERN_CONT users. This reduces the "risk" of garbled messages
to the users which actually decided to take this "risk".

Thanks,
Kay

2012-05-08 15:33:49

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 8, 2012 at 4:29 PM, Kay Sievers <[email protected]> wrote:
> On Tue, May 8, 2012 at 3:33 PM, Sasha Levin <[email protected]> wrote:

>> if you're ok with it I can try patching those myself.
>
> That would be awesome. If you decide otherwise, please let me know,
> and I'll take care of them.

Sent the rest out, which got lost a while ago. Here's the already
fixed and pending stuff, so you can check what's not covered and still
missing on your box.

percpu:
http://thread.gmane.org/gmane.linux.kernel/1274450

mm:
http://thread.gmane.org/gmane.linux.kernel/1293437

acpi:
http://thread.gmane.org/gmane.linux.kernel/1293438

Thanks,
Kay

2012-05-08 15:57:58

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 8, 2012 at 5:33 PM, Kay Sievers <[email protected]> wrote:
> On Tue, May 8, 2012 at 4:29 PM, Kay Sievers <[email protected]> wrote:
>> On Tue, May 8, 2012 at 3:33 PM, Sasha Levin <[email protected]> wrote:
>
>>> if you're ok with it I can try patching those myself.
>>
>> That would be awesome. If you decide otherwise, please let me know,
>> and I'll take care of them.
>
> Sent the rest out, which got lost a while ago. Here's the already
> fixed and pending stuff, so you can check what's not covered and still
> missing on your box.

> mm:
> ?http://thread.gmane.org/gmane.linux.kernel/1293437

Woops, look like I've did this one all over.


What about PRINTK_TIME? Are there plans to enable it on console output?

2012-05-08 16:28:06

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 8, 2012 at 5:57 PM, Sasha Levin <[email protected]> wrote:

> What about PRINTK_TIME? Are there plans to enable it on console output?

Yeah, we should restore the old behaviour. The timestamp is always
stored internally, regardless of the CONFIG_PRINTK_TIME or tunable
setting. At the moment, seems they are only (conditionally) added to
the syslog() output and (unconditionally) to the /dev/kmsg export.

I guess we just re-define CONFIG_PRINTK_TIME and the tunable as: "show
time in syslog() and on the console output".

Karel will likely extend dmesg(1) to use /dev/kmsg instead of syslog()
when it's readable and dmesg(1) can still show the time stamps read
from /dev/kmsg, even when CONFIG_PRINTK_TIME is not set, so the old
tunable basically becomes a "show timestamps on the console" flag. :)

Kay

2012-05-08 22:57:33

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 08, 2012 at 06:27:42PM +0200, Kay Sievers wrote:
> On Tue, May 8, 2012 at 5:57 PM, Sasha Levin <[email protected]> wrote:
>
> > What about PRINTK_TIME? Are there plans to enable it on console output?
>
> Yeah, we should restore the old behaviour. The timestamp is always
> stored internally, regardless of the CONFIG_PRINTK_TIME or tunable
> setting. At the moment, seems they are only (conditionally) added to
> the syslog() output and (unconditionally) to the /dev/kmsg export.
>
> I guess we just re-define CONFIG_PRINTK_TIME and the tunable as: "show
> time in syslog() and on the console output".

That would be good to put back, can you do that?

thanks,

greg k-h

2012-05-09 03:52:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 8, 2012 at 4:14 AM, Kay Sievers <[email protected]> wrote:
>
> Yeah, we need to make sure, we never merge the (always racy)
> continuation printk() users with (non-racy) non-continuation users.
> Therefore KERN_CONT is required to suppress the newline and to merge the
> content with the earlier non-newline-terminated printk() line.

Why?

I really think this is just a bug in the new code.

KERN_CONT should not be needed if the previous printk didn't have a final "\n".

We made it easier to use printk for a reason a few months ago. The new
rules are:

- If you have a KERN_<loglevel>, it *always* starts a new line, the
obvious exception being KERN_CONT

- the loglevels *only* matter at the start of the printk - so if you
have '\n' embedded in a single printk, that changes nothing
what-so-ever. It's not line-based.

- if you didn't have a '\n', and don't have a loglevel, KERN_CONT is implied.

Quite frankly, those three rules (a) make sense and (b) make things easy.

Breaking them now is a bug. Please don't go adding ugly KERN_CONT when
there really isn't any reason for it. Just fix the printk code you
broke.

Linus

2012-05-09 04:07:15

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, 2012-05-08 at 20:52 -0700, Linus Torvalds wrote:
> On Tue, May 8, 2012 at 4:14 AM, Kay Sievers <[email protected]> wrote:
> > Yeah, we need to make sure, we never merge the (always racy)
> > continuation printk() users with (non-racy) non-continuation users.
> > Therefore KERN_CONT is required to suppress the newline and to merge the
> > content with the earlier non-newline-terminated printk() line.
>
> Please don't go adding ugly KERN_CONT when
> there really isn't any reason for it. Just fix the printk code you
> broke.

Thanks for making that clear.

2012-05-09 04:12:14

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, May 9, 2012 at 5:52 AM, Linus Torvalds
<[email protected]> wrote:
> KERN_CONT should not be needed if the previous printk didn't have a final "\n".
>
> We made it easier to use printk for a reason a few months ago. The new
> rules are:
>
> ?- If you have a KERN_<loglevel>, it *always* starts a new line, the
> obvious exception being KERN_CONT
>
> ?- the loglevels *only* matter at the start of the printk - so if you
> have '\n' embedded in a single printk, that changes nothing
> what-so-ever. It's not line-based.
>
> ?- if you didn't have a '\n', and don't have a loglevel, KERN_CONT is implied.
>
> Quite frankly, those three rules (a) make sense and (b) make things easy.

Is there a reason to keep KERN_CONT under this set of rules at all?

I'm guessing that there are very few places that have a final '\n' but
still want to use KERN_CONT, and even in that case it should be
trivially easy to fix them up.

Besides that, from what I understand, KERN_CONT isn't really needed.

2012-05-09 04:27:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 8, 2012 at 9:11 PM, Sasha Levin <[email protected]> wrote:
>
> Is there a reason to keep KERN_CONT under this set of rules at all?

Yes, a very subtle and rare one.

If you want to print out a string that starts with "<%d>", you *have*
to use a prefix, in order to not make printk think that the "<*>" is
the prefix itself.

And if you don't want to start a new line, you need to use KERN_CONT.

So there is *one* case, and one case only, where KERN_CONT is useful,
and it's when the string you are printing out could otherwise be
mistaken for a prefix itself (the "<%d>" thing isn't the only such
string, of course - it could be a "%s" where the string has those
characters in it.

So think of KERN_CONT as a "quoting safety" thing. And in 99% of all
cases it is obviously not actually needed. So in general, KERN_CONT is
useless and should not be used, but that doesn't mean that it can be
removed as a _concept_.

Linus

2012-05-09 04:36:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 8, 2012 at 9:27 PM, Linus Torvalds
<[email protected]> wrote:
>
> So think of KERN_CONT as a "quoting safety" thing. And in 99% of all
> cases it is obviously not actually needed. So in general, KERN_CONT is
> useless and should not be used, but that doesn't mean that it can be
> removed as a _concept_.

Btw, KERN_DEFAULT is the exact same thing for the "I start a new line,
but I don't really have a special priority level".

And it's *way* overused. At one point we used to have the policy that
all new printk()'s should have a priority level, presumably so that we
at some point could have just made it a real argument instead (ie
'printk(prio, "fmt", arg)').

However, that turned out to be just stupid (the same way encouraging
people to always use KERN_CONT was/is stupid). It turns out that a lot
of printk's really don't have a good priority level, there's no reason
to force them to use KERN_DEFAULT, and it doesn't actually add *any*
value what-so-ever.

But again, it can be useful for the special case where you start a new
printk() with a string that could be mistaken for the priority marker.
So exactly like KERN_CONT, KERN_DEFAULT can be used for "quoting"
purposes.

Of course, if you print out random strings with odd random crap in the
first few characters, you're probably doing something really really
wrong. So 99% of the time, you should never need KERN_DEFAULT or
KERN_CONT. They exist for the rare exceptional case, and they do
*need* to exist, but they should not generally need to be *used*.

So it's like a defibrillator: it is good to *have* one, but it's
really bad to have to *use* one.

Linus

2012-05-09 07:07:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer


* Linus Torvalds <[email protected]> wrote:

> [...]
>
> So it's like a defibrillator: it is good to *have* one, but
> it's really bad to have to *use* one.

/me nominates this for LWN quote of the week!

Thanks for bringing sanity into printk,

Ingo

2012-05-09 09:39:01

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, May 9, 2012 at 5:52 AM, Linus Torvalds
<[email protected]> wrote:
> On Tue, May 8, 2012 at 4:14 AM, Kay Sievers <[email protected]> wrote:
>>
>> Yeah, we need to make sure, we never merge the (always racy)
>> continuation printk() users with (non-racy) non-continuation users.
>> Therefore KERN_CONT is required to suppress the newline and to merge the
>> content with the earlier non-newline-terminated printk() line.
>
> Why?

The idea was: Prefixes are not used that often, but not using a prefix
should not expose the user to wrongly get appended to an earlier
non-terminated line of another thread.

The point was to limit the "risk" of wrong merges to users of
continuation, and not to users which send ordinary "atomic" lines.

> I really think this is just a bug in the new code.
>
> KERN_CONT should not be needed if the previous printk didn't have a final "\n".

It significantly limits wrong merges, especially for users which can
rightfully expect not to get a wrong merge.

> We made it easier to use printk for a reason a few months ago. The new
> rules are:
>
>  - If you have a KERN_<loglevel>, it *always* starts a new line, the
> obvious exception being KERN_CONT
>
>  - the loglevels *only* matter at the start of the printk - so if you
> have '\n' embedded in a single printk, that changes nothing
> what-so-ever. It's not line-based.

It is a different behaviour. "Innocent" users are not exposed to "risky" users.

>  - if you didn't have a '\n', and don't have a loglevel, KERN_CONT is implied.

But a lot of stuff which does not look for continuation, has no
prefix, hasn't it? I rather make the "I want to be appended" explicit,
instead giving the "I don't care about the log level" any meaning.

I think continuation is special, and ideally should not be much used.
We should not "optimize" for it, and not accept breakage introduced to
ordinary users that way.

> Quite frankly, those three rules (a) make sense and (b) make things easy.

It's true, it's much easier, but it's also much less reliable.

> Breaking them now is a bug. Please don't go adding ugly KERN_CONT when
> there really isn't any reason for it. Just fix the printk code you
> broke.

I can do this, I just don't think it's the right thing to do.

I surely would prefer reliability over rather weird heuristics for
special cases. Today, we should be able to trust at least
non-continuation printk users, which we can't, if we do expose them to
the very real problems of continuation.

Still not convinced? :)

Kay

2012-05-09 13:21:54

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, May 9, 2012 at 9:07 AM, Ingo Molnar <[email protected]> wrote:
> * Linus Torvalds <[email protected]> wrote:
>
>> [...]
>>
>> So it's like a defibrillator: it is good to *have* one, but
>> it's really bad to have to *use* one.

I rather teach the people with known heart problems to only slap their
shocks at people who wear a sign that they are prepared to get
treated, instead of just letting them "help" the healthy random next
guy. :)

> /me nominates this for LWN quote of the week!
>
> Thanks for bringing sanity into printk,

Sounds like the view of sanity of the guys with heart problems, not so
much of the healthy guys. :)

In other words:
Isolation of risk-takers from non-risk-takers is usually a good thing.

Explicit annotation of exceptions almost every time beats expressions
which need heuristics to construct the context.

We use typed pointer instead of void too, even when we don't need it
most of the time. :)

Kay

2012-05-09 13:30:13

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, May 9, 2012 at 9:07 AM, Ingo Molnar <[email protected]> wrote:

> Thanks for bringing sanity into printk,

Just to check all the options we have, and I have no good idea how
they would work out. It's just an idea, without giving much thought
about possible side effects:

Can we somehow store the PID of the printk() thread that has left the
un-terminated line behind in the buffer, and flush it out when the
next printk() is from a differnt PID? That could prevent the mangling
of "atomic" printk()s by continuation users, and merging unrelated
continuation users together.

Would it make sense to make the printk line buffer per CPU, so they
are not shared between threads, and continuation could work more
reliable?

Kay

2012-05-09 13:51:05

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, 2012-05-09 at 11:38 +0200, Kay Sievers wrote:
> On Wed, May 9, 2012 at 5:52 AM, Linus Torvalds
> <[email protected]> wrote:
> > On Tue, May 8, 2012 at 4:14 AM, Kay Sievers <[email protected]> wrote:
> >>
> >> Yeah, we need to make sure, we never merge the (always racy)
> >> continuation printk() users with (non-racy) non-continuation users.
> >> Therefore KERN_CONT is required to suppress the newline and to merge the
> >> content with the earlier non-newline-terminated printk() line.
> >
> > Why?
>
> The idea was: Prefixes are not used that often, but not using a prefix
> should not expose the user to wrongly get appended to an earlier
> non-terminated line of another thread.
>
> The point was to limit the "risk" of wrong merges to users of
> continuation, and not to users which send ordinary "atomic" lines.

I think your premise is wrong for a couple of reasons.

1: printk content is not guaranteed to be stable.

printk content should not be guaranteed to be stable.
It'd make it difficult to simply add a new message
or extend or correct an existing one.

2: There are _thousands_ of printks without prefix levels.

KERN_CONT was always kind of a half-stupid idea.
It's only real purpose is to designate message
continuations but it never gave enough information to
correctly coalesce the messages.

I think you need to give up on the idea that printk
output can be made to be machine readable.

If you really want to have users get complete and
not-interleaved messages you need to buffer partial
messages, Whether or not printk message coalescing
needs to have a cookie or can be done well enough by
using call tree information as suggested by Andrew
Morton is tbd.

Changing printk semantics and trying to stabilize
printk message content are bad ideas though.

I think printk_emit is an OK idea. Having a
per-subsystem notification mechanism, something
like an expanded ethtool is certainly useful.

Making the new additional content of printk_emit
discoverable would be useful.

Converting log_buf from circular to record oriented
and adding another binary header/descriptor to it
is a good idea too.

cheers, Joe

2012-05-09 14:38:00

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, May 9, 2012 at 3:50 PM, Joe Perches <[email protected]> wrote:
> On Wed, 2012-05-09 at 11:38 +0200, Kay Sievers wrote:

> I think your premise is wrong for a couple of reasons.
>
> 1: printk content is not guaranteed to be stable.

Nobody talks about the content of the strings.

>   printk content should not be guaranteed to be stable.
>   It'd make it difficult to simply add a new message
>   or extend or correct an existing one.

Again, that's only what you make out of it.

My point is reliable log _storage_ and with that properly working, the
output _format_ gets as _reliable_ as possible. Continuation printk()
today is entirely unreliable, not only for the continuation users,
also for everybody else, because we merge randomly with atomic
printk() users.

We have (stupid ) scripts that diff dmesg after
supposed-to-be-harmless kernel changes to automatically spot problems.
You can see all the broken merges of continuation lines, and they also
affect lines which do simple and proper atomic printk()s. This is not
useful, and again it's the broken _logic_ of storage not the _content_
of the message.

> 2: There are _thousands_ of printks without prefix levels.

Right, and nobody is crazy enough to think we want to touch them all.

>   KERN_CONT was always kind of a half-stupid idea.
>   It's only real purpose is to designate message
>   continuations but it never gave enough information to
>   correctly coalesce the messages.

Right, it's the annotation of an exception; not nice, but pretty
well-defined, and can be isolated from the non-exception by software
instead of a human.

> I think you need to give up on the idea that printk
> output can be made to be machine readable.

Nobody talks about that in this very context. It is all about better
_reliability_ of the facility not about _content_.

The structured data, possibly attached to the log message, which
uniquely identifies the kernel device, is surely meant to be machine
readable, and no, I'm absolutely not giving up on that approach.

> If you really want to have users get complete and
> not-interleaved messages you need to buffer partial
> messages,  Whether or not printk message coalescing
> needs to have a cookie or can be done well enough by
> using call tree information as suggested by Andrew
> Morton is tbd.

I'm convinced, that we need something simpler than a new cookie logic.
It can work sure, but I doubt we really wants to change the code flow
of all the stuff.

> Changing printk semantics and trying to stabilize
> printk message content are bad ideas though.

Again, nobody talks about content of here.

But reliable context-aware facilities are almost never a bad idea. We
can certainly make printk() better without changing the whole source
tree.

You might not oppose to the heuristics and the "throw it all in a bag
an let's find out later" strategy, I certainly don't like it and think
it's very broken.

> I think printk_emit is an OK idea.  Having a
> per-subsystem notification mechanism, something
> like an expanded ethtool is certainly useful.
>
> Making the new additional content of printk_emit
> discoverable would be useful.

It is all available through /dev/kmsg,

Kay

2012-05-09 23:02:42

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, May 9, 2012 at 7:37 AM, Kay Sievers <[email protected]> wrote:
>
>> Changing printk semantics and trying to stabilize
>> printk message content are bad ideas though.
>
> Again, nobody talks about content of here.

printk_time=1 does not work anymore with this patch on serial console.

but dmesg still have that time printout.

Yinghai

2012-05-09 23:06:53

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, May 09, 2012 at 04:02:40PM -0700, Yinghai Lu wrote:
> On Wed, May 9, 2012 at 7:37 AM, Kay Sievers <[email protected]> wrote:
> >
> >> Changing printk semantics and trying to stabilize
> >> printk message content are bad ideas though.
> >
> > Again, nobody talks about content of here.
>
> printk_time=1 does not work anymore with this patch on serial console.

Known issue, I think Kay is working on the solution for this right
now...

greg k-h

2012-05-10 00:55:12

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, 2012-05-09 at 15:29 +0200, Kay Sievers wrote:

> Can we somehow store the PID of the printk() thread that has left the
> un-terminated line behind in the buffer, and flush it out when the
> next printk() is from a differnt PID? That could prevent the mangling
> of "atomic" printk()s by continuation users, and merging unrelated
> continuation users together.

How about this? It relaxes the need for KERN_CONT, but it limits
continuation lines to repeated calls of the same thread.

If things race against each other, the lines are separated and not
wrongly mixed with the data from other users.

I would be happy with that, as it protects the "atomic" users of
printk() from getting mixed up with continuation users.

Thanks,
Kay


From: Kay Sievers <[email protected]>
Subject: printk() - do not merge continuation lines of different threads

Signed-off-by: Kay Sievers <[email protected]>---
---

kernel/printk.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1230,12 +1230,13 @@ asmlinkage int vprintk_emit(int facility
static size_t buflen;
static int buflevel;
static char textbuf[LOG_LINE_MAX];
+ static struct task_struct *cont;
char *text = textbuf;
size_t textlen;
unsigned long flags;
int this_cpu;
bool newline = false;
- bool cont = false;
+ bool prefix = false;
int printed_len = 0;

boot_delay_msec();
@@ -1295,20 +1296,17 @@ asmlinkage int vprintk_emit(int facility
case '0' ... '7':
if (level == -1)
level = text[1] - '0';
- text += 3;
- textlen -= 3;
- break;
case 'c': /* KERN_CONT */
- cont = true;
case 'd': /* KERN_DEFAULT */
+ if (text[1] != 'c')
+ prefix = true;
text += 3;
textlen -= 3;
- break;
}
}

- if (buflen && (!cont || dict)) {
- /* no continuation; flush existing buffer */
+ if (buflen && (prefix || dict || cont != current)) {
+ /* flush existing buffer */
log_store(facility, buflevel, NULL, 0, buf, buflen);
printed_len += buflen;
buflen = 0;
@@ -1342,6 +1340,10 @@ asmlinkage int vprintk_emit(int facility
dict, dictlen, text, textlen);
printed_len += textlen;
}
+ cont = NULL;
+ } else {
+ /* remember thread which filled the buffer */
+ cont = current;
}

/*

2012-05-10 01:18:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, May 9, 2012 at 5:54 PM, Kay Sievers <[email protected]> wrote:
>
> How about this? It relaxes the need for KERN_CONT, but it limits
> continuation lines to repeated calls of the same thread.

Fair enough, looks reasonable.

Except your case-statement sucks. Having a "case 'c':" that basically
then immediately has an "if (it was not 'c')" is just disturbing. Just
move the code that is common to all the cases outside the case
statement entirely, and make the 'c' case not do anything at all, and
remove that crazy extraneous if-statement.

Linus

2012-05-10 02:31:18

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, 2012-05-09 at 16:06 -0700, Greg Kroah-Hartmann wrote:
> On Wed, May 09, 2012 at 04:02:40PM -0700, Yinghai Lu wrote:
> > On Wed, May 9, 2012 at 7:37 AM, Kay Sievers <[email protected]> wrote:
> > >
> > >> Changing printk semantics and trying to stabilize
> > >> printk message content are bad ideas though.
> > >
> > > Again, nobody talks about content of here.
> >
> > printk_time=1 does not work anymore with this patch on serial console.
>
> Known issue, I think Kay is working on the solution for this right
> now...

Sure, and this seems to work for me.

Thanks,
Kay


From: Kay Sievers <[email protected]>
Subject: printk() - restore timestamp printing at console output

The output of the timestamps got lost with the conversion of the
kmsg buffer to records; restore the old behavior.

Document, that CONFIG_PRINTK_TIME now only controls the output of
the timestamps in the syslog() system call and on the console, and
not the recording of the timestamps.

Signed-off-by: Kay Sievers <[email protected]>
---

kernel/printk.c | 43 ++++++++++++++++++++++++++-----------------
lib/Kconfig.debug | 16 ++++++++++------
2 files changed, 36 insertions(+), 23 deletions(-)

--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -786,6 +786,22 @@ static bool printk_time;
#endif
module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR);

+static size_t prepend_timestamp(unsigned long long t, char *buf)
+{
+ unsigned long rem_ns;
+
+ if (!printk_time)
+ return 0;
+
+ if (!buf)
+ return 15;
+
+ rem_ns = do_div(t, 1000000000);
+
+ return sprintf(buf, "[%5lu.%06lu] ",
+ (unsigned long) t, rem_ns / 1000);
+}
+
static int syslog_print_line(u32 idx, char *text, size_t size)
{
struct log *msg;
@@ -800,9 +816,7 @@ static int syslog_print_line(u32 idx, ch
len++;
if (msg->level > 99)
len++;
-
- if (printk_time)
- len += 15;
+ len += prepend_timestamp(0, NULL);

len += msg->text_len;
len++;
@@ -810,15 +824,7 @@ static int syslog_print_line(u32 idx, ch
}

len = sprintf(text, "<%u>", msg->level);
-
- if (printk_time) {
- unsigned long long t = msg->ts_nsec;
- unsigned long rem_ns = do_div(t, 1000000000);
-
- len += sprintf(text + len, "[%5lu.%06lu] ",
- (unsigned long) t, rem_ns / 1000);
- }
-
+ len += prepend_timestamp(msg->ts_nsec, text + len);
if (len + msg->text_len > size)
return -EINVAL;
memcpy(text + len, log_text(msg), msg->text_len);
@@ -1741,7 +1747,7 @@ again:
for (;;) {
struct log *msg;
static char text[LOG_LINE_MAX];
- size_t len;
+ size_t len, l;
int level;

raw_spin_lock_irqsave(&logbuf_lock, flags);
@@ -1761,10 +1767,13 @@ again:

msg = log_from_idx(console_idx);
level = msg->level & 7;
- len = msg->text_len;
- if (len+1 >= sizeof(text))
- len = sizeof(text)-1;
- memcpy(text, log_text(msg), len);
+
+ len = prepend_timestamp(msg->ts_nsec, text);
+ l = msg->text_len;
+ if (len + l + 1 >= sizeof(text))
+ l = sizeof(text) - len - 1;
+ memcpy(text + len, log_text(msg), l);
+ len += l;
text[len++] = '\n';

console_idx = log_next(console_idx);
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -3,12 +3,16 @@ config PRINTK_TIME
bool "Show timing information on printks"
depends on PRINTK
help
- Selecting this option causes timing information to be
- included in printk output. This allows you to measure
- the interval between kernel operations, including bootup
- operations. This is useful for identifying long delays
- in kernel startup. Or add printk.time=1 at boot-time.
- See Documentation/kernel-parameters.txt
+ Selecting this option causes time stamps of the printk()
+ messages to be added to the output of the syslog() system
+ call and at the console.
+
+ The timestamp is always recorded internally, and exported
+ to /dev/kmsg. This flag just specifies if the timestamp should
+ be included, not that the timestamp is recorded.
+
+ The behavior is also controlled by the kernel command line
+ parameter printk.time=1. See Documentation/kernel-parameters.txt

config DEFAULT_MESSAGE_LOGLEVEL
int "Default message log level (1-7)"

2012-05-10 02:33:26

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, 2012-05-09 at 18:18 -0700, Linus Torvalds wrote:
> On Wed, May 9, 2012 at 5:54 PM, Kay Sievers <[email protected]> wrote:
> >
> > How about this? It relaxes the need for KERN_CONT, but it limits
> > continuation lines to repeated calls of the same thread.
>
> Fair enough, looks reasonable.
>
> Except your case-statement sucks. Having a "case 'c':" that basically
> then immediately has an "if (it was not 'c')" is just disturbing. Just
> move the code that is common to all the cases outside the case
> statement entirely, and make the 'c' case not do anything at all, and
> remove that crazy extraneous if-statement.


From: Kay Sievers <[email protected]>
Subject: printk() - do not merge continuation lines of different threads

This prevents the merging of printk() continuation lines of different
threads, in the case they race against each other.

It should properly isolate "atomic" single-line printk() users from
continuation users, to make sure the single-line users will never be
merged with the racy continuation ones.

Signed-off-by: Kay Sievers <[email protected]>
---

kernel/printk.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)

--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1230,12 +1230,13 @@ asmlinkage int vprintk_emit(int facility
static size_t buflen;
static int buflevel;
static char textbuf[LOG_LINE_MAX];
+ static struct task_struct *cont;
char *text = textbuf;
size_t textlen;
unsigned long flags;
int this_cpu;
bool newline = false;
- bool cont = false;
+ bool prefix = false;
int printed_len = 0;

boot_delay_msec();
@@ -1295,20 +1296,16 @@ asmlinkage int vprintk_emit(int facility
case '0' ... '7':
if (level == -1)
level = text[1] - '0';
- text += 3;
- textlen -= 3;
- break;
- case 'c': /* KERN_CONT */
- cont = true;
case 'd': /* KERN_DEFAULT */
+ prefix = true;
+ case 'c': /* KERN_CONT */
text += 3;
textlen -= 3;
- break;
}
}

- if (buflen && (!cont || dict)) {
- /* no continuation; flush existing buffer */
+ if (buflen && (prefix || dict || cont != current)) {
+ /* flush existing buffer */
log_store(facility, buflevel, NULL, 0, buf, buflen);
printed_len += buflen;
buflen = 0;
@@ -1342,6 +1339,10 @@ asmlinkage int vprintk_emit(int facility
dict, dictlen, text, textlen);
printed_len += textlen;
}
+ cont = NULL;
+ } else {
+ /* remember thread which filled the buffer */
+ cont = current;
}

/*

2012-05-10 02:46:14

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-10 at 04:32 +0200, Kay Sievers wrote:
> --- a/kernel/printk.c
[]
> @@ -1295,20 +1296,16 @@ asmlinkage int vprintk_emit(int facility
> case '0' ... '7':
> if (level == -1)
> level = text[1] - '0';
> - text += 3;
> - textlen -= 3;
> - break;
> - case 'c': /* KERN_CONT */
> - cont = true;
> case 'd': /* KERN_DEFAULT */
> + prefix = true;
> + case 'c': /* KERN_CONT */

trivia:

Please add /* fallthrough */ comment lines
so it shows the "missing" breaks are intentional.

2012-05-10 16:40:21

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, 2012-05-09 at 18:18 -0700, Linus Torvalds wrote:
> On Wed, May 9, 2012 at 5:54 PM, Kay Sievers <[email protected]> wrote:
> >
> > How about this? It relaxes the need for KERN_CONT, but it limits
> > continuation lines to repeated calls of the same thread.
>
> Fair enough, looks reasonable.

Here is something which might make sense, and could become be the most
reliable and fail-tolerant version so far. It is also the least
restrictive one regarding the input format, and takes the same amount of
resources as the current implementation.

We fully isolate continuation users from non-continuation users. If a
continuation user gets interrupted by a an ordinary non-continuation
user, we will no longer touch the buffer of the continuation user, we
just emit the ordinary message.

When the same thread comes back and continues printing, we still append
to the earlier buffer we stored.

The only case where printk() still gets messed up now, is when multiple
threads use continuation at the same time, which is way less likely than
mixing with ordinary users.

We will also never merge two racing continuation users into one line;
the worst thing that can happen now, is that they end split up into more
than the intended single line.

Note: In this version, the KERN_* prefixes have all no further meaning
anymore, besides that they carry the priority, or prevent they the
content of the line to be parsed for a priority.

All the special rules are gone. KERN_CONT is the same as KERN_DEFAULT
now.

Even continuation users could use prefixes now, if they wanted to. We
should be context-aware enough now, with remembering the owner (task) of
the buffered data, that we might be able to relax all the special rules
regarding the prefixes.

Any ideas about that?

Thanks,
Kay



From: Kay Sievers <[email protected]>
Subject: printk() - fully separate continuation line users from ordinary ones
---

printk.c | 86 ++++++++++++++++++++++++++++++---------------------------------
1 file changed, 41 insertions(+), 45 deletions(-)

--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1232,17 +1232,16 @@ asmlinkage int vprintk_emit(int facility, int level,
const char *fmt, va_list args)
{
static int recursion_bug;
- static char buf[LOG_LINE_MAX];
- static size_t buflen;
- static int buflevel;
+ static char cont_buf[LOG_LINE_MAX];
+ static size_t cont_len;
+ static int cont_level;
+ static struct task_struct *cont_task;
static char textbuf[LOG_LINE_MAX];
- static struct task_struct *cont;
char *text = textbuf;
- size_t textlen;
+ size_t text_len;
unsigned long flags;
int this_cpu;
bool newline = false;
- bool prefix = false;
int printed_len = 0;

boot_delay_msec();
@@ -1288,11 +1287,11 @@ asmlinkage int vprintk_emit(int facility, int level,
* The printf needs to come first; we need the syslog
* prefix which might be passed-in as a parameter.
*/
- textlen = vscnprintf(text, sizeof(textbuf), fmt, args);
+ text_len = vscnprintf(text, sizeof(textbuf), fmt, args);

/* mark and strip a trailing newline */
- if (textlen && text[textlen-1] == '\n') {
- textlen--;
+ if (text_len && text[text_len-1] == '\n') {
+ text_len--;
newline = true;
}

@@ -1303,52 +1302,49 @@ asmlinkage int vprintk_emit(int facility, int level,
if (level == -1)
level = text[1] - '0';
case 'd': /* KERN_DEFAULT */
- prefix = true;
case 'c': /* KERN_CONT */
text += 3;
- textlen -= 3;
+ text_len -= 3;
}
}

- if (buflen && (prefix || dict || cont != current)) {
- /* flush existing buffer */
- log_store(facility, buflevel, NULL, 0, buf, buflen);
- printed_len += buflen;
- buflen = 0;
- }
+ if (level == -1)
+ level = default_message_loglevel;

- if (buflen == 0) {
- /* remember level for first message in the buffer */
- if (level == -1)
- buflevel = default_message_loglevel;
- else
- buflevel = level;
- }
+ if (!newline) {
+ if (cont_len && cont_task != current) {
+ /* flush earlier buffer from different thread */
+ log_store(facility, cont_level, NULL, 0, cont_buf, cont_len);
+ cont_len = 0;
+ }

- if (buflen || !newline) {
- /* append to existing buffer, or buffer until next message */
- if (buflen + textlen > sizeof(buf))
- textlen = sizeof(buf) - buflen;
- memcpy(buf + buflen, text, textlen);
- buflen += textlen;
- }
+ if (!cont_len) {
+ cont_level = level;
+ cont_task = current;
+ }

- if (newline) {
- /* end of line; flush buffer */
- if (buflen) {
- log_store(facility, buflevel,
- dict, dictlen, buf, buflen);
- printed_len += buflen;
- buflen = 0;
+ /* buffer, or append to earlier buffer from same thread */
+ if (cont_len + text_len > sizeof(cont_buf))
+ text_len = sizeof(cont_buf) - cont_len;
+ memcpy(cont_buf + cont_len, text, text_len);
+ cont_len += text_len;
+ } else {
+ if (cont_len && cont_task == current) {
+ /* append to earlier buffer and flush */
+ if (cont_len + text_len > sizeof(cont_buf))
+ text_len = sizeof(cont_buf) - cont_len;
+ memcpy(cont_buf + cont_len, text, text_len);
+ cont_len += text_len;
+ log_store(facility, cont_level,
+ NULL, 0, cont_buf, cont_len);
+ cont_len = 0;
+ cont_task = NULL;
+ printed_len = cont_len;
} else {
- log_store(facility, buflevel,
- dict, dictlen, text, textlen);
- printed_len += textlen;
+ log_store(facility, level,
+ dict, dictlen, text, text_len);
+ printed_len = text_len;
}
- cont = NULL;
- } else {
- /* remember thread which filled the buffer */
- cont = current;
}

/*

2012-05-10 16:47:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 9:39 AM, Kay Sievers <[email protected]> wrote:
>
> All the special rules are gone. KERN_CONT is the same as KERN_DEFAULT
> now.

That's wrong.

Key, KERN_CONT and KERN_DEFAULT really are different and have
fundamentally different semantics. The fact that you think they aren't
shows that you don't understand it.

Your "current" check doesn't change anything.

There are two main and important differences:

- people can avoid using '\n' if they know the next printk will have
a KERN_xxx marker (xxx != CONT).

This is often useful for having loops that print out individual
entries all on the same line - print all of them without the '\n'.

The printk afterwards will automatically start a new line if it has
KERN_DEFAULT.

If you make KERN_CONT and KERN_DEFAULT the same, it is a BUG. Don't do it.

- You don't seem to realize that interrupts are threaded events too,
but they will happen with the same task-struct.

An interrupt that prints out with KERN_DEFAULT had better not act
the same way as KERN_CONT.

So dammit, just stop trying to get rid of KERN_CONT or KERN_DEFAULT.
They are both real, and they are both *different*. If you think they
are the same, you are WRONG.

Don't try to change the rules because you think you are "clever".
You're only making things worse.

Linus

2012-05-10 18:49:46

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 9:47 AM, Linus Torvalds
<[email protected]> wrote:
> ? This is often useful for having loops that print out individual
> entries all on the same line - print all of them without the '\n'.

Though it is a pain when people do this and the output from each iteration
of the loop gets interleaved with other printk() output. Perhaps it is OK to
do multiple printk() calls in initialization code where it kernel is
mostly single
threaded. But it should be avoided in "oh dear, something bad happened" bits
of code (in case the badness isn't localized and all the other cpus are spitting
out partial messages too).

-Tony

2012-05-10 19:10:22

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-10 at 09:47 -0700, Linus Torvalds wrote:
> On Thu, May 10, 2012 at 9:39 AM, Kay Sievers <[email protected]> wrote:

> An interrupt that prints out with KERN_DEFAULT had better not act
> the same way as KERN_CONT.

Good point, understood, and adapted to the new logic.


Here is a new RFC version. I think it's slightly better than the
currently implemented logic in Greg's tree:

We fully isolate continuation users from non-continuation users. If a
continuation user gets interrupted by an ordinary non-continuation
user, we will not touch the continuation buffer, we just emit the
ordinary message. When the same thread comes back and continues its
printing, we still append to the earlier buffer we stored.

The only case where printk() still gets messed up now, is when multiple
threads (or interrupts) race against each other.

We will also never wrongly merge two racing continuation users into one
line.


Current intended behavior, and general rules:
printk() starting with KERN_* (not KERN_CONT) and ending with '\n' will
*always* logged as single line, *never* be merged with anything else.

printk() not starting with any KERN_* might get merged with an earlier
line buffered by the *same* thread.

printk() not ending with '\n' will be buffered.

Buffered lines will be flushed when a different thread emits a printk()
that needs to be buffered --> race.

Buffered line will be flushed, when the same thread emits a printk()
with a KERN_* (not KERN_CONT) --> newline missing.

Buffered line will be joined, when the same thread emits a printk()
without any KERN_* or with KERN_CONT.

Does that sounds correct? Anything to add or adjust?

Thanks,
Kay

---

printk.c | 97 +++++++++++++++++++++++++++++++++++----------------------------
1 file changed, 54 insertions(+), 43 deletions(-)

--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1232,13 +1232,13 @@ asmlinkage int vprintk_emit(int facility, int level,
const char *fmt, va_list args)
{
static int recursion_bug;
- static char buf[LOG_LINE_MAX];
- static size_t buflen;
- static int buflevel;
+ static char cont_buf[LOG_LINE_MAX];
+ static size_t cont_len;
+ static int cont_level;
+ static struct task_struct *cont_task;
static char textbuf[LOG_LINE_MAX];
- static struct task_struct *cont;
char *text = textbuf;
- size_t textlen;
+ size_t text_len;
unsigned long flags;
int this_cpu;
bool newline = false;
@@ -1288,11 +1288,11 @@ asmlinkage int vprintk_emit(int facility, int level,
* The printf needs to come first; we need the syslog
* prefix which might be passed-in as a parameter.
*/
- textlen = vscnprintf(text, sizeof(textbuf), fmt, args);
+ text_len = vscnprintf(text, sizeof(textbuf), fmt, args);

/* mark and strip a trailing newline */
- if (textlen && text[textlen-1] == '\n') {
- textlen--;
+ if (text_len && text[text_len-1] == '\n') {
+ text_len--;
newline = true;
}

@@ -1306,49 +1306,60 @@ asmlinkage int vprintk_emit(int facility, int level,
prefix = true;
case 'c': /* KERN_CONT */
text += 3;
- textlen -= 3;
+ text_len -= 3;
}
}

- if (buflen && (prefix || dict || cont != current)) {
- /* flush existing buffer */
- log_store(facility, buflevel, NULL, 0, buf, buflen);
- printed_len += buflen;
- buflen = 0;
- }
+ if (level == -1)
+ level = default_message_loglevel;

- if (buflen == 0) {
- /* remember level for first message in the buffer */
- if (level == -1)
- buflevel = default_message_loglevel;
- else
- buflevel = level;
- }
+ if (!newline) {
+ if (cont_len && (prefix || cont_task != current)) {
+ /*
+ * Flush earlier buffer, either from a different
+ * thread, or when we've seen a new prefix.
+ */
+ log_store(facility, cont_level, NULL, 0, cont_buf, cont_len);
+ cont_len = 0;
+ }

- if (buflen || !newline) {
- /* append to existing buffer, or buffer until next message */
- if (buflen + textlen > sizeof(buf))
- textlen = sizeof(buf) - buflen;
- memcpy(buf + buflen, text, textlen);
- buflen += textlen;
- }
+ if (!cont_len) {
+ cont_level = level;
+ cont_task = current;
+ }
+
+ /* buffer/append to earlier buffer from same thread */
+ if (cont_len + text_len > sizeof(cont_buf))
+ text_len = sizeof(cont_buf) - cont_len;
+ memcpy(cont_buf + cont_len, text, text_len);
+ cont_len += text_len;
+ } else {
+ if (cont_len && cont_task == current) {
+ if (prefix) {
+ /*
+ * New prefix in same thread; flush. Either
+ * no earlier newline, or in an interrupt.
+ */
+ log_store(facility, cont_level,
+ NULL, 0, cont_buf, cont_len);
+ cont_len = 0;
+ }

- if (newline) {
- /* end of line; flush buffer */
- if (buflen) {
- log_store(facility, buflevel,
- dict, dictlen, buf, buflen);
- printed_len += buflen;
- buflen = 0;
+ /* append to earlier buffer and flush */
+ if (cont_len + text_len > sizeof(cont_buf))
+ text_len = sizeof(cont_buf) - cont_len;
+ memcpy(cont_buf + cont_len, text, text_len);
+ cont_len += text_len;
+ log_store(facility, cont_level,
+ NULL, 0, cont_buf, cont_len);
+ cont_len = 0;
+ cont_task = NULL;
+ printed_len = cont_len;
} else {
- log_store(facility, buflevel,
- dict, dictlen, text, textlen);
- printed_len += textlen;
+ log_store(facility, level,
+ dict, dictlen, text, text_len);
+ printed_len = text_len;
}
- cont = NULL;
- } else {
- /* remember thread which filled the buffer */
- cont = current;
}

/*

2012-05-10 20:14:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 09:09:46PM +0200, Kay Sievers wrote:
> We fully isolate continuation users from non-continuation users. If a
> continuation user gets interrupted by an ordinary non-continuation
> user, we will not touch the continuation buffer, we just emit the
> ordinary message. When the same thread comes back and continues its
> printing, we still append to the earlier buffer we stored.

It's not necessarily a matter of "thread comes back", although that
situation can happen too. You can get this situation quite simply if
you have two processes in foreground kernel mode on two different
CPU's sending continuation printk's at the same time.

> We will also never wrongly merge two racing continuation users into one
> line.

I'm not sure how you guarantee this? The only way you *could*
guarantee this is if you used a continuation buffer in the task_struct
for foreground kernel code, and a per-CPU continuation buffer for
interrupt code.

> Buffered line will be joined, when the same thread emits a printk()
> without any KERN_* or with KERN_CONT.

Is there any difference in any of the cases in terms of how printk's
that are prefixed with KERN_CONT versus a printk that does not have
any KERN_* prefix? If so, is there value in keeping KERN_CONT?

- Ted

2012-05-10 20:37:09

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-10 at 16:14 -0400, Ted Ts'o wrote:
> Is there any difference in any of the cases in terms of how printk's
> that are prefixed with KERN_CONT versus a printk that does not have
> any KERN_* prefix? If so, is there value in keeping KERN_CONT?

As far as I know, no.

It is a useful marker to show where prints
are actually continued.

#define KERN_CONT ""

would save a small amount of text.

2012-05-10 20:38:38

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 10:14 PM, Ted Ts'o <[email protected]> wrote:
> On Thu, May 10, 2012 at 09:09:46PM +0200, Kay Sievers wrote:
>> We fully isolate continuation users from non-continuation users. If a
>> continuation user gets interrupted by an ordinary non-continuation
>> user, we will not touch the continuation buffer, we just emit the
>> ordinary message. When the same thread comes back and continues its
>> printing, we still append to the earlier buffer we stored.
>
> It's not necessarily a matter of "thread comes back", although that
> situation can happen too.  You can get this situation quite simply if
> you have two processes in foreground kernel mode on two different
> CPU's sending continuation printk's at the same time.

The access to printk is fully serialized. If only one thread does
continuation (needs buffering), it will still own the continuation
buffer. We record the "owner" (pointer to the task) of the data.

>> We will also never wrongly merge two racing continuation users into one
>> line.
>
> I'm not sure how you guarantee this?  The only way you *could*
> guarantee this is if you used a continuation buffer in the task_struct
> for foreground kernel code, and a per-CPU continuation buffer for
> interrupt code.

Yeah adding it to struct task would reliably work. :) But as we record
the "owner" of the continuation buffer, we should be able to flush the
content of the earlier buffer, instead of wrongly merging it. The same
struct task and interrupts could go wrong, that's true.

>> Buffered line will be joined, when the same thread emits a printk()
>> without any KERN_* or with KERN_CONT.
>
> Is there any difference in any of the cases in terms of how printk's
> that are prefixed with KERN_CONT versus a printk that does not have
> any KERN_* prefix?  If so, is there value in keeping KERN_CONT?

Yeah, it is. It will instruct to merge with the buffer, but also tell
not the parse the prefix, in case you want to log stuff like "<7>" as
real data and not as a prefix.

Kay

2012-05-10 20:40:23

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 10:37 PM, Joe Perches <[email protected]> wrote:
> On Thu, 2012-05-10 at 16:14 -0400, Ted Ts'o wrote:
>> Is there any difference in any of the cases in terms of how printk's
>> that are prefixed with KERN_CONT versus a printk that does not have
>> any KERN_* prefix?  If so, is there value in keeping KERN_CONT?
>
> As far as I know, no.
>
> It is a useful marker to show where prints
> are actually continued.
>
> #define KERN_CONT ""
>
> would save a small amount of text.

Nah, we can't do that. We need it to tell "here is your non-prefix to
parse, leave the data behind alone".

Kay

2012-05-10 20:46:14

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-10 at 22:39 +0200, Kay Sievers wrote:
> On Thu, May 10, 2012 at 10:37 PM, Joe Perches <[email protected]> wrote:
> > On Thu, 2012-05-10 at 16:14 -0400, Ted Ts'o wrote:
> >> Is there any difference in any of the cases in terms of how printk's
> >> that are prefixed with KERN_CONT versus a printk that does not have
> >> any KERN_* prefix? If so, is there value in keeping KERN_CONT?
> >
> > As far as I know, no.
> >
> > It is a useful marker to show where prints
> > are actually continued.
> >
> > #define KERN_CONT ""
> >
> > would save a small amount of text.
>
> Nah, we can't do that. We need it to tell "here is your non-prefix to
> parse, leave the data behind alone".

That's where I think you're still a bit
uncertain how the _current_ printk system
works. Your _new_ printk system should
have identical behavior. Though if you
manage to use the call tree and current to
coalesce complete messages more correctly,
that'd be great.

2012-05-10 20:53:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 1:46 PM, Joe Perches <[email protected]> wrote:
>
> That's where I think you're still a bit
> uncertain how the _current_ printk system
> works.

No, you are. Read my answer from two days ago in this thread.

KERN_CONT is *not* the same as "". Not now, not ever. If you make it
the same, you're broken.

The reason is simple: KERN_CONT "<3>" should print out the string
"<3>". If you make KERN_CONT be "", it will do the wrong thing, and
think that the <3> is a priority marker.

Please people, this is subtle, and current code does things RIGHT. Any
code that changes it to do something else is almost certainly buggy.
The new semantics had better be the same as the old one.

The change to verify that 'current' matches the previous printout is
so far the *only* sane semantic change I've seen in this thread.
Everything else has been pure garbage.

Linus

2012-05-10 21:01:44

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 10:46 PM, Joe Perches <[email protected]> wrote:
> On Thu, 2012-05-10 at 22:39 +0200, Kay Sievers wrote:

>> Nah, we can't do that. We need it to tell "here is your non-prefix to
>> parse, leave the data behind alone".
>
> That's where I think you're still a bit
> uncertain how the _current_ printk system
> works.
> Your _new_ printk system should
> have identical behavior.

We must be at least as good as we are, sure.

But the aim is to be *better* not to be *identical*, especially when
things go wrong, and they do go wrong far too often in the current
code. What we have today is really not good enough. We have a lot of
context during logging (like the thread), and we should use it to make
the best out of it _before_ we log the stuff away.

>  Though if you
> manage to use the call tree and current to
> coalesce complete messages more correctly,
> that'd be great.

That's what we try. We just need to get all the details out of the
peoples heads, which are nowhere written down, to make it happen. It's
a bit of a painful process sometimes. :)

The conversion from the "put all bytes in a bag and let's find out
later what happened"-buffer to a real separated record buffer imposed
some changes to the logic, and we need to restore/adapt some useful
rules now, which could't be preserved 1:1. But I'm confident that we
manage to get a better overall picture in the end.

Kay

2012-05-10 21:11:21

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-10 at 13:52 -0700, Linus Torvalds wrote:
> On Thu, May 10, 2012 at 1:46 PM, Joe Perches <[email protected]> wrote:
> >
> > That's where I think you're still a bit
> > uncertain how the _current_ printk system
> > works.
>
> No, you are. Read my answer from two days ago in this thread.
>
> KERN_CONT is *not* the same as "". Not now, not ever. If you make it
> the same, you're broken.
>
> The reason is simple: KERN_CONT "<3>" should print out the string
> "<3>".

I think it's a distinction without a difference for,
as fas as I know, that's a case that doesn't exist
in the current kernel.

$ grep -rP --include=*.[ch] "\bpr_cont\s*\(\s*\"<" *
$ grep -rP --include=*.[ch] "\bprintk\s*\(\s*KERN_CONT\s*\"<" *
arch/x86/kernel/dumpstack_32.c: printk(KERN_CONT "<%02x> ", c);
arch/x86/kernel/dumpstack_64.c: printk(KERN_CONT "<%02x> ", c);
arch/powerpc/kernel/process.c: printk(KERN_CONT "<%08x> ", instr);
drivers/media/video/tm6000/tm6000-core.c: printk(KERN_CONT "<<< ");
drivers/media/video/cx231xx/cx231xx-core.c: printk(KERN_CONT "<<<");

2012-05-10 21:15:35

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 11:11 PM, Joe Perches <[email protected]> wrote:
> On Thu, 2012-05-10 at 13:52 -0700, Linus Torvalds wrote:
>> On Thu, May 10, 2012 at 1:46 PM, Joe Perches <[email protected]> wrote:
>> >
>> > That's where I think you're still a bit
>> > uncertain how the _current_ printk system
>> > works.
>>
>> No, you are. Read my answer from two days ago in this thread.
>>
>> KERN_CONT is *not* the same as "". Not now, not ever. If you make it
>> the same, you're broken.
>>
>> The reason is simple: KERN_CONT "<3>" should print out the string
>> "<3>".
>
> I think it's a distinction without a difference for,
> as fas as I know, that's a case that doesn't exist
> in the current kernel.
>
> $ grep -rP --include=*.[ch] "\bpr_cont\s*\(\s*\"<" *
> $ grep -rP --include=*.[ch] "\bprintk\s*\(\s*KERN_CONT\s*\"<" *
> arch/x86/kernel/dumpstack_32.c:                         printk(KERN_CONT "<%02x> ", c);
> arch/x86/kernel/dumpstack_64.c:                         printk(KERN_CONT "<%02x> ", c);
> arch/powerpc/kernel/process.c:                          printk(KERN_CONT "<%08x> ", instr);
> drivers/media/video/tm6000/tm6000-core.c:                       printk(KERN_CONT "<<< ");
> drivers/media/video/cx231xx/cx231xx-core.c:                     printk(KERN_CONT "<<<");

It surely does, and we needed to add KERN_CONT in the past because
stuff got eaten by the prefix parser.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=40c8cefaaf12734327db7199a56e60058d98e7b6
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=13f541c10b30fc6529200d7f9a0073217709622f

Kay

2012-05-10 21:58:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 2:11 PM, Joe Perches <[email protected]> wrote:
>
> I think it's a distinction without a difference for,
> as fas as I know, that's a case that doesn't exist
> in the current kernel.

So? Even if we hadn't had this bug before (we have), your argument is
utter crap. Even if we weren't to have that patter right *now*, that
doesn't mean that it cannot happen.

Your grep is also not at all exhaustive. What if somebody does

printk("%s.."

how are you going to prove that '%s' doesn't contain a level marker
thing? It's much easier to just say "use a KERN_CONT" (or more
commonly KERN_DEFAULT) to avoid the problem.

We do need an escape model. The current KERN_CONT does that. Stop
arguing for crap.

Linus

2012-05-11 00:13:22

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-05-10 at 14:58 -0700, Linus Torvalds wrote:
> On Thu, May 10, 2012 at 2:11 PM, Joe Perches <[email protected]> wrote:
> >
> > I think it's a distinction without a difference for,
> > as fas as I know, that's a case that doesn't exist
> > in the current kernel.
>
> So? Even if we hadn't had this bug before (we have),

True. I've fixed the ones you've introduced when you
changed printk semantics in the past.

> your argument is utter crap.

What happened to the pragmatic Linus Torvalds?
Did the dogmatic one replace him again temporarily?

> Even if we weren't to have that patter right *now*, that
> doesn't mean that it cannot happen.

Not what I said.

> Your grep is also not at all exhaustive.

I believe I know the printk subsystem and its uses
as well as anyone. I don't think there is a case
today.

> We do need an escape model.

Or an agreement to not to emit "<[0-7cd]>"
as the first 3 bytes after a newline or as
the start of a new printk. It could just
as easily be an escape is a required leading
non "<" character.

I think that's not onerous.

> The current KERN_CONT does that. Stop
> arguing for crap.

Correctness in all cases can be a good thing.
Simplicity may be better because complexity
is reduced.

cheers, Joe

2012-05-11 00:39:02

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 10:52 PM, Linus Torvalds <[email protected]> wrote:

> The change to verify that 'current' matches the previous printout is
> so far the *only* sane semantic change I've seen in this thread.
> Everything else has been pure garbage.

Ok, let's continue with sane things then. :)

Here is a stupid kernel module which creates 3 racy printk() loops, 2 of
them do continuation, one does an atomic print. The continuation one
prints one char per printk.
http://people.freedesktop.org/~kay/printk-race/printk-race.c

The lines all try to look like:
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
(XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX)

Every loop has it's own uppercase character. The line enclosed in '()'.
'A' is the atomic one, the cont ones are 'C' and 'X'.

Our current kernel look like:
[ 99.056806] XC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 99.059772] XC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 99.062168] XCCX(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 99.066479] CXXC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 99.070063] XC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 99.073081] XCCXCXCXCXCXCXCXCXCXCXCXCXCXCXCCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXC)
[ 99.148325] (CCXCXCX)
[ 99.155468] (XXCXCXCXCCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXCXXCXCXCXCXCXCXCXCXC)
[ 99.277357] (CCXCXCXCX)
http://people.freedesktop.org/~kay/printk-race/printk-out-before.txt


The kernel with the record buffer, and the 'current' who owns the cont buffer
looks like:
[ 78.080866] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 78.082790] (CC
[ 78.084152] (XX
[ 78.084497] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 78.085494] C
[ 78.086125] X
[ 78.087188] C
[ 78.088669] X
[ 78.089803] C
[ 78.091510] X
[ 78.091924] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 78.094112] CC
[ 78.096171] X
http://people.freedesktop.org/~kay/printk-race/printk-out-after.txt

Thanks,
Kay

2012-05-11 01:24:07

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, 2012-05-11 at 02:38 +0200, Kay Sievers wrote:
> On Thu, May 10, 2012 at 10:52 PM, Linus Torvalds <[email protected]> wrote:
>
> > The change to verify that 'current' matches the previous printout is
> > so far the *only* sane semantic change I've seen in this thread.
> > Everything else has been pure garbage.
>
> Ok, let's continue with sane things then. :)
>
> Here is a stupid kernel module which creates 3 racy printk() loops, 2 of
> them do continuation, one does an atomic print. The continuation one
> prints one char per printk.
> http://people.freedesktop.org/~kay/printk-race/printk-race.c
>
> The lines all try to look like:
> (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
> (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
> (XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX)
>
> Every loop has it's own uppercase character. The line enclosed in '()'.
> 'A' is the atomic one, the cont ones are 'C' and 'X'.

Having only one continuation user racing against an atomic printk user
looks like this:

Current kernel:
[ 54.062900] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 54.068200] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 54.071698] (CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 54.074103] C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 54.076052] C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 54.078116] C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 54.080096] C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 54.082060] CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 54.086121] C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
http://people.freedesktop.org/~kay/printk-race/printk-out-before-2.txt

Record buffer, with cont buffer owner:
[ 33.431129] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 33.435112] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 33.439131] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 33.441820] (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
[ 33.445292] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 33.450131] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 33.453127] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 33.456084] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 33.459143] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 33.462161] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
[ 33.465165] (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
http://people.freedesktop.org/~kay/printk-race/printk-out-after-2.txt

Kay

2012-05-11 10:35:30

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, May 10, 2012 at 4:30 AM, Kay Sievers <[email protected]> wrote:
> On Wed, 2012-05-09 at 16:06 -0700, Greg Kroah-Hartmann wrote:
>> On Wed, May 09, 2012 at 04:02:40PM -0700, Yinghai Lu wrote:
>> > On Wed, May 9, 2012 at 7:37 AM, Kay Sievers <[email protected]> wrote:
>> > >
>> > >> Changing printk semantics and trying to stabilize
>> > >> printk message content are bad ideas though.
>> > >
>> > > Again, nobody talks about content of here.
>> >
>> > printk_time=1 does not work anymore with this patch on serial console.
>>
>> Known issue, I think Kay is working on the solution for this right
>> now...
>
> Sure, and this seems to work for me.
>
> Thanks,
> Kay
>
>
> From: Kay Sievers <[email protected]>
> Subject: printk() - restore timestamp printing at console output
>
> The output of the timestamps got lost with the conversion of the
> kmsg buffer to records; restore the old behavior.
>
> Document, that CONFIG_PRINTK_TIME now only controls the output of
> the timestamps in the syslog() system call and on the console, and
> not the recording of the timestamps.
>
> Signed-off-by: Kay Sievers <[email protected]>
> ---

This has a small issue with the way timing is printed compared to how
it worked previously.

Consider the following case:

printk("\n foo \n bar \n zoot")

Previously, you would see timing on each line of the print, but now
you see it only on the first.

2012-05-11 15:19:49

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, May 11, 2012 at 12:35:07PM +0200, Sasha Levin wrote:
> On Thu, May 10, 2012 at 4:30 AM, Kay Sievers <[email protected]> wrote:
> > On Wed, 2012-05-09 at 16:06 -0700, Greg Kroah-Hartmann wrote:
> >> On Wed, May 09, 2012 at 04:02:40PM -0700, Yinghai Lu wrote:
> >> > On Wed, May 9, 2012 at 7:37 AM, Kay Sievers <[email protected]> wrote:
> >> > >
> >> > >> Changing printk semantics and trying to stabilize
> >> > >> printk message content are bad ideas though.
> >> > >
> >> > > Again, nobody talks about content of here.
> >> >
> >> > printk_time=1 does not work anymore with this patch on serial console.
> >>
> >> Known issue, I think Kay is working on the solution for this right
> >> now...
> >
> > Sure, and this seems to work for me.
> >
> > Thanks,
> > Kay
> >
> >
> > From: Kay Sievers <[email protected]>
> > Subject: printk() - restore timestamp printing at console output
> >
> > The output of the timestamps got lost with the conversion of the
> > kmsg buffer to records; restore the old behavior.
> >
> > Document, that CONFIG_PRINTK_TIME now only controls the output of
> > the timestamps in the syslog() system call and on the console, and
> > not the recording of the timestamps.
> >
> > Signed-off-by: Kay Sievers <[email protected]>
> > ---
>
> This has a small issue with the way timing is printed compared to how
> it worked previously.
>
> Consider the following case:
>
> printk("\n foo \n bar \n zoot")
>
> Previously, you would see timing on each line of the print, but now
> you see it only on the first.

What code does this? Shouldn't those be individual printk lines? I
didn't think that type of line would ever work properly in the past.

thanks,

greg k-h

2012-05-11 15:22:55

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, May 11, 2012 at 5:19 PM, Greg KH <[email protected]> wrote:
> On Fri, May 11, 2012 at 12:35:07PM +0200, Sasha Levin wrote:
>> This has a small issue with the way timing is printed compared to how
>> it worked previously.
>>
>> Consider the following case:
>>
>> ? ? printk("\n foo \n bar \n zoot")
>>
>> Previously, you would see timing on each line of the print, but now
>> you see it only on the first.
>
> What code does this? ?Shouldn't those be individual printk lines? ?I
> didn't think that type of line would ever work properly in the past.

At least the lockdep code, in several places:

__print_lock_name(safe_class);
printk(");\n");
printk("\n *** DEADLOCK ***\n\n");

Which has looked fine before.

2012-05-11 15:36:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, May 11, 2012 at 8:19 AM, Greg KH <[email protected]> wrote:
>
> What code does this? ?Shouldn't those be individual printk lines? ?I
> didn't think that type of line would ever work properly in the past.

We've *always* supported multiple lines, Greg. And it has always
worked. It's very inconvenient to have to use multiple printk's when
one would do.

That said, I think it would be interesting if the timing values were
just not printed at all when they match the last one, which would
happen with this kind of setup (and also happens if your time source
sucks).

You'd still want the lines to line up, though.

Linus

2012-05-11 15:41:14

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, May 11, 2012 at 5:35 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, May 11, 2012 at 8:19 AM, Greg KH <[email protected]> wrote:
>>
>> What code does this?  Shouldn't those be individual printk lines?  I
>> didn't think that type of line would ever work properly in the past.
>
> We've *always* supported multiple lines, Greg. And it has always
> worked. It's very inconvenient to have to use multiple printk's when
> one would do.
>
> That said, I think it would be interesting if the timing values were
> just not printed at all when they match the last one, which would
> happen with this kind of setup (and also happens if your time source
> sucks).
>
> You'd still want the lines to line up, though.

Should we just pad the multi-newline-in-one-record output with 15
spaces when printk_time is on?

Kay

2012-05-11 15:47:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, May 11, 2012 at 8:40 AM, Kay Sievers <[email protected]> wrote:
>
> Should we just pad the multi-newline-in-one-record output with 15
> spaces when printk_time is on?

Yeah, I think that would be the nicest thing.

Btw, I'd also *love* to see some model (eventually - not necessarily
now) where we could have the option of the time printouts being a bit
different (rather than just on/off). It would be very nice if we had
"relative" time printouts for events close together, so the rule could
be something like:

- if time is same as last line, pad with empty

- if time is more than a minute from last one, show an absolute value
of dd hh:mm:ss

- otherwise, show a relative value of +ss.mmmmmm

So the on/off choice could be on/off/relative.

Hmm?

Linus

2012-05-11 19:51:41

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On 12-05-11 11:47 AM, Linus Torvalds wrote:
> On Fri, May 11, 2012 at 8:40 AM, Kay Sievers <[email protected]> wrote:
>>
>> Should we just pad the multi-newline-in-one-record output with 15
>> spaces when printk_time is on?
>
> Yeah, I think that would be the nicest thing.
>
> Btw, I'd also *love* to see some model (eventually - not necessarily
> now) where we could have the option of the time printouts being a bit
> different (rather than just on/off). It would be very nice if we had
> "relative" time printouts for events close together, so the rule could
> be something like:
>
> - if time is same as last line, pad with empty
>
> - if time is more than a minute from last one, show an absolute value
> of dd hh:mm:ss
>
> - otherwise, show a relative value of +ss.mmmmmm
>
> So the on/off choice could be on/off/relative.
>
> Hmm?

Breaks the "grep for specific messages" case.

2012-05-11 20:02:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, May 11, 2012 at 12:51 PM, Mark Lord <[email protected]> wrote:
>
> Breaks the "grep for specific messages" case.

Why?

Sure, if you grep for times, you'd not want to enable the relative
time. But why would you grep for a time anyway? Regardless, it would
be your choice.

Generally, relative times are much more useful - the absolute time seldom is.

Linus

2012-05-12 07:44:19

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, May 11, 2012 at 5:47 PM, Linus Torvalds
<[email protected]> wrote:
> Btw, I'd also *love* to see some model (eventually - not necessarily
> now) where we could have the option of the time printouts being a bit
> different (rather than just on/off). It would be very nice if we had
> "relative" time printouts for events close together, so the rule could
> be something like:
>
> ?- if time is same as last line, pad with empty
>
> ?- if time is more than a minute from last one, show an absolute value
> of dd hh:mm:ss
>
> ?- otherwise, show a relative value of +ss.mmmmmm
>
> So the on/off choice could be on/off/relative.
>
> Hmm?

So I got something like this:

# sleep 61 ; echo foo > /dev/kmsg ; sleep 61 ; echo bar > /dev/kmsg ;
sleep 5 ; echo zoot > /dev/kmsg ; sleep 10 ; echo foo > /dev/kmsg ;
echo bar > /dev/kmsg ; sleep 61 ; echo zoot > /dev/kmsg ; echo foo >
/dev/kmsg

[ 673.222632] foo
[ 734.315249] bar
[ +5.077527] zoot
[ +10.235225] foo
[ +0.002971] bar
[ 810.883085] zoot
[ +0.003081] foo

If that looks right, I can send a patch once this printf overhaul is
stable, unless Kay already plans on doing it as part of his work.

2012-05-12 18:04:27

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On 12-05-11 04:02 PM, Linus Torvalds wrote:
> On Fri, May 11, 2012 at 12:51 PM, Mark Lord <[email protected]> wrote:
>>
>> Breaks the "grep for specific messages" case.
>
> Why?
>
> Sure, if you grep for times, you'd not want to enable the relative
> time. But why would you grep for a time anyway? Regardless, it would
> be your choice.


I'd be grepping for specific message content, not times.
But I'd want to see the times for the messages,
to know if they're related events or not.

Relative times are great, no problem there.
But relative to what? The previous message?

That's no good unless viewing all messages,
rather than a grep'd subset.

But with a choice, we do get the best of both, so no problemo.

Cheers

2012-05-12 18:35:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Sat, May 12, 2012 at 12:43 AM, Sasha Levin <[email protected]> wrote:
>
> So I got something like this:
>
> # sleep 61 ; echo foo > /dev/kmsg ; sleep 61 ; echo bar > /dev/kmsg ;
> sleep 5 ; echo zoot > /dev/kmsg ; sleep 10 ; echo foo > /dev/kmsg ;
> echo bar > /dev/kmsg ; sleep 61 ; echo zoot > /dev/kmsg ; echo foo >
> /dev/kmsg
>
> [ ?673.222632] foo
> [ ?734.315249] bar
> [ ? +5.077527] zoot
> [ ?+10.235225] foo
> [ ? +0.002971] bar
> [ ?810.883085] zoot
> [ ? +0.003081] foo
>
> If that looks right, I can send a patch once this printf overhaul is
> stable, unless Kay already plans on doing it as part of his work.

So judging by my own use, I really think that

(a) the microseconds are *totally* useless for the absolute time. In
fact, they are worse than useless: they take up space and are actively
distracting.

If the last message was over a minute ago, *nobody* cares about
the microsecond. The microseconds are useful when there are multiple
messages and you see that "oh, those events happened within a couple
of ms of each other", but for the absolute time they are just bad.

Even the *seconds* are probably not that interesting.

(b) The one thing I have often wanted is not "after how many seconds
of boot", but "when". So it would actually be nice if the absolute
time was converted into local time. The kernel can actually do that,
so I suspect that the best format for the relative timestamps would
really be something like

> [May12 11:27] foo
> [May12 11:28] bar
> [ +5.077527] zoot
> [ +10.235225] foo
> [ +0.002971] bar
> [May12 11:29] zoot
> [ +0.003081] foo

because that would be really useful sometimes.

(And no, never mind the year. If you log those things long-term, the
year will be in the full log, so logging the year in the dmesg is just
pointless.)

I dunno. The above is what I would have liked to see quite often. I
look at dmesg, and notice that I had an WARN_ON_ONCE in it, and I have
no idea when it happened, because the "seconds since boot" is totally
useless information for me, as I'm too lazy to try to use /proc/uptime
to calculate what it is.

Linus

2012-05-13 11:08:27

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Sat, May 12, 2012 at 8:35 PM, Linus Torvalds
<[email protected]> wrote:
> On Sat, May 12, 2012 at 12:43 AM, Sasha Levin <[email protected]> wrote:
>>
>> So I got something like this:
>>
>> # sleep 61 ; echo foo > /dev/kmsg ; sleep 61 ; echo bar > /dev/kmsg ;
>> sleep 5 ; echo zoot > /dev/kmsg ; sleep 10 ; echo foo > /dev/kmsg ;
>> echo bar > /dev/kmsg ; sleep 61 ; echo zoot > /dev/kmsg ; echo foo >
>> /dev/kmsg
>>
>> [  673.222632] foo
>> [  734.315249] bar
>> [   +5.077527] zoot
>> [  +10.235225] foo
>> [   +0.002971] bar
>> [  810.883085] zoot
>> [   +0.003081] foo
>>
>> If that looks right, I can send a patch once this printf overhaul is
>> stable, unless Kay already plans on doing it as part of his work.
>
> So judging by my own use, I really think that
>
>  (a) the microseconds are *totally* useless for the absolute time. In
> fact, they are worse than useless: they take up space and are actively
> distracting.
>
>      If the last message was over a minute ago, *nobody* cares about
> the microsecond. The microseconds are useful when there are multiple
> messages and you see that "oh, those events happened within a couple
> of ms of each other", but for the absolute time they are just bad.
>
>     Even the *seconds* are probably not that interesting.
>
>  (b) The one thing I have often wanted is not "after how many seconds
> of boot", but "when". So it would actually be nice if the absolute
> time was converted into local time. The kernel can actually do that,
> so I suspect that the best format for the relative timestamps would
> really be something like
>
>   > [May12 11:27] foo
>   > [May12 11:28] bar
>   > [  +5.077527] zoot
>   > [ +10.235225] foo
>   > [  +0.002971] bar
>   > [May12 11:29] zoot
>   > [  +0.003081] foo
>
> because that would be really useful sometimes.
>
> (And no, never mind the year. If you log those things long-term, the
> year will be in the full log, so logging the year in the dmesg is just
> pointless.)
>
> I dunno. The above is what I would have liked to see quite often. I
> look at dmesg, and notice that I had an WARN_ON_ONCE in it, and I have
> no idea when it happened, because the "seconds since boot" is totally
> useless information for me, as I'm too lazy to try to use /proc/uptime
> to calculate what it is.

Yeah, that looks generally more useful.

"dmesg -T" tries to translate that to wall clock time too. At the
moment, it can't really do that reliable at the point of reading the
timestamps though, because the monotonic time used does not include
the time spent in suspend.

One other problem with the current logic is that the timestamp we get
at the moment is not guaranteed to monotonic, log lines from different
CPUs at the same time can jump forth and back, so the relative-time
calculation could, with the current code, show negative values.

Not sure what options we have to use a more better clock source, if
things are expected to work in NMIs and IRQs.

Kay

2012-05-13 13:22:08

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On 12-05-12 02:35 PM, Linus Torvalds wrote:
>
> (b) The one thing I have often wanted is not "after how many seconds
> of boot", but "when". So it would actually be nice if the absolute
> time was converted into local time. The kernel can actually do that,
> so I suspect that the best format for the relative timestamps would
> really be something like
>
> > [May12 11:27] foo
> > [May12 11:28] bar
> > [ +5.077527] zoot
> > [ +10.235225] foo
> > [ +0.002971] bar
> > [May12 11:29] zoot
> > [ +0.003081] foo
>
> because that would be really useful sometimes.
>
> (And no, never mind the year. If you log those things long-term, the
> year will be in the full log, so logging the year in the dmesg is just
> pointless.)
>
> I dunno. The above is what I would have liked to see quite often. I
> look at dmesg, and notice that I had an WARN_ON_ONCE in it, and I have
> no idea when it happened, because the "seconds since boot" is totally
> useless information for me, as I'm too lazy to try to use /proc/uptime
> to calculate what it is.


Look in syslog ?

2012-05-13 18:01:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Sun, May 13, 2012 at 6:22 AM, Mark Lord <[email protected]> wrote:
>
> Look in syslog ?

That usually requires root permission, so it's an extra (annoying)
step. It also gives all kinds or other mostly irrelevant information
for a kernel person.

Besides, according to the same argument, dmesg shouldn't have times
*at*all*. So the "look in your syslog" is pointless for discussing the
dmesg timestamp.

Linus

2012-05-13 21:31:24

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, 2012-05-11 at 12:35 +0200, Sasha Levin wrote:
> On Thu, May 10, 2012 at 4:30 AM, Kay Sievers <[email protected]> wrote:
> > On Wed, 2012-05-09 at 16:06 -0700, Greg Kroah-Hartmann wrote:
> >> On Wed, May 09, 2012 at 04:02:40PM -0700, Yinghai Lu wrote:
> >> > On Wed, May 9, 2012 at 7:37 AM, Kay Sievers <[email protected]> wrote:
> >> > >
> >> > >> Changing printk semantics and trying to stabilize
> >> > >> printk message content are bad ideas though.
> >> > >
> >> > > Again, nobody talks about content of here.
> >> >
> >> > printk_time=1 does not work anymore with this patch on serial console.
> >>
> >> Known issue, I think Kay is working on the solution for this right
> >> now...
> >
> > Sure, and this seems to work for me.
> >
> > Thanks,
> > Kay
> >
> >
> > From: Kay Sievers <[email protected]>
> > Subject: printk() - restore timestamp printing at console output
> >
> > The output of the timestamps got lost with the conversion of the
> > kmsg buffer to records; restore the old behavior.
> >
> > Document, that CONFIG_PRINTK_TIME now only controls the output of
> > the timestamps in the syslog() system call and on the console, and
> > not the recording of the timestamps.
> >
> > Signed-off-by: Kay Sievers <[email protected]>
> > ---
>
> This has a small issue with the way timing is printed compared to how
> it worked previously.
>
> Consider the following case:
>
> printk("\n foo \n bar \n zoot")
>
> Previously, you would see timing on each line of the print, but now
> you see it only on the first.

Does this fix it for you? It should fully restore the current behavior
for multi-line printing.

Thanks,
Kay



From: Kay Sievers <[email protected]>
Subject: printk() - restore prefix/timestamp printing for multi-newline strings

Calls like:
printk("\n *** DEADLOCK ***\n\n");
will print 3 properly indented, separated, syslog + timestamp prefixed lines in
the log output.


Reported-By: Sasha Levin <[email protected]>
Signed-off-by: Kay Sievers <[email protected]>
---

printk.c | 127 +++++++++++++++++++++++++++++++++++++--------------------------
1 file changed, 76 insertions(+), 51 deletions(-)

diff --git a/kernel/printk.c b/kernel/printk.c
index c42faf9..ee312c7 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -448,7 +448,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,

/* escape non-printable characters */
for (i = 0; i < msg->text_len; i++) {
- char c = log_text(msg)[i];
+ unsigned char c = log_text(msg)[i];

if (c < ' ' || c >= 128)
len += sprintf(user->buf + len, "\\x%02x", c);
@@ -461,7 +461,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
bool line = true;

for (i = 0; i < msg->dict_len; i++) {
- char c = log_dict(msg)[i];
+ unsigned char c = log_dict(msg)[i];

if (line) {
user->buf[len++] = ' ';
@@ -785,56 +791,81 @@ static bool printk_time;
#endif
module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR);

-static size_t prepend_timestamp(unsigned long long t, char *buf)
+static size_t print_prefix(const struct log *msg, bool syslog, char *buf)
{
- unsigned long rem_ns;
+ size_t len = 0;

- if (!printk_time)
- return 0;
+ if (syslog) {
+ if (buf) {
+ len += sprintf(buf, "<%u>", msg->level);
+ } else {
+ len += 3;
+ if (msg->level > 9)
+ len++;
+ if (msg->level > 99)
+ len++;
+ }
+ }

- if (!buf)
- return 15;
+ if (printk_time) {
+ if (buf) {
+ unsigned long long ts = msg->ts_nsec;
+ unsigned long rem_nsec = do_div(ts, 1000000000);

- rem_ns = do_div(t, 1000000000);
+ len += sprintf(buf + len, "[%5lu.%06lu] ",
+ (unsigned long) ts, rem_nsec / 1000);
+ } else {
+ len += 15;
+ }
+ }

- return sprintf(buf, "[%5lu.%06lu] ",
- (unsigned long) t, rem_ns / 1000);
+ return len;
}

-static int syslog_print_line(u32 idx, char *text, size_t size)
+static size_t msg_print_text(const struct log *msg, bool syslog,
+ char *buf, size_t size)
{
- struct log *msg;
- size_t len;
+ const char *text = log_text(msg);
+ size_t text_size = msg->text_len;
+ size_t len = 0;

- msg = log_from_idx(idx);
- if (!text) {
- /* calculate length only */
- len = 3;
+ do {
+ const char *next = memchr(text, '\n', text_size);
+ size_t text_len;

- if (msg->level > 9)
- len++;
- if (msg->level > 99)
- len++;
- len += prepend_timestamp(0, NULL);
+ if (next) {
+ text_len = next - text;
+ next++;
+ text_size -= next - text;
+ } else {
+ text_len = text_size;
+ }

- len += msg->text_len;
- len++;
- return len;
- }
+ if (buf) {
+ if (print_prefix(msg, syslog, NULL) +
+ text_len + 1>= size - len)
+ break;
+
+ len += print_prefix(msg, syslog, buf + len);
+ memcpy(buf + len, text, text_len);
+ len += text_len;
+ buf[len++] = '\n';
+ } else {
+ /* SYSLOG_ACTION_* buffer size only calculation */
+ len += print_prefix(msg, syslog, NULL);
+ len += text_len + 1;
+ }
+
+ text = next;
+ } while (text);

- len = sprintf(text, "<%u>", msg->level);
- len += prepend_timestamp(msg->ts_nsec, text + len);
- if (len + msg->text_len > size)
- return -EINVAL;
- memcpy(text + len, log_text(msg), msg->text_len);
- len += msg->text_len;
- text[len++] = '\n';
return len;
}

static int syslog_print(char __user *buf, int size)
{
char *text;
+ struct log *msg;
int len;

text = kmalloc(LOG_LINE_MAX, GFP_KERNEL);
@@ -847,7 +878,8 @@ static int syslog_print(char __user *buf, int size)
syslog_seq = log_first_seq;
syslog_idx = log_first_idx;
}
- len = syslog_print_line(syslog_idx, text, LOG_LINE_MAX);
+ msg = log_from_idx(syslog_idx);
+ len = msg_print_text(msg, true, text, LOG_LINE_MAX);
syslog_idx = log_next(syslog_idx);
syslog_seq++;
raw_spin_unlock_irq(&logbuf_lock);
@@ -887,14 +919,18 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
seq = clear_seq;
idx = clear_idx;
while (seq < log_next_seq) {
- len += syslog_print_line(idx, NULL, 0);
+ struct log *msg = log_from_idx(idx);
+
+ len += msg_print_text(msg, true, NULL, 0);
idx = log_next(idx);
seq++;
}
seq = clear_seq;
idx = clear_idx;
while (len > size && seq < log_next_seq) {
- len -= syslog_print_line(idx, NULL, 0);
+ struct log *msg = log_from_idx(idx);
+
+ len -= msg_print_text(msg, true, NULL, 0);
idx = log_next(idx);
seq++;
}
@@ -904,9 +940,10 @@ static int syslog_print_all(char __user *buf, int size, bool clear)

len = 0;
while (len >= 0 && seq < next_seq) {
+ struct log *msg = log_from_idx(idx);
int textlen;

- textlen = syslog_print_line(idx, text, LOG_LINE_MAX);
+ textlen = msg_print_text(msg, true, text, LOG_LINE_MAX);
if (textlen < 0) {
len = textlen;
break;
@@ -1044,7 +1081,9 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
seq = syslog_seq;
idx = syslog_idx;
while (seq < log_next_seq) {
- error += syslog_print_line(idx, NULL, 0);
+ struct log *msg = log_from_idx(idx);
+
+ error += msg_print_text(msg, true, NULL, 0);
idx = log_next(idx);
seq++;
}
@@ -1439,10 +1478,8 @@ static struct log *log_from_idx(u32 idx) { return NULL; }
static u32 log_next(u32 idx) { return 0; }
static char *log_text(const struct log *msg) { return NULL; }
static void call_console_drivers(int level, const char *text, size_t len) {}
-static size_t prepend_timestamp(unsigned long long t, char *buf)
-{
- return 0;
-}
+static size_t msg_print_text(const struct log *msg, bool syslog,
+ char *buf, size_t size) { return 0; }

#endif /* CONFIG_PRINTK */

@@ -1750,7 +1787,7 @@ again:
for (;;) {
struct log *msg;
static char text[LOG_LINE_MAX];
- size_t len, l;
+ size_t len;
int level;

raw_spin_lock_irqsave(&logbuf_lock, flags);
@@ -1771,13 +1808,7 @@ again:
msg = log_from_idx(console_idx);
level = msg->level & 7;

- len = prepend_timestamp(msg->ts_nsec, text);
- l = msg->text_len;
- if (len + l + 1 >= sizeof(text))
- l = sizeof(text) - len - 1;
- memcpy(text + len, log_text(msg), l);
- len += l;
- text[len++] = '\n';
+ len = msg_print_text(msg, false, text, sizeof(text));

console_idx = log_next(console_idx);
console_seq++;

2012-05-13 21:48:41

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Sat, May 12, 2012 at 9:43 AM, Sasha Levin <[email protected]> wrote:
> On Fri, May 11, 2012 at 5:47 PM, Linus Torvalds
> <[email protected]> wrote:
>> Btw, I'd also *love* to see some model (eventually - not necessarily
>> now) where we could have the option of the time printouts being a bit
>> different (rather than just on/off). It would be very nice if we had
>> "relative" time printouts for events close together, so the rule could
>> be something like:
>>
>>  - if time is same as last line, pad with empty
>>
>>  - if time is more than a minute from last one, show an absolute value
>> of dd hh:mm:ss
>>
>>  - otherwise, show a relative value of +ss.mmmmmm
>>
>> So the on/off choice could be on/off/relative.
>>
>> Hmm?
>
> So I got something like this:
>
> # sleep 61 ; echo foo > /dev/kmsg ; sleep 61 ; echo bar > /dev/kmsg ;
> sleep 5 ; echo zoot > /dev/kmsg ; sleep 10 ; echo foo > /dev/kmsg ;
> echo bar > /dev/kmsg ; sleep 61 ; echo zoot > /dev/kmsg ; echo foo >
> /dev/kmsg
>
> [  673.222632] foo
> [  734.315249] bar
> [   +5.077527] zoot
> [  +10.235225] foo
> [   +0.002971] bar
> [  810.883085] zoot
> [   +0.003081] foo
>
> If that looks right,
> I can send a patch once this printf overhaul is
> stable

I just sent out the fix to restore the original multi-line behaviour,
I have nothing else queued in that area at the moment.

> unless Kay already plans on doing it as part of his work.

I'm busy making the prink() output look the same as it was before but
without the wrong continuation merges. New features need to wait a
little bit, so go ahead. :)

But I want to express, that I'm very interested in finding out if we
could switch to a 'better' clock source as we have today, which can
safely be translated to wall clock time not leaving the suspend time
out, and not jumping forth and back in time between logs from
different CPUs. I think that would be very useful for your patch too.

Kay

2012-05-13 22:19:30

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On 12-05-13 02:01 PM, Linus Torvalds wrote:
> On Sun, May 13, 2012 at 6:22 AM, Mark Lord <[email protected]> wrote:
>>
>> Look in syslog ?
>
> That usually requires root permission, so it's an extra (annoying)
> step. It also gives all kinds or other mostly irrelevant information
> for a kernel person.
>
> Besides, according to the same argument, dmesg shouldn't have times
> *at*all*. So the "look in your syslog" is pointless for discussing the
> dmesg timestamp.

It's good for dates. Not as good for times.
I figure anyone trying to diagnose a kernel issue
already has access to /var/log/ as well. :)

But it's good as you propose with an option.

Cheers

2012-05-14 16:40:38

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Sun, 13 May 2012 09:22:04 -0400, Mark Lord said:
> On 12-05-12 02:35 PM, Linus Torvalds wrote:
> > I dunno. The above is what I would have liked to see quite often. I
> > look at dmesg, and notice that I had an WARN_ON_ONCE in it, and I have
> > no idea when it happened, because the "seconds since boot" is totally
> > useless information for me, as I'm too lazy to try to use /proc/uptime
> > to calculate what it is.

> Look in syslog ?

Oddly enough, just yesterday I was looking in dmesg on one of my servers, trying to
find a hint of why syslog and a number of other things had failed to start at boot...



Attachments:
(No filename) (865.00 B)

2012-05-14 18:47:06

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, 2012-05-11 at 03:23 +0200, Kay Sievers wrote:
> On Fri, 2012-05-11 at 02:38 +0200, Kay Sievers wrote:
> > On Thu, May 10, 2012 at 10:52 PM, Linus Torvalds <[email protected]> wrote:
> >
> > > The change to verify that 'current' matches the previous printout is
> > > so far the *only* sane semantic change I've seen in this thread.
> > > Everything else has been pure garbage.
> >
> > Ok, let's continue with sane things then. :)
> >
> > Here is a stupid kernel module which creates 3 racy printk() loops, 2 of
> > them do continuation, one does an atomic print. The continuation one
> > prints one char per printk.
> > http://people.freedesktop.org/~kay/printk-race/printk-race.c
> >
> > The lines all try to look like:
> > (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
> > (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
> > (XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX)
> >
> > Every loop has it's own uppercase character. The line enclosed in '()'.
> > 'A' is the atomic one, the cont ones are 'C' and 'X'.
>
> Having only one continuation user racing against an atomic printk user
> looks like this:

This seems to work significantly better for me than all earlier
versions. The single line users like dev_printk() should be reliable
now.

Thanks,
Kay


From: Kay Sievers <[email protected]>
Subject: printk() - isolate KERN_CONT users from ordinary complete lines

Arrange the continuation printk() buffering to be fully separated from the
ordinary full line users.

Limit the exposure to races and wrong printk() line merges to users of
continuation only. Ordinary full line users racing against continuation
users will no longer affect each other.

Multiple continuation users from different threads, racing against each
other will not wrongly be merged into a single line, but printed as
separate lines.

Test output of a kernel module which starts two separate threads which
race against each other, one of them printing a single full terminated
line:
printk("(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)\n");

The other one printing the line, every character separate in a
continuation loop:
printk("(C");
for (i = 0; i < 58; i++)
printk(KERN_CONT "C");
printk(KERN_CONT "C)\n");

Behavior of single and non-thread-aware printk() buffer:
# modprobe printk-race
printk test init
(CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
(CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)

New behavior with separate and thread-aware continuation buffer:
# modprobe printk-race
printk test init
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
(CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
(CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)

Signed-off-by: Kay Sievers <[email protected]>
---

kernel/printk.c | 113 ++++++++++++++++++++++++++++++++------------------------
1 file changed, 65 insertions(+), 48 deletions(-)

--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1264,13 +1264,13 @@ asmlinkage int vprintk_emit(int facility
const char *fmt, va_list args)
{
static int recursion_bug;
- static char buf[LOG_LINE_MAX];
- static size_t buflen;
- static int buflevel;
+ static char cont_buf[LOG_LINE_MAX];
+ static size_t cont_len;
+ static int cont_level;
+ static struct task_struct *cont_task;
static char textbuf[LOG_LINE_MAX];
- static struct task_struct *cont;
char *text = textbuf;
- size_t textlen;
+ size_t text_len;
unsigned long flags;
int this_cpu;
bool newline = false;
@@ -1320,15 +1320,15 @@ asmlinkage int vprintk_emit(int facility
* The printf needs to come first; we need the syslog
* prefix which might be passed-in as a parameter.
*/
- textlen = vscnprintf(text, sizeof(textbuf), fmt, args);
+ text_len = vscnprintf(text, sizeof(textbuf), fmt, args);

/* mark and strip a trailing newline */
- if (textlen && text[textlen-1] == '\n') {
- textlen--;
+ if (text_len && text[text_len-1] == '\n') {
+ text_len--;
newline = true;
}

- /* strip syslog prefix and extract log level or flags */
+ /* strip syslog prefix and extract log level or control flags */
if (text[0] == '<' && text[1] && text[2] == '>') {
switch (text[1]) {
case '0' ... '7':
@@ -1338,49 +1338,67 @@ asmlinkage int vprintk_emit(int facility
prefix = true;
case 'c': /* KERN_CONT */
text += 3;
- textlen -= 3;
+ text_len -= 3;
}
}

- if (buflen && (prefix || dict || cont != current)) {
- /* flush existing buffer */
- log_store(facility, buflevel, NULL, 0, buf, buflen);
- printed_len += buflen;
- buflen = 0;
- }
-
- if (buflen == 0) {
- /* remember level for first message in the buffer */
- if (level == -1)
- buflevel = default_message_loglevel;
- else
- buflevel = level;
- }
-
- if (buflen || !newline) {
- /* append to existing buffer, or buffer until next message */
- if (buflen + textlen > sizeof(buf))
- textlen = sizeof(buf) - buflen;
- memcpy(buf + buflen, text, textlen);
- buflen += textlen;
- }
-
- if (newline) {
- /* end of line; flush buffer */
- if (buflen) {
- log_store(facility, buflevel,
- dict, dictlen, buf, buflen);
- printed_len += buflen;
- buflen = 0;
- } else {
- log_store(facility, buflevel,
- dict, dictlen, text, textlen);
- printed_len += textlen;
+ if (level == -1)
+ level = default_message_loglevel;
+
+ if (dict) {
+ prefix = true;
+ newline = true;
+ }
+
+ if (!newline) {
+ if (cont_len && (prefix || cont_task != current)) {
+ /*
+ * Flush earlier buffer, which is either from a
+ * different thread, or when we got a new prefix.
+ */
+ log_store(facility, cont_level, NULL, 0, cont_buf, cont_len);
+ cont_len = 0;
}
- cont = NULL;
+
+ if (!cont_len) {
+ cont_level = level;
+ cont_task = current;
+ }
+
+ /* buffer or append to earlier buffer from the same thread */
+ if (cont_len + text_len > sizeof(cont_buf))
+ text_len = sizeof(cont_buf) - cont_len;
+ memcpy(cont_buf + cont_len, text, text_len);
+ cont_len += text_len;
} else {
- /* remember thread which filled the buffer */
- cont = current;
+ if (cont_len && cont_task == current) {
+ if (prefix) {
+ /*
+ * New prefix from the same thread; flush. We
+ * either got no earlier newline, or we race
+ * with an interrupt.
+ */
+ log_store(facility, cont_level,
+ NULL, 0, cont_buf, cont_len);
+ cont_len = 0;
+ }
+
+ /* append to the earlier buffer and flush */
+ if (cont_len + text_len > sizeof(cont_buf))
+ text_len = sizeof(cont_buf) - cont_len;
+ memcpy(cont_buf + cont_len, text, text_len);
+ cont_len += text_len;
+ log_store(facility, cont_level,
+ NULL, 0, cont_buf, cont_len);
+ cont_len = 0;
+ cont_task = NULL;
+ printed_len = cont_len;
+ } else {
+ /* ordinary single and terminated line */
+ log_store(facility, level,
+ dict, dictlen, text, text_len);
+ printed_len = text_len;
+ }
}

/*
@@ -1470,7 +1488,6 @@ EXPORT_SYMBOL(printk);
#define LOG_LINE_MAX 0
static struct log *log_from_idx(u32 idx) { return NULL; }
static u32 log_next(u32 idx) { return 0; }
-static char *log_text(const struct log *msg) { return NULL; }
static void call_console_drivers(int level, const char *text, size_t len) {}
static size_t msg_print_text(const struct log *msg, bool syslog,
char *buf, size_t size) { return 0; }

2012-05-17 03:44:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On 05/12/2012 11:35 AM, Linus Torvalds wrote:
>
> (b) The one thing I have often wanted is not "after how many seconds
> of boot", but "when". So it would actually be nice if the absolute
> time was converted into local time. The kernel can actually do that,
> so I suspect that the best format for the relative timestamps would
> really be something like
>
> > [May12 11:27] foo
> > [May12 11:28] bar
> > [ +5.077527] zoot
> > [ +10.235225] foo
> > [ +0.002971] bar
> > [May12 11:29] zoot
> > [ +0.003081] foo
>
> because that would be really useful sometimes.
>
> (And no, never mind the year. If you log those things long-term, the
> year will be in the full log, so logging the year in the dmesg is just
> pointless.)

I would like to give a +1 on absolute time, and -1 on local timezone and
a -$LARGE_NUMBER on low resolution.

Why? Getting events accurately timed across multiple hosts can be
really useful when trying to reconstruct, ahem, "network events".

Getting UTC time at some fairly high resolution is useful for that. I
would say the useful resolution is probably milliseconds.

-hpa

2012-05-26 11:13:19

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

Hello Kay,

On Thu, May 03, 2012 at 02:29:13AM +0200, Kay Sievers wrote:
> From: Kay Sievers <[email protected]>
> Subject: printk: convert byte-buffer to variable-length record buffer

Bad news, this patch breaks all kmsg_dump users.

Per current printk.c:

* The 'struct log' buffer header must never be directly exported to
* userspace, it is a kernel-private implementation detail that might
* need to be changed in the future, when the requirements change.

But kmsg_dump users get the raw log, not text. Some users may even
use strlen() on the buffers, e.g. ramoops module was using it just
recently, before persistent_ram rework.

So, starting from this patch every kmsg_dump user is getting the
binary stuff, which is barely usable, and breaks a longstanding
user-visible ABIs (e.g. mtdoops, oops dumping into nvram on PowerPC,
and ramoops of course).

I'm not sure how we can fix this without allocating a huge buffer
(and thus wasting memory), or reworking the whole kmsg_dump concept
and its users.

Ideas?

p.s. And, btw, during HW bring-up, when there's even no serial
console usable, I somewhat liked to directly print log_buf memory
from a JTAG debugger, and thus see the kernel logs. Now that'll
look ugly. Heh. :-)

--
Anton Vorontsov
Email: [email protected]

2012-05-27 14:23:39

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Sat, May 26, 2012 at 1:11 PM, Anton Vorontsov
<[email protected]> wrote:

>> Subject: printk: convert byte-buffer to variable-length record buffer
>
> Bad news, this patch breaks all kmsg_dump users.
>
> Per current printk.c:
>
>  * The 'struct log' buffer header must never be directly exported to
>  * userspace, it is a kernel-private implementation detail that might
>  * need to be changed in the future, when the requirements change.
>
> But kmsg_dump users get the raw log, not text. Some users may even
> use strlen() on the buffers, e.g. ramoops module was using it just
> recently, before persistent_ram rework.
>
> So, starting from this patch every kmsg_dump user is getting the
> binary stuff, which is barely usable, and breaks a longstanding
> user-visible ABIs (e.g. mtdoops, oops dumping into nvram on PowerPC,
> and ramoops of course).

These cases export their own format, it's just that the dump routines
could copy the buffer in only two steps (beginning and end of the ring
buffer), while in the future their code would need an update to copy
it record for record now. The exported format in the dump does not
necessarily need to change.

It should be possible to convert them pretty easily to do that, they
operate on their own allocated buffers already. Some of them seem to
mangle the line-buffer already, so they should be able to copy
record-by-record either.

> I'm not sure how we can fix this without allocating a huge buffer
> (and thus wasting memory), or reworking the whole kmsg_dump concept
> and its users.

I think an iterator in the dumper code is nicer and simpler. It will
even provide structured data to the dumper, if the dumper wants to
make use of it in the future. The prefix mangling of the text in the
nvram dumper could surely benefit from that.

The iterator would basically work similar to the printing of the whole
buffer at a newly registered boot console, which was doing a similar
logic as the dumper does.

> p.s. And, btw, during HW bring-up, when there's even no serial
> console usable, I somewhat liked to directly print log_buf memory
> from a JTAG debugger, and thus see the kernel logs. Now that'll
> look ugly. Heh. :-)

Yeah, that probably does not look pretty anymore, but it still all
works, right? Doesn't have he debugger a strings(1)-like output to
print of a memory area?

Kay

2012-05-29 16:07:49

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Sun, 2012-05-27 at 16:23 +0200, Kay Sievers wrote:
> On Sat, May 26, 2012 at 1:11 PM, Anton Vorontsov <[email protected]> wrote:

> > So, starting from this patch every kmsg_dump user is getting the
> > binary stuff, which is barely usable, and breaks a longstanding
> > user-visible ABIs (e.g. mtdoops, oops dumping into nvram on PowerPC,
> > and ramoops of course).
>
> These cases export their own format, it's just that the dump routines
> could copy the buffer in only two steps (beginning and end of the ring
> buffer), while in the future their code would need an update to copy
> it record for record now. The exported format in the dump does not
> necessarily need to change.

Here is an untested patch, which should restore the old plain text
export format of the kmsg buffer. I adds an iterator to the dumper
routine, which can be used by the dumper to properly copy the data out
of the kmsg record buffer. All current users are converted in the same
patch.

Any feedback/help in testing/updating and getting this thing into a
mergeable state would be greatly appreciated.

There are two versions of the iterator:
kmsg_dump_get_line() starts with the oldest records. With the next call,
the iterator retrieves one line after the other. Useful for cases like
printing to a console or any other stream device.

kmsg_dump_get_buffer() fills a buffer with as many of the youngest
records as fit into it. With the next call, the iterator retrieves the
next block of older messages. Useful to dump the entire buffer with one
shot, or chunk the data into blocks and store them, with the youngest
messages being the most important, retrieved first.

Unlike the old logic, there are no stale pointers handed out to the
dumper, if case things get overwritten while a dump is in progress,
lines are not truncated or overwritten.

Thanks,
Kay

---
arch/powerpc/platforms/pseries/nvram.c | 60 +---------
arch/x86/platform/mrst/early_printk_mrst.c | 13 --
drivers/mtd/mtdoops.c | 22 ---
fs/pstore/platform.c | 34 ++---
include/linux/kmsg_dump.h | 39 +++++-
kernel/printk.c | 173 ++++++++++++++++++++++++-----
6 files changed, 204 insertions(+), 137 deletions(-)

--- a/arch/powerpc/platforms/pseries/nvram.c
+++ b/arch/powerpc/platforms/pseries/nvram.c
@@ -68,9 +68,7 @@ static const char *pseries_nvram_os_part
};

static void oops_to_nvram(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *old_msgs, unsigned long old_len,
- const char *new_msgs, unsigned long new_len);
+ enum kmsg_dump_reason reason);

static struct kmsg_dumper nvram_kmsg_dumper = {
.dump = oops_to_nvram
@@ -504,28 +502,6 @@ int __init pSeries_nvram_init(void)
}

/*
- * Try to capture the last capture_len bytes of the printk buffer. Return
- * the amount actually captured.
- */
-static size_t capture_last_msgs(const char *old_msgs, size_t old_len,
- const char *new_msgs, size_t new_len,
- char *captured, size_t capture_len)
-{
- if (new_len >= capture_len) {
- memcpy(captured, new_msgs + (new_len - capture_len),
- capture_len);
- return capture_len;
- } else {
- /* Grab the end of old_msgs. */
- size_t old_tail_len = min(old_len, capture_len - new_len);
- memcpy(captured, old_msgs + (old_len - old_tail_len),
- old_tail_len);
- memcpy(captured + old_tail_len, new_msgs, new_len);
- return old_tail_len + new_len;
- }
-}
-
-/*
* Are we using the ibm,rtas-log for oops/panic reports? And if so,
* would logging this oops/panic overwrite an RTAS event that rtas_errd
* hasn't had a chance to read and process? Return 1 if so, else 0.
@@ -541,27 +517,6 @@ static int clobbering_unread_rtas_event(
NVRAM_RTAS_READ_TIMEOUT);
}

-/* Squeeze out each line's <n> severity prefix. */
-static size_t elide_severities(char *buf, size_t len)
-{
- char *in, *out, *buf_end = buf + len;
- /* Assume a <n> at the very beginning marks the start of a line. */
- int newline = 1;
-
- in = out = buf;
- while (in < buf_end) {
- if (newline && in+3 <= buf_end &&
- *in == '<' && isdigit(in[1]) && in[2] == '>') {
- in += 3;
- newline = 0;
- } else {
- newline = (*in == '\n');
- *out++ = *in++;
- }
- }
- return out - buf;
-}
-
/* Derived from logfs_compress() */
static int nvram_compress(const void *in, void *out, size_t inlen,
size_t outlen)
@@ -619,9 +574,7 @@ static int zip_oops(size_t text_len)
* partition. If that's too much, go back and capture uncompressed text.
*/
static void oops_to_nvram(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *old_msgs, unsigned long old_len,
- const char *new_msgs, unsigned long new_len)
+ enum kmsg_dump_reason reason)
{
static unsigned int oops_count = 0;
static bool panicking = false;
@@ -660,14 +613,13 @@ static void oops_to_nvram(struct kmsg_du
return;

if (big_oops_buf) {
- text_len = capture_last_msgs(old_msgs, old_len,
- new_msgs, new_len, big_oops_buf, big_oops_buf_sz);
- text_len = elide_severities(big_oops_buf, text_len);
+ kmsg_dump_get_buffer(dumper, false,
+ big_oops_buf, big_oops_buf_sz, &text_len);
rc = zip_oops(text_len);
}
if (rc != 0) {
- text_len = capture_last_msgs(old_msgs, old_len,
- new_msgs, new_len, oops_data, oops_data_sz);
+ kmsg_dump_get_buffer(dumper, true,
+ oops_data, oops_data_sz, &text_len);
err_type = ERR_TYPE_KERNEL_PANIC;
*oops_len = (u16) text_len;
}
--- a/arch/x86/platform/mrst/early_printk_mrst.c
+++ b/arch/x86/platform/mrst/early_printk_mrst.c
@@ -110,19 +110,16 @@ static struct kmsg_dumper dw_dumper;
static int dumper_registered;

static void dw_kmsg_dump(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *s1, unsigned long l1,
- const char *s2, unsigned long l2)
+ enum kmsg_dump_reason reason)
{
- int i;
+ static char line[1024];
+ size_t len;

/* When run to this, we'd better re-init the HW */
mrst_early_console_init();

- for (i = 0; i < l1; i++)
- early_mrst_console.write(&early_mrst_console, s1 + i, 1);
- for (i = 0; i < l2; i++)
- early_mrst_console.write(&early_mrst_console, s2 + i, 1);
+ while (kmsg_dump_get_line(dumper, true, line, sizeof(line), &len))
+ early_mrst_console.write(&early_mrst_console, line, len);
}

/* Set the ratio rate to 115200, 8n1, IRQ disabled */
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -304,32 +304,17 @@ static void find_next_position(struct mt
}

static void mtdoops_do_dump(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason, const char *s1, unsigned long l1,
- const char *s2, unsigned long l2)
+ enum kmsg_dump_reason reason)
{
struct mtdoops_context *cxt = container_of(dumper,
struct mtdoops_context, dump);
- unsigned long s1_start, s2_start;
- unsigned long l1_cpy, l2_cpy;
- char *dst;
-
- if (reason != KMSG_DUMP_OOPS &&
- reason != KMSG_DUMP_PANIC)
- return;

/* Only dump oopses if dump_oops is set */
if (reason == KMSG_DUMP_OOPS && !dump_oops)
return;

- dst = cxt->oops_buf + MTDOOPS_HEADER_SIZE; /* Skip the header */
- l2_cpy = min(l2, record_size - MTDOOPS_HEADER_SIZE);
- l1_cpy = min(l1, record_size - MTDOOPS_HEADER_SIZE - l2_cpy);
-
- s2_start = l2 - l2_cpy;
- s1_start = l1 - l1_cpy;
-
- memcpy(dst, s1 + s1_start, l1_cpy);
- memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
+ kmsg_dump_get_buffer(dumper, true, cxt->oops_buf + MTDOOPS_HEADER_SIZE,
+ record_size - MTDOOPS_HEADER_SIZE, NULL);

/* Panics must be written immediately */
if (reason != KMSG_DUMP_OOPS)
@@ -375,6 +360,7 @@ static void mtdoops_notify_add(struct mt
return;
}

+ cxt->dump.max_reason = KMSG_DUMP_OOPS;
cxt->dump.dump = mtdoops_do_dump;
err = kmsg_dump_register(&cxt->dump);
if (err) {
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -94,20 +94,15 @@ static const char *get_reason_str(enum k
* as we can from the end of the buffer.
*/
static void pstore_dump(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *s1, unsigned long l1,
- const char *s2, unsigned long l2)
+ enum kmsg_dump_reason reason)
{
- unsigned long s1_start, s2_start;
- unsigned long l1_cpy, l2_cpy;
- unsigned long size, total = 0;
- char *dst;
+ unsigned long total = 0;
const char *why;
u64 id;
- int hsize, ret;
unsigned int part = 1;
unsigned long flags = 0;
int is_locked = 0;
+ int ret;

why = get_reason_str(reason);

@@ -119,30 +114,25 @@ static void pstore_dump(struct kmsg_dump
spin_lock_irqsave(&psinfo->buf_lock, flags);
oopscount++;
while (total < kmsg_bytes) {
+ char *dst;
+ unsigned long size;
+ int hsize;
+ size_t len;
+
dst = psinfo->buf;
hsize = sprintf(dst, "%s#%d Part%d\n", why, oopscount, part);
size = psinfo->bufsize - hsize;
dst += hsize;

- l2_cpy = min(l2, size);
- l1_cpy = min(l1, size - l2_cpy);
-
- if (l1_cpy + l2_cpy == 0)
+ if (!kmsg_dump_get_buffer(dumper, true, dst, size, &len))
break;

- s2_start = l2 - l2_cpy;
- s1_start = l1 - l1_cpy;
-
- memcpy(dst, s1 + s1_start, l1_cpy);
- memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
-
ret = psinfo->write(PSTORE_TYPE_DMESG, reason, &id, part,
- hsize + l1_cpy + l2_cpy, psinfo);
+ hsize + len, psinfo);
if (ret == 0 && reason == KMSG_DUMP_OOPS && pstore_is_mounted())
pstore_new_entry = 1;
- l1 -= l1_cpy;
- l2 -= l2_cpy;
- total += l1_cpy + l2_cpy;
+
+ total += hsize + len;
part++;
}
if (in_nmi()) {
--- a/include/linux/kmsg_dump.h
+++ b/include/linux/kmsg_dump.h
@@ -21,6 +21,7 @@
* is passed to the kernel.
*/
enum kmsg_dump_reason {
+ KMSG_DUMP_UNDEF,
KMSG_DUMP_PANIC,
KMSG_DUMP_OOPS,
KMSG_DUMP_EMERG,
@@ -31,23 +32,35 @@ enum kmsg_dump_reason {

/**
* struct kmsg_dumper - kernel crash message dumper structure
- * @dump: The callback which gets called on crashes. The buffer is passed
- * as two sections, where s1 (length l1) contains the older
- * messages and s2 (length l2) contains the newer.
* @list: Entry in the dumper list (private)
+ * @dump: Call into dumping code which will retrieve the data with
+ * through the record iterator
+ * @max_reason: filter for highest reason number that should be dumped
* @registered: Flag that specifies if this is already registered
*/
struct kmsg_dumper {
- void (*dump)(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
- const char *s1, unsigned long l1,
- const char *s2, unsigned long l2);
struct list_head list;
- int registered;
+ void (*dump)(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason);
+ enum kmsg_dump_reason max_reason;
+ bool active;
+ bool registered;
+
+ /* private state of the kmsg iterator */
+ u32 cur_idx;
+ u32 next_idx;
+ u64 cur_seq;
+ u64 next_seq;
};

#ifdef CONFIG_PRINTK
void kmsg_dump(enum kmsg_dump_reason reason);

+bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
+ char *line, size_t size, size_t *len);
+
+bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
+ char *buf, size_t size, size_t *len);
+
int kmsg_dump_register(struct kmsg_dumper *dumper);

int kmsg_dump_unregister(struct kmsg_dumper *dumper);
@@ -56,6 +69,18 @@ static inline void kmsg_dump(enum kmsg_d
{
}

+bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
+ const char *line, size_t size, size_t *len)
+{
+ return false
+}
+
+bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
+ char *buf, size_t size, size_t *len)
+{
+ return false;
+}
+
static inline int kmsg_dump_register(struct kmsg_dumper *dumper)
{
return -EINVAL;
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -909,7 +909,7 @@ static int syslog_print_all(char __user
/*
* Find first record that fits, including all following records,
* into the user-provided buffer for this dump.
- */
+ */
seq = clear_seq;
idx = clear_idx;
while (seq < log_next_seq) {
@@ -919,6 +919,8 @@ static int syslog_print_all(char __user
idx = log_next(idx);
seq++;
}
+
+ /* move first record forward until length fits into the buffer */
seq = clear_seq;
idx = clear_idx;
while (len > size && seq < log_next_seq) {
@@ -929,7 +931,7 @@ static int syslog_print_all(char __user
seq++;
}

- /* last message in this dump */
+ /* last message fitting into this dump */
next_seq = log_next_seq;

len = 0;
@@ -2305,43 +2307,158 @@ module_param_named(always_kmsg_dump, alw
*/
void kmsg_dump(enum kmsg_dump_reason reason)
{
- u64 idx;
struct kmsg_dumper *dumper;
- const char *s1, *s2;
- unsigned long l1, l2;
unsigned long flags;

if ((reason > KMSG_DUMP_OOPS) && !always_kmsg_dump)
return;

- /* Theoretically, the log could move on after we do this, but
- there's not a lot we can do about that. The new messages
- will overwrite the start of what we dump. */
+ rcu_read_lock();
+ list_for_each_entry_rcu(dumper, &dump_list, list) {
+ if (dumper->max_reason && reason > dumper->max_reason)
+ continue;

- raw_spin_lock_irqsave(&logbuf_lock, flags);
- if (syslog_seq < log_first_seq)
- idx = syslog_idx;
- else
- idx = log_first_idx;
-
- if (idx > log_next_idx) {
- s1 = log_buf;
- l1 = log_next_idx;
+ /* initialize iterator with data about the stored records */
+ dumper->active = true;

- s2 = log_buf + idx;
- l2 = log_buf_len - idx;
- } else {
- s1 = "";
- l1 = 0;
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ dumper->cur_seq = clear_seq;
+ dumper->cur_idx = clear_idx;
+ dumper->next_seq = log_next_seq;
+ dumper->next_idx = log_next_idx;
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);

- s2 = log_buf + idx;
- l2 = log_next_idx - idx;
+ /* invoke dumper which will iterate over records */
+ dumper->dump(dumper, reason);
+
+ /* reset iterator */
+ dumper->active = false;
}
+ rcu_read_unlock();
+}
+
+/*
+ * Start at the *oldest* kmsg record, and print one line into the provided
+ * buffer. Repeated calls will return the next record. A return value of
+ * FALSE indicates that all records are read including the youngest one.
+ */
+bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
+ char *line, size_t size, size_t *len)
+{
+ unsigned long flags;
+ struct log *msg;
+ size_t l = 0;
+ bool ret = false;
+
+ if (!dumper->active)
+ goto out;
+
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ if (dumper->cur_seq < log_first_seq) {
+ /* messages are gone, move to first available one */
+ dumper->cur_seq = log_first_seq;
+ dumper->cur_idx = log_first_idx;
+ }
+
+ /* last entry */
+ if (dumper->cur_seq >= log_next_seq) {
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+ goto out;
+ }
+
+ msg = log_from_idx(dumper->cur_idx);
+ l = msg_print_text(msg, syslog,
+ line, size);
+
+ dumper->cur_idx = log_next(dumper->cur_idx);
+ dumper->cur_seq++;
+ ret = true;
raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+out:
+ if (len)
+ *len = l;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kmsg_dump_get_line);

- rcu_read_lock();
- list_for_each_entry_rcu(dumper, &dump_list, list)
- dumper->dump(dumper, reason, s1, l1, s2, l2);
- rcu_read_unlock();
+/*
+ * Fill the provided buffer with as many of the the *youngest* kmsg records
+ * that fit into it. Repeated calls will fill the buffer with the next block
+ * of older records, not including the former ones. A return value of FALSE
+ * indicate that all records are read, including the oldest one.
+ */
+bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
+ char *buf, size_t size, size_t *len)
+{
+ unsigned long flags;
+ u64 seq;
+ u32 idx;
+ u64 next_seq;
+ u32 next_idx;
+ size_t l = 0;
+ bool ret = false;
+
+ if (!dumper->active)
+ goto out;
+
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ if (dumper->cur_seq < log_first_seq) {
+ /* messages are gone, move to first available one */
+ dumper->cur_seq = log_first_seq;
+ dumper->cur_idx = log_first_idx;
+ }
+
+ /* last entry */
+ if (dumper->cur_seq >= dumper->next_seq) {
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+ goto out;
+ }
+
+ /* calculate length of entire buffer */
+ seq = dumper->cur_seq;
+ idx = dumper->cur_idx;
+ while (seq < dumper->next_seq) {
+ struct log *msg = log_from_idx(idx);
+
+ l += msg_print_text(msg, true, NULL, 0);
+ idx = log_next(idx);
+ seq++;
+ }
+
+ /* move first record forward until length fits into the buffer */
+ seq = dumper->cur_seq;
+ idx = dumper->cur_idx;
+ while (l > size && seq < dumper->next_seq) {
+ struct log *msg = log_from_idx(idx);
+
+ l -= msg_print_text(msg, true, NULL, 0);
+ idx = log_next(idx);
+ seq++;
+ }
+
+ /* last message in next interation */
+ next_seq = seq;
+ next_idx = idx;
+
+ l = 0;
+ while (seq < dumper->next_seq) {
+ struct log *msg = log_from_idx(idx);
+
+ l += msg_print_text(msg, syslog,
+ buf + l, size - l);
+
+ idx = log_next(idx);
+ seq++;
+ }
+
+ dumper->next_seq = next_seq;
+ dumper->next_idx = next_idx;
+ ret = true;
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+out:
+ if (len)
+ *len = l;
+ return ret;
}
+EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);
#endif

2012-05-29 16:14:51

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, 2012-05-29 at 18:07 +0200, Kay Sievers wrote:
[]
> --- a/include/linux/kmsg_dump.h
> +++ b/include/linux/kmsg_dump.h
[]
> @@ -56,6 +69,18 @@ static inline void kmsg_dump(enum kmsg_d
> {
> }
>
> +bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
> + const char *line, size_t size, size_t *len)
> +{
> + return false
> +}

uncompilable.


2012-05-29 16:34:57

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 29, 2012 at 6:14 PM, Joe Perches <[email protected]> wrote:
> On Tue, 2012-05-29 at 18:07 +0200, Kay Sievers wrote:

>> +bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
>> +                       const char *line, size_t size, size_t *len)
>> +{
>> +     return false
>> +}
>
> uncompilable.

The damn !CONFIG_PRINTK case. :) Fixed!
http://git.kernel.org/?p=linux/kernel/git/kay/patches.git;a=commitdiff;h=607b6e8fa7ca61dd385d632fe51c6afc5f5374b7

Thanks,
Kay

2012-05-29 16:52:00

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, 2012-05-29 at 18:34 +0200, Kay Sievers wrote:
> The damn !CONFIG_PRINTK case. :) Fixed!

Maybe more appropriate is a ;) Thanks.

When you get a chance, could you comment on the
RFC patch to convert the <.> in KERN_<LEVEL>
to an ascii SOH.

https://lkml.org/lkml/2012/5/24/350

It doesn't / shouldn't affect any output anywhere
as vprintk_emit converts it back to <.> for the
actual output, it saves some space, and it makes
KERN_CONT/pr_cont unnecessary as anything other
than a notational marker.

2012-05-29 17:11:39

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

+/*
+ * Fill the provided buffer with as many of the the *youngest* kmsg records
+ * that fit into it. Repeated calls will fill the buffer with the next block
+ * of older records, not including the former ones. A return value of FALSE
+ * indicate that all records are read, including the oldest one.
+ */

I confess to not reading all of this ... but does this handle the pstore/OOPS
case? We'll use this call to pull as many records as will fit in the pstore
error record ... but almost certainly won't pull all of the console log, so
we won't see the FALSE return.

The system keeps on running.

Now another OOPS happens and pstore wants to grab the new "youngest" records,
it doesn't want to continue stepping to older records from the place it got
to on the first OOPS. Will that work?

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-05-29 17:22:40

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 29, 2012 at 7:11 PM, Luck, Tony <[email protected]> wrote:
> +/*
> + * Fill the provided buffer with as many of the the *youngest* kmsg records
> + * that fit into it. Repeated calls will fill the buffer with the next block
> + * of older records, not including the former ones. A return value of FALSE
> + * indicate that all records are read, including the oldest one.
> + */
>
> I confess to not reading all of this ... but does this handle the pstore/OOPS
> case? We'll use this call to pull as many records as will fit in the pstore
> error record ... but almost certainly won't pull all of the console log, so
> we won't see the FALSE return.

The loop in pstore gets the newest messages first, then another chunk
of the older messages with every call; pstore stops to iterate when
the defined limit it wants to store is reached, or when no messages
are available anymore.

> The system keeps on running.
>
> Now another OOPS happens and pstore wants to grab the new "youngest" records,
> it doesn't want to continue stepping to older records from the place it got
> to on the first OOPS.  Will that work?

A new oops will init the iterator again, so it will start at the end
with the new messages that happened in the meantime.

The iterator is only valid while used from dump(), it will return an
error if used outside of it.

The logic *should* work; at least it should be easy to make it work if
it doesn't. :)

Kay

2012-05-30 11:30:16

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, 2012-05-29 at 19:22 +0200, Kay Sievers wrote:
> On Tue, May 29, 2012 at 7:11 PM, Luck, Tony <[email protected]> wrote:
> > +/*
> > + * Fill the provided buffer with as many of the the *youngest* kmsg records
> > + * that fit into it. Repeated calls will fill the buffer with the next block
> > + * of older records, not including the former ones. A return value of FALSE
> > + * indicate that all records are read, including the oldest one.
> > + */
> >
> > I confess to not reading all of this ... but does this handle the pstore/OOPS
> > case? We'll use this call to pull as many records as will fit in the pstore
> > error record ... but almost certainly won't pull all of the console log, so
> > we won't see the FALSE return.
>
> The loop in pstore gets the newest messages first, then another chunk
> of the older messages with every call; pstore stops to iterate when
> the defined limit it wants to store is reached, or when no messages
> are available anymore.
>
> > The system keeps on running.
> >
> > Now another OOPS happens and pstore wants to grab the new "youngest" records,
> > it doesn't want to continue stepping to older records from the place it got
> > to on the first OOPS. Will that work?
>
> A new oops will init the iterator again, so it will start at the end
> with the new messages that happened in the meantime.
>
> The iterator is only valid while used from dump(), it will return an
> error if used outside of it.
>
> The logic *should* work; at least it should be easy to make it work if
> it doesn't. :)

Updated patch with proper kerneldoc headers, to clarify the iterator's
interface and behavior.

Tony, could you possibly give it a try with your pstore dumper, if it
behaves as expected, or what we need to fix?

Thanks,
Kay

---
arch/powerpc/platforms/pseries/nvram.c | 61 -------
arch/x86/platform/mrst/early_printk_mrst.c | 13 -
drivers/mtd/mtdoops.c | 22 --
fs/pstore/platform.c | 34 +---
include/linux/kmsg_dump.h | 45 ++++-
kernel/printk.c | 222 +++++++++++++++++++++++++----
6 files changed, 259 insertions(+), 138 deletions(-)

--- a/arch/powerpc/platforms/pseries/nvram.c
+++ b/arch/powerpc/platforms/pseries/nvram.c
@@ -68,9 +68,7 @@ static const char *pseries_nvram_os_part
};

static void oops_to_nvram(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *old_msgs, unsigned long old_len,
- const char *new_msgs, unsigned long new_len);
+ enum kmsg_dump_reason reason);

static struct kmsg_dumper nvram_kmsg_dumper = {
.dump = oops_to_nvram
@@ -504,28 +502,6 @@ int __init pSeries_nvram_init(void)
}

/*
- * Try to capture the last capture_len bytes of the printk buffer. Return
- * the amount actually captured.
- */
-static size_t capture_last_msgs(const char *old_msgs, size_t old_len,
- const char *new_msgs, size_t new_len,
- char *captured, size_t capture_len)
-{
- if (new_len >= capture_len) {
- memcpy(captured, new_msgs + (new_len - capture_len),
- capture_len);
- return capture_len;
- } else {
- /* Grab the end of old_msgs. */
- size_t old_tail_len = min(old_len, capture_len - new_len);
- memcpy(captured, old_msgs + (old_len - old_tail_len),
- old_tail_len);
- memcpy(captured + old_tail_len, new_msgs, new_len);
- return old_tail_len + new_len;
- }
-}
-
-/*
* Are we using the ibm,rtas-log for oops/panic reports? And if so,
* would logging this oops/panic overwrite an RTAS event that rtas_errd
* hasn't had a chance to read and process? Return 1 if so, else 0.
@@ -541,27 +517,6 @@ static int clobbering_unread_rtas_event(
NVRAM_RTAS_READ_TIMEOUT);
}

-/* Squeeze out each line's <n> severity prefix. */
-static size_t elide_severities(char *buf, size_t len)
-{
- char *in, *out, *buf_end = buf + len;
- /* Assume a <n> at the very beginning marks the start of a line. */
- int newline = 1;
-
- in = out = buf;
- while (in < buf_end) {
- if (newline && in+3 <= buf_end &&
- *in == '<' && isdigit(in[1]) && in[2] == '>') {
- in += 3;
- newline = 0;
- } else {
- newline = (*in == '\n');
- *out++ = *in++;
- }
- }
- return out - buf;
-}
-
/* Derived from logfs_compress() */
static int nvram_compress(const void *in, void *out, size_t inlen,
size_t outlen)
@@ -619,9 +574,7 @@ static int zip_oops(size_t text_len)
* partition. If that's too much, go back and capture uncompressed text.
*/
static void oops_to_nvram(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *old_msgs, unsigned long old_len,
- const char *new_msgs, unsigned long new_len)
+ enum kmsg_dump_reason reason)
{
static unsigned int oops_count = 0;
static bool panicking = false;
@@ -660,14 +613,14 @@ static void oops_to_nvram(struct kmsg_du
return;

if (big_oops_buf) {
- text_len = capture_last_msgs(old_msgs, old_len,
- new_msgs, new_len, big_oops_buf, big_oops_buf_sz);
- text_len = elide_severities(big_oops_buf, text_len);
+ kmsg_dump_get_buffer(dumper, false,
+ big_oops_buf, big_oops_buf_sz, &text_len);
rc = zip_oops(text_len);
}
if (rc != 0) {
- text_len = capture_last_msgs(old_msgs, old_len,
- new_msgs, new_len, oops_data, oops_data_sz);
+ kmsg_dump_rewind(dumper);
+ kmsg_dump_get_buffer(dumper, true,
+ oops_data, oops_data_sz, &text_len);
err_type = ERR_TYPE_KERNEL_PANIC;
*oops_len = (u16) text_len;
}
--- a/arch/x86/platform/mrst/early_printk_mrst.c
+++ b/arch/x86/platform/mrst/early_printk_mrst.c
@@ -110,19 +110,16 @@ static struct kmsg_dumper dw_dumper;
static int dumper_registered;

static void dw_kmsg_dump(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *s1, unsigned long l1,
- const char *s2, unsigned long l2)
+ enum kmsg_dump_reason reason)
{
- int i;
+ static char line[1024];
+ size_t len;

/* When run to this, we'd better re-init the HW */
mrst_early_console_init();

- for (i = 0; i < l1; i++)
- early_mrst_console.write(&early_mrst_console, s1 + i, 1);
- for (i = 0; i < l2; i++)
- early_mrst_console.write(&early_mrst_console, s2 + i, 1);
+ while (kmsg_dump_get_line(dumper, true, line, sizeof(line), &len))
+ early_mrst_console.write(&early_mrst_console, line, len);
}

/* Set the ratio rate to 115200, 8n1, IRQ disabled */
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -304,32 +304,17 @@ static void find_next_position(struct mt
}

static void mtdoops_do_dump(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason, const char *s1, unsigned long l1,
- const char *s2, unsigned long l2)
+ enum kmsg_dump_reason reason)
{
struct mtdoops_context *cxt = container_of(dumper,
struct mtdoops_context, dump);
- unsigned long s1_start, s2_start;
- unsigned long l1_cpy, l2_cpy;
- char *dst;
-
- if (reason != KMSG_DUMP_OOPS &&
- reason != KMSG_DUMP_PANIC)
- return;

/* Only dump oopses if dump_oops is set */
if (reason == KMSG_DUMP_OOPS && !dump_oops)
return;

- dst = cxt->oops_buf + MTDOOPS_HEADER_SIZE; /* Skip the header */
- l2_cpy = min(l2, record_size - MTDOOPS_HEADER_SIZE);
- l1_cpy = min(l1, record_size - MTDOOPS_HEADER_SIZE - l2_cpy);
-
- s2_start = l2 - l2_cpy;
- s1_start = l1 - l1_cpy;
-
- memcpy(dst, s1 + s1_start, l1_cpy);
- memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
+ kmsg_dump_get_buffer(dumper, true, cxt->oops_buf + MTDOOPS_HEADER_SIZE,
+ record_size - MTDOOPS_HEADER_SIZE, NULL);

/* Panics must be written immediately */
if (reason != KMSG_DUMP_OOPS)
@@ -375,6 +360,7 @@ static void mtdoops_notify_add(struct mt
return;
}

+ cxt->dump.max_reason = KMSG_DUMP_OOPS;
cxt->dump.dump = mtdoops_do_dump;
err = kmsg_dump_register(&cxt->dump);
if (err) {
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -94,20 +94,15 @@ static const char *get_reason_str(enum k
* as we can from the end of the buffer.
*/
static void pstore_dump(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *s1, unsigned long l1,
- const char *s2, unsigned long l2)
+ enum kmsg_dump_reason reason)
{
- unsigned long s1_start, s2_start;
- unsigned long l1_cpy, l2_cpy;
- unsigned long size, total = 0;
- char *dst;
+ unsigned long total = 0;
const char *why;
u64 id;
- int hsize, ret;
unsigned int part = 1;
unsigned long flags = 0;
int is_locked = 0;
+ int ret;

why = get_reason_str(reason);

@@ -119,30 +114,25 @@ static void pstore_dump(struct kmsg_dump
spin_lock_irqsave(&psinfo->buf_lock, flags);
oopscount++;
while (total < kmsg_bytes) {
+ char *dst;
+ unsigned long size;
+ int hsize;
+ size_t len;
+
dst = psinfo->buf;
hsize = sprintf(dst, "%s#%d Part%d\n", why, oopscount, part);
size = psinfo->bufsize - hsize;
dst += hsize;

- l2_cpy = min(l2, size);
- l1_cpy = min(l1, size - l2_cpy);
-
- if (l1_cpy + l2_cpy == 0)
+ if (!kmsg_dump_get_buffer(dumper, true, dst, size, &len))
break;

- s2_start = l2 - l2_cpy;
- s1_start = l1 - l1_cpy;
-
- memcpy(dst, s1 + s1_start, l1_cpy);
- memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
-
ret = psinfo->write(PSTORE_TYPE_DMESG, reason, &id, part,
- hsize + l1_cpy + l2_cpy, psinfo);
+ hsize + len, psinfo);
if (ret == 0 && reason == KMSG_DUMP_OOPS && pstore_is_mounted())
pstore_new_entry = 1;
- l1 -= l1_cpy;
- l2 -= l2_cpy;
- total += l1_cpy + l2_cpy;
+
+ total += hsize + len;
part++;
}
if (in_nmi()) {
--- a/include/linux/kmsg_dump.h
+++ b/include/linux/kmsg_dump.h
@@ -21,6 +21,7 @@
* is passed to the kernel.
*/
enum kmsg_dump_reason {
+ KMSG_DUMP_UNDEF,
KMSG_DUMP_PANIC,
KMSG_DUMP_OOPS,
KMSG_DUMP_EMERG,
@@ -31,23 +32,37 @@ enum kmsg_dump_reason {

/**
* struct kmsg_dumper - kernel crash message dumper structure
- * @dump: The callback which gets called on crashes. The buffer is passed
- * as two sections, where s1 (length l1) contains the older
- * messages and s2 (length l2) contains the newer.
* @list: Entry in the dumper list (private)
+ * @dump: Call into dumping code which will retrieve the data with
+ * through the record iterator
+ * @max_reason: filter for highest reason number that should be dumped
* @registered: Flag that specifies if this is already registered
*/
struct kmsg_dumper {
- void (*dump)(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
- const char *s1, unsigned long l1,
- const char *s2, unsigned long l2);
struct list_head list;
- int registered;
+ void (*dump)(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason);
+ enum kmsg_dump_reason max_reason;
+ bool active;
+ bool registered;
+
+ /* private state of the kmsg iterator */
+ u32 cur_idx;
+ u32 next_idx;
+ u64 cur_seq;
+ u64 next_seq;
};

#ifdef CONFIG_PRINTK
void kmsg_dump(enum kmsg_dump_reason reason);

+bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
+ char *line, size_t size, size_t *len);
+
+bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
+ char *buf, size_t size, size_t *len);
+
+void kmsg_dump_rewind(struct kmsg_dumper *dumper);
+
int kmsg_dump_register(struct kmsg_dumper *dumper);

int kmsg_dump_unregister(struct kmsg_dumper *dumper);
@@ -56,6 +71,22 @@ static inline void kmsg_dump(enum kmsg_d
{
}

+bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
+ const char *line, size_t size, size_t *len)
+{
+ return false;
+}
+
+bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
+ char *buf, size_t size, size_t *len)
+{
+ return false;
+}
+
+void kmsg_dump_rewind(struct kmsg_dumper *dumper)
+{
+}
+
static inline int kmsg_dump_register(struct kmsg_dumper *dumper)
{
return -EINVAL;
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -909,7 +909,7 @@ static int syslog_print_all(char __user
/*
* Find first record that fits, including all following records,
* into the user-provided buffer for this dump.
- */
+ */
seq = clear_seq;
idx = clear_idx;
while (seq < log_next_seq) {
@@ -919,6 +919,8 @@ static int syslog_print_all(char __user
idx = log_next(idx);
seq++;
}
+
+ /* move first record forward until length fits into the buffer */
seq = clear_seq;
idx = clear_idx;
while (len > size && seq < log_next_seq) {
@@ -929,7 +931,7 @@ static int syslog_print_all(char __user
seq++;
}

- /* last message in this dump */
+ /* last message fitting into this dump */
next_seq = log_next_seq;

len = 0;
@@ -2300,48 +2302,210 @@ module_param_named(always_kmsg_dump, alw
* kmsg_dump - dump kernel log to kernel message dumpers.
* @reason: the reason (oops, panic etc) for dumping
*
- * Iterate through each of the dump devices and call the oops/panic
- * callbacks with the log buffer.
+ * Call each of the registered dumper's dump() callback, which can
+ * retrieve the kmsg records with kmsg_dump_get_line() or
+ * kmsg_dump_get_buffer().
*/
void kmsg_dump(enum kmsg_dump_reason reason)
{
- u64 idx;
struct kmsg_dumper *dumper;
- const char *s1, *s2;
- unsigned long l1, l2;
unsigned long flags;

if ((reason > KMSG_DUMP_OOPS) && !always_kmsg_dump)
return;

- /* Theoretically, the log could move on after we do this, but
- there's not a lot we can do about that. The new messages
- will overwrite the start of what we dump. */
+ rcu_read_lock();
+ list_for_each_entry_rcu(dumper, &dump_list, list) {
+ if (dumper->max_reason && reason > dumper->max_reason)
+ continue;
+
+ /* initialize iterator with data about the stored records */
+ dumper->active = true;
+
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ dumper->cur_seq = clear_seq;
+ dumper->cur_idx = clear_idx;
+ dumper->next_seq = log_next_seq;
+ dumper->next_idx = log_next_idx;
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+
+ /* invoke dumper which will iterate over records */
+ dumper->dump(dumper, reason);
+
+ /* reset iterator */
+ dumper->active = false;
+ }
+ rcu_read_unlock();
+}
+
+/**
+ * kmsg_dump_get_line - retrieve one kmsg log line
+ * @dumper: registered kmsg dumper
+ * @syslog: include the "<4>" prefixes
+ * @line: buffer to copy the line to
+ * @size: maximum size of the buffer
+ * @len: length of line placed into buffer
+ *
+ * Start at the beginning of the kmsg buffer, with the oldest kmsg
+ * record, and copy one record into the provided buffer.
+ *
+ * Consecutive calls will return the next available record moving
+ * towards the end of the buffer with the youngest messages.
+ *
+ * A return value of FALSE indicate that there are no more records to
+ * read.
+ */
+bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
+ char *line, size_t size, size_t *len)
+{
+ unsigned long flags;
+ struct log *msg;
+ size_t l = 0;
+ bool ret = false;
+
+ if (!dumper->active)
+ goto out;

raw_spin_lock_irqsave(&logbuf_lock, flags);
- if (syslog_seq < log_first_seq)
- idx = syslog_idx;
- else
- idx = log_first_idx;
-
- if (idx > log_next_idx) {
- s1 = log_buf;
- l1 = log_next_idx;
+ if (dumper->cur_seq < log_first_seq) {
+ /* messages are gone, move to first available one */
+ dumper->cur_seq = log_first_seq;
+ dumper->cur_idx = log_first_idx;
+ }

- s2 = log_buf + idx;
- l2 = log_buf_len - idx;
- } else {
- s1 = "";
- l1 = 0;
+ /* last entry */
+ if (dumper->cur_seq >= log_next_seq) {
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+ goto out;
+ }

- s2 = log_buf + idx;
- l2 = log_next_idx - idx;
+ msg = log_from_idx(dumper->cur_idx);
+ l = msg_print_text(msg, syslog,
+ line, size);
+
+ dumper->cur_idx = log_next(dumper->cur_idx);
+ dumper->cur_seq++;
+ ret = true;
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+out:
+ if (len)
+ *len = l;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kmsg_dump_get_line);
+
+/**
+ * kmsg_dump_get_buffer - copy kmsg log lines
+ * @dumper: registered kmsg dumper
+ * @syslog: include the "<4>" prefixes
+ * @line: buffer to copy the line to
+ * @size: maximum size of the buffer
+ * @len: length of line placed into buffer
+ *
+ * Start at the end of the kmsg buffer and fill the provided buffer
+ * with as many of the the *youngest* kmsg records that fit into it.
+ * If the buffer is large enough, all available kmsg records will be
+ * copied with a single call.
+ *
+ * Consecutive calls will fill the buffer with the next block of
+ * available older records, not including the earlier retrieved ones.
+ *
+ * A return value of FALSE indicate that there are no more records to
+ * read.
+ */
+bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
+ char *buf, size_t size, size_t *len)
+{
+ unsigned long flags;
+ u64 seq;
+ u32 idx;
+ u64 next_seq;
+ u32 next_idx;
+ size_t l = 0;
+ bool ret = false;
+
+ if (!dumper->active)
+ goto out;
+
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ if (dumper->cur_seq < log_first_seq) {
+ /* messages are gone, move to first available one */
+ dumper->cur_seq = log_first_seq;
+ dumper->cur_idx = log_first_idx;
+ }
+
+ /* last entry */
+ if (dumper->cur_seq >= dumper->next_seq) {
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+ goto out;
+ }
+
+ /* calculate length of entire buffer */
+ seq = dumper->cur_seq;
+ idx = dumper->cur_idx;
+ while (seq < dumper->next_seq) {
+ struct log *msg = log_from_idx(idx);
+
+ l += msg_print_text(msg, true, NULL, 0);
+ idx = log_next(idx);
+ seq++;
+ }
+
+ /* move first record forward until length fits into the buffer */
+ seq = dumper->cur_seq;
+ idx = dumper->cur_idx;
+ while (l > size && seq < dumper->next_seq) {
+ struct log *msg = log_from_idx(idx);
+
+ l -= msg_print_text(msg, true, NULL, 0);
+ idx = log_next(idx);
+ seq++;
+ }
+
+ /* last message in next interation */
+ next_seq = seq;
+ next_idx = idx;
+
+ l = 0;
+ while (seq < dumper->next_seq) {
+ struct log *msg = log_from_idx(idx);
+
+ l += msg_print_text(msg, syslog,
+ buf + l, size - l);
+
+ idx = log_next(idx);
+ seq++;
}
+
+ dumper->next_seq = next_seq;
+ dumper->next_idx = next_idx;
+ ret = true;
raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+out:
+ if (len)
+ *len = l;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);

- rcu_read_lock();
- list_for_each_entry_rcu(dumper, &dump_list, list)
- dumper->dump(dumper, reason, s1, l1, s2, l2);
- rcu_read_unlock();
+/**
+ * kmsg_dump_rewind - reset the interator
+ * @dumper: registered kmsg dumper
+ *
+ * Reset the dumper's iterator so that kmsg_dump_get_line() and
+ * kmsg_dump_get_buffer() can be called again and used multiple
+ * times within the same dumper.dump() callback.
+ */
+void kmsg_dump_rewind(struct kmsg_dumper *dumper)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ dumper->cur_seq = clear_seq;
+ dumper->cur_idx = clear_idx;
+ dumper->next_seq = log_next_seq;
+ dumper->next_idx = log_next_idx;
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
}
+EXPORT_SYMBOL_GPL(kmsg_dump_rewind);
#endif

2012-06-06 06:33:24

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 29, 2012 at 06:07:00PM +0200, Kay Sievers wrote:
> On Sun, 2012-05-27 at 16:23 +0200, Kay Sievers wrote:
> > On Sat, May 26, 2012 at 1:11 PM, Anton Vorontsov <[email protected]> wrote:
>
> > > So, starting from this patch every kmsg_dump user is getting the
> > > binary stuff, which is barely usable, and breaks a longstanding
> > > user-visible ABIs (e.g. mtdoops, oops dumping into nvram on PowerPC,
> > > and ramoops of course).
> >
> > These cases export their own format, it's just that the dump routines
> > could copy the buffer in only two steps (beginning and end of the ring
> > buffer), while in the future their code would need an update to copy
> > it record for record now. The exported format in the dump does not
> > necessarily need to change.
>
> Here is an untested patch, which should restore the old plain text
> export format of the kmsg buffer. I adds an iterator to the dumper
> routine, which can be used by the dumper to properly copy the data out
> of the kmsg record buffer. All current users are converted in the same
> patch.
>
> Any feedback/help in testing/updating and getting this thing into a
> mergeable state would be greatly appreciated.

Any comments on this patch? Did anyone try it out? I want to merge it
now, so unless someone objects, will do so soon.

thanks,

greg k-h

2012-06-15 00:04:30

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Tue, May 29, 2012 at 06:07:00PM +0200, Kay Sievers wrote:
> On Sun, 2012-05-27 at 16:23 +0200, Kay Sievers wrote:
> > On Sat, May 26, 2012 at 1:11 PM, Anton Vorontsov <[email protected]> wrote:
>
> > > So, starting from this patch every kmsg_dump user is getting the
> > > binary stuff, which is barely usable, and breaks a longstanding
> > > user-visible ABIs (e.g. mtdoops, oops dumping into nvram on PowerPC,
> > > and ramoops of course).
> >
> > These cases export their own format, it's just that the dump routines
> > could copy the buffer in only two steps (beginning and end of the ring
> > buffer), while in the future their code would need an update to copy
> > it record for record now. The exported format in the dump does not
> > necessarily need to change.
>
> Here is an untested patch, which should restore the old plain text
> export format of the kmsg buffer. I adds an iterator to the dumper
> routine, which can be used by the dumper to properly copy the data out
> of the kmsg record buffer. All current users are converted in the same
> patch.
>
> Any feedback/help in testing/updating and getting this thing into a
> mergeable state would be greatly appreciated.
>
> There are two versions of the iterator:
> kmsg_dump_get_line() starts with the oldest records. With the next call,
> the iterator retrieves one line after the other. Useful for cases like
> printing to a console or any other stream device.
>
> kmsg_dump_get_buffer() fills a buffer with as many of the youngest
> records as fit into it. With the next call, the iterator retrieves the
> next block of older messages. Useful to dump the entire buffer with one
> shot, or chunk the data into blocks and store them, with the youngest
> messages being the most important, retrieved first.
>
> Unlike the old logic, there are no stale pointers handed out to the
> dumper, if case things get overwritten while a dump is in progress,
> lines are not truncated or overwritten.

As no one seems to have responded, care to resend this with a
signed-off-by so I can queue it up for 3.6? If people complain, it can
still go to 3.5, but given that no one is saying anything, I guess no
one cares...

thanks,

greg k-h

2012-06-15 01:32:52

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, Jun 14, 2012 at 05:04:22PM -0700, Greg KH wrote:
[...]
> > Unlike the old logic, there are no stale pointers handed out to the
> > dumper, if case things get overwritten while a dump is in progress,
> > lines are not truncated or overwritten.
>
> As no one seems to have responded, care to resend this with a
> signed-off-by so I can queue it up for 3.6? If people complain, it can
> still go to 3.5, but given that no one is saying anything, I guess no
> one cares...

Well, I do care. Just didn't find time to test it. I'll try to speed
myself up.

Thanks,

--
Anton Vorontsov
Email: [email protected]

2012-06-15 12:08:01

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, 2012-06-14 at 17:04 -0700, Greg KH wrote:

> As no one seems to have responded, care to resend this with a
> signed-off-by so I can queue it up for 3.6? If people complain, it can
> still go to 3.5, but given that no one is saying anything, I guess no
> one cares...

Sounds like a good plan. Here it is.

Thanks,
Kay


From: Kay Sievers <[email protected]>
Subject: kmsg - kmsg_dump() use iterator to receice log buffer content

Provide an iterator to receive the log buffer content, and convert all
kmsg_dump() user to it.

The structured data in the kmsg buffer now contains binary data, which
should no longer be copied verbatim to the kmsg_dump() users.

The iterator should provide reliable access to the buffer data, and also
supports proper log line-aware chunking of data while iterating.

Signed-off-by: Kay Sievers <[email protected]>
---
arch/powerpc/platforms/pseries/nvram.c | 61 -------
arch/x86/platform/mrst/early_printk_mrst.c | 13 -
drivers/mtd/mtdoops.c | 22 --
fs/pstore/platform.c | 34 +---
include/linux/kmsg_dump.h | 45 ++++-
kernel/printk.c | 222 +++++++++++++++++++++++++----
6 files changed, 259 insertions(+), 138 deletions(-)

--- a/arch/powerpc/platforms/pseries/nvram.c
+++ b/arch/powerpc/platforms/pseries/nvram.c
@@ -68,9 +68,7 @@ static const char *pseries_nvram_os_part
};

static void oops_to_nvram(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *old_msgs, unsigned long old_len,
- const char *new_msgs, unsigned long new_len);
+ enum kmsg_dump_reason reason);

static struct kmsg_dumper nvram_kmsg_dumper = {
.dump = oops_to_nvram
@@ -504,28 +502,6 @@ int __init pSeries_nvram_init(void)
}

/*
- * Try to capture the last capture_len bytes of the printk buffer. Return
- * the amount actually captured.
- */
-static size_t capture_last_msgs(const char *old_msgs, size_t old_len,
- const char *new_msgs, size_t new_len,
- char *captured, size_t capture_len)
-{
- if (new_len >= capture_len) {
- memcpy(captured, new_msgs + (new_len - capture_len),
- capture_len);
- return capture_len;
- } else {
- /* Grab the end of old_msgs. */
- size_t old_tail_len = min(old_len, capture_len - new_len);
- memcpy(captured, old_msgs + (old_len - old_tail_len),
- old_tail_len);
- memcpy(captured + old_tail_len, new_msgs, new_len);
- return old_tail_len + new_len;
- }
-}
-
-/*
* Are we using the ibm,rtas-log for oops/panic reports? And if so,
* would logging this oops/panic overwrite an RTAS event that rtas_errd
* hasn't had a chance to read and process? Return 1 if so, else 0.
@@ -541,27 +517,6 @@ static int clobbering_unread_rtas_event(
NVRAM_RTAS_READ_TIMEOUT);
}

-/* Squeeze out each line's <n> severity prefix. */
-static size_t elide_severities(char *buf, size_t len)
-{
- char *in, *out, *buf_end = buf + len;
- /* Assume a <n> at the very beginning marks the start of a line. */
- int newline = 1;
-
- in = out = buf;
- while (in < buf_end) {
- if (newline && in+3 <= buf_end &&
- *in == '<' && isdigit(in[1]) && in[2] == '>') {
- in += 3;
- newline = 0;
- } else {
- newline = (*in == '\n');
- *out++ = *in++;
- }
- }
- return out - buf;
-}
-
/* Derived from logfs_compress() */
static int nvram_compress(const void *in, void *out, size_t inlen,
size_t outlen)
@@ -619,9 +574,7 @@ static int zip_oops(size_t text_len)
* partition. If that's too much, go back and capture uncompressed text.
*/
static void oops_to_nvram(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *old_msgs, unsigned long old_len,
- const char *new_msgs, unsigned long new_len)
+ enum kmsg_dump_reason reason)
{
static unsigned int oops_count = 0;
static bool panicking = false;
@@ -660,14 +613,14 @@ static void oops_to_nvram(struct kmsg_du
return;

if (big_oops_buf) {
- text_len = capture_last_msgs(old_msgs, old_len,
- new_msgs, new_len, big_oops_buf, big_oops_buf_sz);
- text_len = elide_severities(big_oops_buf, text_len);
+ kmsg_dump_get_buffer(dumper, false,
+ big_oops_buf, big_oops_buf_sz, &text_len);
rc = zip_oops(text_len);
}
if (rc != 0) {
- text_len = capture_last_msgs(old_msgs, old_len,
- new_msgs, new_len, oops_data, oops_data_sz);
+ kmsg_dump_rewind(dumper);
+ kmsg_dump_get_buffer(dumper, true,
+ oops_data, oops_data_sz, &text_len);
err_type = ERR_TYPE_KERNEL_PANIC;
*oops_len = (u16) text_len;
}
--- a/arch/x86/platform/mrst/early_printk_mrst.c
+++ b/arch/x86/platform/mrst/early_printk_mrst.c
@@ -110,19 +110,16 @@ static struct kmsg_dumper dw_dumper;
static int dumper_registered;

static void dw_kmsg_dump(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *s1, unsigned long l1,
- const char *s2, unsigned long l2)
+ enum kmsg_dump_reason reason)
{
- int i;
+ static char line[1024];
+ size_t len;

/* When run to this, we'd better re-init the HW */
mrst_early_console_init();

- for (i = 0; i < l1; i++)
- early_mrst_console.write(&early_mrst_console, s1 + i, 1);
- for (i = 0; i < l2; i++)
- early_mrst_console.write(&early_mrst_console, s2 + i, 1);
+ while (kmsg_dump_get_line(dumper, true, line, sizeof(line), &len))
+ early_mrst_console.write(&early_mrst_console, line, len);
}

/* Set the ratio rate to 115200, 8n1, IRQ disabled */
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -304,32 +304,17 @@ static void find_next_position(struct mt
}

static void mtdoops_do_dump(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason, const char *s1, unsigned long l1,
- const char *s2, unsigned long l2)
+ enum kmsg_dump_reason reason)
{
struct mtdoops_context *cxt = container_of(dumper,
struct mtdoops_context, dump);
- unsigned long s1_start, s2_start;
- unsigned long l1_cpy, l2_cpy;
- char *dst;
-
- if (reason != KMSG_DUMP_OOPS &&
- reason != KMSG_DUMP_PANIC)
- return;

/* Only dump oopses if dump_oops is set */
if (reason == KMSG_DUMP_OOPS && !dump_oops)
return;

- dst = cxt->oops_buf + MTDOOPS_HEADER_SIZE; /* Skip the header */
- l2_cpy = min(l2, record_size - MTDOOPS_HEADER_SIZE);
- l1_cpy = min(l1, record_size - MTDOOPS_HEADER_SIZE - l2_cpy);
-
- s2_start = l2 - l2_cpy;
- s1_start = l1 - l1_cpy;
-
- memcpy(dst, s1 + s1_start, l1_cpy);
- memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
+ kmsg_dump_get_buffer(dumper, true, cxt->oops_buf + MTDOOPS_HEADER_SIZE,
+ record_size - MTDOOPS_HEADER_SIZE, NULL);

/* Panics must be written immediately */
if (reason != KMSG_DUMP_OOPS)
@@ -375,6 +360,7 @@ static void mtdoops_notify_add(struct mt
return;
}

+ cxt->dump.max_reason = KMSG_DUMP_OOPS;
cxt->dump.dump = mtdoops_do_dump;
err = kmsg_dump_register(&cxt->dump);
if (err) {
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -94,20 +94,15 @@ static const char *get_reason_str(enum k
* as we can from the end of the buffer.
*/
static void pstore_dump(struct kmsg_dumper *dumper,
- enum kmsg_dump_reason reason,
- const char *s1, unsigned long l1,
- const char *s2, unsigned long l2)
+ enum kmsg_dump_reason reason)
{
- unsigned long s1_start, s2_start;
- unsigned long l1_cpy, l2_cpy;
- unsigned long size, total = 0;
- char *dst;
+ unsigned long total = 0;
const char *why;
u64 id;
- int hsize, ret;
unsigned int part = 1;
unsigned long flags = 0;
int is_locked = 0;
+ int ret;

why = get_reason_str(reason);

@@ -119,30 +114,25 @@ static void pstore_dump(struct kmsg_dump
spin_lock_irqsave(&psinfo->buf_lock, flags);
oopscount++;
while (total < kmsg_bytes) {
+ char *dst;
+ unsigned long size;
+ int hsize;
+ size_t len;
+
dst = psinfo->buf;
hsize = sprintf(dst, "%s#%d Part%d\n", why, oopscount, part);
size = psinfo->bufsize - hsize;
dst += hsize;

- l2_cpy = min(l2, size);
- l1_cpy = min(l1, size - l2_cpy);
-
- if (l1_cpy + l2_cpy == 0)
+ if (!kmsg_dump_get_buffer(dumper, true, dst, size, &len))
break;

- s2_start = l2 - l2_cpy;
- s1_start = l1 - l1_cpy;
-
- memcpy(dst, s1 + s1_start, l1_cpy);
- memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
-
ret = psinfo->write(PSTORE_TYPE_DMESG, reason, &id, part,
- hsize + l1_cpy + l2_cpy, psinfo);
+ hsize + len, psinfo);
if (ret == 0 && reason == KMSG_DUMP_OOPS && pstore_is_mounted())
pstore_new_entry = 1;
- l1 -= l1_cpy;
- l2 -= l2_cpy;
- total += l1_cpy + l2_cpy;
+
+ total += hsize + len;
part++;
}
if (in_nmi()) {
--- a/include/linux/kmsg_dump.h
+++ b/include/linux/kmsg_dump.h
@@ -21,6 +21,7 @@
* is passed to the kernel.
*/
enum kmsg_dump_reason {
+ KMSG_DUMP_UNDEF,
KMSG_DUMP_PANIC,
KMSG_DUMP_OOPS,
KMSG_DUMP_EMERG,
@@ -31,23 +32,37 @@ enum kmsg_dump_reason {

/**
* struct kmsg_dumper - kernel crash message dumper structure
- * @dump: The callback which gets called on crashes. The buffer is passed
- * as two sections, where s1 (length l1) contains the older
- * messages and s2 (length l2) contains the newer.
* @list: Entry in the dumper list (private)
+ * @dump: Call into dumping code which will retrieve the data with
+ * through the record iterator
+ * @max_reason: filter for highest reason number that should be dumped
* @registered: Flag that specifies if this is already registered
*/
struct kmsg_dumper {
- void (*dump)(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
- const char *s1, unsigned long l1,
- const char *s2, unsigned long l2);
struct list_head list;
- int registered;
+ void (*dump)(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason);
+ enum kmsg_dump_reason max_reason;
+ bool active;
+ bool registered;
+
+ /* private state of the kmsg iterator */
+ u32 cur_idx;
+ u32 next_idx;
+ u64 cur_seq;
+ u64 next_seq;
};

#ifdef CONFIG_PRINTK
void kmsg_dump(enum kmsg_dump_reason reason);

+bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
+ char *line, size_t size, size_t *len);
+
+bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
+ char *buf, size_t size, size_t *len);
+
+void kmsg_dump_rewind(struct kmsg_dumper *dumper);
+
int kmsg_dump_register(struct kmsg_dumper *dumper);

int kmsg_dump_unregister(struct kmsg_dumper *dumper);
@@ -56,6 +71,22 @@ static inline void kmsg_dump(enum kmsg_d
{
}

+bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
+ const char *line, size_t size, size_t *len)
+{
+ return false;
+}
+
+bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
+ char *buf, size_t size, size_t *len)
+{
+ return false;
+}
+
+void kmsg_dump_rewind(struct kmsg_dumper *dumper)
+{
+}
+
static inline int kmsg_dump_register(struct kmsg_dumper *dumper)
{
return -EINVAL;
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -909,7 +909,7 @@ static int syslog_print_all(char __user
/*
* Find first record that fits, including all following records,
* into the user-provided buffer for this dump.
- */
+ */
seq = clear_seq;
idx = clear_idx;
while (seq < log_next_seq) {
@@ -919,6 +919,8 @@ static int syslog_print_all(char __user
idx = log_next(idx);
seq++;
}
+
+ /* move first record forward until length fits into the buffer */
seq = clear_seq;
idx = clear_idx;
while (len > size && seq < log_next_seq) {
@@ -929,7 +931,7 @@ static int syslog_print_all(char __user
seq++;
}

- /* last message in this dump */
+ /* last message fitting into this dump */
next_seq = log_next_seq;

len = 0;
@@ -2300,48 +2302,210 @@ module_param_named(always_kmsg_dump, alw
* kmsg_dump - dump kernel log to kernel message dumpers.
* @reason: the reason (oops, panic etc) for dumping
*
- * Iterate through each of the dump devices and call the oops/panic
- * callbacks with the log buffer.
+ * Call each of the registered dumper's dump() callback, which can
+ * retrieve the kmsg records with kmsg_dump_get_line() or
+ * kmsg_dump_get_buffer().
*/
void kmsg_dump(enum kmsg_dump_reason reason)
{
- u64 idx;
struct kmsg_dumper *dumper;
- const char *s1, *s2;
- unsigned long l1, l2;
unsigned long flags;

if ((reason > KMSG_DUMP_OOPS) && !always_kmsg_dump)
return;

- /* Theoretically, the log could move on after we do this, but
- there's not a lot we can do about that. The new messages
- will overwrite the start of what we dump. */
+ rcu_read_lock();
+ list_for_each_entry_rcu(dumper, &dump_list, list) {
+ if (dumper->max_reason && reason > dumper->max_reason)
+ continue;
+
+ /* initialize iterator with data about the stored records */
+ dumper->active = true;
+
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ dumper->cur_seq = clear_seq;
+ dumper->cur_idx = clear_idx;
+ dumper->next_seq = log_next_seq;
+ dumper->next_idx = log_next_idx;
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+
+ /* invoke dumper which will iterate over records */
+ dumper->dump(dumper, reason);
+
+ /* reset iterator */
+ dumper->active = false;
+ }
+ rcu_read_unlock();
+}
+
+/**
+ * kmsg_dump_get_line - retrieve one kmsg log line
+ * @dumper: registered kmsg dumper
+ * @syslog: include the "<4>" prefixes
+ * @line: buffer to copy the line to
+ * @size: maximum size of the buffer
+ * @len: length of line placed into buffer
+ *
+ * Start at the beginning of the kmsg buffer, with the oldest kmsg
+ * record, and copy one record into the provided buffer.
+ *
+ * Consecutive calls will return the next available record moving
+ * towards the end of the buffer with the youngest messages.
+ *
+ * A return value of FALSE indicate that there are no more records to
+ * read.
+ */
+bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
+ char *line, size_t size, size_t *len)
+{
+ unsigned long flags;
+ struct log *msg;
+ size_t l = 0;
+ bool ret = false;
+
+ if (!dumper->active)
+ goto out;

raw_spin_lock_irqsave(&logbuf_lock, flags);
- if (syslog_seq < log_first_seq)
- idx = syslog_idx;
- else
- idx = log_first_idx;
-
- if (idx > log_next_idx) {
- s1 = log_buf;
- l1 = log_next_idx;
+ if (dumper->cur_seq < log_first_seq) {
+ /* messages are gone, move to first available one */
+ dumper->cur_seq = log_first_seq;
+ dumper->cur_idx = log_first_idx;
+ }

- s2 = log_buf + idx;
- l2 = log_buf_len - idx;
- } else {
- s1 = "";
- l1 = 0;
+ /* last entry */
+ if (dumper->cur_seq >= log_next_seq) {
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+ goto out;
+ }

- s2 = log_buf + idx;
- l2 = log_next_idx - idx;
+ msg = log_from_idx(dumper->cur_idx);
+ l = msg_print_text(msg, syslog,
+ line, size);
+
+ dumper->cur_idx = log_next(dumper->cur_idx);
+ dumper->cur_seq++;
+ ret = true;
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+out:
+ if (len)
+ *len = l;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kmsg_dump_get_line);
+
+/**
+ * kmsg_dump_get_buffer - copy kmsg log lines
+ * @dumper: registered kmsg dumper
+ * @syslog: include the "<4>" prefixes
+ * @line: buffer to copy the line to
+ * @size: maximum size of the buffer
+ * @len: length of line placed into buffer
+ *
+ * Start at the end of the kmsg buffer and fill the provided buffer
+ * with as many of the the *youngest* kmsg records that fit into it.
+ * If the buffer is large enough, all available kmsg records will be
+ * copied with a single call.
+ *
+ * Consecutive calls will fill the buffer with the next block of
+ * available older records, not including the earlier retrieved ones.
+ *
+ * A return value of FALSE indicate that there are no more records to
+ * read.
+ */
+bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
+ char *buf, size_t size, size_t *len)
+{
+ unsigned long flags;
+ u64 seq;
+ u32 idx;
+ u64 next_seq;
+ u32 next_idx;
+ size_t l = 0;
+ bool ret = false;
+
+ if (!dumper->active)
+ goto out;
+
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ if (dumper->cur_seq < log_first_seq) {
+ /* messages are gone, move to first available one */
+ dumper->cur_seq = log_first_seq;
+ dumper->cur_idx = log_first_idx;
+ }
+
+ /* last entry */
+ if (dumper->cur_seq >= dumper->next_seq) {
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+ goto out;
+ }
+
+ /* calculate length of entire buffer */
+ seq = dumper->cur_seq;
+ idx = dumper->cur_idx;
+ while (seq < dumper->next_seq) {
+ struct log *msg = log_from_idx(idx);
+
+ l += msg_print_text(msg, true, NULL, 0);
+ idx = log_next(idx);
+ seq++;
+ }
+
+ /* move first record forward until length fits into the buffer */
+ seq = dumper->cur_seq;
+ idx = dumper->cur_idx;
+ while (l > size && seq < dumper->next_seq) {
+ struct log *msg = log_from_idx(idx);
+
+ l -= msg_print_text(msg, true, NULL, 0);
+ idx = log_next(idx);
+ seq++;
+ }
+
+ /* last message in next interation */
+ next_seq = seq;
+ next_idx = idx;
+
+ l = 0;
+ while (seq < dumper->next_seq) {
+ struct log *msg = log_from_idx(idx);
+
+ l += msg_print_text(msg, syslog,
+ buf + l, size - l);
+
+ idx = log_next(idx);
+ seq++;
}
+
+ dumper->next_seq = next_seq;
+ dumper->next_idx = next_idx;
+ ret = true;
raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+out:
+ if (len)
+ *len = l;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);

- rcu_read_lock();
- list_for_each_entry_rcu(dumper, &dump_list, list)
- dumper->dump(dumper, reason, s1, l1, s2, l2);
- rcu_read_unlock();
+/**
+ * kmsg_dump_rewind - reset the interator
+ * @dumper: registered kmsg dumper
+ *
+ * Reset the dumper's iterator so that kmsg_dump_get_line() and
+ * kmsg_dump_get_buffer() can be called again and used multiple
+ * times within the same dumper.dump() callback.
+ */
+void kmsg_dump_rewind(struct kmsg_dumper *dumper)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&logbuf_lock, flags);
+ dumper->cur_seq = clear_seq;
+ dumper->cur_idx = clear_idx;
+ dumper->next_seq = log_next_seq;
+ dumper->next_idx = log_next_idx;
+ raw_spin_unlock_irqrestore(&logbuf_lock, flags);
}
+EXPORT_SYMBOL_GPL(kmsg_dump_rewind);
#endif

2012-06-15 12:23:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer


* Kay Sievers <[email protected]> wrote:

> On Thu, 2012-06-14 at 17:04 -0700, Greg KH wrote:
>
> > As no one seems to have responded, care to resend this with a
> > signed-off-by so I can queue it up for 3.6? If people complain, it can
> > still go to 3.5, but given that no one is saying anything, I guess no
> > one cares...
>
> Sounds like a good plan. Here it is.
>
> Thanks,
> Kay
>
>
> From: Kay Sievers <[email protected]>
> Subject: kmsg - kmsg_dump() use iterator to receice log buffer content

typo.

>
> Provide an iterator to receive the log buffer content, and convert all
> kmsg_dump() user to it.

typo.

> +/**
> + * kmsg_dump_get_line - retrieve one kmsg log line
> + * @dumper: registered kmsg dumper
> + * @syslog: include the "<4>" prefixes
> + * @line: buffer to copy the line to
> + * @size: maximum size of the buffer
> + * @len: length of line placed into buffer
> + *
> + * Start at the beginning of the kmsg buffer, with the oldest kmsg
> + * record, and copy one record into the provided buffer.
> + *
> + * Consecutive calls will return the next available record moving
> + * towards the end of the buffer with the youngest messages.
> + *
> + * A return value of FALSE indicate that there are no more records to
> + * read.

typo.

> + *
> + * Start at the end of the kmsg buffer and fill the provided buffer
> + * with as many of the the *youngest* kmsg records that fit into it.
> + * If the buffer is large enough, all available kmsg records will be
> + * copied with a single call.
> + *
> + * Consecutive calls will fill the buffer with the next block of
> + * available older records, not including the earlier retrieved ones.
> + *
> + * A return value of FALSE indicate that there are no more records to
> + * read.

typo.

Thanks,

Ingo

2012-06-15 12:25:13

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, Jun 15, 2012 at 02:07:51PM +0200, Kay Sievers wrote:
[...]
> From: Kay Sievers <[email protected]>
> Subject: kmsg - kmsg_dump() use iterator to receice log buffer content
>
> Provide an iterator to receive the log buffer content, and convert all
> kmsg_dump() user to it.
>
> The structured data in the kmsg buffer now contains binary data, which
> should no longer be copied verbatim to the kmsg_dump() users.
>
> The iterator should provide reliable access to the buffer data, and also
> supports proper log line-aware chunking of data while iterating.
>
> Signed-off-by: Kay Sievers <[email protected]>

Thanks Kay, it works!

Reported-by: Anton Vorontsov <[email protected]>
Tested-by: Anton Vorontsov <[email protected]>

2012-06-15 20:54:36

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, Jun 15, 2012 at 5:07 AM, Kay Sievers <[email protected]> wrote:
> On Thu, 2012-06-14 at 17:04 -0700, Greg KH wrote:
>
>> As no one seems to have responded, care to resend this with a
>> signed-off-by so I can queue it up for 3.6? ?If people complain, it can
>> still go to 3.5, but given that no one is saying anything, I guess no
>> one cares...
>
> Sounds like a good plan. Here it is.

This got squeezed out of my TODO list ... sorry. pstore bits work fine. I'd
like to see this go into 3.5.

Thanks

Tested-by: Tony Luck <[email protected]>

2012-06-15 21:53:17

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Fri, Jun 15, 2012 at 02:23:11PM +0200, Ingo Molnar wrote:
>
> * Kay Sievers <[email protected]> wrote:
>
> > On Thu, 2012-06-14 at 17:04 -0700, Greg KH wrote:
> >
> > > As no one seems to have responded, care to resend this with a
> > > signed-off-by so I can queue it up for 3.6? If people complain, it can
> > > still go to 3.5, but given that no one is saying anything, I guess no
> > > one cares...
> >
> > Sounds like a good plan. Here it is.
> >
> > Thanks,
> > Kay
> >
> >
> > From: Kay Sievers <[email protected]>
> > Subject: kmsg - kmsg_dump() use iterator to receice log buffer content
>
> typo.
>
> >
> > Provide an iterator to receive the log buffer content, and convert all
> > kmsg_dump() user to it.
>
> typo.

<snip>

Thanks, I've fixed up all of these typos in the patch before I applied
it.

We need a "Proofread-by:" tag :)

greg k-h

Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

Hi Kay,

On Thu, May 3, 2012 at 2:29 AM, Kay Sievers <[email protected]> wrote:
> From: Kay Sievers <[email protected]>
[...]
> case SYSLOG_ACTION_SIZE_UNREAD:
> - error = log_end - log_start;
> + raw_spin_lock_irq(&logbuf_lock);
> + if (syslog_seq < log_first_seq) {
> + /* messages are gone, move to first one */
> + syslog_seq = log_first_seq;
> + syslog_idx = log_first_idx;
> + }
> + if (from_file) {
> + /*
> + * Short-cut for poll(/"proc/kmsg") which simply checks
> + * for pending data, not the size; return the count of
> + * records, not the length.
> + */
> + error = log_next_idx - syslog_idx;
> + } else {
> + u64 seq;
> + u32 idx;
> +
> + error = 0;
> + seq = syslog_seq;
> + idx = syslog_idx;
> + while (seq < log_next_seq) {
> + error += syslog_print_line(idx, NULL, 0);
> + idx = log_next(idx);
> + seq++;
> + }
> + }
> + raw_spin_unlock_irq(&logbuf_lock);
> break;
[...]

It looks as though the changes here have broken SYSLOG_ACTION_SIZE_UNREAD.

On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a
SYSLOG_ACTION_SIZE_UNREAD returns 0.

On 3.5, immediately after SYSLOG_ACTION_READ_CLEAR, the value returned
by SYSLOG_ACTION_SIZE_UNREAD is unchanged (i.e., assuming that the
value returned was non-zero before SYSLOG_ACTION_SIZE_UNREAD, it is
still nonzero afterward), even though a subsequent
SYSLOG_ACTION_READ_CLEAR indicates that there are zero bytes to read.

(All tests conducted with (r)syslogd stopped.)

Cheers,

Michael


--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/

2012-11-28 16:23:04

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, Nov 28, 2012 at 2:33 PM, Michael Kerrisk <[email protected]> wrote:
> On Thu, May 3, 2012 at 2:29 AM, Kay Sievers <[email protected]> wrote:
>> From: Kay Sievers <[email protected]>
> [...]
>> case SYSLOG_ACTION_SIZE_UNREAD:
>> - error = log_end - log_start;
>> + raw_spin_lock_irq(&logbuf_lock);
>> + if (syslog_seq < log_first_seq) {
>> + /* messages are gone, move to first one */
>> + syslog_seq = log_first_seq;
>> + syslog_idx = log_first_idx;
>> + }
>> + if (from_file) {
>> + /*
>> + * Short-cut for poll(/"proc/kmsg") which simply checks
>> + * for pending data, not the size; return the count of
>> + * records, not the length.
>> + */
>> + error = log_next_idx - syslog_idx;
>> + } else {
>> + u64 seq;
>> + u32 idx;
>> +
>> + error = 0;
>> + seq = syslog_seq;
>> + idx = syslog_idx;
>> + while (seq < log_next_seq) {
>> + error += syslog_print_line(idx, NULL, 0);
>> + idx = log_next(idx);
>> + seq++;
>> + }
>> + }
>> + raw_spin_unlock_irq(&logbuf_lock);
>> break;
> [...]
>
> It looks as though the changes here have broken SYSLOG_ACTION_SIZE_UNREAD.

Any specifics that it causes actual problems we need to address?

> On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a
> SYSLOG_ACTION_SIZE_UNREAD returns 0.

Hmm, sounds like the right thing to do.

We have read everything, even cleared the buffer for later queries. So
there is nothing to read anymore for later calls, and they will
actually never return anything if they are called, so returning 0 here
sounds like the right thing. The current SYSLOG_ACTION_SIZE_UNREAD
seems to match properly the expectations one can make for
SYSLOG_ACTION_READ_ALL.

> On 3.5, immediately after SYSLOG_ACTION_READ_CLEAR, the value returned
> by SYSLOG_ACTION_SIZE_UNREAD is unchanged
>
> (i.e., assuming that the
> value returned was non-zero before SYSLOG_ACTION_SIZE_UNREAD, it is
> still nonzero afterward), even though a subsequent
> SYSLOG_ACTION_READ_CLEAR indicates that there are zero bytes to read.

Which sounds at least like weird behaviour, if not "broken".

Any indication that we need to restore the old behaviour to fix some
weird assumptions? To me the current one sounds like the better and
more correct option, and what one would expect from it. But maybe we
cannot get away with it ...

(I hope I understood what you explained correctly, I'm a bit confused
by the issue.)


Thanks,
Kay

2012-11-28 16:37:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, Nov 28, 2012 at 8:22 AM, Kay Sievers <[email protected]> wrote:
> On Wed, Nov 28, 2012 at 2:33 PM, Michael Kerrisk <[email protected]> wrote:
>
>> On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a
>> SYSLOG_ACTION_SIZE_UNREAD returns 0.
>
> Hmm, sounds like the right thing to do.

Right.

And that's the *OLD* behavior (2.6.31).

>> On 3.5, immediately after SYSLOG_ACTION_READ_CLEAR, the value returned
>> by SYSLOG_ACTION_SIZE_UNREAD is unchanged

And this is the *NEW* behavior, and as you say:

> Which sounds at least like weird behaviour, if not "broken".

So the new behavior is insane and different. Let's fix it.

It looks like it is because the new SYSLOG_ACTION_SIZE_UNREAD code
does not take the new clear_seq code into account. Hmm?

Linus

2012-11-28 16:49:32

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, Nov 28, 2012 at 5:37 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Nov 28, 2012 at 8:22 AM, Kay Sievers <[email protected]> wrote:
>> On Wed, Nov 28, 2012 at 2:33 PM, Michael Kerrisk <[email protected]> wrote:
>>
>>> On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a
>>> SYSLOG_ACTION_SIZE_UNREAD returns 0.
>>
>> Hmm, sounds like the right thing to do.
>
> Right.
>
> And that's the *OLD* behavior (2.6.31).

Ah, hmm, I read 2.6... as 3.6... :)

> So the new behavior is insane and different. Let's fix it.

Yeah.

> It looks like it is because the new SYSLOG_ACTION_SIZE_UNREAD code
> does not take the new clear_seq code into account. Hmm?

Right, something like that. I'll take a look now ...

Thanks,
Kay

2012-11-28 17:51:31

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, 2012-11-28 at 17:49 +0100, Kay Sievers wrote:
> On Wed, Nov 28, 2012 at 5:37 PM, Linus Torvalds
> <[email protected]> wrote:
> > On Wed, Nov 28, 2012 at 8:22 AM, Kay Sievers <[email protected]> wrote:
> >> On Wed, Nov 28, 2012 at 2:33 PM, Michael Kerrisk <[email protected]> wrote:
> >>
> >>> On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a
> >>> SYSLOG_ACTION_SIZE_UNREAD returns 0.
> >>
> >> Hmm, sounds like the right thing to do.
> >
> > Right.
> >
> > And that's the *OLD* behavior (2.6.31).
>
> Ah, hmm, I read 2.6... as 3.6... :)
>
> > So the new behavior is insane and different. Let's fix it.
>
> Yeah.
>
> > It looks like it is because the new SYSLOG_ACTION_SIZE_UNREAD code
> > does not take the new clear_seq code into account. Hmm?
>
> Right, something like that. I'll take a look now ...

From: Kay Sievers <[email protected]>
Subject: printk: respect SYSLOG_ACTION_READ_CLEAR for SYSLOG_ACTION_SIZE_UNREAD

On Wed, Nov 28, 2012 at 2:33 PM, Michael Kerrisk <[email protected]> wrote:
> It looks as though the changes here have broken SYSLOG_ACTION_SIZE_UNREAD.
>
> On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a
> SYSLOG_ACTION_SIZE_UNREAD returns 0.
>
> On 3.5, immediately after SYSLOG_ACTION_READ_CLEAR, the value returned
> by SYSLOG_ACTION_SIZE_UNREAD is unchanged (i.e., assuming that the
> value returned was non-zero before SYSLOG_ACTION_SIZE_UNREAD, it is
> still nonzero afterward), even though a subsequent
> SYSLOG_ACTION_READ_CLEAR indicates that there are zero bytes to read.

Fix SYSLOG_ACTION_SIZE_UNREAD to return the amount of available
characters by starting to count at the first available record after
the last SYSLOG_ACTION_READ_CLEAR, instead of the first message
record for SYSLOG_ACTION_READ.

Before:
syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
syslog(SYSLOG_ACTION_READ_CLEAR, "<12>"..., 1000000) = 24000
syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965

After:
syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 90402
syslog(SYSLOG_ACTION_READ_CLEAR, "<5>"..., 1000000) = 90402
syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 0

Reported-By: Michael Kerrisk <[email protected]>
Signed-Off-By: Kay Sievers <[email protected]>
---
printk.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/kernel/printk.c b/kernel/printk.c
index 2d607f4..35a7f4f 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1183,12 +1183,10 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
/* Number of chars in the log buffer */
case SYSLOG_ACTION_SIZE_UNREAD:
raw_spin_lock_irq(&logbuf_lock);
- if (syslog_seq < log_first_seq) {
+ if (clear_seq < log_first_seq) {
/* messages are gone, move to first one */
- syslog_seq = log_first_seq;
- syslog_idx = log_first_idx;
- syslog_prev = 0;
- syslog_partial = 0;
+ clear_seq = log_first_seq;
+ clear_idx = log_first_idx;
}
if (from_file) {
/*
@@ -1198,9 +1196,9 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
*/
error = log_next_idx - syslog_idx;
} else {
- u64 seq = syslog_seq;
- u32 idx = syslog_idx;
- enum log_flags prev = syslog_prev;
+ u64 seq = clear_seq;
+ u32 idx = clear_idx;
+ enum log_flags prev = 0;

error = 0;
while (seq < log_next_seq) {
@@ -1211,7 +1209,6 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
seq++;
prev = msg->flags;
}
- error -= syslog_partial;
}
raw_spin_unlock_irq(&logbuf_lock);
break;

Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Wed, Nov 28, 2012 at 6:51 PM, Kay Sievers <[email protected]> wrote:
> On Wed, 2012-11-28 at 17:49 +0100, Kay Sievers wrote:
>> On Wed, Nov 28, 2012 at 5:37 PM, Linus Torvalds
>> <[email protected]> wrote:
>> > On Wed, Nov 28, 2012 at 8:22 AM, Kay Sievers <[email protected]> wrote:
>> >> On Wed, Nov 28, 2012 at 2:33 PM, Michael Kerrisk <[email protected]> wrote:
>> >>
>> >>> On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a
>> >>> SYSLOG_ACTION_SIZE_UNREAD returns 0.
>> >>
>> >> Hmm, sounds like the right thing to do.
>> >
>> > Right.
>> >
>> > And that's the *OLD* behavior (2.6.31).
>>
>> Ah, hmm, I read 2.6... as 3.6... :)
>>
>> > So the new behavior is insane and different. Let's fix it.
>>
>> Yeah.
>>
>> > It looks like it is because the new SYSLOG_ACTION_SIZE_UNREAD code
>> > does not take the new clear_seq code into account. Hmm?
>>
>> Right, something like that. I'll take a look now ...
>
> From: Kay Sievers <[email protected]>
> Subject: printk: respect SYSLOG_ACTION_READ_CLEAR for SYSLOG_ACTION_SIZE_UNREAD
>
> On Wed, Nov 28, 2012 at 2:33 PM, Michael Kerrisk <[email protected]> wrote:
>> It looks as though the changes here have broken SYSLOG_ACTION_SIZE_UNREAD.
>>
>> On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a
>> SYSLOG_ACTION_SIZE_UNREAD returns 0.
>>
>> On 3.5, immediately after SYSLOG_ACTION_READ_CLEAR, the value returned
>> by SYSLOG_ACTION_SIZE_UNREAD is unchanged (i.e., assuming that the
>> value returned was non-zero before SYSLOG_ACTION_SIZE_UNREAD, it is
>> still nonzero afterward), even though a subsequent
>> SYSLOG_ACTION_READ_CLEAR indicates that there are zero bytes to read.
>
> Fix SYSLOG_ACTION_SIZE_UNREAD to return the amount of available
> characters by starting to count at the first available record after
> the last SYSLOG_ACTION_READ_CLEAR, instead of the first message
> record for SYSLOG_ACTION_READ.
>
> Before:
> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
> syslog(SYSLOG_ACTION_READ_CLEAR, "<12>"..., 1000000) = 24000
> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
>
> After:
> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 90402
> syslog(SYSLOG_ACTION_READ_CLEAR, "<5>"..., 1000000) = 90402
> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 0
>
> Reported-By: Michael Kerrisk <[email protected]>
> Signed-Off-By: Kay Sievers <[email protected]>
> ---
> printk.c | 15 ++++++---------
> 1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/printk.c b/kernel/printk.c
> index 2d607f4..35a7f4f 100644
> --- a/kernel/printk.c
> +++ b/kernel/printk.c
> @@ -1183,12 +1183,10 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
> /* Number of chars in the log buffer */
> case SYSLOG_ACTION_SIZE_UNREAD:
> raw_spin_lock_irq(&logbuf_lock);
> - if (syslog_seq < log_first_seq) {
> + if (clear_seq < log_first_seq) {
> /* messages are gone, move to first one */
> - syslog_seq = log_first_seq;
> - syslog_idx = log_first_idx;
> - syslog_prev = 0;
> - syslog_partial = 0;
> + clear_seq = log_first_seq;
> + clear_idx = log_first_idx;
> }
> if (from_file) {
> /*
> @@ -1198,9 +1196,9 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
> */
> error = log_next_idx - syslog_idx;
> } else {
> - u64 seq = syslog_seq;
> - u32 idx = syslog_idx;
> - enum log_flags prev = syslog_prev;
> + u64 seq = clear_seq;
> + u32 idx = clear_idx;
> + enum log_flags prev = 0;
>
> error = 0;
> while (seq < log_next_seq) {
> @@ -1211,7 +1209,6 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
> seq++;
> prev = msg->flags;
> }
> - error -= syslog_partial;
> }
> raw_spin_unlock_irq(&logbuf_lock);
> break;

Hello Kay,

I'm going to call my report yesterday bogus. Somewhere along the way,
I got confused while testing something, and my statement about 2.6.31
behavior is wrong: the 2.6.31 and 3.5 behaviors are the same. As such,
your patch is unneeded. Sorry for wasting your time.

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2012-11-29 13:28:23

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, Nov 29, 2012 at 2:18 PM, Michael Kerrisk (man-pages)
<[email protected]> wrote:
> On Wed, Nov 28, 2012 at 6:51 PM, Kay Sievers <[email protected]> wrote:

>> Before:
>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
>> syslog(SYSLOG_ACTION_READ_CLEAR, "<12>"..., 1000000) = 24000
>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
>>
>> After:
>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 90402
>> syslog(SYSLOG_ACTION_READ_CLEAR, "<5>"..., 1000000) = 90402
>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 0

> I'm going to call my report yesterday bogus. Somewhere along the way,
> I got confused while testing something, and my statement about 2.6.31
> behavior is wrong: the 2.6.31 and 3.5 behaviors are the same. As such,
> your patch is unneeded. Sorry for wasting your time.

I think you have been right with your report. The above pasted
before/after from the patch commit text is actually a result of real
testing with current git. And your initial description sounds right,
and the patch seems to produce the expected results here. I just
confused the numbers in your report and wrongly parsed 2.6 > 3.6.

Hmm, at least do far we did not blame anybody else than ourselves as
confused. One of us at least is right, and it looks you have been, and
I also think the patch is at least intended to be right. :)

Kay

Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, Nov 29, 2012 at 2:28 PM, Kay Sievers <[email protected]> wrote:
> On Thu, Nov 29, 2012 at 2:18 PM, Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>> On Wed, Nov 28, 2012 at 6:51 PM, Kay Sievers <[email protected]> wrote:
>
>>> Before:
>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
>>> syslog(SYSLOG_ACTION_READ_CLEAR, "<12>"..., 1000000) = 24000
>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
>>>
>>> After:
>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 90402
>>> syslog(SYSLOG_ACTION_READ_CLEAR, "<5>"..., 1000000) = 90402
>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 0
>
>> I'm going to call my report yesterday bogus. Somewhere along the way,
>> I got confused while testing something, and my statement about 2.6.31
>> behavior is wrong: the 2.6.31 and 3.5 behaviors are the same. As such,
>> your patch is unneeded. Sorry for wasting your time.
>
> I think you have been right with your report. The above pasted
> before/after from the patch commit text is actually a result of real
> testing with current git. And your initial description sounds right,
> and the patch seems to produce the expected results here. I just
> confused the numbers in your report and wrongly parsed 2.6 > 3.6.
>
> Hmm, at least do far we did not blame anybody else than ourselves as
> confused. One of us at least is right, and it looks you have been, and
> I also think the patch is at least intended to be right. :)

Okay -- I'm pretty sure I am right about being wrong ;-).

Could you do some comparative testing please between 3.5 and pre-3.5.
I have a little test program below. When I rechecked 2.6.31 and 3.5
using this program I found the behavior was the same, which is why I
conclude my report is wrong. (And also, your proposed patch in
response to my bogus report produces different behavior from 2.6.31).

Thanks,

Michael

/*#* t_klogctl.c

klogctl(3) is the glibc interface for syslog(2)

*/

#include <sys/klog.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>

#define errMsg(msg) do { perror(msg); } while (0)

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

#define fatal(msg) do { fprintf(stderr, "%s\n", msg); \
exit(EXIT_FAILURE); } while (0)


/* Close the log. Currently a NOP. */
#define SYSLOG_ACTION_CLOSE 0
/* Open the log. Currently a NOP. */
#define SYSLOG_ACTION_OPEN 1
/* Read from the log. */
#define SYSLOG_ACTION_READ 2
/* Read all messages remaining in the ring buffer. */
#define SYSLOG_ACTION_READ_ALL 3
/* Read and clear all messages remaining in the ring buffer */
#define SYSLOG_ACTION_READ_CLEAR 4
/* Clear ring buffer. */
#define SYSLOG_ACTION_CLEAR 5
/* Disable printk's to console */
#define SYSLOG_ACTION_CONSOLE_OFF 6
/* Enable printk's to console */
#define SYSLOG_ACTION_CONSOLE_ON 7
/* Set level of messages printed to console */
#define SYSLOG_ACTION_CONSOLE_LEVEL 8
/* Return number of unread characters in the log buffer */
#define SYSLOG_ACTION_SIZE_UNREAD 9
/* Return size of the log buffer */
#define SYSLOG_ACTION_SIZE_BUFFER 10


int
main(int argc, char *argv[])
{
int s;
#define BSIZE 500000
char buf[BSIZE];

if (argc < 2) {
fprintf(stderr, "Usage: %s command-num [arg]\n", argv[0]);
exit(EXIT_FAILURE);
}

int cmd;

cmd = atoi(argv[1]);

switch (cmd) {

case SYSLOG_ACTION_READ: /* 2 */
printf("SYSLOG_ACTION_READ\n");
s = klogctl(SYSLOG_ACTION_READ, buf,
(argc > 2) ? atoi(argv[2]) : BSIZE);
if (s == -1)
errExit("klogctl-READ");
printf("Return value: %d\n", s);

if (write(2, buf, s) != s)
errMsg("write");
break;

case SYSLOG_ACTION_READ_ALL: /* 3 */
case SYSLOG_ACTION_READ_CLEAR: /* 4 */
if (cmd == SYSLOG_ACTION_READ_CLEAR)
printf("SYSLOG_ACTION_READ_CLEAR\n");
else
printf("SYSLOG_ACTION_READ_ALL\n");


s = klogctl(cmd, buf,
(argc > 2) ? atoi(argv[2]) : BSIZE);
if (s == -1)
errExit("klogctl-READ_ALL/READ_CLEAR");
printf("Return value: %d\n", s);

if (write(2, buf, s) != s)
errMsg("write");

if (cmd == SYSLOG_ACTION_READ_CLEAR)
printf("Ring buffer cleared\n");
break;

case SYSLOG_ACTION_CLEAR: /* 5 */
printf("SYSLOG_ACTION_CLEAR\n");
s = klogctl(SYSLOG_ACTION_CLEAR, NULL, 0);
if (s == -1)
errExit("klogctl-CLEAR");
printf("Return value: %d\n", s);
printf("Ring buffer cleared\n");
break;

case SYSLOG_ACTION_SIZE_UNREAD: /* 9 */
printf("SYSLOG_ACTION_SIZE_UNREAD\n");
s = klogctl(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0);
if (s == -1)
errExit("klogctl-SIZE_UNREAD");
printf("Number of unread bytes: %d\n", s);
break;

case SYSLOG_ACTION_SIZE_BUFFER: /* 10 */
printf("SYSLOG_ACTION_SIZE_BUFFER\n");
s = klogctl(SYSLOG_ACTION_SIZE_BUFFER, NULL, 0);
if (s == -1)
errExit("klogctl-SIZE_BUFFER");
printf("Size of buffer: %d\n", s);
break;

default:
fatal("Bad command");
break;
}

exit(EXIT_SUCCESS);
}

2012-11-29 14:09:03

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, Nov 29, 2012 at 2:37 PM, Michael Kerrisk (man-pages)
<[email protected]> wrote:
> On Thu, Nov 29, 2012 at 2:28 PM, Kay Sievers <[email protected]> wrote:
>> On Thu, Nov 29, 2012 at 2:18 PM, Michael Kerrisk (man-pages)
>> <[email protected]> wrote:
>>> On Wed, Nov 28, 2012 at 6:51 PM, Kay Sievers <[email protected]> wrote:
>>
>>>> Before:
>>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
>>>> syslog(SYSLOG_ACTION_READ_CLEAR, "<12>"..., 1000000) = 24000
>>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
>>>>
>>>> After:
>>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 90402
>>>> syslog(SYSLOG_ACTION_READ_CLEAR, "<5>"..., 1000000) = 90402
>>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 0
>>
>>> I'm going to call my report yesterday bogus. Somewhere along the way,
>>> I got confused while testing something, and my statement about 2.6.31
>>> behavior is wrong: the 2.6.31 and 3.5 behaviors are the same. As such,
>>> your patch is unneeded. Sorry for wasting your time.
>>
>> I think you have been right with your report. The above pasted
>> before/after from the patch commit text is actually a result of real
>> testing with current git. And your initial description sounds right,
>> and the patch seems to produce the expected results here. I just
>> confused the numbers in your report and wrongly parsed 2.6 > 3.6.
>>
>> Hmm, at least do far we did not blame anybody else than ourselves as
>> confused. One of us at least is right, and it looks you have been, and
>> I also think the patch is at least intended to be right. :)
>
> Okay -- I'm pretty sure I am right about being wrong ;-).
>
> Could you do some comparative testing please between 3.5 and pre-3.5.
> I have a little test program below. When I rechecked 2.6.31 and 3.5
> using this program I found the behavior was the same, which is why I
> conclude my report is wrong. (And also, your proposed patch in
> response to my bogus report produces different behavior from 2.6.31).

Oh, seems you are right.

The old kernel does not return 0, while it probably should. The
current kernel seems to do the same thing.

But the behaviour with the patch stills seems like the better and the
obvious and expected behaviour to me. :)

Thanks,
Kay

######### old fedora kernel (pre-new-logging) ###############
# uname -a
Linux nja 3.4.4-3.fc17.x86_64

# echo 1234567890 > /dev/kmsg

# ./syslog-test 9
SYSLOG_ACTION_SIZE_UNREAD
Number of unread bytes: 95299

# ./syslog-test 4
SYSLOG_ACTION_READ_CLEAR
Return value: 29
<4>[ 54.933282] 1234567890
Ring buffer cleared

# ./syslog-test 9
SYSLOG_ACTION_SIZE_UNREAD
Number of unread bytes: 95299

######### current fedora kernel (new logging) ############

# uname -a
Linux nja 3.6.6-3.fc18.x86_64

# echo 1234567890 > /dev/kmsg

# ./syslog-test 9
SYSLOG_ACTION_SIZE_UNREAD
Number of unread bytes: 286822

# ./syslog-test 4
SYSLOG_ACTION_READ_CLEAR
Return value: 31
<12>[259301.067699] 1234567890
Ring buffer cleared

# ./syslog-test 9
SYSLOG_ACTION_SIZE_UNREAD
Number of unread bytes: 286822

######### git kernel with the above patch applied #############

# uname -a
Linux nja 3.7.0-rc7+

# echo 1234567890 > /dev/kmsg

# ./syslog-test 9
SYSLOG_ACTION_SIZE_UNREAD
Number of unread bytes: 30

# ./syslog-test 4
SYSLOG_ACTION_READ_CLEAR
Return value: 30
<12>[ 69.591745] 1234567890
Ring buffer cleared

# ./syslog-test 9
SYSLOG_ACTION_SIZE_UNREAD
Number of unread bytes: 0

Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, Nov 29, 2012 at 3:08 PM, Kay Sievers <[email protected]> wrote:
> On Thu, Nov 29, 2012 at 2:37 PM, Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>> On Thu, Nov 29, 2012 at 2:28 PM, Kay Sievers <[email protected]> wrote:
>>> On Thu, Nov 29, 2012 at 2:18 PM, Michael Kerrisk (man-pages)
>>> <[email protected]> wrote:
>>>> On Wed, Nov 28, 2012 at 6:51 PM, Kay Sievers <[email protected]> wrote:
>>>
>>>>> Before:
>>>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
>>>>> syslog(SYSLOG_ACTION_READ_CLEAR, "<12>"..., 1000000) = 24000
>>>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 286965
>>>>>
>>>>> After:
>>>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 90402
>>>>> syslog(SYSLOG_ACTION_READ_CLEAR, "<5>"..., 1000000) = 90402
>>>>> syslog(SYSLOG_ACTION_SIZE_UNREAD, 0, 0) = 0
>>>
>>>> I'm going to call my report yesterday bogus. Somewhere along the way,
>>>> I got confused while testing something, and my statement about 2.6.31
>>>> behavior is wrong: the 2.6.31 and 3.5 behaviors are the same. As such,
>>>> your patch is unneeded. Sorry for wasting your time.
>>>
>>> I think you have been right with your report. The above pasted
>>> before/after from the patch commit text is actually a result of real
>>> testing with current git. And your initial description sounds right,
>>> and the patch seems to produce the expected results here. I just
>>> confused the numbers in your report and wrongly parsed 2.6 > 3.6.
>>>
>>> Hmm, at least do far we did not blame anybody else than ourselves as
>>> confused. One of us at least is right, and it looks you have been, and
>>> I also think the patch is at least intended to be right. :)
>>
>> Okay -- I'm pretty sure I am right about being wrong ;-).
>>
>> Could you do some comparative testing please between 3.5 and pre-3.5.
>> I have a little test program below. When I rechecked 2.6.31 and 3.5
>> using this program I found the behavior was the same, which is why I
>> conclude my report is wrong. (And also, your proposed patch in
>> response to my bogus report produces different behavior from 2.6.31).
>
> Oh, seems you are right.
>
> The old kernel does not return 0, while it probably should. The
> current kernel seems to do the same thing.
>
> But the behaviour with the patch stills seems like the better and the
> obvious and expected behaviour to me. :)

The point here I think is that the semantics of the various syslog()
commands are surprising, which is what led me into some confusion in
testing. Essentially, command 5 ("clear ring buffer") does not really
clear anything, it simply sets bookkeeping variables that affect the
behavior of commands 3 and 4. Of particular note is that command 5
does NOT affect commands 2 and 9, and command 9 is only returning the
number of bytes that would be read by command 2. The man page could do
with some improvement (and will get it).

So, just to be clear: you better not apply your patch; it might break
something ;-).

Cheers,

Michael

2012-11-29 14:31:27

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer

On Thu, Nov 29, 2012 at 3:18 PM, Michael Kerrisk (man-pages)
<[email protected]> wrote:
> So, just to be clear: you better not apply your patch; it might break
> something ;-).

Sounds good to me. :)

Thanks,
Kay