2019-02-12 14:43:02

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 00/25] printk: new implementation

Hello,

As probably many of you are aware, the current printk implementation
has some issues. This series (against 5.0-rc6) makes some fundamental
changes in an attempt to address these issues. The particular issues I
am referring to:

1. The printk buffer is protected by a global raw spinlock for readers
and writers. This restricts the contexts that are allowed to
access the buffer.

2. Because of #1, NMI and recursive contexts are handled by deferring
logging/printing to a spinlock-safe context. This means that
messages will not be visible if (for example) the kernel dies in
NMI context and the irq_work mechanism does not survive.

3. Because of #1, when *not* using features such as PREEMPT_RT, large
latencies exist when printing to slow consoles.

4. Because of #1, when _using_ features such as PREEMPT_RT, printing
to the consoles is further restricted to contexts that can sleep.
This can cause large latencies in seeing the messages.

5. Printing to consoles is the responsibility of the printk caller
and that caller may be required to print many messages that other
printk callers inserted. Because of this there can be enormous
variance in the runtime of a printk call.

6. The timestamps on the printk messages are generated when the
message is inserted into the buffer. But the call to printk may
have occurred long before that. This is especially true for
messages in the printk_safe context. This means that printk timing
information (although neatly "sorted") is neither accurate nor
reliable.

7. Loglevel INFO is handled the same as ERR. There seems to be an
endless effort to get printk to show _all_ messages as quickly as
possible in case of a panic (i.e. printing from any context), but
at the same time try not to have printk be too intrusive for the
callers. These are conflicting requirements that lead to a printk
implementation that does a sub-optimal job of satisfying both
sides.

To address these issues this series makes the following changes:

- Implements a new printk ringbuffer that supports lockless multiple
readers. Writers are synchronized per-cpu with support for all
contexts (including NMI). (This implementation was inspired by a
post[0] from Peter Zijlstra.)

- The new printk ringbuffer uses the initialized data segment of the
kernel for its data buffer so that it is available on early boot.

- Timestamps are captured at the beginning of the printk call.

- A dedicated kernel thread is created for printing to all consoles in
a fully preemptible context.

- A new (optional) console operation "write_atomic" is introduced that
console drivers may implement. This function must be NMI-safe. An
implementation for the 8250 UART driver is provided.

- The concept of "emergency messages" is introduced that allows
important messages (based on a new emergency loglevel threshold) to
be immediately written to any consoles supporting write_atomic,
regardless of the context. This allows non-emergency printk calls
(i.e. INFO) to run in nearly constant time, with their console
printing taking place in a separate fully preemptible context. And
emergency messages (i.e. ERR) are printed immediately for the user.

- Individual emergency messages are written contiguously and a CPU-ID
field is added to all output to allow for sorting of messages being
printed by multiple CPUs simultaneously.

Although the RFC series works, there are some open issues that I chose
to leave open until I received some initial feedback from the community.
These issues are:

- The behavior of LOG_CONT has been changed. Rather than using the
current task as the "cont owner", the CPU ID is used. NMIs have
their own cont buffer so NMI and non-NMI tasks can safely use
LOG_CONT.

- The runtime resizing features of the printk buffer are not
implemented.

- The exported vmcore symbols relating to the printk buffer no longer
exist and no replacements have been defined. I do not know all the
userspace consequences of making a change here.

- The introduction of the CPU-ID field not only changes the usual
printk output, but also defines a new field in the extended console
output. I do not know the userspace consequences of making a change
here.

- console_flush_on_panic() currently is a NOP. It is pretty clear how
this could be implemented if atomic_write was available. But if no
such console is registered, it is not clear what should be done. Is
this function really even needed?

- Right now the emergency messages are set apart from the
non-emergency messages using '\n'. There have been requests that
some special markers could be specifiable to make it easier for
parsers. Possibly as CONFIG_ and boot options?

- Be aware that printk output is no longer time-sorted. Actually, it
never was, but now you see the real timestamps. This seems strange
at first.

- The ringbuffer API is not very pretty. It grew to be what it is
due to the varying requirements of the different aspects of printk
(syslog, kmsg_dump, /dev/kmsg, console) and the complexity of
handling lockless reading, which can fall behind at any moment.

- Memory barriers are not my specialty. A critical eye on their
usage (or lack thereof) in the ringbuffer code would be greatly
appreciated.

The first 7 patches introduce the new printk ringbuffer. The
remaining 18 go through and replace the various components of the
printk implementation. All patches are against 5.0-rc6 and each
yield a buildable/testable system.

John Ogness

[0] http://lkml.kernel.org/r/[email protected]

John Ogness (25):
printk-rb: add printk ring buffer documentation
printk-rb: add prb locking functions
printk-rb: define ring buffer struct and initializer
printk-rb: add writer interface
printk-rb: add basic non-blocking reading interface
printk-rb: add blocking reader support
printk-rb: add functionality required by printk
printk: add ring buffer and kthread
printk: remove exclusive console hack
printk: redirect emit/store to new ringbuffer
printk_safe: remove printk safe code
printk: minimize console locking implementation
printk: track seq per console
printk: do boot_delay_msec inside printk_delay
printk: print history for new consoles
printk: implement CON_PRINTBUFFER
printk: add processor number to output
console: add write_atomic interface
printk: introduce emergency messages
serial: 8250: implement write_atomic
printk: implement KERN_CONT
printk: implement /dev/kmsg
printk: implement syslog
printk: implement kmsg_dump
printk: remove unused code

Documentation/printk-ringbuffer.txt | 377 +++++++
drivers/tty/serial/8250/8250.h | 4 +
drivers/tty/serial/8250/8250_core.c | 19 +-
drivers/tty/serial/8250/8250_dma.c | 5 +-
drivers/tty/serial/8250/8250_port.c | 154 ++-
fs/proc/kmsg.c | 4 +-
include/linux/console.h | 6 +
include/linux/hardirq.h | 2 -
include/linux/kmsg_dump.h | 6 +-
include/linux/printk.h | 30 +-
include/linux/printk_ringbuffer.h | 114 +++
include/linux/serial_8250.h | 5 +
init/main.c | 1 -
kernel/kexec_core.c | 1 -
kernel/panic.c | 3 -
kernel/printk/Makefile | 1 -
kernel/printk/internal.h | 79 --
kernel/printk/printk.c | 1895 +++++++++++++++++------------------
kernel/printk/printk_safe.c | 427 --------
kernel/trace/trace.c | 2 -
lib/Kconfig.debug | 17 +
lib/Makefile | 2 +-
lib/bust_spinlocks.c | 3 +-
lib/nmi_backtrace.c | 6 -
lib/printk_ringbuffer.c | 583 +++++++++++
25 files changed, 2137 insertions(+), 1609 deletions(-)
create mode 100644 Documentation/printk-ringbuffer.txt
create mode 100644 include/linux/printk_ringbuffer.h
delete mode 100644 kernel/printk/internal.h
delete mode 100644 kernel/printk/printk_safe.c
create mode 100644 lib/printk_ringbuffer.c

--
2.11.0



2019-02-12 14:40:47

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 07/25] printk-rb: add functionality required by printk

The printk subsystem needs to be able to query the size of the ring
buffer, seek to specific entries within the ring buffer, and track
if records could not be stored in the ring buffer.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/printk_ringbuffer.h | 5 +++
lib/printk_ringbuffer.c | 95 +++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)

diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
index 106f20ef8b4d..ec3d7ceec378 100644
--- a/include/linux/printk_ringbuffer.h
+++ b/include/linux/printk_ringbuffer.h
@@ -17,6 +17,7 @@ struct printk_ringbuffer {
unsigned int size_bits;

u64 seq;
+ atomic_long_t lost;

atomic_long_t tail;
atomic_long_t head;
@@ -78,6 +79,7 @@ static struct printk_ringbuffer name = { \
.buffer = &_##name##_buffer[0], \
.size_bits = szbits, \
.seq = 0, \
+ .lost = ATOMIC_LONG_INIT(0), \
.tail = ATOMIC_LONG_INIT(-111 * sizeof(long)), \
.head = ATOMIC_LONG_INIT(-111 * sizeof(long)), \
.reserve = ATOMIC_LONG_INIT(-111 * sizeof(long)), \
@@ -100,9 +102,12 @@ void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src);
int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq);
int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size,
u64 *seq);
+int prb_iter_seek(struct prb_iterator *iter, u64 seq);
int prb_iter_data(struct prb_iterator *iter, char *buf, int size, u64 *seq);

/* utility functions */
+int prb_buffer_size(struct printk_ringbuffer *rb);
+void prb_inc_lost(struct printk_ringbuffer *rb);
void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store);

diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
index c2ddf4cb9f92..ce33b5add5a1 100644
--- a/lib/printk_ringbuffer.c
+++ b/lib/printk_ringbuffer.c
@@ -175,11 +175,16 @@ void prb_commit(struct prb_handle *h)
head = PRB_WRAP_LPOS(rb, head, 1);
continue;
}
+ while (atomic_long_read(&rb->lost)) {
+ atomic_long_dec(&rb->lost);
+ rb->seq++;
+ }
e->seq = ++rb->seq;
head += e->size;
changed = true;
}
atomic_long_set_release(&rb->head, res);
+
atomic_dec(&rb->ctx);

if (atomic_long_read(&rb->reserve) == res)
@@ -486,3 +491,93 @@ int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)

return ret;
}
+
+/*
+ * prb_iter_seek: Seek forward to a specific record.
+ * @iter: Iterator to advance.
+ * @seq: Record number to advance to.
+ *
+ * Advance @iter such that a following call to prb_iter_data() will provide
+ * the contents of the specified record. If a record is specified that does
+ * not yet exist, advance @iter to the end of the record list.
+ *
+ * Note that iterators cannot be rewound. So if a record is requested that
+ * exists but is previous to @iter in position, @iter is considered invalid.
+ *
+ * It is safe to call this function from any context and state.
+ *
+ * Returns 1 on succces, 0 if specified record does not yet exist (@iter is
+ * now at the end of the list), or -EINVAL if @iter is now invalid.
+ */
+int prb_iter_seek(struct prb_iterator *iter, u64 seq)
+{
+ u64 cur_seq;
+ int ret;
+
+ /* first check if the iterator is already at the wanted seq */
+ if (seq == 0) {
+ if (iter->lpos == PRB_INIT)
+ return 1;
+ else
+ return -EINVAL;
+ }
+ if (iter->lpos != PRB_INIT) {
+ if (prb_iter_data(iter, NULL, 0, &cur_seq) >= 0) {
+ if (cur_seq == seq)
+ return 1;
+ if (cur_seq > seq)
+ return -EINVAL;
+ }
+ }
+
+ /* iterate to find the wanted seq */
+ for (;;) {
+ ret = prb_iter_next(iter, NULL, 0, &cur_seq);
+ if (ret <= 0)
+ break;
+
+ if (cur_seq == seq)
+ break;
+
+ if (cur_seq > seq) {
+ ret = -EINVAL;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+/*
+ * prb_buffer_size: Get the size of the ring buffer.
+ * @rb: The ring buffer to get the size of.
+ *
+ * Return the number of bytes used for the ring buffer entry storage area.
+ * Note that this area stores both entry header and entry data. Therefore
+ * this represents an upper bound to the amount of data that can be stored
+ * in the ring buffer.
+ *
+ * It is safe to call this function from any context and state.
+ *
+ * Returns the size in bytes of the entry storage area.
+ */
+int prb_buffer_size(struct printk_ringbuffer *rb)
+{
+ return PRB_SIZE(rb);
+}
+
+/*
+ * prb_inc_lost: Increment the seq counter to signal a lost record.
+ * @rb: The ring buffer to increment the seq of.
+ *
+ * Increment the seq counter so that a seq number is intentially missing
+ * for the readers. This allows readers to identify that a record is
+ * missing. A writer will typically use this function if prb_reserve()
+ * fails.
+ *
+ * It is safe to call this function from any context and state.
+ */
+void prb_inc_lost(struct printk_ringbuffer *rb)
+{
+ atomic_long_inc(&rb->lost);
+}
--
2.11.0


2019-02-12 14:41:22

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 14/25] printk: do boot_delay_msec inside printk_delay

Both functions needed to be called one after the other, so just
integrate boot_delay_msec into printk_delay for simplification.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 35 +++++++++++++++++------------------
1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ebd9aac06323..897219f34cab 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1453,6 +1453,21 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)
return do_syslog(type, buf, len, SYSLOG_FROM_READER);
}

+int printk_delay_msec __read_mostly;
+
+static inline void printk_delay(int level)
+{
+ boot_delay_msec(level);
+ if (unlikely(printk_delay_msec)) {
+ int m = printk_delay_msec;
+
+ while (m--) {
+ mdelay(1);
+ touch_nmi_watchdog();
+ }
+ }
+}
+
static void print_console_dropped(struct console *con, u64 count)
{
char text[64];
@@ -1534,20 +1549,6 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
}
}

-int printk_delay_msec __read_mostly;
-
-static inline void printk_delay(void)
-{
- if (unlikely(printk_delay_msec)) {
- int m = printk_delay_msec;
-
- while (m--) {
- mdelay(1);
- touch_nmi_watchdog();
- }
- }
-}
-
/* FIXME: no support for LOG_CONT */
#if 0
/*
@@ -2506,10 +2507,8 @@ static int printk_kthread_func(void *data)
console_lock();
call_console_drivers(master_seq, ext_text,
ext_len, text, len);
- if (len > 0 || ext_len > 0) {
- boot_delay_msec(msg->level);
- printk_delay();
- }
+ if (len > 0 || ext_len > 0)
+ printk_delay(msg->level);
console_unlock();
}

--
2.11.0


2019-02-12 14:41:53

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 17/25] printk: add processor number to output

It can be difficult to sort printk out if multiple processors are
printing simultaneously. Add the processor number to the printk
output to allow the messages to be sorted.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b97d4195b09a..cde036d8487a 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -331,6 +331,7 @@ enum log_flags {

struct printk_log {
u64 ts_nsec; /* timestamp in nanoseconds */
+ u16 cpu; /* cpu that generated record */
u16 len; /* length of entire record */
u16 text_len; /* length of text buffer */
u16 dict_len; /* length of dictionary buffer */
@@ -475,7 +476,7 @@ static u32 log_next(u32 idx)

/* insert record into the buffer, discard old ones, update heads */
static int log_store(int facility, int level,
- enum log_flags flags, u64 ts_nsec,
+ enum log_flags flags, u64 ts_nsec, u16 cpu,
const char *dict, u16 dict_len,
const char *text, u16 text_len)
{
@@ -506,6 +507,7 @@ static int log_store(int facility, int level,
msg->level = level & 7;
msg->flags = flags & 0x1f;
msg->ts_nsec = ts_nsec;
+ msg->cpu = cpu;
msg->len = size;

/* insert message */
@@ -570,9 +572,9 @@ static ssize_t msg_print_ext_header(char *buf, size_t size,

do_div(ts_usec, 1000);

- return scnprintf(buf, size, "%u,%llu,%llu,%c;",
+ return scnprintf(buf, size, "%u,%llu,%llu,%c,%hu;",
(msg->facility << 3) | msg->level, seq, ts_usec,
- msg->flags & LOG_CONT ? 'c' : '-');
+ msg->flags & LOG_CONT ? 'c' : '-', msg->cpu);
}

static ssize_t msg_print_ext_body(char *buf, size_t size,
@@ -1110,6 +1112,11 @@ static inline void boot_delay_msec(int level)
static bool printk_time = IS_ENABLED(CONFIG_PRINTK_TIME);
module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR);

+static size_t print_cpu(u16 cpu, char *buf)
+{
+ return sprintf(buf, "%03hu: ", cpu);
+}
+
static size_t print_syslog(unsigned int level, char *buf)
{
return sprintf(buf, "<%u>", level);
@@ -1132,6 +1139,7 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog,
len = print_syslog((msg->facility << 3) | msg->level, buf);
if (time)
len += print_time(msg->ts_nsec, buf + len);
+ len += print_cpu(msg->cpu, buf + len);
return len;
}

@@ -1698,6 +1706,7 @@ asmlinkage int vprintk_emit(int facility, int level,
u64 ts_nsec;
char *text;
char *rbuf;
+ int cpu;

ts_nsec = local_clock();

@@ -1707,6 +1716,8 @@ asmlinkage int vprintk_emit(int facility, int level,
return printed_len;
}

+ cpu = raw_smp_processor_id();
+
text = rbuf;
text_len = vscnprintf(text, PRINTK_SPRINT_MAX, fmt, args);

@@ -1744,7 +1755,7 @@ asmlinkage int vprintk_emit(int facility, int level,
if (dict)
lflags |= LOG_PREFIX|LOG_NEWLINE;

- printed_len = log_store(facility, level, lflags, ts_nsec,
+ printed_len = log_store(facility, level, lflags, ts_nsec, cpu,
dict, dictlen, text, text_len);

prb_commit(&h);
--
2.11.0


2019-02-12 14:42:39

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 25/25] printk: remove unused code

Code relating to the safe context and anything dealing with the
previous log buffer implementation is no longer in use. Remove it.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/internal.h | 53 ---------------
kernel/printk/printk.c | 167 ++++-------------------------------------------
lib/bust_spinlocks.c | 3 +-
3 files changed, 13 insertions(+), 210 deletions(-)
delete mode 100644 kernel/printk/internal.h

diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
deleted file mode 100644
index 59ad43dba837..000000000000
--- a/kernel/printk/internal.h
+++ /dev/null
@@ -1,53 +0,0 @@
-/*
- * internal.h - printk internal definitions
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version 2
- * of the License, or (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, see <http://www.gnu.org/licenses/>.
- */
-#include <linux/percpu.h>
-
-#ifdef CONFIG_PRINTK
-
-#define PRINTK_SAFE_CONTEXT_MASK 0x3fffffff
-#define PRINTK_NMI_DIRECT_CONTEXT_MASK 0x40000000
-#define PRINTK_NMI_CONTEXT_MASK 0x80000000
-
-extern raw_spinlock_t logbuf_lock;
-
-__printf(5, 0)
-int vprintk_store(int facility, int level,
- const char *dict, size_t dictlen,
- const char *fmt, va_list args);
-
-__printf(1, 0) int vprintk_default(const char *fmt, va_list args);
-__printf(1, 0) int vprintk_deferred(const char *fmt, va_list args);
-__printf(1, 0) int vprintk_func(const char *fmt, va_list args);
-
-void defer_console_output(void);
-
-#else
-
-__printf(1, 0) int vprintk_func(const char *fmt, va_list args) { return 0; }
-
-/*
- * In !PRINTK builds we still export logbuf_lock spin_lock, console_sem
- * semaphore and some of console functions (console_unlock()/etc.), so
- * printk-safe must preserve the existing local IRQ guarantees.
- */
-#endif /* CONFIG_PRINTK */
-
-#define printk_safe_enter_irqsave(flags) local_irq_save(flags)
-#define printk_safe_exit_irqrestore(flags) local_irq_restore(flags)
-
-#define printk_safe_enter_irq() local_irq_disable()
-#define printk_safe_exit_irq() local_irq_enable()
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 6d5bb7f5f584..8ef35c398887 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -60,7 +60,6 @@

#include "console_cmdline.h"
#include "braille.h"
-#include "internal.h"

int console_printk[5] = {
CONSOLE_LOGLEVEL_DEFAULT, /* console_loglevel */
@@ -346,41 +345,6 @@ __packed __aligned(4)
#endif
;

-/*
- * The logbuf_lock protects kmsg buffer, indices, counters. This can be taken
- * within the scheduler's rq lock. It must be released before calling
- * console_unlock() or anything else that might wake up a process.
- */
-DEFINE_RAW_SPINLOCK(logbuf_lock);
-
-/*
- * Helper macros to lock/unlock logbuf_lock and switch between
- * printk-safe/unsafe modes.
- */
-#define logbuf_lock_irq() \
- do { \
- printk_safe_enter_irq(); \
- raw_spin_lock(&logbuf_lock); \
- } while (0)
-
-#define logbuf_unlock_irq() \
- do { \
- raw_spin_unlock(&logbuf_lock); \
- printk_safe_exit_irq(); \
- } while (0)
-
-#define logbuf_lock_irqsave(flags) \
- do { \
- printk_safe_enter_irqsave(flags); \
- raw_spin_lock(&logbuf_lock); \
- } while (0)
-
-#define logbuf_unlock_irqrestore(flags) \
- do { \
- raw_spin_unlock(&logbuf_lock); \
- printk_safe_exit_irqrestore(flags); \
- } while (0)
-
DECLARE_STATIC_PRINTKRB_CPULOCK(printk_cpulock);

#ifdef CONFIG_PRINTK
@@ -390,23 +354,15 @@ DECLARE_STATIC_PRINTKRB(printk_rb, CONFIG_LOG_BUF_SHIFT, &printk_cpulock);
static DEFINE_MUTEX(syslog_lock);
DECLARE_STATIC_PRINTKRB_ITER(syslog_iter, &printk_rb);

-DECLARE_WAIT_QUEUE_HEAD(log_wait);
-/* the next printk record to read by syslog(READ) or /proc/kmsg */
+/* the last printk record read by syslog(READ) or /proc/kmsg */
static u64 syslog_seq;
static size_t syslog_partial;
static bool syslog_time;

-/* index and sequence number of the first record stored in the buffer */
-static u32 log_first_idx;
-
-/* index and sequence number of the next record to store in the buffer */
-static u32 log_next_idx;
-
static DEFINE_MUTEX(kmsg_dump_lock);

-/* the next printk record to read after the last 'clear' command */
+/* the last printk record at the last 'clear' command */
static u64 clear_seq;
-static u32 clear_idx;

#define PREFIX_MAX 32
#define LOG_LINE_MAX (1024 - PREFIX_MAX)
@@ -414,26 +370,6 @@ static u32 clear_idx;
#define LOG_LEVEL(v) ((v) & 0x07)
#define LOG_FACILITY(v) ((v) >> 3 & 0xff)

-/* record buffer */
-#define LOG_ALIGN __alignof__(struct printk_log)
-#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
-#define LOG_BUF_LEN_MAX (u32)(1 << 31)
-static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
-static char *log_buf = __log_buf;
-static u32 log_buf_len = __LOG_BUF_LEN;
-
-/* Return log buffer address */
-char *log_buf_addr_get(void)
-{
- return log_buf;
-}
-
-/* Return log buffer size */
-u32 log_buf_len_get(void)
-{
- return log_buf_len;
-}
-
/* human readable text of the record */
static char *log_text(const struct printk_log *msg)
{
@@ -944,11 +880,6 @@ const struct file_operations kmsg_fops = {
*/
void log_buf_vmcoreinfo_setup(void)
{
- VMCOREINFO_SYMBOL(log_buf);
- VMCOREINFO_SYMBOL(log_buf_len);
- VMCOREINFO_SYMBOL(log_first_idx);
- VMCOREINFO_SYMBOL(clear_idx);
- VMCOREINFO_SYMBOL(log_next_idx);
/*
* Export struct printk_log size and field offsets. User space tools can
* parse it and detect any changes to structure down the line.
@@ -961,6 +892,8 @@ void log_buf_vmcoreinfo_setup(void)
}
#endif

+/* FIXME: no support for buffer resizing */
+#if 0
/* requested log_buf_len from kernel cmdline */
static unsigned long __initdata new_log_buf_len;

@@ -1026,9 +959,12 @@ static void __init log_buf_add_cpu(void)
#else /* !CONFIG_SMP */
static inline void log_buf_add_cpu(void) {}
#endif /* CONFIG_SMP */
+#endif /* 0 */

void __init setup_log_buf(int early)
{
+/* FIXME: no support for buffer resizing */
+#if 0
unsigned long flags;
char *new_log_buf;
unsigned int free;
@@ -1067,6 +1003,7 @@ void __init setup_log_buf(int early)
pr_info("log_buf_len: %u bytes\n", log_buf_len);
pr_info("early log buf free: %u(%u%%)\n",
free, (free * 100) / __LOG_BUF_LEN);
+#endif
}

static bool __read_mostly ignore_loglevel;
@@ -2023,31 +1960,6 @@ asmlinkage __visible int printk(const char *fmt, ...)
return r;
}
EXPORT_SYMBOL(printk);
-
-#else /* CONFIG_PRINTK */
-
-#define LOG_LINE_MAX 0
-#define PREFIX_MAX 0
-#define printk_time false
-
-static u64 syslog_seq;
-static u32 log_first_idx;
-static char *log_text(const struct printk_log *msg) { return NULL; }
-static char *log_dict(const struct printk_log *msg) { return NULL; }
-static struct printk_log *log_from_idx(u32 idx) { return NULL; }
-static u32 log_next(u32 idx) { return 0; }
-static ssize_t msg_print_ext_header(char *buf, size_t size,
- struct printk_log *msg,
- u64 seq) { return 0; }
-static ssize_t msg_print_ext_body(char *buf, size_t size,
- char *dict, size_t dict_len,
- char *text, size_t text_len) { return 0; }
-static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
- const char *text, size_t len, int level) {}
-static size_t msg_print_text(const struct printk_log *msg, bool syslog,
- bool time, char *buf, size_t size) { return 0; }
-static bool suppress_message_printing(int level) { return false; }
-
#endif /* CONFIG_PRINTK */

#ifdef CONFIG_EARLY_PRINTK
@@ -2343,15 +2255,10 @@ void console_unblank(void)
void console_flush_on_panic(void)
{
/*
- * If someone else is holding the console lock, trylock will fail
- * and may_schedule may be set. Ignore and proceed to unlock so
- * that messages are flushed out. As this can be called from any
- * context and we don't want to get preempted while flushing,
- * ensure may_schedule is cleared.
+ * FIXME: This is currently a NOP. Emergency messages will have been
+ * printed, but what about if write_atomic is not available on the
+ * console? What if the printk kthread is still alive?
*/
- console_trylock();
- console_may_schedule = 0;
- console_unlock();
}

/*
@@ -2700,43 +2607,6 @@ static int __init printk_late_init(void)
late_initcall(printk_late_init);

#if defined CONFIG_PRINTK
-/*
- * Delayed printk version, for scheduler-internal messages:
- */
-#define PRINTK_PENDING_WAKEUP 0x01
-#define PRINTK_PENDING_OUTPUT 0x02
-
-static DEFINE_PER_CPU(int, printk_pending);
-
-static void wake_up_klogd_work_func(struct irq_work *irq_work)
-{
- int pending = __this_cpu_xchg(printk_pending, 0);
-
- if (pending & PRINTK_PENDING_OUTPUT) {
- /* If trylock fails, someone else is doing the printing */
- if (console_trylock())
- console_unlock();
- }
-
- if (pending & PRINTK_PENDING_WAKEUP)
- wake_up_interruptible(&log_wait);
-}
-
-static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
- .func = wake_up_klogd_work_func,
- .flags = IRQ_WORK_LAZY,
-};
-
-void wake_up_klogd(void)
-{
- preempt_disable();
- if (waitqueue_active(&log_wait)) {
- this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
- irq_work_queue(this_cpu_ptr(&wake_up_klogd_work));
- }
- preempt_enable();
-}
-
static int printk_kthread_func(void *data)
{
struct prb_iterator iter;
@@ -2802,22 +2672,9 @@ static int __init init_printk_kthread(void)
}
late_initcall(init_printk_kthread);

-void defer_console_output(void)
-{
- preempt_disable();
- __this_cpu_or(printk_pending, PRINTK_PENDING_OUTPUT);
- irq_work_queue(this_cpu_ptr(&wake_up_klogd_work));
- preempt_enable();
-}
-
int vprintk_deferred(const char *fmt, va_list args)
{
- int r;
-
- r = vprintk_emit(0, LOGLEVEL_SCHED, NULL, 0, fmt, args);
- defer_console_output();
-
- return r;
+ return vprintk_emit(0, LOGLEVEL_SCHED, NULL, 0, fmt, args);
}

int printk_deferred(const char *fmt, ...)
diff --git a/lib/bust_spinlocks.c b/lib/bust_spinlocks.c
index 8be59f84eaea..c6e083323d1b 100644
--- a/lib/bust_spinlocks.c
+++ b/lib/bust_spinlocks.c
@@ -26,7 +26,6 @@ void bust_spinlocks(int yes)
unblank_screen();
#endif
console_unblank();
- if (--oops_in_progress == 0)
- wake_up_klogd();
+ --oops_in_progress;
}
}
--
2.11.0


2019-02-12 14:43:04

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 19/25] printk: introduce emergency messages

Console messages are generally either critical or non-critical.
Critical messages are messages such as crashes or sysrq output.
Critical messages should never be lost because generally they provide
important debugging information.

Since all console messages are output via a fully preemptible printk
kernel thread, it is possible that messages are not output because
that thread cannot be scheduled (BUG in scheduler, run-away RT task,
etc).

To allow critical messages to be output independent of the
schedulability of the printk task, introduce an emergency mechanism
that _immediately_ outputs the message to the consoles. To avoid
possible unbounded latency issues, the emergency mechanism only
outputs the printk line provided by the caller and ignores any
pending messages in the log buffer.

Critical messages are identified as messages (by default) with log
level LOGLEVEL_WARNING or more critical. This is configurable via the
kernel option CONSOLE_LOGLEVEL_EMERGENCY.

Any messages output as emergency messages are skipped by the printk
thread on those consoles that output the emergency message.

In order for a console driver to support emergency messages, the
write_atomic function must be implemented by the driver. If not
implemented, the emergency messages are handled like all other
messages and are printed by the printk thread.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/printk.h | 2 +
kernel/printk/printk.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++---
lib/Kconfig.debug | 17 ++++++++
3 files changed, 124 insertions(+), 6 deletions(-)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index a79a736b54b6..58bd06d88ea3 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -59,6 +59,7 @@ static inline const char *printk_skip_headers(const char *buffer)
*/
#define CONSOLE_LOGLEVEL_DEFAULT CONFIG_CONSOLE_LOGLEVEL_DEFAULT
#define CONSOLE_LOGLEVEL_QUIET CONFIG_CONSOLE_LOGLEVEL_QUIET
+#define CONSOLE_LOGLEVEL_EMERGENCY CONFIG_CONSOLE_LOGLEVEL_EMERGENCY

extern int console_printk[];

@@ -66,6 +67,7 @@ extern int console_printk[];
#define default_message_loglevel (console_printk[1])
#define minimum_console_loglevel (console_printk[2])
#define default_console_loglevel (console_printk[3])
+#define emergency_console_loglevel (console_printk[4])

static inline void console_silent(void)
{
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 0ff7c3942464..eebe6f4fdbba 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -45,6 +45,7 @@
#include <linux/ctype.h>
#include <linux/uio.h>
#include <linux/kthread.h>
+#include <linux/clocksource.h>
#include <linux/printk_ringbuffer.h>
#include <linux/sched/clock.h>
#include <linux/sched/debug.h>
@@ -61,11 +62,12 @@
#include "braille.h"
#include "internal.h"

-int console_printk[4] = {
+int console_printk[5] = {
CONSOLE_LOGLEVEL_DEFAULT, /* console_loglevel */
MESSAGE_LOGLEVEL_DEFAULT, /* default_message_loglevel */
CONSOLE_LOGLEVEL_MIN, /* minimum_console_loglevel */
CONSOLE_LOGLEVEL_DEFAULT, /* default_console_loglevel */
+ CONSOLE_LOGLEVEL_EMERGENCY, /* emergency_console_loglevel */
};

atomic_t ignore_console_lock_warning __read_mostly = ATOMIC_INIT(0);
@@ -474,6 +476,9 @@ static u32 log_next(u32 idx)
return idx + msg->len;
}

+static void printk_emergency(char *buffer, int level, u64 ts_nsec, u16 cpu,
+ char *text, u16 text_len);
+
/* insert record into the buffer, discard old ones, update heads */
static int log_store(int facility, int level,
enum log_flags flags, u64 ts_nsec, u16 cpu,
@@ -1587,7 +1592,7 @@ static void printk_write_history(struct console *con, u64 master_seq)
* The console_lock must be held.
*/
static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
- const char *text, size_t len)
+ const char *text, size_t len, int level)
{
struct console *con;

@@ -1607,6 +1612,18 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
con->wrote_history = 1;
con->printk_seq = seq - 1;
}
+ if (con->write_atomic && level < emergency_console_loglevel) {
+ /* skip emergency messages, already printed */
+ if (con->printk_seq < seq)
+ con->printk_seq = seq;
+ continue;
+ }
+ if (con->flags & CON_BOOT) {
+ /* skip emergency messages, already printed */
+ if (con->printk_seq < seq)
+ con->printk_seq = seq;
+ continue;
+ }
if (!con->write)
continue;
if (!cpu_online(raw_smp_processor_id()) &&
@@ -1718,8 +1735,12 @@ asmlinkage int vprintk_emit(int facility, int level,

cpu = raw_smp_processor_id();

- text = rbuf;
- text_len = vscnprintf(text, PRINTK_SPRINT_MAX, fmt, args);
+ /*
+ * If this turns out to be an emergency message, there
+ * may need to be a prefix added. Leave room for it.
+ */
+ text = rbuf + PREFIX_MAX;
+ text_len = vscnprintf(text, PRINTK_SPRINT_MAX - PREFIX_MAX, fmt, args);

/* strip and flag a trailing newline */
if (text_len && text[text_len-1] == '\n') {
@@ -1755,6 +1776,14 @@ asmlinkage int vprintk_emit(int facility, int level,
if (dict)
lflags |= LOG_PREFIX|LOG_NEWLINE;

+ /*
+ * NOTE:
+ * - rbuf points to beginning of allocated buffer
+ * - text points to beginning of text
+ * - there is room before text for prefix
+ */
+ printk_emergency(rbuf, level, ts_nsec, cpu, text, text_len);
+
printed_len = log_store(facility, level, lflags, ts_nsec, cpu,
dict, dictlen, text, text_len);

@@ -1847,7 +1876,7 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
char *dict, size_t dict_len,
char *text, size_t text_len) { return 0; }
static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
- const char *text, size_t len) {}
+ const char *text, size_t len, int level) {}
static size_t msg_print_text(const struct printk_log *msg, bool syslog,
bool time, char *buf, size_t size) { return 0; }
static bool suppress_message_printing(int level) { return false; }
@@ -2579,7 +2608,7 @@ static int printk_kthread_func(void *data)

console_lock();
call_console_drivers(master_seq, ext_text,
- ext_len, text, len);
+ ext_len, text, len, msg->level);
if (len > 0 || ext_len > 0)
printk_delay(msg->level);
console_unlock();
@@ -2983,6 +3012,76 @@ void kmsg_dump_rewind(struct kmsg_dumper *dumper)
logbuf_unlock_irqrestore(flags);
}
EXPORT_SYMBOL_GPL(kmsg_dump_rewind);
+
+static bool console_can_emergency(int level)
+{
+ struct console *con;
+
+ for_each_console(con) {
+ if (!(con->flags & CON_ENABLED))
+ continue;
+ if (con->write_atomic && level < emergency_console_loglevel)
+ return true;
+ if (con->write && (con->flags & CON_BOOT))
+ return true;
+ }
+ return false;
+}
+
+static void call_emergency_console_drivers(int level, const char *text,
+ size_t text_len)
+{
+ struct console *con;
+
+ for_each_console(con) {
+ if (!(con->flags & CON_ENABLED))
+ continue;
+ if (con->write_atomic && level < emergency_console_loglevel) {
+ con->write_atomic(con, text, text_len);
+ continue;
+ }
+ if (con->write && (con->flags & CON_BOOT)) {
+ con->write(con, text, text_len);
+ continue;
+ }
+ }
+}
+
+static void printk_emergency(char *buffer, int level, u64 ts_nsec, u16 cpu,
+ char *text, u16 text_len)
+{
+ struct printk_log msg;
+ size_t prefix_len;
+
+ if (!console_can_emergency(level))
+ return;
+
+ msg.level = level;
+ msg.ts_nsec = ts_nsec;
+ msg.cpu = cpu;
+ msg.facility = 0;
+
+ /* "text" must have PREFIX_MAX preceding bytes available */
+
+ prefix_len = print_prefix(&msg,
+ console_msg_format & MSG_FORMAT_SYSLOG,
+ printk_time, buffer);
+ /* move the prefix forward to the beginning of the message text */
+ text -= prefix_len;
+ memmove(text, buffer, prefix_len);
+ text_len += prefix_len;
+
+ text[text_len++] = '\n';
+
+ call_emergency_console_drivers(level, text, text_len);
+
+ touch_softlockup_watchdog_sync();
+ clocksource_touch_watchdog();
+ rcu_cpu_stall_reset();
+ touch_nmi_watchdog();
+
+ printk_delay(level);
+}
#endif

void console_atomic_lock(unsigned int *flags)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d4df5b24d75e..38d2fe5df425 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -43,6 +43,23 @@ config CONSOLE_LOGLEVEL_QUIET
will be used as the loglevel. IOW passing "quiet" will be the
equivalent of passing "loglevel=<CONSOLE_LOGLEVEL_QUIET>"

+config CONSOLE_LOGLEVEL_EMERGENCY
+ int "Emergency console loglevel (1-15)"
+ range 1 15
+ default "5"
+ help
+ The loglevel to determine if a console message is an emergency
+ message.
+
+ If supported by the console driver, emergency messages will be
+ flushed to the console immediately. This can cause significant system
+ latencies so the value should be set such that only significant
+ messages are classified as emergency messages.
+
+ Setting a default here is equivalent to passing in
+ emergency_loglevel=<x> in the kernel bootargs. emergency_loglevel=<x>
+ continues to override whatever value is specified here as well.
+
config MESSAGE_LOGLEVEL_DEFAULT
int "Default message log level (1-7)"
range 1 7
--
2.11.0


2019-02-12 14:43:25

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 06/25] printk-rb: add blocking reader support

Add a blocking read function for readers. An irq_work function is
used to signal the wait queue so that write notification can
be triggered from any context.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/printk_ringbuffer.h | 20 ++++++++++++++++
lib/printk_ringbuffer.c | 49 +++++++++++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+)

diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
index 5fdaf632c111..106f20ef8b4d 100644
--- a/include/linux/printk_ringbuffer.h
+++ b/include/linux/printk_ringbuffer.h
@@ -2,8 +2,10 @@
#ifndef _LINUX_PRINTK_RINGBUFFER_H
#define _LINUX_PRINTK_RINGBUFFER_H

+#include <linux/irq_work.h>
#include <linux/atomic.h>
#include <linux/percpu.h>
+#include <linux/wait.h>

struct prb_cpulock {
atomic_t owner;
@@ -22,6 +24,10 @@ struct printk_ringbuffer {

struct prb_cpulock *cpulock;
atomic_t ctx;
+
+ struct wait_queue_head *wq;
+ atomic_long_t wq_counter;
+ struct irq_work *wq_work;
};

struct prb_entry {
@@ -59,6 +65,15 @@ struct prb_iterator {
#define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) \
static char _##name##_buffer[1 << (szbits)] \
__aligned(__alignof__(long)); \
+static DECLARE_WAIT_QUEUE_HEAD(_##name##_wait); \
+static void _##name##_wake_work_func(struct irq_work *irq_work) \
+{ \
+ wake_up_interruptible_all(&_##name##_wait); \
+} \
+static struct irq_work _##name##_wake_work = { \
+ .func = _##name##_wake_work_func, \
+ .flags = IRQ_WORK_LAZY, \
+}; \
static struct printk_ringbuffer name = { \
.buffer = &_##name##_buffer[0], \
.size_bits = szbits, \
@@ -68,6 +83,9 @@ static struct printk_ringbuffer name = { \
.reserve = ATOMIC_LONG_INIT(-111 * sizeof(long)), \
.cpulock = cpulockptr, \
.ctx = ATOMIC_INIT(0), \
+ .wq = &_##name##_wait, \
+ .wq_counter = ATOMIC_LONG_INIT(0), \
+ .wq_work = &_##name##_wake_work, \
}

/* writer interface */
@@ -80,6 +98,8 @@ void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
u64 *seq);
void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src);
int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq);
+int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size,
+ u64 *seq);
int prb_iter_data(struct prb_iterator *iter, char *buf, int size, u64 *seq);

/* utility functions */
diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
index 1d1e886a0966..c2ddf4cb9f92 100644
--- a/lib/printk_ringbuffer.c
+++ b/lib/printk_ringbuffer.c
@@ -1,4 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
+#include <linux/sched.h>
#include <linux/smp.h>
#include <linux/string.h>
#include <linux/errno.h>
@@ -154,6 +155,7 @@ static bool push_tail(struct printk_ringbuffer *rb, unsigned long tail)
void prb_commit(struct prb_handle *h)
{
struct printk_ringbuffer *rb = h->rb;
+ bool changed = false;
struct prb_entry *e;
unsigned long head;
unsigned long res;
@@ -175,6 +177,7 @@ void prb_commit(struct prb_handle *h)
}
e->seq = ++rb->seq;
head += e->size;
+ changed = true;
}
atomic_long_set_release(&rb->head, res);
atomic_dec(&rb->ctx);
@@ -185,6 +188,12 @@ void prb_commit(struct prb_handle *h)
}

prb_unlock(rb->cpulock, h->cpu);
+
+ if (changed) {
+ atomic_long_inc(&rb->wq_counter);
+ if (wq_has_sleeper(rb->wq))
+ irq_work_queue(rb->wq_work);
+ }
}

/*
@@ -437,3 +446,43 @@ int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)

return 1;
}
+
+/*
+ * prb_iter_wait_next: Advance to the next record, blocking if none available.
+ * @iter: Iterator tracking the current position.
+ * @buf: A buffer to store the data of the next record. May be NULL.
+ * @size: The size of @buf. (Ignored if @buf is NULL.)
+ * @seq: The sequence number of the next record. May be NULL.
+ *
+ * If a next record is already available, this function works like
+ * prb_iter_next(). Otherwise block interruptible until a next record is
+ * available.
+ *
+ * When a next record is available, @iter is advanced and (if specified)
+ * the data and/or sequence number of that record are provided.
+ *
+ * This function might sleep.
+ *
+ * Returns 1 if @iter was advanced, -EINVAL if @iter is now invalid, or
+ * -ERESTARTSYS if interrupted by a signal.
+ */
+int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)
+{
+ unsigned long last_seen;
+ int ret;
+
+ for (;;) {
+ last_seen = atomic_long_read(&iter->rb->wq_counter);
+
+ ret = prb_iter_next(iter, buf, size, seq);
+ if (ret != 0)
+ break;
+
+ ret = wait_event_interruptible(*iter->rb->wq,
+ last_seen != atomic_long_read(&iter->rb->wq_counter));
+ if (ret < 0)
+ break;
+ }
+
+ return ret;
+}
--
2.11.0


2019-02-12 14:43:27

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 20/25] serial: 8250: implement write_atomic

Implement a non-sleeping NMI-safe write_atomic console function in
order to support emergency printk messages.

Since interrupts need to be disabled during transmit, all usage of
the IER register was wrapped with access functions that use the
console_atomic_lock function to synchronize register access while
tracking the state of the interrupts. This was necessary because
write_atomic is can be calling from an NMI context that has
preempted write_atomic.

Signed-off-by: John Ogness <[email protected]>
---
drivers/tty/serial/8250/8250.h | 4 +
drivers/tty/serial/8250/8250_core.c | 19 +++--
drivers/tty/serial/8250/8250_dma.c | 5 +-
drivers/tty/serial/8250/8250_port.c | 154 +++++++++++++++++++++++++++---------
include/linux/serial_8250.h | 5 ++
5 files changed, 139 insertions(+), 48 deletions(-)

diff --git a/drivers/tty/serial/8250/8250.h b/drivers/tty/serial/8250/8250.h
index ebfb0bd5bef5..1d36337d40a2 100644
--- a/drivers/tty/serial/8250/8250.h
+++ b/drivers/tty/serial/8250/8250.h
@@ -255,3 +255,7 @@ static inline int serial_index(struct uart_port *port)
{
return port->minor - 64;
}
+
+void set_ier(struct uart_8250_port *up, unsigned char ier);
+void clear_ier(struct uart_8250_port *up);
+void restore_ier(struct uart_8250_port *up);
diff --git a/drivers/tty/serial/8250/8250_core.c b/drivers/tty/serial/8250/8250_core.c
index e441221e04b9..1daea9709ce2 100644
--- a/drivers/tty/serial/8250/8250_core.c
+++ b/drivers/tty/serial/8250/8250_core.c
@@ -265,7 +265,7 @@ static void serial8250_timeout(struct timer_list *t)
static void serial8250_backup_timeout(struct timer_list *t)
{
struct uart_8250_port *up = from_timer(up, t, timer);
- unsigned int iir, ier = 0, lsr;
+ unsigned int iir, lsr;
unsigned long flags;

spin_lock_irqsave(&up->port.lock, flags);
@@ -274,10 +274,8 @@ static void serial8250_backup_timeout(struct timer_list *t)
* Must disable interrupts or else we risk racing with the interrupt
* based handler.
*/
- if (up->port.irq) {
- ier = serial_in(up, UART_IER);
- serial_out(up, UART_IER, 0);
- }
+ if (up->port.irq)
+ clear_ier(up);

iir = serial_in(up, UART_IIR);

@@ -300,7 +298,7 @@ static void serial8250_backup_timeout(struct timer_list *t)
serial8250_tx_chars(up);

if (up->port.irq)
- serial_out(up, UART_IER, ier);
+ restore_ier(up);

spin_unlock_irqrestore(&up->port.lock, flags);

@@ -578,6 +576,14 @@ serial8250_register_ports(struct uart_driver *drv, struct device *dev)

#ifdef CONFIG_SERIAL_8250_CONSOLE

+static void univ8250_console_write_atomic(struct console *co, const char *s,
+ unsigned int count)
+{
+ struct uart_8250_port *up = &serial8250_ports[co->index];
+
+ serial8250_console_write_atomic(up, s, count);
+}
+
static void univ8250_console_write(struct console *co, const char *s,
unsigned int count)
{
@@ -663,6 +669,7 @@ static int univ8250_console_match(struct console *co, char *name, int idx,

static struct console univ8250_console = {
.name = "ttyS",
+ .write_atomic = univ8250_console_write_atomic,
.write = univ8250_console_write,
.device = uart_console_device,
.setup = univ8250_console_setup,
diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c
index bfa1a857f3ff..2fb394ddc420 100644
--- a/drivers/tty/serial/8250/8250_dma.c
+++ b/drivers/tty/serial/8250/8250_dma.c
@@ -36,7 +36,7 @@ static void __dma_tx_complete(void *param)
ret = serial8250_tx_dma(p);
if (ret) {
p->ier |= UART_IER_THRI;
- serial_port_out(&p->port, UART_IER, p->ier);
+ set_ier(p, p->ier);
}

spin_unlock_irqrestore(&p->port.lock, flags);
@@ -101,8 +101,7 @@ int serial8250_tx_dma(struct uart_8250_port *p)
if (dma->tx_err) {
dma->tx_err = 0;
if (p->ier & UART_IER_THRI) {
- p->ier &= ~UART_IER_THRI;
- serial_out(p, UART_IER, p->ier);
+ set_ier(p, p->ier);
}
}
return 0;
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index d2f3310abe54..c1f308c1aaec 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -731,7 +731,7 @@ static void serial8250_set_sleep(struct uart_8250_port *p, int sleep)
serial_out(p, UART_EFR, UART_EFR_ECB);
serial_out(p, UART_LCR, 0);
}
- serial_out(p, UART_IER, sleep ? UART_IERX_SLEEP : 0);
+ set_ier(p, sleep ? UART_IERX_SLEEP : 0);
if (p->capabilities & UART_CAP_EFR) {
serial_out(p, UART_LCR, UART_LCR_CONF_MODE_B);
serial_out(p, UART_EFR, efr);
@@ -1433,7 +1433,7 @@ static void serial8250_stop_rx(struct uart_port *port)

up->ier &= ~(UART_IER_RLSI | UART_IER_RDI);
up->port.read_status_mask &= ~UART_LSR_DR;
- serial_port_out(port, UART_IER, up->ier);
+ set_ier(up, up->ier);

serial8250_rpm_put(up);
}
@@ -1451,7 +1451,7 @@ static void __do_stop_tx_rs485(struct uart_8250_port *p)
serial8250_clear_and_reinit_fifos(p);

p->ier |= UART_IER_RLSI | UART_IER_RDI;
- serial_port_out(&p->port, UART_IER, p->ier);
+ set_ier(p, p->ier);
}
}
static enum hrtimer_restart serial8250_em485_handle_stop_tx(struct hrtimer *t)
@@ -1504,7 +1504,7 @@ static inline void __do_stop_tx(struct uart_8250_port *p)
{
if (p->ier & UART_IER_THRI) {
p->ier &= ~UART_IER_THRI;
- serial_out(p, UART_IER, p->ier);
+ set_ier(p, p->ier);
serial8250_rpm_put_tx(p);
}
}
@@ -1557,7 +1557,7 @@ static inline void __start_tx(struct uart_port *port)

if (!(up->ier & UART_IER_THRI)) {
up->ier |= UART_IER_THRI;
- serial_port_out(port, UART_IER, up->ier);
+ set_ier(up, up->ier);

if (up->bugs & UART_BUG_TXEN) {
unsigned char lsr;
@@ -1663,7 +1663,7 @@ static void serial8250_disable_ms(struct uart_port *port)
return;

up->ier &= ~UART_IER_MSI;
- serial_port_out(port, UART_IER, up->ier);
+ set_ier(up, up->ier);
}

static void serial8250_enable_ms(struct uart_port *port)
@@ -1677,7 +1677,7 @@ static void serial8250_enable_ms(struct uart_port *port)
up->ier |= UART_IER_MSI;

serial8250_rpm_get(up);
- serial_port_out(port, UART_IER, up->ier);
+ set_ier(up, up->ier);
serial8250_rpm_put(up);
}

@@ -2050,6 +2050,52 @@ static void wait_for_xmitr(struct uart_8250_port *up, int bits)
}
}

+static atomic_t ier_counter = ATOMIC_INIT(0);
+static atomic_t ier_value = ATOMIC_INIT(0);
+
+void set_ier(struct uart_8250_port *up, unsigned char ier)
+{
+ struct uart_port *port = &up->port;
+ unsigned int flags;
+
+ console_atomic_lock(&flags);
+ if (atomic_read(&ier_counter) > 0)
+ atomic_set(&ier_value, ier);
+ else
+ serial_port_out(port, UART_IER, ier);
+ console_atomic_unlock(flags);
+}
+
+void clear_ier(struct uart_8250_port *up)
+{
+ struct uart_port *port = &up->port;
+ unsigned int ier_cleared = 0;
+ unsigned int flags;
+ unsigned int ier;
+
+ console_atomic_lock(&flags);
+ atomic_inc(&ier_counter);
+ ier = serial_port_in(port, UART_IER);
+ if (up->capabilities & UART_CAP_UUE)
+ ier_cleared = UART_IER_UUE;
+ if (ier != ier_cleared) {
+ serial_port_out(port, UART_IER, ier_cleared);
+ atomic_set(&ier_value, ier);
+ }
+ console_atomic_unlock(flags);
+}
+
+void restore_ier(struct uart_8250_port *up)
+{
+ struct uart_port *port = &up->port;
+ unsigned int flags;
+
+ console_atomic_lock(&flags);
+ if (atomic_fetch_dec(&ier_counter) == 1)
+ serial_port_out(port, UART_IER, atomic_read(&ier_value));
+ console_atomic_unlock(flags);
+}
+
#ifdef CONFIG_CONSOLE_POLL
/*
* Console polling routines for writing and reading from the uart while
@@ -2081,18 +2127,10 @@ static int serial8250_get_poll_char(struct uart_port *port)
static void serial8250_put_poll_char(struct uart_port *port,
unsigned char c)
{
- unsigned int ier;
struct uart_8250_port *up = up_to_u8250p(port);

serial8250_rpm_get(up);
- /*
- * First save the IER then disable the interrupts
- */
- ier = serial_port_in(port, UART_IER);
- if (up->capabilities & UART_CAP_UUE)
- serial_port_out(port, UART_IER, UART_IER_UUE);
- else
- serial_port_out(port, UART_IER, 0);
+ clear_ier(up);

wait_for_xmitr(up, BOTH_EMPTY);
/*
@@ -2105,7 +2143,7 @@ static void serial8250_put_poll_char(struct uart_port *port,
* and restore the IER
*/
wait_for_xmitr(up, BOTH_EMPTY);
- serial_port_out(port, UART_IER, ier);
+ restore_ier(up);
serial8250_rpm_put(up);
}

@@ -2417,7 +2455,7 @@ void serial8250_do_shutdown(struct uart_port *port)
*/
spin_lock_irqsave(&port->lock, flags);
up->ier = 0;
- serial_port_out(port, UART_IER, 0);
+ set_ier(up, 0);
spin_unlock_irqrestore(&port->lock, flags);

synchronize_irq(port->irq);
@@ -2728,7 +2766,7 @@ serial8250_do_set_termios(struct uart_port *port, struct ktermios *termios,
if (up->capabilities & UART_CAP_RTOIE)
up->ier |= UART_IER_RTOIE;

- serial_port_out(port, UART_IER, up->ier);
+ set_ier(up, up->ier);

if (up->capabilities & UART_CAP_EFR) {
unsigned char efr = 0;
@@ -3192,7 +3230,7 @@ EXPORT_SYMBOL_GPL(serial8250_set_defaults);

#ifdef CONFIG_SERIAL_8250_CONSOLE

-static void serial8250_console_putchar(struct uart_port *port, int ch)
+static void serial8250_console_putchar_locked(struct uart_port *port, int ch)
{
struct uart_8250_port *up = up_to_u8250p(port);

@@ -3200,6 +3238,18 @@ static void serial8250_console_putchar(struct uart_port *port, int ch)
serial_port_out(port, UART_TX, ch);
}

+static void serial8250_console_putchar(struct uart_port *port, int ch)
+{
+ struct uart_8250_port *up = up_to_u8250p(port);
+ unsigned int flags;
+
+ wait_for_xmitr(up, UART_LSR_THRE);
+
+ console_atomic_lock(&flags);
+ serial8250_console_putchar_locked(port, ch);
+ console_atomic_unlock(flags);
+}
+
/*
* Restore serial console when h/w power-off detected
*/
@@ -3221,6 +3271,42 @@ static void serial8250_console_restore(struct uart_8250_port *up)
serial8250_out_MCR(up, UART_MCR_DTR | UART_MCR_RTS);
}

+void serial8250_console_write_atomic(struct uart_8250_port *up,
+ const char *s, unsigned int count)
+{
+ struct uart_port *port = &up->port;
+ unsigned int flags;
+ bool locked;
+
+ console_atomic_lock(&flags);
+
+ /*
+ * If possible, keep any other CPUs from working with the
+ * UART until the atomic message is completed. This helps
+ * to keep the output more orderly.
+ */
+ locked = spin_trylock(&port->lock);
+
+ touch_nmi_watchdog();
+
+ clear_ier(up);
+
+ if (atomic_fetch_inc(&up->console_printing)) {
+ uart_console_write(port, "\n", 1,
+ serial8250_console_putchar_locked);
+ }
+ uart_console_write(port, s, count, serial8250_console_putchar_locked);
+ atomic_dec(&up->console_printing);
+
+ wait_for_xmitr(up, BOTH_EMPTY);
+ restore_ier(up);
+
+ if (locked)
+ spin_unlock(&port->lock);
+
+ console_atomic_unlock(flags);
+}
+
/*
* Print a string to the serial port trying not to disturb
* any possible real use of the port...
@@ -3232,27 +3318,13 @@ void serial8250_console_write(struct uart_8250_port *up, const char *s,
{
struct uart_port *port = &up->port;
unsigned long flags;
- unsigned int ier;
- int locked = 1;

touch_nmi_watchdog();

serial8250_rpm_get(up);
+ spin_lock_irqsave(&port->lock, flags);

- if (oops_in_progress)
- locked = spin_trylock_irqsave(&port->lock, flags);
- else
- spin_lock_irqsave(&port->lock, flags);
-
- /*
- * First save the IER then disable the interrupts
- */
- ier = serial_port_in(port, UART_IER);
-
- if (up->capabilities & UART_CAP_UUE)
- serial_port_out(port, UART_IER, UART_IER_UUE);
- else
- serial_port_out(port, UART_IER, 0);
+ clear_ier(up);

/* check scratch reg to see if port powered off during system sleep */
if (up->canary && (up->canary != serial_port_in(port, UART_SCR))) {
@@ -3260,14 +3332,16 @@ void serial8250_console_write(struct uart_8250_port *up, const char *s,
up->canary = 0;
}

+ atomic_inc(&up->console_printing);
uart_console_write(port, s, count, serial8250_console_putchar);
+ atomic_dec(&up->console_printing);

/*
* Finally, wait for transmitter to become empty
* and restore the IER
*/
wait_for_xmitr(up, BOTH_EMPTY);
- serial_port_out(port, UART_IER, ier);
+ restore_ier(up);

/*
* The receive handling will happen properly because the
@@ -3279,8 +3353,7 @@ void serial8250_console_write(struct uart_8250_port *up, const char *s,
if (up->msr_saved_flags)
serial8250_modem_status(up);

- if (locked)
- spin_unlock_irqrestore(&port->lock, flags);
+ spin_unlock_irqrestore(&port->lock, flags);
serial8250_rpm_put(up);
}

@@ -3301,6 +3374,7 @@ static unsigned int probe_baud(struct uart_port *port)

int serial8250_console_setup(struct uart_port *port, char *options, bool probe)
{
+ struct uart_8250_port *up = up_to_u8250p(port);
int baud = 9600;
int bits = 8;
int parity = 'n';
@@ -3309,6 +3383,8 @@ int serial8250_console_setup(struct uart_port *port, char *options, bool probe)
if (!port->iobase && !port->membase)
return -ENODEV;

+ atomic_set(&up->console_printing, 0);
+
if (options)
uart_parse_options(options, &baud, &parity, &bits, &flow);
else if (probe)
diff --git a/include/linux/serial_8250.h b/include/linux/serial_8250.h
index 5a655ba8d273..a2dbeb0bc005 100644
--- a/include/linux/serial_8250.h
+++ b/include/linux/serial_8250.h
@@ -11,6 +11,7 @@
#ifndef _LINUX_SERIAL_8250_H
#define _LINUX_SERIAL_8250_H

+#include <linux/atomic.h>
#include <linux/serial_core.h>
#include <linux/serial_reg.h>
#include <linux/platform_device.h>
@@ -126,6 +127,8 @@ struct uart_8250_port {
#define MSR_SAVE_FLAGS UART_MSR_ANY_DELTA
unsigned char msr_saved_flags;

+ atomic_t console_printing;
+
struct uart_8250_dma *dma;
const struct uart_8250_ops *ops;

@@ -177,6 +180,8 @@ void serial8250_init_port(struct uart_8250_port *up);
void serial8250_set_defaults(struct uart_8250_port *up);
void serial8250_console_write(struct uart_8250_port *up, const char *s,
unsigned int count);
+void serial8250_console_write_atomic(struct uart_8250_port *up, const char *s,
+ unsigned int count);
int serial8250_console_setup(struct uart_port *port, char *options, bool probe);

extern void serial8250_set_isa_configurator(void (*v)
--
2.11.0


2019-02-12 14:45:29

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 04/25] printk-rb: add writer interface

Add the writer functions prb_reserve() and prb_commit(). These make
use of processor-reentrant spin locks to limit the number of possible
interruption scenarios for the writers.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/printk_ringbuffer.h | 17 ++++
lib/printk_ringbuffer.c | 172 ++++++++++++++++++++++++++++++++++++++
2 files changed, 189 insertions(+)

diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
index 0e6e8dd0d01e..1aec9d5666b1 100644
--- a/include/linux/printk_ringbuffer.h
+++ b/include/linux/printk_ringbuffer.h
@@ -24,6 +24,18 @@ struct printk_ringbuffer {
atomic_t ctx;
};

+struct prb_entry {
+ unsigned int size;
+ u64 seq;
+ char data[0];
+};
+
+struct prb_handle {
+ struct printk_ringbuffer *rb;
+ unsigned int cpu;
+ struct prb_entry *entry;
+};
+
#define DECLARE_STATIC_PRINTKRB_CPULOCK(name) \
static DEFINE_PER_CPU(unsigned long, _##name##_percpu_irqflags); \
static struct prb_cpulock name = { \
@@ -45,6 +57,11 @@ static struct printk_ringbuffer name = { \
.ctx = ATOMIC_INIT(0), \
}

+/* writer interface */
+char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
+ unsigned int size);
+void prb_commit(struct prb_handle *h);
+
/* utility functions */
void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store);
diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
index 28958b0cf774..90c7f9a9f861 100644
--- a/lib/printk_ringbuffer.c
+++ b/lib/printk_ringbuffer.c
@@ -2,6 +2,14 @@
#include <linux/smp.h>
#include <linux/printk_ringbuffer.h>

+#define PRB_SIZE(rb) (1 << rb->size_bits)
+#define PRB_SIZE_BITMASK(rb) (PRB_SIZE(rb) - 1)
+#define PRB_INDEX(rb, lpos) (lpos & PRB_SIZE_BITMASK(rb))
+#define PRB_WRAPS(rb, lpos) (lpos >> rb->size_bits)
+#define PRB_WRAP_LPOS(rb, lpos, xtra) \
+ ((PRB_WRAPS(rb, lpos) + xtra) << rb->size_bits)
+#define PRB_DATA_ALIGN sizeof(long)
+
static bool __prb_trylock(struct prb_cpulock *cpu_lock,
unsigned int *cpu_store)
{
@@ -75,3 +83,167 @@ void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store)

put_cpu();
}
+
+static struct prb_entry *to_entry(struct printk_ringbuffer *rb,
+ unsigned long lpos)
+{
+ char *buffer = rb->buffer;
+ buffer += PRB_INDEX(rb, lpos);
+ return (struct prb_entry *)buffer;
+}
+
+static int calc_next(struct printk_ringbuffer *rb, unsigned long tail,
+ unsigned long lpos, int size, unsigned long *calced_next)
+{
+ unsigned long next_lpos;
+ int ret = 0;
+again:
+ next_lpos = lpos + size;
+ if (next_lpos - tail > PRB_SIZE(rb))
+ return -1;
+
+ if (PRB_WRAPS(rb, lpos) != PRB_WRAPS(rb, next_lpos)) {
+ lpos = PRB_WRAP_LPOS(rb, next_lpos, 0);
+ ret |= 1;
+ goto again;
+ }
+
+ *calced_next = next_lpos;
+ return ret;
+}
+
+static bool push_tail(struct printk_ringbuffer *rb, unsigned long tail)
+{
+ unsigned long new_tail;
+ struct prb_entry *e;
+ unsigned long head;
+
+ if (tail != atomic_long_read(&rb->tail))
+ return true;
+
+ e = to_entry(rb, tail);
+ if (e->size != -1)
+ new_tail = tail + e->size;
+ else
+ new_tail = PRB_WRAP_LPOS(rb, tail, 1);
+
+ /* make sure the new tail does not overtake the head */
+ head = atomic_long_read(&rb->head);
+ if (head - new_tail > PRB_SIZE(rb))
+ return false;
+
+ atomic_long_cmpxchg(&rb->tail, tail, new_tail);
+ return true;
+}
+
+/*
+ * prb_commit: Commit a reserved entry to the ring buffer.
+ * @h: An entry handle referencing the data entry to commit.
+ *
+ * Commit data that has been reserved using prb_reserve(). Once the data
+ * block has been committed, it can be invalidated at any time. If a writer
+ * is interested in using the data after committing, the writer should make
+ * its own copy first or use the prb_iter_ reader functions to access the
+ * data in the ring buffer.
+ *
+ * It is safe to call this function from any context and state.
+ */
+void prb_commit(struct prb_handle *h)
+{
+ struct printk_ringbuffer *rb = h->rb;
+ struct prb_entry *e;
+ unsigned long head;
+ unsigned long res;
+
+ for (;;) {
+ if (atomic_read(&rb->ctx) != 1) {
+ /* the interrupted context will fixup head */
+ atomic_dec(&rb->ctx);
+ break;
+ }
+ /* assign sequence numbers before moving head */
+ head = atomic_long_read(&rb->head);
+ res = atomic_long_read(&rb->reserve);
+ while (head != res) {
+ e = to_entry(rb, head);
+ if (e->size == -1) {
+ head = PRB_WRAP_LPOS(rb, head, 1);
+ continue;
+ }
+ e->seq = ++rb->seq;
+ head += e->size;
+ }
+ atomic_long_set_release(&rb->head, res);
+ atomic_dec(&rb->ctx);
+
+ if (atomic_long_read(&rb->reserve) == res)
+ break;
+ atomic_inc(&rb->ctx);
+ }
+
+ prb_unlock(rb->cpulock, h->cpu);
+}
+
+/*
+ * prb_reserve: Reserve an entry within a ring buffer.
+ * @h: An entry handle to be setup and reference an entry.
+ * @rb: A ring buffer to reserve data within.
+ * @size: The number of bytes to reserve.
+ *
+ * Reserve an entry of at least @size bytes to be used by the caller. If
+ * successful, the data region of the entry belongs to the caller and cannot
+ * be invalidated by any other task/context. For this reason, the caller
+ * should call prb_commit() as quickly as possible in order to avoid preventing
+ * other tasks/contexts from reserving data in the case that the ring buffer
+ * has wrapped.
+ *
+ * It is safe to call this function from any context and state.
+ *
+ * Returns a pointer to the reserved entry (and @h is setup to reference that
+ * entry) or NULL if it was not possible to reserve data.
+ */
+char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
+ unsigned int size)
+{
+ unsigned long tail, res1, res2;
+ int ret;
+
+ if (size == 0)
+ return NULL;
+ size += sizeof(struct prb_entry);
+ size += PRB_DATA_ALIGN - 1;
+ size &= ~(PRB_DATA_ALIGN - 1);
+ if (size >= PRB_SIZE(rb))
+ return NULL;
+
+ h->rb = rb;
+ prb_lock(rb->cpulock, &h->cpu);
+
+ atomic_inc(&rb->ctx);
+
+ do {
+ for (;;) {
+ tail = atomic_long_read(&rb->tail);
+ res1 = atomic_long_read(&rb->reserve);
+ ret = calc_next(rb, tail, res1, size, &res2);
+ if (ret >= 0)
+ break;
+ if (!push_tail(rb, tail)) {
+ prb_commit(h);
+ return NULL;
+ }
+ }
+ } while (!atomic_long_try_cmpxchg_acquire(&rb->reserve, &res1, res2));
+
+ h->entry = to_entry(rb, res1);
+
+ if (ret) {
+ /* handle wrap */
+ h->entry->size = -1;
+ h->entry = to_entry(rb, PRB_WRAP_LPOS(rb, res2, 0));
+ }
+
+ h->entry->size = size;
+
+ return &h->entry->data[0];
+}
--
2.11.0


2019-02-12 14:46:11

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 02/25] printk-rb: add prb locking functions

Add processor-reentrant spin locking functions. These allow
restricting the number of possible contexts to 2, which can simplify
implementing code that also supports NMI interruptions.

prb_lock();

/*
* This code is synchronized with all contexts
* except an NMI on the same processor.
*/

prb_unlock();

In order to support printk's emergency messages, a
processor-reentrant spin lock will be used to control raw access to
the emergency console. However, it must be the same
processor-reentrant spin lock as the one used by the ring buffer,
otherwise a deadlock can occur:

CPU1: printk lock -> emergency -> serial lock
CPU2: serial lock -> printk lock

By making the processor-reentrant implemtation available externally,
printk can use the same atomic_t for the ring buffer as for the
emergency console and thus avoid the above deadlock.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/printk_ringbuffer.h | 24 ++++++++++++
lib/Makefile | 2 +-
lib/printk_ringbuffer.c | 77 +++++++++++++++++++++++++++++++++++++++
3 files changed, 102 insertions(+), 1 deletion(-)
create mode 100644 include/linux/printk_ringbuffer.h
create mode 100644 lib/printk_ringbuffer.c

diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
new file mode 100644
index 000000000000..75f5708ea902
--- /dev/null
+++ b/include/linux/printk_ringbuffer.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PRINTK_RINGBUFFER_H
+#define _LINUX_PRINTK_RINGBUFFER_H
+
+#include <linux/atomic.h>
+#include <linux/percpu.h>
+
+struct prb_cpulock {
+ atomic_t owner;
+ unsigned long __percpu *irqflags;
+};
+
+#define DECLARE_STATIC_PRINTKRB_CPULOCK(name) \
+static DEFINE_PER_CPU(unsigned long, _##name##_percpu_irqflags); \
+static struct prb_cpulock name = { \
+ .owner = ATOMIC_INIT(-1), \
+ .irqflags = &_##name##_percpu_irqflags, \
+}
+
+/* utility functions */
+void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
+void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store);
+
+#endif /*_LINUX_PRINTK_RINGBUFFER_H */
diff --git a/lib/Makefile b/lib/Makefile
index e1b59da71418..77a20bfd232e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -19,7 +19,7 @@ KCOV_INSTRUMENT_dynamic_debug.o := n

lib-y := ctype.o string.o vsprintf.o cmdline.o \
rbtree.o radix-tree.o timerqueue.o xarray.o \
- idr.o int_sqrt.o extable.o \
+ idr.o int_sqrt.o extable.o printk_ringbuffer.o \
sha1.o chacha.o irq_regs.o argv_split.o \
flex_proportions.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
new file mode 100644
index 000000000000..28958b0cf774
--- /dev/null
+++ b/lib/printk_ringbuffer.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/smp.h>
+#include <linux/printk_ringbuffer.h>
+
+static bool __prb_trylock(struct prb_cpulock *cpu_lock,
+ unsigned int *cpu_store)
+{
+ unsigned long *flags;
+ unsigned int cpu;
+
+ cpu = get_cpu();
+
+ *cpu_store = atomic_read(&cpu_lock->owner);
+ /* memory barrier to ensure the current lock owner is visible */
+ smp_rmb();
+ if (*cpu_store == -1) {
+ flags = per_cpu_ptr(cpu_lock->irqflags, cpu);
+ local_irq_save(*flags);
+ if (atomic_try_cmpxchg_acquire(&cpu_lock->owner,
+ cpu_store, cpu)) {
+ return true;
+ }
+ local_irq_restore(*flags);
+ } else if (*cpu_store == cpu) {
+ return true;
+ }
+
+ put_cpu();
+ return false;
+}
+
+/*
+ * prb_lock: Perform a processor-reentrant spin lock.
+ * @cpu_lock: A pointer to the lock object.
+ * @cpu_store: A "flags" pointer to store lock status information.
+ *
+ * If no processor has the lock, the calling processor takes the lock and
+ * becomes the owner. If the calling processor is already the owner of the
+ * lock, this function succeeds immediately. If lock is locked by another
+ * processor, this function spins until the calling processor becomes the
+ * owner.
+ *
+ * It is safe to call this function from any context and state.
+ */
+void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store)
+{
+ for (;;) {
+ if (__prb_trylock(cpu_lock, cpu_store))
+ break;
+ cpu_relax();
+ }
+}
+
+/*
+ * prb_unlock: Perform a processor-reentrant spin unlock.
+ * @cpu_lock: A pointer to the lock object.
+ * @cpu_store: A "flags" object storing lock status information.
+ *
+ * Release the lock. The calling processor must be the owner of the lock.
+ *
+ * It is safe to call this function from any context and state.
+ */
+void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store)
+{
+ unsigned long *flags;
+ unsigned int cpu;
+
+ cpu = atomic_read(&cpu_lock->owner);
+ atomic_set_release(&cpu_lock->owner, cpu_store);
+
+ if (cpu_store == -1) {
+ flags = per_cpu_ptr(cpu_lock->irqflags, cpu);
+ local_irq_restore(*flags);
+ }
+
+ put_cpu();
+}
--
2.11.0


2019-02-12 14:46:15

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 24/25] printk: implement kmsg_dump

Since printk messages are now logged to a new ring buffer, update
the kmsg_dump functions to pull the messages from there.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/kmsg_dump.h | 6 +-
kernel/printk/printk.c | 258 ++++++++++++++++++++++++----------------------
2 files changed, 139 insertions(+), 125 deletions(-)

diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
index 2e7a1e032c71..ede5066663b9 100644
--- a/include/linux/kmsg_dump.h
+++ b/include/linux/kmsg_dump.h
@@ -46,10 +46,8 @@ struct kmsg_dumper {
bool registered;

/* private state of the kmsg iterator */
- u32 cur_idx;
- u32 next_idx;
- u64 cur_seq;
- u64 next_seq;
+ u64 line_seq;
+ u64 buffer_end_seq;
};

#ifdef CONFIG_PRINTK
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 241d142a2755..6d5bb7f5f584 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -397,13 +397,13 @@ static size_t syslog_partial;
static bool syslog_time;

/* index and sequence number of the first record stored in the buffer */
-static u64 log_first_seq;
static u32 log_first_idx;

/* index and sequence number of the next record to store in the buffer */
-static u64 log_next_seq;
static u32 log_next_idx;

+static DEFINE_MUTEX(kmsg_dump_lock);
+
/* the next printk record to read after the last 'clear' command */
static u64 clear_seq;
static u32 clear_idx;
@@ -446,38 +446,6 @@ static char *log_dict(const struct printk_log *msg)
return (char *)msg + sizeof(struct printk_log) + msg->text_len;
}

-/* get record by index; idx must point to valid msg */
-static struct printk_log *log_from_idx(u32 idx)
-{
- struct printk_log *msg = (struct printk_log *)(log_buf + idx);
-
- /*
- * A length == 0 record is the end of buffer marker. Wrap around and
- * read the message at the start of the buffer.
- */
- if (!msg->len)
- return (struct printk_log *)log_buf;
- return msg;
-}
-
-/* get next record; idx must point to valid msg */
-static u32 log_next(u32 idx)
-{
- struct printk_log *msg = (struct printk_log *)(log_buf + idx);
-
- /* length == 0 indicates the end of the buffer; wrap */
- /*
- * A length == 0 record is the end of buffer marker. Wrap around and
- * read the message at the start of the buffer as *this* one, and
- * return the one after that.
- */
- if (!msg->len) {
- msg = (struct printk_log *)log_buf;
- return msg->len;
- }
- return idx + msg->len;
-}
-
static void printk_emergency(char *buffer, int level, u64 ts_nsec, u16 cpu,
char *text, u16 text_len);

@@ -2063,9 +2031,7 @@ EXPORT_SYMBOL(printk);
#define printk_time false

static u64 syslog_seq;
-static u64 log_first_seq;
static u32 log_first_idx;
-static u64 log_next_seq;
static char *log_text(const struct printk_log *msg) { return NULL; }
static char *log_dict(const struct printk_log *msg) { return NULL; }
static struct printk_log *log_from_idx(u32 idx) { return NULL; }
@@ -2974,7 +2940,6 @@ module_param_named(always_kmsg_dump, always_kmsg_dump, bool, S_IRUGO | S_IWUSR);
void kmsg_dump(enum kmsg_dump_reason reason)
{
struct kmsg_dumper *dumper;
- unsigned long flags;

if ((reason > KMSG_DUMP_OOPS) && !always_kmsg_dump)
return;
@@ -2987,12 +2952,7 @@ void kmsg_dump(enum kmsg_dump_reason reason)
/* initialize iterator with data about the stored records */
dumper->active = true;

- logbuf_lock_irqsave(flags);
- dumper->cur_seq = clear_seq;
- dumper->cur_idx = clear_idx;
- dumper->next_seq = log_next_seq;
- dumper->next_idx = log_next_idx;
- logbuf_unlock_irqrestore(flags);
+ kmsg_dump_rewind(dumper);

/* invoke dumper which will iterate over records */
dumper->dump(dumper, reason);
@@ -3025,33 +2985,67 @@ void kmsg_dump(enum kmsg_dump_reason reason)
bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
char *line, size_t size, size_t *len)
{
+ struct prb_iterator iter;
struct printk_log *msg;
- size_t l = 0;
- bool ret = false;
+ struct prb_handle h;
+ bool cont = false;
+ char *msgbuf;
+ char *rbuf;
+ size_t l;
+ u64 seq;
+ int ret;

if (!dumper->active)
- goto out;
+ return cont;

- if (dumper->cur_seq < log_first_seq) {
- /* messages are gone, move to first available one */
- dumper->cur_seq = log_first_seq;
- dumper->cur_idx = log_first_idx;
+ rbuf = prb_reserve(&h, &sprint_rb, PRINTK_RECORD_MAX);
+ if (!rbuf)
+ return cont;
+ msgbuf = rbuf;
+retry:
+ for (;;) {
+ prb_iter_init(&iter, &printk_rb, &seq);
+
+ if (dumper->line_seq == seq) {
+ /* already where we want to be */
+ break;
+ } else if (dumper->line_seq < seq) {
+ /* messages are gone, move to first available one */
+ dumper->line_seq = seq;
+ break;
+ }
+
+ ret = prb_iter_seek(&iter, dumper->line_seq);
+ if (ret > 0) {
+ /* seeked to line_seq */
+ break;
+ } else if (ret == 0) {
+ /*
+ * The end of the list was hit without ever seeing
+ * line_seq. Reset it to the beginning of the list.
+ */
+ prb_iter_init(&iter, &printk_rb, &dumper->line_seq);
+ break;
+ }
+ /* iterator invalid, start over */
}

- /* last entry */
- if (dumper->cur_seq >= log_next_seq)
+ ret = prb_iter_next(&iter, msgbuf, PRINTK_RECORD_MAX,
+ &dumper->line_seq);
+ if (ret == 0)
goto out;
+ else if (ret < 0)
+ goto retry;

- msg = log_from_idx(dumper->cur_idx);
+ msg = (struct printk_log *)msgbuf;
l = msg_print_text(msg, syslog, printk_time, line, size);

- dumper->cur_idx = log_next(dumper->cur_idx);
- dumper->cur_seq++;
- ret = true;
-out:
if (len)
*len = l;
- return ret;
+ cont = true;
+out:
+ prb_commit(&h);
+ return cont;
}

/**
@@ -3074,12 +3068,11 @@ bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
char *line, size_t size, size_t *len)
{
- unsigned long flags;
bool ret;

- logbuf_lock_irqsave(flags);
+ mutex_lock(&kmsg_dump_lock);
ret = kmsg_dump_get_line_nolock(dumper, syslog, line, size, len);
- logbuf_unlock_irqrestore(flags);
+ mutex_unlock(&kmsg_dump_lock);

return ret;
}
@@ -3107,74 +3100,101 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_line);
bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
char *buf, size_t size, size_t *len)
{
- unsigned long flags;
- u64 seq;
- u32 idx;
- u64 next_seq;
- u32 next_idx;
- size_t l = 0;
- bool ret = false;
+ struct prb_iterator iter;
bool time = printk_time;
+ struct printk_log *msg;
+ u64 new_end_seq = 0;
+ struct prb_handle h;
+ bool cont = false;
+ char *msgbuf;
+ u64 end_seq;
+ int textlen;
+ u64 seq = 0;
+ char *rbuf;
+ int l = 0;
+ int ret;

if (!dumper->active)
- goto out;
+ return cont;

- logbuf_lock_irqsave(flags);
- if (dumper->cur_seq < log_first_seq) {
- /* messages are gone, move to first available one */
- dumper->cur_seq = log_first_seq;
- dumper->cur_idx = log_first_idx;
- }
+ rbuf = prb_reserve(&h, &sprint_rb, PRINTK_RECORD_MAX);
+ if (!rbuf)
+ return cont;
+ msgbuf = rbuf;

- /* last entry */
- if (dumper->cur_seq >= dumper->next_seq) {
- logbuf_unlock_irqrestore(flags);
- goto out;
- }
+ prb_iter_init(&iter, &printk_rb, NULL);

- /* calculate length of entire buffer */
- seq = dumper->cur_seq;
- idx = dumper->cur_idx;
- while (seq < dumper->next_seq) {
- struct printk_log *msg = log_from_idx(idx);
+ /*
+ * seek to the start record, which is set/modified
+ * by kmsg_dump_get_line_nolock()
+ */
+ ret = prb_iter_seek(&iter, dumper->line_seq);
+ if (ret <= 0)
+ prb_iter_init(&iter, &printk_rb, &seq);
+
+ /* work with a local end seq to have a constant value */
+ end_seq = dumper->buffer_end_seq;
+ if (!end_seq) {
+ /* initialize end seq to "infinity" */
+ end_seq = -1;
+ dumper->buffer_end_seq = end_seq;
+ }
+retry:
+ if (seq >= end_seq)
+ goto out;

- l += msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
- }
+ /* count the total bytes after seq */
+ textlen = count_remaining(&iter, end_seq, msgbuf,
+ PRINTK_RECORD_MAX, 0, time);

- /* move first record forward until length fits into the buffer */
- seq = dumper->cur_seq;
- idx = dumper->cur_idx;
- while (l > size && seq < dumper->next_seq) {
- struct printk_log *msg = log_from_idx(idx);
+ /* move iter forward until length fits into the buffer */
+ while (textlen > size) {
+ ret = prb_iter_next(&iter, msgbuf, PRINTK_RECORD_MAX, &seq);
+ if (ret == 0) {
+ break;
+ } else if (ret < 0) {
+ prb_iter_init(&iter, &printk_rb, &seq);
+ goto retry;
+ }

- l -= msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
+ msg = (struct printk_log *)msgbuf;
+ textlen -= msg_print_text(msg, true, time, NULL, 0);
}

- /* last message in next interation */
- next_seq = seq;
- next_idx = idx;
+ /* save end seq for the next interation */
+ new_end_seq = seq + 1;

- l = 0;
- while (seq < dumper->next_seq) {
- struct printk_log *msg = log_from_idx(idx);
+ /* copy messages to buffer */
+ while (l < size) {
+ ret = prb_iter_next(&iter, msgbuf, PRINTK_RECORD_MAX, &seq);
+ if (ret == 0) {
+ break;
+ } else if (ret < 0) {
+ /*
+ * iterator (and thus also the start position)
+ * invalid, start over from beginning of list
+ */
+ prb_iter_init(&iter, &printk_rb, NULL);
+ continue;
+ }

- l += msg_print_text(msg, syslog, time, buf + l, size - l);
- idx = log_next(idx);
- seq++;
+ if (seq >= end_seq)
+ break;
+
+ msg = (struct printk_log *)msgbuf;
+ textlen = msg_print_text(msg, syslog, time, buf + l, size - l);
+ if (textlen > 0)
+ l += textlen;
+ cont = true;
}

- dumper->next_seq = next_seq;
- dumper->next_idx = next_idx;
- ret = true;
- logbuf_unlock_irqrestore(flags);
-out:
- if (len)
+ if (cont && len)
*len = l;
- return ret;
+out:
+ prb_commit(&h);
+ if (new_end_seq)
+ dumper->buffer_end_seq = new_end_seq;
+ return cont;
}
EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);

@@ -3190,10 +3210,8 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);
*/
void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper)
{
- dumper->cur_seq = clear_seq;
- dumper->cur_idx = clear_idx;
- dumper->next_seq = log_next_seq;
- dumper->next_idx = log_next_idx;
+ dumper->line_seq = 0;
+ dumper->buffer_end_seq = 0;
}

/**
@@ -3206,11 +3224,9 @@ void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper)
*/
void kmsg_dump_rewind(struct kmsg_dumper *dumper)
{
- unsigned long flags;
-
- logbuf_lock_irqsave(flags);
+ mutex_lock(&kmsg_dump_lock);
kmsg_dump_rewind_nolock(dumper);
- logbuf_unlock_irqrestore(flags);
+ mutex_unlock(&kmsg_dump_lock);
}
EXPORT_SYMBOL_GPL(kmsg_dump_rewind);

--
2.11.0


2019-02-12 14:48:21

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 23/25] printk: implement syslog

Since printk messages are now logged to a new ring buffer, update
the syslog functions to pull the messages from there.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 340 ++++++++++++++++++++++++++++++++++---------------
1 file changed, 235 insertions(+), 105 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ed1ec8c23e97..241d142a2755 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -387,10 +387,12 @@ DECLARE_STATIC_PRINTKRB_CPULOCK(printk_cpulock);
/* record buffer */
DECLARE_STATIC_PRINTKRB(printk_rb, CONFIG_LOG_BUF_SHIFT, &printk_cpulock);

+static DEFINE_MUTEX(syslog_lock);
+DECLARE_STATIC_PRINTKRB_ITER(syslog_iter, &printk_rb);
+
DECLARE_WAIT_QUEUE_HEAD(log_wait);
/* the next printk record to read by syslog(READ) or /proc/kmsg */
static u64 syslog_seq;
-static u32 syslog_idx;
static size_t syslog_partial;
static bool syslog_time;

@@ -1249,30 +1251,42 @@ static size_t msg_print_text(const struct printk_log *msg, bool syslog,
return len;
}

-static int syslog_print(char __user *buf, int size)
+static int syslog_print(char __user *buf, int size, char *text,
+ char *msgbuf, int *locked)
{
- char *text;
+ struct prb_iterator iter;
struct printk_log *msg;
int len = 0;
-
- text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
- if (!text)
- return -ENOMEM;
+ u64 seq;
+ int ret;

while (size > 0) {
size_t n;
size_t skip;

- logbuf_lock_irq();
- if (syslog_seq < log_first_seq) {
- /* messages are gone, move to first one */
- syslog_seq = log_first_seq;
- syslog_idx = log_first_idx;
- syslog_partial = 0;
+ for (;;) {
+ prb_iter_copy(&iter, &syslog_iter);
+ ret = prb_iter_next(&iter, msgbuf,
+ PRINTK_RECORD_MAX, &seq);
+ if (ret < 0) {
+ /* messages are gone, move to first one */
+ prb_iter_init(&syslog_iter, &printk_rb,
+ &syslog_seq);
+ syslog_partial = 0;
+ continue;
+ }
+ break;
}
- if (syslog_seq == log_next_seq) {
- logbuf_unlock_irq();
+ if (ret == 0)
break;
+
+ /*
+ * If messages have been missed, the partial tracker
+ * is no longer valid and must be reset.
+ */
+ if (syslog_seq > 0 && seq - 1 != syslog_seq) {
+ syslog_seq = seq - 1;
+ syslog_partial = 0;
}

/*
@@ -1282,131 +1296,212 @@ static int syslog_print(char __user *buf, int size)
if (!syslog_partial)
syslog_time = printk_time;

+ msg = (struct printk_log *)msgbuf;
+
skip = syslog_partial;
- msg = log_from_idx(syslog_idx);
n = msg_print_text(msg, true, syslog_time, text,
- LOG_LINE_MAX + PREFIX_MAX);
+ PRINTK_SPRINT_MAX);
if (n - syslog_partial <= size) {
/* message fits into buffer, move forward */
- syslog_idx = log_next(syslog_idx);
- syslog_seq++;
+ prb_iter_next(&syslog_iter, NULL, 0, &syslog_seq);
n -= syslog_partial;
syslog_partial = 0;
- } else if (!len){
+ } else if (!len) {
/* partial read(), remember position */
n = size;
syslog_partial += n;
} else
n = 0;
- logbuf_unlock_irq();

if (!n)
break;

+ mutex_unlock(&syslog_lock);
if (copy_to_user(buf, text + skip, n)) {
if (!len)
len = -EFAULT;
+ *locked = 0;
break;
}
+ ret = mutex_lock_interruptible(&syslog_lock);

len += n;
size -= n;
buf += n;
+
+ if (ret) {
+ if (!len)
+ len = ret;
+ *locked = 0;
+ break;
+ }
}

- kfree(text);
return len;
}

-static int syslog_print_all(char __user *buf, int size, bool clear)
+static int count_remaining(struct prb_iterator *iter, u64 until_seq,
+ char *msgbuf, int size, bool records, bool time)
{
- char *text;
+ struct prb_iterator local_iter;
+ struct printk_log *msg;
int len = 0;
- u64 next_seq;
u64 seq;
- u32 idx;
+ int ret;
+
+ prb_iter_copy(&local_iter, iter);
+ for (;;) {
+ ret = prb_iter_next(&local_iter, msgbuf, size, &seq);
+ if (ret == 0) {
+ break;
+ } else if (ret < 0) {
+ /* the iter is invalid, restart from head */
+ prb_iter_init(&local_iter, &printk_rb, NULL);
+ len = 0;
+ continue;
+ }
+
+ if (until_seq && seq >= until_seq)
+ break;
+
+ if (records) {
+ len++;
+ } else {
+ msg = (struct printk_log *)msgbuf;
+ len += msg_print_text(msg, true, time, NULL, 0);
+ }
+ }
+
+ return len;
+}
+
+static void syslog_clear(void)
+{
+ struct prb_iterator iter;
+ int ret;
+
+ prb_iter_init(&iter, &printk_rb, &clear_seq);
+ for (;;) {
+ ret = prb_iter_next(&iter, NULL, 0, &clear_seq);
+ if (ret == 0)
+ break;
+ else if (ret < 0)
+ prb_iter_init(&iter, &printk_rb, &clear_seq);
+ }
+}
+
+static int syslog_print_all(char __user *buf, int size, bool clear)
+{
+ struct prb_iterator iter;
+ struct printk_log *msg;
+ char *msgbuf = NULL;
+ char *text = NULL;
+ int textlen;
+ u64 seq = 0;
+ int len = 0;
bool time;
+ int ret;

- text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
+ text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL);
if (!text)
return -ENOMEM;
+ msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
+ if (!msgbuf) {
+ kfree(text);
+ return -ENOMEM;
+ }

time = printk_time;
- logbuf_lock_irq();
+
/*
- * Find first record that fits, including all following records,
- * into the user-provided buffer for this dump.
+ * Setup iter to last event before clear. Clear may
+ * be lost, but keep going with a best effort.
*/
- seq = clear_seq;
- idx = clear_idx;
- while (seq < log_next_seq) {
- struct printk_log *msg = log_from_idx(idx);
+ prb_iter_init(&iter, &printk_rb, NULL);
+ prb_iter_seek(&iter, clear_seq);

- len += msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
- }
+ /* count the total bytes after clear */
+ len = count_remaining(&iter, 0, msgbuf, PRINTK_RECORD_MAX,
+ false, time);

- /* move first record forward until length fits into the buffer */
- seq = clear_seq;
- idx = clear_idx;
- while (len > size && seq < log_next_seq) {
- struct printk_log *msg = log_from_idx(idx);
+ /* move iter forward until length fits into the buffer */
+ while (len > size) {
+ ret = prb_iter_next(&iter, msgbuf,
+ PRINTK_RECORD_MAX, &seq);
+ if (ret == 0) {
+ break;
+ } else if (ret < 0) {
+ /*
+ * The iter is now invalid so clear will
+ * also be invalid. Restart from the head.
+ */
+ prb_iter_init(&iter, &printk_rb, NULL);
+ len = count_remaining(&iter, 0, msgbuf,
+ PRINTK_RECORD_MAX, false, time);
+ continue;
+ }

+ msg = (struct printk_log *)msgbuf;
len -= msg_print_text(msg, true, time, NULL, 0);
- idx = log_next(idx);
- seq++;
- }

- /* last message fitting into this dump */
- next_seq = log_next_seq;
+ if (clear)
+ clear_seq = seq;
+ }

+ /* copy messages to buffer */
len = 0;
- while (len >= 0 && seq < next_seq) {
- struct printk_log *msg = log_from_idx(idx);
- int textlen = msg_print_text(msg, true, time, text,
- LOG_LINE_MAX + PREFIX_MAX);
+ while (len >= 0 && len < size) {
+ if (clear)
+ clear_seq = seq;

- idx = log_next(idx);
- seq++;
+ ret = prb_iter_next(&iter, msgbuf,
+ PRINTK_RECORD_MAX, &seq);
+ if (ret == 0) {
+ break;
+ } else if (ret < 0) {
+ /*
+ * The iter is now invalid. Make a best
+ * effort to grab the rest of the log
+ * from the new head.
+ */
+ prb_iter_init(&iter, &printk_rb, NULL);
+ continue;
+ }
+
+ msg = (struct printk_log *)msgbuf;
+ textlen = msg_print_text(msg, true, time, text,
+ PRINTK_SPRINT_MAX);
+ if (textlen < 0) {
+ len = textlen;
+ break;
+ }

- logbuf_unlock_irq();
if (copy_to_user(buf + len, text, textlen))
len = -EFAULT;
else
len += textlen;
- logbuf_lock_irq();
-
- if (seq < log_first_seq) {
- /* messages are gone, move to next one */
- seq = log_first_seq;
- idx = log_first_idx;
- }
}

- if (clear) {
- clear_seq = log_next_seq;
- clear_idx = log_next_idx;
- }
- logbuf_unlock_irq();
+ if (clear && !seq)
+ syslog_clear();

- kfree(text);
+ if (text)
+ kfree(text);
+ if (msgbuf)
+ kfree(msgbuf);
return len;
}

-static void syslog_clear(void)
-{
- logbuf_lock_irq();
- clear_seq = log_next_seq;
- clear_idx = log_next_idx;
- logbuf_unlock_irq();
-}
-
int do_syslog(int type, char __user *buf, int len, int source)
{
bool clear = false;
static int saved_console_loglevel = LOGLEVEL_DEFAULT;
+ struct prb_iterator iter;
+ char *msgbuf = NULL;
+ char *text = NULL;
+ int locked;
int error;
+ int ret;

error = check_syslog_permissions(type, source);
if (error)
@@ -1424,11 +1519,49 @@ int do_syslog(int type, char __user *buf, int len, int source)
return 0;
if (!access_ok(buf, len))
return -EFAULT;
- error = wait_event_interruptible(log_wait,
- syslog_seq != log_next_seq);
+
+ text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL);
+ msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
+ if (!text || !msgbuf) {
+ error = -ENOMEM;
+ goto out;
+ }
+
+ error = mutex_lock_interruptible(&syslog_lock);
if (error)
- return error;
- error = syslog_print(buf, len);
+ goto out;
+
+ /*
+ * Wait until a first message is available. Use a copy
+ * because no iteration should occur for syslog now.
+ */
+ for (;;) {
+ prb_iter_copy(&iter, &syslog_iter);
+
+ mutex_unlock(&syslog_lock);
+ ret = prb_iter_wait_next(&iter, NULL, 0, NULL);
+ if (ret == -ERESTARTSYS) {
+ error = ret;
+ goto out;
+ }
+ error = mutex_lock_interruptible(&syslog_lock);
+ if (error)
+ goto out;
+
+ if (ret == -EINVAL) {
+ prb_iter_init(&syslog_iter, &printk_rb,
+ &syslog_seq);
+ syslog_partial = 0;
+ continue;
+ }
+ break;
+ }
+
+ /* print as much as will fit in the user buffer */
+ locked = 1;
+ error = syslog_print(buf, len, text, msgbuf, &locked);
+ if (locked)
+ mutex_unlock(&syslog_lock);
break;
/* Read/clear last kernel messages */
case SYSLOG_ACTION_READ_CLEAR:
@@ -1473,47 +1606,45 @@ int do_syslog(int type, char __user *buf, int len, int source)
break;
/* Number of chars in the log buffer */
case SYSLOG_ACTION_SIZE_UNREAD:
- logbuf_lock_irq();
- if (syslog_seq < log_first_seq) {
- /* messages are gone, move to first one */
- syslog_seq = log_first_seq;
- syslog_idx = log_first_idx;
- syslog_partial = 0;
- }
+ msgbuf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
+ if (!msgbuf)
+ return -ENOMEM;
+
+ error = mutex_lock_interruptible(&syslog_lock);
+ if (error)
+ goto out;
+
if (source == SYSLOG_FROM_PROC) {
/*
* Short-cut for poll(/"proc/kmsg") which simply checks
* for pending data, not the size; return the count of
* records, not the length.
*/
- error = log_next_seq - syslog_seq;
+ error = count_remaining(&syslog_iter, 0, msgbuf,
+ PRINTK_RECORD_MAX, true,
+ printk_time);
} else {
- u64 seq = syslog_seq;
- u32 idx = syslog_idx;
- bool time = syslog_partial ? syslog_time : printk_time;
-
- while (seq < log_next_seq) {
- struct printk_log *msg = log_from_idx(idx);
-
- error += msg_print_text(msg, true, time, NULL,
- 0);
- time = printk_time;
- idx = log_next(idx);
- seq++;
- }
+ error = count_remaining(&syslog_iter, 0, msgbuf,
+ PRINTK_RECORD_MAX, false,
+ printk_time);
error -= syslog_partial;
}
- logbuf_unlock_irq();
+
+ mutex_unlock(&syslog_lock);
break;
/* Size of the log buffer */
case SYSLOG_ACTION_SIZE_BUFFER:
- error = log_buf_len;
+ error = prb_buffer_size(&printk_rb);
break;
default:
error = -EINVAL;
break;
}
-
+out:
+ if (msgbuf)
+ kfree(msgbuf);
+ if (text)
+ kfree(text);
return error;
}

@@ -1932,7 +2063,6 @@ EXPORT_SYMBOL(printk);
#define printk_time false

static u64 syslog_seq;
-static u32 syslog_idx;
static u64 log_first_seq;
static u32 log_first_idx;
static u64 log_next_seq;
--
2.11.0


2019-02-12 14:49:08

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 22/25] printk: implement /dev/kmsg

Since printk messages are now logged to a new ring buffer, update
the /dev/kmsg functions to pull the messages from there.

Signed-off-by: John Ogness <[email protected]>
---
fs/proc/kmsg.c | 4 +-
include/linux/printk.h | 1 +
kernel/printk/printk.c | 162 +++++++++++++++++++++++++++++++++----------------
3 files changed, 113 insertions(+), 54 deletions(-)

diff --git a/fs/proc/kmsg.c b/fs/proc/kmsg.c
index 4f4a2abb225e..4e62963a87ca 100644
--- a/fs/proc/kmsg.c
+++ b/fs/proc/kmsg.c
@@ -18,8 +18,6 @@
#include <linux/uaccess.h>
#include <asm/io.h>

-extern wait_queue_head_t log_wait;
-
static int kmsg_open(struct inode * inode, struct file * file)
{
return do_syslog(SYSLOG_ACTION_OPEN, NULL, 0, SYSLOG_FROM_PROC);
@@ -42,7 +40,7 @@ static ssize_t kmsg_read(struct file *file, char __user *buf,

static __poll_t kmsg_poll(struct file *file, poll_table *wait)
{
- poll_wait(file, &log_wait, wait);
+ poll_wait(file, printk_wait_queue(), wait);
if (do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC))
return EPOLLIN | EPOLLRDNORM;
return 0;
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 58bd06d88ea3..bef0b5c5fcbf 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -191,6 +191,7 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...);
void dump_stack_print_info(const char *log_lvl);
void show_regs_print_info(const char *log_lvl);
extern asmlinkage void dump_stack(void) __cold;
+struct wait_queue_head *printk_wait_queue(void);
#else
static inline __printf(1, 0)
int vprintk(const char *s, va_list args)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 306e7575499c..ed1ec8c23e97 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -637,10 +637,11 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
/* /dev/kmsg - userspace message inject/listen interface */
struct devkmsg_user {
u64 seq;
- u32 idx;
+ struct prb_iterator iter;
struct ratelimit_state rs;
struct mutex lock;
char buf[CONSOLE_EXT_LOG_MAX];
+ char msgbuf[PRINTK_RECORD_MAX];
};

static __printf(3, 4) __cold
@@ -723,9 +724,11 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
struct devkmsg_user *user = file->private_data;
+ struct prb_iterator backup_iter;
struct printk_log *msg;
- size_t len;
ssize_t ret;
+ size_t len;
+ u64 seq;

if (!user)
return -EBADF;
@@ -734,52 +737,67 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
if (ret)
return ret;

- logbuf_lock_irq();
- while (user->seq == log_next_seq) {
- if (file->f_flags & O_NONBLOCK) {
- ret = -EAGAIN;
- logbuf_unlock_irq();
- goto out;
- }
+ /* make a backup copy in case there is a problem */
+ prb_iter_copy(&backup_iter, &user->iter);

- logbuf_unlock_irq();
- ret = wait_event_interruptible(log_wait,
- user->seq != log_next_seq);
- if (ret)
- goto out;
- logbuf_lock_irq();
+ if (file->f_flags & O_NONBLOCK) {
+ ret = prb_iter_next(&user->iter, &user->msgbuf[0],
+ sizeof(user->msgbuf), &seq);
+ } else {
+ ret = prb_iter_wait_next(&user->iter, &user->msgbuf[0],
+ sizeof(user->msgbuf), &seq);
}
-
- if (user->seq < log_first_seq) {
- /* our last seen message is gone, return error and reset */
- user->idx = log_first_idx;
- user->seq = log_first_seq;
+ if (ret == 0) {
+ /* end of list */
+ ret = -EAGAIN;
+ goto out;
+ } else if (ret == -EINVAL) {
+ /* iterator invalid, return error and reset */
ret = -EPIPE;
- logbuf_unlock_irq();
+ prb_iter_init(&user->iter, &printk_rb, &user->seq);
+ goto out;
+ } else if (ret < 0) {
+ /* interrupted by signal */
goto out;
}

- msg = log_from_idx(user->idx);
+ if (user->seq == 0) {
+ user->seq = seq;
+ } else {
+ user->seq++;
+ if (user->seq < seq) {
+ ret = -EPIPE;
+ goto restore_out;
+ }
+ }
+
+ msg = (struct printk_log *)&user->msgbuf[0];
len = msg_print_ext_header(user->buf, sizeof(user->buf),
msg, user->seq);
len += msg_print_ext_body(user->buf + len, sizeof(user->buf) - len,
log_dict(msg), msg->dict_len,
log_text(msg), msg->text_len);

- user->idx = log_next(user->idx);
- user->seq++;
- logbuf_unlock_irq();
-
if (len > count) {
ret = -EINVAL;
- goto out;
+ goto restore_out;
}

if (copy_to_user(buf, user->buf, len)) {
ret = -EFAULT;
- goto out;
+ goto restore_out;
}
+
ret = len;
+ goto out;
+restore_out:
+ /*
+ * There was an error, but this message should not be
+ * lost because of it. Restore the backup and setup
+ * seq so that it will work with the next read.
+ */
+ prb_iter_copy(&user->iter, &backup_iter);
+ user->seq = seq - 1;
out:
mutex_unlock(&user->lock);
return ret;
@@ -788,19 +806,21 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
{
struct devkmsg_user *user = file->private_data;
- loff_t ret = 0;
+ loff_t ret;

if (!user)
return -EBADF;
if (offset)
return -ESPIPE;

- logbuf_lock_irq();
+ ret = mutex_lock_interruptible(&user->lock);
+ if (ret)
+ return ret;
+
switch (whence) {
case SEEK_SET:
/* the first record */
- user->idx = log_first_idx;
- user->seq = log_first_seq;
+ prb_iter_init(&user->iter, &printk_rb, &user->seq);
break;
case SEEK_DATA:
/*
@@ -808,40 +828,83 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
* like issued by 'dmesg -c'. Reading /dev/kmsg itself
* changes no global state, and does not clear anything.
*/
- user->idx = clear_idx;
- user->seq = clear_seq;
+ for (;;) {
+ prb_iter_init(&user->iter, &printk_rb, NULL);
+ ret = prb_iter_seek(&user->iter, clear_seq);
+ if (ret > 0) {
+ /* seeked to clear seq */
+ user->seq = clear_seq;
+ break;
+ } else if (ret == 0) {
+ /*
+ * The end of the list was hit without
+ * ever seeing the clear seq. Just
+ * seek to the beginning of the list.
+ */
+ prb_iter_init(&user->iter, &printk_rb,
+ &user->seq);
+ break;
+ }
+ /* iterator invalid, start over */
+ }
+ ret = 0;
break;
case SEEK_END:
/* after the last record */
- user->idx = log_next_idx;
- user->seq = log_next_seq;
+ for (;;) {
+ ret = prb_iter_next(&user->iter, NULL, 0, &user->seq);
+ if (ret == 0)
+ break;
+ else if (ret > 0)
+ continue;
+ /* iterator invalid, start over */
+ prb_iter_init(&user->iter, &printk_rb, &user->seq);
+ }
+ ret = 0;
break;
default:
ret = -EINVAL;
}
- logbuf_unlock_irq();
+
+ mutex_unlock(&user->lock);
return ret;
}

+struct wait_queue_head *printk_wait_queue(void)
+{
+ /* FIXME: using prb internals! */
+ return printk_rb.wq;
+}
+
static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
{
struct devkmsg_user *user = file->private_data;
+ struct prb_iterator iter;
__poll_t ret = 0;
+ int rbret;
+ u64 seq;

if (!user)
return EPOLLERR|EPOLLNVAL;

- poll_wait(file, &log_wait, wait);
+ poll_wait(file, printk_wait_queue(), wait);

- logbuf_lock_irq();
- if (user->seq < log_next_seq) {
- /* return error when data has vanished underneath us */
- if (user->seq < log_first_seq)
- ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
- else
- ret = EPOLLIN|EPOLLRDNORM;
- }
- logbuf_unlock_irq();
+ mutex_lock(&user->lock);
+
+ /* use copy so no actual iteration takes place */
+ prb_iter_copy(&iter, &user->iter);
+
+ rbret = prb_iter_next(&iter, &user->msgbuf[0],
+ sizeof(user->msgbuf), &seq);
+ if (rbret == 0)
+ goto out;
+
+ ret = EPOLLIN|EPOLLRDNORM;
+
+ if (rbret < 0 || (seq - user->seq) != 1)
+ ret |= EPOLLERR|EPOLLPRI;
+out:
+ mutex_unlock(&user->lock);

return ret;
}
@@ -871,10 +934,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)

mutex_init(&user->lock);

- logbuf_lock_irq();
- user->idx = log_first_idx;
- user->seq = log_first_seq;
- logbuf_unlock_irq();
+ prb_iter_init(&user->iter, &printk_rb, &user->seq);

file->private_data = user;
return 0;
--
2.11.0


2019-02-12 14:49:38

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 11/25] printk_safe: remove printk safe code

vprintk variants are now NMI-safe so there is no longer a need for
the "safe" calls.

NOTE: This also removes printk flushing functionality.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/hardirq.h | 2 -
include/linux/printk.h | 27 ---
init/main.c | 1 -
kernel/kexec_core.c | 1 -
kernel/panic.c | 3 -
kernel/printk/Makefile | 1 -
kernel/printk/internal.h | 30 +---
kernel/printk/printk.c | 13 +-
kernel/printk/printk_safe.c | 427 --------------------------------------------
kernel/trace/trace.c | 2 -
lib/nmi_backtrace.c | 6 -
11 files changed, 7 insertions(+), 506 deletions(-)
delete mode 100644 kernel/printk/printk_safe.c

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 0fbbcdf0c178..c1effa24a71d 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -62,7 +62,6 @@ extern void irq_exit(void);

#define nmi_enter() \
do { \
- printk_nmi_enter(); \
lockdep_off(); \
ftrace_nmi_enter(); \
BUG_ON(in_nmi()); \
@@ -79,7 +78,6 @@ extern void irq_exit(void);
preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET); \
ftrace_nmi_exit(); \
lockdep_on(); \
- printk_nmi_exit(); \
} while (0)

#endif /* LINUX_HARDIRQ_H */
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 77740a506ebb..a79a736b54b6 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -145,18 +145,6 @@ static inline __printf(1, 2) __cold
void early_printk(const char *s, ...) { }
#endif

-#ifdef CONFIG_PRINTK_NMI
-extern void printk_nmi_enter(void);
-extern void printk_nmi_exit(void);
-extern void printk_nmi_direct_enter(void);
-extern void printk_nmi_direct_exit(void);
-#else
-static inline void printk_nmi_enter(void) { }
-static inline void printk_nmi_exit(void) { }
-static inline void printk_nmi_direct_enter(void) { }
-static inline void printk_nmi_direct_exit(void) { }
-#endif /* PRINTK_NMI */
-
#ifdef CONFIG_PRINTK
asmlinkage __printf(5, 0)
int vprintk_emit(int facility, int level,
@@ -201,9 +189,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...);
void dump_stack_print_info(const char *log_lvl);
void show_regs_print_info(const char *log_lvl);
extern asmlinkage void dump_stack(void) __cold;
-extern void printk_safe_init(void);
-extern void printk_safe_flush(void);
-extern void printk_safe_flush_on_panic(void);
#else
static inline __printf(1, 0)
int vprintk(const char *s, va_list args)
@@ -267,18 +252,6 @@ static inline void show_regs_print_info(const char *log_lvl)
static inline void dump_stack(void)
{
}
-
-static inline void printk_safe_init(void)
-{
-}
-
-static inline void printk_safe_flush(void)
-{
-}
-
-static inline void printk_safe_flush_on_panic(void)
-{
-}
#endif

extern int kptr_restrict;
diff --git a/init/main.c b/init/main.c
index e2e80ca3165a..aec02435f00b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -648,7 +648,6 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- printk_safe_init();
perf_event_init();
profile_init();
call_function_init();
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index d7140447be75..bbe21da47e2e 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -972,7 +972,6 @@ void crash_kexec(struct pt_regs *regs)
old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
if (old_cpu == PANIC_CPU_INVALID) {
/* This is the 1st CPU which comes here, so go ahead. */
- printk_safe_flush_on_panic();
__crash_kexec(regs);

/*
diff --git a/kernel/panic.c b/kernel/panic.c
index f121e6ba7e11..09a836b3c687 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -223,7 +223,6 @@ void panic(const char *fmt, ...)
* Bypass the panic_cpu check and call __crash_kexec directly.
*/
if (!_crash_kexec_post_notifiers) {
- printk_safe_flush_on_panic();
__crash_kexec(NULL);

/*
@@ -247,8 +246,6 @@ void panic(const char *fmt, ...)
*/
atomic_notifier_call_chain(&panic_notifier_list, 0, buf);

- /* Call flush even twice. It tries harder with a single online CPU */
- printk_safe_flush_on_panic();
kmsg_dump(KMSG_DUMP_PANIC);

/*
diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
index 4a2ffc39eb95..85405bdcf2b3 100644
--- a/kernel/printk/Makefile
+++ b/kernel/printk/Makefile
@@ -1,3 +1,2 @@
obj-y = printk.o
-obj-$(CONFIG_PRINTK) += printk_safe.o
obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 0f1898820cba..59ad43dba837 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -32,32 +32,6 @@ int vprintk_store(int facility, int level,
__printf(1, 0) int vprintk_default(const char *fmt, va_list args);
__printf(1, 0) int vprintk_deferred(const char *fmt, va_list args);
__printf(1, 0) int vprintk_func(const char *fmt, va_list args);
-void __printk_safe_enter(void);
-void __printk_safe_exit(void);
-
-#define printk_safe_enter_irqsave(flags) \
- do { \
- local_irq_save(flags); \
- __printk_safe_enter(); \
- } while (0)
-
-#define printk_safe_exit_irqrestore(flags) \
- do { \
- __printk_safe_exit(); \
- local_irq_restore(flags); \
- } while (0)
-
-#define printk_safe_enter_irq() \
- do { \
- local_irq_disable(); \
- __printk_safe_enter(); \
- } while (0)
-
-#define printk_safe_exit_irq() \
- do { \
- __printk_safe_exit(); \
- local_irq_enable(); \
- } while (0)

void defer_console_output(void);

@@ -70,10 +44,10 @@ __printf(1, 0) int vprintk_func(const char *fmt, va_list args) { return 0; }
* semaphore and some of console functions (console_unlock()/etc.), so
* printk-safe must preserve the existing local IRQ guarantees.
*/
+#endif /* CONFIG_PRINTK */
+
#define printk_safe_enter_irqsave(flags) local_irq_save(flags)
#define printk_safe_exit_irqrestore(flags) local_irq_restore(flags)

#define printk_safe_enter_irq() local_irq_disable()
#define printk_safe_exit_irq() local_irq_enable()
-
-#endif /* CONFIG_PRINTK */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b6a6f1002741..073ff9fd6872 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1675,13 +1675,6 @@ static bool cont_add(int facility, int level, enum log_flags flags, const char *
}
#endif /* 0 */

-int vprintk_store(int facility, int level,
- const char *dict, size_t dictlen,
- const char *fmt, va_list args)
-{
- return vprintk_emit(facility, level, dict, dictlen, fmt, args);
-}
-
/* ring buffer used as memory allocator for temporary sprint buffers */
DECLARE_STATIC_PRINTKRB(sprint_rb,
ilog2(PRINTK_RECORD_MAX + sizeof(struct prb_entry) +
@@ -1752,6 +1745,11 @@ asmlinkage int vprintk_emit(int facility, int level,
}
EXPORT_SYMBOL(vprintk_emit);

+__printf(1, 0) int vprintk_func(const char *fmt, va_list args)
+{
+ return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
+}
+
asmlinkage int vprintk(const char *fmt, va_list args)
{
return vprintk_func(fmt, args);
@@ -3142,5 +3140,4 @@ void kmsg_dump_rewind(struct kmsg_dumper *dumper)
logbuf_unlock_irqrestore(flags);
}
EXPORT_SYMBOL_GPL(kmsg_dump_rewind);
-
#endif
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
deleted file mode 100644
index 0913b4d385de..000000000000
--- a/kernel/printk/printk_safe.c
+++ /dev/null
@@ -1,427 +0,0 @@
-/*
- * printk_safe.c - Safe printk for printk-deadlock-prone contexts
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version 2
- * of the License, or (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, see <http://www.gnu.org/licenses/>.
- */
-
-#include <linux/preempt.h>
-#include <linux/spinlock.h>
-#include <linux/debug_locks.h>
-#include <linux/smp.h>
-#include <linux/cpumask.h>
-#include <linux/irq_work.h>
-#include <linux/printk.h>
-
-#include "internal.h"
-
-/*
- * printk() could not take logbuf_lock in NMI context. Instead,
- * it uses an alternative implementation that temporary stores
- * the strings into a per-CPU buffer. The content of the buffer
- * is later flushed into the main ring buffer via IRQ work.
- *
- * The alternative implementation is chosen transparently
- * by examinig current printk() context mask stored in @printk_context
- * per-CPU variable.
- *
- * The implementation allows to flush the strings also from another CPU.
- * There are situations when we want to make sure that all buffers
- * were handled or when IRQs are blocked.
- */
-static int printk_safe_irq_ready __read_mostly;
-
-#define SAFE_LOG_BUF_LEN ((1 << CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT) - \
- sizeof(atomic_t) - \
- sizeof(atomic_t) - \
- sizeof(struct irq_work))
-
-struct printk_safe_seq_buf {
- atomic_t len; /* length of written data */
- atomic_t message_lost;
- struct irq_work work; /* IRQ work that flushes the buffer */
- unsigned char buffer[SAFE_LOG_BUF_LEN];
-};
-
-static DEFINE_PER_CPU(struct printk_safe_seq_buf, safe_print_seq);
-static DEFINE_PER_CPU(int, printk_context);
-
-#ifdef CONFIG_PRINTK_NMI
-static DEFINE_PER_CPU(struct printk_safe_seq_buf, nmi_print_seq);
-#endif
-
-/* Get flushed in a more safe context. */
-static void queue_flush_work(struct printk_safe_seq_buf *s)
-{
- if (printk_safe_irq_ready)
- irq_work_queue(&s->work);
-}
-
-/*
- * Add a message to per-CPU context-dependent buffer. NMI and printk-safe
- * have dedicated buffers, because otherwise printk-safe preempted by
- * NMI-printk would have overwritten the NMI messages.
- *
- * The messages are flushed from irq work (or from panic()), possibly,
- * from other CPU, concurrently with printk_safe_log_store(). Should this
- * happen, printk_safe_log_store() will notice the buffer->len mismatch
- * and repeat the write.
- */
-static __printf(2, 0) int printk_safe_log_store(struct printk_safe_seq_buf *s,
- const char *fmt, va_list args)
-{
- int add;
- size_t len;
- va_list ap;
-
-again:
- len = atomic_read(&s->len);
-
- /* The trailing '\0' is not counted into len. */
- if (len >= sizeof(s->buffer) - 1) {
- atomic_inc(&s->message_lost);
- queue_flush_work(s);
- return 0;
- }
-
- /*
- * Make sure that all old data have been read before the buffer
- * was reset. This is not needed when we just append data.
- */
- if (!len)
- smp_rmb();
-
- va_copy(ap, args);
- add = vscnprintf(s->buffer + len, sizeof(s->buffer) - len, fmt, ap);
- va_end(ap);
- if (!add)
- return 0;
-
- /*
- * Do it once again if the buffer has been flushed in the meantime.
- * Note that atomic_cmpxchg() is an implicit memory barrier that
- * makes sure that the data were written before updating s->len.
- */
- if (atomic_cmpxchg(&s->len, len, len + add) != len)
- goto again;
-
- queue_flush_work(s);
- return add;
-}
-
-static inline void printk_safe_flush_line(const char *text, int len)
-{
- /*
- * Avoid any console drivers calls from here, because we may be
- * in NMI or printk_safe context (when in panic). The messages
- * must go only into the ring buffer at this stage. Consoles will
- * get explicitly called later when a crashdump is not generated.
- */
- printk_deferred("%.*s", len, text);
-}
-
-/* printk part of the temporary buffer line by line */
-static int printk_safe_flush_buffer(const char *start, size_t len)
-{
- const char *c, *end;
- bool header;
-
- c = start;
- end = start + len;
- header = true;
-
- /* Print line by line. */
- while (c < end) {
- if (*c == '\n') {
- printk_safe_flush_line(start, c - start + 1);
- start = ++c;
- header = true;
- continue;
- }
-
- /* Handle continuous lines or missing new line. */
- if ((c + 1 < end) && printk_get_level(c)) {
- if (header) {
- c = printk_skip_level(c);
- continue;
- }
-
- printk_safe_flush_line(start, c - start);
- start = c++;
- header = true;
- continue;
- }
-
- header = false;
- c++;
- }
-
- /* Check if there was a partial line. Ignore pure header. */
- if (start < end && !header) {
- static const char newline[] = KERN_CONT "\n";
-
- printk_safe_flush_line(start, end - start);
- printk_safe_flush_line(newline, strlen(newline));
- }
-
- return len;
-}
-
-static void report_message_lost(struct printk_safe_seq_buf *s)
-{
- int lost = atomic_xchg(&s->message_lost, 0);
-
- if (lost)
- printk_deferred("Lost %d message(s)!\n", lost);
-}
-
-/*
- * Flush data from the associated per-CPU buffer. The function
- * can be called either via IRQ work or independently.
- */
-static void __printk_safe_flush(struct irq_work *work)
-{
- static raw_spinlock_t read_lock =
- __RAW_SPIN_LOCK_INITIALIZER(read_lock);
- struct printk_safe_seq_buf *s =
- container_of(work, struct printk_safe_seq_buf, work);
- unsigned long flags;
- size_t len;
- int i;
-
- /*
- * The lock has two functions. First, one reader has to flush all
- * available message to make the lockless synchronization with
- * writers easier. Second, we do not want to mix messages from
- * different CPUs. This is especially important when printing
- * a backtrace.
- */
- raw_spin_lock_irqsave(&read_lock, flags);
-
- i = 0;
-more:
- len = atomic_read(&s->len);
-
- /*
- * This is just a paranoid check that nobody has manipulated
- * the buffer an unexpected way. If we printed something then
- * @len must only increase. Also it should never overflow the
- * buffer size.
- */
- if ((i && i >= len) || len > sizeof(s->buffer)) {
- const char *msg = "printk_safe_flush: internal error\n";
-
- printk_safe_flush_line(msg, strlen(msg));
- len = 0;
- }
-
- if (!len)
- goto out; /* Someone else has already flushed the buffer. */
-
- /* Make sure that data has been written up to the @len */
- smp_rmb();
- i += printk_safe_flush_buffer(s->buffer + i, len - i);
-
- /*
- * Check that nothing has got added in the meantime and truncate
- * the buffer. Note that atomic_cmpxchg() is an implicit memory
- * barrier that makes sure that the data were copied before
- * updating s->len.
- */
- if (atomic_cmpxchg(&s->len, len, 0) != len)
- goto more;
-
-out:
- report_message_lost(s);
- raw_spin_unlock_irqrestore(&read_lock, flags);
-}
-
-/**
- * printk_safe_flush - flush all per-cpu nmi buffers.
- *
- * The buffers are flushed automatically via IRQ work. This function
- * is useful only when someone wants to be sure that all buffers have
- * been flushed at some point.
- */
-void printk_safe_flush(void)
-{
- int cpu;
-
- for_each_possible_cpu(cpu) {
-#ifdef CONFIG_PRINTK_NMI
- __printk_safe_flush(&per_cpu(nmi_print_seq, cpu).work);
-#endif
- __printk_safe_flush(&per_cpu(safe_print_seq, cpu).work);
- }
-}
-
-/**
- * printk_safe_flush_on_panic - flush all per-cpu nmi buffers when the system
- * goes down.
- *
- * Similar to printk_safe_flush() but it can be called even in NMI context when
- * the system goes down. It does the best effort to get NMI messages into
- * the main ring buffer.
- *
- * Note that it could try harder when there is only one CPU online.
- */
-void printk_safe_flush_on_panic(void)
-{
- /*
- * Make sure that we could access the main ring buffer.
- * Do not risk a double release when more CPUs are up.
- */
- if (raw_spin_is_locked(&logbuf_lock)) {
- if (num_online_cpus() > 1)
- return;
-
- debug_locks_off();
- raw_spin_lock_init(&logbuf_lock);
- }
-
- printk_safe_flush();
-}
-
-#ifdef CONFIG_PRINTK_NMI
-/*
- * Safe printk() for NMI context. It uses a per-CPU buffer to
- * store the message. NMIs are not nested, so there is always only
- * one writer running. But the buffer might get flushed from another
- * CPU, so we need to be careful.
- */
-static __printf(1, 0) int vprintk_nmi(const char *fmt, va_list args)
-{
- struct printk_safe_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
-
- return printk_safe_log_store(s, fmt, args);
-}
-
-void notrace printk_nmi_enter(void)
-{
- this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
-}
-
-void notrace printk_nmi_exit(void)
-{
- this_cpu_and(printk_context, ~PRINTK_NMI_CONTEXT_MASK);
-}
-
-/*
- * Marks a code that might produce many messages in NMI context
- * and the risk of losing them is more critical than eventual
- * reordering.
- *
- * It has effect only when called in NMI context. Then printk()
- * will try to store the messages into the main logbuf directly
- * and use the per-CPU buffers only as a fallback when the lock
- * is not available.
- */
-void printk_nmi_direct_enter(void)
-{
- if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK)
- this_cpu_or(printk_context, PRINTK_NMI_DIRECT_CONTEXT_MASK);
-}
-
-void printk_nmi_direct_exit(void)
-{
- this_cpu_and(printk_context, ~PRINTK_NMI_DIRECT_CONTEXT_MASK);
-}
-
-#else
-
-static __printf(1, 0) int vprintk_nmi(const char *fmt, va_list args)
-{
- return 0;
-}
-
-#endif /* CONFIG_PRINTK_NMI */
-
-/*
- * Lock-less printk(), to avoid deadlocks should the printk() recurse
- * into itself. It uses a per-CPU buffer to store the message, just like
- * NMI.
- */
-static __printf(1, 0) int vprintk_safe(const char *fmt, va_list args)
-{
- struct printk_safe_seq_buf *s = this_cpu_ptr(&safe_print_seq);
-
- return printk_safe_log_store(s, fmt, args);
-}
-
-/* Can be preempted by NMI. */
-void __printk_safe_enter(void)
-{
- this_cpu_inc(printk_context);
-}
-
-/* Can be preempted by NMI. */
-void __printk_safe_exit(void)
-{
- this_cpu_dec(printk_context);
-}
-
-__printf(1, 0) int vprintk_func(const char *fmt, va_list args)
-{
- /*
- * Try to use the main logbuf even in NMI. But avoid calling console
- * drivers that might have their own locks.
- */
- if ((this_cpu_read(printk_context) & PRINTK_NMI_DIRECT_CONTEXT_MASK) &&
- raw_spin_trylock(&logbuf_lock)) {
- int len;
-
- len = vprintk_store(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
- raw_spin_unlock(&logbuf_lock);
- defer_console_output();
- return len;
- }
-
- /* Use extra buffer in NMI when logbuf_lock is taken or in safe mode. */
- if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK)
- return vprintk_nmi(fmt, args);
-
- /* Use extra buffer to prevent a recursion deadlock in safe mode. */
- if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
- return vprintk_safe(fmt, args);
-
- /* No obstacles. */
- return vprintk_default(fmt, args);
-}
-
-void __init printk_safe_init(void)
-{
- int cpu;
-
- for_each_possible_cpu(cpu) {
- struct printk_safe_seq_buf *s;
-
- s = &per_cpu(safe_print_seq, cpu);
- init_irq_work(&s->work, __printk_safe_flush);
-
-#ifdef CONFIG_PRINTK_NMI
- s = &per_cpu(nmi_print_seq, cpu);
- init_irq_work(&s->work, __printk_safe_flush);
-#endif
- }
-
- /*
- * In the highly unlikely event that a NMI were to trigger at
- * this moment. Make sure IRQ work is set up before this
- * variable is set.
- */
- barrier();
- printk_safe_irq_ready = 1;
-
- /* Flush pending messages that did not have scheduled IRQ works. */
- printk_safe_flush();
-}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c521b7347482..cfce391621c0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8363,7 +8363,6 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
tracing_off();

local_irq_save(flags);
- printk_nmi_direct_enter();

/* Simulate the iterator */
trace_init_global_iter(&iter);
@@ -8444,7 +8443,6 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
atomic_dec(&per_cpu_ptr(iter.trace_buffer->data, cpu)->disabled);
}
atomic_dec(&dump_running);
- printk_nmi_direct_exit();
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(ftrace_dump);
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index 15ca78e1c7d4..77bf84987cda 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -75,12 +75,6 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
touch_softlockup_watchdog();
}

- /*
- * Force flush any remote buffers that might be stuck in IRQ context
- * and therefore could not run their irq_work.
- */
- printk_safe_flush();
-
clear_bit_unlock(0, &backtrace_flag);
put_cpu();
}
--
2.11.0


2019-02-12 14:50:55

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 18/25] console: add write_atomic interface

Add a write_atomic callback to the console. This is an optional
function for console drivers. The function must be atomic (including
NMI safe) for writing to the console.

Console drivers must still implement the write callback. The
write_atomic callback will only be used for emergency messages.

Creating an NMI safe write_atomic that must synchronize with write
requires a careful implementation of the console driver. To aid with
the implementation, a set of console_atomic_* functions are provided:

void console_atomic_lock(unsigned int *flags);
void console_atomic_unlock(unsigned int flags);

These functions synchronize using the processor-reentrant cpu lock of
the printk buffer.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/console.h | 4 ++++
kernel/printk/printk.c | 12 ++++++++++++
2 files changed, 16 insertions(+)

diff --git a/include/linux/console.h b/include/linux/console.h
index 633fb741e871..13482565c08c 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -145,6 +145,7 @@ static inline int con_debug_leave(void)
struct console {
char name[16];
void (*write)(struct console *, const char *, unsigned);
+ void (*write_atomic)(struct console *, const char *, unsigned);
int (*read)(struct console *, char *, unsigned);
struct tty_driver *(*device)(struct console *, int *);
void (*unblank)(void);
@@ -231,4 +232,7 @@ extern void console_init(void);
void dummycon_register_output_notifier(struct notifier_block *nb);
void dummycon_unregister_output_notifier(struct notifier_block *nb);

+extern void console_atomic_lock(unsigned int *flags);
+extern void console_atomic_unlock(unsigned int flags);
+
#endif /* _LINUX_CONSOLE_H */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cde036d8487a..0ff7c3942464 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2984,3 +2984,15 @@ void kmsg_dump_rewind(struct kmsg_dumper *dumper)
}
EXPORT_SYMBOL_GPL(kmsg_dump_rewind);
#endif
+
+void console_atomic_lock(unsigned int *flags)
+{
+ prb_lock(&printk_cpulock, flags);
+}
+EXPORT_SYMBOL(console_atomic_lock);
+
+void console_atomic_unlock(unsigned int flags)
+{
+ prb_unlock(&printk_cpulock, flags);
+}
+EXPORT_SYMBOL(console_atomic_unlock);
--
2.11.0


2019-02-12 14:53:40

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 01/25] printk-rb: add printk ring buffer documentation

The full documentation file for the printk ring buffer.

Signed-off-by: John Ogness <[email protected]>
---
Documentation/printk-ringbuffer.txt | 377 ++++++++++++++++++++++++++++++++++++
1 file changed, 377 insertions(+)
create mode 100644 Documentation/printk-ringbuffer.txt

diff --git a/Documentation/printk-ringbuffer.txt b/Documentation/printk-ringbuffer.txt
new file mode 100644
index 000000000000..6bde5dbd8545
--- /dev/null
+++ b/Documentation/printk-ringbuffer.txt
@@ -0,0 +1,377 @@
+struct printk_ringbuffer
+------------------------
+John Ogness <[email protected]>
+
+Overview
+~~~~~~~~
+As the name suggests, this ring buffer was implemented specifically to serve
+the needs of the printk() infrastructure. The ring buffer itself is not
+specific to printk and could be used for other purposes. _However_, the
+requirements and semantics of printk are rather unique. If you intend to use
+this ring buffer for anything other than printk, you need to be very clear on
+its features, behavior, and pitfalls.
+
+Features
+^^^^^^^^
+The printk ring buffer has the following features:
+
+- single global buffer
+- resides in initialized data section (available at early boot)
+- lockless readers
+- supports multiple writers
+- supports multiple non-consuming readers
+- safe from any context (including NMI)
+- groups bytes into variable length blocks (referenced by entries)
+- entries tagged with sequence numbers
+
+Behavior
+^^^^^^^^
+Since the printk ring buffer readers are lockless, there exists no
+synchronization between readers and writers. Basically writers are the tasks
+in control and may overwrite any and all committed data at any time and from
+any context. For this reason readers can miss entries if they are overwritten
+before the reader was able to access the data. The reader API implementation
+is such that reader access to entries is atomic, so there is no risk of
+readers having to deal with partial or corrupt data. Also, entries are
+tagged with sequence numbers so readers can recognize if entries were missed.
+
+Writing to the ring buffer consists of 2 steps. First a writer must reserve
+an entry of desired size. After this step the writer has exclusive access
+to the memory region. Once the data has been written to memory, it needs to
+be committed to the ring buffer. After this step the entry has been inserted
+into the ring buffer and assigned an appropriate sequence number.
+
+Once committed, a writer must no longer access the data directly. This is
+because the data may have been overwritten and no longer exists. If a
+writer must access the data, it should either keep a private copy before
+committing the entry or use the reader API to gain access to the data.
+
+Because of how the data backend is implemented, entries that have been
+reserved but not yet committed act as barriers, preventing future writers
+from filling the ring buffer beyond the location of the reserved but not
+yet committed entry region. For this reason it is *important* that writers
+perform both reserve and commit as quickly as possible. Also, be aware that
+preemption and local interrupts are disabled and writing to the ring buffer
+is processor-reentrant locked during the reserve/commit window. Writers in
+NMI contexts can still preempt any other writers, but as long as these
+writers do not write a large amount of data with respect to the ring buffer
+size, this should not become an issue.
+
+API
+~~~
+
+Declaration
+^^^^^^^^^^^
+The printk ring buffer can be instantiated as a static structure:
+
+ /* declare a static struct printk_ringbuffer */
+ #define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr)
+
+The value of szbits specifies the size of the ring buffer in bits. The
+cpulockptr field is a pointer to a prb_cpulock struct that is used to
+perform processor-reentrant spin locking for the writers. It is specified
+externally because it may be used for multiple ring buffers (or other
+code) to synchronize writers without risk of deadlock.
+
+Here is an example of a declaration of a printk ring buffer specifying a
+32KB (2^15) ring buffer:
+
+....
+DECLARE_STATIC_PRINTKRB_CPULOCK(rb_cpulock);
+DECLARE_STATIC_PRINTKRB(rb, 15, &rb_cpulock);
+....
+
+If writers will be using multiple ring buffers and the ordering of that usage
+is not clear, the same prb_cpulock should be used for both ring buffers.
+
+Writer API
+^^^^^^^^^^
+The writer API consists of 2 functions. The first is to reserve an entry in
+the ring buffer, the second is to commit that data to the ring buffer. The
+reserved entry information is stored within a provided `struct prb_handle`.
+
+ /* reserve an entry */
+ char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
+ unsigned int size);
+
+ /* commit a reserved entry to the ring buffer */
+ void prb_commit(struct prb_handle *h);
+
+Here is an example of a function to write data to a ring buffer:
+
+....
+int write_data(struct printk_ringbuffer *rb, char *data, int size)
+{
+ struct prb_handle h;
+ char *buf;
+
+ buf = prb_reserve(&h, rb, size);
+ if (!buf)
+ return -1;
+ memcpy(buf, data, size);
+ prb_commit(&h);
+
+ return 0;
+}
+....
+
+Pitfalls
+++++++++
+Be aware that prb_reserve() can fail. A retry might be successful, but it
+depends entirely on whether or not the next part of the ring buffer to
+overwrite belongs to reserved but not yet committed entries of other writers.
+Writers can use the prb_inc_lost() function to allow readers to notice that a
+message was lost.
+
+Reader API
+^^^^^^^^^^
+The reader API utilizes a `struct prb_iterator` to track the reader's
+position in the ring buffer.
+
+ /* declare a pre-initialized static iterator for a ring buffer */
+ #define DECLARE_STATIC_PRINTKRB_ITER(name, rbaddr)
+
+ /* initialize iterator for a ring buffer (if static macro NOT used) */
+ void prb_iter_init(struct prb_iterator *iter,
+ struct printk_ringbuffer *rb, u64 *seq);
+
+ /* make a deep copy of an iterator */
+ void prb_iter_copy(struct prb_iterator *dest,
+ struct prb_iterator *src);
+
+ /* non-blocking, advance to next entry (and read the data) */
+ int prb_iter_next(struct prb_iterator *iter, char *buf,
+ int size, u64 *seq);
+
+ /* blocking, advance to next entry (and read the data) */
+ int prb_iter_wait_next(struct prb_iterator *iter, char *buf,
+ int size, u64 *seq);
+
+ /* position iterator at the entry seq */
+ int prb_iter_seek(struct prb_iterator *iter, u64 seq);
+
+ /* read data at current position */
+ int prb_iter_data(struct prb_iterator *iter, char *buf,
+ int size, u64 *seq);
+
+Typically prb_iter_data() is not needed because the data can be retrieved
+directly with prb_iter_next().
+
+Here is an example of a non-blocking function that will read all the data in
+a ring buffer:
+
+....
+void read_all_data(struct printk_ringbuffer *rb, char *buf, int size)
+{
+ struct prb_iterator iter;
+ u64 prev_seq = 0;
+ u64 seq;
+ int ret;
+
+ prb_iter_init(&iter, rb, NULL);
+
+ for (;;) {
+ ret = prb_iter_next(&iter, buf, size, &seq);
+ if (ret > 0) {
+ if (seq != ++prev_seq) {
+ /* "seq - prev_seq" entries missed */
+ prev_seq = seq;
+ }
+ /* process buf here */
+ } else if (ret == 0) {
+ /* hit the end, done */
+ break;
+ } else if (ret < 0) {
+ /*
+ * iterator is invalid, a writer overtook us, reset the
+ * iterator and keep going, entries were missed
+ */
+ prb_iter_init(&iter, rb, NULL);
+ }
+ }
+}
+....
+
+Pitfalls
+++++++++
+The reader's iterator can become invalid at any time because the reader was
+overtaken by a writer. Typically the reader should reset the iterator back
+to the current oldest entry (which will be newer than the entry the reader
+was at) and continue, noting the number of entries that were missed.
+
+Utility API
+^^^^^^^^^^^
+Several functions are available as convenience for external code.
+
+ /* query the size of the data buffer */
+ int prb_buffer_size(struct printk_ringbuffer *rb);
+
+ /* skip a seq number to signify a lost record */
+ void prb_inc_lost(struct printk_ringbuffer *rb);
+
+ /* processor-reentrant spin lock */
+ void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
+
+ /* processor-reentrant spin unlock */
+ void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
+
+Pitfalls
+++++++++
+Although the value returned by prb_buffer_size() does represent an absolute
+upper bound, the amount of data that can be stored within the ring buffer
+is actually less because of the additional storage space of a header for each
+entry.
+
+The prb_lock() and prb_unlock() functions can be used to synchronize between
+ring buffer writers and other external activities. The function of a
+processor-reentrant spin lock is to disable preemption and local interrupts
+and synchronize against other processors. It does *not* protect against
+multiple contexts of a single processor, i.e NMI.
+
+Implementation
+~~~~~~~~~~~~~~
+This section describes several of the implementation concepts and details to
+help developers better understand the code.
+
+Entries
+^^^^^^^
+All ring buffer data is stored within a single static byte array. The reason
+for this is to ensure that any pointers to the data (past and present) will
+always point to valid memory. This is important because the lockless readers
+may be preempted for long periods of time and when they resume may be working
+with expired pointers.
+
+Entries are identified by start index and size. (The start index plus size
+is the start index of the next entry.) The start index is not simply an
+offset into the byte array, but rather a logical position (lpos) that maps
+directly to byte array offsets.
+
+For example, for a byte array of 1000, an entry may have have a start index
+of 100. Another entry may have a start index of 1100. And yet another 2100.
+All of these entry are pointing to the same memory region, but only the most
+recent entry is valid. The other entries are pointing to valid memory, but
+represent entries that have been overwritten.
+
+Note that due to overflowing, the most recent entry is not necessarily the one
+with the highest lpos value. Indeed, the printk ring buffer initializes its
+data such that an overflow happens relatively quickly in order to validate the
+handling of this situation. The implementation assumes that an lpos (unsigned
+long) will never completely wrap while a reader is preempted. If this were to
+become an issue, the seq number (which never wraps) could be used to increase
+the robustness of handling this situation.
+
+Buffer Wrapping
+^^^^^^^^^^^^^^^
+If an entry starts near the end of the byte array but would extend beyond it,
+a special terminating entry (size = -1) is inserted into the byte array and
+the real entry is placed at the beginning of the byte array. This can waste
+space at the end of the byte array, but simplifies the implementation by
+allowing writers to always work with contiguous buffers.
+
+Note that the size field is the first 4 bytes of the entry header. Also note
+that calc_next() always ensures that there are at least 4 bytes left at the
+end of the byte array to allow room for a terminating entry.
+
+Ring Buffer Pointers
+^^^^^^^^^^^^^^^^^^^^
+Three pointers (lpos values) are used to manage the ring buffer:
+
+ - _tail_: points to the oldest entry
+ - _head_: points to where the next new committed entry will be
+ - _reserve_: points to where the next new reserved entry will be
+
+These pointers always maintain a logical ordering:
+
+ tail <= head <= reserve
+
+The reserve pointer moves forward when a writer reserves a new entry. The
+head pointer moves forward when a writer commits a new entry.
+
+The reserve pointer cannot overwrite the tail pointer in a wrap situation. In
+such a situation, the tail pointer must be "pushed forward", thus
+invalidating that oldest entry. Readers identify if they are accessing a
+valid entry by ensuring their entry pointer is `>= tail && < head`.
+
+If the tail pointer is equal to the head pointer, it cannot be pushed and any
+reserve operation will fail. The only resolution is for writers to commit
+their reserved entries.
+
+Processor-Reentrant Locking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The purpose of the processor-reentrant locking is to limit the interruption
+scenarios of writers to 2 contexts. This allows for a simplified
+implementation where:
+
+- The reserve/commit window only exists on 1 processor at a time. A reserve
+ can never fail due to uncommitted entries of other processors.
+
+- When committing entries, it is trivial to handle the situation when
+ subsequent entries have already been committed, i.e. managing the head
+ pointer.
+
+Performance
+~~~~~~~~~~~
+Some basic tests were performed on a quad Intel(R) Xeon(R) CPU E5-2697 v4 at
+2.30GHz (36 cores / 72 threads). All tests involved writing a total of
+32,000,000 records at an average of 33 bytes each. Each writer was pinned to
+its own CPU and would write as fast as it could until a total of 32,000,000
+records were written. All tests involved 2 readers that were both pinned
+together to another CPU. Each reader would read as fast as it could and track
+how many of the 32,000,000 records it could read. All tests used a ring buffer
+of 16KB in size, which holds around 350 records (header + data for each
+entry).
+
+The only difference between the tests is the number of writers (and thus also
+the number of records per writer). As more writers are added, the time to
+write a record increases. This is because data pointers, modified via cmpxchg,
+and global data access in general become more contended.
+
+1 writer
+^^^^^^^^
+ runtime: 0m 18s
+ reader1: 16219900/32000000 (50%) records
+ reader2: 16141582/32000000 (50%) records
+
+2 writers
+^^^^^^^^^
+ runtime: 0m 32s
+ reader1: 16327957/32000000 (51%) records
+ reader2: 16313988/32000000 (50%) records
+
+4 writers
+^^^^^^^^^
+ runtime: 0m 42s
+ reader1: 16421642/32000000 (51%) records
+ reader2: 16417224/32000000 (51%) records
+
+8 writers
+^^^^^^^^^
+ runtime: 0m 43s
+ reader1: 16418300/32000000 (51%) records
+ reader2: 16432222/32000000 (51%) records
+
+16 writers
+^^^^^^^^^^
+ runtime: 0m 54s
+ reader1: 16539189/32000000 (51%) records
+ reader2: 16542711/32000000 (51%) records
+
+32 writers
+^^^^^^^^^^
+ runtime: 1m 13s
+ reader1: 16731808/32000000 (52%) records
+ reader2: 16735119/32000000 (52%) records
+
+Comments
+^^^^^^^^
+It is particularly interesting to compare/contrast the 1-writer and 32-writer
+tests. Despite the writing of the 32,000,000 records taking over 4 times
+longer, the readers (which perform no cmpxchg) were still unable to keep up.
+This shows that the memory contention between the increasing number of CPUs
+also has a dramatic effect on readers.
+
+It should also be noted that in all cases each reader was able to read >=50%
+of the records. This means that a single reader would have been able to keep
+up with the writer(s) in all cases, becoming slightly easier as more writers
+are added. This was the purpose of pinning 2 readers to 1 CPU: to observe how
+maximum reader performance changes.
--
2.11.0


2019-02-12 14:53:51

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 21/25] printk: implement KERN_CONT

Implement KERN_CONT based on the printing CPU rather than on the
printing task. As long as the KERN_CONT messages are coming from the
same CPU and no non-KERN_CONT messages come, the messages are assumed
to belong to each other.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 73 +++++++++++++++++++++++++++++---------------------
1 file changed, 42 insertions(+), 31 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index eebe6f4fdbba..306e7575499c 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1649,8 +1649,6 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
}
}

-/* FIXME: no support for LOG_CONT */
-#if 0
/*
* Continuation lines are buffered, and not committed to the record buffer
* until the line is complete, or a race forces it. The line fragments
@@ -1660,52 +1658,57 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
static struct cont {
char buf[LOG_LINE_MAX];
size_t len; /* length == 0 means unused buffer */
- struct task_struct *owner; /* task of first print*/
+ int cpu_owner; /* cpu of first print*/
u64 ts_nsec; /* time of first print */
u8 level; /* log level of first message */
u8 facility; /* log facility of first message */
enum log_flags flags; /* prefix, newline flags */
-} cont;
+} cont[2];

-static void cont_flush(void)
+static void cont_flush(int ctx)
{
- if (cont.len == 0)
+ struct cont *c = &cont[ctx];
+
+ if (c->len == 0)
return;

- log_store(cont.facility, cont.level, cont.flags, cont.ts_nsec,
- NULL, 0, cont.buf, cont.len);
- cont.len = 0;
+ log_store(c->facility, c->level, c->flags, c->ts_nsec, c->cpu_owner,
+ NULL, 0, c->buf, c->len);
+ c->len = 0;
}

-static bool cont_add(int facility, int level, enum log_flags flags, const char *text, size_t len)
+static void cont_add(int ctx, int cpu, int facility, int level,
+ enum log_flags flags, const char *text, size_t len)
{
+ struct cont *c = &cont[ctx];
+
+ if (cpu != c->cpu_owner || !(flags & LOG_CONT))
+ cont_flush(ctx);
+
/* If the line gets too long, split it up in separate records. */
- if (cont.len + len > sizeof(cont.buf)) {
- cont_flush();
- return false;
- }
+ while (c->len + len > sizeof(c->buf))
+ cont_flush(ctx);

- if (!cont.len) {
- cont.facility = facility;
- cont.level = level;
- cont.owner = current;
- cont.ts_nsec = local_clock();
- cont.flags = flags;
+ if (!c->len) {
+ c->facility = facility;
+ c->level = level;
+ c->cpu_owner = cpu;
+ c->ts_nsec = local_clock();
+ c->flags = flags;
}

- memcpy(cont.buf + cont.len, text, len);
- cont.len += len;
+ memcpy(c->buf + c->len, text, len);
+ c->len += len;

- // The original flags come from the first line,
- // but later continuations can add a newline.
+ /*
+ * The original flags come from the first line,
+ * but later continuations can add a newline.
+ */
if (flags & LOG_NEWLINE) {
- cont.flags |= LOG_NEWLINE;
- cont_flush();
+ c->flags |= LOG_NEWLINE;
+ cont_flush(ctx);
}
-
- return true;
}
-#endif /* 0 */

/* ring buffer used as memory allocator for temporary sprint buffers */
DECLARE_STATIC_PRINTKRB(sprint_rb,
@@ -1717,6 +1720,7 @@ asmlinkage int vprintk_emit(int facility, int level,
const char *fmt, va_list args)
{
enum log_flags lflags = 0;
+ int ctx = !!in_nmi();
int printed_len = 0;
struct prb_handle h;
size_t text_len;
@@ -1784,8 +1788,15 @@ asmlinkage int vprintk_emit(int facility, int level,
*/
printk_emergency(rbuf, level, ts_nsec, cpu, text, text_len);

- printed_len = log_store(facility, level, lflags, ts_nsec, cpu,
- dict, dictlen, text, text_len);
+ if ((lflags & LOG_CONT) || !(lflags & LOG_NEWLINE)) {
+ cont_add(ctx, cpu, facility, level, lflags, text, text_len);
+ printed_len = text_len;
+ } else {
+ if (cpu == cont[ctx].cpu_owner)
+ cont_flush(ctx);
+ printed_len = log_store(facility, level, lflags, ts_nsec, cpu,
+ dict, dictlen, text, text_len);
+ }

prb_commit(&h);
return printed_len;
--
2.11.0


2019-02-12 14:55:01

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 16/25] printk: implement CON_PRINTBUFFER

If the CON_PRINTBUFFER flag is not set, do not replay the history
for that console.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 25 ++++++-------------------
1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 6c875abd7b17..b97d4195b09a 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -399,10 +399,6 @@ static u32 log_first_idx;
static u64 log_next_seq;
static u32 log_next_idx;

-/* the next printk record to write to the console */
-static u64 console_seq;
-static u32 console_idx;
-
/* the next printk record to read after the last 'clear' command */
static u64 clear_seq;
static u32 clear_idx;
@@ -1596,8 +1592,12 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
if (!(con->flags & CON_ENABLED))
continue;
if (!con->wrote_history) {
- printk_write_history(con, seq);
- continue;
+ if (con->flags & CON_PRINTBUFFER) {
+ printk_write_history(con, seq);
+ continue;
+ }
+ con->wrote_history = 1;
+ con->printk_seq = seq - 1;
}
if (!con->write)
continue;
@@ -1822,8 +1822,6 @@ EXPORT_SYMBOL(printk);

static u64 syslog_seq;
static u32 syslog_idx;
-static u64 console_seq;
-static u32 console_idx;
static u64 log_first_seq;
static u32 log_first_idx;
static u64 log_next_seq;
@@ -2224,7 +2222,6 @@ early_param("keep_bootcon", keep_bootcon_setup);
void register_console(struct console *newcon)
{
int i;
- unsigned long flags;
struct console *bcon = NULL;
struct console_cmdline *c;
static bool has_preferred;
@@ -2340,16 +2337,6 @@ void register_console(struct console *newcon)
if (newcon->flags & CON_EXTENDED)
nr_ext_console_drivers++;

- if (newcon->flags & CON_PRINTBUFFER) {
- /*
- * console_unlock(); will print out the buffered messages
- * for us.
- */
- logbuf_lock_irqsave(flags);
- console_seq = syslog_seq;
- console_idx = syslog_idx;
- logbuf_unlock_irqrestore(flags);
- }
console_unlock();
console_sysfs_notify();

--
2.11.0


2019-02-12 14:55:13

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 15/25] printk: print history for new consoles

When new consoles register, they currently print how many messages
they have missed. However, many (or all) of those messages may still
be in the ring buffer. Add functionality to print as much of the
history as available. This is a clean replacement of the old
exclusive console hack.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/console.h | 1 +
kernel/printk/printk.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 76 insertions(+)

diff --git a/include/linux/console.h b/include/linux/console.h
index 7fa06a058339..633fb741e871 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -154,6 +154,7 @@ struct console {
short index;
int cflag;
unsigned long printk_seq;
+ int wrote_history;
void *data;
struct console *next;
};
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 897219f34cab..6c875abd7b17 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1506,6 +1506,77 @@ static void format_text(struct printk_log *msg, u64 seq,
}
}

+static void printk_write_history(struct console *con, u64 master_seq)
+{
+ struct prb_iterator iter;
+ bool time = printk_time;
+ static char *ext_text;
+ static char *text;
+ static char *buf;
+ u64 seq;
+
+ ext_text = kmalloc(CONSOLE_EXT_LOG_MAX, GFP_KERNEL);
+ text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL);
+ buf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
+ if (!ext_text || !text || !buf)
+ return;
+
+ if (!(con->flags & CON_ENABLED))
+ goto out;
+
+ if (!con->write)
+ goto out;
+
+ if (!cpu_online(raw_smp_processor_id()) &&
+ !(con->flags & CON_ANYTIME))
+ goto out;
+
+ prb_iter_init(&iter, &printk_rb, NULL);
+
+ for (;;) {
+ struct printk_log *msg;
+ size_t ext_len;
+ size_t len;
+ int ret;
+
+ ret = prb_iter_next(&iter, buf, PRINTK_RECORD_MAX, &seq);
+ if (ret == 0) {
+ break;
+ } else if (ret < 0) {
+ prb_iter_init(&iter, &printk_rb, NULL);
+ continue;
+ }
+
+ if (seq > master_seq)
+ break;
+
+ con->printk_seq++;
+ if (con->printk_seq < seq) {
+ print_console_dropped(con, seq - con->printk_seq);
+ con->printk_seq = seq;
+ }
+
+ msg = (struct printk_log *)buf;
+ format_text(msg, master_seq, ext_text, &ext_len, text,
+ &len, time);
+
+ if (len == 0 && ext_len == 0)
+ continue;
+
+ if (con->flags & CON_EXTENDED)
+ con->write(con, ext_text, ext_len);
+ else
+ con->write(con, text, len);
+
+ printk_delay(msg->level);
+ }
+out:
+ con->wrote_history = 1;
+ kfree(ext_text);
+ kfree(text);
+ kfree(buf);
+}
+
/*
* Call the console drivers, asking them to write out
* log_buf[start] to log_buf[end - 1].
@@ -1524,6 +1595,10 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
for_each_console(con) {
if (!(con->flags & CON_ENABLED))
continue;
+ if (!con->wrote_history) {
+ printk_write_history(con, seq);
+ continue;
+ }
if (!con->write)
continue;
if (!cpu_online(raw_smp_processor_id()) &&
--
2.11.0


2019-02-12 14:55:24

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 05/25] printk-rb: add basic non-blocking reading interface

Add reader iterator static declaration/initializer, dynamic
initializer, and functions to iterate and retrieve ring buffer data.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/printk_ringbuffer.h | 20 ++++
lib/printk_ringbuffer.c | 190 ++++++++++++++++++++++++++++++++++++++
2 files changed, 210 insertions(+)

diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
index 1aec9d5666b1..5fdaf632c111 100644
--- a/include/linux/printk_ringbuffer.h
+++ b/include/linux/printk_ringbuffer.h
@@ -43,6 +43,19 @@ static struct prb_cpulock name = { \
.irqflags = &_##name##_percpu_irqflags, \
}

+#define PRB_INIT ((unsigned long)-1)
+
+#define DECLARE_STATIC_PRINTKRB_ITER(name, rbaddr) \
+static struct prb_iterator name = { \
+ .rb = rbaddr, \
+ .lpos = PRB_INIT, \
+}
+
+struct prb_iterator {
+ struct printk_ringbuffer *rb;
+ unsigned long lpos;
+};
+
#define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) \
static char _##name##_buffer[1 << (szbits)] \
__aligned(__alignof__(long)); \
@@ -62,6 +75,13 @@ char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
unsigned int size);
void prb_commit(struct prb_handle *h);

+/* reader interface */
+void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
+ u64 *seq);
+void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src);
+int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq);
+int prb_iter_data(struct prb_iterator *iter, char *buf, int size, u64 *seq);
+
/* utility functions */
void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store);
diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
index 90c7f9a9f861..1d1e886a0966 100644
--- a/lib/printk_ringbuffer.c
+++ b/lib/printk_ringbuffer.c
@@ -1,5 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/smp.h>
+#include <linux/string.h>
+#include <linux/errno.h>
#include <linux/printk_ringbuffer.h>

#define PRB_SIZE(rb) (1 << rb->size_bits)
@@ -8,6 +10,7 @@
#define PRB_WRAPS(rb, lpos) (lpos >> rb->size_bits)
#define PRB_WRAP_LPOS(rb, lpos, xtra) \
((PRB_WRAPS(rb, lpos) + xtra) << rb->size_bits)
+#define PRB_DATA_SIZE(e) (e->size - sizeof(struct prb_entry))
#define PRB_DATA_ALIGN sizeof(long)

static bool __prb_trylock(struct prb_cpulock *cpu_lock,
@@ -247,3 +250,190 @@ char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,

return &h->entry->data[0];
}
+
+/*
+ * prb_iter_copy: Copy an iterator.
+ * @dest: The iterator to copy to.
+ * @src: The iterator to copy from.
+ *
+ * Make a deep copy of an iterator. This is particularly useful for making
+ * backup copies of an iterator in case a form of rewinding it needed.
+ *
+ * It is safe to call this function from any context and state. But
+ * note that this function is not atomic. Callers should not make copies
+ * to/from iterators that can be accessed by other tasks/contexts.
+ */
+void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src)
+{
+ memcpy(dest, src, sizeof(*dest));
+}
+
+/*
+ * prb_iter_init: Initialize an iterator for a ring buffer.
+ * @iter: The iterator to initialize.
+ * @rb: A ring buffer to that @iter should iterate.
+ * @seq: The sequence number of the position preceding the first record.
+ * May be NULL.
+ *
+ * Initialize an iterator to be used with a specified ring buffer. If @seq
+ * is non-NULL, it will be set such that prb_iter_next() will provide a
+ * sequence value of "@seq + 1" if no records were missed.
+ *
+ * It is safe to call this function from any context and state.
+ */
+void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
+ u64 *seq)
+{
+ memset(iter, 0, sizeof(*iter));
+ iter->rb = rb;
+ iter->lpos = PRB_INIT;
+
+ if (!seq)
+ return;
+
+ for (;;) {
+ struct prb_iterator tmp_iter;
+ int ret;
+
+ prb_iter_copy(&tmp_iter, iter);
+
+ ret = prb_iter_next(&tmp_iter, NULL, 0, seq);
+ if (ret < 0)
+ continue;
+
+ if (ret == 0)
+ *seq = 0;
+ else
+ (*seq)--;
+ break;
+ }
+}
+
+static bool is_valid(struct printk_ringbuffer *rb, unsigned long lpos)
+{
+ unsigned long head, tail;
+
+ tail = atomic_long_read(&rb->tail);
+ head = atomic_long_read(&rb->head);
+ head -= tail;
+ lpos -= tail;
+
+ if (lpos >= head)
+ return false;
+ return true;
+}
+
+/*
+ * prb_iter_data: Retrieve the record data at the current position.
+ * @iter: Iterator tracking the current position.
+ * @buf: A buffer to store the data of the record. May be NULL.
+ * @size: The size of @buf. (Ignored if @buf is NULL.)
+ * @seq: The sequence number of the record. May be NULL.
+ *
+ * If @iter is at a record, provide the data and/or sequence number of that
+ * record (if specified by the caller).
+ *
+ * It is safe to call this function from any context and state.
+ *
+ * Returns >=0 if the current record contains valid data (returns 0 if @buf
+ * is NULL or returns the size of the data block if @buf is non-NULL) or
+ * -EINVAL if @iter is now invalid.
+ */
+int prb_iter_data(struct prb_iterator *iter, char *buf, int size, u64 *seq)
+{
+ struct printk_ringbuffer *rb = iter->rb;
+ unsigned long lpos = iter->lpos;
+ unsigned int datsize = 0;
+ struct prb_entry *e;
+
+ if (buf || seq) {
+ e = to_entry(rb, lpos);
+ if (!is_valid(rb, lpos))
+ return -EINVAL;
+ /* memory barrier to ensure valid lpos */
+ smp_rmb();
+ if (buf) {
+ datsize = PRB_DATA_SIZE(e);
+ /* memory barrier to ensure load of datsize */
+ smp_rmb();
+ if (!is_valid(rb, lpos))
+ return -EINVAL;
+ if (PRB_INDEX(rb, lpos) + datsize >
+ PRB_SIZE(rb) - PRB_DATA_ALIGN) {
+ return -EINVAL;
+ }
+ if (size > datsize)
+ size = datsize;
+ memcpy(buf, &e->data[0], size);
+ }
+ if (seq)
+ *seq = e->seq;
+ /* memory barrier to ensure loads of entry data */
+ smp_rmb();
+ }
+
+ if (!is_valid(rb, lpos))
+ return -EINVAL;
+
+ return datsize;
+}
+
+/*
+ * prb_iter_next: Advance to the next record.
+ * @iter: Iterator tracking the current position.
+ * @buf: A buffer to store the data of the next record. May be NULL.
+ * @size: The size of @buf. (Ignored if @buf is NULL.)
+ * @seq: The sequence number of the next record. May be NULL.
+ *
+ * If a next record is available, @iter is advanced and (if specified)
+ * the data and/or sequence number of that record are provided.
+ *
+ * It is safe to call this function from any context and state.
+ *
+ * Returns 1 if @iter was advanced, 0 if @iter is at the end of the list, or
+ * -EINVAL if @iter is now invalid.
+ */
+int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)
+{
+ struct printk_ringbuffer *rb = iter->rb;
+ unsigned long next_lpos;
+ struct prb_entry *e;
+ unsigned int esize;
+
+ if (iter->lpos == PRB_INIT) {
+ next_lpos = atomic_long_read(&rb->tail);
+ } else {
+ if (!is_valid(rb, iter->lpos))
+ return -EINVAL;
+ /* memory barrier to ensure valid lpos */
+ smp_rmb();
+ e = to_entry(rb, iter->lpos);
+ esize = e->size;
+ /* memory barrier to ensure load of size */
+ smp_rmb();
+ if (!is_valid(rb, iter->lpos))
+ return -EINVAL;
+ next_lpos = iter->lpos + esize;
+ }
+ if (next_lpos == atomic_long_read(&rb->head))
+ return 0;
+ if (!is_valid(rb, next_lpos))
+ return -EINVAL;
+ /* memory barrier to ensure valid lpos */
+ smp_rmb();
+
+ iter->lpos = next_lpos;
+ e = to_entry(rb, iter->lpos);
+ esize = e->size;
+ /* memory barrier to ensure load of size */
+ smp_rmb();
+ if (!is_valid(rb, iter->lpos))
+ return -EINVAL;
+ if (esize == -1)
+ iter->lpos = PRB_WRAP_LPOS(rb, iter->lpos, 1);
+
+ if (prb_iter_data(iter, buf, size, seq) < 0)
+ return -EINVAL;
+
+ return 1;
+}
--
2.11.0


2019-02-12 14:55:27

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 10/25] printk: redirect emit/store to new ringbuffer

vprintk_emit and vprintk_store are the main functions that all printk
variants eventually go through. Change these to store the message in
the new printk ring buffer that the printk kthread is reading.

Remove functions no longer in use because of the changes to
vprintk_emit and vprintk_store.

In order to handle interrupts and NMIs, a second per-cpu ring buffer
(sprint_rb) is added. This ring buffer is used for NMI-safe memory
allocation in order to format the printk messages.

NOTE: LOG_CONT is ignored for now and handled as individual messages.
LOG_CONT functions are masked behind "#if 0" blocks until their
functionality can be restored

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 319 ++++++++-----------------------------------------
1 file changed, 51 insertions(+), 268 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 5a5a685bb128..b6a6f1002741 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -493,90 +493,6 @@ static u32 log_next(u32 idx)
return idx + msg->len;
}

-/*
- * Check whether there is enough free space for the given message.
- *
- * The same values of first_idx and next_idx mean that the buffer
- * is either empty or full.
- *
- * If the buffer is empty, we must respect the position of the indexes.
- * They cannot be reset to the beginning of the buffer.
- */
-static int logbuf_has_space(u32 msg_size, bool empty)
-{
- u32 free;
-
- if (log_next_idx > log_first_idx || empty)
- free = max(log_buf_len - log_next_idx, log_first_idx);
- else
- free = log_first_idx - log_next_idx;
-
- /*
- * We need space also for an empty header that signalizes wrapping
- * of the buffer.
- */
- return free >= msg_size + sizeof(struct printk_log);
-}
-
-static int log_make_free_space(u32 msg_size)
-{
- while (log_first_seq < log_next_seq &&
- !logbuf_has_space(msg_size, false)) {
- /* drop old messages until we have enough contiguous space */
- log_first_idx = log_next(log_first_idx);
- log_first_seq++;
- }
-
- if (clear_seq < log_first_seq) {
- clear_seq = log_first_seq;
- clear_idx = log_first_idx;
- }
-
- /* sequence numbers are equal, so the log buffer is empty */
- if (logbuf_has_space(msg_size, log_first_seq == log_next_seq))
- return 0;
-
- return -ENOMEM;
-}
-
-/* compute the message size including the padding bytes */
-static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
-{
- u32 size;
-
- size = sizeof(struct printk_log) + text_len + dict_len;
- *pad_len = (-size) & (LOG_ALIGN - 1);
- size += *pad_len;
-
- return size;
-}
-
-/*
- * Define how much of the log buffer we could take at maximum. The value
- * must be greater than two. Note that only half of the buffer is available
- * when the index points to the middle.
- */
-#define MAX_LOG_TAKE_PART 4
-static const char trunc_msg[] = "<truncated>";
-
-static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
- u16 *dict_len, u32 *pad_len)
-{
- /*
- * The message should not take the whole buffer. Otherwise, it might
- * get removed too soon.
- */
- u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
- if (*text_len > max_text_len)
- *text_len = max_text_len;
- /* enable the warning message */
- *trunc_msg_len = strlen(trunc_msg);
- /* disable the "dict" completely */
- *dict_len = 0;
- /* compute the size again, count also the warning message */
- return msg_used_size(*text_len + *trunc_msg_len, 0, pad_len);
-}
-
/* insert record into the buffer, discard old ones, update heads */
static int log_store(int facility, int level,
enum log_flags flags, u64 ts_nsec,
@@ -584,54 +500,36 @@ static int log_store(int facility, int level,
const char *text, u16 text_len)
{
struct printk_log *msg;
- u32 size, pad_len;
- u16 trunc_msg_len = 0;
-
- /* number of '\0' padding bytes to next message */
- size = msg_used_size(text_len, dict_len, &pad_len);
-
- if (log_make_free_space(size)) {
- /* truncate the message if it is too long for empty buffer */
- size = truncate_msg(&text_len, &trunc_msg_len,
- &dict_len, &pad_len);
- /* survive when the log buffer is too small for trunc_msg */
- if (log_make_free_space(size))
- return 0;
- }
+ struct prb_handle h;
+ char *rbuf;
+ u32 size;

- if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
+ size = sizeof(*msg) + text_len + dict_len;
+
+ rbuf = prb_reserve(&h, &printk_rb, size);
+ if (!rbuf) {
/*
- * This message + an additional empty header does not fit
- * at the end of the buffer. Add an empty header with len == 0
- * to signify a wrap around.
+ * An emergency message would have been printed, but
+ * it cannot be stored in the log.
*/
- memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
- log_next_idx = 0;
+ prb_inc_lost(&printk_rb);
+ return 0;
}

/* fill message */
- msg = (struct printk_log *)(log_buf + log_next_idx);
+ msg = (struct printk_log *)rbuf;
memcpy(log_text(msg), text, text_len);
msg->text_len = text_len;
- if (trunc_msg_len) {
- memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len);
- msg->text_len += trunc_msg_len;
- }
memcpy(log_dict(msg), dict, dict_len);
msg->dict_len = dict_len;
msg->facility = facility;
msg->level = level & 7;
msg->flags = flags & 0x1f;
- if (ts_nsec > 0)
- msg->ts_nsec = ts_nsec;
- else
- msg->ts_nsec = local_clock();
- memset(log_dict(msg) + dict_len, 0, pad_len);
+ msg->ts_nsec = ts_nsec;
msg->len = size;

/* insert message */
- log_next_idx += msg->len;
- log_next_seq++;
+ prb_commit(&h);

return msg->text_len;
}
@@ -1675,70 +1573,6 @@ static int console_lock_spinning_disable_and_check(void)
return 1;
}

-/**
- * console_trylock_spinning - try to get console_lock by busy waiting
- *
- * This allows to busy wait for the console_lock when the current
- * owner is running in specially marked sections. It means that
- * the current owner is running and cannot reschedule until it
- * is ready to lose the lock.
- *
- * Return: 1 if we got the lock, 0 othrewise
- */
-static int console_trylock_spinning(void)
-{
- struct task_struct *owner = NULL;
- bool waiter;
- bool spin = false;
- unsigned long flags;
-
- if (console_trylock())
- return 1;
-
- printk_safe_enter_irqsave(flags);
-
- raw_spin_lock(&console_owner_lock);
- owner = READ_ONCE(console_owner);
- waiter = READ_ONCE(console_waiter);
- if (!waiter && owner && owner != current) {
- WRITE_ONCE(console_waiter, true);
- spin = true;
- }
- raw_spin_unlock(&console_owner_lock);
-
- /*
- * If there is an active printk() writing to the
- * consoles, instead of having it write our data too,
- * see if we can offload that load from the active
- * printer, and do some printing ourselves.
- * Go into a spin only if there isn't already a waiter
- * spinning, and there is an active printer, and
- * that active printer isn't us (recursive printk?).
- */
- if (!spin) {
- printk_safe_exit_irqrestore(flags);
- return 0;
- }
-
- /* We spin waiting for the owner to release us */
- spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
- /* Owner will clear console_waiter on hand off */
- while (READ_ONCE(console_waiter))
- cpu_relax();
- spin_release(&console_owner_dep_map, 1, _THIS_IP_);
-
- printk_safe_exit_irqrestore(flags);
- /*
- * The owner passed the console lock to us.
- * Since we did not spin on console lock, annotate
- * this as a trylock. Otherwise lockdep will
- * complain.
- */
- mutex_acquire(&console_lock_dep_map, 0, 1, _THIS_IP_);
-
- return 1;
-}
-
/*
* Call the console drivers, asking them to write out
* log_buf[start] to log_buf[end - 1].
@@ -1759,7 +1593,7 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,
continue;
if (!con->write)
continue;
- if (!cpu_online(smp_processor_id()) &&
+ if (!cpu_online(raw_smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
if (con->flags & CON_EXTENDED)
@@ -1783,6 +1617,8 @@ static inline void printk_delay(void)
}
}

+/* FIXME: no support for LOG_CONT */
+#if 0
/*
* Continuation lines are buffered, and not committed to the record buffer
* until the line is complete, or a race forces it. The line fragments
@@ -1837,53 +1673,44 @@ static bool cont_add(int facility, int level, enum log_flags flags, const char *

return true;
}
+#endif /* 0 */

-static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len)
-{
- /*
- * If an earlier line was buffered, and we're a continuation
- * write from the same process, try to add it to the buffer.
- */
- if (cont.len) {
- if (cont.owner == current && (lflags & LOG_CONT)) {
- if (cont_add(facility, level, lflags, text, text_len))
- return text_len;
- }
- /* Otherwise, make sure it's flushed */
- cont_flush();
- }
-
- /* Skip empty continuation lines that couldn't be added - they just flush */
- if (!text_len && (lflags & LOG_CONT))
- return 0;
-
- /* If it doesn't end in a newline, try to buffer the current line */
- if (!(lflags & LOG_NEWLINE)) {
- if (cont_add(facility, level, lflags, text, text_len))
- return text_len;
- }
-
- /* Store it in the record log */
- return log_store(facility, level, lflags, 0, dict, dictlen, text, text_len);
-}
-
-/* Must be called under logbuf_lock. */
int vprintk_store(int facility, int level,
const char *dict, size_t dictlen,
const char *fmt, va_list args)
{
- static char textbuf[LOG_LINE_MAX];
- char *text = textbuf;
- size_t text_len;
+ return vprintk_emit(facility, level, dict, dictlen, fmt, args);
+}
+
+/* ring buffer used as memory allocator for temporary sprint buffers */
+DECLARE_STATIC_PRINTKRB(sprint_rb,
+ ilog2(PRINTK_RECORD_MAX + sizeof(struct prb_entry) +
+ sizeof(long)) + 2, &printk_cpulock);
+
+asmlinkage int vprintk_emit(int facility, int level,
+ const char *dict, size_t dictlen,
+ const char *fmt, va_list args)
+{
enum log_flags lflags = 0;
+ int printed_len = 0;
+ struct prb_handle h;
+ size_t text_len;
+ u64 ts_nsec;
+ char *text;
+ char *rbuf;

- /*
- * The printf needs to come first; we need the syslog
- * prefix which might be passed-in as a parameter.
- */
- text_len = vscnprintf(text, sizeof(textbuf), fmt, args);
+ ts_nsec = local_clock();

- /* mark and strip a trailing newline */
+ rbuf = prb_reserve(&h, &sprint_rb, PRINTK_SPRINT_MAX);
+ if (!rbuf) {
+ prb_inc_lost(&printk_rb);
+ return printed_len;
+ }
+
+ text = rbuf;
+ text_len = vscnprintf(text, PRINTK_SPRINT_MAX, fmt, args);
+
+ /* strip and flag a trailing newline */
if (text_len && text[text_len-1] == '\n') {
text_len--;
lflags |= LOG_NEWLINE;
@@ -1917,54 +1744,10 @@ int vprintk_store(int facility, int level,
if (dict)
lflags |= LOG_PREFIX|LOG_NEWLINE;

- return log_output(facility, level, lflags,
- dict, dictlen, text, text_len);
-}
-
-asmlinkage int vprintk_emit(int facility, int level,
- const char *dict, size_t dictlen,
- const char *fmt, va_list args)
-{
- int printed_len;
- bool in_sched = false, pending_output;
- unsigned long flags;
- u64 curr_log_seq;
-
- if (level == LOGLEVEL_SCHED) {
- level = LOGLEVEL_DEFAULT;
- in_sched = true;
- }
-
- boot_delay_msec(level);
- printk_delay();
-
- /* This stops the holder of console_sem just where we want him */
- logbuf_lock_irqsave(flags);
- curr_log_seq = log_next_seq;
- printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
- pending_output = (curr_log_seq != log_next_seq);
- logbuf_unlock_irqrestore(flags);
-
- /* If called from the scheduler, we can not call up(). */
- if (!in_sched && pending_output) {
- /*
- * Disable preemption to avoid being preempted while holding
- * console_sem which would prevent anyone from printing to
- * console
- */
- preempt_disable();
- /*
- * Try to acquire and then immediately release the console
- * semaphore. The release will print out buffers and wake up
- * /dev/kmsg and syslog() users.
- */
- if (console_trylock_spinning())
- console_unlock();
- preempt_enable();
- }
+ printed_len = log_store(facility, level, lflags, ts_nsec,
+ dict, dictlen, text, text_len);

- if (pending_output)
- wake_up_klogd();
+ prb_commit(&h);
return printed_len;
}
EXPORT_SYMBOL(vprintk_emit);
@@ -2429,7 +2212,7 @@ void console_unlock(void)
console_lock_spinning_enable();

stop_critical_timings(); /* don't trace print latency */
- call_console_drivers(ext_text, ext_len, text, len);
+ //call_console_drivers(ext_text, ext_len, text, len);
start_critical_timings();

if (console_lock_spinning_disable_and_check()) {
--
2.11.0


2019-02-12 14:55:35

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 12/25] printk: minimize console locking implementation

Since printing of the printk buffer is now handled by the printk
kthread, minimize the console locking functions to just handle
locking of the console.

NOTE: With this console_flush_on_panic will no longer flush.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 255 +------------------------------------------------
1 file changed, 1 insertion(+), 254 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 073ff9fd6872..ece54c24ea0d 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -209,19 +209,7 @@ static int nr_ext_console_drivers;

static int __down_trylock_console_sem(unsigned long ip)
{
- int lock_failed;
- unsigned long flags;
-
- /*
- * Here and in __up_console_sem() we need to be in safe mode,
- * because spindump/WARN/etc from under console ->lock will
- * deadlock in printk()->down_trylock_console_sem() otherwise.
- */
- printk_safe_enter_irqsave(flags);
- lock_failed = down_trylock(&console_sem);
- printk_safe_exit_irqrestore(flags);
-
- if (lock_failed)
+ if (down_trylock(&console_sem))
return 1;
mutex_acquire(&console_lock_dep_map, 0, 1, ip);
return 0;
@@ -230,13 +218,9 @@ static int __down_trylock_console_sem(unsigned long ip)

static void __up_console_sem(unsigned long ip)
{
- unsigned long flags;
-
mutex_release(&console_lock_dep_map, 1, ip);

- printk_safe_enter_irqsave(flags);
up(&console_sem);
- printk_safe_exit_irqrestore(flags);
}
#define up_console_sem() __up_console_sem(_RET_IP_)

@@ -1498,82 +1482,6 @@ static void format_text(struct printk_log *msg, u64 seq,
}

/*
- * Special console_lock variants that help to reduce the risk of soft-lockups.
- * They allow to pass console_lock to another printk() call using a busy wait.
- */
-
-#ifdef CONFIG_LOCKDEP
-static struct lockdep_map console_owner_dep_map = {
- .name = "console_owner"
-};
-#endif
-
-static DEFINE_RAW_SPINLOCK(console_owner_lock);
-static struct task_struct *console_owner;
-static bool console_waiter;
-
-/**
- * console_lock_spinning_enable - mark beginning of code where another
- * thread might safely busy wait
- *
- * This basically converts console_lock into a spinlock. This marks
- * the section where the console_lock owner can not sleep, because
- * there may be a waiter spinning (like a spinlock). Also it must be
- * ready to hand over the lock at the end of the section.
- */
-static void console_lock_spinning_enable(void)
-{
- raw_spin_lock(&console_owner_lock);
- console_owner = current;
- raw_spin_unlock(&console_owner_lock);
-
- /* The waiter may spin on us after setting console_owner */
- spin_acquire(&console_owner_dep_map, 0, 0, _THIS_IP_);
-}
-
-/**
- * console_lock_spinning_disable_and_check - mark end of code where another
- * thread was able to busy wait and check if there is a waiter
- *
- * This is called at the end of the section where spinning is allowed.
- * It has two functions. First, it is a signal that it is no longer
- * safe to start busy waiting for the lock. Second, it checks if
- * there is a busy waiter and passes the lock rights to her.
- *
- * Important: Callers lose the lock if there was a busy waiter.
- * They must not touch items synchronized by console_lock
- * in this case.
- *
- * Return: 1 if the lock rights were passed, 0 otherwise.
- */
-static int console_lock_spinning_disable_and_check(void)
-{
- int waiter;
-
- raw_spin_lock(&console_owner_lock);
- waiter = READ_ONCE(console_waiter);
- console_owner = NULL;
- raw_spin_unlock(&console_owner_lock);
-
- if (!waiter) {
- spin_release(&console_owner_dep_map, 1, _THIS_IP_);
- return 0;
- }
-
- /* The waiter is now free to continue */
- WRITE_ONCE(console_waiter, false);
-
- spin_release(&console_owner_dep_map, 1, _THIS_IP_);
-
- /*
- * Hand off console_lock to waiter. The waiter will perform
- * the up(). After this, the waiter is the console_lock owner.
- */
- mutex_release(&console_lock_dep_map, 1, _THIS_IP_);
- return 1;
-}
-
-/*
* Call the console drivers, asking them to write out
* log_buf[start] to log_buf[end - 1].
* The console_lock must be held.
@@ -1830,8 +1738,6 @@ static ssize_t msg_print_ext_header(char *buf, size_t size,
static ssize_t msg_print_ext_body(char *buf, size_t size,
char *dict, size_t dict_len,
char *text, size_t text_len) { return 0; }
-static void console_lock_spinning_enable(void) { }
-static int console_lock_spinning_disable_and_check(void) { return 0; }
static void call_console_drivers(const char *ext_text, size_t ext_len,
const char *text, size_t len) {}
static size_t msg_print_text(const struct printk_log *msg, bool syslog,
@@ -2066,35 +1972,6 @@ int is_console_locked(void)
{
return console_locked;
}
-EXPORT_SYMBOL(is_console_locked);
-
-/*
- * Check if we have any console that is capable of printing while cpu is
- * booting or shutting down. Requires console_sem.
- */
-static int have_callable_console(void)
-{
- struct console *con;
-
- for_each_console(con)
- if ((con->flags & CON_ENABLED) &&
- (con->flags & CON_ANYTIME))
- return 1;
-
- return 0;
-}
-
-/*
- * Can we actually use the console at this time on this cpu?
- *
- * Console drivers may assume that per-cpu resources have been allocated. So
- * unless they're explicitly marked as being able to cope (CON_ANYTIME) don't
- * call them until this CPU is officially up.
- */
-static inline int can_use_console(void)
-{
- return cpu_online(raw_smp_processor_id()) || have_callable_console();
-}

/**
* console_unlock - unlock the console system
@@ -2102,147 +1979,17 @@ static inline int can_use_console(void)
* Releases the console_lock which the caller holds on the console system
* and the console driver list.
*
- * While the console_lock was held, console output may have been buffered
- * by printk(). If this is the case, console_unlock(); emits
- * the output prior to releasing the lock.
- *
- * If there is output waiting, we wake /dev/kmsg and syslog() users.
- *
* console_unlock(); may be called from any context.
*/
void console_unlock(void)
{
- static char ext_text[CONSOLE_EXT_LOG_MAX];
- static char text[LOG_LINE_MAX + PREFIX_MAX];
- unsigned long flags;
- bool do_cond_resched, retry;
-
if (console_suspended) {
up_console_sem();
return;
}

- /*
- * Console drivers are called with interrupts disabled, so
- * @console_may_schedule should be cleared before; however, we may
- * end up dumping a lot of lines, for example, if called from
- * console registration path, and should invoke cond_resched()
- * between lines if allowable. Not doing so can cause a very long
- * scheduling stall on a slow console leading to RCU stall and
- * softlockup warnings which exacerbate the issue with more
- * messages practically incapacitating the system.
- *
- * console_trylock() is not able to detect the preemptive
- * context reliably. Therefore the value must be stored before
- * and cleared after the the "again" goto label.
- */
- do_cond_resched = console_may_schedule;
-again:
- console_may_schedule = 0;
-
- /*
- * We released the console_sem lock, so we need to recheck if
- * cpu is online and (if not) is there at least one CON_ANYTIME
- * console.
- */
- if (!can_use_console()) {
- console_locked = 0;
- up_console_sem();
- return;
- }
-
- for (;;) {
- struct printk_log *msg;
- size_t ext_len = 0;
- size_t len;
-
- printk_safe_enter_irqsave(flags);
- raw_spin_lock(&logbuf_lock);
- if (console_seq < log_first_seq) {
- len = sprintf(text,
- "** %llu printk messages dropped **\n",
- log_first_seq - console_seq);
-
- /* messages are gone, move to first one */
- console_seq = log_first_seq;
- console_idx = log_first_idx;
- } else {
- len = 0;
- }
-skip:
- if (console_seq == log_next_seq)
- break;
-
- msg = log_from_idx(console_idx);
- if (suppress_message_printing(msg->level)) {
- /*
- * Skip record we have buffered and already printed
- * directly to the console when we received it, and
- * record that has level above the console loglevel.
- */
- console_idx = log_next(console_idx);
- console_seq++;
- goto skip;
- }
-
- len += msg_print_text(msg,
- console_msg_format & MSG_FORMAT_SYSLOG,
- printk_time, text + len, sizeof(text) - len);
- if (nr_ext_console_drivers) {
- ext_len = msg_print_ext_header(ext_text,
- sizeof(ext_text),
- msg, console_seq);
- ext_len += msg_print_ext_body(ext_text + ext_len,
- sizeof(ext_text) - ext_len,
- log_dict(msg), msg->dict_len,
- log_text(msg), msg->text_len);
- }
- console_idx = log_next(console_idx);
- console_seq++;
- raw_spin_unlock(&logbuf_lock);
-
- /*
- * While actively printing out messages, if another printk()
- * were to occur on another CPU, it may wait for this one to
- * finish. This task can not be preempted if there is a
- * waiter waiting to take over.
- */
- console_lock_spinning_enable();
-
- stop_critical_timings(); /* don't trace print latency */
- //call_console_drivers(ext_text, ext_len, text, len);
- start_critical_timings();
-
- if (console_lock_spinning_disable_and_check()) {
- printk_safe_exit_irqrestore(flags);
- return;
- }
-
- printk_safe_exit_irqrestore(flags);
-
- if (do_cond_resched)
- cond_resched();
- }
-
console_locked = 0;
-
- raw_spin_unlock(&logbuf_lock);
-
up_console_sem();
-
- /*
- * Someone could have filled up the buffer again, so re-check if there's
- * something to flush. In case we cannot trylock the console_sem again,
- * there's a new owner and the console_unlock() from them will do the
- * flush, no worries.
- */
- raw_spin_lock(&logbuf_lock);
- retry = console_seq != log_next_seq;
- raw_spin_unlock(&logbuf_lock);
- printk_safe_exit_irqrestore(flags);
-
- if (retry && console_trylock())
- goto again;
}
EXPORT_SYMBOL(console_unlock);

--
2.11.0


2019-02-12 14:55:44

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 03/25] printk-rb: define ring buffer struct and initializer

See Documentation/printk-ringbuffer.txt for details about the
initializer arguments.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/printk_ringbuffer.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
index 75f5708ea902..0e6e8dd0d01e 100644
--- a/include/linux/printk_ringbuffer.h
+++ b/include/linux/printk_ringbuffer.h
@@ -10,6 +10,20 @@ struct prb_cpulock {
unsigned long __percpu *irqflags;
};

+struct printk_ringbuffer {
+ void *buffer;
+ unsigned int size_bits;
+
+ u64 seq;
+
+ atomic_long_t tail;
+ atomic_long_t head;
+ atomic_long_t reserve;
+
+ struct prb_cpulock *cpulock;
+ atomic_t ctx;
+};
+
#define DECLARE_STATIC_PRINTKRB_CPULOCK(name) \
static DEFINE_PER_CPU(unsigned long, _##name##_percpu_irqflags); \
static struct prb_cpulock name = { \
@@ -17,6 +31,20 @@ static struct prb_cpulock name = { \
.irqflags = &_##name##_percpu_irqflags, \
}

+#define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) \
+static char _##name##_buffer[1 << (szbits)] \
+ __aligned(__alignof__(long)); \
+static struct printk_ringbuffer name = { \
+ .buffer = &_##name##_buffer[0], \
+ .size_bits = szbits, \
+ .seq = 0, \
+ .tail = ATOMIC_LONG_INIT(-111 * sizeof(long)), \
+ .head = ATOMIC_LONG_INIT(-111 * sizeof(long)), \
+ .reserve = ATOMIC_LONG_INIT(-111 * sizeof(long)), \
+ .cpulock = cpulockptr, \
+ .ctx = ATOMIC_INIT(0), \
+}
+
/* utility functions */
void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store);
--
2.11.0


2019-02-12 14:56:08

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 09/25] printk: remove exclusive console hack

In order to support printing the printk log history when new
consoles are registered, a global exclusive_console variable is
temporarily set. This only works because printk runs with
preemption disabled.

When console printing is moved to a fully preemptible dedicated
kthread, this hack no longer works.

Remove exclusive_console usage.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 30 ++++--------------------------
1 file changed, 4 insertions(+), 26 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 08e079b95652..5a5a685bb128 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -251,11 +251,6 @@ static void __up_console_sem(unsigned long ip)
static int console_locked, console_suspended;

/*
- * If exclusive_console is non-NULL then only this console is to be printed to.
- */
-static struct console *exclusive_console;
-
-/*
* Array of consoles built from command line options (console=)
*/

@@ -423,7 +418,6 @@ static u32 log_next_idx;
/* the next printk record to write to the console */
static u64 console_seq;
static u32 console_idx;
-static u64 exclusive_console_stop_seq;

/* the next printk record to read after the last 'clear' command */
static u64 clear_seq;
@@ -1761,8 +1755,6 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,
return;

for_each_console(con) {
- if (exclusive_console && con != exclusive_console)
- continue;
if (!(con->flags & CON_ENABLED))
continue;
if (!con->write)
@@ -2044,7 +2036,6 @@ static u64 syslog_seq;
static u32 syslog_idx;
static u64 console_seq;
static u32 console_idx;
-static u64 exclusive_console_stop_seq;
static u64 log_first_seq;
static u32 log_first_idx;
static u64 log_next_seq;
@@ -2413,12 +2404,6 @@ void console_unlock(void)
goto skip;
}

- /* Output to all consoles once old messages replayed. */
- if (unlikely(exclusive_console &&
- console_seq >= exclusive_console_stop_seq)) {
- exclusive_console = NULL;
- }
-
len += msg_print_text(msg,
console_msg_format & MSG_FORMAT_SYSLOG,
printk_time, text + len, sizeof(text) - len);
@@ -2736,17 +2721,6 @@ void register_console(struct console *newcon)
logbuf_lock_irqsave(flags);
console_seq = syslog_seq;
console_idx = syslog_idx;
- /*
- * We're about to replay the log buffer. Only do this to the
- * just-registered console to avoid excessive message spam to
- * the already-registered consoles.
- *
- * Set exclusive_console with disabled interrupts to reduce
- * race window with eventual console_flush_on_panic() that
- * ignores console_lock.
- */
- exclusive_console = newcon;
- exclusive_console_stop_seq = console_seq;
logbuf_unlock_irqrestore(flags);
}
console_unlock();
@@ -2758,6 +2732,10 @@ void register_console(struct console *newcon)
* boot consoles, real consoles, etc - this is to ensure that end
* users know there might be something in the kernel's log buffer that
* went to the bootconsole (that they do not see on the real console)
+ *
+ * This message is also important because it will trigger the
+ * printk kthread to begin dumping the log buffer to the newly
+ * registered console.
*/
pr_info("%sconsole [%s%d] enabled\n",
(newcon->flags & CON_BOOT) ? "boot" : "" ,
--
2.11.0


2019-02-12 14:56:44

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 13/25] printk: track seq per console

Allow each console to track which seq record was last printed. This
simplifies identifying dropped records.

Signed-off-by: John Ogness <[email protected]>
---
include/linux/console.h | 1 +
kernel/printk/printk.c | 30 +++++++++++++++++++++++++++---
2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index ec9bdb3d7bab..7fa06a058339 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -153,6 +153,7 @@ struct console {
short flags;
short index;
int cflag;
+ unsigned long printk_seq;
void *data;
struct console *next;
};
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ece54c24ea0d..ebd9aac06323 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1453,6 +1453,16 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)
return do_syslog(type, buf, len, SYSLOG_FROM_READER);
}

+static void print_console_dropped(struct console *con, u64 count)
+{
+ char text[64];
+ int len;
+
+ len = sprintf(text, "** %llu printk message%s dropped **\n",
+ count, count > 1 ? "s" : "");
+ con->write(con, text, len);
+}
+
static void format_text(struct printk_log *msg, u64 seq,
char *ext_text, size_t *ext_len,
char *text, size_t *len, bool time)
@@ -1486,7 +1496,7 @@ static void format_text(struct printk_log *msg, u64 seq,
* log_buf[start] to log_buf[end - 1].
* The console_lock must be held.
*/
-static void call_console_drivers(const char *ext_text, size_t ext_len,
+static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
const char *text, size_t len)
{
struct console *con;
@@ -1504,6 +1514,19 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,
if (!cpu_online(raw_smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+ if (con->printk_seq >= seq)
+ continue;
+
+ con->printk_seq++;
+ if (con->printk_seq < seq) {
+ print_console_dropped(con, seq - con->printk_seq);
+ con->printk_seq = seq;
+ }
+
+ /* for supressed messages, only seq is updated */
+ if (len == 0 && ext_len == 0)
+ continue;
+
if (con->flags & CON_EXTENDED)
con->write(con, ext_text, ext_len);
else
@@ -1738,7 +1761,7 @@ static ssize_t msg_print_ext_header(char *buf, size_t size,
static ssize_t msg_print_ext_body(char *buf, size_t size,
char *dict, size_t dict_len,
char *text, size_t text_len) { return 0; }
-static void call_console_drivers(const char *ext_text, size_t ext_len,
+static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
const char *text, size_t len) {}
static size_t msg_print_text(const struct printk_log *msg, bool syslog,
bool time, char *buf, size_t size) { return 0; }
@@ -2481,8 +2504,9 @@ static int printk_kthread_func(void *data)
&len, printk_time);

console_lock();
+ call_console_drivers(master_seq, ext_text,
+ ext_len, text, len);
if (len > 0 || ext_len > 0) {
- call_console_drivers(ext_text, ext_len, text, len);
boot_delay_msec(msg->level);
printk_delay();
}
--
2.11.0


2019-02-12 14:56:58

by John Ogness

[permalink] [raw]
Subject: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

The printk ring buffer provides an NMI-safe interface for writing
messages to a ring buffer. Using such a buffer for alleviates printk
callers from the current burdens of disabled preemption while calling
the console drivers (and possibly printing out many messages that
another task put into the log buffer).

Create a ring buffer to be used for storing messages to be
printed to the consoles.

Create a dedicated printk kthread to block on the ring buffer
and call the console drivers for the read messages.

NOTE: The printk_delay is relocated to _after_ the message is
printed, where it makes more sense.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 105 insertions(+)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index d3d170374ceb..08e079b95652 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -44,6 +44,8 @@
#include <linux/irq_work.h>
#include <linux/ctype.h>
#include <linux/uio.h>
+#include <linux/kthread.h>
+#include <linux/printk_ringbuffer.h>
#include <linux/sched/clock.h>
#include <linux/sched/debug.h>
#include <linux/sched/task_stack.h>
@@ -397,7 +399,12 @@ DEFINE_RAW_SPINLOCK(logbuf_lock);
printk_safe_exit_irqrestore(flags); \
} while (0)

+DECLARE_STATIC_PRINTKRB_CPULOCK(printk_cpulock);
+
#ifdef CONFIG_PRINTK
+/* record buffer */
+DECLARE_STATIC_PRINTKRB(printk_rb, CONFIG_LOG_BUF_SHIFT, &printk_cpulock);
+
DECLARE_WAIT_QUEUE_HEAD(log_wait);
/* the next printk record to read by syslog(READ) or /proc/kmsg */
static u64 syslog_seq;
@@ -744,6 +751,10 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
return p - buf;
}

+#define PRINTK_SPRINT_MAX (LOG_LINE_MAX + PREFIX_MAX)
+#define PRINTK_RECORD_MAX (sizeof(struct printk_log) + \
+ CONSOLE_EXT_LOG_MAX + PRINTK_SPRINT_MAX)
+
/* /dev/kmsg - userspace message inject/listen interface */
struct devkmsg_user {
u64 seq;
@@ -1566,6 +1577,34 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)
return do_syslog(type, buf, len, SYSLOG_FROM_READER);
}

+static void format_text(struct printk_log *msg, u64 seq,
+ char *ext_text, size_t *ext_len,
+ char *text, size_t *len, bool time)
+{
+ if (suppress_message_printing(msg->level)) {
+ /*
+ * Skip record that has level above the console
+ * loglevel and update each console's local seq.
+ */
+ *len = 0;
+ *ext_len = 0;
+ return;
+ }
+
+ *len = msg_print_text(msg, console_msg_format & MSG_FORMAT_SYSLOG,
+ time, text, PRINTK_SPRINT_MAX);
+ if (nr_ext_console_drivers) {
+ *ext_len = msg_print_ext_header(ext_text, CONSOLE_EXT_LOG_MAX,
+ msg, seq);
+ *ext_len += msg_print_ext_body(ext_text + *ext_len,
+ CONSOLE_EXT_LOG_MAX - *ext_len,
+ log_dict(msg), msg->dict_len,
+ log_text(msg), msg->text_len);
+ } else {
+ *ext_len = 0;
+ }
+}
+
/*
* Special console_lock variants that help to reduce the risk of soft-lockups.
* They allow to pass console_lock to another printk() call using a busy wait.
@@ -2899,6 +2938,72 @@ void wake_up_klogd(void)
preempt_enable();
}

+static int printk_kthread_func(void *data)
+{
+ struct prb_iterator iter;
+ struct printk_log *msg;
+ size_t ext_len;
+ char *ext_text;
+ u64 master_seq;
+ size_t len;
+ char *text;
+ char *buf;
+ int ret;
+
+ ext_text = kmalloc(CONSOLE_EXT_LOG_MAX, GFP_KERNEL);
+ text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL);
+ buf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
+ if (!ext_text || !text || !buf)
+ return -1;
+
+ prb_iter_init(&iter, &printk_rb, NULL);
+
+ /* the printk kthread never exits */
+ for (;;) {
+ ret = prb_iter_wait_next(&iter, buf,
+ PRINTK_RECORD_MAX, &master_seq);
+ if (ret == -ERESTARTSYS) {
+ continue;
+ } else if (ret < 0) {
+ /* iterator invalid, start over */
+ prb_iter_init(&iter, &printk_rb, NULL);
+ continue;
+ }
+
+ msg = (struct printk_log *)buf;
+ format_text(msg, master_seq, ext_text, &ext_len, text,
+ &len, printk_time);
+
+ console_lock();
+ if (len > 0 || ext_len > 0) {
+ call_console_drivers(ext_text, ext_len, text, len);
+ boot_delay_msec(msg->level);
+ printk_delay();
+ }
+ console_unlock();
+ }
+
+ kfree(ext_text);
+ kfree(text);
+ kfree(buf);
+
+ return 0;
+}
+
+static int __init init_printk_kthread(void)
+{
+ struct task_struct *thread;
+
+ thread = kthread_run(printk_kthread_func, NULL, "printk");
+ if (IS_ERR(thread)) {
+ pr_err("printk: unable to create printing thread\n");
+ return PTR_ERR(thread);
+ }
+
+ return 0;
+}
+late_initcall(init_printk_kthread);
+
void defer_console_output(void)
{
preempt_disable();
--
2.11.0


2019-02-12 14:59:29

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH v1 03/25] printk-rb: define ring buffer struct and initializer

On Tue, Feb 12, 2019 at 03:29:41PM +0100, John Ogness wrote:
> See Documentation/printk-ringbuffer.txt for details about the
> initializer arguments.

You can put that documentation here in the .h file and have it pulled
out automatically into the documentation files when they are created.
That way you always keep everything in sync properly.

thanks,

greg k-h

2019-02-12 15:00:39

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH v1 01/25] printk-rb: add printk ring buffer documentation

On Tue, Feb 12, 2019 at 03:29:39PM +0100, John Ogness wrote:
> The full documentation file for the printk ring buffer.
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> Documentation/printk-ringbuffer.txt | 377 ++++++++++++++++++++++++++++++++++++

Nit, shouldn't this be in .rst format and tied into the "build the
kernel documentation" process somehow?

thanks,

greg k-h

2019-02-12 15:48:43

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On (02/12/19 15:29), John Ogness wrote:
[..]
> +static int printk_kthread_func(void *data)
> +{
> + struct prb_iterator iter;
> + struct printk_log *msg;
> + size_t ext_len;
> + char *ext_text;
> + u64 master_seq;
> + size_t len;
> + char *text;
> + char *buf;
> + int ret;
> +
> + ext_text = kmalloc(CONSOLE_EXT_LOG_MAX, GFP_KERNEL);
> + text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL);
> + buf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
> + if (!ext_text || !text || !buf)
> + return -1;
> +
> + prb_iter_init(&iter, &printk_rb, NULL);
> +
> + /* the printk kthread never exits */
> + for (;;) {
> + ret = prb_iter_wait_next(&iter, buf,
> + PRINTK_RECORD_MAX, &master_seq);
> + if (ret == -ERESTARTSYS) {
> + continue;
> + } else if (ret < 0) {
> + /* iterator invalid, start over */
> + prb_iter_init(&iter, &printk_rb, NULL);
> + continue;
> + }
> +
> + msg = (struct printk_log *)buf;
> + format_text(msg, master_seq, ext_text, &ext_len, text,
> + &len, printk_time);
> +
> + console_lock();
> + if (len > 0 || ext_len > 0) {
> + call_console_drivers(ext_text, ext_len, text, len);
> + boot_delay_msec(msg->level);
> + printk_delay();
> + }
> + console_unlock();
> + }

One thing that I have learned is that preemptible printk does not work
as expected; it wants to be 'atomic' and just stay busy as long as it can.
We tried preemptible printk at Samsung and the result was just bad:
preempted printk kthread + slow serial console = lots of lost messages

We also had preemptile printk in the upstream kernel and reverted the
patch (see fd5f7cde1b85d4c8e09); same reasons - we had reports that
preemptible printk could "stall" for minutes.

-ss

2019-02-12 17:37:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH v1 07/25] printk-rb: add functionality required by printk

On Tue, Feb 12, 2019 at 6:30 AM John Ogness <[email protected]> wrote:
>
> + while (atomic_long_read(&rb->lost)) {
> + atomic_long_dec(&rb->lost);
> + rb->seq++;
> + }

This looks like crazy garbage. It's neither atomic nor sane.

Why isn't it something like

if (atomic_long_read(&rb->lost)) {
long lost = atomic_xchg(&rb->lost, 0);
rb->seq += lost;
}

instead?

Linus

2019-02-13 02:38:39

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (02/12/19 15:29), John Ogness wrote:
>
> 1. The printk buffer is protected by a global raw spinlock for readers
> and writers. This restricts the contexts that are allowed to
> access the buffer.

[..]

> 2. Because of #1, NMI and recursive contexts are handled by deferring
> logging/printing to a spinlock-safe context. This means that
> messages will not be visible if (for example) the kernel dies in
> NMI context and the irq_work mechanism does not survive.

panic() calls printk_safe_flush_on_panic(), which iterates all per-CPU
buffers and moves data to the main logbuf; so then we can flush pending
logbuf message

panic()
printk_safe_flush_on_panic();
console_flush_on_panic();

We don't really use irq_work mechanism for that.

> 3. Because of #1, when *not* using features such as PREEMPT_RT, large
> latencies exist when printing to slow consoles.

Because of #1? I'm not familiar with PREEMPT_RT; but logbuf spinlock
should be unlocked while we print messages to slow consoles
(call_consoles_drivers() is protected by console_sem, not logbuf
lock).

So it's

spin_lock_irqsave(logbuf);
vsprintf();
memcpy();
spin_unlock_irqrestore(logbuf);

console_trylock();
for (;;)
call_console_drivers();
// console_owner handover
console_unlock();

Do you see large latencies because of logbuf spinlock?

> 5. Printing to consoles is the responsibility of the printk caller
> and that caller may be required to print many messages that other
> printk callers inserted. Because of this there can be enormous
> variance in the runtime of a printk call.

That's complicated. Steven's console_owner handover patch makes
printk() more fair. We can have "winner takes it all" scenarios,
but significantly less often, IMO. Do you have any data that
suggest otherwise?

> 7. Loglevel INFO is handled the same as ERR. There seems to be an
> endless effort to get printk to show _all_ messages as quickly as
> possible in case of a panic (i.e. printing from any context), but
> at the same time try not to have printk be too intrusive for the
> callers. These are conflicting requirements that lead to a printk
> implementation that does a sub-optimal job of satisfying both
> sides.

Per my experience, fully preemptible "print it sometime maybe"
printk() does not work equally well for everyone.

-ss

2019-02-13 02:39:00

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (02/12/19 15:29), John Ogness wrote:
> - console_flush_on_panic() currently is a NOP. It is pretty clear how
> this could be implemented if atomic_write was available. But if no
> such console is registered, it is not clear what should be done. Is
> this function really even needed?

If you now rely on a fully preemptible printk kthread to flush
pending logbuf messages, then console_flush_on_panic() is your
only chance to see those pending logbuf messages on the serial
console when the system dies.

Non-atomic consoles should become atomic once you call bust_spinlocks(1),
this is what we currently have:

panic()
bust_spinlocks(1) // sets oops_in_progress
console_flush_on_panic()
call_console_drivers()
-> serial_driver_write()
if (oops_in_progress)
locked = spin_trylock_irqsave(&port->lock);
uart_console_write();
if (locked)
spin_unlock_irqrestore(&port->lock);

-ss

2019-02-13 02:56:50

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (02/12/19 15:29), John Ogness wrote:
>
> - A dedicated kernel thread is created for printing to all consoles in
> a fully preemptible context.

How do you handle sysrq-<foo> printouts on systems which can't
schedule printk-kthread?

-ss

2019-02-13 13:30:21

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 07/25] printk-rb: add functionality required by printk

On 2019-02-12, Linus Torvalds <[email protected]> wrote:
> On Tue, Feb 12, 2019 at 6:30 AM John Ogness <[email protected]> wrote:
>>
>> + while (atomic_long_read(&rb->lost)) {
>> + atomic_long_dec(&rb->lost);
>> + rb->seq++;
>> + }
>
> This looks like crazy garbage. It's neither atomic nor sane.

It works because because only 1 context on a single CPU can hit that
loop. But yes, it is crazy.

> Why isn't it something like
>
> if (atomic_long_read(&rb->lost)) {
> long lost = atomic_xchg(&rb->lost, 0);
> rb->seq += lost;
> }
>
> instead?

Yes, it should be like you suggest. Thanks.

John Ogness

2019-02-13 15:01:03

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hi Sergey,

I am glad to see that you are getting involved here. Your previous
talks, work, and discussions were a large part of my research when
preparing for this work.

My response to your comments inline...

On 2019-02-13, Sergey Senozhatsky <[email protected]> wrote:
> On (02/12/19 15:29), John Ogness wrote:
>>
>> 1. The printk buffer is protected by a global raw spinlock for readers
>> and writers. This restricts the contexts that are allowed to
>> access the buffer.
>
> [..]
>
>> 2. Because of #1, NMI and recursive contexts are handled by deferring
>> logging/printing to a spinlock-safe context. This means that
>> messages will not be visible if (for example) the kernel dies in
>> NMI context and the irq_work mechanism does not survive.
>
> panic() calls printk_safe_flush_on_panic(), which iterates all per-CPU
> buffers and moves data to the main logbuf; so then we can flush pending
> logbuf message
>
> panic()
> printk_safe_flush_on_panic();
> console_flush_on_panic();

If we are talking about an SMP system where logbuf_lock is locked, the
call chain is actually:

panic()
crash_smp_send_stop()
... wait for "num_online_cpus() == 1" ...
printk_safe_flush_on_panic();
console_flush_on_panic();

Is it guaranteed that the kernel will successfully stop the other CPUs
so that it can print to the console?

And then there is console_flush_on_panic(), which will ignore locks and
write to the consoles, expecting them to check "oops_in_progress" and
ignore their own internal locks.

Is it guaranteed that locks can just be ignored and backtraces will be
seen and legible to the user?

With the proposed emergency messages, panic() can write immediately to
the guaranteed NMI-safe write_atomic console without having to first do
anything with other CPUs (IPIs, NMIs, waiting, whatever) and without
ignoring locks.

> We don't really use irq_work mechanism for that.

Sorry. I didn't mean to imply that panic() uses irq_work. For non-panic
NMI situations irq_work is required (for example, WARN_ON). As an
example, if a WARN_ON occurred but the hardware locked up before the
irq_work engaged, the message is not seen.

Obviously I am talking about rare situations. But these situations do
occur and it is quite misfortunate (and frustrating!) when the kernel
has the important messages ready, but could not get them out.

>> 3. Because of #1, when *not* using features such as PREEMPT_RT, large
>> latencies exist when printing to slow consoles.
>
> Because of #1? I'm not familiar with PREEMPT_RT; but logbuf spinlock
> should be unlocked while we print messages to slow consoles
> (call_consoles_drivers() is protected by console_sem, not logbuf
> lock).
>
> So it's
>
> spin_lock_irqsave(logbuf);
> vsprintf();
> memcpy();
> spin_unlock_irqrestore(logbuf);
>
> console_trylock();
> for (;;)
> call_console_drivers();
> // console_owner handover
> console_unlock();
>
> Do you see large latencies because of logbuf spinlock?

Sorry. As you point out, this issue is not because of #1. It is because
of the call sequence:

vprintk_emit()
preempt_disable()
console_unlock()
call_console_drivers()
preempt_enable()

For slow consoles, this can cause large latencies for some misfortunate
tasks.

>> 5. Printing to consoles is the responsibility of the printk caller
>> and that caller may be required to print many messages that other
>> printk callers inserted. Because of this there can be enormous
>> variance in the runtime of a printk call.
>
> That's complicated. Steven's console_owner handover patch makes
> printk() more fair. We can have "winner takes it all" scenarios,
> but significantly less often, IMO. Do you have any data that
> suggest otherwise?

Steven's console_owner handover was a definite improvement. But as you
said, we can have "winner takes it all" scenarios (or rather "the last
one to leave the bar pays the bill"). This still should not be
acceptable. Let's let some some other _preemptible_ task pay the bill.

I am proposing to change printk so that a single pr_info() call can be
made without the fear that it might have to print thousands of messages
to multiple consoles, all with preemption disabled.

>> 7. Loglevel INFO is handled the same as ERR. There seems to be an
>> endless effort to get printk to show _all_ messages as quickly as
>> possible in case of a panic (i.e. printing from any context), but
>> at the same time try not to have printk be too intrusive for the
>> callers. These are conflicting requirements that lead to a printk
>> implementation that does a sub-optimal job of satisfying both
>> sides.
>
> Per my experience, fully preemptible "print it sometime maybe"
> printk() does not work equally well for everyone.

(I'm also including your previous relevant comment[0].)

> One thing that I have learned is that preemptible printk does not work
> as expected; it wants to be 'atomic' and just stay busy as long as it
> can.
> We tried preemptible printk at Samsung and the result was just bad:
> preempted printk kthread + slow serial console = lots of lost
> messages

As long as all critical messages are print directly and immediately to
an emergency console, why is it is problem if the informational messages
to consoles are sometimes delayed or lost? And if those informational
messages _are_ so important, there are things the user can do. For
example, create a realtime userspace task to read /dev/kmsg.

> We also had preemptile printk in the upstream kernel and reverted the
> patch (see fd5f7cde1b85d4c8e09); same reasons - we had reports that
> preemptible printk could "stall" for minutes.

But in this case the preemptible task was used for printing critical
tasks as well. Then the stall really is a problem. I am proposing to
rely on emergency consoles for critical messages. By changing printk to
support 2 different channels (emergency and non-emergency), we can focus
on making each of those channels optimal.

This is exactly what point #7 is talking about: How we currently have
only 1 channel to try and satisfy all needs (whether critical or console
noise).

John Ogness

[0] http://lkml.kernel.org/r/[email protected]

2019-02-13 15:12:34

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On 2019-02-13, Sergey Senozhatsky <[email protected]> wrote:
>> - console_flush_on_panic() currently is a NOP. It is pretty clear how
>> this could be implemented if atomic_write was available. But if no
>> such console is registered, it is not clear what should be done. Is
>> this function really even needed?
>
> If you now rely on a fully preemptible printk kthread to flush
> pending logbuf messages, then console_flush_on_panic() is your
> only chance to see those pending logbuf messages on the serial
> console when the system dies.

Anything critical would have already been immediately print to the
emergency consoles. And if an emergency console was available,
console_flush_on_panic() could be a special case where _all_ unseen
messages (regardless of importance) are printed to the emergency
console.

> Non-atomic consoles should become atomic once you call bust_spinlocks(1),
> this is what we currently have:
>
> panic()
> bust_spinlocks(1) // sets oops_in_progress
> console_flush_on_panic()
> call_console_drivers()
> -> serial_driver_write()
> if (oops_in_progress)
> locked = spin_trylock_irqsave(&port->lock);
> uart_console_write();
> if (locked)
> spin_unlock_irqrestore(&port->lock);

I don't like bust_spinlocks() because drivers end up implementing
oops_in_progress with exactly that... ignoring their own locks. I prefer
consoles are provided with a locking mechanism that they can use to
support a separate NMI-safe write function. My series introduces
console_atomic_lock() for exactly this purpose.

But this doesn't help here. Here we are talking about a crashing system
that does _not_ have an emergency console. And in this case I would say
messages would be lost (just like they are now if all you have is a vt
console and it was busy).

I suppose we could keep the current bust_spinlocks() stuff for the
special case that there are no emergency consoles available. It's better
than nothing, but also not really reliable. Preferrably we figure out
how to implement write_atomic for all console drivers.

John Ogness

2019-02-13 15:21:39

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On 2019-02-13, Sergey Senozhatsky <[email protected]> wrote:
>> - A dedicated kernel thread is created for printing to all consoles in
>> a fully preemptible context.
>
> How do you handle sysrq-<foo> printouts on systems which can't
> schedule printk-kthread?

If those sysrq printouts are at the emergency loglevel (which most are),
then they are printed immediately to the emergency consoles. This has
already proved useful for our own kernel debugging work. For example,
currently sysrq-z for very large traces result in messages being dropped
because of printk buffer overflows. But with the emergency console we
always see the full trace buffer.

Because you have already done so much work and experimentation with
printk-kthreads, I feel like many of your comments are related to your
kthread work in this area. Really the big design change I make with my
printk-kthread is that it is only for non-critical messages. For
anything critical, users should rely on an emergency console.

John Ogness

2019-02-13 16:12:29

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 02/25] printk-rb: add prb locking functions

On Tue 2019-02-12 15:29:40, John Ogness wrote:
> Add processor-reentrant spin locking functions. These allow
> restricting the number of possible contexts to 2, which can simplify
> implementing code that also supports NMI interruptions.
>
> prb_lock();
>
> /*
> * This code is synchronized with all contexts
> * except an NMI on the same processor.
> */
>
> prb_unlock();
>
> In order to support printk's emergency messages, a
> processor-reentrant spin lock will be used to control raw access to
> the emergency console. However, it must be the same
> processor-reentrant spin lock as the one used by the ring buffer,
> otherwise a deadlock can occur:
>
> CPU1: printk lock -> emergency -> serial lock
> CPU2: serial lock -> printk lock
>
> By making the processor-reentrant implemtation available externally,
> printk can use the same atomic_t for the ring buffer as for the
> emergency console and thus avoid the above deadlock.

Interesting idea. I just wonder if it might cause some problems
when it is shared between printk() and many other consoles.

It sounds like the big kernel lock or console_lock. They
both caused many troubles.


> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
> new file mode 100644
> index 000000000000..28958b0cf774
> --- /dev/null
> +++ b/lib/printk_ringbuffer.c
> @@ -0,0 +1,77 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/smp.h>
> +#include <linux/printk_ringbuffer.h>
> +
> +static bool __prb_trylock(struct prb_cpulock *cpu_lock,
> + unsigned int *cpu_store)
> +{
> + unsigned long *flags;
> + unsigned int cpu;
> +
> + cpu = get_cpu();
> +
> + *cpu_store = atomic_read(&cpu_lock->owner);
> + /* memory barrier to ensure the current lock owner is visible */
> + smp_rmb();
> + if (*cpu_store == -1) {
> + flags = per_cpu_ptr(cpu_lock->irqflags, cpu);
> + local_irq_save(*flags);
> + if (atomic_try_cmpxchg_acquire(&cpu_lock->owner,
> + cpu_store, cpu)) {
> + return true;
> + }
> + local_irq_restore(*flags);
> + } else if (*cpu_store == cpu) {
> + return true;
> + }
> +
> + put_cpu();

Is there any reason why you get/put CPU and enable/disable
in each iteration?

It is a spin lock after all. We do busy waiting anyway. This looks like
burning CPU power for no real gain. Simple cpu_relax() should be enough.

> + return false;
> +}
> +
> +/*
> + * prb_lock: Perform a processor-reentrant spin lock.
> + * @cpu_lock: A pointer to the lock object.
> + * @cpu_store: A "flags" pointer to store lock status information.
> + *
> + * If no processor has the lock, the calling processor takes the lock and
> + * becomes the owner. If the calling processor is already the owner of the
> + * lock, this function succeeds immediately. If lock is locked by another
> + * processor, this function spins until the calling processor becomes the
> + * owner.
> + *
> + * It is safe to call this function from any context and state.
> + */
> +void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store)
> +{
> + for (;;) {
> + if (__prb_trylock(cpu_lock, cpu_store))
> + break;
> + cpu_relax();
> + }
> +}
> +
> +/*
> + * prb_unlock: Perform a processor-reentrant spin unlock.
> + * @cpu_lock: A pointer to the lock object.
> + * @cpu_store: A "flags" object storing lock status information.
> + *
> + * Release the lock. The calling processor must be the owner of the lock.
> + *
> + * It is safe to call this function from any context and state.
> + */
> +void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store)
> +{
> + unsigned long *flags;
> + unsigned int cpu;
> +
> + cpu = atomic_read(&cpu_lock->owner);
> + atomic_set_release(&cpu_lock->owner, cpu_store);
> +
> + if (cpu_store == -1) {
> + flags = per_cpu_ptr(cpu_lock->irqflags, cpu);
> + local_irq_restore(*flags);
> + }

cpu_store looks like an implementation detail. The caller
needs to remember it to handle the nesting properly.

We could achieve the same with a recursion counter hidden
in struct prb_lock.

Best Regards,
Petr


PS: This is the most complex patchset that I have ever reviewed.
I am not sure what is the best approach. I am going to understand
it and comment on what touches my eye. I will comment the overall
design later after I have a better understanding.

The first feeling is that it would be nice to be able to
store messages into a single log buffer from every context.
It will depend if the new approach is safe and maintainable.

The offloading of console handling into a kthread might be
problematic. We were pushing it for years and never succeeded.
People preferred to minimize the risk that messages would never
appear on the console.

Well, I still think that it might be needed because Steven's
console waiter logic does not prevent softlockups completely.
And realtime has much bigger problems with unpredictable
random printk-console-lockups requirements. IMHO, we need
a solution for the realtime mode and normal one could just benefit
from it. We have some ideas in the drawer. And this patchset
brings some new. Let's see.

Best Regards,
Petr

2019-02-13 16:55:26

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH v1 00/25] printk: new implementation

From: John Ogness
> Sent: 12 February 2019 14:30
...
> - A dedicated kernel thread is created for printing to all consoles in
> a fully preemptible context.
>
> - A new (optional) console operation "write_atomic" is introduced that
> console drivers may implement. This function must be NMI-safe. An
> implementation for the 8250 UART driver is provided.
>
> - The concept of "emergency messages" is introduced that allows
> important messages (based on a new emergency loglevel threshold) to
> be immediately written to any consoles supporting write_atomic,
> regardless of the context.
...

Does this address my usual 'gripe' that the output is written to the console
by syslogd and not by the kernel itself?
When you are trying to find out where the system is completely deadlocking
you need the 'old fashioned' completely synchronous kernel printf().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2019-02-14 08:56:14

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 02/25] printk-rb: add prb locking functions

On 2019-02-13, Petr Mladek <[email protected]> wrote:
>> Add processor-reentrant spin locking functions. These allow
>> restricting the number of possible contexts to 2, which can simplify
>> implementing code that also supports NMI interruptions.
>>
>> prb_lock();
>>
>> /*
>> * This code is synchronized with all contexts
>> * except an NMI on the same processor.
>> */
>>
>> prb_unlock();
>>
>> In order to support printk's emergency messages, a
>> processor-reentrant spin lock will be used to control raw access to
>> the emergency console. However, it must be the same
>> processor-reentrant spin lock as the one used by the ring buffer,
>> otherwise a deadlock can occur:
>>
>> CPU1: printk lock -> emergency -> serial lock
>> CPU2: serial lock -> printk lock
>>
>> By making the processor-reentrant implemtation available externally,
>> printk can use the same atomic_t for the ring buffer as for the
>> emergency console and thus avoid the above deadlock.
>
> Interesting idea. I just wonder if it might cause some problems
> when it is shared between printk() and many other consoles.
>
> It sounds like the big kernel lock or console_lock. They
> both caused many troubles.

It causes big troubles (deadlocks) if you have multiple instances of
it. Mainly because printk can be called from _any_ line of code in the
kernel. That is the reason I decided that it needs to be shared and only
used in call chains that are carefully constructed such as:

printk -> write_atomic

and NMI contexts are _never_ allowed to do things that rely on waiting
forever for other CPUs. For that reason it does kinda feel like a BKL.

If we do find some problems, we may want to switch to a ringbuffer
implementation that is fully lockless for both multi-readers and
multi-writers. I have written such a beast, but it is less efficient and
more complex than the ringbuffer in this series. Also, that only shrinks
the window since write_atomic would still need to make use of the
processor-reentrant spinlock to synchronize the console output. That's
why I decided to RFC with the simpler ringbuffer implementation.

>> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
>> new file mode 100644
>> index 000000000000..28958b0cf774
>> --- /dev/null
>> +++ b/lib/printk_ringbuffer.c
>> @@ -0,0 +1,77 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/smp.h>
>> +#include <linux/printk_ringbuffer.h>
>> +
>> +static bool __prb_trylock(struct prb_cpulock *cpu_lock,
>> + unsigned int *cpu_store)
>> +{
>> + unsigned long *flags;
>> + unsigned int cpu;
>> +
>> + cpu = get_cpu();
>> +
>> + *cpu_store = atomic_read(&cpu_lock->owner);
>> + /* memory barrier to ensure the current lock owner is visible */
>> + smp_rmb();
>> + if (*cpu_store == -1) {
>> + flags = per_cpu_ptr(cpu_lock->irqflags, cpu);
>> + local_irq_save(*flags);
>> + if (atomic_try_cmpxchg_acquire(&cpu_lock->owner,
>> + cpu_store, cpu)) {
>> + return true;
>> + }
>> + local_irq_restore(*flags);
>> + } else if (*cpu_store == cpu) {
>> + return true;
>> + }
>> +
>> + put_cpu();
>
> Is there any reason why you get/put CPU and enable/disable
> in each iteration?
>
> It is a spin lock after all. We do busy waiting anyway. This looks like
> burning CPU power for no real gain. Simple cpu_relax() should be
> enough.

Agreed.

>> + return false;
>> +}
>> +
>> +/*
>> + * prb_lock: Perform a processor-reentrant spin lock.
>> + * @cpu_lock: A pointer to the lock object.
>> + * @cpu_store: A "flags" pointer to store lock status information.
>> + *
>> + * If no processor has the lock, the calling processor takes the lock and
>> + * becomes the owner. If the calling processor is already the owner of the
>> + * lock, this function succeeds immediately. If lock is locked by another
>> + * processor, this function spins until the calling processor becomes the
>> + * owner.
>> + *
>> + * It is safe to call this function from any context and state.
>> + */
>> +void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store)
>> +{
>> + for (;;) {
>> + if (__prb_trylock(cpu_lock, cpu_store))
>> + break;
>> + cpu_relax();
>> + }
>> +}
>> +
>> +/*
>> + * prb_unlock: Perform a processor-reentrant spin unlock.
>> + * @cpu_lock: A pointer to the lock object.
>> + * @cpu_store: A "flags" object storing lock status information.
>> + *
>> + * Release the lock. The calling processor must be the owner of the lock.
>> + *
>> + * It is safe to call this function from any context and state.
>> + */
>> +void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store)
>> +{
>> + unsigned long *flags;
>> + unsigned int cpu;
>> +
>> + cpu = atomic_read(&cpu_lock->owner);
>> + atomic_set_release(&cpu_lock->owner, cpu_store);
>> +
>> + if (cpu_store == -1) {
>> + flags = per_cpu_ptr(cpu_lock->irqflags, cpu);
>> + local_irq_restore(*flags);
>> + }
>
> cpu_store looks like an implementation detail. The caller
> needs to remember it to handle the nesting properly.

It's really no different than "flags" in irqsave/irqrestore.

> We could achieve the same with a recursion counter hidden
> in struct prb_lock.

The only way I see how that could be implemented is if the cmpxchg
encoded the cpu owner and counter into a single integer. (Upper half as
counter, lower half as cpu owner.) Both fields would need to be updated
with a single cmpxchg. The critical cmpxchg being the one where the CPU
becomes unlocked (counter goes from 1 to 0 and cpu owner goes from N to
-1).

That seems like a lot of extra code just to avoid passing a "flags"
argument.

John Ogness

2019-02-14 09:23:44

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On 2019-02-13, David Laight <[email protected]> wrote:
> ...
>> - A dedicated kernel thread is created for printing to all consoles in
>> a fully preemptible context.
>>
>> - A new (optional) console operation "write_atomic" is introduced that
>> console drivers may implement. This function must be NMI-safe. An
>> implementation for the 8250 UART driver is provided.
>>
>> - The concept of "emergency messages" is introduced that allows
>> important messages (based on a new emergency loglevel threshold) to
>> be immediately written to any consoles supporting write_atomic,
>> regardless of the context.
> ...
>
> Does this address my usual 'gripe' that the output is written to the
> console by syslogd and not by the kernel itself?

If I understand it correctly, your usual 'gripe' is aimed at
distributions that are turning off the kernel writing directly to the
console. I don't see how that is a kernel issue.

> When you are trying to find out where the system is completely
> deadlocking you need the 'old fashioned' completely synchronous kernel
> printf().

Emergency messages will give you that. They differ from the current
implementation by changing printk to have the caller print only _their_
message directly without concern for past unseen non-emergency messages
or which context they are in.

John Ogness

2019-02-14 09:34:42

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 17/25] printk: add processor number to output

On 2019-02-12, John Ogness <[email protected]> wrote:
> It can be difficult to sort printk out if multiple processors are
> printing simultaneously. Add the processor number to the printk
> output to allow the messages to be sorted.

I just discovered Tetsuo's recently accepted work[0]. So obviously it
obsoletes this patch.

John Ogness

[0] http://lkml.kernel.org/r/1543045075-3008-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

2019-02-14 18:01:10

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 02/25] printk-rb: add prb locking functions

On Wed 2019-02-13 22:39:20, John Ogness wrote:
> On 2019-02-13, Petr Mladek <[email protected]> wrote:
> >> +/*
> >> + * prb_unlock: Perform a processor-reentrant spin unlock.
> >> + * @cpu_lock: A pointer to the lock object.
> >> + * @cpu_store: A "flags" object storing lock status information.
> >> + *
> >> + * Release the lock. The calling processor must be the owner of the lock.
> >> + *
> >> + * It is safe to call this function from any context and state.
> >> + */
> >> +void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store)
> >> +{
> >> + unsigned long *flags;
> >> + unsigned int cpu;
> >> +
> >> + cpu = atomic_read(&cpu_lock->owner);
> >> + atomic_set_release(&cpu_lock->owner, cpu_store);
> >> +
> >> + if (cpu_store == -1) {
> >> + flags = per_cpu_ptr(cpu_lock->irqflags, cpu);
> >> + local_irq_restore(*flags);
> >> + }
> >
> > cpu_store looks like an implementation detail. The caller
> > needs to remember it to handle the nesting properly.
>
> It's really no different than "flags" in irqsave/irqrestore.
>
> > We could achieve the same with a recursion counter hidden
> > in struct prb_lock.
>
> The only way I see how that could be implemented is if the cmpxchg
> encoded the cpu owner and counter into a single integer. (Upper half as
> counter, lower half as cpu owner.) Both fields would need to be updated
> with a single cmpxchg. The critical cmpxchg being the one where the CPU
> becomes unlocked (counter goes from 1 to 0 and cpu owner goes from N to
> -1).

The atomic operations are tricky. I feel other lost in them.
Well, I still think that it might easier to detect nesting
on the same CPU, see below.

Also there is no need to store irq flags in per-CPU variable.
Only the first owner of the lock need to store the flags. The others
are spinning or nested.

struct prb_cpulock {
atomic_t owner;
unsigned int flags;
int nesting; /* intialized to 0 */
};

void prb_lock(struct prb_cpulock *cpu_lock)
{
unsigned int flags;
int cpu;

/*
* The next condition might be valid only when
* we are nested on the same CPU. It means
* the IRQs are already disabled and no
* memory barrier is needed.
*/
if (cpu_lock->owner == smp_processor_id()) {
cpu_lock->nested++;
return;
}

/* Not nested. Take the lock */
local_irq_save(flags);
cpu = smp_processor_id();

for (;;) {
if (atomic_try_cmpxchg_acquire(&cpu_lock->owner,
-1, cpu)) {
cpu_lock->flags = flags;
break;
}

cpu_relax();
}
}

void prb_unlock(struct prb_cpulock *cpu_lock)
{
unsigned int flags;

if (cpu_lock->nested)
cpu_lock->nested--;
return;
}

/* We must be the first lock owner */
flags = cpu_lock->flags;
atomic_set_release(&cpu_lock->owner, -1);
local_irq_restore(flags);
}

Or do I miss anything?

Best Regards,
Petr

2019-02-14 23:08:33

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 02/25] printk-rb: add prb locking functions

On 2019-02-14, Petr Mladek <[email protected]> wrote:
>>> cpu_store looks like an implementation detail. The caller
>>> needs to remember it to handle the nesting properly.
>>>
>>> We could achieve the same with a recursion counter hidden
>>> in struct prb_lock.
>
> The atomic operations are tricky. I feel other lost in them.
> Well, I still think that it might easier to detect nesting
> on the same CPU, see below.
>
> Also there is no need to store irq flags in per-CPU variable.
> Only the first owner of the lock need to store the flags. The others
> are spinning or nested.
>
> struct prb_cpulock {
> atomic_t owner;
> unsigned int flags;
> int nesting; /* intialized to 0 */
> };
>
> void prb_lock(struct prb_cpulock *cpu_lock)
> {
> unsigned int flags;
> int cpu;

I added an explicit preempt_disable here:

cpu = get_cpu();

> /*
> * The next condition might be valid only when
> * we are nested on the same CPU. It means
> * the IRQs are already disabled and no
> * memory barrier is needed.
> */
> if (cpu_lock->owner == smp_processor_id()) {
> cpu_lock->nested++;
> return;
> }
>
> /* Not nested. Take the lock */
> local_irq_save(flags);
> cpu = smp_processor_id();
>
> for (;;) {

With fixups so it builds/runs:

unsigned int prev_cpu = -1;

> if (atomic_try_cmpxchg_acquire(&cpu_lock->owner,
&prev_cpu, cpu)) {
> cpu_lock->flags = flags;
> break;
> }
>
> cpu_relax();
> }
> }
>
> void prb_unlock(struct prb_cpulock *cpu_lock)
> {
> unsigned int flags;
>
> if (cpu_lock->nested)
> cpu_lock->nested--;

And the matching preempt_enable().

goto out;

> }
>
> /* We must be the first lock owner */
> flags = cpu_lock->flags;
> atomic_set_release(&cpu_lock->owner, -1);
> local_irq_restore(flags);

out:
put_cpu();

> }
>
> Or do I miss anything?

It looks great. I've run my stress tests on it and everything is running
well.

Thanks for simplifying this!

John Ogness

2019-02-14 23:18:01

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 03/25] printk-rb: define ring buffer struct and initializer

On Tue 2019-02-12 15:46:40, Greg Kroah-Hartman wrote:
> On Tue, Feb 12, 2019 at 03:29:41PM +0100, John Ogness wrote:
> > See Documentation/printk-ringbuffer.txt for details about the
> > initializer arguments.
>
> You can put that documentation here in the .h file and have it pulled
> out automatically into the documentation files when they are created.
> That way you always keep everything in sync properly.

Yes, please, move the documentation into the sources.

It is so easy to get the info via editor+cscope support
than via an external text file that many people do not
know about ;-)

For example, see include/linux/livepatch.h for inspiration.

Best Regards,
Petr

2019-02-15 00:45:04

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 04/25] printk-rb: add writer interface

On Tue 2019-02-12 15:29:42, John Ogness wrote:
> Add the writer functions prb_reserve() and prb_commit(). These make
> use of processor-reentrant spin locks to limit the number of possible
> interruption scenarios for the writers.
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> include/linux/printk_ringbuffer.h | 17 ++++
> lib/printk_ringbuffer.c | 172 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 189 insertions(+)
>
> diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
> index 0e6e8dd0d01e..1aec9d5666b1 100644
> --- a/include/linux/printk_ringbuffer.h
> +++ b/include/linux/printk_ringbuffer.h
> @@ -24,6 +24,18 @@ struct printk_ringbuffer {
> atomic_t ctx;
> };
>
> +struct prb_entry {
> + unsigned int size;
> + u64 seq;
> + char data[0];
> +};
> +
> +struct prb_handle {
> + struct printk_ringbuffer *rb;
> + unsigned int cpu;
> + struct prb_entry *entry;
> +};

Please, add a comment what these structures are for.

> #define DECLARE_STATIC_PRINTKRB_CPULOCK(name) \
> static DEFINE_PER_CPU(unsigned long, _##name##_percpu_irqflags); \
> static struct prb_cpulock name = { \
> @@ -45,6 +57,11 @@ static struct printk_ringbuffer name = { \
> .ctx = ATOMIC_INIT(0), \
> }
>
> +/* writer interface */
> +char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
> + unsigned int size);
> +void prb_commit(struct prb_handle *h);
> +
> /* utility functions */
> void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
> void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store);
> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
> index 28958b0cf774..90c7f9a9f861 100644
> --- a/lib/printk_ringbuffer.c
> +++ b/lib/printk_ringbuffer.c
> @@ -2,6 +2,14 @@
> #include <linux/smp.h>
> #include <linux/printk_ringbuffer.h>
>
> +#define PRB_SIZE(rb) (1 << rb->size_bits)

1 -> 1L

> +#define PRB_SIZE_BITMASK(rb) (PRB_SIZE(rb) - 1)
> +#define PRB_INDEX(rb, lpos) (lpos & PRB_SIZE_BITMASK(rb))
> +#define PRB_WRAPS(rb, lpos) (lpos >> rb->size_bits)
> +#define PRB_WRAP_LPOS(rb, lpos, xtra) \
> + ((PRB_WRAPS(rb, lpos) + xtra) << rb->size_bits)

It took me quite some time to understand the WRAP macros.
The extra parameter makes it even worse.

I suggest to distinguish the two situation by the macro names.
For example:

PRB_THIS_WRAP_START_LPOS(rb, lpos)
PRB_NEXT_WRAP_START_LPOS(rb, lpos)

Also they might deserve a comment.


> +#define PRB_DATA_ALIGN sizeof(long)
> +
> static bool __prb_trylock(struct prb_cpulock *cpu_lock,
> unsigned int *cpu_store)
> {
> @@ -75,3 +83,167 @@ void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store)
>
> put_cpu();
> }
> +
> +static struct prb_entry *to_entry(struct printk_ringbuffer *rb,
> + unsigned long lpos)
> +{
> + char *buffer = rb->buffer;
> + buffer += PRB_INDEX(rb, lpos);
> + return (struct prb_entry *)buffer;
> +}
> +
> +static int calc_next(struct printk_ringbuffer *rb, unsigned long tail,
> + unsigned long lpos, int size, unsigned long *calced_next)
> +{

The function is so tricky that it deserves a comment.

Well, I am getting really lost because of the generic name
and all the parameters. For example, I wonder what "calced" stands
for.

I think that it will be much easiser to follow the logic if the entire
for-cycle around calc_next() is implemented in a single function.
The function push_tail() should get called from inside this function.

> + unsigned long next_lpos;
> + int ret = 0;
> +again:
> + next_lpos = lpos + size;
> + if (next_lpos - tail > PRB_SIZE(rb))
> + return -1;

push_tail() should get called here. prb_reserve() should bail
out when the tail could not get pushed.

> +
> + if (PRB_WRAPS(rb, lpos) != PRB_WRAPS(rb, next_lpos)) {
> + lpos = PRB_WRAP_LPOS(rb, next_lpos, 0);
> + ret |= 1;

This is a strange trick. The function should either return a valid
lpos that might get reserved or an error. The error means that
prb_reserve() must fail.

> + goto again;
> + }
> +
> + *calced_next = next_lpos;
> + return ret;
> +}
> +

/* Try to remove the oldest message */
> +static bool push_tail(struct printk_ringbuffer *rb, unsigned long tail)
> +{
> + unsigned long new_tail;
> + struct prb_entry *e;
> + unsigned long head;
> +
> + if (tail != atomic_long_read(&rb->tail))
> + return true;
> +
> + e = to_entry(rb, tail);
> + if (e->size != -1)
> + new_tail = tail + e->size;
> + else
> + new_tail = PRB_WRAP_LPOS(rb, tail, 1);
> +
> + /* make sure the new tail does not overtake the head */
> + head = atomic_long_read(&rb->head);
> + if (head - new_tail > PRB_SIZE(rb))
> + return false;
> +
> + atomic_long_cmpxchg(&rb->tail, tail, new_tail);
> + return true;
> +}
> +
> +/*
> + * prb_commit: Commit a reserved entry to the ring buffer.
> + * @h: An entry handle referencing the data entry to commit.
> + *
> + * Commit data that has been reserved using prb_reserve(). Once the data
> + * block has been committed, it can be invalidated at any time. If a writer
> + * is interested in using the data after committing, the writer should make
> + * its own copy first or use the prb_iter_ reader functions to access the
> + * data in the ring buffer.
> + *
> + * It is safe to call this function from any context and state.
> + */
> +void prb_commit(struct prb_handle *h)
> +{
> + struct printk_ringbuffer *rb = h->rb;
> + struct prb_entry *e;
> + unsigned long head;
> + unsigned long res;
> +
> + for (;;) {
> + if (atomic_read(&rb->ctx) != 1) {
> + /* the interrupted context will fixup head */
> + atomic_dec(&rb->ctx);
> + break;
> + }
> + /* assign sequence numbers before moving head */
> + head = atomic_long_read(&rb->head);
> + res = atomic_long_read(&rb->reserve);
> + while (head != res) {
> + e = to_entry(rb, head);
> + if (e->size == -1) {
> + head = PRB_WRAP_LPOS(rb, head, 1);
> + continue;
> + }
> + e->seq = ++rb->seq;
> + head += e->size;
> + }
> + atomic_long_set_release(&rb->head, res);

This looks realy weird. It looks like you are commiting all
reserved entries between current head and this entry.

I would expect that every prb_entry has its own flag whether
it was commited or not. This function should set this flag
for its own entry. Then it should move the head to the
first uncommited entry.

It will be racy because because more CPUs might commit their
own entries in parallel and they might miss each other
commit flags.

A solution might be to implement prb_push_head() that will
do the safe thing. Then we could call it here, from klp_push_tail()
and also from readers. I am still not sure if it will be
race-free but it looks promissing.

> + atomic_dec(&rb->ctx);

With the above approach you will not need rb->ctx. It is
racy anyway, see below.

> +
> + if (atomic_long_read(&rb->reserve) == res)
> + break;
> + atomic_inc(&rb->ctx);
> + }
> +
> + prb_unlock(rb->cpulock, h->cpu);
> +}
> +
> +/*
> + * prb_reserve: Reserve an entry within a ring buffer.
> + * @h: An entry handle to be setup and reference an entry.
> + * @rb: A ring buffer to reserve data within.
> + * @size: The number of bytes to reserve.
> + *
> + * Reserve an entry of at least @size bytes to be used by the caller. If
> + * successful, the data region of the entry belongs to the caller and cannot
> + * be invalidated by any other task/context. For this reason, the caller
> + * should call prb_commit() as quickly as possible in order to avoid preventing
> + * other tasks/contexts from reserving data in the case that the ring buffer
> + * has wrapped.
> + *
> + * It is safe to call this function from any context and state.
> + *
> + * Returns a pointer to the reserved entry (and @h is setup to reference that
> + * entry) or NULL if it was not possible to reserve data.
> + */
> +char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
> + unsigned int size)
> +{
> + unsigned long tail, res1, res2;

Please, better distinguish res1 and res2, e.g. old_res, new_res.

> + int ret;
> +
> + if (size == 0)
> + return NULL;
> + size += sizeof(struct prb_entry);
> + size += PRB_DATA_ALIGN - 1;
> + size &= ~(PRB_DATA_ALIGN - 1);

The above two lines should get hidden into PRB_ALLIGN_SIZE() or so.

> + if (size >= PRB_SIZE(rb))
> + return NULL;
> +
> + h->rb = rb;
> + prb_lock(rb->cpulock, &h->cpu);
> +
> + atomic_inc(&rb->ctx);

This looks racy. NMI could come between prb_lock() and this atomic_inc().


> + do {
> + for (;;) {
> + tail = atomic_long_read(&rb->tail);
> + res1 = atomic_long_read(&rb->reserve);
> + ret = calc_next(rb, tail, res1, size, &res2);
> + if (ret >= 0)
> + break;
> + if (!push_tail(rb, tail)) {
> + prb_commit(h);

I am a bit confused. Is it commiting a handle that haven't
been reserved yet? Why, please?

> + return NULL;
> + }
> + }

Please, try to refactor the above as commented in calc_next().

> + } while (!atomic_long_try_cmpxchg_acquire(&rb->reserve, &res1, res2));
> +
> + h->entry = to_entry(rb, res1);
> +
> + if (ret) {
> + /* handle wrap */

/* Write wrapping entry that is part of our reservation. */

> + h->entry->size = -1;
> + h->entry = to_entry(rb, PRB_WRAP_LPOS(rb, res2, 0));
> + }
> +
> + h->entry->size = size;
> +
> + return &h->entry->data[0];
> +}

Best Regards,
Petr

2019-02-15 02:23:34

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 04/25] printk-rb: add writer interface

On 2019-02-14, Petr Mladek <[email protected]> wrote:
>> Add the writer functions prb_reserve() and prb_commit(). These make
>> use of processor-reentrant spin locks to limit the number of possible
>> interruption scenarios for the writers.
>>
>> Signed-off-by: John Ogness <[email protected]>
>> ---
>> include/linux/printk_ringbuffer.h | 17 ++++
>> lib/printk_ringbuffer.c | 172 ++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 189 insertions(+)
>>
>> diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
>> index 0e6e8dd0d01e..1aec9d5666b1 100644
>> --- a/include/linux/printk_ringbuffer.h
>> +++ b/include/linux/printk_ringbuffer.h
>> @@ -24,6 +24,18 @@ struct printk_ringbuffer {
>> atomic_t ctx;
>> };
>>
>> +struct prb_entry {
>> + unsigned int size;
>> + u64 seq;
>> + char data[0];
>> +};
>> +
>> +struct prb_handle {
>> + struct printk_ringbuffer *rb;
>> + unsigned int cpu;
>> + struct prb_entry *entry;
>> +};
>
> Please, add a comment what these structures are for.

OK.

>> #define DECLARE_STATIC_PRINTKRB_CPULOCK(name) \
>> static DEFINE_PER_CPU(unsigned long, _##name##_percpu_irqflags); \
>> static struct prb_cpulock name = { \
>> @@ -45,6 +57,11 @@ static struct printk_ringbuffer name = { \
>> .ctx = ATOMIC_INIT(0), \
>> }
>>
>> +/* writer interface */
>> +char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
>> + unsigned int size);
>> +void prb_commit(struct prb_handle *h);
>> +
>> /* utility functions */
>> void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
>> void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store);
>> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
>> index 28958b0cf774..90c7f9a9f861 100644
>> --- a/lib/printk_ringbuffer.c
>> +++ b/lib/printk_ringbuffer.c
>> @@ -2,6 +2,14 @@
>> #include <linux/smp.h>
>> #include <linux/printk_ringbuffer.h>
>>
>> +#define PRB_SIZE(rb) (1 << rb->size_bits)
>
> 1 -> 1L

OK.

>> +#define PRB_SIZE_BITMASK(rb) (PRB_SIZE(rb) - 1)
>> +#define PRB_INDEX(rb, lpos) (lpos & PRB_SIZE_BITMASK(rb))
>> +#define PRB_WRAPS(rb, lpos) (lpos >> rb->size_bits)
>> +#define PRB_WRAP_LPOS(rb, lpos, xtra) \
>> + ((PRB_WRAPS(rb, lpos) + xtra) << rb->size_bits)
>
> It took me quite some time to understand the WRAP macros.
> The extra parameter makes it even worse.
>
> I suggest to distinguish the two situation by the macro names.
> For example:
>
> PRB_THIS_WRAP_START_LPOS(rb, lpos)
> PRB_NEXT_WRAP_START_LPOS(rb, lpos)

OK.

> Also they might deserve a comment.

Agreed.

>> +#define PRB_DATA_ALIGN sizeof(long)
>> +
>> static bool __prb_trylock(struct prb_cpulock *cpu_lock,
>> unsigned int *cpu_store)
>> {
>> @@ -75,3 +83,167 @@ void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store)
>>
>> put_cpu();
>> }
>> +
>> +static struct prb_entry *to_entry(struct printk_ringbuffer *rb,
>> + unsigned long lpos)
>> +{
>> + char *buffer = rb->buffer;
>> + buffer += PRB_INDEX(rb, lpos);
>> + return (struct prb_entry *)buffer;
>> +}
>> +
>> +static int calc_next(struct printk_ringbuffer *rb, unsigned long tail,
>> + unsigned long lpos, int size, unsigned long *calced_next)
>> +{
>
> The function is so tricky that it deserves a comment.
>
> Well, I am getting really lost because of the generic name
> and all the parameters. For example, I wonder what "calced" stands
> for.

"calced" is short for "calculated". Maybe "lpos_next" would be a good
name?

The function is only doing this: Given a reserve position and size to
reserve, calculate what the next reserve position would be.

It might seem complicated because it also detects/reports the special
cases that the tail would be overwritten (returns -1) or if the
ringbuffer wraps when performing the reserve (returns 1).

> I think that it will be much easiser to follow the logic if the entire
> for-cycle around calc_next() is implemented in a single function.

calc_next() is already sitting in 2 nested loops. And calc_next()
performs manual tail-recursion using a goto. I doubt it becomes easier
to follow when calc_next is inlined in prb_reserve().

> The function push_tail() should get called from inside this function.

I disagree. calc_next's job isn't to make any changes. It only
calculates what should be done. Outer cmpxchg loops take on the
responsibility for making the change. push_tail() is not trivial because
of dealing with the situation when it fails. Below you specifically ask
about this, so I'll go deeper into push_tail() there.

>> + unsigned long next_lpos;
>> + int ret = 0;
>> +again:
>> + next_lpos = lpos + size;
>> + if (next_lpos - tail > PRB_SIZE(rb))
>> + return -1;
>
> push_tail() should get called here. prb_reserve() should bail
> out when the tail could not get pushed.

prb_reserve() does bail out if push_tail() fails. But I think it makes
more sense that it is clearly visible from prb_reserve() and not hidden
as a side-effect of calc_next(). And as I mentioned, I think things
become much more difficult to follow if calc_next is inlined in
prb_reserve().

>> +
>> + if (PRB_WRAPS(rb, lpos) != PRB_WRAPS(rb, next_lpos)) {
>> + lpos = PRB_WRAP_LPOS(rb, next_lpos, 0);
>> + ret |= 1;
>
> This is a strange trick. The function should either return a valid
> lpos that might get reserved or an error. The error means that
> prb_reserve() must fail.

Again, calc_next() does not _do_ anything. It only calculates what needs
to be done. By returning 1, it is saying, "here is the next lpos value
and by the way, it is wrapped around". The caller could see this for
itself by comparing the PRB_WRAPS of lpos and lpos_next, but since
calc_next() already had this information I figured I might as well save
some CPU cycles and inform the caller.

The calc_next() caller is the one that will need to _do_ something about
this. In the case of a wrap, the caller will need to create the
terminating entry and provide the writer with the buffer at the
beginning of the data array.

>> + goto again;
>> + }
>> +
>> + *calced_next = next_lpos;
>> + return ret;
>> +}
>> +
>
> /* Try to remove the oldest message */

That is the kind of comment that I usually get in trouble for (saying
the obvious). But I have no problems adding it.

>> +static bool push_tail(struct printk_ringbuffer *rb, unsigned long tail)
>> +{
>> + unsigned long new_tail;
>> + struct prb_entry *e;
>> + unsigned long head;
>> +
>> + if (tail != atomic_long_read(&rb->tail))
>> + return true;
>> +
>> + e = to_entry(rb, tail);
>> + if (e->size != -1)
>> + new_tail = tail + e->size;
>> + else
>> + new_tail = PRB_WRAP_LPOS(rb, tail, 1);
>> +
>> + /* make sure the new tail does not overtake the head */
>> + head = atomic_long_read(&rb->head);
>> + if (head - new_tail > PRB_SIZE(rb))
>> + return false;
>> +
>> + atomic_long_cmpxchg(&rb->tail, tail, new_tail);
>> + return true;
>> +}
>> +
>> +/*
>> + * prb_commit: Commit a reserved entry to the ring buffer.
>> + * @h: An entry handle referencing the data entry to commit.
>> + *
>> + * Commit data that has been reserved using prb_reserve(). Once the data
>> + * block has been committed, it can be invalidated at any time. If a writer
>> + * is interested in using the data after committing, the writer should make
>> + * its own copy first or use the prb_iter_ reader functions to access the
>> + * data in the ring buffer.
>> + *
>> + * It is safe to call this function from any context and state.
>> + */
>> +void prb_commit(struct prb_handle *h)
>> +{
>> + struct printk_ringbuffer *rb = h->rb;
>> + struct prb_entry *e;
>> + unsigned long head;
>> + unsigned long res;
>> +
>> + for (;;) {
>> + if (atomic_read(&rb->ctx) != 1) {
>> + /* the interrupted context will fixup head */
>> + atomic_dec(&rb->ctx);
>> + break;
>> + }
>> + /* assign sequence numbers before moving head */
>> + head = atomic_long_read(&rb->head);
>> + res = atomic_long_read(&rb->reserve);
>> + while (head != res) {
>> + e = to_entry(rb, head);
>> + if (e->size == -1) {
>> + head = PRB_WRAP_LPOS(rb, head, 1);
>> + continue;
>> + }
>> + e->seq = ++rb->seq;
>> + head += e->size;
>> + }
>> + atomic_long_set_release(&rb->head, res);
>
> This looks realy weird. It looks like you are commiting all
> reserved entries between current head and this entry.

I am.

> I would expect that every prb_entry has its own flag whether
> it was commited or not. This function should set this flag
> for its own entry. Then it should move the head to the
> first uncommited entry.

How could there be a reserved but uncommitted entry before this one? Or
after this one? The reserve/commit window is under the prb_cpulock. No
other CPU can be involved. If an NMI occurred anywhere here and it did a
reserve, it already did the matching commit.

> It will be racy because because more CPUs might commit their
> own entries in parallel and they might miss each other
> commit flags.

No other CPUs here.

> A solution might be to implement prb_push_head() that will
> do the safe thing. Then we could call it here, from klp_push_tail()
> and also from readers. I am still not sure if it will be
> race-free but it looks promissing.

Yes, this all can be implemented lockless, but it is considerably more
complex. Let's not think about that unless we decide we need it.

>> + atomic_dec(&rb->ctx);
>
> With the above approach you will not need rb->ctx. It is
> racy anyway, see below.

See my comments below.

>> +
>> + if (atomic_long_read(&rb->reserve) == res)
>> + break;
>> + atomic_inc(&rb->ctx);
>> + }
>> +
>> + prb_unlock(rb->cpulock, h->cpu);
>> +}
>> +
>> +/*
>> + * prb_reserve: Reserve an entry within a ring buffer.
>> + * @h: An entry handle to be setup and reference an entry.
>> + * @rb: A ring buffer to reserve data within.
>> + * @size: The number of bytes to reserve.
>> + *
>> + * Reserve an entry of at least @size bytes to be used by the caller. If
>> + * successful, the data region of the entry belongs to the caller and cannot
>> + * be invalidated by any other task/context. For this reason, the caller
>> + * should call prb_commit() as quickly as possible in order to avoid preventing
>> + * other tasks/contexts from reserving data in the case that the ring buffer
>> + * has wrapped.
>> + *
>> + * It is safe to call this function from any context and state.
>> + *
>> + * Returns a pointer to the reserved entry (and @h is setup to reference that
>> + * entry) or NULL if it was not possible to reserve data.
>> + */
>> +char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
>> + unsigned int size)
>> +{
>> + unsigned long tail, res1, res2;
>
> Please, better distinguish res1 and res2, e.g. old_res, new_res.

OK.

>> + int ret;
>> +
>> + if (size == 0)
>> + return NULL;
>> + size += sizeof(struct prb_entry);
>> + size += PRB_DATA_ALIGN - 1;
>> + size &= ~(PRB_DATA_ALIGN - 1);
>
> The above two lines should get hidden into PRB_ALLIGN_SIZE() or so.

OK.

>> + if (size >= PRB_SIZE(rb))
>> + return NULL;
>> +
>> + h->rb = rb;
>> + prb_lock(rb->cpulock, &h->cpu);
>> +
>> + atomic_inc(&rb->ctx);
>
> This looks racy. NMI could come between prb_lock() and this
> atomic_inc().

It wouldn't matter. I haven't done anything before the inc so NMIs can
come in and do as much reserve/committing as they want.

>> + do {
>> + for (;;) {
>> + tail = atomic_long_read(&rb->tail);
>> + res1 = atomic_long_read(&rb->reserve);
>> + ret = calc_next(rb, tail, res1, size, &res2);
>> + if (ret >= 0)
>> + break;
>> + if (!push_tail(rb, tail)) {
>> + prb_commit(h);
>
> I am a bit confused. Is it commiting a handle that haven't
> been reserved yet? Why, please?

If ctx is 1 we have the special responsibility of moving the head past
all the entries that interrupting NMIs have reserve/committed. (See the
check for "ctx != 1" in prb_commit().) The NMIs are already gone so we
are the only one that can do this.

Here we are in prb_reserve() and have already incremented ctx. We might
be the "ctx == 1" task and NMIs may have reserve/committed entries after
we incremented ctx, which means that they did not push the head. If we
now bail out because we couldn't push the tail, we still are obligated
to push the head if we are "ctx == 1".

prb_commit() does not actually care what is in the handle. It is going
to commit everything up to the reserve. The fact that I pass it a handle
is because that is what the function expects. I suppose I could create a
_prb_commit() that takes only a ringbuffer argument and prb_commit()
simply calls _prb_commit(h->rb). Then the bailout would be:

_prb_commit(rb);

Or maybe it should have a more descriptive name:

_prb_commit_all_reserved(rb);

>> + return NULL;
>> + }
>> + }
>
> Please, try to refactor the above as commented in calc_next().

I'll play with it and see how it looks.

>> + } while (!atomic_long_try_cmpxchg_acquire(&rb->reserve, &res1, res2));
>> +
>> + h->entry = to_entry(rb, res1);
>> +
>> + if (ret) {
>> + /* handle wrap */
>
> /* Write wrapping entry that is part of our reservation. */

OK.

>> + h->entry->size = -1;
>> + h->entry = to_entry(rb, PRB_WRAP_LPOS(rb, res2, 0));
>> + }
>> +
>> + h->entry->size = size;
>> +
>> + return &h->entry->data[0];
>> +}

Thank you for your comments.

John Ogness

2019-02-15 02:27:47

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 04/25] printk-rb: add writer interface

On 2019-02-15, John Ogness <[email protected]> wrote:
> prb_commit() does not actually care what is in the handle. It is going
> to commit everything up to the reserve.

After thinking about what I wrote here, I realized that the struct
prb_handle has no purpose in this ringbuffer implementation. We really
could simplify the writer interface to:

char *prb_reserve(struct printk_ringbuffer *rb, unsigned int size);

void prb_commit(struct printk_ringbuffer *rb);

That probably feels really strange because the writer doesn't specify
_what_ to commit. But this ringbuffer implementation doesn't need to
know that.

The only reason I can think of for having a handle is if there should be
any statistics, debugging, or sanity checking added. (For example if a
writer tried to commit something it did not reserve.)

John Ogness

2019-02-15 16:08:07

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 02/25] printk-rb: add prb locking functions

On Thu 2019-02-14 13:10:28, John Ogness wrote:
> On 2019-02-14, Petr Mladek <[email protected]> wrote:
> >>> cpu_store looks like an implementation detail. The caller
> >>> needs to remember it to handle the nesting properly.
> >>>
> >>> We could achieve the same with a recursion counter hidden
> >>> in struct prb_lock.
> >
> > The atomic operations are tricky. I feel other lost in them.
> > Well, I still think that it might easier to detect nesting
> > on the same CPU, see below.
> >
> > Also there is no need to store irq flags in per-CPU variable.
> > Only the first owner of the lock need to store the flags. The others
> > are spinning or nested.
> >
> > struct prb_cpulock {
> > atomic_t owner;
> > unsigned int flags;
> > int nesting; /* intialized to 0 */
> > };
> >
> > void prb_lock(struct prb_cpulock *cpu_lock)
> > {
> > unsigned int flags;
> > int cpu;
>
> I added an explicit preempt_disable here:
>
> cpu = get_cpu();

It is superfluous. Preemption is not possible when interrupts
are disabled.


> It looks great. I've run my stress tests on it and everything is running
> well.

I am glad to read this.

> Thanks for simplifying this!

You are welcome.

Best Regards,
Petr

2019-02-15 16:09:47

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 02/25] printk-rb: add prb locking functions

On 2019-02-15, Petr Mladek <[email protected]> wrote:
>>> void prb_lock(struct prb_cpulock *cpu_lock)
>>> {
>>> unsigned int flags;
>>> int cpu;
>>
>> I added an explicit preempt_disable here:
>>
>> cpu = get_cpu();
>
> It is superfluous. Preemption is not possible when interrupts
> are disabled.

Interrupts are not necessarily disabled here. They get disabled later if
the lock needs to be taken (i.e. we are not nested).

John Ogness

2019-02-15 16:17:45

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 04/25] printk-rb: add writer interface

On Fri 2019-02-15 00:36:49, John Ogness wrote:
> On 2019-02-14, Petr Mladek <[email protected]> wrote:
> >> Add the writer functions prb_reserve() and prb_commit(). These make
> >> use of processor-reentrant spin locks to limit the number of possible
> >> interruption scenarios for the writers.
> >>
> >> --- a/lib/printk_ringbuffer.c
> >> +++ b/lib/printk_ringbuffer.c
> >> static bool __prb_trylock(struct prb_cpulock *cpu_lock,
> >> unsigned int *cpu_store)
> >> {
> >> @@ -75,3 +83,167 @@ void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store)
> >>
> >> put_cpu();
> >> }
> >> +
> >> +static struct prb_entry *to_entry(struct printk_ringbuffer *rb,
> >> + unsigned long lpos)
> >> +{
> >> + char *buffer = rb->buffer;
> >> + buffer += PRB_INDEX(rb, lpos);
> >> + return (struct prb_entry *)buffer;
> >> +}
> >> +
> >> +static int calc_next(struct printk_ringbuffer *rb, unsigned long tail,
> >> + unsigned long lpos, int size, unsigned long *calced_next)
> >> +{
> >
> > The function is so tricky that it deserves a comment.
> >
> > Well, I am getting really lost because of the generic name
> > and all the parameters. For example, I wonder what "calced" stands
> > for.
>
> "calced" is short for "calculated". Maybe "lpos_next" would be a good
> name?

Yes. It would help if the name is the same as the variable passed
from prb_reserve().


> The function is only doing this: Given a reserve position and size to
> reserve, calculate what the next reserve position would be.
>
> It might seem complicated because it also detects/reports the special
> cases that the tail would be overwritten (returns -1) or if the
> ringbuffer wraps when performing the reserve (returns 1).

> > I think that it will be much easiser to follow the logic if the entire
> > for-cycle around calc_next() is implemented in a single function.
>
> calc_next() is already sitting in 2 nested loops. And calc_next()
> performs manual tail-recursion using a goto. I doubt it becomes easier
> to follow when calc_next is inlined in prb_reserve().

Of course, it is possible that it will be worse. But we need to do
something to make it better. The current interaction between calc_next(),
push_tail() and prb_reserve using -1,0,1 values and many parameters
is really hard to follow.


> > The function push_tail() should get called from inside this function.
>
> I disagree. calc_next's job isn't to make any changes.

It is only about the name. If you rename it to klp_get_next_reserve()
then it will be allowed to push the tail ;-) I believe that it will
simplify the code. I might be wrong but please try it.

[...]

> >
> > /* Try to remove the oldest message */
>
> That is the kind of comment that I usually get in trouble for (saying
> the obvious). But I have no problems adding it.

It is obvious for you. But it helps to understand the meaning
for people that see the code for the first time. Especially
when they need to get familiar with the tail/head/reserve
naming scheme. Note that the current printk buffer used
first/next names.

> >> +static bool push_tail(struct printk_ringbuffer *rb, unsigned long tail)

[...]

> >> +/*
> >> + * prb_commit: Commit a reserved entry to the ring buffer.
> >> + * @h: An entry handle referencing the data entry to commit.
> >> + *
> >> + * Commit data that has been reserved using prb_reserve(). Once the data
> >> + * block has been committed, it can be invalidated at any time. If a writer
> >> + * is interested in using the data after committing, the writer should make
> >> + * its own copy first or use the prb_iter_ reader functions to access the
> >> + * data in the ring buffer.
> >> + *
> >> + * It is safe to call this function from any context and state.
> >> + */
> >> +void prb_commit(struct prb_handle *h)
> >> +{
> >> + struct printk_ringbuffer *rb = h->rb;
> >> + struct prb_entry *e;
> >> + unsigned long head;
> >> + unsigned long res;
> >> +
> >> + for (;;) {
> >> + if (atomic_read(&rb->ctx) != 1) {
> >> + /* the interrupted context will fixup head */
> >> + atomic_dec(&rb->ctx);
> >> + break;
> >> + }
> >> + /* assign sequence numbers before moving head */
> >> + head = atomic_long_read(&rb->head);
> >> + res = atomic_long_read(&rb->reserve);
> >> + while (head != res) {
> >> + e = to_entry(rb, head);
> >> + if (e->size == -1) {
> >> + head = PRB_WRAP_LPOS(rb, head, 1);
> >> + continue;
> >> + }
> >> + e->seq = ++rb->seq;
> >> + head += e->size;
> >> + }
> >> + atomic_long_set_release(&rb->head, res);
> >
> > This looks realy weird. It looks like you are commiting all
> > reserved entries between current head and this entry.
>
> I am.
>
> > I would expect that every prb_entry has its own flag whether
> > it was commited or not. This function should set this flag
> > for its own entry. Then it should move the head to the
> > first uncommited entry.
>
> How could there be a reserved but uncommitted entry before this one? Or
> after this one? The reserve/commit window is under the prb_cpulock. No
> other CPU can be involved. If an NMI occurred anywhere here and it did a
> reserve, it already did the matching commit.

Heh, I missed this. But then all the reserve/commit complexity is
used just to handle the race with NMIs. Other contexts do both actions
atomically.

Hmm, prb_reserve() code looks almost ready to be used
lockless. And if you have reservation then it looks natural
to leave the lock and fill the data lockless.

I understand that your approach solve some problems, especially
with the commit. Also I know that fixing all races often
makes the code much more complicated than one expected.

OK, I am going to continue with the review and will think about it.
Some complexity is surely needed also because of the readers.
I have to get familiar with them as well.


> >> +char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
> >> + unsigned int size)
> >> +{

[...]

> >> + do {
> >> + for (;;) {
> >> + tail = atomic_long_read(&rb->tail);
> >> + res1 = atomic_long_read(&rb->reserve);
> >> + ret = calc_next(rb, tail, res1, size, &res2);
> >> + if (ret >= 0)
> >> + break;
> >> + if (!push_tail(rb, tail)) {
> >> + prb_commit(h);
> >
> > I am a bit confused. Is it commiting a handle that haven't
> > been reserved yet? Why, please?
>
> If ctx is 1 we have the special responsibility of moving the head past
> all the entries that interrupting NMIs have reserve/committed. (See the
> check for "ctx != 1" in prb_commit().) The NMIs are already gone so we
> are the only one that can do this.
>
> Here we are in prb_reserve() and have already incremented ctx. We might
> be the "ctx == 1" task and NMIs may have reserve/committed entries after
> we incremented ctx, which means that they did not push the head. If we
> now bail out because we couldn't push the tail, we still are obligated
> to push the head if we are "ctx == 1".
>
> prb_commit() does not actually care what is in the handle. It is going
> to commit everything up to the reserve. The fact that I pass it a handle
> is because that is what the function expects. I suppose I could create a
> _prb_commit() that takes only a ringbuffer argument and prb_commit()
> simply calls _prb_commit(h->rb). Then the bailout would be:

This is tricky like hell. Please add more comments in your code.
For example, see rb_remove_pages() or rb_tail_page_update().
Even trivial operations are commented there:

+ describe complex interactions between various variables,
flags, etc.

+ the author spent non-trivial time to realize that
the operation has to be done exactly there

+ some non-trivial computing

+ some corner case is handled

+ the operation has some non-obvious side effects or
prerequisites


> Or maybe it should have a more descriptive name:
>
> _prb_commit_all_reserved(rb);

This would have helped. But it would still deserve a comment
why it is called there.


Best Regards,
Petr

2019-02-17 08:11:04

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 04/25] printk-rb: add writer interface

Hi Petr,

I've made changes to the patch that hopefully align with what you are
looking for. I would appreciate it if you could go over it and see if
the changes are in the right direction. And if so, you should decide
whether I should make these kinds of changes for the whole series and
submit a v2 before you continue with the review.

The list of changes:

- Added comments everywhere I think they could be useful. Is it too
much?

- Renamed struct prb_handle to prb_reserved_entry (more appropriate).

- Fixed up macros as you requested.

- The implementation from prb_commit() has been moved to a new
prb_commit_all_reserved(). This should resolve the confusion in the
"failed to push_tail()" code.

- I tried moving calc_next() into prb_reserve(), but it was pure
insanity. I played with refactoring for a while until I found
something that I think looks nice. I moved the implementation of
calc_next() along with its containing loop into a new function
find_res_ptrs(). This function does what calc_next() and push_tail()
did. With this solution, I think prb_reserve() looks pretty
clean. However, the optimization of communicating about the wrap is
gone. So even though find_res_ptrs() knew if a wrap occurred,
prb_reserve() figures it out again for itself. If we want the
optimization, I still think the best approach is the -1,0,1 return
value of find_res_ptrs().

I'm looking forward to your response.

John Ogness


diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
index 4239dc86e029..ab6177c9fe0a 100644
--- a/include/linux/printk_ringbuffer.h
+++ b/include/linux/printk_ringbuffer.h
@@ -25,6 +25,23 @@ struct printk_ringbuffer {
atomic_t ctx;
};

+/*
+ * struct prb_reserved_entry: Reserved but not yet committed entry.
+ * @rb: The printk_ringbuffer where the entry was reserved.
+ *
+ * This is a handle used by the writer to represent an entry that has been
+ * reserved but not yet committed.
+ *
+ * The structure does not actually store any information about the entry that
+ * has been reserved because this information is not required by the
+ * implementation. The struct could prove useful if extra tracking or even
+ * fundamental changes to the ringbuffer were to be implemented. And as such
+ * would not require changes to writers.
+ */
+struct prb_reserved_entry {
+ struct printk_ringbuffer *rb;
+};
+
#define DECLARE_STATIC_PRINTKRB_CPULOCK(name) \
static struct prb_cpulock name = { \
.owner = ATOMIC_INIT(-1), \
@@ -46,6 +63,11 @@ static struct printk_ringbuffer name = { \
.ctx = ATOMIC_INIT(0), \
}

+/* writer interface */
+char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
+ unsigned int size);
+void prb_commit(struct prb_reserved_entry *e);
+
/* utility functions */
void prb_lock(struct prb_cpulock *cpu_lock);
void prb_unlock(struct prb_cpulock *cpu_lock);
diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
index 54c750092810..fbe1d92b9b60 100644
--- a/lib/printk_ringbuffer.c
+++ b/lib/printk_ringbuffer.c
@@ -2,6 +2,59 @@
#include <linux/smp.h>
#include <linux/printk_ringbuffer.h>

+/*
+ * struct prb_entry: An entry within the ringbuffer.
+ * @size: The size in bytes of the entry or -1 if terminating.
+ * @seq: The unique sequence number of the entry.
+ * @data: The data bytes of the entry.
+ *
+ * The struct is typecasted directly into the ringbuffer data array to access
+ * an entry. The @size specifies the complete size of the entry including any
+ * padding. The next entry will be located at &this_entry + this_entry.size.
+ * The only exception is if the entry is terminating (size = -1). In this case
+ * @seq and @data are invalid and the next entry is at the beginning of the
+ * ringbuffer data array.
+ */
+struct prb_entry {
+ unsigned int size;
+ u64 seq;
+ char data[0];
+};
+
+/* the size and size bitmask of the ringbuffer data array */
+#define PRB_SIZE(rb) (1L << rb->size_bits)
+#define PRB_SIZE_BITMASK(rb) (PRB_SIZE(rb) - 1)
+
+/* given a logical position, return its index in the ringbuffer data array */
+#define PRB_INDEX(rb, lpos) (lpos & PRB_SIZE_BITMASK(rb))
+
+/*
+ * given a logical position, return how many times the data buffer has
+ * wrapped, where logical position 0 begins at index 0 with no wraps
+ */
+#define PRB_WRAPS(rb, lpos) (lpos >> rb->size_bits)
+
+/*
+ * given a logical position, return the logical position that represents the
+ * beginning of the ringbuffer data array for this wrap
+ */
+#define PRB_THIS_WRAP_START_LPOS(rb, lpos) \
+ (PRB_WRAPS(rb, lpos) << rb->size_bits)
+
+/*
+ * given a logical position, return the logical position that represents the
+ * beginning of the ringbuffer data array for the next wrap
+ */
+#define PRB_NEXT_WRAP_START_LPOS(rb, lpos) \
+ ((PRB_WRAPS(rb, lpos) + 1) << rb->size_bits)
+
+/*
+ * entries are aligned to allow direct typecasts as struct prb_entry
+ */
+#define PRB_ENTRY_ALIGN sizeof(long)
+#define PRB_ENTRY_ALIGN_SIZE(sz) \
+ ((sz + (PRB_ENTRY_ALIGN - 1)) & ~(PRB_ENTRY_ALIGN - 1))
+
/*
* prb_lock: Perform a processor-reentrant spin lock.
* @cpu_lock: A pointer to the lock object.
@@ -58,3 +111,257 @@ void prb_unlock(struct prb_cpulock *cpu_lock)
}
put_cpu();
}
+
+/* translate a logical position to an entry in the data array */
+static struct prb_entry *to_entry(struct printk_ringbuffer *rb,
+ unsigned long lpos)
+{
+ char *buffer = rb->buffer;
+
+ buffer += PRB_INDEX(rb, lpos);
+ return (struct prb_entry *)buffer;
+}
+
+/* try to move the tail pointer forward, thus removing the oldest entry */
+static bool push_tail(struct printk_ringbuffer *rb, unsigned long tail)
+{
+ unsigned long new_tail, head;
+ struct prb_entry *e;
+
+ /* maybe another context already pushed the tail */
+ if (tail != atomic_long_read(&rb->tail))
+ return true;
+
+ /*
+ * Determine what the new tail should be. If the tail is a
+ * terminating entry, the new tail will be beyond the entry
+ * at the beginning of the data array.
+ */
+ e = to_entry(rb, tail);
+ if (e->size != -1)
+ new_tail = tail + e->size;
+ else
+ new_tail = PRB_NEXT_WRAP_START_LPOS(rb, tail);
+
+ /* make sure the new tail does not overtake the head */
+ head = atomic_long_read(&rb->head);
+ if (head - new_tail > PRB_SIZE(rb))
+ return false;
+
+ /*
+ * The result of this cmpxchg does not matter. If it succeeds,
+ * this context pushed the tail. If it fails, some other context
+ * pushed the tail. Either way, the tail was pushed.
+ */
+ atomic_long_cmpxchg(&rb->tail, tail, new_tail);
+ return true;
+}
+
+/*
+ * If this context incremented rb->ctx to 1, move the head pointer
+ * beyond all reserved entries.
+ */
+static void prb_commit_all_reserved(struct printk_ringbuffer *rb)
+{
+ unsigned long head, res;
+ struct prb_entry *e;
+
+ for (;;) {
+ if (atomic_read(&rb->ctx) > 1) {
+ /* another context will adjust the head pointer */
+ atomic_dec(&rb->ctx);
+ break;
+ }
+
+ /*
+ * This is the only context that will adjust the head pointer.
+ * If NMIs interrupt at any time, they can reserve/commit new
+ * entries, but they will not adjust the head pointer.
+ */
+
+ /* assign sequence numbers before moving the head pointer */
+ head = atomic_long_read(&rb->head);
+ res = atomic_long_read(&rb->reserve);
+ while (head != res) {
+ e = to_entry(rb, head);
+ if (e->size == -1) {
+ head = PRB_NEXT_WRAP_START_LPOS(rb, head);
+ continue;
+ }
+ e->seq = ++rb->seq;
+ head += e->size;
+ }
+
+ /*
+ * move the head pointer, thus making all reserved entries
+ * visible to any readers
+ */
+ atomic_long_set_release(&rb->head, res);
+
+ atomic_dec(&rb->ctx);
+ if (atomic_long_read(&rb->reserve) == res)
+ break;
+ /*
+ * The reserve pointer is different than previously read. New
+ * entries were reserve/committed by NMI contexts, possibly
+ * before ctx was decremented by this context. Go back and move
+ * the head pointer beyond those entries as well.
+ */
+ atomic_inc(&rb->ctx);
+ }
+
+ /* Enable interrupts and allow other CPUs to reserve/commit. */
+ prb_unlock(rb->cpulock);
+}
+
+/*
+ * prb_commit: Commit a reserved entry to the ring buffer.
+ * @e: A structure referencing a the reserved entry to commit.
+ *
+ * Commit data that has been reserved using prb_reserve(). Once the entry
+ * has been committed, it can be invalidated at any time. If a writer is
+ * interested in using the data after committing, the writer should make
+ * its own copy first or use the prb_iter_ reader functions to access the
+ * data in the ring buffer.
+ *
+ * It is safe to call this function from any context and state.
+ */
+void prb_commit(struct prb_reserved_entry *e)
+{
+ prb_commit_all_reserved(e->rb);
+}
+
+/* given the size to reserve, determine current and next reserve pointers */
+static bool find_res_ptrs(struct printk_ringbuffer *rb, unsigned long *res_old,
+ unsigned long *res_new, unsigned int size)
+{
+ unsigned long tail, entry_begin;
+
+ /*
+ * The reserve pointer is not allowed to overtake the index of the
+ * tail pointer. If this would happen, the tail pointer must be
+ * pushed, thus removing the oldest entry.
+ */
+ for (;;) {
+ tail = atomic_long_read(&rb->tail);
+ *res_old = atomic_long_read(&rb->reserve);
+
+ /*
+ * If the new reserve pointer wraps, the new entry will
+ * begin at the beginning of the data array. This loop
+ * exists only to handle the wrap.
+ */
+ for (entry_begin = *res_old;;) {
+
+ *res_new = entry_begin + size;
+
+ if (*res_new - tail > PRB_SIZE(rb)) {
+ /* would overtake tail, push tail */
+
+ if (!push_tail(rb, tail)) {
+ /* couldn't push tail, can't reserve */
+ return false;
+ }
+
+ /* tail pushed, try again */
+ break;
+ }
+
+ if (PRB_WRAPS(rb, entry_begin) ==
+ PRB_WRAPS(rb, *res_new)) {
+ /* reserve pointer values determined */
+ return true;
+ }
+
+ /*
+ * The new entry will wrap. Calculate the new reserve
+ * pointer based on the beginning of the data array
+ * for the wrap of the new reserve pointer.
+ */
+ entry_begin = PRB_THIS_WRAP_START_LPOS(rb, *res_new);
+ }
+ }
+}
+
+/*
+ * prb_reserve: Reserve an entry within a ring buffer.
+ * @e: A structure to be setup and reference a reserved entry.
+ * @rb: A ring buffer to reserve data within.
+ * @size: The number of bytes to reserve.
+ *
+ * Reserve an entry of at least @size bytes to be used by the caller. If
+ * successful, the data region of the entry belongs to the caller and cannot
+ * be invalidated by any other task/context. For this reason, the caller
+ * should call prb_commit() as quickly as possible in order to avoid preventing
+ * other tasks/contexts from reserving data in the case that the ring buffer
+ * has wrapped.
+ *
+ * It is safe to call this function from any context and state.
+ *
+ * Returns a pointer to the reserved data (and @e is setup to reference the
+ * entry containing that data) or NULL if it was not possible to reserve data.
+ */
+char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
+ unsigned int size)
+{
+ unsigned long res_old, res_new;
+ struct prb_entry *entry;
+
+ if (size == 0)
+ return NULL;
+
+ /* add entry header to size and align for the following entry */
+ size = PRB_ENTRY_ALIGN_SIZE(sizeof(struct prb_entry) + size);
+
+ if (size >= PRB_SIZE(rb))
+ return NULL;
+
+ /*
+ * Lock out all other CPUs and disable interrupts. Only NMIs will
+ * be able to interrupt. It will stay this way until the matching
+ * commit is called.
+ */
+ prb_lock(rb->cpulock);
+
+ /*
+ * Clarify the responsibility of this context. If this context
+ * increments ctx to 1, this context is responsible for pushing
+ * the head pointer beyond all reserved entries on commit.
+ */
+ atomic_inc(&rb->ctx);
+
+ /*
+ * Move the reserve pointer forward. Since NMIs can interrupt at any
+ * time, modifying the reserve pointer is done in a cmpxchg loop.
+ */
+ do {
+ if (!find_res_ptrs(rb, &res_old, &res_new, size)) {
+ /*
+ * Not possible to move the reserve pointer. Try to
+ * commit all reserved entries because this context
+ * might have that responsibility (if it incremented
+ * ctx to 1).
+ */
+ prb_commit_all_reserved(rb);
+ return NULL;
+ }
+ } while (!atomic_long_try_cmpxchg_acquire(&rb->reserve,
+ &res_old, res_new));
+
+ entry = to_entry(rb, res_old);
+ if (PRB_WRAPS(rb, res_old) != PRB_WRAPS(rb, res_new)) {
+ /*
+ * The reserve wraps. Create the terminating entry and get the
+ * pointer to the actually reserved entry at the beginning of
+ * the data array on the wrap of the new reserve pointer.
+ */
+ entry->size = -1;
+ entry = to_entry(rb, PRB_THIS_WRAP_START_LPOS(rb, res_new));
+ }
+
+ /* The size is set now. The seq is set later, on commit. */
+ entry->size = size;
+
+ e->rb = rb;
+ return &entry->data[0];
+}

2019-02-18 13:38:22

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 05/25] printk-rb: add basic non-blocking reading interface

On Tue 2019-02-12 15:29:43, John Ogness wrote:
> Add reader iterator static declaration/initializer, dynamic
> initializer, and functions to iterate and retrieve ring buffer data.
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> include/linux/printk_ringbuffer.h | 20 ++++
> lib/printk_ringbuffer.c | 190 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 210 insertions(+)
>
> diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
> index 1aec9d5666b1..5fdaf632c111 100644
> --- a/include/linux/printk_ringbuffer.h
> +++ b/include/linux/printk_ringbuffer.h
> @@ -43,6 +43,19 @@ static struct prb_cpulock name = { \
> .irqflags = &_##name##_percpu_irqflags, \
> }
>
> +#define PRB_INIT ((unsigned long)-1)
> +
> +#define DECLARE_STATIC_PRINTKRB_ITER(name, rbaddr) \
> +static struct prb_iterator name = { \

Please, define the macro without static. The static variable
should get declared as:

static DECLARE_PRINTKRB_ITER();

> + .rb = rbaddr, \
> + .lpos = PRB_INIT, \
> +}
> +
> +struct prb_iterator {
> + struct printk_ringbuffer *rb;
> + unsigned long lpos;
> +};

Please, define the structure before it is used (even in macros).
It is strange to initialize something that is not yet defined.


> #define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) \
> static char _##name##_buffer[1 << (szbits)] \
> __aligned(__alignof__(long)); \
> @@ -62,6 +75,13 @@ char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
> unsigned int size);
> void prb_commit(struct prb_handle *h);
>
> +/* reader interface */
> +void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
> + u64 *seq);
> +void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src);
> +int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq);
> +int prb_iter_data(struct prb_iterator *iter, char *buf, int size, u64 *seq);
> +
> /* utility functions */
> void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
> void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store);
> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
> index 90c7f9a9f861..1d1e886a0966 100644
> --- a/lib/printk_ringbuffer.c
> +++ b/lib/printk_ringbuffer.c
> @@ -1,5 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/smp.h>
> +#include <linux/string.h>
> +#include <linux/errno.h>
> #include <linux/printk_ringbuffer.h>
>
> #define PRB_SIZE(rb) (1 << rb->size_bits)
> @@ -8,6 +10,7 @@
> #define PRB_WRAPS(rb, lpos) (lpos >> rb->size_bits)
> #define PRB_WRAP_LPOS(rb, lpos, xtra) \
> ((PRB_WRAPS(rb, lpos) + xtra) << rb->size_bits)
> +#define PRB_DATA_SIZE(e) (e->size - sizeof(struct prb_entry))
> #define PRB_DATA_ALIGN sizeof(long)
>
> static bool __prb_trylock(struct prb_cpulock *cpu_lock,
> @@ -247,3 +250,190 @@ char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
>
> return &h->entry->data[0];
> }
> +
> +/*
> + * prb_iter_copy: Copy an iterator.
> + * @dest: The iterator to copy to.
> + * @src: The iterator to copy from.
> + *
> + * Make a deep copy of an iterator. This is particularly useful for making
> + * backup copies of an iterator in case a form of rewinding it needed.
> + *
> + * It is safe to call this function from any context and state. But
> + * note that this function is not atomic. Callers should not make copies
> + * to/from iterators that can be accessed by other tasks/contexts.
> + */
> +void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src)
> +{
> + memcpy(dest, src, sizeof(*dest));
> +}
> +
> +/*
> + * prb_iter_init: Initialize an iterator for a ring buffer.
> + * @iter: The iterator to initialize.
> + * @rb: A ring buffer to that @iter should iterate.
> + * @seq: The sequence number of the position preceding the first record.
> + * May be NULL.
> + *
> + * Initialize an iterator to be used with a specified ring buffer. If @seq
> + * is non-NULL, it will be set such that prb_iter_next() will provide a
> + * sequence value of "@seq + 1" if no records were missed.
> + *
> + * It is safe to call this function from any context and state.
> + */
> +void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
> + u64 *seq)
> +{
> + memset(iter, 0, sizeof(*iter));
> + iter->rb = rb;
> + iter->lpos = PRB_INIT;
> +
> + if (!seq)
> + return;
> +
> + for (;;) {
> + struct prb_iterator tmp_iter;
> + int ret;
> +
> + prb_iter_copy(&tmp_iter, iter);

It looks strange to copy something that has been hardly initialized.
I hope that we could do this without a copy, see below.

> +
> + ret = prb_iter_next(&tmp_iter, NULL, 0, seq);

prb_iter_next() and prb_iter_data() are too complex spaghetti
code. They do basically the same but they do not share
any helper function. The error handling is different
which is really confusing. See below.


> + if (ret < 0)
> + continue;
> +
> + if (ret == 0)
> + *seq = 0;
> + else
> + (*seq)--;

The decrement is another suspicious things here.

> + break;

Finally, I am confused by the semantic of this function.
What is supposed to be the initialized lpos, seq number,
please?

I would expect a function that initializes the iterator
either to the first valid entry (tail-one) or
to the one defined by the given seq number.


Well, I think that we need to start with a more low-level functions.
IMHO. we need something to read one entry a safe way. Then it will
be much easier to live with races in the rest of the code:

/*
* Return valid entry on the given lpos. Data are read
* only when the buffer is is not zero.
*/
int prb_get_entry(struct struct printk_ringbuffer *rb,
unsigned long lpos,
struct prb_entry *entry,
unsigned int data_buf_size)
{
/*
* Pointer to the ring buffer. The data might get lost
* at any time.
*/
struct prb_entry *weak_entry;

if (!is_valid(lpos))
return -EINVAL;

/* Make sure that data are valid for the given valid lpos. */
smp_rmb();

weak_entry = to_entry(lpos);
entry->seq = weak_entry->seq;

if (data_buf_size) {
unsigned int size;

size = min(data_buf_size, weak_entry->size);
memcpy(entry->data, weak_entry->data, size);
entry->size = size;
} else {
entry->size = weak_data->size;
}

/* Make sure that the copy is done before we check validity. */
smp_mb();

return is_valid(lpos);
}

Then I would do the support for iterating the following way.
First, I would extend the structure:

struct prb_iterator {
struct printk_ringbuffer *rb;
struct prb_entry *entry;
unsigned int data_buffer_size;
unsigned long lpos;
};

And do something like:

void prg_iter_init(struct struct printk_ringbuffer *rb,
struct prb_entry *entry,
unsigned int data_buffer_size,
struct prb_iterator *iter)
{
iter->rb = rb;
iter->entry = entry;
iter->data_buffer_size = data_buffer_size;
lpos = 0UL;
}

Then we could do iterator support the following way:

/* Start iteration with reading the tail entry. */
int prb_iter_tail_entry(struct prb_iterator *iter);
{
unsigned long tail;
int ret;

for (;;) {
tail = atomic_long_read(&rb->tail);

/* Ring buffer is empty? */
if (unlikely(!is_valid(tail)))
return -EINVAL;

ret = prb_get_entry(iter->rb, tail,
iter->entry, iter->data_buf_size);
if (!ret) {
iter->lpos = tail;
break;
}
}

return 0;
}

unsigned long next_lpos(unsineg long lpos, struct prb_entry *entry)
{
return lpos + sizeof(struct entry) + entry->size;
}

/* Try to get next entry using a valid iterator */
int prb_iter_next_entry(struct prb_iterator *iter)
{
iter->lpos = next_lpos(iter->lpos, iter->etnry);

return prb_get_entry(rb, lpos, entry, data_buf_size;
}

/* Try to get the next entry. Allow to skip lost messages. */
int prb_iter_next_valid_entry(struct prb_iterator *iter)
{
int ret;

ret = prb_iter_next_entry(iter);
if (!ret)
return 0;

/* Next entry has been lost. Skip to the current tail. */
return prb_iter_tail_entry(rb, *lpos, entry, data_buf_size);
}

> +static bool is_valid(struct printk_ringbuffer *rb, unsigned long lpos)
> +{
> + unsigned long head, tail;
> +
> + tail = atomic_long_read(&rb->tail);
> + head = atomic_long_read(&rb->head);
> + head -= tail;
> + lpos -= tail;
> +
> + if (lpos >= head)
> + return false;
> + return true;
> +}
> +

IMPORTANT:

Please, do not start rewriting the entire patchset after reading this
mail. I suggest to take a rest from coding. Just read the feadback,
ask if anything is unclear, and let your brain processing it
on background.

The motivation:

1. This is a really complex patchset and it will be a long run. For
example, I worked on atomic livepatch replace patchset more
than 1 year. There were 15 iterations, see
https://lkml.kernel.org/r/[email protected]
And I was really familiar with this subsystem, was reviewer from
the beginning.

Another example is kthread worker API. It took more than
1 year from RFC until v10 was accepted, see
[email protected]

In both cases, the final revision looked completely
different from the initial RFC.

Printk word is even more complicated. I worked several months
on race-free NMI buffer (it was my first big kernel project).
It was put into a garbage dump within 1 day, see
https://lkml.kernel.org/r/CA+55aFzseOqF-EpKMvwKpfhBJZQSLqKpJ3shzVee9s0+mvyCuA@mail.gmail.com

The patches offloading printk console to a kthread was
floating around few years and was never accepted.

That said, I do not want to discourage you. Your patchset has
interesting ideas that are worth finishing. I just think
that it is better to take a rest when you need to wait
for feedback. It will help you to see it from another
perspective and start working on v2 with a fresh mind.


2. There are 25 patches already. It might be hard to follow
the discussion. And it will be even harder if there
are more variants of the patches discussed in the same
thread.

I suggest to just proceed the feedback. Ask if anything
is unclear. Eventually discuss code alternatives but
do not send entire patches. Send full v2. Then we could
start with a clean table again.


3. I have seen only 5 patches out of 25 so far. My feedback
is based past experience and common sense. I might
view some things differently once I get to the other
patches.

You might feel frustrated when you rework something
based on my feedback and I later change my mind
and suggest something different.

I do not want to make you frustrated. Therefore I feel
stressed and afraid to send more feedback before I get
the full picture. But then it might take weeks until
I send something. Many ideas might be lost in the meantime.
The result might be hard to understand because I would
describe some final statements without explaining
all the reasons.


4. There might be feedback from other people and they
might have another opinion.


Thanks for working on it.

Best Regards,
Petr

2019-02-18 14:06:52

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 06/25] printk-rb: add blocking reader support

On Tue 2019-02-12 15:29:44, John Ogness wrote:
> Add a blocking read function for readers. An irq_work function is
> used to signal the wait queue so that write notification can
> be triggered from any context.

I would be more precise what exacly is problematic in which context.
Something like:

An irq_work function is used because wake_up() cannot be called safely
from NMI and scheduler context.

> Signed-off-by: John Ogness <[email protected]>
> ---
> include/linux/printk_ringbuffer.h | 20 ++++++++++++++++
> lib/printk_ringbuffer.c | 49 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 69 insertions(+)
>
> diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
> index 5fdaf632c111..106f20ef8b4d 100644
> --- a/include/linux/printk_ringbuffer.h
> +++ b/include/linux/printk_ringbuffer.h
> @@ -2,8 +2,10 @@
> #ifndef _LINUX_PRINTK_RINGBUFFER_H
> #define _LINUX_PRINTK_RINGBUFFER_H
>
> +#include <linux/irq_work.h>
> #include <linux/atomic.h>
> #include <linux/percpu.h>
> +#include <linux/wait.h>
>
> struct prb_cpulock {
> atomic_t owner;
> @@ -22,6 +24,10 @@ struct printk_ringbuffer {
>
> struct prb_cpulock *cpulock;
> atomic_t ctx;
> +
> + struct wait_queue_head *wq;
> + atomic_long_t wq_counter;
> + struct irq_work *wq_work;
> };
>
> struct prb_entry {
> @@ -59,6 +65,15 @@ struct prb_iterator {
> #define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) \
> static char _##name##_buffer[1 << (szbits)] \
> __aligned(__alignof__(long)); \
> +static DECLARE_WAIT_QUEUE_HEAD(_##name##_wait); \
> +static void _##name##_wake_work_func(struct irq_work *irq_work) \
> +{ \
> + wake_up_interruptible_all(&_##name##_wait); \
> +} \

All ring buffers might share the same generic function, something like:

void prb_wake_readers_work_func(struct irq_work *irq_work)
{
struct printk_ringbuffer *rb;

rb = container_of(irq_work, struct printk_ring_buffer, wq_work);
wake_up_interruptible_all(rb->wq); \
}


> +static struct irq_work _##name##_wake_work = { \
> + .func = _##name##_wake_work_func, \
> + .flags = IRQ_WORK_LAZY, \
> +}; \
> static struct printk_ringbuffer name = { \
> .buffer = &_##name##_buffer[0], \
> .size_bits = szbits, \
> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
> index 1d1e886a0966..c2ddf4cb9f92 100644
> --- a/lib/printk_ringbuffer.c
> +++ b/lib/printk_ringbuffer.c
> @@ -185,6 +188,12 @@ void prb_commit(struct prb_handle *h)
> }
>
> prb_unlock(rb->cpulock, h->cpu);
> +
> + if (changed) {
> + atomic_long_inc(&rb->wq_counter);
> + if (wq_has_sleeper(rb->wq))
> + irq_work_queue(rb->wq_work);
> + }
> }
>
> /*
> @@ -437,3 +446,43 @@ int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)
>
> return 1;
> }
> +
> +/*
> + * prb_iter_wait_next: Advance to the next record, blocking if none available.
> + * @iter: Iterator tracking the current position.
> + * @buf: A buffer to store the data of the next record. May be NULL.
> + * @size: The size of @buf. (Ignored if @buf is NULL.)
> + * @seq: The sequence number of the next record. May be NULL.
> + *
> + * If a next record is already available, this function works like
> + * prb_iter_next(). Otherwise block interruptible until a next record is
> + * available.
> + *
> + * When a next record is available, @iter is advanced and (if specified)
> + * the data and/or sequence number of that record are provided.
> + *
> + * This function might sleep.
> + *
> + * Returns 1 if @iter was advanced, -EINVAL if @iter is now invalid, or
> + * -ERESTARTSYS if interrupted by a signal.
> + */
> +int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)
> +{
> + unsigned long last_seen;
> + int ret;
> +
> + for (;;) {
> + last_seen = atomic_long_read(&iter->rb->wq_counter);
> +
> + ret = prb_iter_next(iter, buf, size, seq);
> + if (ret != 0)
> + break;
> +
> + ret = wait_event_interruptible(*iter->rb->wq,
> + last_seen != atomic_long_read(&iter->rb->wq_counter));

Do we really need yet another counter here?

I think that rb->seq might do the same job. Or if there is problem
with atomicity then rb->head might work as well. Or do I miss
anything?

Best Regards,
Petr

> + if (ret < 0)
> + break;
> + }
> +
> + return ret;
> +}
> --
> 2.11.0
>

2019-02-18 18:50:52

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 07/25] printk-rb: add functionality required by printk

On Tue 2019-02-12 15:29:45, John Ogness wrote:
> The printk subsystem needs to be able to query the size of the ring
> buffer, seek to specific entries within the ring buffer, and track
> if records could not be stored in the ring buffer.
>
> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
> index c2ddf4cb9f92..ce33b5add5a1 100644
> --- a/lib/printk_ringbuffer.c
> +++ b/lib/printk_ringbuffer.c
> @@ -175,11 +175,16 @@ void prb_commit(struct prb_handle *h)
> head = PRB_WRAP_LPOS(rb, head, 1);
> continue;
> }
> + while (atomic_long_read(&rb->lost)) {
> + atomic_long_dec(&rb->lost);
> + rb->seq++;

The lost-messages handling should be in a separate patch.
At least I do not see any close relation with prb_iter_seek().

I would personally move prb_iter_seek() to the 5th patch that
implements the other get/iterator functions.

On the contrary, the patch adding support for lost messages
should implement a way how to inform the user about lost messages.
E.g. to add a warning when some space becomes available again.

> + }
> e->seq = ++rb->seq;
> head += e->size;
> changed = true;
> }
> atomic_long_set_release(&rb->head, res);
> +
> atomic_dec(&rb->ctx);
>
> if (atomic_long_read(&rb->reserve) == res)
> @@ -486,3 +491,93 @@ int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)
>
> return ret;
> }
> +
> +/*
> + * prb_iter_seek: Seek forward to a specific record.
> + * @iter: Iterator to advance.
> + * @seq: Record number to advance to.
> + *
> + * Advance @iter such that a following call to prb_iter_data() will provide
> + * the contents of the specified record. If a record is specified that does
> + * not yet exist, advance @iter to the end of the record list.
> + *
> + * Note that iterators cannot be rewound. So if a record is requested that
> + * exists but is previous to @iter in position, @iter is considered invalid.
> + *
> + * It is safe to call this function from any context and state.
> + *
> + * Returns 1 on succces, 0 if specified record does not yet exist (@iter is
> + * now at the end of the list), or -EINVAL if @iter is now invalid.
> + */

Do we really need to distinguish when the iterator is invalid and when
we are at the end of the buffer?

It seems that the reaction in both situation always is to call
prb_iter_init(&iter, &printk_rb, &some_seq). I am still a bit
confused what your prb_iter_init() does. Therefore I am not
sure what it is supposed to do.

Anyway, it seems to be typically used when you need to start
from tail. I would personally do something like (based on my code
in reply to 5th patch:

int prb_iter_seek_to_seq(struct prb_iterator *iter, u64 seq)
{
int ret;

ret = prb_iter_tail_entry(iter);

while (!ret && iter->entry->seq != seq)
ret = prb_iter_next_valid_entry(iter);

return ret;
}

When I see it, I would make the functionality more obvious
by renaming:

prb_iter_tail_entry() -> prb_iter_set_tail_entry()


> +int prb_iter_seek(struct prb_iterator *iter, u64 seq)
> +{
> + u64 cur_seq;
> + int ret;
> +
> + /* first check if the iterator is already at the wanted seq */
> + if (seq == 0) {
> + if (iter->lpos == PRB_INIT)
> + return 1;
> + else
> + return -EINVAL;
> + }
> + if (iter->lpos != PRB_INIT) {
> + if (prb_iter_data(iter, NULL, 0, &cur_seq) >= 0) {
> + if (cur_seq == seq)
> + return 1;
> + if (cur_seq > seq)
> + return -EINVAL;
> + }
> + }
> +
> + /* iterate to find the wanted seq */
> + for (;;) {
> + ret = prb_iter_next(iter, NULL, 0, &cur_seq);
> + if (ret <= 0)
> + break;
> +
> + if (cur_seq == seq)
> + break;
> +
> + if (cur_seq > seq) {
> + ret = -EINVAL;
> + break;
> + }

This is a good example why prb_iter_data() and prb_iter_next() is
a weird interface. You need to read the documentation very carefully
to understand the difference (functionality, error codes). At least
for me, it is far from obvious why they are used this way and
if it is correct.

Best Regards,
Petr

> + }
> +
> + return ret;
> +}

2019-02-19 13:55:30

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On Tue 2019-02-12 15:29:46, John Ogness wrote:
> The printk ring buffer provides an NMI-safe interface for writing
> messages to a ring buffer. Using such a buffer for alleviates printk
> callers from the current burdens of disabled preemption while calling
> the console drivers (and possibly printing out many messages that
> another task put into the log buffer).
>
> Create a ring buffer to be used for storing messages to be
> printed to the consoles.
>
> Create a dedicated printk kthread to block on the ring buffer
> and call the console drivers for the read messages.

It was already mentioned in the reply to the cover letter.
An offloading console handling to a kthread is very problematic:

+ It reduces the chance that the messages will ever reach
the console when the system becomes unstable or is dying.
Note that panic() is the better alternative. The system
could die without calling panic().

+ There are situations when the kthread will not be usable
by definition, for example, early boot, suspend, kexec,
shutdown.

+ It is hard to do a single thread efficient enough. It could
not have too high priority to avoid staling normal processes.
It won't be scheduled in time when processed with a higher
priority produce too many messages.


That said, I think that some offloading would be useful and
even necessary, especially on the real time systems. But we
need to be more conservative, for example:

+ offload only when it takes too long
+ do not offload in emergency situations
+ keep the console owner logic

The above ideas come from the old discussions about introducing
printk kthread. This patchset brings one more idea to push
emergency messages directly to "lockless" consoles.

I could imagine that the default setting will be more
conservative while real time kernel would do a more
aggressive offloading.

Anyway, I would split this entire patchset into two.
The first one would replace the existing main buffer
and per-cpu safe buffers with the lockless one.
The other patchset would try to reduce delays caused
by console handling by introducing lockless consoles,
offloading, ...


> NOTE: The printk_delay is relocated to _after_ the message is
> printed, where it makes more sense.
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> kernel/printk/printk.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 105 insertions(+)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index d3d170374ceb..08e079b95652 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -44,6 +44,8 @@
> #include <linux/irq_work.h>
> #include <linux/ctype.h>
> #include <linux/uio.h>
> +#include <linux/kthread.h>
> +#include <linux/printk_ringbuffer.h>
> #include <linux/sched/clock.h>
> #include <linux/sched/debug.h>
> #include <linux/sched/task_stack.h>
> @@ -397,7 +399,12 @@ DEFINE_RAW_SPINLOCK(logbuf_lock);
> printk_safe_exit_irqrestore(flags); \
> } while (0)
>
> +DECLARE_STATIC_PRINTKRB_CPULOCK(printk_cpulock);
> +
> #ifdef CONFIG_PRINTK
> +/* record buffer */
> +DECLARE_STATIC_PRINTKRB(printk_rb, CONFIG_LOG_BUF_SHIFT, &printk_cpulock);
> +
> DECLARE_WAIT_QUEUE_HEAD(log_wait);
> /* the next printk record to read by syslog(READ) or /proc/kmsg */
> static u64 syslog_seq;
> @@ -744,6 +751,10 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
> return p - buf;
> }
>
> +#define PRINTK_SPRINT_MAX (LOG_LINE_MAX + PREFIX_MAX)
> +#define PRINTK_RECORD_MAX (sizeof(struct printk_log) + \
> + CONSOLE_EXT_LOG_MAX + PRINTK_SPRINT_MAX)
> +
> /* /dev/kmsg - userspace message inject/listen interface */
> struct devkmsg_user {
> u64 seq;
> @@ -1566,6 +1577,34 @@ SYSCALL_DEFINE3(syslog, int, type, char __user *, buf, int, len)
> return do_syslog(type, buf, len, SYSLOG_FROM_READER);
> }
>
> +static void format_text(struct printk_log *msg, u64 seq,
> + char *ext_text, size_t *ext_len,
> + char *text, size_t *len, bool time)
> +{
> + if (suppress_message_printing(msg->level)) {
> + /*
> + * Skip record that has level above the console
> + * loglevel and update each console's local seq.
> + */
> + *len = 0;
> + *ext_len = 0;
> + return;
> + }
> +
> + *len = msg_print_text(msg, console_msg_format & MSG_FORMAT_SYSLOG,
> + time, text, PRINTK_SPRINT_MAX);
> + if (nr_ext_console_drivers) {
> + *ext_len = msg_print_ext_header(ext_text, CONSOLE_EXT_LOG_MAX,
> + msg, seq);
> + *ext_len += msg_print_ext_body(ext_text + *ext_len,
> + CONSOLE_EXT_LOG_MAX - *ext_len,
> + log_dict(msg), msg->dict_len,
> + log_text(msg), msg->text_len);
> + } else {
> + *ext_len = 0;
> + }
> +}

Please, refactor the existing code instead of cut&pasting. It will
help to check eventual regressions. Also it will help to understand
the evolution when digging in the git history. [*]

> +
> /*
> * Special console_lock variants that help to reduce the risk of soft-lockups.
> * They allow to pass console_lock to another printk() call using a busy wait.
> @@ -2899,6 +2938,72 @@ void wake_up_klogd(void)
> preempt_enable();
> }
>
> +static int printk_kthread_func(void *data)
> +{
> + struct prb_iterator iter;
> + struct printk_log *msg;
> + size_t ext_len;
> + char *ext_text;
> + u64 master_seq;
> + size_t len;
> + char *text;
> + char *buf;
> + int ret;
> +
> + ext_text = kmalloc(CONSOLE_EXT_LOG_MAX, GFP_KERNEL);
> + text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL);
> + buf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
> + if (!ext_text || !text || !buf)
> + return -1;

We need to have some fallback solution when the kthread can't
get started properly.


> +
> + prb_iter_init(&iter, &printk_rb, NULL);
> +
> + /* the printk kthread never exits */
> + for (;;) {
> + ret = prb_iter_wait_next(&iter, buf,
> + PRINTK_RECORD_MAX, &master_seq);
> + if (ret == -ERESTARTSYS) {
> + continue;
> + } else if (ret < 0) {
> + /* iterator invalid, start over */
> + prb_iter_init(&iter, &printk_rb, NULL);
> + continue;
> + }
> +
> + msg = (struct printk_log *)buf;
> + format_text(msg, master_seq, ext_text, &ext_len, text,
> + &len, printk_time);
> +
> + console_lock();
> + if (len > 0 || ext_len > 0) {
> + call_console_drivers(ext_text, ext_len, text, len);
> + boot_delay_msec(msg->level);
> + printk_delay();
> + }
> + console_unlock();

This calls console_unlock() that still calls console drivers at this
stage. Note the each patch should keep the kernel buildable and
avoid regressions. Otherwise, it would break bisection. [*]

> + }
> +
> + kfree(ext_text);
> + kfree(text);
> + kfree(buf);

This will never get called. Well, I think that the kthread
will look different when it is used only as a callback.
We will need some buffers also when the direct console
is handled by the printk() caller directly.


> + return 0;
> +}
> +

[*] I am not sure whether I should comment this "details" at this
stage [RFC]. I guess that you worked on this patchset many
weeks or even months. You tried various approaches,
refactored the code several times. It is hard to keep
such a big patchset clean. It might even be wasting of time.


Best Regards,
Petr

2019-02-19 14:05:53

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 09/25] printk: remove exclusive console hack

On Tue 2019-02-12 15:29:47, John Ogness wrote:
> In order to support printing the printk log history when new
> consoles are registered, a global exclusive_console variable is
> temporarily set. This only works because printk runs with
> preemption disabled.
>
> When console printing is moved to a fully preemptible dedicated
> kthread, this hack no longer works.

We need to keep this functionality. Otherwise, people would miss
the beginning of the log on some consoles. Kernel does not know
what console is important for monitoring or debugging.

Fortunately, it should be easy to implement this with
the new ring buffer a much cleaner way. Anyway, you
need to switch to the new implementation in a single
patch to avoid regression.

I would personally postpone this into a separate patchset.

Best Regards,
Petr

2019-02-19 21:45:01

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 05/25] printk-rb: add basic non-blocking reading interface

Hi Petr,

Below I make several comments, responding to your questions. But I like
the new API I believe you are trying to propose. So really only my final
comments are of particular importance. There I show you what I think
reader code would look like using your proposed API.

On 2019-02-18, Petr Mladek <[email protected]> wrote:
>> Add reader iterator static declaration/initializer, dynamic
>> initializer, and functions to iterate and retrieve ring buffer data.
>>
>> Signed-off-by: John Ogness <[email protected]>
>> ---
>> include/linux/printk_ringbuffer.h | 20 ++++
>> lib/printk_ringbuffer.c | 190 ++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 210 insertions(+)
>>
>> diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
>> index 1aec9d5666b1..5fdaf632c111 100644
>> --- a/include/linux/printk_ringbuffer.h
>> +++ b/include/linux/printk_ringbuffer.h
>> @@ -43,6 +43,19 @@ static struct prb_cpulock name = { \
>> .irqflags = &_##name##_percpu_irqflags, \
>> }
>>
>> +#define PRB_INIT ((unsigned long)-1)
>> +
>> +#define DECLARE_STATIC_PRINTKRB_ITER(name, rbaddr) \
>> +static struct prb_iterator name = { \
>
> Please, define the macro without static. The static variable
> should get declared as:
>
> static DECLARE_PRINTKRB_ITER();

OK.

>> + .rb = rbaddr, \
>> + .lpos = PRB_INIT, \
>> +}
>> +
>> +struct prb_iterator {
>> + struct printk_ringbuffer *rb;
>> + unsigned long lpos;
>> +};
>
> Please, define the structure before it is used (even in macros).
> It is strange to initialize something that is not yet defined.

Agreed.

>> #define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) \
>> static char _##name##_buffer[1 << (szbits)] \
>> __aligned(__alignof__(long)); \
>> @@ -62,6 +75,13 @@ char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
>> unsigned int size);
>> void prb_commit(struct prb_handle *h);
>>
>> +/* reader interface */
>> +void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
>> + u64 *seq);
>> +void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src);
>> +int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq);
>> +int prb_iter_data(struct prb_iterator *iter, char *buf, int size, u64 *seq);
>> +
>> /* utility functions */
>> void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store);
>> void prb_unlock(struct prb_cpulock *cpu_lock, unsigned int cpu_store);
>> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
>> index 90c7f9a9f861..1d1e886a0966 100644
>> --- a/lib/printk_ringbuffer.c
>> +++ b/lib/printk_ringbuffer.c
>> @@ -1,5 +1,7 @@
>> // SPDX-License-Identifier: GPL-2.0
>> #include <linux/smp.h>
>> +#include <linux/string.h>
>> +#include <linux/errno.h>
>> #include <linux/printk_ringbuffer.h>
>>
>> #define PRB_SIZE(rb) (1 << rb->size_bits)
>> @@ -8,6 +10,7 @@
>> #define PRB_WRAPS(rb, lpos) (lpos >> rb->size_bits)
>> #define PRB_WRAP_LPOS(rb, lpos, xtra) \
>> ((PRB_WRAPS(rb, lpos) + xtra) << rb->size_bits)
>> +#define PRB_DATA_SIZE(e) (e->size - sizeof(struct prb_entry))
>> #define PRB_DATA_ALIGN sizeof(long)
>>
>> static bool __prb_trylock(struct prb_cpulock *cpu_lock,
>> @@ -247,3 +250,190 @@ char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb,
>>
>> return &h->entry->data[0];
>> }
>> +
>> +/*
>> + * prb_iter_copy: Copy an iterator.
>> + * @dest: The iterator to copy to.
>> + * @src: The iterator to copy from.
>> + *
>> + * Make a deep copy of an iterator. This is particularly useful for making
>> + * backup copies of an iterator in case a form of rewinding it needed.
>> + *
>> + * It is safe to call this function from any context and state. But
>> + * note that this function is not atomic. Callers should not make copies
>> + * to/from iterators that can be accessed by other tasks/contexts.
>> + */
>> +void prb_iter_copy(struct prb_iterator *dest, struct prb_iterator *src)
>> +{
>> + memcpy(dest, src, sizeof(*dest));
>> +}
>> +
>> +/*
>> + * prb_iter_init: Initialize an iterator for a ring buffer.
>> + * @iter: The iterator to initialize.
>> + * @rb: A ring buffer to that @iter should iterate.
>> + * @seq: The sequence number of the position preceding the first record.
>> + * May be NULL.
>> + *
>> + * Initialize an iterator to be used with a specified ring buffer. If @seq
>> + * is non-NULL, it will be set such that prb_iter_next() will provide a
>> + * sequence value of "@seq + 1" if no records were missed.
>> + *
>> + * It is safe to call this function from any context and state.
>> + */
>> +void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
>> + u64 *seq)
>> +{
>> + memset(iter, 0, sizeof(*iter));
>> + iter->rb = rb;
>> + iter->lpos = PRB_INIT;
>> +
>> + if (!seq)
>> + return;
>> +
>> + for (;;) {
>> + struct prb_iterator tmp_iter;
>> + int ret;
>> +
>> + prb_iter_copy(&tmp_iter, iter);
>
> It looks strange to copy something that has been hardly initialized.
> I hope that we could do this without a copy, see below.
>
>> +
>> + ret = prb_iter_next(&tmp_iter, NULL, 0, seq);
>
> prb_iter_next() and prb_iter_data() are too complex spaghetti
> code. They do basically the same but they do not share
> any helper function. The error handling is different
> which is really confusing. See below.

I don't follow why you think they do basically the same
thing. prb_iter_next() moves forward to the next entry, then calls
prb_iter_data() to retrieve the data. prb_iter_data() _is_ the helper
function.

(I do not intend to defend the ringbuffer API. I am only addressing your
comment that they do the same thing. I mentioned in the cover letter
that the API is not pretty. I am thankful on tips to improve it.)

>> + if (ret < 0)
>> + continue;
>> +
>> + if (ret == 0)
>> + *seq = 0;
>> + else
>> + (*seq)--;
>
> The decrement is another suspicious things here.

"seq" represents the last entry that we have seen. If we initialize the
iterator and then call prb_iter_next(), it is expected that we will see
the first unseen entry. The caller expects prb_iter_next() to yield a
seq that is 1 more than what prb_iter_init() set.

The purpose of the interface working this way is to have loops where we
do not need to call prb_iter_data() for the first entry and
prb_iter_next() for all other elements. The call order would be:

prb_iter_init();
while (prb_iter_next()) {
}

If prb_iter_init() can return a seq value, then the logic to check if
any entries were missed will verify that seq has incremented by exactly
1 for each prb_iter_next() call.


>> + break;
>
> Finally, I am confused by the semantic of this function.
> What is supposed to be the initialized lpos, seq number,
> please?

lpos (an internal field) is initialized with the PRB_INIT constant so
mark that the iterator has been initialized but not yet used. This is so
that the next call to prb_iter_next() will be valid no matter what
(after which lpos is set to the seq of that returned entry). I did not
like the idea of iterators becoming invalid before they were even used.

seq (an optional by-reference variable from the caller), if non-NULL,
will be set with a seq number such that the next call to prb_iter_next()
will return that value + 1. As mentioned above, these semantics are to
simplify the init/next loop/logic.

But prb_iter_init() notis is only used for read loops. It is also used,
for example, by printk to set the clear_seq marker in some cases of
"dmesg -c". As I've mentioned, the API grew out of printk
requirements. I am more than happy to simplify this.

> I would expect a function that initializes the iterator
> either to the first valid entry (tail-one) or
> to the one defined by the given seq number.
>
> Well, I think that we need to start with a more low-level functions.
> IMHO. we need something to read one entry a safe way. Then it will
> be much easier to live with races in the rest of the code:
>
> /*
> * Return valid entry on the given lpos. Data are read
> * only when the buffer is is not zero.
> */
> int prb_get_entry(struct struct printk_ringbuffer *rb,
> unsigned long lpos,
> struct prb_entry *entry,
> unsigned int data_buf_size)
> {
> /*
> * Pointer to the ring buffer. The data might get lost
> * at any time.
> */
> struct prb_entry *weak_entry;
>
> if (!is_valid(lpos))
> return -EINVAL;
>
> /* Make sure that data are valid for the given valid lpos. */
> smp_rmb();
>
> weak_entry = to_entry(lpos);
> entry->seq = weak_entry->seq;
>
> if (data_buf_size) {
> unsigned int size;
>
> size = min(data_buf_size, weak_entry->size);

weak_entry->size is untrusted data here. The following memcpy could grab
data beyond the data array. (But we can ignore these details for now. I
realize you are trying to refactor, not focus on these details.)

> memcpy(entry->data, weak_entry->data, size);
> entry->size = size;
> } else {
> entry->size = weak_data->size;
> }
>
> /* Make sure that the copy is done before we check validity. */
> smp_mb();
>
> return is_valid(lpos);
> }
>
> Then I would do the support for iterating the following way.
> First, I would extend the structure:
>
> struct prb_iterator {
> struct printk_ringbuffer *rb;
> struct prb_entry *entry;
> unsigned int data_buffer_size;
> unsigned long lpos;
> };
>
> And do something like:
>
> void prg_iter_init(struct struct printk_ringbuffer *rb,
> struct prb_entry *entry,
> unsigned int data_buffer_size,
> struct prb_iterator *iter)
> {
> iter->rb = rb;
> iter->entry = entry;
> iter->data_buffer_size = data_buffer_size;
> lpos = 0UL;
> }
>
> Then we could do iterator support the following way:
>
> /* Start iteration with reading the tail entry. */
> int prb_iter_tail_entry(struct prb_iterator *iter);

A name like prb_iter_oldest_entry() might simplify things. I really
don't want the caller to be concerned with heads and tails and which is
which. That's an implementation detail of the ringbuffer.

> {
> unsigned long tail;
> int ret;
>
> for (;;) {
> tail = atomic_long_read(&rb->tail);
>
> /* Ring buffer is empty? */

The ring buffer is only empty at the beginning (when tail == head). Readers
are non-consuming, so it is never empty again once an entry is committed.

if (unlikely(atomic_long_read(head) == atomic_long_read(tail)))
return -EINVAL;

The check for valid is to make sure the tail we just read hasn't already
been overtaken by writers. I suppose this could be put into a nested
loop so that we continue trying again until we get a valid tail.

> if (unlikely(!is_valid(tail)))
> return -EINVAL;
>
> ret = prb_get_entry(iter->rb, tail,
> iter->entry, iter->data_buf_size);
> if (!ret) {
> iter->lpos = tail;
> break;
> }
> }
>
> return 0;
> }
>
> unsigned long next_lpos(unsineg long lpos, struct prb_entry *entry)
> {
> return lpos + sizeof(struct entry) + entry->size;

entry->size already includes sizeof(struct prb_entry) plus alignment
padding. (Again, not important right now.)

> }
>
> /* Try to get next entry using a valid iterator */
> int prb_iter_next_entry(struct prb_iterator *iter)
> {
> iter->lpos = next_lpos(iter->lpos, iter->etnry);
>
> return prb_get_entry(rb, lpos, entry, data_buf_size;
> }
>
> /* Try to get the next entry. Allow to skip lost messages. */
> int prb_iter_next_valid_entry(struct prb_iterator *iter)
> {
> int ret;
>
> ret = prb_iter_next_entry(iter);
> if (!ret)
> return 0;
>
> /* Next entry has been lost. Skip to the current tail. */
> return prb_iter_tail_entry(rb, *lpos, entry, data_buf_size);

You return values are geting mixed up here and you are not
distinguishing between overtaken by writers and hit the end of the
ringbuffer, but I think what you are trying to write is that
prb_iter_next_valid_entry() should either return 0 for success or !0 if
the end of the ringbuffer was hit.

> }
>
>> +static bool is_valid(struct printk_ringbuffer *rb, unsigned long lpos)
>> +{
>> + unsigned long head, tail;
>> +
>> + tail = atomic_long_read(&rb->tail);
>> + head = atomic_long_read(&rb->head);
>> + head -= tail;
>> + lpos -= tail;
>> +
>> + if (lpos >= head)
>> + return false;
>> + return true;
>> +}

I am trying to understand what you want the reader code should look
like. Keep in mind that readers can be overtaken at any moment. They
also need to track entries they have missed. If I understand your idea,
it is something like this (trying to keep it as simple as possible):

void print_all_entries(...)
{
char buf[sizeof(struct prb_entry) + DATASIZE + sizeof(long)];
struct prb_entry *entry = (struct prb_entry *)&buf[0];
struct prb_iterator iter;
u64 last_seq;

prb_iter_init(rb, entry, DATASIZE, &iter);
if (prb_iter_oldest_entry(&iter) != 0)
return; /* ringbuffer empty */

for (;;) {
do_print_entry(entry);
last_seq = entry->seq;
if (prb_iter_next_valid_entry(&iter) != 0)
break; /* ringbuffer empty */
if (entry->seq - last_seq != 1)
print("lost %d\n", ((entry->seq - last_seq) + 1));
}
}

I like the idea of the caller passing in a prb_entry struct. That really
helps to reduce the parameters. And having the functions
prb_iter_oldest_entry() and prb_iter_next_valid_entry() internally loop
until they get a valid result also helps so that the reader doesn't have
to do that. And if the reader doesn't have to track lost entries (for
example, /dev/kmsg), then it becomes even less code.

> IMPORTANT:
>
> Please, do not start rewriting the entire patchset after reading this
> mail. I suggest to take a rest from coding. Just read the feadback,
> ask if anything is unclear, and let your brain processing it on
> background.

Thank you for this tip.

John Ogness

2019-02-19 21:48:36

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 06/25] printk-rb: add blocking reader support

On 2019-02-18, Petr Mladek <[email protected]> wrote:
>> Add a blocking read function for readers. An irq_work function is
>> used to signal the wait queue so that write notification can
>> be triggered from any context.
>
> I would be more precise what exacly is problematic in which context.
> Something like:
>
> An irq_work function is used because wake_up() cannot be called safely
> from NMI and scheduler context.

OK.

>> Signed-off-by: John Ogness <[email protected]>
>> ---
>> include/linux/printk_ringbuffer.h | 20 ++++++++++++++++
>> lib/printk_ringbuffer.c | 49 +++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 69 insertions(+)
>>
>> diff --git a/include/linux/printk_ringbuffer.h b/include/linux/printk_ringbuffer.h
>> index 5fdaf632c111..106f20ef8b4d 100644
>> --- a/include/linux/printk_ringbuffer.h
>> +++ b/include/linux/printk_ringbuffer.h
>> @@ -2,8 +2,10 @@
>> #ifndef _LINUX_PRINTK_RINGBUFFER_H
>> #define _LINUX_PRINTK_RINGBUFFER_H
>>
>> +#include <linux/irq_work.h>
>> #include <linux/atomic.h>
>> #include <linux/percpu.h>
>> +#include <linux/wait.h>
>>
>> struct prb_cpulock {
>> atomic_t owner;
>> @@ -22,6 +24,10 @@ struct printk_ringbuffer {
>>
>> struct prb_cpulock *cpulock;
>> atomic_t ctx;
>> +
>> + struct wait_queue_head *wq;
>> + atomic_long_t wq_counter;
>> + struct irq_work *wq_work;
>> };
>>
>> struct prb_entry {
>> @@ -59,6 +65,15 @@ struct prb_iterator {
>> #define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) \
>> static char _##name##_buffer[1 << (szbits)] \
>> __aligned(__alignof__(long)); \
>> +static DECLARE_WAIT_QUEUE_HEAD(_##name##_wait); \
>> +static void _##name##_wake_work_func(struct irq_work *irq_work) \
>> +{ \
>> + wake_up_interruptible_all(&_##name##_wait); \
>> +} \
>
> All ring buffers might share the same generic function, something like:
>
> void prb_wake_readers_work_func(struct irq_work *irq_work)
> {
> struct printk_ringbuffer *rb;
>
> rb = container_of(irq_work, struct printk_ring_buffer, wq_work);
> wake_up_interruptible_all(rb->wq); \
> }

Agreed.

>> +static struct irq_work _##name##_wake_work = { \
>> + .func = _##name##_wake_work_func, \
>> + .flags = IRQ_WORK_LAZY, \
>> +}; \
>> static struct printk_ringbuffer name = { \
>> .buffer = &_##name##_buffer[0], \
>> .size_bits = szbits, \
>> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
>> index 1d1e886a0966..c2ddf4cb9f92 100644
>> --- a/lib/printk_ringbuffer.c
>> +++ b/lib/printk_ringbuffer.c
>> @@ -185,6 +188,12 @@ void prb_commit(struct prb_handle *h)
>> }
>>
>> prb_unlock(rb->cpulock, h->cpu);
>> +
>> + if (changed) {
>> + atomic_long_inc(&rb->wq_counter);
>> + if (wq_has_sleeper(rb->wq))
>> + irq_work_queue(rb->wq_work);
>> + }
>> }
>>
>> /*
>> @@ -437,3 +446,43 @@ int prb_iter_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)
>>
>> return 1;
>> }
>> +
>> +/*
>> + * prb_iter_wait_next: Advance to the next record, blocking if none available.
>> + * @iter: Iterator tracking the current position.
>> + * @buf: A buffer to store the data of the next record. May be NULL.
>> + * @size: The size of @buf. (Ignored if @buf is NULL.)
>> + * @seq: The sequence number of the next record. May be NULL.
>> + *
>> + * If a next record is already available, this function works like
>> + * prb_iter_next(). Otherwise block interruptible until a next record is
>> + * available.
>> + *
>> + * When a next record is available, @iter is advanced and (if specified)
>> + * the data and/or sequence number of that record are provided.
>> + *
>> + * This function might sleep.
>> + *
>> + * Returns 1 if @iter was advanced, -EINVAL if @iter is now invalid, or
>> + * -ERESTARTSYS if interrupted by a signal.
>> + */
>> +int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)
>> +{
>> + unsigned long last_seen;
>> + int ret;
>> +
>> + for (;;) {
>> + last_seen = atomic_long_read(&iter->rb->wq_counter);
>> +
>> + ret = prb_iter_next(iter, buf, size, seq);
>> + if (ret != 0)
>> + break;
>> +
>> + ret = wait_event_interruptible(*iter->rb->wq,
>> + last_seen != atomic_long_read(&iter->rb->wq_counter));
>
> Do we really need yet another counter here?
>
> I think that rb->seq might do the same job. Or if there is problem
> with atomicity then rb->head might work as well. Or do I miss
> anything?

You are correct. rb->head would be appropriate.

John Ogness

2019-02-19 22:10:27

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 07/25] printk-rb: add functionality required by printk

On 2019-02-18, Petr Mladek <[email protected]> wrote:
>> The printk subsystem needs to be able to query the size of the ring
>> buffer, seek to specific entries within the ring buffer, and track
>> if records could not be stored in the ring buffer.
>>
>> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
>> index c2ddf4cb9f92..ce33b5add5a1 100644
>> --- a/lib/printk_ringbuffer.c
>> +++ b/lib/printk_ringbuffer.c
>> @@ -175,11 +175,16 @@ void prb_commit(struct prb_handle *h)
>> head = PRB_WRAP_LPOS(rb, head, 1);
>> continue;
>> }
>> + while (atomic_long_read(&rb->lost)) {
>> + atomic_long_dec(&rb->lost);
>> + rb->seq++;
>
> The lost-messages handling should be in a separate patch.
> At least I do not see any close relation with prb_iter_seek().

Agreed.

> I would personally move prb_iter_seek() to the 5th patch that
> implements the other get/iterator functions.

Agreed.

> On the contrary, the patch adding support for lost messages
> should implement a way how to inform the user about lost messages.
> E.g. to add a warning when some space becomes available again.

The readers will see that messages were lost. I think that is enough. I
don't know how useful it would be to notify writers that space is
available. The writers are holding the prb_cpulock, so they definitely
shouldn't be waiting around for anything.

This situation should be quite rare because it means the _entire_ ring
buffer was filled up by an NMI context that interrupted a context that
was in the reserve/commit window. NMI contexts probably should not be
doing _so_ much printk'ing within a single NMI.

>> + }
>> e->seq = ++rb->seq;
>> head += e->size;
>> changed = true;
>> }
>> atomic_long_set_release(&rb->head, res);
>> +
>> atomic_dec(&rb->ctx);
>>
>> if (atomic_long_read(&rb->reserve) == res)
>> @@ -486,3 +491,93 @@ int prb_iter_wait_next(struct prb_iterator *iter, char *buf, int size, u64 *seq)
>>
>> return ret;
>> }
>> +
>> +/*
>> + * prb_iter_seek: Seek forward to a specific record.
>> + * @iter: Iterator to advance.
>> + * @seq: Record number to advance to.
>> + *
>> + * Advance @iter such that a following call to prb_iter_data() will provide
>> + * the contents of the specified record. If a record is specified that does
>> + * not yet exist, advance @iter to the end of the record list.
>> + *
>> + * Note that iterators cannot be rewound. So if a record is requested that
>> + * exists but is previous to @iter in position, @iter is considered invalid.
>> + *
>> + * It is safe to call this function from any context and state.
>> + *
>> + * Returns 1 on succces, 0 if specified record does not yet exist (@iter is
>> + * now at the end of the list), or -EINVAL if @iter is now invalid.
>> + */
>
> Do we really need to distinguish when the iterator is invalid and when
> we are at the end of the buffer?

Sure! There is big difference between "stop iterating because we hit the
newest entry" and "reset the iterator to the oldest entry because we
were overtaken by a writer".

> It seems that the reaction in both situation always is to call
> prb_iter_init(&iter, &printk_rb, &some_seq).

prb_iter_init() is only called to reset the iterator to the oldest
entry. That's all it is really doing. The fact that it can optionally
return a sequence number is just a convenience side-effect implemented
for some printk demands.

> I am still a bit
> confused what your prb_iter_init() does. Therefore I am not
> sure what it is supposed to do.
>
> Anyway, it seems to be typically used when you need to start
> from tail. I would personally do something like (based on my code
> in reply to 5th patch:
>
> int prb_iter_seek_to_seq(struct prb_iterator *iter, u64 seq)
> {
> int ret;
>
> ret = prb_iter_tail_entry(iter);
>
> while (!ret && iter->entry->seq != seq)
> ret = prb_iter_next_valid_entry(iter);
>
> return ret;
> }

Yes. Moving the loops inside prb_iter_tail_entry() and
prb_iter_next_valid_entry() definitely simplify the code.

> When I see it, I would make the functionality more obvious
> by renaming:
>
> prb_iter_tail_entry() -> prb_iter_set_tail_entry()

I would say: prb_iter_set_oldest_entry()

>> +int prb_iter_seek(struct prb_iterator *iter, u64 seq)
>> +{
>> + u64 cur_seq;
>> + int ret;
>> +
>> + /* first check if the iterator is already at the wanted seq */
>> + if (seq == 0) {
>> + if (iter->lpos == PRB_INIT)
>> + return 1;
>> + else
>> + return -EINVAL;
>> + }
>> + if (iter->lpos != PRB_INIT) {
>> + if (prb_iter_data(iter, NULL, 0, &cur_seq) >= 0) {
>> + if (cur_seq == seq)
>> + return 1;
>> + if (cur_seq > seq)
>> + return -EINVAL;
>> + }
>> + }
>> +
>> + /* iterate to find the wanted seq */
>> + for (;;) {
>> + ret = prb_iter_next(iter, NULL, 0, &cur_seq);
>> + if (ret <= 0)
>> + break;
>> +
>> + if (cur_seq == seq)
>> + break;
>> +
>> + if (cur_seq > seq) {
>> + ret = -EINVAL;
>> + break;
>> + }
>
> This is a good example why prb_iter_data() and prb_iter_next() is
> a weird interface. You need to read the documentation very carefully
> to understand the difference (functionality, error codes). At least
> for me, it is far from obvious why they are used this way and
> if it is correct.

Agreed. I prefer your suggested API. They significantly simplify the
reader code, which as you'll see in later printk.c patches, is
everywhere.

John Ogness

2019-02-20 09:02:35

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 10/25] printk: redirect emit/store to new ringbuffer

On Tue 2019-02-12 15:29:48, John Ogness wrote:
> vprintk_emit and vprintk_store are the main functions that all printk
> variants eventually go through. Change these to store the message in
> the new printk ring buffer that the printk kthread is reading.

We need to switch the two buffers in a single commit
without disabling important functionality.

By other words, we need to change vprintk_emit(), vprintk_store(),
console_unlock(), syslog(), devkmsg(), and syslog in one patch.

The only exception might continuous lines handling. We might
temporary store them right away. It should not break bisectability.

The patch will be huge but I do not see other reasonable solution
at the moment. In each case, the patch should do only
a "straightforward" switch. Any refactoring or logical
changes should be done in preliminary patches.


> Remove functions no longer in use because of the changes to
> vprintk_emit and vprintk_store.
>
> In order to handle interrupts and NMIs, a second per-cpu ring buffer
> (sprint_rb) is added. This ring buffer is used for NMI-safe memory
> allocation in order to format the printk messages.
>
> NOTE: LOG_CONT is ignored for now and handled as individual messages.
> LOG_CONT functions are masked behind "#if 0" blocks until their
> functionality can be restored
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> kernel/printk/printk.c | 319 ++++++++-----------------------------------------
> 1 file changed, 51 insertions(+), 268 deletions(-)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 5a5a685bb128..b6a6f1002741 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -584,54 +500,36 @@ static int log_store(int facility, int level,
> const char *text, u16 text_len)

> memcpy(log_dict(msg), dict, dict_len);
> msg->dict_len = dict_len;
> msg->facility = facility;
> msg->level = level & 7;
> msg->flags = flags & 0x1f;

The existing struct printk_log is stored into the data field
of struct prb_entry. It is because printk_ring_buffer is supposed
to be a generic ring buffer.

It makes the code more complicated. Also it needs more space for
the size and seq items from struct prb_entry.

printk() is already very complicated code. We should not make
it unnecessarily worse.

Please, are there any candidates or plans to reuse the new ring
buffer implementation? For example, would it be usable
for ftrace? Steven?

If not, I would prefer to make it printk-specific
and hopefully simplify the code a bit.


> - if (ts_nsec > 0)
> - msg->ts_nsec = ts_nsec;
> - else
> - msg->ts_nsec = local_clock();
> - memset(log_dict(msg) + dict_len, 0, pad_len);
> + msg->ts_nsec = ts_nsec;
> msg->len = size;
>
> /* insert message */
> - log_next_idx += msg->len;
> - log_next_seq++;
> + prb_commit(&h);
>
> return msg->text_len;
> }

[...]

> int vprintk_store(int facility, int level,
> const char *dict, size_t dictlen,
> const char *fmt, va_list args)
> {
> - static char textbuf[LOG_LINE_MAX];
> - char *text = textbuf;
> - size_t text_len;
> + return vprintk_emit(facility, level, dict, dictlen, fmt, args);
> +}
> +
> +/* ring buffer used as memory allocator for temporary sprint buffers */
> +DECLARE_STATIC_PRINTKRB(sprint_rb,
> + ilog2(PRINTK_RECORD_MAX + sizeof(struct prb_entry) +
> + sizeof(long)) + 2, &printk_cpulock);
> +
> +asmlinkage int vprintk_emit(int facility, int level,
> + const char *dict, size_t dictlen,
> + const char *fmt, va_list args)

[...]

> + rbuf = prb_reserve(&h, &sprint_rb, PRINTK_SPRINT_MAX);

The second ring buffer for temporary buffers is really interesting
idea.

Well, it brings some questions. For example, how many users might
need a reservation in parallel. Or if the nested use might cause
some problems when we decide to use printk-specific ring buffer
implementation. I still have to think about it.


> - /* If called from the scheduler, we can not call up(). */
> - if (!in_sched && pending_output) {
> - /*
> - * Disable preemption to avoid being preempted while holding
> - * console_sem which would prevent anyone from printing to
> - * console
> - */
> - preempt_disable();
> - /*
> - * Try to acquire and then immediately release the console
> - * semaphore. The release will print out buffers and wake up
> - * /dev/kmsg and syslog() users.
> - */
> - if (console_trylock_spinning())
> - console_unlock();
> - preempt_enable();
> - }

I guess that it is clear from the other mails. But to be sure.
This patch should just switch the buffers. The console handling
optimizations/fixes should be done in later patches or even
in a separate patchset.

Best Regards,
Petr

PS: I have just finished a mail that I started writing yesterday
evening. I am going to proceed some other pending mails now. I'll
come back to this thread soon.

2019-02-20 21:27:25

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 10/25] printk: redirect emit/store to new ringbuffer

On 2019-02-20, Petr Mladek <[email protected]> wrote:
>> vprintk_emit and vprintk_store are the main functions that all printk
>> variants eventually go through. Change these to store the message in
>> the new printk ring buffer that the printk kthread is reading.
>
> We need to switch the two buffers in a single commit
> without disabling important functionality.
>
> By other words, we need to change vprintk_emit(), vprintk_store(),
> console_unlock(), syslog(), devkmsg(), and syslog in one patch.

Agreed. But for the review process I expect it makes things much easier
to change them one at a time. Patch-squashing is not a problem once all
the individuals have been ack'd.

>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index 5a5a685bb128..b6a6f1002741 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -584,54 +500,36 @@ static int log_store(int facility, int level,
>> const char *text, u16 text_len)
>
>> memcpy(log_dict(msg), dict, dict_len);
>> msg->dict_len = dict_len;
>> msg->facility = facility;
>> msg->level = level & 7;
>> msg->flags = flags & 0x1f;
>
> The existing struct printk_log is stored into the data field
> of struct prb_entry. It is because printk_ring_buffer is supposed
> to be a generic ring buffer.

Yes.

> It makes the code more complicated. Also it needs more space for
> the size and seq items from struct prb_entry.
>
> printk() is already very complicated code. We should not make
> it unnecessarily worse.

In my opinion it makes things considerably easier. My experience with
printk-code is that it is so complicated because it is mixing
printk-features with ring buffer handling code. By providing a strict
API (and hiding the details) of the ring buffer, the implementation of
the printk-features became pretty straight forward.

Now I will admit that the ring buffer API I proposed is not easy to
digest. Mostly because I leave a lot of work up to the readers and have
lots of arguments. Your proposed changes of passing a struct and moving
loops under the ring buffer API should provide some major
simplification.

> Please, are there any candidates or plans to reuse the new ring
> buffer implementation?

As you pointed out below, this patch already uses the ring buffer
implementation for a totally different purpose: NMI safe dynamic memory
allocation.

> For example, would it be usable for ftrace? Steven?
>
> If not, I would prefer to make it printk-specific
> and hopefully simplify the code a bit.
>
>
>> - if (ts_nsec > 0)
>> - msg->ts_nsec = ts_nsec;
>> - else
>> - msg->ts_nsec = local_clock();
>> - memset(log_dict(msg) + dict_len, 0, pad_len);
>> + msg->ts_nsec = ts_nsec;
>> msg->len = size;
>>
>> /* insert message */
>> - log_next_idx += msg->len;
>> - log_next_seq++;
>> + prb_commit(&h);
>>
>> return msg->text_len;
>> }
>
> [...]
>
>> int vprintk_store(int facility, int level,
>> const char *dict, size_t dictlen,
>> const char *fmt, va_list args)
>> {
>> - static char textbuf[LOG_LINE_MAX];
>> - char *text = textbuf;
>> - size_t text_len;
>> + return vprintk_emit(facility, level, dict, dictlen, fmt, args);
>> +}
>> +
>> +/* ring buffer used as memory allocator for temporary sprint buffers */
>> +DECLARE_STATIC_PRINTKRB(sprint_rb,
>> + ilog2(PRINTK_RECORD_MAX + sizeof(struct prb_entry) +
>> + sizeof(long)) + 2, &printk_cpulock);
>> +
>> +asmlinkage int vprintk_emit(int facility, int level,
>> + const char *dict, size_t dictlen,
>> + const char *fmt, va_list args)
>
> [...]
>
>> + rbuf = prb_reserve(&h, &sprint_rb, PRINTK_SPRINT_MAX);
>
> The second ring buffer for temporary buffers is really interesting
> idea.
>
> Well, it brings some questions. For example, how many users might
> need a reservation in parallel. Or if the nested use might cause
> some problems when we decide to use printk-specific ring buffer
> implementation. I still have to think about it.

Keep in mind that it is only used by the writers, which have the
prb_cpulock. Typically there would only be 2 max users: a non-NMI writer
that was interrupted during the reserve/commit window and the
interrupting NMI that does printk. The only exception would be if the
printk-code code itself triggers a BUG_ON or WARN_ON within the
reserve/commit window. Then you will have an additional user per
recursion level.

John Ogness

2019-02-21 13:53:32

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 04/25] printk-rb: add writer interface

On Sun 2019-02-17 02:32:22, John Ogness wrote:
> Hi Petr,
>
> I've made changes to the patch that hopefully align with what you are
> looking for. I would appreciate it if you could go over it and see if
> the changes are in the right direction. And if so, you should decide
> whether I should make these kinds of changes for the whole series and
> submit a v2 before you continue with the review.
>
> The list of changes:
>
> - Added comments everywhere I think they could be useful. Is it too
> much?

Some comments probably can get shortened. But I personally find
them really helpful.

I am not going to do a detailed review of this variant at the moment.
I would like to finish the review of the entire patchset first.

> - I tried moving calc_next() into prb_reserve(), but it was pure
> insanity. I played with refactoring for a while until I found
> something that I think looks nice. I moved the implementation of
> calc_next() along with its containing loop into a new function
> find_res_ptrs(). This function does what calc_next() and push_tail()
> did. With this solution, I think prb_reserve() looks pretty
> clean. However, the optimization of communicating about the wrap is
> gone. So even though find_res_ptrs() knew if a wrap occurred,
> prb_reserve() figures it out again for itself. If we want the
> optimization, I still think the best approach is the -1,0,1 return
> value of find_res_ptrs().

I still have to go more deeply into it. Anyway, the new code looks
much better than the previous one.

Best Regards,
Petr

2019-02-21 16:24:03

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 05/25] printk-rb: add basic non-blocking reading interface

On Tue 2019-02-19 22:44:07, John Ogness wrote:
> Hi Petr,
>
> Below I make several comments, responding to your questions. But I like
> the new API I believe you are trying to propose. So really only my final
> comments are of particular importance. There I show you what I think
> reader code would look like using your proposed API.
>
> On 2019-02-18, Petr Mladek <[email protected]> wrote:
> >> + * prb_iter_init: Initialize an iterator for a ring buffer.
> >> + * @iter: The iterator to initialize.
> >> + * @rb: A ring buffer to that @iter should iterate.
> >> + * @seq: The sequence number of the position preceding the first record.
> >> + * May be NULL.
> >> + *
> >> + * Initialize an iterator to be used with a specified ring buffer. If @seq
> >> + * is non-NULL, it will be set such that prb_iter_next() will provide a
> >> + * sequence value of "@seq + 1" if no records were missed.
> >> + *
> >> + * It is safe to call this function from any context and state.
> >> + */
> >> +void prb_iter_init(struct prb_iterator *iter, struct printk_ringbuffer *rb,
> >> + u64 *seq)
> >> +{
> >> + memset(iter, 0, sizeof(*iter));
> >> + iter->rb = rb;
> >> + iter->lpos = PRB_INIT;
> >> +
> >> + if (!seq)
> >> + return;
> >> +
> >> + for (;;) {
> >> + struct prb_iterator tmp_iter;
> >> + int ret;
> >> +
> >> + prb_iter_copy(&tmp_iter, iter);
> >
> > It looks strange to copy something that has been hardly initialized.
> > I hope that we could do this without a copy, see below.
> >
> >> +
> >> + ret = prb_iter_next(&tmp_iter, NULL, 0, seq);
> >
> > prb_iter_next() and prb_iter_data() are too complex spaghetti
> > code. They do basically the same but they do not share
> > any helper function. The error handling is different
> > which is really confusing. See below.
>
> I don't follow why you think they do basically the same
> thing. prb_iter_next() moves forward to the next entry, then calls
> prb_iter_data() to retrieve the data. prb_iter_data() _is_ the helper
> function.

Ah, I missed the prb_iter_data() call in prb_iter_next(). I have to
admit that I did not have a courage to check them carefully. Both
functions looked like a compact maze of to_entry(), is_valid,
smp_rmb() calls ;-)

I got only the basic idea and started thinking about how
to achieve the same an easier to understand way.

> > Well, I think that we need to start with a more low-level functions.
> > IMHO. we need something to read one entry a safe way. Then it will
> > be much easier to live with races in the rest of the code:
> >
> > /*
> > * Return valid entry on the given lpos. Data are read
> > * only when the buffer is is not zero.
> > */
> > int prb_get_entry(struct struct printk_ringbuffer *rb,
> > unsigned long lpos,
> > struct prb_entry *entry,
> > unsigned int data_buf_size)
> > {
> > /*
> > * Pointer to the ring buffer. The data might get lost
> > * at any time.
> > */
> > struct prb_entry *weak_entry;
> >
> > if (!is_valid(lpos))
> > return -EINVAL;
> >
> > /* Make sure that data are valid for the given valid lpos. */
> > smp_rmb();
> >
> > weak_entry = to_entry(lpos);
> > entry->seq = weak_entry->seq;
> >
> > if (data_buf_size) {
> > unsigned int size;
> >
> > size = min(data_buf_size, weak_entry->size);
>
> weak_entry->size is untrusted data here. The following memcpy could grab
> data beyond the data array. (But we can ignore these details for now. I
> realize you are trying to refactor, not focus on these details.)

Great catch! Yes, we would need to check the overflow here.

> > memcpy(entry->data, weak_entry->data, size);
> > entry->size = size;
> > } else {
> > entry->size = weak_data->size;
> > }
> >
> > /* Make sure that the copy is done before we check validity. */
> > smp_mb();
> >
> > return is_valid(lpos);
> > }
> >
> > Then I would do the support for iterating the following way.
> > First, I would extend the structure:
> >
> > struct prb_iterator {
> > struct printk_ringbuffer *rb;
> > struct prb_entry *entry;
> > unsigned int data_buffer_size;
> > unsigned long lpos;
> > };
> >
> > And do something like:
> >
> > void prg_iter_init(struct struct printk_ringbuffer *rb,
> > struct prb_entry *entry,
> > unsigned int data_buffer_size,
> > struct prb_iterator *iter)
> > {
> > iter->rb = rb;
> > iter->entry = entry;
> > iter->data_buffer_size = data_buffer_size;
> > lpos = 0UL;
> > }
> >
> > Then we could do iterator support the following way:
> >
> > /* Start iteration with reading the tail entry. */
> > int prb_iter_tail_entry(struct prb_iterator *iter);
>
> A name like prb_iter_oldest_entry() might simplify things. I really
> don't want the caller to be concerned with heads and tails and which is
> which. That's an implementation detail of the ringbuffer.

Makes sense.

> > {
> > unsigned long tail;
> > int ret;
> >
> > for (;;) {
> > tail = atomic_long_read(&rb->tail);
> >
> > /* Ring buffer is empty? */
>
> The ring buffer is only empty at the beginning (when tail == head). Readers
> are non-consuming, so it is never empty again once an entry is committed.

> if (unlikely(atomic_long_read(head) == atomic_long_read(tail)))
> return -EINVAL;

Yes, this is a check for empty buffer. And yes, it can be done
outside the cycle.

> The check for valid is to make sure the tail we just read hasn't already
> been overtaken by writers. I suppose this could be put into a nested
> loop so that we continue trying again until we get a valid tail.
>
> > if (unlikely(!is_valid(tail)))
> > return -EINVAL;

Heh, I wanted to add check for the empty buffer (the comment was correct).
But I added is_valid() check instead.

We could remove it. It is hidden in prb_get_entry(). The for-cycle
will get repeated when it fails.

> >
> > ret = prb_get_entry(iter->rb, tail,
> > iter->entry, iter->data_buf_size);
> > if (!ret) {
> > iter->lpos = tail;
> > break;
> > }
> > }
> >
> > return 0;
> > }
> >
> > unsigned long next_lpos(unsineg long lpos, struct prb_entry *entry)
> > {
> > return lpos + sizeof(struct entry) + entry->size;
>
> entry->size already includes sizeof(struct prb_entry) plus alignment
> padding. (Again, not important right now.)

I see. I have missed it.

> > }
> >
> > /* Try to get next entry using a valid iterator */
> > int prb_iter_next_entry(struct prb_iterator *iter)
> > {
> > iter->lpos = next_lpos(iter->lpos, iter->etnry);
> >
> > return prb_get_entry(rb, lpos, entry, data_buf_size;
> > }
> >
> > /* Try to get the next entry. Allow to skip lost messages. */
> > int prb_iter_next_valid_entry(struct prb_iterator *iter)
> > {
> > int ret;
> >
> > ret = prb_iter_next_entry(iter);
> > if (!ret)
> > return 0;
> >
> > /* Next entry has been lost. Skip to the current tail. */
> > return prb_iter_tail_entry(rb, *lpos, entry, data_buf_size);
>
> You return values are geting mixed up here and you are not
> distinguishing between overtaken by writers and hit the end of the
> ringbuffer, but I think what you are trying to write is that
> prb_iter_next_valid_entry() should either return 0 for success or !0 if
> the end of the ringbuffer was hit.

The -1,0,1 return values were hard to follow. It might get improved
by using -EINVAL,0,-EAGAIN or so. But I hope that the two return
values are even better and will be enough.

First, I proposed two functions to distinguish the two situations:

prb_iter_next_entry() - fails when the next entry gets
overtaken or at the end of the buffer
prb_iter_next_valid_entry() - fails only at the end of the buffer

Second, the lost messages might get counted by comparing seq numbers.
You actually used this in the sample code below.


> > }
> >
> >> +static bool is_valid(struct printk_ringbuffer *rb, unsigned long lpos)
> >> +{
> >> + unsigned long head, tail;
> >> +
> >> + tail = atomic_long_read(&rb->tail);
> >> + head = atomic_long_read(&rb->head);
> >> + head -= tail;
> >> + lpos -= tail;
> >> +
> >> + if (lpos >= head)
> >> + return false;
> >> + return true;
> >> +}
>
> I am trying to understand what you want the reader code should look
> like. Keep in mind that readers can be overtaken at any moment. They
> also need to track entries they have missed. If I understand your idea,
> it is something like this (trying to keep it as simple as possible):
>
> void print_all_entries(...)
> {
> char buf[sizeof(struct prb_entry) + DATASIZE + sizeof(long)];
> struct prb_entry *entry = (struct prb_entry *)&buf[0];
> struct prb_iterator iter;
> u64 last_seq;
>
> prb_iter_init(rb, entry, DATASIZE, &iter);
> if (prb_iter_oldest_entry(&iter) != 0)
> return; /* ringbuffer empty */
>
> for (;;) {
> do_print_entry(entry);
> last_seq = entry->seq;
> if (prb_iter_next_valid_entry(&iter) != 0)
> break; /* ringbuffer empty */
> if (entry->seq - last_seq != 1)
> print("lost %d\n", ((entry->seq - last_seq) + 1));
> }
> }

Yes, this was my idea.

I wonder if we might allow to omit prb_iter_oldest_entry() call.
In fact, prb_iter_next_valid_entry() restarts from the oldest
entry when the expected one gets lost. But it might cause more
harm than good. I need to see it in the various existing use cases.

Anyway, the data_buf_size parameter is supposed to allow reading only
the metadata. This is useful in situations where you just need
to find which entries would fit into the given buffer. For example,
the first loop in syslog_print_all().

> I like the idea of the caller passing in a prb_entry struct. That really
> helps to reduce the parameters. And having the functions
> prb_iter_oldest_entry() and prb_iter_next_valid_entry() internally loop
> until they get a valid result also helps so that the reader doesn't have
> to do that. And if the reader doesn't have to track lost entries (for
> example, /dev/kmsg), then it becomes even less code.

I am glad to read this.

Best Regards,
Petr

2019-02-22 09:59:15

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 07/25] printk-rb: add functionality required by printk

On Tue 2019-02-19 23:08:20, John Ogness wrote:
> On 2019-02-18, Petr Mladek <[email protected]> wrote:
> >> The printk subsystem needs to be able to query the size of the ring
> >> buffer, seek to specific entries within the ring buffer, and track
> >> if records could not be stored in the ring buffer.
> >>
> >> diff --git a/lib/printk_ringbuffer.c b/lib/printk_ringbuffer.c
> >> index c2ddf4cb9f92..ce33b5add5a1 100644
> >> --- a/lib/printk_ringbuffer.c
> >> +++ b/lib/printk_ringbuffer.c
> >> @@ -175,11 +175,16 @@ void prb_commit(struct prb_handle *h)
> >> head = PRB_WRAP_LPOS(rb, head, 1);
> >> continue;
> >> }
> >> + while (atomic_long_read(&rb->lost)) {
> >> + atomic_long_dec(&rb->lost);
> >> + rb->seq++;
>
> > On the contrary, the patch adding support for lost messages
> > should implement a way how to inform the user about lost messages.
> > E.g. to add a warning when some space becomes available again.
>
> The readers will see that messages were lost. I think that is enough. I
> don't know how useful it would be to notify writers that space is
> available. The writers are holding the prb_cpulock, so they definitely
> shouldn't be waiting around for anything.

I see your intention. Well, it forces all readers to implement the
check and write a message. It might be fine if the code can be shared.

My original idea was the following. If any next writer succeeds
to reserve space for its own message. It would try to reserve
space also for the warning. If it succeeds, it would just
write the warning there (like a nested context or so). Then
all readers would get the warning for free.

But you inspired me to antoher idea. We could hadle this in
the krb_iter_get_next_entry() calls. They could fill
the given buffer with the warning message when they
detect missed messages. The real message might get
added into the same buffer. Or we might add a flag
into struct prb_iter so that the reader would need
to call krb_iter_get_next_entry() to get the real
message on the current lpos.

Both solutions allow to get the warnings trasparently.
There will be no duplicated extra code. Also all readers
would handle it consistently.

But there is a difference:

If we store the warning into the ring buffer directly
then we do not need to store the seq number. I mean
that we would not need to bump seq when reservation failed.
The amount of lost messages is handled by another
counter anyway.

On the other hand, using the fake messages would allow
to handle transparently even the messages lost by readers.
I mean that krb_iter_get_next_valid_entry() might fill
the given buffer with a warning when the next message
was overwriten and it had to reset lpos to tail.

I am not sure what is better out of my head. I would need
to play with the code.


> This situation should be quite rare because it means the _entire_ ring
> buffer was filled up by an NMI context that interrupted a context that
> was in the reserve/commit window. NMI contexts probably should not be
> doing _so_ much printk'ing within a single NMI.

Sure.

Best Regards,
Petr

2019-02-22 10:38:10

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 11/25] printk_safe: remove printk safe code

On Tue 2019-02-12 15:29:49, John Ogness wrote:
> vprintk variants are now NMI-safe so there is no longer a need for
> the "safe" calls.
>
> NOTE: This also removes printk flushing functionality.
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> include/linux/hardirq.h | 2 -
> include/linux/printk.h | 27 ---
> init/main.c | 1 -
> kernel/kexec_core.c | 1 -
> kernel/panic.c | 3 -
> kernel/printk/Makefile | 1 -
> kernel/printk/internal.h | 30 +---
> kernel/printk/printk.c | 13 +-
> kernel/printk/printk_safe.c | 427 --------------------------------------------
> kernel/trace/trace.c | 2 -
> lib/nmi_backtrace.c | 6 -
> 11 files changed, 7 insertions(+), 506 deletions(-)
> delete mode 100644 kernel/printk/printk_safe.c

From my POV, this is the primary selling argument for the new
ring buffer.


> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index b6a6f1002741..073ff9fd6872 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1752,6 +1745,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> }
> EXPORT_SYMBOL(vprintk_emit);
>
> +__printf(1, 0) int vprintk_func(const char *fmt, va_list args)
> +{
> + return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args);
> +}

All vprintk_func() calls should get replaced with vprintk_default().
It includes a crazy hack to reuse some kernel code (that calls
printk() in kdb code.


> asmlinkage int vprintk(const char *fmt, va_list args)
> {
> return vprintk_func(fmt, args);
> diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
> index 15ca78e1c7d4..77bf84987cda 100644
> --- a/lib/nmi_backtrace.c
> +++ b/lib/nmi_backtrace.c
> @@ -75,12 +75,6 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
> touch_softlockup_watchdog();
> }
>
> - /*
> - * Force flush any remote buffers that might be stuck in IRQ context
> - * and therefore could not run their irq_work.
> - */
> - printk_safe_flush();
> -
> clear_bit_unlock(0, &backtrace_flag);
> put_cpu();
> }

This reminds me that we need to add back the locking that was
removed in the commit 03fc7f9c99c1e7ae2925d45 ("printk/nmi:
Prevent deadlock when accessing the main log buffer in NMI").

Otherwise, backtraces from different CPUs would get mixed.

We need to add this before redirecting printk() to
the new ring buffer.

Best Regards,
Petr

2019-02-22 13:39:19

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 11/25] printk_safe: remove printk safe code

On 2019-02-22, Petr Mladek <[email protected]> wrote:
>> diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
>> index 15ca78e1c7d4..77bf84987cda 100644
>> --- a/lib/nmi_backtrace.c
>> +++ b/lib/nmi_backtrace.c
>> @@ -75,12 +75,6 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
>> touch_softlockup_watchdog();
>> }
>>
>> - /*
>> - * Force flush any remote buffers that might be stuck in IRQ context
>> - * and therefore could not run their irq_work.
>> - */
>> - printk_safe_flush();
>> -
>> clear_bit_unlock(0, &backtrace_flag);
>> put_cpu();
>> }
>
> This reminds me that we need to add back the locking that was
> removed in the commit 03fc7f9c99c1e7ae2925d45 ("printk/nmi:
> Prevent deadlock when accessing the main log buffer in NMI").

No, that commit is needed. You cannot have NMIs waiting on other CPUs.

> Otherwise, backtraces from different CPUs would get mixed.

A later patch (#17) adds CPU IDs to the printk messages so that this
isn't a problem. (That patch is actually obsolete now because Sergey has
already merged work for linux-next that includes this information.)

John Ogness

2019-02-22 14:45:00

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 10/25] printk: redirect emit/store to new ringbuffer

On Wed 2019-02-20 22:25:00, John Ogness wrote:
> On 2019-02-20, Petr Mladek <[email protected]> wrote:
> >> vprintk_emit and vprintk_store are the main functions that all printk
> >> variants eventually go through. Change these to store the message in
> >> the new printk ring buffer that the printk kthread is reading.
> >
> > We need to switch the two buffers in a single commit
> > without disabling important functionality.
> >
> > By other words, we need to change vprintk_emit(), vprintk_store(),
> > console_unlock(), syslog(), devkmsg(), and syslog in one patch.
>
> Agreed. But for the review process I expect it makes things much easier
> to change them one at a time. Patch-squashing is not a problem once all
> the individuals have been ack'd.

Good question. I would personally prefer to keep it in a single patch
even for review.

It would help me to see what the different readers have in common
and what can get optimized. Anyway, we should prepare the patch
a way so that it can get understand also by git history readers.


> >> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> >> index 5a5a685bb128..b6a6f1002741 100644
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> @@ -584,54 +500,36 @@ static int log_store(int facility, int level,
> >> const char *text, u16 text_len)
> >
> >> memcpy(log_dict(msg), dict, dict_len);
> >> msg->dict_len = dict_len;
> >> msg->facility = facility;
> >> msg->level = level & 7;
> >> msg->flags = flags & 0x1f;
> >
> > The existing struct printk_log is stored into the data field
> > of struct prb_entry. It is because printk_ring_buffer is supposed
> > to be a generic ring buffer.
>
> Yes.
>
> > It makes the code more complicated. Also it needs more space for
> > the size and seq items from struct prb_entry.
> >
> > printk() is already very complicated code. We should not make
> > it unnecessarily worse.
>
> In my opinion it makes things considerably easier. My experience with
> printk-code is that it is so complicated because it is mixing
> printk-features with ring buffer handling code. By providing a strict
> API (and hiding the details) of the ring buffer, the implementation of
> the printk-features became pretty straight forward.

It sounds reasonable. Well, the separation is not completely clear.
We have tree layers:

+ struct prb_entry: ring buffer metadata
+ struct printk_log: message metadata
+ text, dict: message strings

Where sequence number is part of prb_entry but it is too important
part of the printk logic.

Also it does not make sense to read text and dict when we just
calculate the space taken by the messages.

That said, I agree that printk code is complicated and we should
do better. This patchset goes in the right direction.

I personally hate the following things:

+ There are too many global values. Many of them are related,
e.g. first_idx, next_idx, first_seq, next_seq. They should
be in some structures.

+ Too many variables are passed by parameters. They should be
in some structures as well.

+ The name of struct printk_log is really confusing. It
should have been printk_msg or so.

+ Continuous lines buffer makes the buffer-related code much
more complicated.

+ Especially the console-related code is full of hacks.

printk code was not really maintained most of the history.
Random people just fixed/extended the code for their needs.

> > Please, are there any candidates or plans to reuse the new ring
> > buffer implementation?
>
> As you pointed out below, this patch already uses the ring buffer
> implementation for a totally different purpose: NMI safe dynamic memory
> allocation.

I am not sure that this 2nd usage is worth it, see below.

> >> int vprintk_store(int facility, int level,
> >> const char *dict, size_t dictlen,
> >> const char *fmt, va_list args)
> >> {
> >> - static char textbuf[LOG_LINE_MAX];
> >> - char *text = textbuf;
> >> - size_t text_len;
> >> + return vprintk_emit(facility, level, dict, dictlen, fmt, args);
> >> +}
> >> +
> >> +/* ring buffer used as memory allocator for temporary sprint buffers */
> >> +DECLARE_STATIC_PRINTKRB(sprint_rb,
> >> + ilog2(PRINTK_RECORD_MAX + sizeof(struct prb_entry) +
> >> + sizeof(long)) + 2, &printk_cpulock);
> >> +
> >> +asmlinkage int vprintk_emit(int facility, int level,
> >> + const char *dict, size_t dictlen,
> >> + const char *fmt, va_list args)
> >
> > [...]
> >
> >> + rbuf = prb_reserve(&h, &sprint_rb, PRINTK_SPRINT_MAX);
> >
> > The second ring buffer for temporary buffers is really interesting
> > idea.
> >
> > Well, it brings some questions. For example, how many users might
> > need a reservation in parallel. Or if the nested use might cause
> > some problems when we decide to use printk-specific ring buffer
> > implementation. I still have to think about it.
>
> Keep in mind that it is only used by the writers, which have the
> prb_cpulock. Typically there would only be 2 max users: a non-NMI writer
> that was interrupted during the reserve/commit window and the
> interrupting NMI that does printk. The only exception would be if the
> printk-code code itself triggers a BUG_ON or WARN_ON within the
> reserve/commit window. Then you will have an additional user per
> recursion level.

I am not sure it is worth to call the ring buffer machinery just
to handle 2-3 buffers.

Well, it might be just my mental block. We need to be really
careful to avoid infinite recursion when storing messages
into the log buffer. The nested reserve/commit calls provoke
my brain to spin around. It is possible that I would love
this idea once my brain stops spinning ;-)

Best Regards,
Petr

2019-02-22 15:08:31

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 10/25] printk: redirect emit/store to new ringbuffer

On 2019-02-22, Petr Mladek <[email protected]> wrote:
>>>> + rbuf = prb_reserve(&h, &sprint_rb, PRINTK_SPRINT_MAX);
>>>
>>> The second ring buffer for temporary buffers is really interesting
>>> idea.
>>>
>>> Well, it brings some questions. For example, how many users might
>>> need a reservation in parallel. Or if the nested use might cause
>>> some problems when we decide to use printk-specific ring buffer
>>> implementation. I still have to think about it.
>>
>> Keep in mind that it is only used by the writers, which have the
>> prb_cpulock. Typically there would only be 2 max users: a non-NMI
>> writer that was interrupted during the reserve/commit window and the
>> interrupting NMI that does printk. The only exception would be if the
>> printk-code code itself triggers a BUG_ON or WARN_ON within the
>> reserve/commit window. Then you will have an additional user per
>> recursion level.
>
> I am not sure it is worth to call the ring buffer machinery just
> to handle 2-3 buffers.

It may be slightly overkill, but:

1. We have the prb_cpulock at this point anyway, so it will be
fast. (Both ring buffers share the same prb_cpulock.)

2. Getting a safe buffer is just 1 line of code: prb_reserve()

3. Why should we waste _any_ lines of code implementing the handling of
these special 3-4 buffers?

> Well, it might be just my mental block. We need to be really careful
> to avoid infinite recursion when storing messages into the log
> buffer.

The recursion works well. I inserted a triggerable BUG_ON() in
vprintk_emit() _within_ the reserve/commit window and I see a clean
backtrace on the emergency console.

John Ogness

2019-02-22 15:16:16

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 11/25] printk_safe: remove printk safe code

On Fri 2019-02-22 14:38:28, John Ogness wrote:
> On 2019-02-22, Petr Mladek <[email protected]> wrote:
> >> diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
> >> index 15ca78e1c7d4..77bf84987cda 100644
> >> --- a/lib/nmi_backtrace.c
> >> +++ b/lib/nmi_backtrace.c
> >> @@ -75,12 +75,6 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
> >> touch_softlockup_watchdog();
> >> }
> >>
> >> - /*
> >> - * Force flush any remote buffers that might be stuck in IRQ context
> >> - * and therefore could not run their irq_work.
> >> - */
> >> - printk_safe_flush();
> >> -
> >> clear_bit_unlock(0, &backtrace_flag);
> >> put_cpu();
> >> }
> >
> > This reminds me that we need to add back the locking that was
> > removed in the commit 03fc7f9c99c1e7ae2925d45 ("printk/nmi:
> > Prevent deadlock when accessing the main log buffer in NMI").
>
> No, that commit is needed. You cannot have NMIs waiting on other CPUs.

It sounds weird. But it is safe to use a lock when it is used only
in the NMI context.

The lock has always been there. For example, I found it in
the commit 1fb9d6ad2766a1dd70 ("nmi_watchdog: Add new,
generic implementation, using perf events") from v2.6.36.

It could get removed only because we switched to the per-CPU
buffers.


> > Otherwise, backtraces from different CPUs would get mixed.
>
> A later patch (#17) adds CPU IDs to the printk messages so that this
> isn't a problem. (That patch is actually obsolete now because Sergey has
> already merged work for linux-next that includes this information.)

No this is not enough. First, the CPU-ids were primary added for
kernel testers (0-day robot, kernel-ci, syzcaller). It is not
enabled by default. It is not handled by kmsg interface.
Also sorting the messages is not much user friendly.
It should be the last resort when no other solution
is possible.

Best Regards,
Petr

2019-02-22 15:26:34

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 10/25] printk: redirect emit/store to new ringbuffer

On Fri 2019-02-22 16:06:26, John Ogness wrote:
> On 2019-02-22, Petr Mladek <[email protected]> wrote:
> >>>> + rbuf = prb_reserve(&h, &sprint_rb, PRINTK_SPRINT_MAX);
> >>>
> >>> The second ring buffer for temporary buffers is really interesting
> >>> idea.
> >>>
> >>> Well, it brings some questions. For example, how many users might
> >>> need a reservation in parallel. Or if the nested use might cause
> >>> some problems when we decide to use printk-specific ring buffer
> >>> implementation. I still have to think about it.
> >>
> >> Keep in mind that it is only used by the writers, which have the
> >> prb_cpulock. Typically there would only be 2 max users: a non-NMI
> >> writer that was interrupted during the reserve/commit window and the
> >> interrupting NMI that does printk. The only exception would be if the
> >> printk-code code itself triggers a BUG_ON or WARN_ON within the
> >> reserve/commit window. Then you will have an additional user per
> >> recursion level.
> >
> > I am not sure it is worth to call the ring buffer machinery just
> > to handle 2-3 buffers.
>
> It may be slightly overkill, but:
>
> 1. We have the prb_cpulock at this point anyway, so it will be
> fast. (Both ring buffers share the same prb_cpulock.)

I am still not persuaded that we really need the lock. The
implementation looks almost ready for a fully lockless
writers. But I might be wrong.

The lock might be fine when it makes the code easier and does
not bring any deadlocks.


> 2. Getting a safe buffer is just 1 line of code: prb_reserve()

The problem is how complicated code is hidden behind
this 1 line of code.


> 3. Why should we waste _any_ lines of code implementing the handling of
> these special 3-4 buffers?

It might be worth if it makes the code more strighforward
and less prone to bugs.


> > Well, it might be just my mental block. We need to be really careful
> > to avoid infinite recursion when storing messages into the log
> > buffer.
>
> The recursion works well. I inserted a triggerable BUG_ON() in
> vprintk_emit() _within_ the reserve/commit window and I see a clean
> backtrace on the emergency console.

Have you tested all possible error situations that might happen?
Testing helps a lot. But the real life often brings surprises.

Best Regards,
Petr

2019-02-25 12:12:17

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 10/25] printk: redirect emit/store to new ringbuffer

On Wed 2019-02-20 22:25:00, John Ogness wrote:
> On 2019-02-20, Petr Mladek <[email protected]> wrote:
> >> vprintk_emit and vprintk_store are the main functions that all printk
> >> variants eventually go through. Change these to store the message in
> >> the new printk ring buffer that the printk kthread is reading.
> >
> > Please, are there any candidates or plans to reuse the new ring
> > buffer implementation?
>
> As you pointed out below, this patch already uses the ring buffer
> implementation for a totally different purpose: NMI safe dynamic memory
> allocation.

I have found an alternative solution. We could calculate the length
of the formatted string without any buffer:

va_list args_copy;

va_copy(args_copy, args);
len = vscprintf(NULL, fmt, args_copy);
va_end(args_copy);

This vsprintf() mode was implemented for exactly this purpose.

Best Regards,
Petr

2019-02-25 13:45:38

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 12/25] printk: minimize console locking implementation

On Tue 2019-02-12 15:29:50, John Ogness wrote:
> Since printing of the printk buffer is now handled by the printk
> kthread, minimize the console locking functions to just handle
> locking of the console.
>
> NOTE: With this console_flush_on_panic will no longer flush.
>
> Signed-off-by: John Ogness <[email protected]>
> ---
> kernel/printk/printk.c | 255 +------------------------------------------------
> 1 file changed, 1 insertion(+), 254 deletions(-)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 073ff9fd6872..ece54c24ea0d 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -209,19 +209,7 @@ static int nr_ext_console_drivers;
>
> static int __down_trylock_console_sem(unsigned long ip)
> {
> - int lock_failed;
> - unsigned long flags;
> -
> - /*
> - * Here and in __up_console_sem() we need to be in safe mode,
> - * because spindump/WARN/etc from under console ->lock will
> - * deadlock in printk()->down_trylock_console_sem() otherwise.
> - */
> - printk_safe_enter_irqsave(flags);
> - lock_failed = down_trylock(&console_sem);
> - printk_safe_exit_irqrestore(flags);
> -
> - if (lock_failed)
> + if (down_trylock(&console_sem))
> return 1;
> mutex_acquire(&console_lock_dep_map, 0, 1, ip);
> return 0;
> @@ -230,13 +218,9 @@ static int __down_trylock_console_sem(unsigned long ip)
>
> static void __up_console_sem(unsigned long ip)
> {
> - unsigned long flags;
> -
> mutex_release(&console_lock_dep_map, 1, ip);
>
> - printk_safe_enter_irqsave(flags);
> up(&console_sem);
> - printk_safe_exit_irqrestore(flags);
> }
> #define up_console_sem() __up_console_sem(_RET_IP_)
>

It might be obvious from the previous mails. But just to be sure.

I would remove printk_safe stuff in one patch after switching
to the new ring buffer implementation.


> @@ -1498,82 +1482,6 @@ static void format_text(struct printk_log *msg, u64 seq,
> }
>
> /*
> - * Special console_lock variants that help to reduce the risk of soft-lockups.
> - * They allow to pass console_lock to another printk() call using a busy wait.
> - */
[...]
> -static void console_lock_spinning_enable(void)

The console waiter logic is another story. It can get removed only
after we have a reasonable alternative. That means an acceptable
offload that handles emergency situations and sudden death
reasonable well.

I would move this into a separate patchset.

Best Regards,
Petr

2019-02-25 15:00:35

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 13/25] printk: track seq per console

On Tue 2019-02-12 15:29:51, John Ogness wrote:
> Allow each console to track which seq record was last printed. This
> simplifies identifying dropped records.

I would like to see a better justification here. And I think
that it is part of this patchset.

One reason is to implement console reply a cleaner way.
Another reason is to allows calling lockless consoles
directly in emergency situations.

Both reasons are independent on the log buffer implementation.
Therefore I suggest to move it into another patchset.


> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index ece54c24ea0d..ebd9aac06323 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1504,6 +1514,19 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,
> if (!cpu_online(raw_smp_processor_id()) &&
> !(con->flags & CON_ANYTIME))
> continue;
> + if (con->printk_seq >= seq)
> + continue;
> +
> + con->printk_seq++;
> + if (con->printk_seq < seq) {
> + print_console_dropped(con, seq - con->printk_seq);
> + con->printk_seq = seq;

It would be great to print this message only when the real one
is not superseded.

The suppressed messages can be proceed quickly. They allow consoles
to catch up with the message producers. But if the spend time
with printing the warning, we just risk loosing more messages
again and again.

Heh, there is a bug in the current code. The warning is not
printed when the currently proceed message is superseded.
I am going to prepare a patch for this.

Best Regards,
Petr

2019-02-25 16:43:28

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 10/25] printk: redirect emit/store to new ringbuffer

On 2019-02-25, Petr Mladek <[email protected]> wrote:
>> >> vprintk_emit and vprintk_store are the main functions that all printk
>> >> variants eventually go through. Change these to store the message in
>> >> the new printk ring buffer that the printk kthread is reading.
>> >
>> > Please, are there any candidates or plans to reuse the new ring
>> > buffer implementation?
>>
>> As you pointed out below, this patch already uses the ring buffer
>> implementation for a totally different purpose: NMI safe dynamic memory
>> allocation.
>
> I have found an alternative solution. We could calculate the length
> of the formatted string without any buffer:
>
> va_list args_copy;
>
> va_copy(args_copy, args);
> len = vscprintf(NULL, fmt, args_copy);
> va_end(args_copy);
>
> This vsprintf() mode was implemented for exactly this purpose.

For vprintk_emit() that would work. As you will see in later (patch 23),
the sprint_rb ringbuffer is used for dynamic memory allocation for
kmsg_dump functions as well.

The current printk implementation allows readers to read directly from
the ringbuffer. The proposed ringbuffer requires the reader (printk) to
have its own buffers.

We may be able to find an alternate solution here as well if that is
desired.

John Ogness

2019-02-26 08:45:56

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 13/25] printk: track seq per console

On 2019-02-25, Petr Mladek <[email protected]> wrote:
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index ece54c24ea0d..ebd9aac06323 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1504,6 +1514,19 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,
>> if (!cpu_online(raw_smp_processor_id()) &&
>> !(con->flags & CON_ANYTIME))
>> continue;
>> + if (con->printk_seq >= seq)
>> + continue;
>> +
>> + con->printk_seq++;
>> + if (con->printk_seq < seq) {
>> + print_console_dropped(con, seq - con->printk_seq);
>> + con->printk_seq = seq;
>
> It would be great to print this message only when the real one
> is not superseded.

You mean if there was some function to check if "seq" is the newest
entry. And only in that situation would any dropped information be
presented?

> The suppressed messages can be proceed quickly. They allow consoles
> to catch up with the message producers. But if the spend time
> with printing the warning, we just risk loosing more messages
> again and again.

So instead of:

message A
message B
3 messages dropped
message F
message G
2 messages dropped
message J
message K

you would prefer to see:

message A
message B
message F
message G
message J
message K
5 messages dropped

... with the hope that maybe we are able to print messages H and/or I by
not spending time on the first dropped message?

If there are a lof printk's coming (sysrq-z) then probably there will be
many dropped during output. With your proposal, it wouldn't be seen that
so many intermediate messages are dropped. Only at the end of the output
the user is presented with some huge dropped number.

With my implementation if you see 2 printk lines together you can be
sure that nothing was dropped between those two lines. I think that is
more valuable than having the possibility of maybe losing a few less
messages in a flood situation.

John Ogness

2019-02-26 09:47:32

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 10/25] printk: redirect emit/store to new ringbuffer

On Mon 2019-02-25 17:41:50, John Ogness wrote:
> On 2019-02-25, Petr Mladek <[email protected]> wrote:
> >> >> vprintk_emit and vprintk_store are the main functions that all printk
> >> >> variants eventually go through. Change these to store the message in
> >> >> the new printk ring buffer that the printk kthread is reading.
> >> >
> >> > Please, are there any candidates or plans to reuse the new ring
> >> > buffer implementation?
> >>
> >> As you pointed out below, this patch already uses the ring buffer
> >> implementation for a totally different purpose: NMI safe dynamic memory
> >> allocation.
> >
> > I have found an alternative solution. We could calculate the length
> > of the formatted string without any buffer:
> >
> > va_list args_copy;
> >
> > va_copy(args_copy, args);
> > len = vscprintf(NULL, fmt, args_copy);
> > va_end(args_copy);
> >
> > This vsprintf() mode was implemented for exactly this purpose.
>
> For vprintk_emit() that would work. As you will see in later (patch 23),
> the sprint_rb ringbuffer is used for dynamic memory allocation for
> kmsg_dump functions as well.

It looks dangerous to share a limited buffer between core kernel
functionality and user-space triggered one. I mean that an unlimited
number of devkmsg operations must not cause loosing printk() messages.

> The current printk implementation allows readers to read directly from
> the ringbuffer. The proposed ringbuffer requires the reader (printk) to
> have its own buffers.
>
> We may be able to find an alternate solution here as well if that is
> desired.

I hope that we will be able to find one. The previous implementation
needed some buffers as well. We should be able to use the same
approach.

I guess that one problem is that the new ringbuffer API is not
able to copy just the text directly into the user-provided buffer.
It might get solved by extending the API.

Anyway, I still have to look at the remaining patches.

Best Regards,
Petr

2019-02-26 13:13:23

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 13/25] printk: track seq per console

On Tue 2019-02-26 09:45:02, John Ogness wrote:
> On 2019-02-25, Petr Mladek <[email protected]> wrote:
> >> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> >> index ece54c24ea0d..ebd9aac06323 100644
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> @@ -1504,6 +1514,19 @@ static void call_console_drivers(const char *ext_text, size_t ext_len,
> >> if (!cpu_online(raw_smp_processor_id()) &&
> >> !(con->flags & CON_ANYTIME))
> >> continue;
> >> + if (con->printk_seq >= seq)
> >> + continue;
> >> +
> >> + con->printk_seq++;
> >> + if (con->printk_seq < seq) {
> >> + print_console_dropped(con, seq - con->printk_seq);
> >> + con->printk_seq = seq;
> >
> > It would be great to print this message only when the real one
> > is not superseded.
>
> You mean if there was some function to check if "seq" is the newest
> entry. And only in that situation would any dropped information be
> presented?

Not newest but not suppressed.

Example: Only every 10th message is important enough to reach
console (see suppress_message_printing).

Instead of seeing:

message 10
message 20
7 messages dropped
20 another messages dropped because of printing the warning
20 another messages dropped because of printing the warning
20 another messages dropped because of printing the warning
message 100

see something like:

message 10
message 20
13 messages dropped
message 40
13 messages dropped
message 60
13 messages dropped
message 80
message 90
message100

The original code would print only warnings because the important
lines are lost when printing the warnings.

A better code would show more messages because the warning
is printed when a visible message is already buffered.

You might wonder why there are only 13 messages dropped in
the new state. It is because the other 6 messages are
quickly read and suspended (filtered out). They are
skipped by intention.

I hope that it will be more clear with the patch that I have
just sent, see
https://lkml.kernel.org/r/[email protected]

Best Regards,
Petr

2019-02-26 14:59:30

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 15/25] printk: print history for new consoles

On Tue 2019-02-12 15:29:53, John Ogness wrote:
> When new consoles register, they currently print how many messages
> they have missed. However, many (or all) of those messages may still
> be in the ring buffer. Add functionality to print as much of the
> history as available. This is a clean replacement of the old
> exclusive console hack.
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 897219f34cab..6c875abd7b17 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1506,6 +1506,77 @@ static void format_text(struct printk_log *msg, u64 seq,
> }
> }
>
> +static void printk_write_history(struct console *con, u64 master_seq)
> +{
> + struct prb_iterator iter;
> + bool time = printk_time;
> + static char *ext_text;
> + static char *text;
> + static char *buf;
> + u64 seq;
> +
> + ext_text = kmalloc(CONSOLE_EXT_LOG_MAX, GFP_KERNEL);
> + text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL);
> + buf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
> + if (!ext_text || !text || !buf)
> + return;

We need to free buffers that were successfully allocated.

> + if (!(con->flags & CON_ENABLED))
> + goto out;
> +
> + if (!con->write)
> + goto out;
> +
> + if (!cpu_online(raw_smp_processor_id()) &&
> + !(con->flags & CON_ANYTIME))
> + goto out;
> +
> + prb_iter_init(&iter, &printk_rb, NULL);
> +
> + for (;;) {
> + struct printk_log *msg;
> + size_t ext_len;
> + size_t len;
> + int ret;
> +
> + ret = prb_iter_next(&iter, buf, PRINTK_RECORD_MAX, &seq);
> + if (ret == 0) {
> + break;
> + } else if (ret < 0) {
> + prb_iter_init(&iter, &printk_rb, NULL);
> + continue;
> + }
> +
> + if (seq > master_seq)
> + break;
> +
> + con->printk_seq++;
> + if (con->printk_seq < seq) {
> + print_console_dropped(con, seq - con->printk_seq);
> + con->printk_seq = seq;
> + }
> +
> + msg = (struct printk_log *)buf;
> + format_text(msg, master_seq, ext_text, &ext_len, text,
> + &len, time);
> +
> + if (len == 0 && ext_len == 0)
> + continue;
> +
> + if (con->flags & CON_EXTENDED)
> + con->write(con, ext_text, ext_len);
> + else
> + con->write(con, text, len);
> +
> + printk_delay(msg->level);

Hmm, this duplicates a lot of code from call_console_drivers() and
maybe also from printk_kthread_func(). It is error prone. People
will forget to update this function when working on the main one.

We need to put the shared parts into separate functions.
For example, the per-console code might be moved to
call_console_write(struct console *con, ...). Then
call_console_drivers() might look like:

static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
const char *text, size_t len, int level)
{
struct console *con;

trace_console_rcuidle(text, len);

if (!console_drivers)
return;

for_each_console(con) {
call_console_write(con, seq, ext_text, ext_len,
text, len, level);
}
}

And you could call call_console_write() for the particular console
from printk_write_history().

> + }
> +out:
> + con->wrote_history = 1;
> + kfree(ext_text);
> + kfree(text);
> + kfree(buf);
> +}
> +
> /*
> * Call the console drivers, asking them to write out
> * log_buf[start] to log_buf[end - 1].
> @@ -1524,6 +1595,10 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
> for_each_console(con) {
> if (!(con->flags & CON_ENABLED))
> continue;
> + if (!con->wrote_history) {
> + printk_write_history(con, seq);

This looks like an alien. The code is supposed to write one message
from the given buffer. And some huge job is well hidden there.

In addition, the code is actually recursive. It will become
clear when it is deduplicated as suggested above. We should
avoid it when it is not necessary. Note that recursive code
is always more prone to mistakes and it is harder to think of.

I guess that the motivation is to do everything from the printk
kthread. Is it really necessary? register_console() takes
console_lock(). It has to be sleepable context by definition.

Anyway, the new console is added under console_lock().
Any new console_lock owner could always check which new
consoles need to replay the log before it starts processing
new messages.

Best Regards,
Petr

2019-02-26 15:23:36

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 15/25] printk: print history for new consoles

On 2019-02-26, Petr Mladek <[email protected]> wrote:
>> When new consoles register, they currently print how many messages
>> they have missed. However, many (or all) of those messages may still
>> be in the ring buffer. Add functionality to print as much of the
>> history as available. This is a clean replacement of the old
>> exclusive console hack.
>>
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index 897219f34cab..6c875abd7b17 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1506,6 +1506,77 @@ static void format_text(struct printk_log *msg, u64 seq,
>> }
>> }
>>
>> +static void printk_write_history(struct console *con, u64 master_seq)
>> +{
>> + struct prb_iterator iter;
>> + bool time = printk_time;
>> + static char *ext_text;
>> + static char *text;
>> + static char *buf;
>> + u64 seq;
>> +
>> + ext_text = kmalloc(CONSOLE_EXT_LOG_MAX, GFP_KERNEL);
>> + text = kmalloc(PRINTK_SPRINT_MAX, GFP_KERNEL);
>> + buf = kmalloc(PRINTK_RECORD_MAX, GFP_KERNEL);
>> + if (!ext_text || !text || !buf)
>> + return;
>
> We need to free buffers that were successfully allocated.

Ouch. You just found some crazy garbage. The char-pointers are
static. The bug is that it allocates each time a console is
registered. It was supposed to be lazy allocation:

if (!ext_text)
ext_text = kmalloc(CONSOLE_EXT_LOG_MAX, GFP_KERNEL);

>> + if (!(con->flags & CON_ENABLED))
>> + goto out;
>> +
>> + if (!con->write)
>> + goto out;
>> +
>> + if (!cpu_online(raw_smp_processor_id()) &&
>> + !(con->flags & CON_ANYTIME))
>> + goto out;
>> +
>> + prb_iter_init(&iter, &printk_rb, NULL);
>> +
>> + for (;;) {
>> + struct printk_log *msg;
>> + size_t ext_len;
>> + size_t len;
>> + int ret;
>> +
>> + ret = prb_iter_next(&iter, buf, PRINTK_RECORD_MAX, &seq);
>> + if (ret == 0) {
>> + break;
>> + } else if (ret < 0) {
>> + prb_iter_init(&iter, &printk_rb, NULL);
>> + continue;
>> + }
>> +
>> + if (seq > master_seq)
>> + break;
>> +
>> + con->printk_seq++;
>> + if (con->printk_seq < seq) {
>> + print_console_dropped(con, seq - con->printk_seq);
>> + con->printk_seq = seq;
>> + }
>> +
>> + msg = (struct printk_log *)buf;
>> + format_text(msg, master_seq, ext_text, &ext_len, text,
>> + &len, time);
>> +
>> + if (len == 0 && ext_len == 0)
>> + continue;
>> +
>> + if (con->flags & CON_EXTENDED)
>> + con->write(con, ext_text, ext_len);
>> + else
>> + con->write(con, text, len);
>> +
>> + printk_delay(msg->level);
>
> Hmm, this duplicates a lot of code from call_console_drivers() and
> maybe also from printk_kthread_func(). It is error prone. People
> will forget to update this function when working on the main one.
>
> We need to put the shared parts into separate functions.

Agreed.

>> + }
>> +out:
>> + con->wrote_history = 1;
>> + kfree(ext_text);
>> + kfree(text);
>> + kfree(buf);
>> +}
>> +
>> /*
>> * Call the console drivers, asking them to write out
>> * log_buf[start] to log_buf[end - 1].
>> @@ -1524,6 +1595,10 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
>> for_each_console(con) {
>> if (!(con->flags & CON_ENABLED))
>> continue;
>> + if (!con->wrote_history) {
>> + printk_write_history(con, seq);
>
> This looks like an alien. The code is supposed to write one message
> from the given buffer. And some huge job is well hidden there.

This is a very simple implementation of a printk kthread. It probably
makes more sense to have a printk kthread per console. That would allow
fast consoles to not be penalized by slow consoles. Due to the
per-console seq tracking, the code would already support it.

> In addition, the code is actually recursive. It will become
> clear when it is deduplicated as suggested above. We should
> avoid it when it is not necessary. Note that recursive code
> is always more prone to mistakes and it is harder to think of.

Agreed.

> I guess that the motivation is to do everything from the printk
> kthread. Is it really necessary? register_console() takes
> console_lock(). It has to be sleepable context by definition.

It is not necessary. It is desired. Why should _any_ task be punished
with console writing? That is what the printk kthread is for.

John Ogness

2019-02-26 15:38:57

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 16/25] printk: implement CON_PRINTBUFFER

On Tue 2019-02-12 15:29:54, John Ogness wrote:
> If the CON_PRINTBUFFER flag is not set, do not replay the history
> for that console.

This patch fixes a regression caused by previous patches.
We need to do it a way that does not cause the regression
and do not break bisection.

> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 6c875abd7b17..b97d4195b09a 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1596,8 +1592,12 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
> if (!(con->flags & CON_ENABLED))
> continue;
> if (!con->wrote_history) {
> - printk_write_history(con, seq);
> - continue;
> + if (con->flags & CON_PRINTBUFFER) {
> + printk_write_history(con, seq);
> + continue;
> + }
> + con->wrote_history = 1;

I have just got an idea that we do not need a new flag.
We could clear CON_PRINTBUFFER bit instead.

Best Regards,
Petr

2019-02-27 09:03:36

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 15/25] printk: print history for new consoles

On Tue 2019-02-26 16:22:01, John Ogness wrote:
> On 2019-02-26, Petr Mladek <[email protected]> wrote:
> >> When new consoles register, they currently print how many messages
> >> they have missed. However, many (or all) of those messages may still
> >> be in the ring buffer. Add functionality to print as much of the
> >> history as available. This is a clean replacement of the old
> >> exclusive console hack.
> >>
> >> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> >> index 897219f34cab..6c875abd7b17 100644
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> @@ -1524,6 +1595,10 @@ static void call_console_drivers(u64 seq, const char *ext_text, size_t ext_len,
> >> for_each_console(con) {
> >> if (!(con->flags & CON_ENABLED))
> >> continue;
> >> + if (!con->wrote_history) {
> >> + printk_write_history(con, seq);
> >
> > This looks like an alien. The code is supposed to write one message
> > from the given buffer. And some huge job is well hidden there.
>
> This is a very simple implementation of a printk kthread. It probably
> makes more sense to have a printk kthread per console. That would allow
> fast consoles to not be penalized by slow consoles. Due to the
> per-console seq tracking, the code would already support it.

I mean that your patch does the reply on a very hidden location.
I think that a cleaned design would be to implement something like:

void console_check_and_reply(void)
{
struct console *con;

if (!console_drivers)
return;

for_each_console(con) {
if (con->flags & CON_PRINTBUFFER) {
printk_write_history(con, console_seq);
con->flags &= ~CON_PRINTBUFFER;
}
}
}

Then there is no recursion. Also you are much more flexible.
You could call this on any reasonable place. For example, you
could call this in the printk_kthread right after taking
console_lock and before processing new messages.


Regarding the per-console kthread. It would make sense if
we stop handling all consoles synchronously. For example,
when we push messages to fast consoles immediately and
offload the work for slow consoles.

Anyway, we first need to make the offload reliable enough.
It is not acceptable to always offload all messages.
We have been there last few years. We must keep a high
chance to see the messages. Any warning might be important
when it causes the system to die. Nobody knows what message
is such an important.


> > In addition, the code is actually recursive. It will become
> > clear when it is deduplicated as suggested above. We should
> > avoid it when it is not necessary. Note that recursive code
> > is always more prone to mistakes and it is harder to think of.
>
> Agreed.
>
> > I guess that the motivation is to do everything from the printk
> > kthread. Is it really necessary? register_console() takes
> > console_lock(). It has to be sleepable context by definition.
>
> It is not necessary. It is desired. Why should _any_ task be punished
> with console writing? That is what the printk kthread is for.

I do not know about any acceptable solution without punishing
the tasks. But we might find a better compromise between the
punishment and reliability.

Best Regards,
Petr

2019-02-27 09:49:06

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 20/25] serial: 8250: implement write_atomic

On Tue 2019-02-12 15:29:58, John Ogness wrote:
> Implement a non-sleeping NMI-safe write_atomic console function in
> order to support emergency printk messages.

It uses console_atomic_lock() added in 18th patch. That one uses
prb_lock() added by 2nd patch.

Now, prb_lock() allows recursion on the same CPU. But it still needs
to wait until it is released on another CPU.

It means that it is not completely save when NMIs happen on more CPUs in
parallel, for example, when calling nmi_trigger_cpumask_backtrace().

OK, it would be safe when prb_lock() is the only lock taken
in the NMI handler. But printk() should not make such limitation
to the rest of the system. Not to say, that we would most
likely need to add a lock back into nmi_cpu_backtrace()
to keep the output sane.


Peter Zijlstra several times talked about fully lockless
consoles. He is using the early console for debugging, see
the patchset
https://lkml.kernel.org/r/[email protected]

I am not sure if it is always possible. I personally see
the following way:

1. Make the printk ring buffer fully lockless. Then we reduce
the problem only to console locking. And we could
have a per-console-driver lock (no the big lock like
prb_lock()).

2. I am afraid that we need to add some locking between CPUs
to avoid mixing characters from directly printed messages.
This would be safe everywhere expect in NMI. Then we could
either risk ignoring the lock in NMI (there should be few
messages anyway, the backtraces would get synchronized
another way). Or we might need a compromise between
handling console using the current owner and offload.


Best Regards,
Petr

2019-02-27 10:04:02

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 15/25] printk: print history for new consoles

On 2019-02-27, Petr Mladek <[email protected]> wrote:
> I mean that your patch does the reply on a very hidden location.

Right. I understand that and I agree.

> Regarding the per-console kthread. It would make sense if
> we stop handling all consoles synchronously. For example,
> when we push messages to fast consoles immediately and
> offload the work for slow consoles.

My per-console kthread suggestion relating to fast consoles is so that
some consoles (such as netconsole, which is quite fast) could drop less
messages than a slow console (such as uart).

> Anyway, we first need to make the offload reliable enough.
> It is not acceptable to always offload all messages.
> We have been there last few years. We must keep a high
> chance to see the messages. Any warning might be important
> when it causes the system to die. Nobody knows what message
> is such an important.

You seem to be missing the point of the series. It _is_ acceptable to
offload all messages because they are being offloaded to non-emergency
consoles. If messages are lost, it sucks (and the appropriate "dropped"
messages are sent), but it isn't critical. Once we can agree to this
point, printk becomes so much easier to work with.

Emergency consoles exist for handling important messages. They will not
drop messages. They are synchronous and immediate.

>> It is not necessary. It is desired. Why should _any_ task be punished
>> with console writing? That is what the printk kthread is for.
>
> I do not know about any acceptable solution without punishing
> the tasks. But we might find a better compromise between the
> punishment and reliability.

I do not want printk to compromise. That compromise is part of the
problem. Let's partition printk to important and non-important so that
we can optimize both. _That_ is the heart of this series.

John Ogness

2019-02-27 10:32:54

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 20/25] serial: 8250: implement write_atomic

On 2019-02-27, Petr Mladek <[email protected]> wrote:
>> Implement a non-sleeping NMI-safe write_atomic console function in
>> order to support emergency printk messages.
>
> It uses console_atomic_lock() added in 18th patch. That one uses
> prb_lock() added by 2nd patch.
>
> Now, prb_lock() allows recursion on the same CPU. But it still needs
> to wait until it is released on another CPU.
>
> [...]
>
> OK, it would be safe when prb_lock() is the only lock taken
> in the NMI handler.

Which is the case. As I wrote to you already [0], NMI contexts are
_never_ allowed to do things that rely on waiting forever for other
CPUs. I could not find any instances where that is the
case. nmi_cpu_backtrace() used to do this, but it does not anymore.

> But printk() should not make such limitation
> to the rest of the system.

That is something we have to decide. It is the one factor that makes
prb_lock() feel a hell of a lot like BKL.

> Not to say, that we would most
> likely need to add a lock back into nmi_cpu_backtrace()
> to keep the output sane.

No. That is why CPU-IDs were added to the output. It is quite sane and
easy to read.

> Peter Zijlstra several times talked about fully lockless
> consoles. He is using the early console for debugging, see
> the patchset
> https://lkml.kernel.org/r/[email protected]

That is an interesting thread to quote. In that thread Peter actually
wrote the exact implementation of prb_lock() as the method to
synchronize access to the serial console.

> I am not sure if it is always possible. I personally see
> the following way:
>
> 1. Make the printk ring buffer fully lockless. Then we reduce
> the problem only to console locking. And we could
> have a per-console-driver lock (no the big lock like
> prb_lock()).

A fully lockless ring buffer is an option. But as you said, it only
reduces the window, which is why I decided it is not so important (at
least for now). Creating a per-console-driver lock would probably be a
good idea anyway as long as we can guarantee the ordering (which
shouldn't be a problem as long as emergency console ordering remains
fixed and emergency writers always follow that ordering).

> 2. I am afraid that we need to add some locking between CPUs
> to avoid mixing characters from directly printed messages.

That is exactly what console_atomic_lock() (actually prb_lock) is!

John Ogness

[0] https://lkml.kernel.org/r/[email protected]

2019-02-27 13:18:16

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 15/25] printk: print history for new consoles

On Wed 2019-02-27 11:02:53, John Ogness wrote:
> On 2019-02-27, Petr Mladek <[email protected]> wrote:
> > I mean that your patch does the reply on a very hidden location.
>
> Right. I understand that and I agree.
>
> > Regarding the per-console kthread. It would make sense if
> > we stop handling all consoles synchronously. For example,
> > when we push messages to fast consoles immediately and
> > offload the work for slow consoles.
>
> My per-console kthread suggestion relating to fast consoles is so that
> some consoles (such as netconsole, which is quite fast) could drop less
> messages than a slow console (such as uart).

OK, it was not clear from the context.


> > Anyway, we first need to make the offload reliable enough.
> > It is not acceptable to always offload all messages.
> > We have been there last few years. We must keep a high
> > chance to see the messages. Any warning might be important
> > when it causes the system to die. Nobody knows what message
> > is such an important.
>
> You seem to be missing the point of the series. It _is_ acceptable to
> offload all messages because they are being offloaded to non-emergency
> consoles. If messages are lost, it sucks (and the appropriate "dropped"
> messages are sent), but it isn't critical. Once we can agree to this
> point, printk becomes so much easier to work with.
>
> Emergency consoles exist for handling important messages. They will not
> drop messages. They are synchronous and immediate.

We might start thinking about this only when the most common consoles
support the emergency mode. This patchset implemented it only
for serial consoles that are often very slow. It is contradicting
the above statement about fast consoles.

Also the emergency messages from different CPUs are synchronized.
This slows down all affected CPUs. They are serialized and blocked
by the speed of the consoles. It was the reason to handle all pending
messages only by the current owner. I am sure that it would cause
regressions.

Not to say that the synchronization is done using an unfair lock.
One CPU can simply get starved by others for non-predictable time.
This is why ticket spinlocks were invented.

You might argue that the amount of emergency messages should
be small but see below.


> >> It is not necessary. It is desired. Why should _any_ task be punished
> >> with console writing? That is what the printk kthread is for.
> >
> > I do not know about any acceptable solution without punishing
> > the tasks. But we might find a better compromise between the
> > punishment and reliability.
>
> I do not want printk to compromise. That compromise is part of the
> problem. Let's partition printk to important and non-important so that
> we can optimize both. _That_ is the heart of this series.

No, this is just another compromise. Let's look at it from another
side.

The important and non-important messages already existed. The split
was done by console_loglevel. The emergency level just adds one
more category (show, show later, ignore). It allows more fine
grained setting but it does not remove the compromise.

People would still need to choose which messages should be seen
reliably and which might get lost. And the problem will be still
the same. The more messages will be printed reliably the more
delayed might get printk() callers. It might prevent softlockups
but only at the cost that all parallel writers are blocked by
waiting for the console.

Also note that printk configuration already is too complicated.
See the four numbers in /proc/sys/kernel/printk. Many people
would have troubles to set them reasonably even with description.
Fifth number would only make it worse.


And it is even more complicated because people are inconsistent
with using the log levels, see
https://lkml.kernel.org/r/[email protected]
It is a lost fight. People always need to see messages from
the code that they work on. If we make it harder to see
some levels, people will just start using levels that are
not filtered.


This is why I suggest to split the work on the ring buffer
and consoles. The new ring buffer might be a clear win.
While the console handling is really complicated. But
I still think that we might and should do better even
in the consoles.

Best Regards,
Petr

2019-02-27 13:55:51

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 20/25] serial: 8250: implement write_atomic

On Wed 2019-02-27 11:32:05, John Ogness wrote:
> On 2019-02-27, Petr Mladek <[email protected]> wrote:
> >> Implement a non-sleeping NMI-safe write_atomic console function in
> >> order to support emergency printk messages.
> >
> > It uses console_atomic_lock() added in 18th patch. That one uses
> > prb_lock() added by 2nd patch.
> >
> > Now, prb_lock() allows recursion on the same CPU. But it still needs
> > to wait until it is released on another CPU.
> >
> > [...]
> >
> > OK, it would be safe when prb_lock() is the only lock taken
> > in the NMI handler.
>
> Which is the case. As I wrote to you already [0], NMI contexts are
> _never_ allowed to do things that rely on waiting forever for other
> CPUs.

Who says _never_? I agree that it is not reasonable. But the
history shows that it happens. In principle, there is nothing wrong
in using spinlock in NMI when it is used only in NMI.


> > Not to say, that we would most
> > likely need to add a lock back into nmi_cpu_backtrace()
> > to keep the output sane.
>
> No. That is why CPU-IDs were added to the output. It is quite sane and
> easy to read.

And I already wrote that they are not added by default and they
does not solve kmsg interface.

Also we might need to provide a userspace support in advance.
We could not release kernel that will make logs hard to read
without post processing. At least I do not have the balls
to do so.


> > Peter Zijlstra several times talked about fully lockless
> > consoles. He is using the early console for debugging, see
> > the patchset
> > https://lkml.kernel.org/r/[email protected]
>
> That is an interesting thread to quote. In that thread Peter actually
> wrote the exact implementation of prb_lock() as the method to
> synchronize access to the serial console.

The synchronization was added just for the thread. I am not sure
if Peter is using it in the real life.


> > I am not sure if it is always possible. I personally see
> > the following way:
> >
> > 1. Make the printk ring buffer fully lockless. Then we reduce
> > the problem only to console locking. And we could
> > have a per-console-driver lock (no the big lock like
> > prb_lock()).
>
> A fully lockless ring buffer is an option. But as you said, it only
> reduces the window, which is why I decided it is not so important (at
> least for now). Creating a per-console-driver lock would probably be a
> good idea anyway as long as we can guarantee the ordering (which
> shouldn't be a problem as long as emergency console ordering remains
> fixed and emergency writers always follow that ordering).
>
> > 2. I am afraid that we need to add some locking between CPUs
> > to avoid mixing characters from directly printed messages.
>
> That is exactly what console_atomic_lock() (actually prb_lock) is!

Sure. But it should not be a common lock for the ring buffer and
all consoles. Also there are still open questions with NMI
and the direct console handling itself.

Best Regards,
Petr

2019-03-04 05:24:20

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (02/13/19 15:43), John Ogness wrote:
> On 2019-02-13, Sergey Senozhatsky <[email protected]> wrote:
> >> - A dedicated kernel thread is created for printing to all consoles in
> >> a fully preemptible context.
> >
> > How do you handle sysrq-<foo> printouts on systems which can't
> > schedule printk-kthread?
>
> If those sysrq printouts are at the emergency loglevel (which most are),
> then they are printed immediately to the emergency consoles. This has
> already proved useful for our own kernel debugging work. For example,
> currently sysrq-z for very large traces result in messages being dropped
> because of printk buffer overflows. But with the emergency console we
> always see the full trace buffer.

Are we sure that all systems will always have ->atomic console(-s)
enabled? Is it possible to convert all console drivers to ->atomic?
fbcon, for instance (with scrolling and font scaling, etc)? If there
are setups which can be fully !atomic (in terms of console output)
then we, essentially, have a fully preemptible kthread printk
implementation.

> Because you have already done so much work and experimentation with
> printk-kthreads, I feel like many of your comments are related to your
> kthread work in this area. Really the big design change I make with my
> printk-kthread is that it is only for non-critical messages. For
> anything critical, users should rely on an emergency console.

Fair point. Well maybe my printk-kthread comments are not utterly
unreasonable, but who knows :)

-ss

2019-03-04 05:32:10

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (02/13/19 15:15), John Ogness wrote:
> I don't like bust_spinlocks() because drivers end up implementing
> oops_in_progress with exactly that... ignoring their own locks. I prefer
> consoles are provided with a locking mechanism that they can use to
> support a separate NMI-safe write function. My series introduces
> console_atomic_lock() for exactly this purpose.
>
> But this doesn't help here. Here we are talking about a crashing system
> that does _not_ have an emergency console. And in this case I would say
> messages would be lost (just like they are now if all you have is a vt
> console and it was busy).
>
> I suppose we could keep the current bust_spinlocks() stuff for the
> special case that there are no emergency consoles available. It's better
> than nothing, but also not really reliable. Preferrably we figure out
> how to implement write_atomic for all console drivers.

Right. We set console loglevel to verbose before the final panic flush,
so flush of all unseen messages (regardless of importance) does look quite
reasonable (and this is what panic() has been doing for many years).

-ss

2019-03-04 06:41:11

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hi John,

On (02/13/19 14:43), John Ogness wrote:
> Hi Sergey,
>
> I am glad to see that you are getting involved here. Your previous
> talks, work, and discussions were a large part of my research when
> preparing for this work.

YAYY! Thanks!

That's a pretty massive research and a patch set!

[..]
> If we are talking about an SMP system where logbuf_lock is locked, the
> call chain is actually:
>
> panic()
> crash_smp_send_stop()
> ... wait for "num_online_cpus() == 1" ...
> printk_safe_flush_on_panic();
> console_flush_on_panic();
>
> Is it guaranteed that the kernel will successfully stop the other CPUs
> so that it can print to the console?

Right. By the way, this reminds that I sort of wanted to send a patch
which would unconditionally raw_spin_lock_init(&logbuf_lock) (without
the num_online_cpus() check) in printk_safe_flush_on_panic().

> And then there is console_flush_on_panic(), which will ignore locks and
> write to the consoles, expecting them to check "oops_in_progress" and
> ignore their own internal locks.
>
> Is it guaranteed that locks can just be ignored and backtraces will be
> seen and legible to the user?

That's a tricky question. In the same way we may have no guarantees that
all consoles can sport ->atomic() write API. And then have no guarantees
that every system will have ->atomic consoles.

> > Do you see large latencies because of logbuf spinlock?
>
[..]
>
> For slow consoles, this can cause large latencies for some misfortunate
> tasks.

Yes, makes sense.

> > One thing that I have learned is that preemptible printk does not work
> > as expected; it wants to be 'atomic' and just stay busy as long as it
> > can.
> > We tried preemptible printk at Samsung and the result was just bad:
> > preempted printk kthread + slow serial console = lots of lost
> > messages
>
> As long as all critical messages are print directly and immediately to
> an emergency console, why is it is problem if the informational messages
> to consoles are sometimes delayed or lost? And if those informational
> messages _are_ so important, there are things the user can do. For
> example, create a realtime userspace task to read /dev/kmsg.
>
> > We also had preemptile printk in the upstream kernel and reverted the
> > patch (see fd5f7cde1b85d4c8e09); same reasons - we had reports that
> > preemptible printk could "stall" for minutes.
>
> But in this case the preemptible task was used for printing critical
> tasks as well. Then the stall really is a problem. I am proposing to
> rely on emergency consoles for critical messages. By changing printk to
> support 2 different channels (emergency and non-emergency), we can focus
> on making each of those channels optimal.

Right. Assuming that we always have at least one ->atomic channel
we can prioritize (and sacrifice !atomic channels, etc.). People,
sort of, already can prioritize some channels; IIRC, netcon can be
configured to print messages only when oops_in_progress and to drop
messages otherwise.

Things can get different if ->atomic channel is not available.

-ss

2019-03-04 07:39:43

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On (02/12/19 15:29), John Ogness wrote:
[..]
> + /* the printk kthread never exits */
> + for (;;) {
> + ret = prb_iter_wait_next(&iter, buf,
> + PRINTK_RECORD_MAX, &master_seq);
> + if (ret == -ERESTARTSYS) {
> + continue;
> + } else if (ret < 0) {
> + /* iterator invalid, start over */
> + prb_iter_init(&iter, &printk_rb, NULL);
> + continue;
> + }
> +
> + msg = (struct printk_log *)buf;
> + format_text(msg, master_seq, ext_text, &ext_len, text,
> + &len, printk_time);
> +
> + console_lock();
> + if (len > 0 || ext_len > 0) {
> + call_console_drivers(ext_text, ext_len, text, len);
> + boot_delay_msec(msg->level);
> + printk_delay();
> + }
> + console_unlock();
> + }

This, theoretically, creates a whole new world of possibilities for
console drivers. Now they can do GFP_KERNEL allocations and stall
printk_kthread during OOM; or they can explicitly reschedule from
->write() callback (via console_conditional_schedule()) because
console_lock() sets console_may_schedule.

It's one thing to do cond_resched() (or to let preemption to take over)
after call_console_drivers() (when we are done printing a message to all
console drivers) and another thing to let preemption to take over while
we are printing a messages to the consoles. It probably would make sense
to disable preemption around call_console_drivers().

-ss

2019-03-04 09:27:07

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 15/25] printk: print history for new consoles

On (02/26/19 16:22), John Ogness wrote:
> > This looks like an alien. The code is supposed to write one message
> > from the given buffer. And some huge job is well hidden there.
>
> This is a very simple implementation of a printk kthread. It probably
> makes more sense to have a printk kthread per console. That would allow
> fast consoles to not be penalized by slow consoles. Due to the
> per-console seq tracking, the code would already support it.

I believe we discussed "polling consoles" several times.

printk-kthread is one way to implement polling. Another one might
already be implemented in, probably, all serial drivers and we just
need to extend it a bit - polling from console's IRQ handler.

Serial drivers poll UART xmit buffer and print (usually) up to
`count' bytes:

static irqreturn_t foo_irq_handler(int irq, void *id)
{
int count = 512;

[...]
while (count > 0 && !uart_circ_empty(xmit)) {
wr_regb(port, TX, xmit->buf[xmit->tail]);
xmit->tail = (xmit->tail + 1) & (UART_XMIT_SIZE - 1);
count--;
}
[...]

return IRQ_HANDLED;
}

So we can also grub NUM (e.g. max 64 entries) pending logbuf messages
and print them from device's isr.

-ss

2019-03-04 10:03:17

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On (03/04/19 16:38), Sergey Senozhatsky wrote:
> This, theoretically, creates a whole new world of possibilities for
> console drivers. Now they can do GFP_KERNEL allocations and stall
> printk_kthread during OOM; or they can explicitly reschedule from
> ->write() callback (via console_conditional_schedule()) because
> console_lock() sets console_may_schedule.

To demonstrate what kind of damage preemptible printk can do, some
of the reports I have in my inbox:

> ** 45 printk messages dropped ** [ 2637.275312] i2c-msm-v2 7af5000.i2c: NACK: slave not responding, ensure its powered: msgs(n:1 cur:0 tx) bc(rx:0 tx:2) mode:FIFO slv_addr:0x68 MSTR_STS:0x011363c8 OPER:0x00000090
> ** 59 printk messages dropped ** [ 2637.294937] i2c-msm-v2 7af5000.i2c: NACK: slave not responding, ensure its powered: msgs(n:1 cur:0 tx) bc(rx:0 tx:2) mode:FIFO slv_addr:0x68 MSTR_STS:0x011363c8 OPER:0x00000090
> ** 54 printk messages dropped ** ...

or

[..]
> ** 2499 printk messages dropped ** [ 60.095515] CPU: 1 PID: 7148 Comm: syz-executor5 Tainted: G B 4.4.104-ged884eb #2
> ** 5042 printk messages dropped ** [ 60.107433] [<ffffffff82564f65>] sg_finish_rem_req+0x255/0x2f0
> ** 3861 printk messages dropped ** [ 60.116522] entry_SYSCALL_64_fastpath+0x16/0x76
> ** 3313 printk messages dropped ** [ 60.124312] Object ffff8800b903e960: 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
> ** 5311 printk messages dropped ** [ 60.136772] INFO: Freed in fasync_free_rcu+0x14/0x20 age=624 cpu=0 pid=3
> ** 4200 printk messages dropped ** [ 60.146612] __slab_free+0x18c/0x2b0
> ** 2864 printk messages dropped ** [ 60.153322] Object ffff8800b903e990: 00 50 8b 83 ff ff ff ff 01 46 00 00 07 00 00 00 .P.......F......
> ** 5323 printk messages dropped ** [ 60.165806] Object ffff8800b903e980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> ** 5308 printk messages dropped ** [ 60.178233] entry_SYSCALL_64_fastpath+0x16/0x76
> ** 3313 printk messages dropped ** [ 60.186014] Object ffff8800b903e960: 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
> ** 5306 printk messages dropped ** [ 60.198451] kmem_cache_alloc+0x155/0x290
[..]

One can lose tens, hundreds or even thousands of messages between
consecutive call_console_drivers(). These reports are back from the
days when printk used to be preemptible. I don't see that many dropped
messages starting from 4.15 (when we disabled preemption), at least not
in those syzbot reports which I have.

Some of those lost messages are probably going to be handled by ->atomic
path (depending on the loglevel), assuming that ->atomic console is
available. At the same time we might see a notable conversion of some
pr_foo-s to "a more safer emergency levels".

But in general, channels which depend on preemptible printk will become
totally useless in some cases.

-ss

2019-03-04 11:08:08

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On (03/04/19 19:00), Sergey Senozhatsky wrote:
>
> But in general, channels which depend on preemptible printk will become
> totally useless in some cases.
>

Which brings me to a question - what are those messages/channels?
Not important enough to be printed on consoles immediately, yet important
enough to pass the suppress_message_printing() check. We may wave those
semi-important messages good bye, I'm afraid, preemptible printk will
take care of it.

So... do we have a case here? Do we really need printk-kthread?

-ss

2019-03-05 21:02:24

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

Hi Sergey,

Thanks for your feedback.

I am responding to this comment ahead of your previous comments because
it really cuts at the heart of the proposed design. After addressing
this point it will make it easier for me to respond to your other
comments.

NOTE: This is a lengthy response.

On 2019-03-04, Sergey Senozhatsky <[email protected]> wrote:
>> But in general, channels which depend on preemptible printk will
>> become totally useless in some cases.
>
> Which brings me to a question - what are those messages/channels? Not
> important enough to be printed on consoles immediately, yet important
> enough to pass the suppress_message_printing() check.

I would like to clarify that message supression (i.e. console loglevel)
is a method of reducing what is printed. It does nothing to address the
issues related to console printing. My proposal focusses on addressing
the issues related to console printing.

Console printing is a convenient feature to allow a kernel to
communicate information to a user without any reliance on
userspace. IMHO there are 2 categories of messages that the kernel will
communicate. The first is informational (usb events, wireless and
ethernet connectivity, filesystem events, etc.). Since this category of
messages occurs during normal runtime, we should expect that it does not
cause adverse effects to the rest of the system (such as latencies and
non-deterministic behavior).

The second category is for emergency situations, where the kernel needs
to report something unusual (panic, BUG, WARN, etc.). In some of these
situations, it may be the last thing the kernel ever does. We should
expect this category to focus on getting the message out as reliably as
possible. Even if it means disturbing the system with large latencies.

_Both_ categories are important for the user, but their requirements are
different:

informational: non-disturbing
emergency: reliable

But what if a console doesn't support the write_atomic() that the
emergency category requires? Then implement it. We currently have about
80 console drivers.

But what if can't be implemented? vt console, for example? Yes, the vt
console would be tricky. It doesn't even support the current
bust_spinlocks/oops_in_progress. But since the emergency category has a
clear requirement (reliability), it means that a vt write_atomic() does
not need to be concerned with system disturbance. That could help to
find an implementation that will work, even for vt.

> We may wave those semi-important messages good bye, I'm afraid,
> preemptible printk will take care of it.

You are talking about a system that is overloaded with messages to print
to the console. The current printk implementation will do a better job
of getting the informational messages out, but at an enormous cost to
all the tasks on the system (including the realtime tasks). I am
proposing a printk implementation where the tasks are not affected by
console printing floods. When the CPU is allowed to dedicate itself to
tasks, this obviously reduces the CPU available for console printing,
and thus more messages will be lost. It is a choice to clarify printk's
role (non-disturbance) and at the same time guarantee more determinism
for the kernel and its tasks.

As I've said, the messages of the informational category are also
important. There are things that can be done to help get these messages
out. For example:

- Creating printk-kthreads per console (and thus per-console locks) so
that printk-buffer readers are not slowing each other down.

- Having printk-threads use priority-buckets based on loglevels so that
(like the rt scheduler) more important messages are printed first.

- Assigning the printk-kthread of more important consoles an appropriate
realtime priority.

> So... do we have a case here? Do we really need printk-kthread?

Obviously I answer yes to that.

I want messages of the information category to cause no disturbance to
the system. Give the kernel the freedom to communicate to users without
destroying its own performance. This can only be achieved if the
messages are printed from a _fully_ preemptible context.

And I want messages of the emergency category to be as reliable as
possible, regardless of the costs to the system. Give the kernel a clear
mechanism to _reliably_ communicate critical information. Such messages
should never appear on a correctly functioning system.

And again, both of the above have nothing to do with message
suppression. Here I am addressing the console printing issues:
reliability and disturbance.

John Ogness

2019-03-06 19:00:43

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On Tue 2019-03-05 22:00:58, John Ogness wrote:
> Hi Sergey,
>
> Thanks for your feedback.
>
> I am responding to this comment ahead of your previous comments because
> it really cuts at the heart of the proposed design. After addressing
> this point it will make it easier for me to respond to your other
> comments.
>
> NOTE: This is a lengthy response.
>
> On 2019-03-04, Sergey Senozhatsky <[email protected]> wrote:
> >> But in general, channels which depend on preemptible printk will
> >> become totally useless in some cases.
> >
> > Which brings me to a question - what are those messages/channels? Not
> > important enough to be printed on consoles immediately, yet important
> > enough to pass the suppress_message_printing() check.
>
> I would like to clarify that message supression (i.e. console loglevel)
> is a method of reducing what is printed. It does nothing to address the
> issues related to console printing. My proposal focusses on addressing
> the issues related to console printing.
>
> Console printing is a convenient feature to allow a kernel to
> communicate information to a user without any reliance on
> userspace. IMHO there are 2 categories of messages that the kernel will
> communicate. The first is informational (usb events, wireless and
> ethernet connectivity, filesystem events, etc.). Since this category of
> messages occurs during normal runtime, we should expect that it does not
> cause adverse effects to the rest of the system (such as latencies and
> non-deterministic behavior).
>
> The second category is for emergency situations, where the kernel needs
> to report something unusual (panic, BUG, WARN, etc.). In some of these
> situations, it may be the last thing the kernel ever does. We should
> expect this category to focus on getting the message out as reliably as
> possible. Even if it means disturbing the system with large latencies.
>
> _Both_ categories are important for the user, but their requirements are
> different:
>
> informational: non-disturbing
> emergency: reliable

Isn't this already handled by the console_level?

The informational messages can be reliably read via syslog, /dev/kmsg.
They are related to the normal works when the system works well.

The emergency messages (errors, warnings) are printed in emergency
situations. They are printed as reliably as possible to the console
because the userspace might not be reliable enough.


That said, the "atomic" consoles brings new possibilities
and might be very useful in some scenarios. Also a more grained
prioritization might be helpful.

But each solution might just bring new problems. For example,
the atomic consoles are still serialized between CPUs. It might
slow down the entire system and not only on task. If it gets
blocked for some reasons (nobody is perfect) it would block
all the other serialized CPUs as well.

In each case, we really need to be careful about the complexity.
printk() is already complex enough. It might be acceptable if
it makes the design cleaner and less tangled. printk() would
deserve a redesign.

Best Regards,
Petr

2019-03-06 21:18:55

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On 2019-03-06, Petr Mladek <[email protected]> wrote:
>> I would like to clarify that message supression (i.e. console loglevel)
>> is a method of reducing what is printed. It does nothing to address the
>> issues related to console printing. My proposal focusses on addressing
>> the issues related to console printing.
>>
>> Console printing is a convenient feature to allow a kernel to
>> communicate information to a user without any reliance on
>> userspace. IMHO there are 2 categories of messages that the kernel will
>> communicate. The first is informational (usb events, wireless and
>> ethernet connectivity, filesystem events, etc.). Since this category of
>> messages occurs during normal runtime, we should expect that it does not
>> cause adverse effects to the rest of the system (such as latencies and
>> non-deterministic behavior).
>>
>> The second category is for emergency situations, where the kernel needs
>> to report something unusual (panic, BUG, WARN, etc.). In some of these
>> situations, it may be the last thing the kernel ever does. We should
>> expect this category to focus on getting the message out as reliably as
>> possible. Even if it means disturbing the system with large latencies.
>>
>> _Both_ categories are important for the user, but their requirements are
>> different:
>>
>> informational: non-disturbing
>> emergency: reliable
>
> Isn't this already handled by the console_level?

You mean that the current console level is being used to set the
boundary between emergency and informational messages? Definitely no!
Take any Linux distribution and look at their default console_level
setting. Even the kernel code defaults to a value of 7!

> The informational messages can be reliably read via syslog, /dev/kmsg.
> They are related to the normal works when the system works well.

Yes, this is how things _could_ be. But why are users currently using
the kernel's console printing for informational messages? And why is the
kernel code encouraging it? Perhaps because users like being able to
receive messages without relying on userspace tools? IMO it is this mass
use of console printing for informational messages that is preventing
the implementation from becoming optimally reliable.

My proposal is making this distinction clearer: a significant increase
in reliability for emergency messages, and a fully preemptible printer
for informational messages. The fully preemptible printer will work just
as well as any userspace tool, but doesn't require userspace. Not
requiring userspace seems to me to be the part users are interested
in.

(But I might be wrong on this. Perhaps Linux is just "marketing" its
console printing feature incorrectly and users aren't aware that it is
only meant for emergencies.)

> The emergency messages (errors, warnings) are printed in emergency
> situations. They are printed as reliably as possible to the console
> because the userspace might not be reliable enough.

As reliably as _possible_? I hope that my series at least helps to show
that we can do a lot better about reliability.

> That said, the "atomic" consoles brings new possibilities
> and might be very useful in some scenarios. Also a more grained
> prioritization might be helpful.
>
> But each solution might just bring new problems. For example,
> the atomic consoles are still serialized between CPUs. It might
> slow down the entire system and not only on task.

Why is that a problem? The focus is reliabilty. We are talking about
emergency messages here. Messages that should never occur for a
correctly functioning system. It does not matter if the entire system
slows down because of it.

> If it gets blocked for some reasons (nobody is perfect) it would
> block all the other serialized CPUs as well.

Yes, blocking in an atomic context would be bad for any code.

> In each case, we really need to be careful about the complexity.
> printk() is already complex enough. It might be acceptable if
> it makes the design cleaner and less tangled. printk() would
> deserve a redesign.

It is my belief that I am significantly simplifying printk because there
are no more exotic contexts and situations. Emergency messages are
atomic and immediate. Context does not matter. Informational messages
are printed fully preemptible, so console drivers are free to do
whatever magic they want to do. Do you see that as more complex than the
current implementation of safe buffers, defers, hand-offs, exclusive
consoles, and cond_rescheds?

John Ogness

2019-03-06 22:23:59

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On 2019-03-06, Petr Mladek <[email protected]> wrote:
>> _Both_ categories are important for the user, but their requirements
>> are different:
>>
>> informational: non-disturbing
>> emergency: reliable
>
> Isn't this already handled by the console_level?
>
> The informational messages can be reliably read via syslog, /dev/kmsg.
> They are related to the normal works when the system works well.
>
> The emergency messages (errors, warnings) are printed in emergency
> situations. They are printed as reliably as possible to the console
> because the userspace might not be reliable enough.

I've never viewed console_level this way. _If_ console_level really is
supposed to define the emergency/informational boundary, all
informational messages are supposed to be handled by userspace, and
console printing's main objective is reliability... then I would change
my proposal such that:

- if a console supports write_atomic(), _all_ console printing for that
console would use write_atomic()

- only consoles without write_atomic() will be printing via the
printk-kthread(s)

IMO, for consoles with write_atomic(), this would increase reliability
over the current mainline implementation. It would also simplify
write_atomic() implementations because they would no longer need to
synchronize against write().

For those consoles that cannot implement write_atomic() (vt and
netconsole come to mind), or as a transition period until remaining
console drivers have implemented write_atomic(), these would use the
"fallback" of printing fully preemptively in their own kthread using
write().

Does this better align with the concept of the console_loglevel and the
purpose of console printing?

John Ogness

2019-03-07 02:13:37

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 02/25] printk-rb: add prb locking functions

On (02/12/19 15:29), John Ogness wrote:
> +static bool __prb_trylock(struct prb_cpulock *cpu_lock,
> + unsigned int *cpu_store)
> +{
> + unsigned long *flags;
> + unsigned int cpu;
> +
> + cpu = get_cpu();
> +
> + *cpu_store = atomic_read(&cpu_lock->owner);
> + /* memory barrier to ensure the current lock owner is visible */
> + smp_rmb();
> + if (*cpu_store == -1) {
> + flags = per_cpu_ptr(cpu_lock->irqflags, cpu);
> + local_irq_save(*flags);
> + if (atomic_try_cmpxchg_acquire(&cpu_lock->owner,
> + cpu_store, cpu)) {
> + return true;
> + }
> + local_irq_restore(*flags);
> + } else if (*cpu_store == cpu) {
> + return true;
> + }
> +
> + put_cpu();
> + return false;
> +}
> +
> +/*
> + * prb_lock: Perform a processor-reentrant spin lock.
> + * @cpu_lock: A pointer to the lock object.
> + * @cpu_store: A "flags" pointer to store lock status information.
> + *
> + * If no processor has the lock, the calling processor takes the lock and
> + * becomes the owner. If the calling processor is already the owner of the
> + * lock, this function succeeds immediately. If lock is locked by another
> + * processor, this function spins until the calling processor becomes the
> + * owner.
> + *
> + * It is safe to call this function from any context and state.
> + */
> +void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store)
> +{
> + for (;;) {
> + if (__prb_trylock(cpu_lock, cpu_store))
> + break;
> + cpu_relax();
> + }
> +}

Any chance to make it more fair? A ticket based lock, perhaps?

-ss

2019-03-07 05:16:11

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

Hi John,

On (03/05/19 22:00), John Ogness wrote:
> Hi Sergey,
>
[..]
> Console printing is a convenient feature to allow a kernel to
> communicate information to a user without any reliance on
> userspace. IMHO there are 2 categories of messages that the kernel will
> communicate. The first is informational (usb events, wireless and
> ethernet connectivity, filesystem events, etc.). Since this category of
> messages occurs during normal runtime, we should expect that it does not
> cause adverse effects to the rest of the system (such as latencies and
> non-deterministic behavior).
>
> The second category is for emergency situations, where the kernel needs
> to report something unusual (panic, BUG, WARN, etc.). In some of these
> situations, it may be the last thing the kernel ever does. We should
> expect this category to focus on getting the message out as reliably as
> possible. Even if it means disturbing the system with large latencies.
>
> _Both_ categories are important for the user, but their requirements are
> different:
>
> informational: non-disturbing
> emergency: reliable

That's one way of looking at this. And it's reasonable.

Another way could be:
- anything that passes the loglevel check (suppress_message_printing())
is considered to be important

- anything else is just "noise" which should be suppressed. This
is what loglevel and suppress_message_printing() are for - to tell
the kernel what we want and what we don't want to be on the consoles.

> But what if can't be implemented? vt console, for example? Yes, the vt
> console would be tricky. It doesn't even support the current
> bust_spinlocks/oops_in_progress. But since the emergency category has a
> clear requirement (reliability)

"Reliability" - yes; the existence of emergency messages - no.

"to report something unusual (panic, BUG, WARN, etc.). In some of
these situations, it may be the last thing the kernel ever does."

But so may be the "informational" message. For example, not all ARCHs
sport NMI to detect and warn about a lockup/deadlock somewhere in usb
or wifi. The "informational" can be the last thing the kernel has to
say.

> The current printk implementation will do a better job of getting the
> informational messages out, but at an enormous cost to all the tasks
> on the system (including the realtime tasks). I am proposing a printk
> implementation where the tasks are not affected by console printing
> floods.

In new printk design the tasks are still affected by printing floods.
Tasks have to line up and (busy) wait for each other, regardless of
contexts.

One of the late patch sets which I had (I never ever published it) was
a different kind of printk-kthread offloading. The idea was that whatever
should be printed (suppress_message_printing()) should be printed. We
obviously can't loop in console_unlock() for ever and there is only one
way to figure out if we can print out more messages, that's why printk
became RCU stall detector and watchdog aware; and printk would break
out and wake up printk_kthread if it sees that watchdog is about to get
angry on that particular CPU. printk_kthread would run with preemption
disabled and do the same thing: if it spent watchdog_threshold / 2
printing - breakout, enable local IRQ, cond_resched(). IOW watchdogs
determine how much time we can spend on printing.

[..]
> I want messages of the information category to cause no disturbance to
> the system. Give the kernel the freedom to communicate to users without
> destroying its own performance. This can only be achieved if the
> messages are printed from a _fully_ preemptible context.
[..]
> And I want messages of the emergency category to be as reliable as
> possible, regardless of the costs to the system. Give the kernel a
> clear mechanism to _reliably_ communicate critical information.
> Such messages should never appear on a correctly functioning system.

I don't really understand the role of loglevel anymore.

When I do ./a.out --loglevel=X I have a clear understanding that
all messages which fall into [critical, X] range will be in the logs,
because I told that application that those messages are important to
me right now. And it used to be the same with the kernel loglevel.
But now the kernel will do its own thing:

- what the kernel considers important will go into the logs
- what the kernel doesn't consider important _maybe_ will end up
in the logs (preemptible printk kthread). And this is where
loglevel now. After the _maybe_ part.

If I'm not mistaken, Tetsuo reported that on a box under heavy OOM
pressure he saw preemptible printk dragging 5 minutes behind the
logbuf head. Preemptible printk is good for nothing. It's beyond
useless, it's something else.

-ss

2019-03-07 06:42:36

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On (03/06/19 23:22), John Ogness wrote:
> On 2019-03-06, Petr Mladek <[email protected]> wrote:
> >> _Both_ categories are important for the user, but their requirements
> >> are different:
> >>
> >> informational: non-disturbing
> >> emergency: reliable
> >
> > Isn't this already handled by the console_level?
> >
> > The informational messages can be reliably read via syslog, /dev/kmsg.
> > They are related to the normal works when the system works well.
> >
> > The emergency messages (errors, warnings) are printed in emergency
> > situations. They are printed as reliably as possible to the console
> > because the userspace might not be reliable enough.
>
> I've never viewed console_level this way. _If_ console_level really is
> supposed to define the emergency/informational boundary, all
> informational messages are supposed to be handled by userspace, and
> console printing's main objective is reliability... then I would change
> my proposal such that:

OK, you guys are ahead of me.

FB folks want to have a per-console sysfs knob to dynamically adjust
loglevel on each console. The use case is to temporarily set loglevel
to, say, debug on fast consoles, gather some data/logs, set loglevel
back to less verbose afterwards.

Preserving the existing loglevel behaviour looks right to me.

> - if a console supports write_atomic(), _all_ console printing for that
> console would use write_atomic()

Sounds right.
But Big-Konsole-Lock looks suspicious.

> - only consoles without write_atomic() will be printing via the
> printk-kthread(s)
>
> IMO, for consoles with write_atomic(), this would increase reliability
> over the current mainline implementation. It would also simplify
> write_atomic() implementations because they would no longer need to
> synchronize against write().

[..]
> For those consoles that cannot implement write_atomic() (vt and
> netconsole come to mind), or as a transition period until remaining
> console drivers have implemented write_atomic(), these would use the
> "fallback" of printing fully preemptively in their own kthread using
> write().

This sounds concerning. IMHO, netconsole is too important to rely
on preemptible printk and scheduler. Especially those netcons which
run in "report only when oops_in_progress" mode. Sometimes netconsole
is the fastest console available, but preemptible printk may turn it
into the slowest one.

-ss

2019-03-07 06:51:59

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On (03/07/19 15:41), Sergey Senozhatsky wrote:
> This sounds concerning. IMHO, netconsole is too important to rely
> on preemptible printk and scheduler. Especially those netcons which
> run in "report only when oops_in_progress" mode. Sometimes netconsole
> is the fastest console available, but preemptible printk may turn it
> into the slowest one.

+ oops may end by the time kthread becomes active.

-ss

2019-03-07 07:32:35

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 19/25] printk: introduce emergency messages

On (02/12/19 15:29), John Ogness wrote:
[..]
> +static bool console_can_emergency(int level)
> +{
> + struct console *con;
> +
> + for_each_console(con) {
> + if (!(con->flags & CON_ENABLED))
> + continue;
> + if (con->write_atomic && level < emergency_console_loglevel)
> + return true;
> + if (con->write && (con->flags & CON_BOOT))
> + return true;
> + }
> + return false;
> +}
> +
> +static void call_emergency_console_drivers(int level, const char *text,
> + size_t text_len)
> +{
> + struct console *con;
> +
> + for_each_console(con) {
> + if (!(con->flags & CON_ENABLED))
> + continue;
> + if (con->write_atomic && level < emergency_console_loglevel) {
> + con->write_atomic(con, text, text_len);
> + continue;
> + }
> + if (con->write && (con->flags & CON_BOOT)) {
> + con->write(con, text, text_len);
> + continue;
> + }
> + }
> +}
> +
> +static void printk_emergency(char *buffer, int level, u64 ts_nsec, u16 cpu,
> + char *text, u16 text_len)
> +{
> + struct printk_log msg;
> + size_t prefix_len;
> +
> + if (!console_can_emergency(level))
> + return;
> +
> + msg.level = level;
> + msg.ts_nsec = ts_nsec;
> + msg.cpu = cpu;
> + msg.facility = 0;
> +
> + /* "text" must have PREFIX_MAX preceding bytes available */
> +
> + prefix_len = print_prefix(&msg,
> + console_msg_format & MSG_FORMAT_SYSLOG,
> + printk_time, buffer);
> + /* move the prefix forward to the beginning of the message text */
> + text -= prefix_len;
> + memmove(text, buffer, prefix_len);
> + text_len += prefix_len;
> +
> + text[text_len++] = '\n';
> +
> + call_emergency_console_drivers(level, text, text_len);

So this iterates the console list and calls consoles' callbacks, but what
prevents console driver to be rmmod-ed under us?

CPU0 CPU1

printk_emergency() rmmod netcon
call_emergency_console_drivers()
con_foo->flags & CON_ENABLED == 1
unregister_console(con_foo)
con_foo->flags &= ~CON_ENABLED
__exit // con_foo gone ?
con_foo->write()

We use console_lock()/console_trylock() in order to protect the list and
console drivers; but this brings scheduler to the picture, with all its
locks.

Or am I missing something?

-ss

2019-03-07 09:55:01

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On 2019-03-04, Sergey Senozhatsky <[email protected]> wrote:
>>>> - A dedicated kernel thread is created for printing to all consoles
>>>> in a fully preemptible context.
>>>
>>> How do you handle sysrq-<foo> printouts on systems which can't
>>> schedule printk-kthread?
>>
>> If those sysrq printouts are at the emergency loglevel (which most
>> are), then they are printed immediately to the emergency
>> consoles. This has already proved useful for our own kernel debugging
>> work. For example, currently sysrq-z for very large traces result in
>> messages being dropped because of printk buffer overflows. But with
>> the emergency console we always see the full trace buffer.
>
> Are we sure that all systems will always have ->atomic console(-s)
> enabled? Is it possible to convert all console drivers to ->atomic?
> fbcon, for instance (with scrolling and font scaling, etc)?

No, I am not sure if we can convert all console drivers to atomic
consoles. But I think if we don't have to fear disturbing the system,
the possibilities for such an implementation are greater.

> If there are setups which can be fully !atomic (in terms of console
> output) then we, essentially, have a fully preemptible kthread printk
> implementation.

Correct. I've mentioned in another response[0] some ideas about what
could be done to aid this.

I understand that fully preemptible kthread printing is unacceptable for
you. Since all current console drivers are already irq safe, I'm
wondering if using irq_work to handle the emergency printing for console
drivers without write_atomic() would help. (If the printk caller is in a
context that write() supports, then write() could be called directly.)
This would also demand that the irq-safe requirements for write() are
not relaxed. The printk-kthread might still be faster than irq_work, but
it might increase reliability if an irq_work is triggered as an extra
precaution.

John Ogness

[0] https://lkml.kernel.org/r/[email protected]

2019-03-07 12:07:37

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On 2019-03-04, Sergey Senozhatsky <[email protected]> wrote:
>> + /* the printk kthread never exits */
>> + for (;;) {
>> + ret = prb_iter_wait_next(&iter, buf,
>> + PRINTK_RECORD_MAX, &master_seq);
>> + if (ret == -ERESTARTSYS) {
>> + continue;
>> + } else if (ret < 0) {
>> + /* iterator invalid, start over */
>> + prb_iter_init(&iter, &printk_rb, NULL);
>> + continue;
>> + }
>> +
>> + msg = (struct printk_log *)buf;
>> + format_text(msg, master_seq, ext_text, &ext_len, text,
>> + &len, printk_time);
>> +
>> + console_lock();
>> + if (len > 0 || ext_len > 0) {
>> + call_console_drivers(ext_text, ext_len, text, len);
>> + boot_delay_msec(msg->level);
>> + printk_delay();
>> + }
>> + console_unlock();
>> + }
>
> This, theoretically, creates a whole new world of possibilities for
> console drivers. Now they can do GFP_KERNEL allocations and stall
> printk_kthread during OOM; or they can explicitly reschedule from
> ->write() callback (via console_conditional_schedule()) because
> console_lock() sets console_may_schedule.

This was the intention. Although, as I mentioned in a previous
response[0], perhaps we should not loosen the requirements on write().

> It's one thing to do cond_resched() (or to let preemption to take
> over) after call_console_drivers() (when we are done printing a
> message to all console drivers) and another thing to let preemption to
> take over while we are printing a messages to the consoles. It
> probably would make sense to disable preemption around
> call_console_drivers().

I could see disabling preemption and interrupts for emergency messages
in the printk-kthread in order to synchronize against an irq_work
secondary printer as suggested in my response[0]. But I don't see an
advantage to disabling preemption in general for
call_console_drivers(). It is exactly that disable_preempt() that is so
harmful for realtime tasks.

John Ogness

[0] https://lkml.kernel.org/r/[email protected]

2019-03-07 12:51:32

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On Wed 2019-03-06 22:17:12, John Ogness wrote:
> On 2019-03-06, Petr Mladek <[email protected]> wrote:
> >> I would like to clarify that message supression (i.e. console loglevel)
> >> is a method of reducing what is printed. It does nothing to address the
> >> issues related to console printing. My proposal focusses on addressing
> >> the issues related to console printing.
> >>
> >> Console printing is a convenient feature to allow a kernel to
> >> communicate information to a user without any reliance on
> >> userspace. IMHO there are 2 categories of messages that the kernel will
> >> communicate. The first is informational (usb events, wireless and
> >> ethernet connectivity, filesystem events, etc.). Since this category of
> >> messages occurs during normal runtime, we should expect that it does not
> >> cause adverse effects to the rest of the system (such as latencies and
> >> non-deterministic behavior).
> >>
> >> The second category is for emergency situations, where the kernel needs
> >> to report something unusual (panic, BUG, WARN, etc.). In some of these
> >> situations, it may be the last thing the kernel ever does. We should
> >> expect this category to focus on getting the message out as reliably as
> >> possible. Even if it means disturbing the system with large latencies.
> >>
> >> _Both_ categories are important for the user, but their requirements are
> >> different:
> >>
> >> informational: non-disturbing
> >> emergency: reliable
> >
> > Isn't this already handled by the console_level?
>
> You mean that the current console level is being used to set the
> boundary between emergency and informational messages? Definitely no!
> Take any Linux distribution and look at their default console_level
> setting. Even the kernel code defaults to a value of 7!

CONSOLE_LOGLEVEL_DEFAULT is one thing. The real life is another thing.
I could hardly imagine any distribution that would scary users by all
kernel messages.

For example, SUSE distro boots with "quiet" kernel parameter.
It seems that RedHat does the same as suggested by
https://lkml.kernel.org/r/[email protected]
It means that the level '4' is typically used and even lower number
is wanted.

Well, SUSE (or systemd folks?) go even further. rsyslog sets
the console_loglevel to "1".


BTW: The default has been set to "7" in kernel-1.3.71 (1996-01-01).
There is no explanation for this. Well, it looks like a pre-SMP
era. printk() was not serialized with any lock. Also messages
to tty were written separately by tty_write_message(). I guess
the the default was primary for serial console and developers.


> My proposal is making this distinction clearer: a significant increase
> in reliability for emergency messages.

This is true if we fulfil several conditions. I am not sure that
the following list is complete but:

+ The commonly used consoles must provide the atomic write.
Otherwise, it would have only few users.

+ The printk code and console atomic write must be really
safe and maintainable.

+ The serialization of atomic consoles must not cause more
problems than the current approach (always slow pr_err() vs.
random printk() slow, unfair locking) [*]

+ The mixed order of directly printed and postponed messages
must not cause too much confusion. We might need some
userpace tools to sort them.


[*] I agree that the direct console handling has several advantages:

+ Avoid all the problems with console_lock.

+ It is more predictable (no random victims spending
time with handling all messages).

+ It is a natural throttling of heavy printk() users.
Less amount of lost messages (important ones never).
Might reduce (remove) the problem with softlockups.


I am just not able to predict eventual problems. There was a reason
for the current state. People did not want always slow printk().


> Why is that a problem? The focus is reliabilty. We are talking about
> emergency messages here. Messages that should never occur for a
> correctly functioning system. It does not matter if the entire system
> slows down because of it.

This sounds too idealistic:

+ It expects that everyone is using the log levels consistently and
with care. But many people believe that their messages are
the most important ones.

+ Also it expects that all users use similar console_logleves
in all situations. But it depends on the usecase. Sometimes
warning and error messages do not make much sense without
the context, e.g. info or even debug messages.


> > If it gets blocked for some reasons (nobody is perfect) it would
> > block all the other serialized CPUs as well.
>
> Yes, blocking in an atomic context would be bad for any code.
>
> > In each case, we really need to be careful about the complexity.
> > printk() is already complex enough. It might be acceptable if
> > it makes the design cleaner and less tangled. printk() would
> > deserve a redesign.
>
> It is my belief that I am significantly simplifying printk because there
> are no more exotic contexts and situations. Emergency messages are
> atomic and immediate. Context does not matter. Informational messages
> are printed fully preemptible, so console drivers are free to do
> whatever magic they want to do. Do you see that as more complex than the
> current implementation of safe buffers, defers, hand-offs, exclusive
> consoles, and cond_rescheds?

This expects that we would be able to always offload all
non-emergency messages. I am not convinced that it is acceptable.
And I wrote this several times.

Also, please, keep the log buffer and console handling separate.
The new "lockless" log buffer solves problems that are currently
being solved by printk_safe code. While the atomic consoles
and kthread tries to solve problems with unpredictable
printk() cost (softlockups) and make the emergency messages
more reliable (lost messages).

The complexity depends on how the final code would look like.
It might be win if it is more straightforward and organized.
But there are some alarming things:

+ lockless code is always tricky

+ another console_loglevel setting is added

+ one more way to push messages to the console

+ messages are slit into 3 groups:
+ one group was subset of the other in the past
+ newly, two groups need to get joined and sorted
to get the original one

Best Regards,
Petr

2019-03-08 01:32:18

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On (03/07/19 13:06), John Ogness wrote:
> On 2019-03-04, Sergey Senozhatsky <[email protected]> wrote:
> >> + /* the printk kthread never exits */
> >> + for (;;) {
> >> + ret = prb_iter_wait_next(&iter, buf,
> >> + PRINTK_RECORD_MAX, &master_seq);
> >> + if (ret == -ERESTARTSYS) {
> >> + continue;
> >> + } else if (ret < 0) {
> >> + /* iterator invalid, start over */
> >> + prb_iter_init(&iter, &printk_rb, NULL);
> >> + continue;
> >> + }
> >> +
> >> + msg = (struct printk_log *)buf;
> >> + format_text(msg, master_seq, ext_text, &ext_len, text,
> >> + &len, printk_time);
> >> +
> >> + console_lock();
> >> + if (len > 0 || ext_len > 0) {
> >> + call_console_drivers(ext_text, ext_len, text, len);
> >> + boot_delay_msec(msg->level);
> >> + printk_delay();
> >> + }
> >> + console_unlock();
> >> + }
> >
> > This, theoretically, creates a whole new world of possibilities for
> > console drivers. Now they can do GFP_KERNEL allocations and stall
> > printk_kthread during OOM; or they can explicitly reschedule from
> > ->write() callback (via console_conditional_schedule()) because
> > console_lock() sets console_may_schedule.
>
> This was the intention.

This can stall the entire printing pipeline

OOM -> printk_kthread() -> console_lock() -> con_foo() -> kmalloc(GFP_KERNEL) -> OOM

> Although, as I mentioned in a previous response[0], perhaps we should
> not loosen the requirements on write().

Right. Console drivers better stay restricted; very restricted.

> It is exactly that disable_preempt() that is so harmful for realtime tasks.

I'll reply in another email (later today, or tomorrow).

-ss

2019-03-08 04:06:41

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 20/25] serial: 8250: implement write_atomic

On 2019-02-27, Petr Mladek <[email protected]> wrote:
>>>> Implement a non-sleeping NMI-safe write_atomic console function in
>>>> order to support emergency printk messages.
>>>
>>> OK, it would be safe when prb_lock() is the only lock taken
>>> in the NMI handler.
>>
>> Which is the case. As I wrote to you already [0], NMI contexts are
>> _never_ allowed to do things that rely on waiting forever for other
>> CPUs.
>
> Who says _never_? I agree that it is not reasonable. But the
> history shows that it happens.

Right, which is why it would need to become policy.

The emergency messages (aka write_atomic) introduce a new requirement to
the kernel because this callback must be callable from any context. The
console drivers must have some way of synchronizing. The CPU-reentrant
spin lock is the only solution I am aware of.

> In principle, there is nothing wrong in using spinlock in NMI
> when it is used only in NMI.

The CPU-reentrant spin lock _will_ be used in NMI context and
potentially could be used from any line of NMI code (if, for example, a
panic is triggered). The problem is when you have 2 different spin locks
in NMI context and their ordering cannot be guaranteed. And since I am
introducing an implicit spin lock that potentially could be locked from
any line of code, any explicit use of a spin lock in NMI could would
really be adding a 2nd spin lock and thus deadlock potential.

If the ringbuffer was fully lockless, we should be able to have
per-console CPU-reentrant spin locks as long as the ordering is
preserved, which I expect shouldn't be a problem. If any NMI context
needed a spin lock for its own purposes, it would need to use the
CPU-reentrant spin lock of the first console so as to preserve the
ordering in case of a panic.

>>> 2. I am afraid that we need to add some locking between CPUs
>>> to avoid mixing characters from directly printed messages.
>>
>> That is exactly what console_atomic_lock() (actually prb_lock) is!
>
> Sure. But it should not be a common lock for the ring buffer and
> all consoles.

As long as the ring buffer requires a CPU-reentrant spin lock, I expect
that it _must_ be a common lock for all. Consider the situation that the
ring buffer writer code causes a panic. I think it is beneficial if at
least 1 level of printk recursion is supported so that even these
backtraces make it out on the emergency consoles.

If the ring buffer becomes fully lockless, then we could move to
per-console CPU-reentrant spin locks.

John Ogness

2019-03-08 04:18:21

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 20/25] serial: 8250: implement write_atomic

On 2019-03-08, John Ogness <[email protected]> wrote:
> If the ringbuffer was fully lockless, we should be able to have
> per-console CPU-reentrant spin locks as long as the ordering is
> preserved, which I expect shouldn't be a problem. If any NMI context
> needed a spin lock for its own purposes, it would need to use the
> CPU-reentrant spin lock of the first console so as to preserve the
> ordering in case of a panic.

This point is garbage. Sorry. I do not see how we could safely have
multiple CPU-reentrant spin locks. Example of a deadlock:

CPU0 CPU1
printk printk
console2.lock console1.lock
NMI NMI
printk printk
console1.lock console2.lock

>> ... it should not be a common lock for the ring buffer and all
>> consoles.
>
> If the ring buffer becomes fully lockless, then we could move to
> per-console CPU-reentrant spin locks.

A fully lockless ring buffer will reduce the scope of the one, global
CPU-reentrant spin lock. But I do not see how we can safely have
multiple of these. If it is part of printk, it is already implicitly on
every line of code.

John Ogness

2019-03-08 10:00:59

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On Thu 2019-03-07 10:53:48, John Ogness wrote:
> On 2019-03-04, Sergey Senozhatsky <[email protected]> wrote:
> > If there are setups which can be fully !atomic (in terms of console
> > output) then we, essentially, have a fully preemptible kthread printk
> > implementation.
>
> Correct. I've mentioned in another response[0] some ideas about what
> could be done to aid this.
>
> I understand that fully preemptible kthread printing is unacceptable for
> you. Since all current console drivers are already irq safe, I'm
> wondering if using irq_work to handle the emergency printing for console
> drivers without write_atomic() would help. (If the printk caller is in a
> context that write() supports, then write() could be called directly.)
> This would also demand that the irq-safe requirements for write() are
> not relaxed. The printk-kthread might still be faster than irq_work, but
> it might increase reliability if an irq_work is triggered as an extra
> precaution.

It is getting more and more complicated. The messages would be pushed
directly, from irq, and kthread. It would depend how the code would
look like but I am not much optimistic.

Note that you could not pass any data to the irq_work handler.
It would need to iterate over the logbuffer and take care of
all non-handled emergency messages.

Anyway, we could solve this later. We need to keep the current
console_unlock() handling as a fallback until enough consoles
support the direct mode anyway.

Best Regards,
Petr

2019-03-08 10:05:30

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On Fri 2019-03-08 10:31:34, Sergey Senozhatsky wrote:
> On (03/07/19 13:06), John Ogness wrote:
> > On 2019-03-04, Sergey Senozhatsky <[email protected]> wrote:
> > > This, theoretically, creates a whole new world of possibilities for
> > > console drivers. Now they can do GFP_KERNEL allocations and stall
> > > printk_kthread during OOM; or they can explicitly reschedule from
> > > ->write() callback (via console_conditional_schedule()) because
> > > console_lock() sets console_may_schedule.
> >
> This can stall the entire printing pipeline
>
> OOM -> printk_kthread() -> console_lock() -> con_foo() -> kmalloc(GFP_KERNEL) -> OOM
>
> > Although, as I mentioned in a previous response[0], perhaps we should
> > not loosen the requirements on write().
>
> Right. Console drivers better stay restricted; very restricted.

I agree here.

> > It is exactly that disable_preempt() that is so harmful for realtime tasks.

> I'll reply in another email (later today, or tomorrow).

I see. But I have troubles to imagine a preemtible direct console
output. The result would be mixed messages on a single line.

Best Regards,
Petr

2019-03-08 10:30:45

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 20/25] serial: 8250: implement write_atomic

On Fri 2019-03-08 05:05:12, John Ogness wrote:
> On 2019-02-27, Petr Mladek <[email protected]> wrote:
> >>>> Implement a non-sleeping NMI-safe write_atomic console function in
> >>>> order to support emergency printk messages.
> >>>
> >>> OK, it would be safe when prb_lock() is the only lock taken
> >>> in the NMI handler.
> >>
> >> Which is the case. As I wrote to you already [0], NMI contexts are
> >> _never_ allowed to do things that rely on waiting forever for other
> >> CPUs.
> >
> > Who says _never_? I agree that it is not reasonable. But the
> > history shows that it happens.
>
> Right, which is why it would need to become policy.
>
> The emergency messages (aka write_atomic) introduce a new requirement to
> the kernel because this callback must be callable from any context. The
> console drivers must have some way of synchronizing. The CPU-reentrant
> spin lock is the only solution I am aware of.

I am not in position to decide if this requirement is acceptable
or not. We would need an opinion from Linus at minimum.

But it is not acceptable from my point of view. Note that
many spinlocks might be safely used in NMI in principle.

You want to introduce a single spinlock just because
printk() could be called anywhere. Most other subsystems
do not have this problem because they a more self-contained.

> If the ringbuffer was fully lockless

Exactly, I think that a solution would be fully lockless logbuffer.


> we should be able to have per-console CPU-reentrant spin locks
> as long as the ordering is preserved, which I expect shouldn't
> be a problem. If any NMI context needed a spin lock for its own
> purposes, it would need to use the CPU-reentrant spin lock of
> the first console so as to preserve the ordering in case of a panic.

IMHO, this is not acceptable. It would be nice to have direct output
from NMI but the cost looks to high here.

A solution would be to defer the console to irq_work or so.
We might also consider using trylock in NMI and have
the irq_work just as a fallback.

Best Regars,
Petr

2019-03-08 10:32:25

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 19/25] printk: introduce emergency messages

On Thu 2019-03-07 16:30:29, Sergey Senozhatsky wrote:
> On (02/12/19 15:29), John Ogness wrote:
> [..]
> > +static bool console_can_emergency(int level)
> > +{
> > + struct console *con;
> > +
> > + for_each_console(con) {
> > + if (!(con->flags & CON_ENABLED))
> > + continue;
> > + if (con->write_atomic && level < emergency_console_loglevel)
> > + return true;
> > + if (con->write && (con->flags & CON_BOOT))
> > + return true;
> > + }
> > + return false;
> > +}
> > +
> > +static void call_emergency_console_drivers(int level, const char *text,
> > + size_t text_len)
> > +{
> > + struct console *con;
> > +
> > + for_each_console(con) {
> > + if (!(con->flags & CON_ENABLED))
> > + continue;
> > + if (con->write_atomic && level < emergency_console_loglevel) {
> > + con->write_atomic(con, text, text_len);
> > + continue;
> > + }
> > + if (con->write && (con->flags & CON_BOOT)) {
> > + con->write(con, text, text_len);
> > + continue;
> > + }
> > + }
> > +}
> > +
> > +static void printk_emergency(char *buffer, int level, u64 ts_nsec, u16 cpu,
> > + char *text, u16 text_len)
> > +{
> > + struct printk_log msg;
> > + size_t prefix_len;
> > +
> > + if (!console_can_emergency(level))
> > + return;
> > +
> > + msg.level = level;
> > + msg.ts_nsec = ts_nsec;
> > + msg.cpu = cpu;
> > + msg.facility = 0;
> > +
> > + /* "text" must have PREFIX_MAX preceding bytes available */
> > +
> > + prefix_len = print_prefix(&msg,
> > + console_msg_format & MSG_FORMAT_SYSLOG,
> > + printk_time, buffer);
> > + /* move the prefix forward to the beginning of the message text */
> > + text -= prefix_len;
> > + memmove(text, buffer, prefix_len);
> > + text_len += prefix_len;
> > +
> > + text[text_len++] = '\n';
> > +
> > + call_emergency_console_drivers(level, text, text_len);
>
> So this iterates the console list and calls consoles' callbacks, but what
> prevents console driver to be rmmod-ed under us?
>
> CPU0 CPU1
>
> printk_emergency() rmmod netcon
> call_emergency_console_drivers()
> con_foo->flags & CON_ENABLED == 1
> unregister_console(con_foo)
> con_foo->flags &= ~CON_ENABLED
> __exit // con_foo gone ?
> con_foo->write()
>
> We use console_lock()/console_trylock() in order to protect the list and
> console drivers; but this brings scheduler to the picture, with all its
> locks.

Great catch!

I think that it is doable to guard the list using RCU.

Best Regards,
Petr

Subject: Re: [RFC PATCH v1 25/25] printk: remove unused code

On 2019-02-12 15:30:03 [+0100], John Ogness wrote:

you removed the whole `irq_work' thing. You can also remove the include
for linux/irq_work.h.

Sebastian

2019-03-11 03:43:18

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 25/25] printk: remove unused code

On (03/08/19 15:02), Sebastian Andrzej Siewior wrote:
> On 2019-02-12 15:30:03 [+0100], John Ogness wrote:
>
> you removed the whole `irq_work' thing. You can also remove the include
> for linux/irq_work.h.

It may be too early to remove the whole `irq_work' thing.
printk()->call_console_driver() should take console_sem
lock.

-ss

Subject: Re: [RFC PATCH v1 25/25] printk: remove unused code

On 2019-03-11 11:46:00 [+0900], Sergey Senozhatsky wrote:
> On (03/08/19 15:02), Sebastian Andrzej Siewior wrote:
> > On 2019-02-12 15:30:03 [+0100], John Ogness wrote:
> >
> > you removed the whole `irq_work' thing. You can also remove the include
> > for linux/irq_work.h.
>
> It may be too early to remove the whole `irq_work' thing.
> printk()->call_console_driver() should take console_sem
> lock.

I would be _very_ glad to see that irq_work thingy gone. I just stumbled
upon this irq_work and cursed a little while doing other things.
Checking John's series and seeing that it was gone, was a relief.
Printing the whole thing in irq context does not look sane. printing the
import things right away and printing the remaining things later in
kthread looks good to me.

> -ss

Sebastian

2019-03-11 10:53:19

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On 2019-03-07, Sergey Senozhatsky <[email protected]> wrote:
>> The current printk implementation will do a better job of getting the
>> informational messages out, but at an enormous cost to all the tasks
>> on the system (including the realtime tasks). I am proposing a printk
>> implementation where the tasks are not affected by console printing
>> floods.
>
> In new printk design the tasks are still affected by printing floods.
> Tasks have to line up and (busy) wait for each other, regardless of
> contexts.

They only line up and busy wait is to add the informational message to
the ring buffer. The current printk implementation is the same in this
respect. And as you've noted, the logbuf spinlock is not a source of
latencies.

> One of the late patch sets which I had (I never ever published it) was
> a different kind of printk-kthread offloading. The idea was that
> whatever should be printed (suppress_message_printing()) should be
> printed. We obviously can't loop in console_unlock() for ever and
> there is only one way to figure out if we can print out more messages,
> that's why printk became RCU stall detector and watchdog aware; and
> printk would break out and wake up printk_kthread if it sees that
> watchdog is about to get angry on that particular CPU. printk_kthread
> would run with preemption disabled and do the same thing: if it spent
> watchdog_threshold / 2 printing - breakout, enable local IRQ,
> cond_resched(). IOW watchdogs determine how much time we can spend on
> printing.

I studied and experimented with this (from your git). It was an
interesting idea of keeping the current logic, but allowing to offload
to a separate kthread if things were getting too overloaded. (It is also
where I got term "emergency" from.)

But I was satisfied with neither the direct printing (winner takes all,
printk-safe defers) nor _relying_ on a kthread for important messages in
an offload situation. This is what convinced me that the kernel needs a
new interface so that it can communicate the _really_ important things
synchronously: write_atomic().

>> I want messages of the information category to cause no disturbance
>> to the system. Give the kernel the freedom to communicate to users
>> without destroying its own performance. This can only be achieved if
>> the messages are printed from a _fully_ preemptible context.
> [..]
>> And I want messages of the emergency category to be as reliable as
>> possible, regardless of the costs to the system. Give the kernel a
>> clear mechanism to _reliably_ communicate critical information. Such
>> messages should never appear on a correctly functioning system.
>
> I don't really understand the role of loglevel anymore.
>
> When I do ./a.out --loglevel=X I have a clear understanding that
> all messages which fall into [critical, X] range will be in the logs,
> because I told that application that those messages are important to
> me right now. And it used to be the same with the kernel loglevel.

The loglevel is not related to logging. It specifies the amount of
console printing. But I will assume you are referring to creating log
files by having an external device store the console printing.

> But now the kernel will do its own thing:
>
> - what the kernel considers important will go into the logs
> - what the kernel doesn't consider important _maybe_ will end up
> in the logs (preemptible printk kthread). And this is where
> loglevel now. After the _maybe_ part.

"what the kernel considers" is a configuration option of the
administrator. The administrator can increase the verbocity of the
console (loglevel) without having negative effects on the system
itself. Also, if the system were to suddenly crash, those crash messages
shouldn't be in jeopardy just because the verbocity of the console was
turned up.

You (and Petr) talk about that _all_ console printing is for
emergencies. That if an administrator sets the loglevel to 7 it is
because the pr_info messages are just as important as the pr_emerg. And
if that is indeed the intention of console printing and loglevel, then
why is asynchronous printk calls for console messages even allowed
today? IMO that isn't taking the importance of the message very
seriously.

> If I'm not mistaken, Tetsuo reported that on a box under heavy OOM
> pressure he saw preemptible printk dragging 5 minutes behind the
> logbuf head. Preemptible printk is good for nothing. It's beyond
> useless, it's something else.

The informational messages are correctly timestamped and can be sorted
offline. They are informational, so any loss is less tragic. And they
aren't affecting system performance because they are being printed in
preemptible contexts.

John Ogness

2019-03-11 10:55:06

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (03/07/19 10:53), John Ogness wrote:
[..]
>
> No, I am not sure if we can convert all console drivers to atomic
> consoles. But I think if we don't have to fear disturbing the system,
> the possibilities for such an implementation are greater.

> > If there are setups which can be fully !atomic (in terms of console
> > output) then we, essentially, have a fully preemptible kthread printk
> > implementation.
>
> Correct. I've mentioned in another response[0] some ideas about what
> could be done to aid this.
>
> I understand that fully preemptible kthread printing is unacceptable for
> you.

Well, it's not like it's unacceptable for me. It's just we've been
there, we had preemptible printk(); and people were not happy with
it. Just to demonstrate that I'm not making this up:

> > From: Tetsuo Handa <[email protected]>:
> > [..]
> >
> > Using a reproducer [..] you will find that calling cond_resched()
> > (from console_unlock() from printk()) can cause a delay of nearly
> > one minute, and it can cause a delay of nearly 5 minutes to complete
> > one out_of_memory() call.

preemptible printk() and printk() do opposite things.

I can't really say that I care for fbcon; but fully preemptible
netcon is going to hurt.

> Since all current console drivers are already irq safe, I'm
> wondering if using irq_work to handle the emergency printing for console
> drivers without write_atomic() would help. (If the printk caller is in a
> context that write() supports, then write() could be called directly.)
> This would also demand that the irq-safe requirements for write() are
> not relaxed. The printk-kthread might still be faster than irq_work, but
> it might increase reliability if an irq_work is triggered as an extra
> precaution.

Hmm. OK. So one of the things with printk is that it's fully sequential.
We call console drivers one by one. Slow consoles can affect what appears
on the fast consoles; fast console have no impact on slow ones.

call_console_drivers()
for_each_console(c)
c->write(c, text, text_len);

So a list of (slow_serial serial netcon) console drivers is a camel train;
fast netcon is not fast anymore, and slow consoles sometimes are the reason
we have dropped messages. And if we drop messages we drop them for all
consoles, including fast netcon. Turning that sequential pipline into a
bunch of per-console kthreads/irq and letting fast consoles to be fast is
not a completely bad thing. Let's think more about this, I'd like to read
more opinions.

-ss

2019-03-11 12:05:36

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 19/25] printk: introduce emergency messages

On 2019-03-08, Petr Mladek <[email protected]> wrote:
>>> +static bool console_can_emergency(int level)
>>> +{
>>> + struct console *con;
>>> +
>>> + for_each_console(con) {
>>> + if (!(con->flags & CON_ENABLED))
>>> + continue;
>>> + if (con->write_atomic && level < emergency_console_loglevel)
>>> + return true;
>>> + if (con->write && (con->flags & CON_BOOT))
>>> + return true;
>>> + }
>>> + return false;
>>> +}
>>> +
>>> +static void call_emergency_console_drivers(int level, const char *text,
>>> + size_t text_len)
>>> +{
>>> + struct console *con;
>>> +
>>> + for_each_console(con) {
>>> + if (!(con->flags & CON_ENABLED))
>>> + continue;
>>> + if (con->write_atomic && level < emergency_console_loglevel) {
>>> + con->write_atomic(con, text, text_len);
>>> + continue;
>>> + }
>>> + if (con->write && (con->flags & CON_BOOT)) {
>>> + con->write(con, text, text_len);
>>> + continue;
>>> + }
>>> + }
>>> +}
>>> +
>>> +static void printk_emergency(char *buffer, int level, u64 ts_nsec, u16 cpu,
>>> + char *text, u16 text_len)
>>> +{
>>> + struct printk_log msg;
>>> + size_t prefix_len;
>>> +
>>> + if (!console_can_emergency(level))
>>> + return;
>>> +
>>> + msg.level = level;
>>> + msg.ts_nsec = ts_nsec;
>>> + msg.cpu = cpu;
>>> + msg.facility = 0;
>>> +
>>> + /* "text" must have PREFIX_MAX preceding bytes available */
>>> +
>>> + prefix_len = print_prefix(&msg,
>>> + console_msg_format & MSG_FORMAT_SYSLOG,
>>> + printk_time, buffer);
>>> + /* move the prefix forward to the beginning of the message text */
>>> + text -= prefix_len;
>>> + memmove(text, buffer, prefix_len);
>>> + text_len += prefix_len;
>>> +
>>> + text[text_len++] = '\n';
>>> +
>>> + call_emergency_console_drivers(level, text, text_len);
>>
>> So this iterates the console list and calls consoles' callbacks, but
>> what prevents console driver to be rmmod-ed under us?
>>
>> CPU0 CPU1
>>
>> printk_emergency() rmmod netcon
>> call_emergency_console_drivers()
>> con_foo->flags & CON_ENABLED == 1
>> unregister_console(con_foo)
>> con_foo->flags &= ~CON_ENABLED
>> __exit // con_foo gone ?
>> con_foo->write()
>>
>> We use console_lock()/console_trylock() in order to protect the list
>> and console drivers; but this brings scheduler to the picture, with
>> all its locks.
>
> Great catch!

Yes, thanks!

> I think that it is doable to guard the list using RCU.

I think it would be enough to take the prb_cpulock when modifying the
console linked list. That will keep printk_emergency() out until the
list has been updated. (registering/unregistering consoles is not
something that happens often.)

John Ogness

2019-03-12 02:52:36

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 19/25] printk: introduce emergency messages

On (03/11/19 13:04), John Ogness wrote:
> > Great catch!
>
> Yes, thanks!
>
> > I think that it is doable to guard the list using RCU.
>
> I think it would be enough to take the prb_cpulock when modifying the
> console linked list. That will keep printk_emergency() out until the
> list has been updated. (registering/unregistering consoles is not
> something that happens often.)

console_sem can be a bit more than just registering/unregistering.
Especially when it comes to VT, fbcon and ioctls.

$ git grep console_lock drivers/tty/ | wc -l
82

$ git grep console_lock drivers/video/fbdev/ | wc -l
80

-ss

2019-03-12 02:59:34

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 19/25] printk: introduce emergency messages

On (03/08/19 11:31), Petr Mladek wrote:
> Great catch!
>
> I think that it is doable to guard the list using RCU.

I think console_sem is more than just list lock.

E.g. fb_flashcursor() - console_sem protects framebuffer from concurrent
modifications. And many other examples.

I think the last time we talked about it (two+ years ago) we counted
5 or 7 (don't remember exactly) different things which console_sem
does.

-ss

2019-03-12 09:40:34

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 25/25] printk: remove unused code

On Mon 2019-03-11 09:18:26, Sebastian Andrzej Siewior wrote:
> On 2019-03-11 11:46:00 [+0900], Sergey Senozhatsky wrote:
> > On (03/08/19 15:02), Sebastian Andrzej Siewior wrote:
> > > On 2019-02-12 15:30:03 [+0100], John Ogness wrote:
> > >
> > > you removed the whole `irq_work' thing. You can also remove the include
> > > for linux/irq_work.h.
> >
> > It may be too early to remove the whole `irq_work' thing.
> > printk()->call_console_driver() should take console_sem
> > lock.
>
> I would be _very_ glad to see that irq_work thingy gone. I just stumbled
> upon this irq_work and cursed a little while doing other things.

Have you seen stalls causes by the irq work? Or was it just
glancing over the printk code?


> Checking John's series and seeing that it was gone, was a relief.
> Printing the whole thing in irq context does not look sane. printing the
> import things right away and printing the remaining things later in
> kthread looks good to me.

The irq_work was originally added to handle messages from the
scheduler by printk_deferred(). It was later used also to
handle messages from NMI and printk recursion.

It means that the use is pretty limited. It is more reliable
than a kthread, especially when the scheduler is reporting
problems. IMHO, it is a reasonable solution as long
as the amount of messages is low.

The real time kernel is another story but it has special
handling in many other situations.

That said, the kthread might still make sense as a fallback.
It would be nice to have a hard limit for handling messages
in irq context.

Best Regards,
Petr

2019-03-12 09:59:36

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On (03/11/19 11:51), John Ogness wrote:
> > In new printk design the tasks are still affected by printing floods.
> > Tasks have to line up and (busy) wait for each other, regardless of
> > contexts.
>
> They only line up and busy wait is to add the informational message to
> the ring buffer. The current printk implementation is the same in this
> respect. And as you've noted, the logbuf spinlock is not a source of
> latencies.

I was talking about prb_lock().

> > When I do ./a.out --loglevel=X I have a clear understanding that
> > all messages which fall into [critical, X] range will be in the logs,
> > because I told that application that those messages are important to
> > me right now. And it used to be the same with the kernel loglevel.
>
> The loglevel is not related to logging. It specifies the amount of
> console printing. But I will assume you are referring to creating log
> files by having an external device store the console printing.

Right. E.g. screenlog.0

> > But now the kernel will do its own thing:
> >
> > - what the kernel considers important will go into the logs
> > - what the kernel doesn't consider important _maybe_ will end up
> > in the logs (preemptible printk kthread). And this is where
> > loglevel now. After the _maybe_ part.
>
> "what the kernel considers" is a configuration option of the
> administrator. The administrator can increase the verbocity of the
> console (loglevel) without having negative effects on the system
> itself. Also, if the system were to suddenly crash, those crash messages
> shouldn't be in jeopardy just because the verbocity of the console was
> turned up.

Right. I'm not very sure about yet another knob which everyone
should figure out. I guess I won't be surprised to find out that
people set it to loglevel value.

-ss

2019-03-12 10:31:02

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 08/25] printk: add ring buffer and kthread

On Mon 2019-03-11 11:51:49, John Ogness wrote:
> On 2019-03-07, Sergey Senozhatsky <[email protected]> wrote:
> > I don't really understand the role of loglevel anymore.
>
> "what the kernel considers" is a configuration option of the
> administrator. The administrator can increase the verbocity of the
> console (loglevel) without having negative effects on the system
> itself.

Where do you get the confidence that atomic console will not
slow down the system? Have you tried it on real life workload
when debugging a real life bug?

Some benchmarks might help. Well, it would be needed to
trigger some messages from them and see how the different
approaches affect the overall system performance.


> Also, if the system were to suddenly crash, those crash messages
> shouldn't be in jeopardy just because the verbocity of the console was
> turned up.

This expects that the error messages will be enough to discover
and fix the problem.


> You (and Petr) talk about that _all_ console printing is for
> emergencies. That if an administrator sets the loglevel to 7 it is
> because the pr_info messages are just as important as the pr_emerg.

It might be true when the messages with higher level (more critical)
are not enough to understand the situation.


> And if that is indeed the intention of console printing and loglevel, then
> why is asynchronous printk calls for console messages even allowed
> today? IMO that isn't taking the importance of the message very
> seriously.

Because it was working pretty well in the past. The amount of messages
is still growing (code complexity, more CPUs, more devices, ...).
Our customers have started reporting softlockups "only" 7 years ago
or so.

We currently have two level handling of messages:

+ all messages can be seen from userspace
+ messages below console_loglevel can be seen
on the console

You are introducing one more level of handling:

+ critical messages are printed on the console directly
even before the queued less critical ones


The third level would be acceptable when:

+ atomic consoles are reliable enough
+ the code complexity is worth the gain


IMHO, we mix too many things here:

+ log buffer implementation
+ console offload
+ direct console handling using atomic consoles

I see the potential in all areas:

+ lock less ring buffer helps to avoid deadlocks,
and extra log buffers

+ console offload prevents too long stalls (softlockups)

+ direct console handling might help to avoid deadlocks
and might make the output more reliable.


I think that we are on the same page here.

But we must use an incremental approach. It is not acceptable
to replace everything by a single patch. And it is not acceptable
to break important functionality and implement alternative
solution several patches later.

Also no solution is as ideal as it is sometimes presented
in this thread.

Best Regards,
Petr

2019-03-12 12:39:56

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On Mon 2019-03-11 19:54:11, Sergey Senozhatsky wrote:
> On (03/07/19 10:53), John Ogness wrote:
> > Since all current console drivers are already irq safe, I'm
> > wondering if using irq_work to handle the emergency printing for console
> > drivers without write_atomic() would help. (If the printk caller is in a
> > context that write() supports, then write() could be called directly.)
> > This would also demand that the irq-safe requirements for write() are
> > not relaxed. The printk-kthread might still be faster than irq_work, but
> > it might increase reliability if an irq_work is triggered as an extra
> > precaution.
>
> Hmm. OK. So one of the things with printk is that it's fully sequential.
> We call console drivers one by one. Slow consoles can affect what appears
> on the fast consoles; fast console have no impact on slow ones.
>
> call_console_drivers()
> for_each_console(c)
> c->write(c, text, text_len);
>
> So a list of (slow_serial serial netcon) console drivers is a camel train;
> fast netcon is not fast anymore, and slow consoles sometimes are the reason
> we have dropped messages. And if we drop messages we drop them for all
> consoles, including fast netcon. Turning that sequential pipline into a
> bunch of per-console kthreads/irq and letting fast consoles to be fast is
> not a completely bad thing. Let's think more about this, I'd like to read
> more opinions.

Per-console kthread sounds interesting but there is the problem with
reliability. I mean that kthread need not get scheduled.

Some of these problems might get solved by the per-console loglevel
patchset.

Sigh, any feature might be useful in some situation. But we always
have to consider the cost and the gain. I wonder how common is
to actively use two consoles at the same time and what would
be the motivation.

Best Regards,
Petr

2019-03-12 15:17:29

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On 2019-03-12, Petr Mladek <[email protected]> wrote:
> Per-console kthread sounds interesting but there is the problem with
> reliability. I mean that kthread need not get scheduled.
>
> Some of these problems might get solved by the per-console loglevel
> patchset.
>
> Sigh, any feature might be useful in some situation. But we always
> have to consider the cost and the gain. I wonder how common is
> to actively use two consoles at the same time and what would
> be the motivation.

The following is from the linux-serial mailing list:
"Re: Serial console is causing system lock-up" [0]

I'm responding to it here because it really belongs in this thread.

On 2019-03-12, Petr Mladek <[email protected]> wrote:
> On Tue 2019-03-12 09:17:49, John Ogness wrote:
>> The current printk implementation is handling all console printing as
>> best effort. Trying hard enough to dramatically affect the system, but
>> not trying hard enough to guarantee success.
>
> I agree that direct output is more reliable. It might be very useful
> for debugging some types of problems. The question is if it is worth
> the cost (code complexity, serializing CPUs == slowing down the
> entire system).
>
> But it is is possible that a reasonable offloading (in the direction
> of last Sergey's approach) might be a better deal.
>
>
> I suggest the following way forward (separate patchsets):
>
> 1. Replace log buffer (least controversial thing)

Yes. I will post a series that only implements the ringbuffer using your
simplified API. That will be enough to remove printk_safe and actually
does most of the work of updating devkmsg, kmsg_dump, and syslog.

> 2. Reliable offload to kthread (would be useful anyway)

Yes. I would like to implement per-console kthreads for this series. I
think the advantages are obvious. For PREEMPT_RT the offloading will
need to be always active. (PREEMPT_RT cannot call the console->write()
from atomic contexts.) But I think this would be acceptable at first. It
would certainly be better than what PREEMPT_RT is doing now.

> 3. Atomic consoles (a lot of tricky code, might not be
> worth the effort)

I think this will be necessary. PREEMPT_RT cannot support reliable
emergency console messages without it. And for kernel developers this is
also very helpful. People like PeterZ are using their own patches
because the mainline kernel is not providing this functionality.

The decision about _when_ to use it is still in the air. But I guess
we'll worry about that when we get that far. There's enough to do until
then.

> Could we agree on this?

I think this is a sane path forward. Thank you for providing some
direction here.

John Ogness

[0] https://marc.info/?l=linux-serial&m=155239250721543

2019-03-13 02:01:17

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (03/12/19 13:38), Petr Mladek wrote:
> > Hmm. OK. So one of the things with printk is that it's fully sequential.
> > We call console drivers one by one. Slow consoles can affect what appears
> > on the fast consoles; fast console have no impact on slow ones.
> >
> > call_console_drivers()
> > for_each_console(c)
> > c->write(c, text, text_len);
> >
> > So a list of (slow_serial serial netcon) console drivers is a camel train;
> > fast netcon is not fast anymore, and slow consoles sometimes are the reason
> > we have dropped messages. And if we drop messages we drop them for all
> > consoles, including fast netcon. Turning that sequential pipline into a
> > bunch of per-console kthreads/irq and letting fast consoles to be fast is
> > not a completely bad thing. Let's think more about this, I'd like to read
> > more opinions.
>
> Per-console kthread sounds interesting but there is the problem with
> reliability. I mean that kthread need not get scheduled.

Correct, it has to get scheduled. From that point of view IRQ offloading
looks better - either to irq_work (like John suggested) or to serial
drivers' irq handler (poll uart xmit + logbuf).

kthread offloading is not super reliable. That's why I played tricks
with CPU affinity - scheduler sometimes schedule printk_kthread on the
same CPU which spins in console_unlock() loop printing the messages, so
printk_kthread offloading never happens. It was first discovered by Jan
Kara (back in the days of async-printk patch set). I think at some point
Jan's async-printk patch set had two printk kthreads.

We also had some concerns regarding offloading on UP systems.

> Some of these problems might get solved by the per-console loglevel
> patchset.

Yes, some.

> Sigh, any feature might be useful in some situation. But we always
> have to consider the cost and the gain. I wonder how common is
> to actively use two consoles at the same time and what would
> be the motivation.

Facebook fleet for example. The motivation is - to have a fancy fast
console that does things which simple serial consoles cannot do and
a slow serial console, which is sometimes more reliable, as last resort.
Fancy stuff usually means dependencies - net, mm, etc. So when fancy
console stop working, slow serial console still does.

-ss

2019-03-13 02:16:54

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (03/12/19 16:15), John Ogness wrote:
> > I suggest the following way forward (separate patchsets):
> >
> > 1. Replace log buffer (least controversial thing)
>
> Yes. I will post a series that only implements the ringbuffer using your
> simplified API. That will be enough to remove printk_safe and actually
> does most of the work of updating devkmsg, kmsg_dump, and syslog.

This may _not_ be enough to remove printk_safe. One of the reasons
printk_safe "condom" came into existence was console_sem (which
is a bit too important to ignore it):

printk()
console_trylock()
console_unlock()
up()
raw_spin_lock_irqsave(&sem->lock, flags)
__up()
wake_up_process()
WARN/etc
printk()
console_trylock()
down_trylock()
raw_spin_lock_irqsave(&sem->lock, flags) << deadlock

Back then we were looking at

printk->console_sem->lock->printk->console_sem->lock

deadlock report from LG, if I'm not mistaken.

-ss

2019-03-13 08:20:23

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On 2019-03-13, Sergey Senozhatsky <[email protected]> wrote:
>>> I suggest the following way forward (separate patchsets):
>>>
>>> 1. Replace log buffer (least controversial thing)
>>
>> Yes. I will post a series that only implements the ringbuffer using
>> your simplified API. That will be enough to remove printk_safe and
>> actually does most of the work of updating devkmsg, kmsg_dump, and
>> syslog.
>
> This may _not_ be enough to remove printk_safe. One of the reasons
> printk_safe "condom" came into existence was console_sem (which
> is a bit too important to ignore it):
>
> printk()
> console_trylock()
> console_unlock()
> up()
> raw_spin_lock_irqsave(&sem->lock, flags)
> __up()
> wake_up_process()
> WARN/etc
> printk()
> console_trylock()
> down_trylock()
> raw_spin_lock_irqsave(&sem->lock, flags) << deadlock
>
> Back then we were looking at
>
> printk->console_sem->lock->printk->console_sem->lock
>
> deadlock report from LG, if I'm not mistaken.

The main drawback of printk_safe is the safe buffers, which, aside from
bogus timestamping, may never make it back to the printk log buffer.

With the new ring buffer the safe buffers are not needed, even in the
recursive situation. As you are pointing out, the notification/wake
component of printk_safe will still be needed. I will leave that (small)
part in printk_safe.c.

John Ogness

Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On 2019-03-13 09:19:32 [+0100], John Ogness wrote:
> recursive situation. As you are pointing out, the notification/wake
> component of printk_safe will still be needed. I will leave that (small)
> part in printk_safe.c.

Does this mean we keep irq_work part or we bury it and solve it by other
means?

> John Ogness

Sebastian

2019-03-13 08:46:41

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (03/13/19 09:19), John Ogness wrote:
> >> Yes. I will post a series that only implements the ringbuffer using
> >> your simplified API. That will be enough to remove printk_safe and
> >> actually does most of the work of updating devkmsg, kmsg_dump, and
> >> syslog.
> >
> > This may _not_ be enough to remove printk_safe. One of the reasons
> > printk_safe "condom" came into existence was console_sem (which
> > is a bit too important to ignore it):
> >
> > printk()
> > console_trylock()
> > console_unlock()
> > up()
> > raw_spin_lock_irqsave(&sem->lock, flags)
> > __up()
> > wake_up_process()
> > WARN/etc
> > printk()
> > console_trylock()
> > down_trylock()
> > raw_spin_lock_irqsave(&sem->lock, flags) << deadlock
> >
> > Back then we were looking at
> >
> > printk->console_sem->lock->printk->console_sem->lock
> >
> > deadlock report from LG, if I'm not mistaken.
>
> The main drawback of printk_safe is the safe buffers, which, aside from
> bogus timestamping, may never make it back to the printk log buffer.
>
> With the new ring buffer the safe buffers are not needed, even in the
> recursive situation. As you are pointing out, the notification/wake
> component of printk_safe will still be needed. I will leave that (small)
> part in printk_safe.c.

Yeah, all I'm saying is that as it stands new printk() is missing a huge
and necessary part - console semaphore. And things can get very different
once you add that missing part. It brings a lot of stuff back to printk.

logbuf and logbuf_lock are not really huge printk problems. scheduler,
timekeeping locks, etc. are much bigger ones. Those dependencies don't
come from logbuf/logbuf_lock.

-ss

2019-03-13 09:29:59

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (03/13/19 09:40), Sebastian Siewior wrote:
> On 2019-03-13 09:19:32 [+0100], John Ogness wrote:
> > recursive situation. As you are pointing out, the notification/wake
> > component of printk_safe will still be needed. I will leave that (small)
> > part in printk_safe.c.
>
> Does this mean we keep irq_work part or we bury it and solve it by other
> means?

That's a very good question. Because if we add console_trylock()
to printk(), then we can't invoke ->atomic() consoles when console_sem
is already locked, because one of the registered drivers is currently
being modified by a 3rd party and printk(), thus, must stay away.
Once that modification will be done console_unlock() will print all
pending messages. This is current design. And this conflicts with the
whole idea of ->atomic() consoles.

So may be we need a whole new scheme in this department as well.

For instance [*and this is completely untested idea* !!!]

*May be* we can take a closer look and find cases when ->atomic
consoles don't really depend on console_sem. And *may be* we can
split the console drivers list and somehow forbid removal and
modification (ioctl) of ->atomic consoles under us. Assuming that
this is doable we then can start iterating ->atomic consoles list
with unlocked console_sem.
Non->atomic consoles or consoles which depend on console_sem (VT,
fbcon and so on) will stay in another list, and we will take
console_sem before we iterate that list and invoke those drivers.

One more time - a completely random thought.

-ss

2019-03-13 10:07:46

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On (03/13/19 18:27), Sergey Senozhatsky wrote:
> > Does this mean we keep irq_work part or we bury it and solve it by other
> > means?
>
> That's a very good question. Because if we add console_trylock()
> to printk(), then we can't invoke ->atomic() consoles when console_sem
> is already locked, because one of the registered drivers is currently
> being modified by a 3rd party and printk(), thus, must stay away.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
one of the drivers or the list itself.

-ss

2019-03-14 09:15:38

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On Tue 2019-03-12 16:15:55, John Ogness wrote:
> On 2019-03-12, Petr Mladek <[email protected]> wrote:
> > On Tue 2019-03-12 09:17:49, John Ogness wrote:
> >> The current printk implementation is handling all console printing as
> >> best effort. Trying hard enough to dramatically affect the system, but
> >> not trying hard enough to guarantee success.
> >
> > I agree that direct output is more reliable. It might be very useful
> > for debugging some types of problems. The question is if it is worth
> > the cost (code complexity, serializing CPUs == slowing down the
> > entire system).
> >
> > But it is is possible that a reasonable offloading (in the direction
> > of last Sergey's approach) might be a better deal.
> >
> >
> > I suggest the following way forward (separate patchsets):
> >
> > 1. Replace log buffer (least controversial thing)
>
> Yes. I will post a series that only implements the ringbuffer using your
> simplified API. That will be enough to remove printk_safe and actually
> does most of the work of updating devkmsg, kmsg_dump, and syslog.

Great. I just wonder if it is going to be fully lockless or
still using the prb_lock. I could understand the a fully lockless
solution will be much more complicated. But I think that it would
make a lot of things easier in the long run. Especially it might
help to avoid the big-kernel-lock-like approach.


> > 2. Reliable offload to kthread (would be useful anyway)
>
> Yes. I would like to implement per-console kthreads for this series. I
> think the advantages are obvious. For PREEMPT_RT the offloading will
> need to be always active. (PREEMPT_RT cannot call the console->write()
> from atomic contexts.) But I think this would be acceptable at first. It
> would certainly be better than what PREEMPT_RT is doing now.

I would personally start with one kthread. I am afraid that
the discussion about it will be complicated enough. We could
always make it more complicated later.

I understand the immediate offloading might be necessary for
PREEMPT_RT. But a more conservative approach is needed for
non-rt kernels.

Well, if there won't be a big difference in the complexity
between one and more threads then we could mix it. But
I personally see this a two steps that are better be done
separately.


> > 3. Atomic consoles (a lot of tricky code, might not be
> > worth the effort)
>
> I think this will be necessary. PREEMPT_RT cannot support reliable
> emergency console messages without it. And for kernel developers this is
> also very helpful. People like PeterZ are using their own patches
> because the mainline kernel is not providing this functionality.
>
> The decision about _when_ to use it is still in the air. But I guess
> we'll worry about that when we get that far. There's enough to do until
> then.

I am sure that there are situations where the direct output
to atomic consoles would help with debugging. PeteZ and Steven
are using their own patches for a reason.

Let's see how the code is complicated and how many consoles
might get supported a reasonable way.


Anyway, it will be a long run. I am personally curious where
this will end :-)

Best Regards,
Petr

2019-03-14 09:28:22

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On Wed 2019-03-13 18:27:59, Sergey Senozhatsky wrote:
> On (03/13/19 09:40), Sebastian Siewior wrote:
> > On 2019-03-13 09:19:32 [+0100], John Ogness wrote:
> > > recursive situation. As you are pointing out, the notification/wake
> > > component of printk_safe will still be needed. I will leave that (small)
> > > part in printk_safe.c.
> >
> > Does this mean we keep irq_work part or we bury it and solve it by other
> > means?
>
> *May be* we can take a closer look and find cases when ->atomic
> consoles don't really depend on console_sem. And *may be* we can
> split the console drivers list and somehow forbid removal and
> modification (ioctl) of ->atomic consoles under us. Assuming that
> this is doable we then can start iterating ->atomic consoles list
> with unlocked console_sem.

I believe that this is actually the plan. Atomic consoles depending
on console_sem would not be a real step forward.


> Non->atomic consoles or consoles which depend on console_sem (VT,
> fbcon and so on) will stay in another list, and we will take
> console_sem before we iterate that list and invoke those drivers.

This might be also needed for "less" important messages
that people would not want to get to the console atomically
because it would serialize CPUs and slow down the entire
system.

I think that we would still need irq_work for this mode.
But it should be necessary only for messages from NMI context
and printk() recursion. It means that it should be a rare
situation and the amount of messages should get limited.
It should not be much worse than handling few printk()
calls from any irq handler.

Best Regards,
Petr

2019-03-14 09:36:15

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On 2019-03-14, Petr Mladek <[email protected]> wrote:
>>> I suggest the following way forward (separate patchsets):
>>>
>>> 1. Replace log buffer (least controversial thing)
>>
>> Yes. I will post a series that only implements the ringbuffer using
>> your simplified API. That will be enough to remove printk_safe and
>> actually does most of the work of updating devkmsg, kmsg_dump, and
>> syslog.
>
> Great. I just wonder if it is going to be fully lockless or
> still using the prb_lock. I could understand the a fully lockless
> solution will be much more complicated. But I think that it would
> make a lot of things easier in the long run. Especially it might
> help to avoid the big-kernel-lock-like approach.

I will re-visit my lockless implementation and see if I can minimize
complexity. We have since identified a couple other negative affects of
the prb_lock for PREEMPT_RT relating to CPU isolated setups. So the
lockless version is starting to look more attractive from a PREEMPT_RT
perspective as well.

>>> 2. Reliable offload to kthread (would be useful anyway)
>>
>> Yes. I would like to implement per-console kthreads for this
>> series. I think the advantages are obvious. For PREEMPT_RT the
>> offloading will need to be always active. (PREEMPT_RT cannot call the
>> console->write() from atomic contexts.) But I think this would be
>> acceptable at first. It would certainly be better than what
>> PREEMPT_RT is doing now.
>
> I would personally start with one kthread. I am afraid that
> the discussion about it will be complicated enough. We could
> always make it more complicated later.
>
> I understand the immediate offloading might be necessary for
> PREEMPT_RT. But a more conservative approach is needed for
> non-rt kernels.
>
> Well, if there won't be a big difference in the complexity
> between one and more threads then we could mix it. But
> I personally see this a two steps that are better be done
> separately.

I will create the series in a way that a complete solution with 1
kthread exists and all following patches in the series add per-console
kthreads. Then we have a clean cut if we decide we only want the first
part of the series.

>>> 3. Atomic consoles (a lot of tricky code, might not be
>>> worth the effort)
>>
>> I think this will be necessary. PREEMPT_RT cannot support reliable
>> emergency console messages without it. And for kernel developers this
>> is also very helpful. People like PeterZ are using their own patches
>> because the mainline kernel is not providing this functionality.
>>
>> The decision about _when_ to use it is still in the air. But I guess
>> we'll worry about that when we get that far. There's enough to do
>> until then.
>
> I am sure that there are situations where the direct output
> to atomic consoles would help with debugging. PeteZ and Steven
> are using their own patches for a reason.
>
> Let's see how the code is complicated and how many consoles
> might get supported a reasonable way.

Agreed.

I will post each of the above series as RFCv2 because I expect we still
need some discussion. Especially if I post the fully lockless version of
the ringbuffer.

I have taken a *lot* of notes about things that need changing during
this thread. I think a lot of great feedback has come out of this
RFC. Thank you to everyone that responded (publicly and privately)! I'll
need several weeks before the RFCv2 for the ringbuffer is ready.

John Ogness

2020-01-20 23:16:38

by Eugeniu Rosca

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hello John, all,

Cc: Geert, Morimoto-san,

On Tue, Feb 12, 2019 at 03:29:38PM +0100, John Ogness wrote:
> Hello,
>
> As probably many of you are aware, the current printk implementation
> has some issues. This series (against 5.0-rc6) makes some fundamental
> changes in an attempt to address these issues. The particular issues I
> am referring to:
>
> 1. The printk buffer is protected by a global raw spinlock for readers
> and writers. This restricts the contexts that are allowed to
> access the buffer.
>
> 2. Because of #1, NMI and recursive contexts are handled by deferring
> logging/printing to a spinlock-safe context. This means that
> messages will not be visible if (for example) the kernel dies in
> NMI context and the irq_work mechanism does not survive.
>
> 3. Because of #1, when *not* using features such as PREEMPT_RT, large
> latencies exist when printing to slow consoles.

This [1] is a fairly old thread, but I only recently stumbled upon it,
while co-investigating below audio distortions [2] on R-Car3 ARM64
boards, which can be reproduced by stressing [3] the serial console.

The investigation started a few months ago, when users reported
audio drops during the first seconds of system startup. Only after
a few weeks it became clear (thanks to some people in Cc) that the
distortions were contributed by the above-average serial console load
during the early boot. Once understood, we were able to come up with
a synthetic test [2-3].

I thought it would be interesting to share below reproduction matrix,
in order to contrast vanilla to linux-rt-devel [4], as well as to
compare various preemption models.

| Ser.console Ser.console
| stressed at rest or disabled
--------------------------------------------
v5.5-rc6 (PREEMPT=y) | distorted clean
v5.4.5-rt3 (PREEMPT=y) | distorted clean
v5.4.5-rt3 (PREEMPT_RT=y) | clean clean

My feeling is that the results probably do not surprise linux-rt people.

My first question is, should there be any improvement in the case of
v5.4.5-rt3 (PREEMPT=y), which I do not sense? I would expect so, based
on the cover letter of this series (pointing out the advantages of the
redesigned printk mechanism).

And the other question is, how would you, generally speaking, tackle
the problem, given that backporting the linux-rt patches is *not* an
option and enabling serial console is a must?

[1] https://lore.kernel.org/lkml/[email protected]/
[2] H3ULCB> speaker-test -f24_LE -c2 -t wav -Dplughw:rcarsound -b 4000
https://vocaroo.com/9NV98mMgdjX
[3] https://github.com/erosca/linux/tree/stress-serial
[4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/

--
Best Regards,
Eugeniu

2020-01-21 23:58:30

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hello Eugeniu,

On 2020-01-21, Eugeniu Rosca <[email protected]> wrote:
> This [1] is a fairly old thread, but I only recently stumbled upon it,
> while co-investigating below audio distortions [2] on R-Car3 ARM64
> boards, which can be reproduced by stressing [3] the serial console.
>
> The investigation started a few months ago, when users reported audio
> drops during the first seconds of system startup. Only after a few
> weeks it became clear (thanks to some people in Cc) that the
> distortions were contributed by the above-average serial console load
> during the early boot. Once understood, we were able to come up with a
> synthetic test [2-3].
>
> I thought it would be interesting to share below reproduction matrix,
> in order to contrast vanilla to linux-rt-devel [4], as well as to
> compare various preemption models.
>
> | Ser.console Ser.console
> | stressed at rest or disabled
> --------------------------------------------
> v5.5-rc6 (PREEMPT=y) | distorted clean
> v5.4.5-rt3 (PREEMPT=y) | distorted clean
> v5.4.5-rt3 (PREEMPT_RT=y) | clean clean
>
> My feeling is that the results probably do not surprise linux-rt
> people.
>
> My first question is, should there be any improvement in the case of
> v5.4.5-rt3 (PREEMPT=y), which I do not sense? I would expect so, based
> on the cover letter of this series (pointing out the advantages of the
> redesigned printk mechanism).

The problem you are reporting is not the problem that the printk rework
is trying to solve.

In your chart, v5.4.5-rt3 (PREEMPT_RT=y) is the only configuration that
is _not_ disabling hardware interrupts during UART activity. I would
guess your problem is due to interrupts being disabled for unacceptable
lengths of time. You need a low-latency system, so PREEMPT_RT=y _is_ the
correct (and only) solution if a verbose serial console is a must.

The printk rework focusses on making printk non-interfering by
decoupling console printing from printk() callers. However, the console
printing itself will still do just as much interrupt disabling as
before. That is driver-related, not printk-related.

> And the other question is, how would you, generally speaking, tackle
> the problem, given that backporting the linux-rt patches is *not* an
> option and enabling serial console is a must?

The linux-rt patches (which include this printk rework) *are* being
ported to mainline now. My recommendation is to continue using the
linux-rt patches (with PREEMPT_RT=y) until PREEMPT_RT is available
mainline.

John Ogness

> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] H3ULCB> speaker-test -f24_LE -c2 -t wav -Dplughw:rcarsound -b 4000
> https://vocaroo.com/9NV98mMgdjX
> [3] https://github.com/erosca/linux/tree/stress-serial
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/

2020-01-22 02:35:43

by Eugeniu Rosca

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hi John,

Thank you for the comprehensive feedback. Some replies below.

On Wed, Jan 22, 2020 at 12:56:48AM +0100, John Ogness wrote:
> On 2020-01-21, Eugeniu Rosca <[email protected]> wrote:
> > This [1] is a fairly old thread, but I only recently stumbled upon it,
> > while co-investigating below audio distortions [2] on R-Car3 ARM64
> > boards, which can be reproduced by stressing [3] the serial console.
> >
> > The investigation started a few months ago, when users reported audio
> > drops during the first seconds of system startup. Only after a few
> > weeks it became clear (thanks to some people in Cc) that the
> > distortions were contributed by the above-average serial console load
> > during the early boot. Once understood, we were able to come up with a
> > synthetic test [2-3].
> >
> > I thought it would be interesting to share below reproduction matrix,
> > in order to contrast vanilla to linux-rt-devel [4], as well as to
> > compare various preemption models.
> >
> > | Ser.console Ser.console
> > | stressed at rest or disabled
> > --------------------------------------------
> > v5.5-rc6 (PREEMPT=y) | distorted clean
> > v5.4.5-rt3 (PREEMPT=y) | distorted clean
> > v5.4.5-rt3 (PREEMPT_RT=y) | clean clean
> >
> > My feeling is that the results probably do not surprise linux-rt
> > people.
> >
> > My first question is, should there be any improvement in the case of
> > v5.4.5-rt3 (PREEMPT=y), which I do not sense? I would expect so, based
> > on the cover letter of this series (pointing out the advantages of the
> > redesigned printk mechanism).
>
> The problem you are reporting is not the problem that the printk rework
> is trying to solve.

In general, agreed. But there are some quirks and peculiarities in how
the issue (increased audio latency) manifests itself on R-Car3, which
might create some room for a (relatively simple) loophole solution in
the printk mechanism.

With that said, I need to diverge a bit from the platform-agnostic
scope of this series.

So, what's specific to R-Car3, based on my testing, is that the issue
can only be reproduced if the printk storm originates on CPU0 (it does
not matter if from interrupt or task context, both have been tested). If
the printk storm is initiated on any other CPU (there are 7 secondary
ones on R-Car H3), there is no regression in the audio quality/latency.

I cannot fully explain this empirical observation, but it directs my
mind to the following workaround, for which I have a PoC:
- employ vprintk_safe() any time CPU0 is the owner/caller of printk
- tie CPU0-private printk internal IRQ workers to another CPU

The above makes sure nothing is printed to the serial console on behalf
of CPU0. I don't even hope this to be accepted by community, but can you
please share your opinion the idea itself is sane?

>
> In your chart, v5.4.5-rt3 (PREEMPT_RT=y) is the only configuration that
> is _not_ disabling hardware interrupts during UART activity. I would
> guess your problem is due to interrupts being disabled for unacceptable
> lengths of time.

This confirms the internally established view on the issue.

> You need a low-latency system, so PREEMPT_RT=y _is_ the
> correct (and only) solution if a verbose serial console is a must.

It's helpful to have your feedback on this.

>
> The printk rework focusses on making printk non-interfering by
> decoupling console printing from printk() callers. However, the console
> printing itself will still do just as much interrupt disabling as
> before. That is driver-related, not printk-related.

I didn't dive into the internals of this series, but decoupling the
execution context of the serial driver from the execution context of
the printk callers sounds very good to me (this is what i try to achieve
via vanilla vprintk_safe). I wonder if it's easier to remove CPU0 from
equation with this series applied.

>
> > And the other question is, how would you, generally speaking, tackle
> > the problem, given that backporting the linux-rt patches is *not* an
> > option and enabling serial console is a must?
>
> The linux-rt patches (which include this printk rework) *are* being
> ported to mainline now. My recommendation is to continue using the
> linux-rt patches (with PREEMPT_RT=y) until PREEMPT_RT is available
> mainline.

That's an extremely useful feedback. However, I still see non-trivial
differences between mainline and linux-rt-devel:

$ git diff --shortstat v5.4.13..v5.4.13-rt7
401 files changed, 9577 insertions(+), 3616 deletions(-)

I would be happy to see this slimming down over time. If there is any
roadmap publicly available, I would appreciate a reference.

>
> John Ogness

Thanks again!

>
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] H3ULCB> speaker-test -f24_LE -c2 -t wav -Dplughw:rcarsound -b 4000
> > https://vocaroo.com/9NV98mMgdjX
> > [3] https://github.com/erosca/linux/tree/stress-serial
> > [4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/

--
Best Regards
Eugeniu Rosca

2020-01-22 07:33:05

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hi Eugeniu,

On Wed, Jan 22, 2020 at 3:34 AM Eugeniu Rosca <[email protected]> wrote:
> On Wed, Jan 22, 2020 at 12:56:48AM +0100, John Ogness wrote:
> > On 2020-01-21, Eugeniu Rosca <[email protected]> wrote:
> > > This [1] is a fairly old thread, but I only recently stumbled upon it,
> > > while co-investigating below audio distortions [2] on R-Car3 ARM64
> > > boards, which can be reproduced by stressing [3] the serial console.
> > >
> > > The investigation started a few months ago, when users reported audio
> > > drops during the first seconds of system startup. Only after a few
> > > weeks it became clear (thanks to some people in Cc) that the
> > > distortions were contributed by the above-average serial console load
> > > during the early boot. Once understood, we were able to come up with a
> > > synthetic test [2-3].
> > >
> > > I thought it would be interesting to share below reproduction matrix,
> > > in order to contrast vanilla to linux-rt-devel [4], as well as to
> > > compare various preemption models.
> > >
> > > | Ser.console Ser.console
> > > | stressed at rest or disabled
> > > --------------------------------------------
> > > v5.5-rc6 (PREEMPT=y) | distorted clean
> > > v5.4.5-rt3 (PREEMPT=y) | distorted clean
> > > v5.4.5-rt3 (PREEMPT_RT=y) | clean clean
> > >
> > > My feeling is that the results probably do not surprise linux-rt
> > > people.
> > >
> > > My first question is, should there be any improvement in the case of
> > > v5.4.5-rt3 (PREEMPT=y), which I do not sense? I would expect so, based
> > > on the cover letter of this series (pointing out the advantages of the
> > > redesigned printk mechanism).
> >
> > The problem you are reporting is not the problem that the printk rework
> > is trying to solve.
>
> In general, agreed. But there are some quirks and peculiarities in how
> the issue (increased audio latency) manifests itself on R-Car3, which
> might create some room for a (relatively simple) loophole solution in
> the printk mechanism.
>
> With that said, I need to diverge a bit from the platform-agnostic
> scope of this series.
>
> So, what's specific to R-Car3, based on my testing, is that the issue
> can only be reproduced if the printk storm originates on CPU0 (it does
> not matter if from interrupt or task context, both have been tested). If
> the printk storm is initiated on any other CPU (there are 7 secondary
> ones on R-Car H3), there is no regression in the audio quality/latency.

The secure stuff is running on CPU0, isn't it?
Is that a coincidence?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-01-22 10:35:01

by John Ogness

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On 2020-01-22, Eugeniu Rosca <[email protected]> wrote:
> So, what's specific to R-Car3, based on my testing, is that the issue
> can only be reproduced if the printk storm originates on CPU0 (it does
> not matter if from interrupt or task context, both have been
> tested). If the printk storm is initiated on any other CPU (there are
> 7 secondary ones on R-Car H3), there is no regression in the audio
> quality/latency.
>
> I cannot fully explain this empirical observation, but it directs my
> mind to the following workaround, for which I have a PoC:
> - employ vprintk_safe() any time CPU0 is the owner/caller of printk
> - tie CPU0-private printk internal IRQ workers to another CPU
>
> The above makes sure nothing is printed to the serial console on
> behalf of CPU0. I don't even hope this to be accepted by community,
> but can you please share your opinion the idea itself is sane?

It is a problem-specific hack. You will need to be certain that CPU1-7
will never have problems with console printing storms.

Be aware that vprintk_safe() is not particularly reliable in many crash
scenarios. If seeing oops output on the console is important, this can
be a risky hack.

Also, be aware that it has its own config option for the safe buffer
size: PRINTK_SAFE_LOG_BUF_SHIFT

>> The printk rework focusses on making printk non-interfering by
>> decoupling console printing from printk() callers. However, the
>> console printing itself will still do just as much interrupt
>> disabling as before. That is driver-related, not printk-related.
>
> I didn't dive into the internals of this series, but decoupling the
> execution context of the serial driver from the execution context of
> the printk callers sounds very good to me (this is what i try to
> achieve via vanilla vprintk_safe). I wonder if it's easier to remove
> CPU0 from equation with this series applied.

Yes, it would be quite easy. The console printers run as dedicated
kthreads. It is only a matter of setting the CPU affinity for the
related kthread.

>> The linux-rt patches (which include this printk rework) *are* being
>> ported to mainline now. My recommendation is to continue using the
>> linux-rt patches (with PREEMPT_RT=y) until PREEMPT_RT is available
>> mainline.
>
> If there is any roadmap publicly available, I would appreciate a
> reference.

I am only aware of the quilt "series" file [0] that is roughly
documenting the status of the effort.

John Ogness

[0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/series?h=linux-5.4.y-rt-patches

2020-01-22 17:00:16

by Eugeniu Rosca

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hi Geert,

On Wed, Jan 22, 2020 at 08:31:44AM +0100, Geert Uytterhoeven wrote:
> On Wed, Jan 22, 2020 at 3:34 AM Eugeniu Rosca <[email protected]> wrote:
> >
> > So, what's specific to R-Car3, based on my testing, is that the issue
> > can only be reproduced if the printk storm originates on CPU0 (it does
> > not matter if from interrupt or task context, both have been tested). If
> > the printk storm is initiated on any other CPU (there are 7 secondary
> > ones on R-Car H3), there is no regression in the audio quality/latency.
>
> The secure stuff is running on CPU0, isn't it?
> Is that a coincidence?

Nobody has ruled this out so far. As a side note, except for the ARMv8
generic IPs, there seems to be quite poor IRQ balancing between the
CPU cores of R-Car H3 (although this might be unrelated to the issue):

$ cat /proc/interrupts | egrep -v "(0[ ]*){8}"
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
3: 55879 17835 14132 33882 6626 4331 6710 4532 GICv2 30 Level arch_timer
16: 1 0 0 0 0 0 0 0 GICv2 38 Level e6052000.gpio
32: 203 0 0 0 0 0 0 0 GICv2 51 Level e66d8000.i2c
33: 95 0 0 0 0 0 0 0 GICv2 205 Level e60b0000.i2c
94: 19339 0 0 0 0 0 0 0 GICv2 71 Level eth0:ch0:rx_be
112: 20599 0 0 0 0 0 0 0 GICv2 89 Level eth0:ch18:tx_be
118: 2 0 0 0 0 0 0 0 GICv2 95 Level eth0:ch24:emac
122: 442092 0 0 0 0 0 0 0 GICv2 196 Level e6e88000.serial:mux
124: 2776685 0 0 0 0 0 0 0 GICv2 352 Level ec700000.dma-controller:0
160: 2896 0 0 0 0 0 0 0 GICv2 197 Level ee100000.sd
161: 5652 0 0 0 0 0 0 0 GICv2 199 Level ee140000.sd
162: 147 0 0 0 0 0 0 0 GICv2 200 Level ee160000.sd
197: 5 0 0 0 0 0 0 0 GICv2 384 Level ec500000.sound
208: 1 0 0 0 0 0 0 0 gpio-rcar 11 Level e6800000.ethernet-ffffffff:00
IPI0: 12701 366358 545059 1869017 9817 8065 9327 10644 Rescheduling interrupts
IPI1: 21 34 111 86 238 191 149 161 Function call interrupts
IPI5: 16422 709 509 637 0 0 3346 0 IRQ work interrupts

BTW/FYI, I raised a bug report to Renesas and specifically asked them
to approach you, hoping that your massive experience in the serial
drivers will help. If you arrive to any conclusions in that context,
we would be delighted to hear from you.

--
Best Regards
Eugeniu Rosca

2020-01-22 19:49:42

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hi Eugeniu,

On Wed, Jan 22, 2020 at 5:59 PM Eugeniu Rosca <[email protected]> wrote:
> On Wed, Jan 22, 2020 at 08:31:44AM +0100, Geert Uytterhoeven wrote:
> > On Wed, Jan 22, 2020 at 3:34 AM Eugeniu Rosca <[email protected]> wrote:
> > > So, what's specific to R-Car3, based on my testing, is that the issue
> > > can only be reproduced if the printk storm originates on CPU0 (it does
> > > not matter if from interrupt or task context, both have been tested). If
> > > the printk storm is initiated on any other CPU (there are 7 secondary
> > > ones on R-Car H3), there is no regression in the audio quality/latency.
> >
> > The secure stuff is running on CPU0, isn't it?
> > Is that a coincidence?
>
> Nobody has ruled this out so far. As a side note, except for the ARMv8
> generic IPs, there seems to be quite poor IRQ balancing between the
> CPU cores of R-Car H3 (although this might be unrelated to the issue):
>
> $ cat /proc/interrupts | egrep -v "(0[ ]*){8}"
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> 3: 55879 17835 14132 33882 6626 4331 6710 4532 GICv2 30 Level arch_timer
> 16: 1 0 0 0 0 0 0 0 GICv2 38 Level e6052000.gpio
> 32: 203 0 0 0 0 0 0 0 GICv2 51 Level e66d8000.i2c
> 33: 95 0 0 0 0 0 0 0 GICv2 205 Level e60b0000.i2c
> 94: 19339 0 0 0 0 0 0 0 GICv2 71 Level eth0:ch0:rx_be
> 112: 20599 0 0 0 0 0 0 0 GICv2 89 Level eth0:ch18:tx_be
> 118: 2 0 0 0 0 0 0 0 GICv2 95 Level eth0:ch24:emac
> 122: 442092 0 0 0 0 0 0 0 GICv2 196 Level e6e88000.serial:mux
> 124: 2776685 0 0 0 0 0 0 0 GICv2 352 Level ec700000.dma-controller:0
> 160: 2896 0 0 0 0 0 0 0 GICv2 197 Level ee100000.sd
> 161: 5652 0 0 0 0 0 0 0 GICv2 199 Level ee140000.sd
> 162: 147 0 0 0 0 0 0 0 GICv2 200 Level ee160000.sd
> 197: 5 0 0 0 0 0 0 0 GICv2 384 Level ec500000.sound
> 208: 1 0 0 0 0 0 0 0 gpio-rcar 11 Level e6800000.ethernet-ffffffff:00
> IPI0: 12701 366358 545059 1869017 9817 8065 9327 10644 Rescheduling interrupts
> IPI1: 21 34 111 86 238 191 149 161 Function call interrupts
> IPI5: 16422 709 509 637 0 0 3346 0 IRQ work interrupts

Yeah, cpu0 is always heavily loaded w.r.t. interrupts.
Can you reproduce the problem after forcing all interrupts to e.g. cpu1?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-01-24 14:36:05

by Eugeniu Rosca

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hi John,

On Wed, Jan 22, 2020 at 11:33:12AM +0100, John Ogness wrote:
> On 2020-01-22, Eugeniu Rosca <[email protected]> wrote:
> > So, what's specific to R-Car3, based on my testing, is that the issue
> > can only be reproduced if the printk storm originates on CPU0 (it does
> > not matter if from interrupt or task context, both have been
> > tested). If the printk storm is initiated on any other CPU (there are
> > 7 secondary ones on R-Car H3), there is no regression in the audio
> > quality/latency.
> >
> > I cannot fully explain this empirical observation, but it directs my
> > mind to the following workaround, for which I have a PoC:
> > - employ vprintk_safe() any time CPU0 is the owner/caller of printk
> > - tie CPU0-private printk internal IRQ workers to another CPU
> >
> > The above makes sure nothing is printed to the serial console on
> > behalf of CPU0. I don't even hope this to be accepted by community,
> > but can you please share your opinion the idea itself is sane?
>
> It is a problem-specific hack. You will need to be certain that CPU1-7
> will never have problems with console printing storms.
>
> Be aware that vprintk_safe() is not particularly reliable in many crash
> scenarios. If seeing oops output on the console is important, this can
> be a risky hack.
>
> Also, be aware that it has its own config option for the safe buffer
> size: PRINTK_SAFE_LOG_BUF_SHIFT

The warnings and pitfalls are much appreciated. Also, this whole
discussion has been referenced in the recently started communication
thread with Renesas, to raise the awareness of what looks to be not
only the limitation of Renesas BSP, but the mainline kernel in the
first place.

> >> The printk rework focusses on making printk non-interfering by
> >> decoupling console printing from printk() callers. However, the
> >> console printing itself will still do just as much interrupt
> >> disabling as before. That is driver-related, not printk-related.
> >
> > I didn't dive into the internals of this series, but decoupling the
> > execution context of the serial driver from the execution context of
> > the printk callers sounds very good to me (this is what i try to
> > achieve via vanilla vprintk_safe). I wonder if it's easier to remove
> > CPU0 from equation with this series applied.
>
> Yes, it would be quite easy. The console printers run as dedicated
> kthreads. It is only a matter of setting the CPU affinity for the
> related kthread.

Confirmed. Below two lines do the job (v5.4.13-rt7+).

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 0605a74ad76b..7bc2cdabf516 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2733,11 +2733,12 @@ static int __init init_printk_kthread(void)
{
struct task_struct *thread;

- thread = kthread_run(printk_kthread_func, NULL, "printk");
+ thread = kthread_create_on_cpu(printk_kthread_func, NULL, 7, "printk");
if (IS_ERR(thread)) {
pr_err("printk: unable to create printing thread\n");
return PTR_ERR(thread);
}
+ wake_up_process(thread);

return 0;
}

>
> >> The linux-rt patches (which include this printk rework) *are* being
> >> ported to mainline now. My recommendation is to continue using the
> >> linux-rt patches (with PREEMPT_RT=y) until PREEMPT_RT is available
> >> mainline.

This has been relayed to Renesas. Thanks.

> >
> > If there is any roadmap publicly available, I would appreciate a
> > reference.
>
> I am only aware of the quilt "series" file [0] that is roughly
> documenting the status of the effort.
>
> John Ogness
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/series?h=linux-5.4.y-rt-patches

Great.

--
Best Regards
Eugeniu Rosca

2020-01-24 20:54:01

by Eugeniu Rosca

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hi Geert,

On Wed, Jan 22, 2020 at 08:48:12PM +0100, Geert Uytterhoeven wrote:
> On Wed, Jan 22, 2020 at 5:59 PM Eugeniu Rosca <[email protected]> wrote:
> > On Wed, Jan 22, 2020 at 08:31:44AM +0100, Geert Uytterhoeven wrote:
> > > On Wed, Jan 22, 2020 at 3:34 AM Eugeniu Rosca <[email protected]> wrote:
> > > > So, what's specific to R-Car3, based on my testing, is that the issue
> > > > can only be reproduced if the printk storm originates on CPU0

Slight amendment the above statement. Below results are got on R-Car
H3ULCB running renesas-drivers-2020-01-14-v5.5-rc6 (cX stands for CPUx,
whitespace stands for clean audio, '!' stands for distorted audio):

printk @:
c0 c1 c2 c3 c4 c5 c6 c7
speaker-test @ c0 !
c1 ! !
c2 ! !
c3 ! !
c4 ! !
c5 ! !
c6 ! !
c7 ! !

One can see two patterns in the chart. The audio has glitches whenever:
- printk and the audio application run on the same CPU, or:
- printk runs on CPU0

> Yeah, cpu0 is always heavily loaded w.r.t. interrupts.
> Can you reproduce the problem after forcing all interrupts to e.g. cpu1?

With instrumentation shown in [1], the chart suffers below changes:

printk @:
c0 c1 c2 c3 c4 c5 c6 c7
speaker+test @ c0 ! !
c1 !
c2 ! !
c3 ! !
c4 ! !
c5 ! !
c6 ! !
c7 ! !

Any comments on the above empirical results?

[1] IRQ affinity set to CPU1

diff --git a/drivers/dma/sh/rcar-dmac.c b/drivers/dma/sh/rcar-dmac.c
index f06016d38a05..40003a3af4e0 100644
--- a/drivers/dma/sh/rcar-dmac.c
+++ b/drivers/dma/sh/rcar-dmac.c
@@ -1786,6 +1786,12 @@ static int rcar_dmac_chan_probe(struct rcar_dmac *dmac,
return ret;
}

+ ret = irq_set_affinity(rchan->irq, cpumask_of(1));
+ if (ret) {
+ dev_err(dmac->dev, "failed to set IRQ affinity %u (%d)\n", rchan->irq, ret);
+ return ret;
+ }
+
return 0;
}

diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c
index 9b4ff872e297..c76b38626b6b 100644
--- a/drivers/tty/serial/sh-sci.c
+++ b/drivers/tty/serial/sh-sci.c
@@ -1926,6 +1926,11 @@ static int sci_request_irq(struct sci_port *port)
dev_err(up->dev, "Can't allocate %s IRQ\n", desc->desc);
goto out_noirq;
}
+ ret = irq_set_affinity(irq, cpumask_of(1));
+ if (ret) {
+ dev_err(up->dev, "failed to set IRQ affinity %u (%d)\n", irq, ret);
+ return ret;
+ }
}

return 0;

--
Best Regards
Eugeniu Rosca

2020-01-27 12:52:21

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

On Fri 2020-01-24 17:09:29, Eugeniu Rosca wrote:
> Hi Geert,
>
> On Wed, Jan 22, 2020 at 08:48:12PM +0100, Geert Uytterhoeven wrote:
> > On Wed, Jan 22, 2020 at 5:59 PM Eugeniu Rosca <[email protected]> wrote:
> > > On Wed, Jan 22, 2020 at 08:31:44AM +0100, Geert Uytterhoeven wrote:
> > > > On Wed, Jan 22, 2020 at 3:34 AM Eugeniu Rosca <[email protected]> wrote:
> > > > > So, what's specific to R-Car3, based on my testing, is that the issue
> > > > > can only be reproduced if the printk storm originates on CPU0
>
> Slight amendment the above statement. Below results are got on R-Car
> H3ULCB running renesas-drivers-2020-01-14-v5.5-rc6 (cX stands for CPUx,
> whitespace stands for clean audio, '!' stands for distorted audio):
>
> printk @:
> c0 c1 c2 c3 c4 c5 c6 c7
> speaker-test @ c0 !
> c1 ! !
> c2 ! !
> c3 ! !
> c4 ! !
> c5 ! !
> c6 ! !
> c7 ! !
>
> One can see two patterns in the chart. The audio has glitches whenever:
> - printk and the audio application run on the same CPU, or:
> - printk runs on CPU0

The proper longterm solution seems to be offloading printk console
handling to a kthread so that it can be bound to a particular CPU
and does not block audio.

Anyway, there is a question whether you really need to send all messages
via the serial console. It might make sense to filter less important
messages using "loglevel=" or "quiet" kernel parameters. The full
log can be read later from userspace (dmesg, syslog, /dev/kmsg).
Filtering can get disabled when debugging non-booting kernel.
In this case, a distorted audio is the least problem.

Best Regards,
Petr

2020-01-27 13:47:57

by Eugeniu Rosca

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/25] printk: new implementation

Hi Petr,

Thank you for your insights.

On Mon, Jan 27, 2020 at 01:32:49PM +0100, Petr Mladek wrote:
> On Fri 2020-01-24 17:09:29, Eugeniu Rosca wrote:
> > On Wed, Jan 22, 2020 at 08:48:12PM +0100, Geert Uytterhoeven wrote:
> > > On Wed, Jan 22, 2020 at 5:59 PM Eugeniu Rosca <[email protected]> wrote:
> > > > On Wed, Jan 22, 2020 at 08:31:44AM +0100, Geert Uytterhoeven wrote:
> > > > > On Wed, Jan 22, 2020 at 3:34 AM Eugeniu Rosca <[email protected]> wrote:
> > > > > > So, what's specific to R-Car3, based on my testing, is that the issue
> > > > > > can only be reproduced if the printk storm originates on CPU0
> >
> > Slight amendment the above statement. Below results are got on R-Car
> > H3ULCB running renesas-drivers-2020-01-14-v5.5-rc6 (cX stands for CPUx,
> > whitespace stands for clean audio, '!' stands for distorted audio):
> >
> > printk @:
> > c0 c1 c2 c3 c4 c5 c6 c7
> > speaker-test @ c0 !
> > c1 ! !
> > c2 ! !
> > c3 ! !
> > c4 ! !
> > c5 ! !
> > c6 ! !
> > c7 ! !
> >
> > One can see two patterns in the chart. The audio has glitches whenever:
> > - printk and the audio application run on the same CPU, or:
> > - printk runs on CPU0
>
> The proper longterm solution seems to be offloading printk console
> handling to a kthread so that it can be bound to a particular CPU
> and does not block audio.

Same understanding here. I don't think this is possible w/o the full
switch to the kthread-based concept proposed in this series (I sought
an easier way out, but failed).

Even after pinning the printk kthread to CPUn, we still accept living
with the huge latencies of the console drivers on CPUn. To avoid audio
glitches being caused by the serial console, userspace would need to
additionally blacklist CPUn from any RT workloads by setting the core
affinity of audio applications appropriately.

The above imposes certain constraints on the CPU partitioning in the
system, but that's the most mainline-friendly solution I can think of.
Any alternative views would be appreciated.

>
> Anyway, there is a question whether you really need to send all messages
> via the serial console. It might make sense to filter less important
> messages using "loglevel=" or "quiet" kernel parameters. The full
> log can be read later from userspace (dmesg, syslog, /dev/kmsg).
> Filtering can get disabled when debugging non-booting kernel.
> In this case, a distorted audio is the least problem.

This has been discussed in detail with the reporters of the issue.
Yes, it might be an infrequent requirement in general, but the goal
is to achieve clean audio even (and specifically) with verbose
serial console output.

--
Best Regards
Eugeniu Rosca