We have a number of systems industry-wide that have a subset of their
functionality that works as follows:
1. Receive a message from local kmsg, serial console, or netconsole;
2. Apply a set of rules to classify the message;
3. Do something based on this classification (like scheduling a
remediation for the machine), rinse, and repeat.
As a couple of examples of places we have this implemented just inside
Facebook, although this isn't a Facebook-specific problem, we have this
inside our netconsole processing (for alarm classification), and as part
of our machine health checking. We use these messages to determine
fairly important metrics around production health, and it's important
that we get them right.
While for some kinds of issues we have counters, tracepoints, or metrics
with a stable interface which can reliably indicate the issue, in order
to react to production issues quickly we need to work with the interface
which most kernel developers naturally use when developing: printk.
Most production issues come from unexpected phenomena, and as such
usually the code in question doesn't have easily usable tracepoints or
other counters available for the specific problem being mitigated. We
have a number of lines of monitoring defence against problems in
production (host metrics, process metrics, service metrics, etc), and
where it's not feasible to reliably monitor at another level, this kind
of pragmatic netconsole monitoring is essential.
As you'd expect, monitoring using printk is rather brittle for a number
of reasons -- most notably that the message might disappear entirely in
a new version of the kernel, or that the message may change in some way
that the regex or other classification methods start to silently fail.
One factor that makes this even harder is that, under normal operation,
many of these messages are never expected to be hit. For example, there
may be some rare hardware bug which you want to detect if it was to ever
happen again, but its recurrence is not likely or anticipated. This
precludes using something like checking whether the printk in question
was printed somewhere fleetwide recently to determine whether the
message in question is still present or not, since we don't anticipate
that it should be printed anywhere, but still need to monitor for its
future presence in the long-term.
This class of issue has happened on a number of occasions, causing
unhealthy machines with hardware issues to remain in production for
longer than ideal. As a recent example, some monitoring around
blk_update_request fell out of date and caused semi-broken machines to
remain in production for longer than would be desirable.
Searching through the codebase to find the message is also extremely
fragile, because many of the messages are further constructed beyond
their callsite (eg. btrfs_printk and other module-specific wrappers,
each with their own functionality). Even if they aren't, guessing the
format and formulation of the underlying message based on the aesthetics
of the message emitted is not a recipe for success at scale, and our
previous issues with fleetwide machine health checking demonstrate as
much.
This patch provides a solution to the issue of silently changed or
deleted printks: we record pointers to all printk format strings known
at compile time into a new .printk_fmts section, both in vmlinux and
modules. At runtime, this can then be iterated by looking at
<debugfs>/printk/formats/<module>, which emits the same format as
`printk` itself, which we already export elsewhere (for example, in
netconsole).
As an example of how simple a parser for this format can be:
$ cat pf.py
#!/usr/bin/env python
with open("/sys/kernel/debug/printk/formats/vmlinux") as f:
raw_fmts = f.read().split("\x00")[:-1]
for raw_fmt in raw_fmts:
level, fmt = raw_fmt[1], raw_fmt[2:]
print(f"Level {level}: {fmt!r}")
$ ./pf.py | shuf -n 5
Level 4: 'Build ID is too large to include in vmcoreinfo: %u > %u\n'
Level 3: 'BIOS bug, no explicit IRQ entries, using default mptable. (tell your hw vendor)\n'
Level 3: 'Failed to execute %s (error %d)\n'
Level 3: 'CHRDEV "%s" minor range requested (%u-%u) is out of range of maximum range (%u-%u) for a single major\n'
Level 3: "slub_debug option '%c' unknown. skipped\n"
This mitigates the majority of cases where we have a highly-specific
printk which we want to match on, as we can now enumerate and check
whether the format changed or the printk callsite disappeared entirely
in userspace. This allows us to catch changes to printks we monitor
earlier and decide what to do about it before it becomes problematic.
There is no additional runtime cost for printk callers or printk itself,
and the assembly generated is exactly the same.
Signed-off-by: Chris Down <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: John Ogness <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andrew Morton <[email protected]>
---
v2:
- Use seq_printf instead of step by step accumulation
- Scope fptr closer to its use
- Prevent seq_file from needing to alloc a new buffer
- Always provide KERN_SOH + level, even if caller omitted it
- Provide one file per module
- Update changelog to show ease of parsing
- Provide printk -> _printk for ARCH=um (thanks, LKP robot)
- Move to debugfs (thanks, Steven)
---
v3:
- Reduce locking during open by moving size lifetime to the inode
- Don't explicitly check for debugfs errors (thanks, gregkh)
---
v4:
- Fix a missed `extern int printk` fixup in ia64 (thanks, lkp)
---
arch/arm/kernel/entry-v7m.S | 2 +-
arch/arm/lib/backtrace-clang.S | 2 +-
arch/arm/lib/backtrace.S | 2 +-
arch/arm/mach-rpc/io-acorn.S | 2 +-
arch/arm/vfp/vfphw.S | 6 +-
arch/ia64/include/uapi/asm/cmpxchg.h | 4 +-
arch/openrisc/kernel/entry.S | 6 +-
arch/powerpc/kernel/head_fsl_booke.S | 2 +-
arch/um/include/shared/user.h | 3 +-
arch/x86/kernel/head_32.S | 2 +-
include/asm-generic/vmlinux.lds.h | 13 ++
include/linux/module.h | 5 +
include/linux/printk.h | 43 ++++-
init/Kconfig | 14 ++
kernel/module.c | 5 +
kernel/printk/printk.c | 231 ++++++++++++++++++++++++++-
16 files changed, 318 insertions(+), 24 deletions(-)
diff --git a/arch/arm/kernel/entry-v7m.S b/arch/arm/kernel/entry-v7m.S
index d0e898608d30..7bde93c10962 100644
--- a/arch/arm/kernel/entry-v7m.S
+++ b/arch/arm/kernel/entry-v7m.S
@@ -23,7 +23,7 @@ __invalid_entry:
adr r0, strerr
mrs r1, ipsr
mov r2, lr
- bl printk
+ bl _printk
#endif
mov r0, sp
bl show_regs
diff --git a/arch/arm/lib/backtrace-clang.S b/arch/arm/lib/backtrace-clang.S
index 6174c45f53a5..5b2cdb1003e3 100644
--- a/arch/arm/lib/backtrace-clang.S
+++ b/arch/arm/lib/backtrace-clang.S
@@ -202,7 +202,7 @@ finished_setup:
1006: adr r0, .Lbad
mov r1, loglvl
mov r2, frame
- bl printk
+ bl _printk
no_frame: ldmfd sp!, {r4 - r9, fp, pc}
ENDPROC(c_backtrace)
.pushsection __ex_table,"a"
diff --git a/arch/arm/lib/backtrace.S b/arch/arm/lib/backtrace.S
index 872f658638d9..e8408f22d4dc 100644
--- a/arch/arm/lib/backtrace.S
+++ b/arch/arm/lib/backtrace.S
@@ -103,7 +103,7 @@ for_each_frame: tst frame, mask @ Check for address exceptions
1006: adr r0, .Lbad
mov r1, loglvl
mov r2, frame
- bl printk
+ bl _printk
no_frame: ldmfd sp!, {r4 - r9, pc}
ENDPROC(c_backtrace)
diff --git a/arch/arm/mach-rpc/io-acorn.S b/arch/arm/mach-rpc/io-acorn.S
index b9082a2a2a01..aa9bf0d771c0 100644
--- a/arch/arm/mach-rpc/io-acorn.S
+++ b/arch/arm/mach-rpc/io-acorn.S
@@ -25,4 +25,4 @@ ENTRY(insl)
ENTRY(outsl)
adr r0, .Liosl_warning
mov r1, lr
- b printk
+ b _printk
diff --git a/arch/arm/vfp/vfphw.S b/arch/arm/vfp/vfphw.S
index d5837bf05a9a..6f7926c9c179 100644
--- a/arch/arm/vfp/vfphw.S
+++ b/arch/arm/vfp/vfphw.S
@@ -23,7 +23,7 @@
#ifdef DEBUG
stmfd sp!, {r0-r3, ip, lr}
ldr r0, =1f
- bl printk
+ bl _printk
ldmfd sp!, {r0-r3, ip, lr}
.pushsection .rodata, "a"
@@ -38,7 +38,7 @@
stmfd sp!, {r0-r3, ip, lr}
mov r1, \arg
ldr r0, =1f
- bl printk
+ bl _printk
ldmfd sp!, {r0-r3, ip, lr}
.pushsection .rodata, "a"
@@ -55,7 +55,7 @@
mov r2, \arg2
mov r1, \arg1
ldr r0, =1f
- bl printk
+ bl _printk
ldmfd sp!, {r0-r3, ip, lr}
.pushsection .rodata, "a"
diff --git a/arch/ia64/include/uapi/asm/cmpxchg.h b/arch/ia64/include/uapi/asm/cmpxchg.h
index 5d90307fd6e0..d96438322d13 100644
--- a/arch/ia64/include/uapi/asm/cmpxchg.h
+++ b/arch/ia64/include/uapi/asm/cmpxchg.h
@@ -139,9 +139,9 @@ extern long ia64_cmpxchg_called_with_bad_pointer(void);
do { \
if (_cmpxchg_bugcheck_count-- <= 0) { \
void *ip; \
- extern int printk(const char *fmt, ...); \
+ extern int _printk(const char *fmt, ...); \
ip = (void *) ia64_getreg(_IA64_REG_IP); \
- printk("CMPXCHG_BUGCHECK: stuck at %p on word %p\n", ip, (v));\
+ _printk("CMPXCHG_BUGCHECK: stuck at %p on word %p\n", ip, (v));\
break; \
} \
} while (0)
diff --git a/arch/openrisc/kernel/entry.S b/arch/openrisc/kernel/entry.S
index bc657e55c15f..947613f61d4a 100644
--- a/arch/openrisc/kernel/entry.S
+++ b/arch/openrisc/kernel/entry.S
@@ -551,7 +551,7 @@ EXCEPTION_ENTRY(_external_irq_handler)
l.movhi r3,hi(42f)
l.ori r3,r3,lo(42f)
l.sw 0x0(r1),r3
- l.jal printk
+ l.jal _printk
l.sw 0x4(r1),r4
l.addi r1,r1,0x8
@@ -681,8 +681,8 @@ _syscall_debug:
l.sw -4(r1),r27
l.sw -8(r1),r11
l.addi r1,r1,-8
- l.movhi r27,hi(printk)
- l.ori r27,r27,lo(printk)
+ l.movhi r27,hi(_printk)
+ l.ori r27,r27,lo(_printk)
l.jalr r27
l.nop
l.addi r1,r1,8
diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S
index fdd4d274c245..f2c7ecef9b14 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -852,7 +852,7 @@ KernelSPE:
ori r3,r3,87f@l
mr r4,r2 /* current */
lwz r5,_NIP(r1)
- bl printk
+ bl _printk
#endif
b ret_from_except
#ifdef CONFIG_PRINTK
diff --git a/arch/um/include/shared/user.h b/arch/um/include/shared/user.h
index e793e4212f0a..dd4badffdeb3 100644
--- a/arch/um/include/shared/user.h
+++ b/arch/um/include/shared/user.h
@@ -38,7 +38,8 @@ extern void panic(const char *fmt, ...)
#define UM_KERN_CONT KERN_CONT
#ifdef UML_CONFIG_PRINTK
-extern int printk(const char *fmt, ...)
+#define printk(...) _printk(__VA_ARGS__)
+extern int _printk(const char *fmt, ...)
__attribute__ ((format (printf, 1, 2)));
#else
static inline int printk(const char *fmt, ...)
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 7ed84c282233..c207b789dd22 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -446,7 +446,7 @@ SYM_FUNC_START(early_ignore_irq)
pushl 32(%esp)
pushl 40(%esp)
pushl $int_msg
- call printk
+ call _printk
call dump_stack
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 34b7e0d2346c..0ca6e28e05d6 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -309,6 +309,17 @@
#define ACPI_PROBE_TABLE(name)
#endif
+#ifdef CONFIG_PRINTK_ENUMERATION
+#define PRINTK_FMTS \
+ .printk_fmts : AT(ADDR(.printk_fmts) - LOAD_OFFSET) { \
+ __start_printk_fmts = .; \
+ *(.printk_fmts) \
+ __stop_printk_fmts = .; \
+ }
+#else
+#define PRINTK_FMTS
+#endif
+
#ifdef CONFIG_THERMAL
#define THERMAL_TABLE(name) \
. = ALIGN(8); \
@@ -480,6 +491,8 @@
\
TRACEDATA \
\
+ PRINTK_FMTS \
+ \
/* Kernel symbol table: Normal symbols */ \
__ksymtab : AT(ADDR(__ksymtab) - LOAD_OFFSET) { \
__start___ksymtab = .; \
diff --git a/include/linux/module.h b/include/linux/module.h
index 7a0bcb5b1ffc..4235b14a22ef 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -516,6 +516,11 @@ struct module {
struct klp_modinfo *klp_info;
#endif
+#ifdef CONFIG_PRINTK_ENUMERATION
+ unsigned int printk_fmts_sec_size;
+ const char **printk_fmts_start;
+#endif
+
#ifdef CONFIG_MODULE_UNLOAD
/* What modules depend on me? */
struct list_head source_list;
diff --git a/include/linux/printk.h b/include/linux/printk.h
index d3c08095a9a3..4aebb96c3694 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -173,12 +173,12 @@ asmlinkage __printf(1, 0)
int vprintk(const char *fmt, va_list args);
asmlinkage __printf(1, 2) __cold
-int printk(const char *fmt, ...);
+int _printk(const char *fmt, ...);
/*
* Special printk facility for scheduler/timekeeping use only, _DO_NOT_USE_ !
*/
-__printf(1, 2) __cold int printk_deferred(const char *fmt, ...);
+__printf(1, 2) __cold int _printk_deferred(const char *fmt, ...);
/*
* Please don't use printk_ratelimit(), because it shares ratelimiting state
@@ -216,12 +216,12 @@ int vprintk(const char *s, va_list args)
return 0;
}
static inline __printf(1, 2) __cold
-int printk(const char *s, ...)
+int _printk(const char *s, ...)
{
return 0;
}
static inline __printf(1, 2) __cold
-int printk_deferred(const char *s, ...)
+int _printk_deferred(const char *s, ...)
{
return 0;
}
@@ -284,6 +284,11 @@ static inline void printk_safe_flush_on_panic(void)
extern int kptr_restrict;
+#ifdef CONFIG_PRINTK_ENUMERATION
+extern const char *__start_printk_fmts[];
+extern const char *__stop_printk_fmts[];
+#endif
+
/**
* pr_fmt - used by the pr_*() macros to generate the printk format string
* @fmt: format string passed from a pr_*() macro
@@ -301,6 +306,36 @@ extern int kptr_restrict;
#define pr_fmt(fmt) fmt
#endif
+#ifdef CONFIG_PRINTK_ENUMERATION
+#define printk_store_fmt(func, fmt, ...) \
+ ({ \
+ int _printk_ret; \
+ \
+ if (__builtin_constant_p(fmt)) { \
+ /*
+ * The compiler may not be able to eliminate this, so
+ * we need to make sure that it doesn't see any
+ * hypothetical assignment for non-constants even
+ * though this is already inside the
+ * __builtin_constant_p guard.
+ */ \
+ static const char *_fmt __section(".printk_fmts") = \
+ __builtin_constant_p(fmt) ? fmt : NULL; \
+ _printk_ret = func(_fmt, ##__VA_ARGS__); \
+ } else \
+ _printk_ret = func(fmt, ##__VA_ARGS__); \
+ \
+ _printk_ret; \
+ })
+
+#define printk(fmt, ...) printk_store_fmt(_printk, fmt, ##__VA_ARGS__)
+#define printk_deferred(fmt, ...) \
+ printk_store_fmt(_printk_deferred, fmt, ##__VA_ARGS__)
+#else /* !CONFIG_PRINTK_ENUMERATION */
+#define printk(...) _printk(__VA_ARGS__)
+#define printk_deferred(...) _printk_deferred(__VA_ARGS__)
+#endif /* CONFIG_PRINTK_ENUMERATION */
+
/**
* pr_emerg - Print an emergency-level message
* @fmt: format string
diff --git a/init/Kconfig b/init/Kconfig
index ab2e916f7e8b..f1f37a060235 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -764,6 +764,20 @@ config PRINTK_SAFE_LOG_BUF_SHIFT
13 => 8 KB for each CPU
12 => 4 KB for each CPU
+config PRINTK_ENUMERATION
+ bool "Printk enumeration debugfs interface"
+ depends on PRINTK && DEBUG_FS
+ help
+ Add support for enumeration of all printk formats known at compile
+ time at <debugfs>/printk/formats/<module>.
+
+ This can be used as part of maintaining daemons which monitor
+ /dev/kmsg, as it permits auditing the printk formats present in a
+ kernel, allowing monitoring of cases where monitored printks are
+ changed or no longer present.
+
+ There is no additional runtime cost to printk with this enabled.
+
#
# Architectures with an unreliable sched_clock() should select this:
#
diff --git a/kernel/module.c b/kernel/module.c
index 1e5aad812310..7b0f2a2f428e 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3429,6 +3429,11 @@ static int find_module_sections(struct module *mod, struct load_info *info)
sizeof(unsigned long),
&mod->num_kprobe_blacklist);
#endif
+#ifdef CONFIG_PRINTK_ENUMERATION
+ mod->printk_fmts_start = section_objs(info, ".printk_fmts",
+ sizeof(*mod->printk_fmts_start),
+ &mod->printk_fmts_sec_size);
+#endif
#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
mod->static_call_sites = section_objs(info, ".static_call_sites",
sizeof(*mod->static_call_sites),
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 5a95c688621f..adf545ba9eb9 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -47,6 +47,8 @@
#include <linux/sched/clock.h>
#include <linux/sched/debug.h>
#include <linux/sched/task_stack.h>
+#include <linux/debugfs.h>
+#include <linux/hashtable.h>
#include <linux/uaccess.h>
#include <asm/sections.h>
@@ -617,6 +619,222 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
return len;
}
+#ifdef CONFIG_PRINTK_ENUMERATION
+
+/*
+ * debugfs/printk/formats/ - userspace enumeration of printk formats
+ *
+ * The format is the same as typically used by printk, <KERN_SOH><level>fmt,
+ * with each distinct format separated by \0.
+ */
+
+struct printk_fmt_sec {
+ struct hlist_node hnode;
+ struct module *module;
+ struct dentry *file;
+ const char **start;
+ const char **end;
+};
+
+/* The base dir for module formats, typically debugfs/printk/formats/ */
+struct dentry *dfs_formats;
+
+/*
+ * Stores .printk_fmt section boundaries for vmlinux and all loaded modules.
+ * Add entries with store_printk_fmt_sec, remove entries with
+ * remove_printk_fmt_sec.
+ */
+static DEFINE_HASHTABLE(printk_fmts_mod_sections, 8);
+
+/* Protects printk_fmts_mod_sections */
+static DEFINE_MUTEX(printk_fmts_mutex);
+
+static const char *ps_get_module_name(const struct printk_fmt_sec *ps);
+static int debugfs_pf_open(struct inode *inode, struct file *file);
+
+static const struct file_operations dfs_formats_fops = {
+ .open = debugfs_pf_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static size_t printk_fmt_size(const char *fmt)
+{
+ size_t sz = strlen(fmt) + 1;
+
+ /*
+ * Some printk formats don't start with KERN_SOH + level. We will add
+ * it later when rendering the output.
+ */
+ if (unlikely(fmt[0] != KERN_SOH_ASCII))
+ sz += 2;
+
+ return sz;
+}
+
+static struct printk_fmt_sec *find_printk_fmt_sec(struct module *mod)
+{
+ struct printk_fmt_sec *ps = NULL;
+
+ hash_for_each_possible(printk_fmts_mod_sections, ps, hnode,
+ (unsigned long)mod)
+ if (ps->module == mod)
+ return ps;
+
+ return NULL;
+}
+
+static void store_printk_fmt_sec(struct module *mod, const char **start,
+ const char **end)
+{
+ struct printk_fmt_sec *ps = NULL;
+ const char **fptr = NULL;
+ size_t size = 0;
+
+ ps = kmalloc(sizeof(struct printk_fmt_sec), GFP_KERNEL);
+ if (!ps)
+ return;
+
+ ps->module = mod;
+ ps->start = start;
+ ps->end = end;
+
+ for (fptr = ps->start; fptr < ps->end; fptr++)
+ size += printk_fmt_size(*fptr);
+
+ mutex_lock(&printk_fmts_mutex);
+ hash_add(printk_fmts_mod_sections, &ps->hnode, (unsigned long)mod);
+ mutex_unlock(&printk_fmts_mutex);
+
+ ps->file = debugfs_create_file(ps_get_module_name(ps), 0444,
+ dfs_formats, mod, &dfs_formats_fops);
+
+ if (!IS_ERR(ps->file))
+ d_inode(ps->file)->i_size = size;
+}
+
+#ifdef CONFIG_MODULES
+static void remove_printk_fmt_sec(struct module *mod)
+{
+ struct printk_fmt_sec *ps = NULL;
+
+ if (WARN_ON_ONCE(!mod))
+ return;
+
+ mutex_lock(&printk_fmts_mutex);
+
+ ps = find_printk_fmt_sec(mod);
+ if (!ps) {
+ mutex_unlock(&printk_fmts_mutex);
+ return;
+ }
+
+ hash_del(&ps->hnode);
+
+ mutex_unlock(&printk_fmts_mutex);
+
+ debugfs_remove(ps->file);
+ kfree(ps);
+}
+
+static int module_printk_fmts_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct module *mod = data;
+
+ if (mod->printk_fmts_sec_size) {
+ switch (val) {
+ case MODULE_STATE_COMING:
+ store_printk_fmt_sec(mod, mod->printk_fmts_start,
+ mod->printk_fmts_start +
+ mod->printk_fmts_sec_size);
+ break;
+
+ case MODULE_STATE_GOING:
+ remove_printk_fmt_sec(mod);
+ break;
+ }
+ }
+
+ return NOTIFY_OK;
+}
+
+static const char *ps_get_module_name(const struct printk_fmt_sec *ps)
+{
+ return ps->module ? ps->module->name : "vmlinux";
+}
+
+static struct notifier_block module_printk_fmts_nb = {
+ .notifier_call = module_printk_fmts_notify,
+};
+
+static int __init module_printk_fmts_init(void)
+{
+ return register_module_notifier(&module_printk_fmts_nb);
+}
+
+core_initcall(module_printk_fmts_init);
+
+#else /* !CONFIG_MODULES */
+static const char *ps_get_module_name(const struct printk_fmt_sec *ps)
+{
+ return "vmlinux";
+}
+#endif /* CONFIG_MODULES */
+
+static int debugfs_pf_show(struct seq_file *s, void *v)
+{
+ struct module *mod = s->file->f_inode->i_private;
+ struct printk_fmt_sec *ps = NULL;
+ const char **fptr = NULL;
+ int ret = 0;
+
+ mutex_lock(&printk_fmts_mutex);
+
+ /*
+ * The entry might have been invalidated in the hlist between _open and
+ * _show, so we need to eyeball the entries under printk_fmts_mutex
+ * again.
+ */
+ ps = find_printk_fmt_sec(mod);
+ if (unlikely(!ps)) {
+ ret = -ENOENT;
+ goto out_unlock;
+ }
+
+ for (fptr = ps->start; fptr < ps->end; fptr++) {
+ /* For callsites without KERN_SOH + level preamble. */
+ if (unlikely(*fptr[0] != KERN_SOH_ASCII))
+ seq_printf(s, "%c%d", KERN_SOH_ASCII,
+ MESSAGE_LOGLEVEL_DEFAULT);
+ seq_printf(s, "%s%c", *fptr, '\0');
+ }
+
+out_unlock:
+ mutex_unlock(&printk_fmts_mutex);
+ return ret;
+}
+
+static int debugfs_pf_open(struct inode *inode, struct file *file)
+{
+ return single_open_size(file, debugfs_pf_show, NULL, inode->i_size);
+}
+
+static int __init init_printk_fmts(void)
+{
+ struct dentry *dfs_root = debugfs_create_dir("printk", NULL);
+
+ dfs_formats = debugfs_create_dir("formats", dfs_root);
+ store_printk_fmt_sec(NULL, __start_printk_fmts, __stop_printk_fmts);
+
+ return 0;
+}
+
+core_initcall(init_printk_fmts);
+
+#endif /* CONFIG_PRINTK_ENUMERATION */
+
/* /dev/kmsg - userspace message inject/listen interface */
struct devkmsg_user {
u64 seq;
@@ -2111,10 +2329,13 @@ int vprintk_default(const char *fmt, va_list args)
EXPORT_SYMBOL_GPL(vprintk_default);
/**
- * printk - print a kernel message
+ * _printk - print a kernel message
* @fmt: format string
*
- * This is printk(). It can be called from any context. We want it to work.
+ * This is _printk(). It can be called from any context. We want it to work.
+ *
+ * If printk enumeration is enabled, _printk() is called from printk_store_fmt.
+ * Otherwise, printk is simply #defined to _printk.
*
* We try to grab the console_lock. If we succeed, it's easy - we log the
* output and call the console drivers. If we fail to get the semaphore, we
@@ -2131,7 +2352,7 @@ EXPORT_SYMBOL_GPL(vprintk_default);
*
* See the vsnprintf() documentation for format string extensions over C99.
*/
-asmlinkage __visible int printk(const char *fmt, ...)
+asmlinkage __visible int _printk(const char *fmt, ...)
{
va_list args;
int r;
@@ -2142,7 +2363,7 @@ asmlinkage __visible int printk(const char *fmt, ...)
return r;
}
-EXPORT_SYMBOL(printk);
+EXPORT_SYMBOL(_printk);
#else /* CONFIG_PRINTK */
@@ -3133,7 +3354,7 @@ int vprintk_deferred(const char *fmt, va_list args)
return r;
}
-int printk_deferred(const char *fmt, ...)
+int _printk_deferred(const char *fmt, ...)
{
va_list args;
int r;
--
2.30.1
Hi Chris,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on jeyu/modules-next]
[also build test ERROR on linux/master soc/for-next openrisc/for-next powerpc/next uml/linux-next asm-generic/master linus/master v5.11-rc7 next-20210211]
[cannot apply to tip/x86/core pmladek/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Chris-Down/printk-Userspace-format-enumeration-support/20210212-233240
base: https://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux.git modules-next
config: ia64-randconfig-r016-20210209 (attached as .config)
compiler: ia64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/bff4c8b4b0b9bcc31917d3b0f1a01a7573a1f473
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Chris-Down/printk-Userspace-format-enumeration-support/20210212-233240
git checkout bff4c8b4b0b9bcc31917d3b0f1a01a7573a1f473
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=ia64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
In file included from include/linux/kernel.h:16,
from include/asm-generic/bug.h:20,
from arch/ia64/include/asm/bug.h:17,
from include/linux/bug.h:5,
from include/linux/page-flags.h:10,
from kernel/bounds.c:10:
>> include/linux/printk.h:219:5: error: static declaration of '_printk' follows non-static declaration
219 | int _printk(const char *s, ...)
| ^~~~~~~
In file included from arch/ia64/include/uapi/asm/intrinsics.h:22,
from arch/ia64/include/asm/intrinsics.h:11,
from arch/ia64/include/asm/bitops.h:19,
from include/linux/bitops.h:32,
from include/linux/kernel.h:11,
from include/asm-generic/bug.h:20,
from arch/ia64/include/asm/bug.h:17,
from include/linux/bug.h:5,
from include/linux/page-flags.h:10,
from kernel/bounds.c:10:
arch/ia64/include/uapi/asm/cmpxchg.h:142:14: note: previous declaration of '_printk' was here
142 | extern int _printk(const char *fmt, ...); \
| ^~~~~~~
arch/ia64/include/asm/bitops.h:309:3: note: in expansion of macro 'CMPXCHG_BUGCHECK'
309 | CMPXCHG_BUGCHECK(m);
| ^~~~~~~~~~~~~~~~
--
In file included from include/linux/kernel.h:16,
from include/asm-generic/bug.h:20,
from arch/ia64/include/asm/bug.h:17,
from include/linux/bug.h:5,
from include/linux/page-flags.h:10,
from kernel/bounds.c:10:
>> include/linux/printk.h:219:5: error: static declaration of '_printk' follows non-static declaration
219 | int _printk(const char *s, ...)
| ^~~~~~~
In file included from arch/ia64/include/uapi/asm/intrinsics.h:22,
from arch/ia64/include/asm/intrinsics.h:11,
from arch/ia64/include/asm/bitops.h:19,
from include/linux/bitops.h:32,
from include/linux/kernel.h:11,
from include/asm-generic/bug.h:20,
from arch/ia64/include/asm/bug.h:17,
from include/linux/bug.h:5,
from include/linux/page-flags.h:10,
from kernel/bounds.c:10:
arch/ia64/include/uapi/asm/cmpxchg.h:142:14: note: previous declaration of '_printk' was here
142 | extern int _printk(const char *fmt, ...); \
| ^~~~~~~
arch/ia64/include/asm/bitops.h:309:3: note: in expansion of macro 'CMPXCHG_BUGCHECK'
309 | CMPXCHG_BUGCHECK(m);
| ^~~~~~~~~~~~~~~~
make[2]: *** [scripts/Makefile.build:117: kernel/bounds.s] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1206: prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:185: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.
vim +/_printk +219 include/linux/printk.h
201
202 char *log_buf_addr_get(void);
203 u32 log_buf_len_get(void);
204 void log_buf_vmcoreinfo_setup(void);
205 void __init setup_log_buf(int early);
206 __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...);
207 void dump_stack_print_info(const char *log_lvl);
208 void show_regs_print_info(const char *log_lvl);
209 extern asmlinkage void dump_stack(void) __cold;
210 extern void printk_safe_flush(void);
211 extern void printk_safe_flush_on_panic(void);
212 #else
213 static inline __printf(1, 0)
214 int vprintk(const char *s, va_list args)
215 {
216 return 0;
217 }
218 static inline __printf(1, 2) __cold
> 219 int _printk(const char *s, ...)
220 {
221 return 0;
222 }
223 static inline __printf(1, 2) __cold
224 int _printk_deferred(const char *s, ...)
225 {
226 return 0;
227 }
228 static inline int printk_ratelimit(void)
229 {
230 return 0;
231 }
232 static inline bool printk_timed_ratelimit(unsigned long *caller_jiffies,
233 unsigned int interval_msec)
234 {
235 return false;
236 }
237
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
kernel test robot writes:
>Hi Chris,
>
>Thank you for the patch! Yet something to improve:
I'm pretty sure !CONFIG_PRINTK && CONFIG_IA64_DEBUG_CMPXCHG has been broken
like this long before this change.
With !CONFIG_PRINTK, printk() is static in the header, but ia64's cmpxchg.h
with CONFIG_IA64_DEBUG_CMPXCHG doesn't take this into account before trying to
use it as extern.
CONFIG_IA64_DEBUG_CMPXCHG should be dependent on CONFIG_PRINTK. I'll send a
separate patch for that, no reason to put it in this series. Not sure why LKP
only caught it now though.
Here's how it is even on the parent commit LKP applied this patch to:
% git checkout bff4c8b4b0b9bcc31917d3b0f1a01a7573a1f473^
Previous HEAD position was bff4c8b4b0b9 printk: Userspace format enumeration support
HEAD is now at 1fa67f8391ac module: mark TRIM_UNUSED_KSYMS as BROKEN on powerpc
% COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 /tmp/make.cross ARCH=ia64 -s
Compiler will be installed in /home/cdown/0day
make W=1 CROSS_COMPILE=/home/cdown/0day/gcc-9.3.0-nolibc/ia64-linux/bin/ia64-linux- --jobs=32 ARCH=ia64 -s
In file included from ./include/linux/kernel.h:16,
from ./include/asm-generic/bug.h:20,
from ./arch/ia64/include/asm/bug.h:17,
from ./include/linux/bug.h:5,
from ./include/linux/page-flags.h:10,
from kernel/bounds.c:10:
./include/linux/printk.h:219:5: error: static declaration of 'printk' follows non-static declaration
219 | int printk(const char *s, ...)
| ^~~~~~
In file included from ./arch/ia64/include/uapi/asm/intrinsics.h:22,
from ./arch/ia64/include/asm/intrinsics.h:11,
from ./arch/ia64/include/asm/bitops.h:19,
from ./include/linux/bitops.h:32,
from ./include/linux/kernel.h:11,
from ./include/asm-generic/bug.h:20,
from ./arch/ia64/include/asm/bug.h:17,
from ./include/linux/bug.h:5,
from ./include/linux/page-flags.h:10,
from kernel/bounds.c:10:
./arch/ia64/include/uapi/asm/cmpxchg.h:142:14: note: previous declaration of 'printk' was here
142 | extern int printk(const char *fmt, ...); \
| ^~~~~~
./arch/ia64/include/asm/bitops.h:309:3: note: in expansion of macro 'CMPXCHG_BUGCHECK'
309 | CMPXCHG_BUGCHECK(m);
| ^~~~~~~~~~~~~~~~
make[1]: *** [scripts/Makefile.build:117: kernel/bounds.s] Error 1
make[1]: *** Waiting for unfinished jobs....
make: *** [Makefile:1206: prepare0] Error 2
Chris Down writes:
>kernel test robot writes:
>>Hi Chris,
>>
>>Thank you for the patch! Yet something to improve:
>
>I'm pretty sure !CONFIG_PRINTK && CONFIG_IA64_DEBUG_CMPXCHG has been
>broken like this long before this change.
I sent "ia64: Depend on non-static printk for cmpxchg debug"[0] to ia64 folks.
It shouldn't block this patch, this is already broken.
0: https://lore.kernel.org/patchwork/patch/1380380/
On Fri 2021-02-12 15:30:16, Chris Down wrote:
> We have a number of systems industry-wide that have a subset of their
> functionality that works as follows:
>
> 1. Receive a message from local kmsg, serial console, or netconsole;
> 2. Apply a set of rules to classify the message;
> 3. Do something based on this classification (like scheduling a
> remediation for the machine), rinse, and repeat.
>
> As a couple of examples of places we have this implemented just inside
> Facebook, although this isn't a Facebook-specific problem, we have this
> inside our netconsole processing (for alarm classification), and as part
> of our machine health checking. We use these messages to determine
> fairly important metrics around production health, and it's important
> that we get them right.
>
> This patch provides a solution to the issue of silently changed or
> deleted printks: we record pointers to all printk format strings known
> at compile time into a new .printk_fmts section, both in vmlinux and
> modules. At runtime, this can then be iterated by looking at
> <debugfs>/printk/formats/<module>, which emits the same format as
> `printk` itself, which we already export elsewhere (for example, in
> netconsole).
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 5a95c688621f..adf545ba9eb9 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> +static int debugfs_pf_show(struct seq_file *s, void *v)
> +{
> + struct module *mod = s->file->f_inode->i_private;
> + struct printk_fmt_sec *ps = NULL;
> + const char **fptr = NULL;
> + int ret = 0;
> +
> + mutex_lock(&printk_fmts_mutex);
> +
> + /*
> + * The entry might have been invalidated in the hlist between _open and
> + * _show, so we need to eyeball the entries under printk_fmts_mutex
> + * again.
> + */
> + ps = find_printk_fmt_sec(mod);
> + if (unlikely(!ps)) {
> + ret = -ENOENT;
> + goto out_unlock;
> + }
> +
> + for (fptr = ps->start; fptr < ps->end; fptr++) {
> + /* For callsites without KERN_SOH + level preamble. */
> + if (unlikely(*fptr[0] != KERN_SOH_ASCII))
> + seq_printf(s, "%c%d", KERN_SOH_ASCII,
> + MESSAGE_LOGLEVEL_DEFAULT);
> + seq_printf(s, "%s%c", *fptr, '\0');
> + }
This produces something like:
3Warning: unable to open an initial console.
3Failed to execute %s (error %d)
6Kernel memory protection disabled.
3Starting init: %s exists but couldn't execute it (error %d)
6Run %s as init process
7initcall %pS returned %d after %lld usecs
7calling %pS @ %i
2initrd overwritten (0x%08lx < 0x%08lx) - disabling it.
The loglevel is not well separated. It is neither human readable
nor safe for a machine processing . It works only for single digit.
Also it includes the KERN_SOH and other non-printable blobs.
It looks in less like:
^A3Warning: unable to open an initial console.
^@^A3Failed to execute %s (error %d)
^@^A6Kernel memory protection disabled.
^@^A3Starting init: %s exists but couldn't execute it (error %d)
^@^A6Run %s as init process
^@^A7initcall %pS returned %d after %lld usecs
^@^A7calling %pS @ %i
^@^A2initrd overwritten (0x%08lx < 0x%08lx) - disabling it.
^@^A5Kernel command line: %s
^@^A5%s^@^A6mem auto-init: clearing system memory may take some time...
I would like to use something already existing. We add only loglevel,
so the syslog format would be the best
<3> Warning: unable to open an initial console.
<3> Failed to execute %s (error %d)
<6> Kernel memory protection disabled.
<3> Starting init: %s exists but couldn't execute it (error %d)
<6> Run %s as init process
<7> initcall %pS returned %d after %lld usecs
<7> calling %pS @ %i
<2> initrd overwritten (0x%08lx < 0x%08lx) - disabling it.
Regarding the non-printable characters, I would either use the same
escape sequences as in /sys/kernel/debug/dynamic_debug/control
But even better might be to preserve to origianal text
as much as possible. It have to the following text
with the patch below:
<3> Warning: unable to open an initial console.\n
<3> Failed to execute %s (error %d)\n
<6> Kernel memory protection disabled.\n
<3> Starting init: %s exists but couldn't execute it (error %d)\n
<6> Run %s as init process\n
<7> initcall %pS returned %d after %lld usecs\n
<7> calling %pS @ %i\n
<2> initrd overwritten (0x%08lx < 0x%08lx) - disabling it.\n
<5> Kernel command line: %s\n
Well, it still might be non-trivial to find the string in the code
and see what exactly has changed. It might be pretty useful
to mention even the source_file:line, for example:
<3> init/main.c:1489: Warning: unable to open an initial console.\n
<3> init/main.c:1446: Failed to execute %s (error %d)\n
<6> init/main.c:1398: Kernel memory protection disabled.\n
<3> init/main.c:1366: Starting init: %s exists but couldn't execute it (error %d)\n
The way how to store the filename and line number might be found
in the dynamic_debug code.
> +
> +out_unlock:
> + mutex_unlock(&printk_fmts_mutex);
> + return ret;
> +}
I have achieved the last but one output with the following patch
on top of the original one:
From d21d1c6d45fc66beb90de8d3309f925ecb501f1a Mon Sep 17 00:00:00 2001
From: Petr Mladek <[email protected]>
Date: Fri, 12 Feb 2021 18:43:54 +0100
Subject: [PATCH] better output
---
fs/seq_file.c | 21 +++++++++++++++++++++
include/linux/seq_file.h | 1 +
kernel/printk/printk.c | 20 ++++++++++++++------
3 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/fs/seq_file.c b/fs/seq_file.c
index 03a369ccd28c..364587eff9d2 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -376,6 +376,27 @@ void seq_escape(struct seq_file *m, const char *s, const char *esc)
}
EXPORT_SYMBOL(seq_escape);
+/**
+ * seq_escape+printf_format - print string into buffer, escaping
+ * characters that are escaped in printf format,
+ * @m: target buffer
+ * @s: string
+ *
+ * Puts string into buffer and escape characters that are
+ * escaped in printf format.
+ * Use seq_has_overflowed() to check for errors.
+ */
+void seq_escape_printf_format(struct seq_file *m, const char *s)
+{
+ char *buf;
+ size_t size = seq_get_buf(m, &buf);
+ int ret;
+
+ ret = string_escape_str(s, buf, size, ESCAPE_SPACE | ESCAPE_SPECIAL, NULL);
+ seq_commit(m, ret < size ? ret : -1);
+}
+EXPORT_SYMBOL(seq_escape_printf_format);
+
void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz)
{
char *buf;
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index b83b3ae3c877..cfc504a30429 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -127,6 +127,7 @@ void seq_put_hex_ll(struct seq_file *m, const char *delimiter,
unsigned long long v, unsigned int width);
void seq_escape(struct seq_file *m, const char *s, const char *esc);
+void seq_escape_printf_format(struct seq_file *m, const char *s);
void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz);
void seq_hex_dump(struct seq_file *m, const char *prefix_str, int prefix_type,
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index adf545ba9eb9..77a43b483b7b 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -783,6 +783,8 @@ static const char *ps_get_module_name(const struct printk_fmt_sec *ps)
}
#endif /* CONFIG_MODULES */
+static u16 parse_prefix(const char *text, int *level, enum log_flags *lflags);
+
static int debugfs_pf_show(struct seq_file *s, void *v)
{
struct module *mod = s->file->f_inode->i_private;
@@ -804,11 +806,17 @@ static int debugfs_pf_show(struct seq_file *s, void *v)
}
for (fptr = ps->start; fptr < ps->end; fptr++) {
- /* For callsites without KERN_SOH + level preamble. */
- if (unlikely(*fptr[0] != KERN_SOH_ASCII))
- seq_printf(s, "%c%d", KERN_SOH_ASCII,
- MESSAGE_LOGLEVEL_DEFAULT);
- seq_printf(s, "%s%c", *fptr, '\0');
+ int level;
+ u16 prefix_len;
+ const char *fmt;
+
+ level = LOGLEVEL_DEFAULT;
+ prefix_len = parse_prefix(fptr[0], &level, NULL);
+ fmt = fptr[0] + prefix_len;
+
+ seq_printf(s, "<%d> ", level);
+ seq_escape_printf_format(s, fmt);
+ seq_printf(s, "\n");
}
out_unlock:
@@ -2113,7 +2121,7 @@ static inline u32 printk_caller_id(void)
*
* Return: The length of the parsed level and control flags.
*/
-static u16 parse_prefix(char *text, int *level, enum log_flags *lflags)
+static u16 parse_prefix(const char *text, int *level, enum log_flags *lflags)
{
u16 prefix_len = 0;
int kern_level;
--
2.26.2
On Fri 2021-02-12 15:30:16, Chris Down wrote:
> We have a number of systems industry-wide that have a subset of their
> functionality that works as follows:
>
> 1. Receive a message from local kmsg, serial console, or netconsole;
> 2. Apply a set of rules to classify the message;
> 3. Do something based on this classification (like scheduling a
> remediation for the machine), rinse, and repeat.
>
> As a couple of examples of places we have this implemented just inside
> Facebook, although this isn't a Facebook-specific problem, we have this
> inside our netconsole processing (for alarm classification), and as part
> of our machine health checking. We use these messages to determine
> fairly important metrics around production health, and it's important
> that we get them right.
>
>
> This patch provides a solution to the issue of silently changed or
> deleted printks: we record pointers to all printk format strings known
> at compile time into a new .printk_fmts section, both in vmlinux and
> modules. At runtime, this can then be iterated by looking at
> <debugfs>/printk/formats/<module>, which emits the same format as
> `printk` itself, which we already export elsewhere (for example, in
> netconsole).
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 5a95c688621f..adf545ba9eb9 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> +
> +static const struct file_operations dfs_formats_fops = {
> + .open = debugfs_pf_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
> +
> +static size_t printk_fmt_size(const char *fmt)
> +{
> + size_t sz = strlen(fmt) + 1;
> +
> + /*
> + * Some printk formats don't start with KERN_SOH + level. We will add
> + * it later when rendering the output.
> + */
> + if (unlikely(fmt[0] != KERN_SOH_ASCII))
> + sz += 2;
This approach is hard to maintain. It might be pretty hard and error
prone to count the size if we want to provide more information.
There are many files in debugfs with not-well defined size.
They are opened by seq_open_private(). It allows to add
a line by line by an iterator.
For example:
+ /sys/kernel/debug/dynamic_debug/control is opened by
ddebug_proc_open() in lib/dynamic_debug.c
+ /sys/kernel/debug/tracing/available_filter_functions
is opened by ftrace_avail_open() in kernel/trace/ftrace.c
> +
> + return sz;
> +}
> +
> +static struct printk_fmt_sec *find_printk_fmt_sec(struct module *mod)
> +{
> + struct printk_fmt_sec *ps = NULL;
> +
> + hash_for_each_possible(printk_fmts_mod_sections, ps, hnode,
> + (unsigned long)mod)
> + if (ps->module == mod)
> + return ps;
> +
> + return NULL;
> +}
> +
> +static void store_printk_fmt_sec(struct module *mod, const char **start,
> + const char **end)
> +{
> + struct printk_fmt_sec *ps = NULL;
> + const char **fptr = NULL;
> + size_t size = 0;
> +
> + ps = kmalloc(sizeof(struct printk_fmt_sec), GFP_KERNEL);
> + if (!ps)
> + return;
> +
> + ps->module = mod;
> + ps->start = start;
> + ps->end = end;
> +
> + for (fptr = ps->start; fptr < ps->end; fptr++)
> + size += printk_fmt_size(*fptr);
> +
> + mutex_lock(&printk_fmts_mutex);
> + hash_add(printk_fmts_mod_sections, &ps->hnode, (unsigned long)mod);
> + mutex_unlock(&printk_fmts_mutex);
> +
> + ps->file = debugfs_create_file(ps_get_module_name(ps), 0444,
> + dfs_formats, mod, &dfs_formats_fops);
> +
> + if (!IS_ERR(ps->file))
> + d_inode(ps->file)->i_size = size;
We should revert the changes when the file could not get crated.
It does not make sense to keep the structure when the file is not
there.
I guess that remove_printk_fmt_sec() would even crash when
ps->file was set to an error code.
> +}
> +
> +#ifdef CONFIG_MODULES
> +static void remove_printk_fmt_sec(struct module *mod)
> +{
> + struct printk_fmt_sec *ps = NULL;
> +
> + if (WARN_ON_ONCE(!mod))
> + return;
> +
> + mutex_lock(&printk_fmts_mutex);
> +
> + ps = find_printk_fmt_sec(mod);
> + if (!ps) {
> + mutex_unlock(&printk_fmts_mutex);
> + return;
> + }
> +
> + hash_del(&ps->hnode);
> +
> + mutex_unlock(&printk_fmts_mutex);
> +
> + debugfs_remove(ps->file);
IMHO, we should remove the file before we remove the way how
to read it. This should be done in the opposite order
than in store_printk_fmt_sec().
> + kfree(ps);
> +}
> +
Best Regards,
Petr
Hey Petr,
Petr Mladek writes:
>This produces something like:
>
>3Warning: unable to open an initial console.
>3Failed to execute %s (error %d)
>6Kernel memory protection disabled.
>3Starting init: %s exists but couldn't execute it (error %d)
>6Run %s as init process
>7initcall %pS returned %d after %lld usecs
>7calling %pS @ %i
>2initrd overwritten (0x%08lx < 0x%08lx) - disabling it.
>
>The loglevel is not well separated. It is neither human readable
>nor safe for a machine processing . It works only for single digit.
>[...]
>It looks in less like: [...]
Hmm, why is it important that debugfs output is human readable? My impression
was that it's fine to have machine-readable stuff there.
Re: not being not safe for machine processing because it only works for a
single digit, I'm a little confused. KERN_* levels are, as far as I know, only
a single byte wide, and we rely on that already (eg. in printk_skip_header()).
We also already have precedent for null-separation/control characters in (for
example) /proc/pid/cmdline.
What am I missing? :-)
>Well, it still might be non-trivial to find the string in the code
>and see what exactly has changed. It might be pretty useful
>to mention even the source_file:line, for example:
>
><3> init/main.c:1489: Warning: unable to open an initial console.\n
><3> init/main.c:1446: Failed to execute %s (error %d)\n
><6> init/main.c:1398: Kernel memory protection disabled.\n
><3> init/main.c:1366: Starting init: %s exists but couldn't execute it (error %d)\n
Almost certainly a theoretical concern, but I am not a big fan of this format,
because it relies on a character (":") which is legal in paths (although as
you'd expect, we don't have any cases in the kernel right now). That's one of
the reasons why I preferred to use nulls, which can't be in a filename.
Hi,
this is from Nitpicker's department.
On Fri 2021-02-12 15:30:16, Chris Down wrote:
> We have a number of systems industry-wide that have a subset of their
> functionality that works as follows:
>
> 1. Receive a message from local kmsg, serial console, or netconsole;
> 2. Apply a set of rules to classify the message;
> 3. Do something based on this classification (like scheduling a
> remediation for the machine), rinse, and repeat.
>
> As a couple of examples of places we have this implemented just inside
> Facebook, although this isn't a Facebook-specific problem, we have this
> inside our netconsole processing (for alarm classification), and as part
> of our machine health checking. We use these messages to determine
> fairly important metrics around production health, and it's important
> that we get them right.
>
> This patch provides a solution to the issue of silently changed or
> deleted printks: we record pointers to all printk format strings known
> at compile time into a new .printk_fmts section, both in vmlinux and
> modules. At runtime, this can then be iterated by looking at
> <debugfs>/printk/formats/<module>, which emits the same format as
> `printk` itself, which we already export elsewhere (for example, in
> netconsole).
Please show how the format really look like. It is not
exactly as in netconsole.
> As an example of how simple a parser for this format can be:
>
> $ cat pf.py
> #!/usr/bin/env python
> with open("/sys/kernel/debug/printk/formats/vmlinux") as f:
> raw_fmts = f.read().split("\x00")[:-1]
> for raw_fmt in raw_fmts:
> level, fmt = raw_fmt[1], raw_fmt[2:]
> print(f"Level {level}: {fmt!r}")
>
> $ ./pf.py | shuf -n 5
> Level 4: 'Build ID is too large to include in vmcoreinfo: %u > %u\n'
> Level 3: 'BIOS bug, no explicit IRQ entries, using default mptable. (tell your hw vendor)\n'
> Level 3: 'Failed to execute %s (error %d)\n'
> Level 3: 'CHRDEV "%s" minor range requested (%u-%u) is out of range of maximum range (%u-%u) for a single major\n'
> Level 3: "slub_debug option '%c' unknown. skipped\n"
This looks much better than the original content. The kernel should
produce something human readable out of box.
Ah, this should have been mentioned in the reply prefixed by "output".
> diff --git a/init/Kconfig b/init/Kconfig
> index ab2e916f7e8b..f1f37a060235 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -764,6 +764,20 @@ config PRINTK_SAFE_LOG_BUF_SHIFT
> 13 => 8 KB for each CPU
> 12 => 4 KB for each CPU
>
> +config PRINTK_ENUMERATION
> + bool "Printk enumeration debugfs interface"
> +
> + depends on PRINTK && DEBUG_FS
> + help
> + Add support for enumeration of all printk formats known at compile
> + time at <debugfs>/printk/formats/<module>.
> +
> + This can be used as part of maintaining daemons which monitor
> + /dev/kmsg, as it permits auditing the printk formats present in a
> + kernel, allowing monitoring of cases where monitored printks are
> + changed or no longer present.
> +
> + There is no additional runtime cost to printk with this enabled.
> +
> #
> # Architectures with an unreliable sched_clock() should select this:
> #
> diff --git a/kernel/module.c b/kernel/module.c
> index 1e5aad812310..7b0f2a2f428e 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3429,6 +3429,11 @@ static int find_module_sections(struct module *mod, struct load_info *info)
> sizeof(unsigned long),
> &mod->num_kprobe_blacklist);
> #endif
> +#ifdef CONFIG_PRINTK_ENUMERATION
> + mod->printk_fmts_start = section_objs(info, ".printk_fmts",
> + sizeof(*mod->printk_fmts_start),
> + &mod->printk_fmts_sec_size);
I wonder if we could find a better name for the configure switch.
I have troubles to imagine what printk enumeration might mean.
Well, it might be because I am not a native speaker.
Anyway, the word "enumeration" is used only in the configure option.
Everything else is "printk_fmt"
What about DEBUG_PRINTK_FORMATS?
> +#endif
> #ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> mod->static_call_sites = section_objs(info, ".static_call_sites",
> sizeof(*mod->static_call_sites),
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 5a95c688621f..adf545ba9eb9 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -47,6 +47,8 @@
> #include <linux/sched/clock.h>
> #include <linux/sched/debug.h>
> #include <linux/sched/task_stack.h>
> +#include <linux/debugfs.h>
> +#include <linux/hashtable.h>
>
> #include <linux/uaccess.h>
> #include <asm/sections.h>
> @@ -617,6 +619,222 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
> return len;
> }
printk.c is already too big. Please, implement this feature in a
separate source file, e.g. kernel/printk/debug_formats.c.
Please, use kernel/printk/internal.h if you need some function
from printk.c that have not been public before.
> +#ifdef CONFIG_PRINTK_ENUMERATION
> +
> +/*
> + * debugfs/printk/formats/ - userspace enumeration of printk formats
> + *
> + * The format is the same as typically used by printk, <KERN_SOH><level>fmt,
> + * with each distinct format separated by \0.
> + */
> struct printk_fmt_sec {
> + struct hlist_node hnode;
> + struct module *module;
Please, use "struct module *mod". It is a more common.
> + struct dentry *file;
> + const char **start;
> + const char **end;
> +};
Please, document the meaning of the fields, ideally using the doc
style or how is the style called:
/**
* struct printk_fmt_sec -
* @hnode: node for the hash table
* @new_func: pointer to the patched function code
> +
> +/* The base dir for module formats, typically debugfs/printk/formats/ */
> +struct dentry *dfs_formats;
> +
> +/*
> + * Stores .printk_fmt section boundaries for vmlinux and all loaded modules.
> + * Add entries with store_printk_fmt_sec, remove entries with
> + * remove_printk_fmt_sec.
> + */
> +static DEFINE_HASHTABLE(printk_fmts_mod_sections, 8);
The hash table looks like an overkill. This is slow path. There are
typically only tens of loaded modules. Even the module loader
uses plain list for iterating the list of modules.
> +
> +/* Protects printk_fmts_mod_sections */
> +static DEFINE_MUTEX(printk_fmts_mutex);
What is the rule for using "printk_fmts" and "printk_fmt", please?
I can't find the system here ;-)
Anyway, I would prefer to use "printk_fmt" everywhere.
Or maybe even "pf_".
> +
> +static const char *ps_get_module_name(const struct printk_fmt_sec *ps);
> +static int debugfs_pf_open(struct inode *inode, struct file *file);
There are used many different:
+ shortcuts: fmt, fmts, ps, fmt_sec, dfs
+ styles/ordering: ps_get_module_name() vs.
find_printk_fmt_sec() vs.
printk_fmt_size() vs.
debugfs_pf_open()
It might be bearable because there is not much code. But it would
still help a lot when we make it more consistent. Many subsystems
prefer using a feature-specific prefix.
It might be "printk_fmt_" or "pf_" [*] in this case. And we could use
names like:
+ struct pf_object [**]
+ pf_get_object_name()
+ pf_find_object()
+ pf_fops,
+ pf_open()
+ pf_release()
+ pf_mutex,
+ pf_add_object()
+ pf_remove_object()
+ pf_module_notify
[*] "pf_" is inspired by kernel/printk/printk_ringbuffer.c that
uses "prb_" prefix.
[**] The "object" is either vmlinux or module. I took this naming from
include/linux/livepatch.h and kernel/livepatch/*.
The livepatch code needs to handle vmlinux name a special way
as well. Also notice that the livepatch code uses klp_ prefix ;-)
Best Regards,
Petr
Petr Mladek writes:
>> +static size_t printk_fmt_size(const char *fmt)
>> +{
>> + size_t sz = strlen(fmt) + 1;
>> +
>> + /*
>> + * Some printk formats don't start with KERN_SOH + level. We will add
>> + * it later when rendering the output.
>> + */
>> + if (unlikely(fmt[0] != KERN_SOH_ASCII))
>> + sz += 2;
>
>This approach is hard to maintain. It might be pretty hard and error
>prone to count the size if we want to provide more information.
>
>There are many files in debugfs with not-well defined size.
>They are opened by seq_open_private(). It allows to add
>a line by line by an iterator.
Hmm, this is optional -- it was just to avoid seq_file having to realloc the
buffer. I originally used an iterator and I'm happy to go back to it if it
proves more convenient.
>We should revert the changes when the file could not get crated.
>It does not make sense to keep the structure when the file is not
>there.
See the reply from gregkh on v2, who was quite insistent that we should not
check debugfs error codes. I'm happy to do either, but I can't please you both
:-)
>I guess that remove_printk_fmt_sec() would even crash when
>ps->file was set to an error code.
debugfs checks if its input is an error, so it shouldn't, unless that's not
what you're referring to?
>> +}
>> +
>> +#ifdef CONFIG_MODULES
>> +static void remove_printk_fmt_sec(struct module *mod)
>> +{
>> + struct printk_fmt_sec *ps = NULL;
>> +
>> + if (WARN_ON_ONCE(!mod))
>> + return;
>> +
>> + mutex_lock(&printk_fmts_mutex);
>> +
>> + ps = find_printk_fmt_sec(mod);
>> + if (!ps) {
>> + mutex_unlock(&printk_fmts_mutex);
>> + return;
>> + }
>> +
>> + hash_del(&ps->hnode);
>> +
>> + mutex_unlock(&printk_fmts_mutex);
>> +
>> + debugfs_remove(ps->file);
>
>IMHO, we should remove the file before we remove the way how
>to read it. This should be done in the opposite order
>than in store_printk_fmt_sec().
There is a subtle issue with doing this as-is: debugfs_remove(ps->file) cannot
be called under printk_fmts_mutex, because we may deadlock due to a pinned
debugfs refcnt if debugfs_remove() and _show happen at the same time.
Imagine we go into remove_printk_fmt_sec and grab printk_fmts_lock. On another
CPU, we call _show for the same file, which takes a reference in debugfs, but
it will stall waiting for printk_fmts_lock. Now we go back into
remove_printk_fmt_sec and can't make any forward progress, because
debugfs_remove will stall until all reference holders have finished, and there
is a deadlock.
That's the reason that debugfs_remove() must be called after we have already
finished with the mutex and have the printk_fmt_sec, since we need to know that
it's still valid, and we also need to not be under it at the time of removal.
One way to do what you're asking might be to have a flag in the printk_fmt_sec
which indicates that we are freeing something, and then take and release the
lock twice in remove_printk_fmt_sec. Personally, I feel indifferent to either
the current solution or something like that, but if you have a preference for
adding a flag or another similar solution, that's fine with me. Just let me
know. :-)
Petr Mladek writes:
>I wonder if we could find a better name for the configure switch.
>I have troubles to imagine what printk enumeration might mean.
>Well, it might be because I am not a native speaker.
>
>Anyway, the word "enumeration" is used only in the configure option.
>Everything else is "printk_fmt"
>
>What about DEBUG_PRINTK_FORMATS?
Hmm, I don't like DEBUG_PRINTK_FMTS because it's not about debugging, it's
about enumeration, I guess :-)
The name should reflect that this catalogues the available printks in the
kernel -- "debugging" seems to imply something different.
I'm ok with a different name like "printk catalogue" or something like that if
you prefer. Personally I think "printk enumeration" is fairly clear -- it's
about enumerating the available printks -- but anything that captures that
spirit is fine.
>printk.c is already too big. Please, implement this feature in a
>separate source file, e.g. kernel/printk/debug_formats.c.
Sure, that's fine.
>> struct printk_fmt_sec {
>> + struct hlist_node hnode;
>> + struct module *module;
>
>Please, use "struct module *mod". It is a more common.
>
>> + struct dentry *file;
>> + const char **start;
>> + const char **end;
>> +};
>
>Please, document the meaning of the fields, ideally using the doc
>style or how is the style called:
>
>/**
> * struct printk_fmt_sec -
> * @hnode: node for the hash table
> * @new_func: pointer to the patched function code
Roger to both. :-)
>> +
>> +/* The base dir for module formats, typically debugfs/printk/formats/ */
>> +struct dentry *dfs_formats;
>> +
>> +/*
>> + * Stores .printk_fmt section boundaries for vmlinux and all loaded modules.
>> + * Add entries with store_printk_fmt_sec, remove entries with
>> + * remove_printk_fmt_sec.
>> + */
>> +static DEFINE_HASHTABLE(printk_fmts_mod_sections, 8);
>The hash table looks like an overkill. This is slow path. There are
>typically only tens of loaded modules. Even the module loader
>uses plain list for iterating the list of modules.
I don't think it's overkill -- we have prod systems with hundreds. Hell, even
my laptop has over 150 loaded. If someone needs to walk all of the files in
debugfs, it seems unreasonable to do an O(n^2) walk when an O(n) one would
suffice.
Unless you have a practical concern, I think this is a distinct case from the
module loader with its own unique requirements, so I'd prefer to use the hash
table.
>> +
>> +/* Protects printk_fmts_mod_sections */
>> +static DEFINE_MUTEX(printk_fmts_mutex);
>
>What is the rule for using "printk_fmts" and "printk_fmt", please?
>I can't find the system here ;-)
>
>Anyway, I would prefer to use "printk_fmt" everywhere.
>Or maybe even "pf_".
pf_ sounds fine. :-)
>> +
>> +static const char *ps_get_module_name(const struct printk_fmt_sec *ps);
>> +static int debugfs_pf_open(struct inode *inode, struct file *file);
>
>There are used many different:
>
> + shortcuts: fmt, fmts, ps, fmt_sec, dfs
>
> + styles/ordering: ps_get_module_name() vs.
> find_printk_fmt_sec() vs.
> printk_fmt_size() vs.
> debugfs_pf_open()
>
>It might be bearable because there is not much code. But it would
>still help a lot when we make it more consistent. Many subsystems
>prefer using a feature-specific prefix.
>
>It might be "printk_fmt_" or "pf_" [*] in this case. And we could use
>names like:
>
> + struct pf_object [**]
> + pf_get_object_name()
> + pf_find_object()
> + pf_fops,
> + pf_open()
> + pf_release()
> + pf_mutex,
> + pf_add_object()
> + pf_remove_object()
> + pf_module_notify
Oh, I meant to change the name for v4 but neglected to do so. Sounds good, will
do.
On Tue, Feb 16, 2021 at 05:27:08PM +0000, Chris Down wrote:
> Petr Mladek writes:
> > I wonder if we could find a better name for the configure switch.
> > I have troubles to imagine what printk enumeration might mean.
> > Well, it might be because I am not a native speaker.
> >
> > Anyway, the word "enumeration" is used only in the configure option.
> > Everything else is "printk_fmt"
> >
> > What about DEBUG_PRINTK_FORMATS?
>
> Hmm, I don't like DEBUG_PRINTK_FMTS because it's not about debugging, it's
> about enumeration, I guess :-)
>
> The name should reflect that this catalogues the available printks in the
> kernel -- "debugging" seems to imply something different.
>
> I'm ok with a different name like "printk catalogue" or something like that
> if you prefer. Personally I think "printk enumeration" is fairly clear --
> it's about enumerating the available printks -- but anything that captures
> that spirit is fine.
How about config PRINTK_INDEX?
Johannes Weiner writes:
>On Tue, Feb 16, 2021 at 05:27:08PM +0000, Chris Down wrote:
>> Petr Mladek writes:
>> > I wonder if we could find a better name for the configure switch.
>> > I have troubles to imagine what printk enumeration might mean.
>> > Well, it might be because I am not a native speaker.
>> >
>> > Anyway, the word "enumeration" is used only in the configure option.
>> > Everything else is "printk_fmt"
>> >
>> > What about DEBUG_PRINTK_FORMATS?
>>
>> Hmm, I don't like DEBUG_PRINTK_FMTS because it's not about debugging, it's
>> about enumeration, I guess :-)
>>
>> The name should reflect that this catalogues the available printks in the
>> kernel -- "debugging" seems to imply something different.
>>
>> I'm ok with a different name like "printk catalogue" or something like that
>> if you prefer. Personally I think "printk enumeration" is fairly clear --
>> it's about enumerating the available printks -- but anything that captures
>> that spirit is fine.
>
>How about config PRINTK_INDEX?
Ah yes, I also like that. PRINTK_INDEX is fine from my perspective and is more
straightforward than "enumeration", thanks.
On Tue 2021-02-16 16:52:39, Chris Down wrote:
> Hey Petr,
>
> Petr Mladek writes:
> > This produces something like:
> >
> > 3Warning: unable to open an initial console.
> > 3Failed to execute %s (error %d)
> > 6Kernel memory protection disabled.
> > 3Starting init: %s exists but couldn't execute it (error %d)
> > 6Run %s as init process
> > 7initcall %pS returned %d after %lld usecs
> > 7calling %pS @ %i
> > 2initrd overwritten (0x%08lx < 0x%08lx) - disabling it.
> >
> > The loglevel is not well separated. It is neither human readable
> > nor safe for a machine processing . It works only for single digit.
> > [...]
> > It looks in less like: [...]
>
> Hmm, why is it important that debugfs output is human readable? My
> impression was that it's fine to have machine-readable stuff there.
We do not know how different people would want to use the
interface. Why not make it easier out of box when it is simple?
I guess that you already use it internally and have its own tooling
around it. You actually provided a snippet that made the context
better readable:
$ cat pf.py
#!/usr/bin/env python
with open("/sys/kernel/debug/printk/formats/vmlinux") as f:
raw_fmts = f.read().split("\x00")[:-1]
for raw_fmt in raw_fmts:
level, fmt = raw_fmt[1], raw_fmt[2:]
print(f"Level {level}: {fmt!r}")
I wonder how it would look in another scripting languages, like
bash or perl. Any non-printable character is tricky and would
complicate it.
> Re: not being not safe for machine processing because it only works for a
> single digit, I'm a little confused. KERN_* levels are, as far as I know,
> only a single byte wide, and we rely on that already (eg. in
> printk_skip_header()).
It is great that you mentioned it. KERN_ levels are implementation
detail.
It used to be "<level>". The binary KERN_SOH has been introduced
in v3.6-rc1 by the commit 04d2c8c83d0e3ac5f ("printk: convert
the format for KERN_<LEVEL> to a 2 byte pattern").
In v4.9-rc1, the commit 4bcc595ccd80decb4 ("printk: reinstate KERN_CONT
for printing continuation lines") added the possibility to define
both KERN_LEVEL + KERN_CONT at the same time. It is not handled
by your python snippet above.
The levels 0-7 are there basically from the beginning. But nobody
knows if one digit will be enough forever like 640kB mem.
Actually level 'c' has been added later for continuous lines which
is not even number. There are the following limits defined
in the sources.
#define CONSOLE_LOGLEVEL_DEBUG 10 /* issue debug messages */
#define CONSOLE_LOGLEVEL_MOTORMOUTH 15 /* You can't shut this one up */
No I rally do not want to expose this binary blobs to the user space
where we need to maintain backward compatibility.
> We also already have precedent for
> null-separation/control characters in (for example) /proc/pid/cmdline.
Yes, there is a precedent. But it does not mean that we should not
try to do it better.
> What am I missing? :-)
>
> > Well, it still might be non-trivial to find the string in the code
> > and see what exactly has changed. It might be pretty useful
> > to mention even the source_file:line, for example:
> >
> > <3> init/main.c:1489: Warning: unable to open an initial console.\n
> > <3> init/main.c:1446: Failed to execute %s (error %d)\n
> > <6> init/main.c:1398: Kernel memory protection disabled.\n
> > <3> init/main.c:1366: Starting init: %s exists but couldn't execute it (error %d)\n
>
> Almost certainly a theoretical concern, but I am not a big fan of this
> format, because it relies on a character (":") which is legal in paths
> (although as you'd expect, we don't have any cases in the kernel right now).
> That's one of the reasons why I preferred to use nulls, which can't be in a
> filename.
Well, it is also a very common format used almost anywhere,
for example, compilers, grep -n.
I think that many tools around kernel will have problems when
such problematic file names are used.
The main question is whether we want the information. IMHO, it might
safe a lot of time when solving problems about the modified strings.
Best Regards,
Petr
Petr Mladek writes:
>I guess that you already use it internally and have its own tooling
>around it.
Hmm, we're actually not using it yet widely because I don't want to backport it
to our prod kernel until we're reasonably agreed on the format :-)
My main concern is making sure that parsing is reliable, and there's as
little "well, there shouldn't ever be a <char> here" as possible. That's why I
preferred originally to use the already well established char array/printk
rules where possible, since they're mature and well tested.
I'm not against some other solution though, as long as it meets these
requirements. It looks like your proposed format with escaping prior to sending
to userspace also meets that requirement.
> $ cat pf.py
> #!/usr/bin/env python
> with open("/sys/kernel/debug/printk/formats/vmlinux") as f:
> raw_fmts = f.read().split("\x00")[:-1]
> for raw_fmt in raw_fmts:
> level, fmt = raw_fmt[1], raw_fmt[2:]
> print(f"Level {level}: {fmt!r}")
>
>I wonder how it would look in another scripting languages, like
>bash or perl. Any non-printable character is tricky and would
>complicate it.
It's really not that complicated there, either. Since you bring up bash, it's
even less work than the Python solution:
while IFS= read -r -d $'\0' fmt; do
printf 'Level %s: %q\n' "${fmt:1:1}" "${fmt:2}"
done < /sys/kernel/debug/printk/formats/vmlinux
The changelog describes the use case: automated detection of printk fmts
changing. For that reason, I don't understand why there's a desire to produce a
human readable format by default when the only known consumer of this is going
to be automation anyway.
If a use case for that comes up, we can always have a directory producing human
readable versions. I personally don't see the need though.
As long as it's reliably parseable though, I won't die on this hill, I just
don't understand the rationale :-)
>> Re: not being not safe for machine processing because it only works for a
>> single digit, I'm a little confused. KERN_* levels are, as far as I know,
>> only a single byte wide, and we rely on that already (eg. in
>> printk_skip_header()).
>
>It is great that you mentioned it. KERN_ levels are implementation
>detail.
>
>It used to be "<level>". The binary KERN_SOH has been introduced
>in v3.6-rc1 by the commit 04d2c8c83d0e3ac5f ("printk: convert
>the format for KERN_<LEVEL> to a 2 byte pattern").
>
>In v4.9-rc1, the commit 4bcc595ccd80decb4 ("printk: reinstate KERN_CONT
>for printing continuation lines") added the possibility to define
>both KERN_LEVEL + KERN_CONT at the same time. It is not handled
>by your python snippet above.
Thanks, this is a good callout. I will make sure v5 handles that.
In a hypothetical scenario where we do go towards something human-readable, how
do you perceive that should look? Something like this?
% less ...
<c, 5> 'Some fmt with cont + level 5\n'
<5> 'Some fmt with only level 5\n'
<c> 'Some fmt with only LOG_CONT\n'
>> We also already have precedent for
>> null-separation/control characters in (for example) /proc/pid/cmdline.
>
>Yes, there is a precedent. But it does not mean that we should not
>try to do it better.
To be clear, I'm actually asserting that I believe the machine-readable version
_is_ better, not that it's simply unnecessary to produce a human-readable
version :-)
As mentioned earlier in this e-mail though, it's not a hill I want to die on.
If you believe it should be human-readable, as long as its reliably parseable,
I'm happy to do that.
On Tue 2021-02-16 17:18:58, Chris Down wrote:
> Petr Mladek writes:
> > > +static size_t printk_fmt_size(const char *fmt)
> > > +{
> > > + size_t sz = strlen(fmt) + 1;
> > > +
> > > + /*
> > > + * Some printk formats don't start with KERN_SOH + level. We will add
> > > + * it later when rendering the output.
> > > + */
> > > + if (unlikely(fmt[0] != KERN_SOH_ASCII))
> > > + sz += 2;
> >
> > This approach is hard to maintain. It might be pretty hard and error
> > prone to count the size if we want to provide more information.
> >
> > There are many files in debugfs with not-well defined size.
> > They are opened by seq_open_private(). It allows to add
> > a line by line by an iterator.
>
> Hmm, this is optional -- it was just to avoid seq_file having to realloc the
> buffer. I originally used an iterator and I'm happy to go back to it if it
> proves more convenient.
Please, go back to iterators ;-)
> > We should revert the changes when the file could not get crated.
> > It does not make sense to keep the structure when the file is not
> > there.
>
> See the reply from gregkh on v2, who was quite insistent that we should not
> check debugfs error codes. I'm happy to do either, but I can't please you
> both :-)
I see. OK, it typically does not make any sense to handle the error.
The API is safe enough to handle ERR_PTR values. The files allow
to show something. The user will notice when they are not created.
But I believe that our case is different. We need to allocate
a structure, put it into hash table, to be able to show the content
by the debugfs interface. The only purpose of the structure is
to provide information for the debugfs file.
It does not make sense to keep the structure when the file was not
created.
Also it might make sense to print an error to inform the user
the some printk formats might be missing. Well, the error might
make sense only when format for others are shown.
That said, I do not resist on the error handling. The code will
be safe even without it.
> > I guess that remove_printk_fmt_sec() would even crash when
> > ps->file was set to an error code.
>
> debugfs checks if its input is an error, so it shouldn't, unless that's not
> what you're referring to?
Good to know.
> > > +}
> > > +
> > > +#ifdef CONFIG_MODULES
> > > +static void remove_printk_fmt_sec(struct module *mod)
> > > +{
> > > + struct printk_fmt_sec *ps = NULL;
> > > +
> > > + if (WARN_ON_ONCE(!mod))
> > > + return;
> > > +
> > > + mutex_lock(&printk_fmts_mutex);
> > > +
> > > + ps = find_printk_fmt_sec(mod);
> > > + if (!ps) {
> > > + mutex_unlock(&printk_fmts_mutex);
> > > + return;
> > > + }
> > > +
> > > + hash_del(&ps->hnode);
> > > +
> > > + mutex_unlock(&printk_fmts_mutex);
> > > +
> > > + debugfs_remove(ps->file);
> >
> > IMHO, we should remove the file before we remove the way how
> > to read it. This should be done in the opposite order
> > than in store_printk_fmt_sec().
>
> There is a subtle issue with doing this as-is: debugfs_remove(ps->file)
> cannot be called under printk_fmts_mutex, because we may deadlock due to a
> pinned debugfs refcnt if debugfs_remove() and _show happen at the same time.
Do we need to call debugfs_remove(ps->file) under printk_fmts_mutex?
> Imagine we go into remove_printk_fmt_sec and grab printk_fmts_lock. On
> another CPU, we call _show for the same file, which takes a reference in
> debugfs, but it will stall waiting for printk_fmts_lock. Now we go back into
> remove_printk_fmt_sec and can't make any forward progress, because
> debugfs_remove will stall until all reference holders have finished, and
> there is a deadlock.
Sure. But this will not happen when debugfs_remove(ps->file) was
called without the mutex. It is safe. The code will wait here
until all references are released and the file is gone.
It will prevent an non-necessary error when reading file:
CPU0: CPU1
monitor_check()
cat debug/printk/formats/*
open(debug/printk/formats/moduleA)
# successfully opened the file
# took reference count
rmmod moduleA
remove_printk_fmt_sec()
mutex_lock(&printk_fmts_mutex);
hash_del(&ps->hnode);
mutex_unlock(&printk_fmts_mutex);
debugfs_remove(ps->file);
debugfs_pf_show()
mutex_lock(&printk_fmts_mutex);
ps = find_printk_fmt_sec(mod);
if (unlikely(!ps)) {
ret = -ENOENT;
RESTULT: There was no deadlock. The file was opened but the show
callback failed.
Best Regards,
Petr
On Tue 2021-02-16 21:05:48, Chris Down wrote:
> Johannes Weiner writes:
> > On Tue, Feb 16, 2021 at 05:27:08PM +0000, Chris Down wrote:
> > > Petr Mladek writes:
> > > > I wonder if we could find a better name for the configure switch.
> > > > I have troubles to imagine what printk enumeration might mean.
> > > > Well, it might be because I am not a native speaker.
> > > >
> > > > Anyway, the word "enumeration" is used only in the configure option.
> > > > Everything else is "printk_fmt"
> > > >
> > > > What about DEBUG_PRINTK_FORMATS?
> > >
> > > Hmm, I don't like DEBUG_PRINTK_FMTS because it's not about debugging, it's
> > > about enumeration, I guess :-)
> > >
> > > The name should reflect that this catalogues the available printks in the
> > > kernel -- "debugging" seems to imply something different.
> > >
> > > I'm ok with a different name like "printk catalogue" or something like that
> > > if you prefer. Personally I think "printk enumeration" is fairly clear --
> > > it's about enumerating the available printks -- but anything that captures
> > > that spirit is fine.
> >
> > How about config PRINTK_INDEX?
>
> Ah yes, I also like that. PRINTK_INDEX is fine from my perspective and is
> more straightforward than "enumeration", thanks.
It is better than enumeration. But there is still the same
problem. The word "index" is used neither in the code
nor in the debugfs interface. It is like enabling cars and
seeing apples.
What about CONFIG_PRINTK_DEBUGFS?
It seems that various subsystems use CONFIG_<SUBSYSTEM>_DEBUGFS
pattern when they expose some internals in debugfs.
Best Regards,
Petr
Petr Mladek writes:
>> > > + debugfs_remove(ps->file);
>> >
>> > IMHO, we should remove the file before we remove the way how
>> > to read it. This should be done in the opposite order
>> > than in store_printk_fmt_sec().
>>
>> There is a subtle issue with doing this as-is: debugfs_remove(ps->file)
>> cannot be called under printk_fmts_mutex, because we may deadlock due to a
>> pinned debugfs refcnt if debugfs_remove() and _show happen at the same time.
>
>Do we need to call debugfs_remove(ps->file) under printk_fmts_mutex?
Ah, my concern was simultaneous entries into remove_printk_fmt_sec (which would
require setting a separate flag under the mutex), but now I think about it, the
module notifier synchronously waits, so that can't happen anyway.
As such it should be safe to just do:
remove()
{
mutex_lock(&printk_fmts_mutex);
ps = find_printk_fmt_sec();
mutex_unlock(&printk_fmts_mutex);
if (!ps)
return;
/* waits for _show */
debugfs_remove(ps->file);
mutex_lock(&printk_fmts_mutex);
/* Do the data structure teardown */
mutex_unlock(&printk_fmts_mutex);
}
Sounds good to me, I'll do that for v5. Thanks! :-)
Petr Mladek writes:
>> > How about config PRINTK_INDEX?
>>
>> Ah yes, I also like that. PRINTK_INDEX is fine from my perspective and is
>> more straightforward than "enumeration", thanks.
>
>It is better than enumeration. But there is still the same
>problem. The word "index" is used neither in the code
>nor in the debugfs interface. It is like enabling cars and
>seeing apples.
>
>What about CONFIG_PRINTK_DEBUGFS?
>
>It seems that various subsystems use CONFIG_<SUBSYSTEM>_DEBUGFS
>pattern when they expose some internals in debugfs.
The thing I don't like about that is that it describes a largely
inconsequential implementation detail rather than the semantic intent of the
config change, which is what the person deciding what to include in their
config is likely to care about. Often when I see "XXX debug interface" when
doing `make oldconfig` I think to myself "yes, but what does the debugfs
interface _do_?".
If someone else was writing this patch, and I saw "CONFIG_PRINTK_DEBUGFS"
appear in my prod kernel, I'd probably say N, because I don't need printk
debugging information. On the other hand, if I saw "CONFIG_PRINTK_INDEX", I'd
immediately understand that it's probably applicable to me.
I'm happy to rename the debugfs structure as <debugfs>/printk/fmt_index if it
helps, but personally I really feel CONFIG_PRINTK_{INDEX,ENUMERATION,CATALOGUE}
is a lot more descriptive than just saying "it has a debugfs interface" in the
config name for that reason.
On Tue 2021-02-16 17:27:08, Chris Down wrote:
> Petr Mladek writes:
> > > +/*
> > > + * Stores .printk_fmt section boundaries for vmlinux and all loaded modules.
> > > + * Add entries with store_printk_fmt_sec, remove entries with
> > > + * remove_printk_fmt_sec.
> > > + */
> > > +static DEFINE_HASHTABLE(printk_fmts_mod_sections, 8);
>
> > The hash table looks like an overkill. This is slow path. There are
> > typically only tens of loaded modules. Even the module loader
> > uses plain list for iterating the list of modules.
>
> I don't think it's overkill -- we have prod systems with hundreds. Hell,
> even my laptop has over 150 loaded. If someone needs to walk all of the
> files in debugfs, it seems unreasonable to do an O(n^2) walk when an O(n)
> one would suffice.
>
> Unless you have a practical concern, I think this is a distinct case from
> the module loader with its own unique requirements, so I'd prefer to use the
> hash table.
OK, it is true that the module API is either called with a particular
struct module pointer. Or it has to iterate over all modules anyway,
for example, when looking for a symbol.
Well, do we need access to struct module at all?
What about storing the pointer to struct pf_object into
struct printk_fmt_sec *ps into the s->file->f_inode->i_private?
Then we would not need any global list/table at all.
> > > +
> > > +/* Protects printk_fmts_mod_sections */
> > > +static DEFINE_MUTEX(printk_fmts_mutex);
> >
> > What is the rule for using "printk_fmts" and "printk_fmt", please?
> > I can't find the system here ;-)
> >
> > Anyway, I would prefer to use "printk_fmt" everywhere.
> > Or maybe even "pf_".
>
> pf_ sounds fine. :-)
>
> > > +
> > > +static const char *ps_get_module_name(const struct printk_fmt_sec *ps);
> > > +static int debugfs_pf_open(struct inode *inode, struct file *file);
> >
> > There are used many different:
> >
> > + shortcuts: fmt, fmts, ps, fmt_sec, dfs
> >
> > + styles/ordering: ps_get_module_name() vs.
> > find_printk_fmt_sec() vs.
> > printk_fmt_size() vs.
> > debugfs_pf_open()
> >
> > It might be bearable because there is not much code. But it would
> > still help a lot when we make it more consistent. Many subsystems
> > prefer using a feature-specific prefix.
> >
> > It might be "printk_fmt_" or "pf_" [*] in this case. And we could use
> > names like:
> >
> > + struct pf_object [**]
> > + pf_get_object_name()
> > + pf_find_object()
> > + pf_fops,
> > + pf_open()
> > + pf_release()
> > + pf_mutex,
> > + pf_add_object()
> > + pf_remove_object()
> > + pf_module_notify
>
> Oh, I meant to change the name for v4 but neglected to do so. Sounds good,
> will do.
Thanks a lot. I am sorry that I ask you to do so many changes.
I talked about the style early enough to make the review easy.
Also I think that it is not ideal and annoing to do these
mass changes and refactoring when the code is already reviewed,
tested, and functional.
Best Regards,
Petr
Petr Mladek writes:
>What about storing the pointer to struct pf_object into
>struct printk_fmt_sec *ps into the s->file->f_inode->i_private?
>Then we would not need any global list/table at all.
Unless I'm misreading the debugfs code, I think the following is possible:
open(f);
debugfs_file_get(f);
fops->open();
inode->private = ps;
debugfs_file_put(f);
remove_printk_fmt_sec(); /* kfree ps */
read(f);
debugfs_file_get(f);
fops->read();
ps = inode->private; /* invalid */
debugfs_file_put(f);
That's the reason why the code looks up from the module address again during
_read. Maybe I'm missing something? :-)
>> Oh, I meant to change the name for v4 but neglected to do so. Sounds good,
>> will do.
>
>Thanks a lot. I am sorry that I ask you to do so many changes.
>I talked about the style early enough to make the review easy.
>Also I think that it is not ideal and annoing to do these
>mass changes and refactoring when the code is already reviewed,
>tested, and functional.
Quite the opposite: thanks a lot for taking so much time to provide valuable
feedback :-) As someone who mostly works on mm code, having you to provide
feedback as printk maintainer is really helpful. Even if we disagree on some
stuff, it's really important that we have a good shared understanding of what
we eventually agree upon.
Chris Down writes:
>open(f);
> debugfs_file_get(f);
> fops->open();
> inode->private = ps;
> debugfs_file_put(f);
>
>remove_printk_fmt_sec(); /* kfree ps */
>
>read(f);
> debugfs_file_get(f);
> fops->read();
> ps = inode->private; /* invalid */
> debugfs_file_put(f);
Er, sorry, inode->private is populated at creation time, not at open(). The
same general concern applies though -- as far as I can tell there's some period
where we may be able to _read() and `ps` has already been freed.
On Wed, Feb 17, 2021 at 04:45:51PM +0100, Petr Mladek wrote:
> On Tue 2021-02-16 21:05:48, Chris Down wrote:
> > Johannes Weiner writes:
> > > On Tue, Feb 16, 2021 at 05:27:08PM +0000, Chris Down wrote:
> > > > Petr Mladek writes:
> > > > > I wonder if we could find a better name for the configure switch.
> > > > > I have troubles to imagine what printk enumeration might mean.
> > > > > Well, it might be because I am not a native speaker.
> > > > >
> > > > > Anyway, the word "enumeration" is used only in the configure option.
> > > > > Everything else is "printk_fmt"
> > > > >
> > > > > What about DEBUG_PRINTK_FORMATS?
> > > >
> > > > Hmm, I don't like DEBUG_PRINTK_FMTS because it's not about debugging, it's
> > > > about enumeration, I guess :-)
> > > >
> > > > The name should reflect that this catalogues the available printks in the
> > > > kernel -- "debugging" seems to imply something different.
> > > >
> > > > I'm ok with a different name like "printk catalogue" or something like that
> > > > if you prefer. Personally I think "printk enumeration" is fairly clear --
> > > > it's about enumerating the available printks -- but anything that captures
> > > > that spirit is fine.
> > >
> > > How about config PRINTK_INDEX?
> >
> > Ah yes, I also like that. PRINTK_INDEX is fine from my perspective and is
> > more straightforward than "enumeration", thanks.
>
> It is better than enumeration. But there is still the same
> problem. The word "index" is used neither in the code
> nor in the debugfs interface. It is like enabling cars and
> seeing apples.
I assumed code and interface would then also be changed to
printk_index, struct printk_index_section, pi_foo, ...
On Wed, 17 Feb 2021 15:28:24 +0000
Chris Down <[email protected]> wrote:
> Petr Mladek writes:
> >I guess that you already use it internally and have its own tooling
> >around it.
>
> Hmm, we're actually not using it yet widely because I don't want to backport it
> to our prod kernel until we're reasonably agreed on the format :-)
>
> My main concern is making sure that parsing is reliable, and there's as
> little "well, there shouldn't ever be a <char> here" as possible. That's why I
> preferred originally to use the already well established char array/printk
> rules where possible, since they're mature and well tested.
>
> I'm not against some other solution though, as long as it meets these
> requirements. It looks like your proposed format with escaping prior to sending
> to userspace also meets that requirement.
>
> > $ cat pf.py
> > #!/usr/bin/env python
> > with open("/sys/kernel/debug/printk/formats/vmlinux") as f:
> > raw_fmts = f.read().split("\x00")[:-1]
> > for raw_fmt in raw_fmts:
> > level, fmt = raw_fmt[1], raw_fmt[2:]
> > print(f"Level {level}: {fmt!r}")
Honestly, I have no idea what the above is doing. And python people tell me
that Perl is hard to read :-p
> >
> >I wonder how it would look in another scripting languages, like
> >bash or perl. Any non-printable character is tricky and would
> >complicate it.
>
> It's really not that complicated there, either. Since you bring up bash, it's
> even less work than the Python solution:
>
> while IFS= read -r -d $'\0' fmt; do
> printf 'Level %s: %q\n' "${fmt:1:1}" "${fmt:2}"
OK, now do the same in C. "%q" "and I guess that "f" in the print statement
in python (but I don't know for sure) does some magic with converting the
"\n" and such.
I agree with Petr on this. Print the format itself, and not what is
converted. It's much easier to convert "\t" and such to what they mean,
than to go the other way around.
> done < /sys/kernel/debug/printk/formats/vmlinux
>
> The changelog describes the use case: automated detection of printk fmts
> changing. For that reason, I don't understand why there's a desire to produce a
> human readable format by default when the only known consumer of this is going
> to be automation anyway.
It may be the reason its being added, but that doesn't mean it will be the
only use case. Otherwise I would just throw a NAK at this whole thing.
>
> If a use case for that comes up, we can always have a directory producing human
> readable versions. I personally don't see the need though.
I rather not add more cruft then needed.
>
> As long as it's reliably parseable though, I won't die on this hill, I just
> don't understand the rationale :-)
I rather not have a binary file listing for printk format listings. Just
for sanity sake if not for anything else.
>
> >> Re: not being not safe for machine processing because it only works for a
> >> single digit, I'm a little confused. KERN_* levels are, as far as I know,
> >> only a single byte wide, and we rely on that already (eg. in
> >> printk_skip_header()).
> >
> >It is great that you mentioned it. KERN_ levels are implementation
> >detail.
> >
> >It used to be "<level>". The binary KERN_SOH has been introduced
> >in v3.6-rc1 by the commit 04d2c8c83d0e3ac5f ("printk: convert
> >the format for KERN_<LEVEL> to a 2 byte pattern").
> >
> >In v4.9-rc1, the commit 4bcc595ccd80decb4 ("printk: reinstate KERN_CONT
> >for printing continuation lines") added the possibility to define
> >both KERN_LEVEL + KERN_CONT at the same time. It is not handled
> >by your python snippet above.
>
> Thanks, this is a good callout. I will make sure v5 handles that.
>
> In a hypothetical scenario where we do go towards something human-readable, how
> do you perceive that should look? Something like this?
>
> % less ...
> <c, 5> 'Some fmt with cont + level 5\n'
> <5> 'Some fmt with only level 5\n'
> <c> 'Some fmt with only LOG_CONT\n'
>
> >> We also already have precedent for
> >> null-separation/control characters in (for example) /proc/pid/cmdline.
> >
> >Yes, there is a precedent. But it does not mean that we should not
> >try to do it better.
>
> To be clear, I'm actually asserting that I believe the machine-readable version
> _is_ better, not that it's simply unnecessary to produce a human-readable
> version :-)
>
> As mentioned earlier in this e-mail though, it's not a hill I want to die on.
> If you believe it should be human-readable, as long as its reliably parseable,
> I'm happy to do that.
I too vote for a human readable machine parseable format ;-)
-- Steve
Steven Rostedt writes:
>OK, now do the same in C. "%q" "and I guess that "f" in the print statement
>in python (but I don't know for sure) does some magic with converting the
>"\n" and such.
>
>I agree with Petr on this. Print the format itself, and not what is
>converted. It's much easier to convert "\t" and such to what they mean,
>than to go the other way around.
To be clear, you're advocating for escaping tabs/newlines/etc on the kernel
side, right?
That seems to be implied by your first paragraph, but "print the format itself
and not what is converted" sounds like the opposite of that to me. I assume
your meaning is "escape the format, don't print it raw", which is what Petr was
also advocating for? :-)
On Wed 2021-02-17 16:32:21, Chris Down wrote:
> Chris Down writes:
> > open(f);
> > debugfs_file_get(f);
> > fops->open();
> > inode->private = ps;
> > debugfs_file_put(f);
> >
> > remove_printk_fmt_sec(); /* kfree ps */
> >
> > read(f);
> > debugfs_file_get(f);
> > fops->read();
> > ps = inode->private; /* invalid */
> > debugfs_file_put(f);
>
> Er, sorry, inode->private is populated at creation time, not at open(). The
> same general concern applies though -- as far as I can tell there's some
> period where we may be able to _read() and `ps` has already been freed.
Honestly, I am a bit lost here. Also I have realized that you needed to
release the struct from the module going notifier. And there you have
only pointer to struct module.
The thing is that I do not see similar tricks anywhere else. Well, other
users are easier because they create the debugfs file for it own purpose.
In our case, we actually create the file for another module.
Anyway, we are going to use the seq_buf iterator API. IMHO, the
seq_buf iterator functions should be able to get the structure
directly via the data pointer.
I wonder if it is similar to proc_seq_open() and proc_seq_release().
It is using the seq_buf iterator as well. It is created used
by proc_create_seq_private(). This API is used, for example,
in decnet_init(). It is a module, so there must be the similar
problems. All data are gone when the module is removed.
The only remaining problem is the module going notifier. For this
purpose, we could store the pointer to struct module. There
are many other fields for similar purposes. I am pretty sure that
the module loader maintainer will agree.
The result will be using some existing patterns. It should reduce
problems with possible races. It should make the life easier for
all involved APIs.
Best Regards,
Petr
On Wed 2021-02-17 15:56:38, Chris Down wrote:
> Petr Mladek writes:
> > > > How about config PRINTK_INDEX?
> > >
> > > Ah yes, I also like that. PRINTK_INDEX is fine from my perspective and is
> > > more straightforward than "enumeration", thanks.
> >
> > It is better than enumeration. But there is still the same
> > problem. The word "index" is used neither in the code
> > nor in the debugfs interface. It is like enabling cars and
> > seeing apples.
> >
> > What about CONFIG_PRINTK_DEBUGFS?
> >
> > It seems that various subsystems use CONFIG_<SUBSYSTEM>_DEBUGFS
> > pattern when they expose some internals in debugfs.
>
> The thing I don't like about that is that it describes a largely
> inconsequential implementation detail rather than the semantic intent of the
> config change, which is what the person deciding what to include in their
> config is likely to care about. Often when I see "XXX debug interface" when
> doing `make oldconfig` I think to myself "yes, but what does the debugfs
> interface _do_?".
I see.
> If someone else was writing this patch, and I saw "CONFIG_PRINTK_DEBUGFS"
> appear in my prod kernel, I'd probably say N, because I don't need printk
> debugging information. On the other hand, if I saw "CONFIG_PRINTK_INDEX", I'd
> immediately understand that it's probably applicable to me.
>
> I'm happy to rename the debugfs structure as <debugfs>/printk/fmt_index if it
> helps, but personally I really feel CONFIG_PRINTK_{INDEX,ENUMERATION,CATALOGUE}
> is a lot more descriptive than just saying "it has a debugfs interface" in the
> config name for that reason.
PRINTK_INDEX sounds the best to me. Keep in mind that I am not a
native speaker.
And my concern will be gone when we use it also in the API and debugfs
hierarchy as suggested by Johannes.
Another compromise might be to have CONFIG_PRINTK_FORMATS_INDEX.
Then the prefix printk_format_, pf_ would still match the option.
Or we could use printk_format_index_m, pfi_ indexes.
Best Regards,
Petr
PS: I feel that I have enough bike-shading. I think that I will be
fine with anything that you choose ;-)
On Wed 2021-02-17 15:28:24, Chris Down wrote:
> Petr Mladek writes:
> > I guess that you already use it internally and have its own tooling
> > around it.
>
> Hmm, we're actually not using it yet widely because I don't want to backport
> it to our prod kernel until we're reasonably agreed on the format :-)
>
> My main concern is making sure that parsing is reliable, and there's as
> little "well, there shouldn't ever be a <char> here" as possible. That's why I
> preferred originally to use the already well established char array/printk
> rules where possible, since they're mature and well tested.
>
> I'm not against some other solution though, as long as it meets these
> requirements. It looks like your proposed format with escaping prior to
> sending to userspace also meets that requirement.
Great.
> > $ cat pf.py
> > #!/usr/bin/env python
> > with open("/sys/kernel/debug/printk/formats/vmlinux") as f:
> > raw_fmts = f.read().split("\x00")[:-1]
> > for raw_fmt in raw_fmts:
> > level, fmt = raw_fmt[1], raw_fmt[2:]
> > print(f"Level {level}: {fmt!r}")
> >
> > I wonder how it would look in another scripting languages, like
> > bash or perl. Any non-printable character is tricky and would
> > complicate it.
>
> It's really not that complicated there, either. Since you bring up bash, it's
> even less work than the Python solution:
>
> while IFS= read -r -d $'\0' fmt; do
> printf 'Level %s: %q\n' "${fmt:1:1}" "${fmt:2}"
> done < /sys/kernel/debug/printk/formats/vmlinux
I do not see how it handles KERN_SOH. Are they silently ignored?
Is it caused by -r -d read options?
And this is one big problem. Non-printable characters sucks. And in
this particular case, we are even disclosing more implementation details
than it is really necessary.
> The changelog describes the use case: automated detection of printk fmts
> changing. For that reason, I don't understand why there's a desire to produce a
> human readable format by default when the only known consumer of this is going
> to be automation anyway.
>
> If a use case for that comes up, we can always have a directory producing
> human readable versions. I personally don't see the need though.
>
> As long as it's reliably parseable though, I won't die on this hill, I just
> don't understand the rationale :-)
>
> > > Re: not being not safe for machine processing because it only works for a
> > > single digit, I'm a little confused. KERN_* levels are, as far as I know,
> > > only a single byte wide, and we rely on that already (eg. in
> > > printk_skip_header()).
> >
> > It is great that you mentioned it. KERN_ levels are implementation
> > detail.
> >
> > It used to be "<level>". The binary KERN_SOH has been introduced
> > in v3.6-rc1 by the commit 04d2c8c83d0e3ac5f ("printk: convert
> > the format for KERN_<LEVEL> to a 2 byte pattern").
> >
> > In v4.9-rc1, the commit 4bcc595ccd80decb4 ("printk: reinstate KERN_CONT
> > for printing continuation lines") added the possibility to define
> > both KERN_LEVEL + KERN_CONT at the same time. It is not handled
> > by your python snippet above.
>
> Thanks, this is a good callout. I will make sure v5 handles that.
>
> In a hypothetical scenario where we do go towards something human-readable,
> how do you perceive that should look? Something like this?
>
> % less ...
> <c, 5> 'Some fmt with cont + level 5\n'
> <5> 'Some fmt with only level 5\n'
> <c> 'Some fmt with only LOG_CONT\n'
The main question is whether 'c' is really important. But it might be.
It is a hint that it might have different log level depending if it
was successfully connected or not.
Note that it is not connected when messages are printed from another
context, CPU, or process in the meantime.
The space in <c, 5> makes it more human readable. But it might
unnecessary complicate machine processing. I would use a compromise
here:
<c,5> "Some fmt with cont + level 5\n"
Alternative might be the following format. But it also might be worse
for machine processing:
<c><5> "Some fmt with cont + level 5\n"
Note, that I used double quotation marks "" instead of single ones ''.
The quotation marks would make sense. And I would prefer the double
ones because they are used in the sources and also in other
interfaces, for example, /sys/kernel/debug/dynamic_debug/control
What about the file name and line number? IMHO, it would be
pretty useful.
> > > We also already have precedent for
> > > null-separation/control characters in (for example) /proc/pid/cmdline.
> >
> > Yes, there is a precedent. But it does not mean that we should not
> > try to do it better.
>
> To be clear, I'm actually asserting that I believe the machine-readable
> version _is_ better, not that it's simply unnecessary to produce a
> human-readable version :-)
I am sorry but I still haven't hear any convincing arguments why it is better.
> As mentioned earlier in this e-mail though, it's not a hill I want to die
> on. If you believe it should be human-readable, as long as its reliably
> parseable, I'm happy to do that.
I believe that it should be human readable :-)
Best Regards,
Petr
Petr Mladek writes:
>> - Move to another file, kernel/printk/debug_formats.c or similar
>
>Just to be sure. The filename should be ideally based on the configure
>option and API names, e.g. formats_index.c or so.
>
>The printk_ prefix is not strictly necessary. The file is in printk/
>directory. IMHO, we should have used ringbuffer.c but ...
Sure thing.
>> - Use `struct module *mod` instead of calling it module
>> - Add documentation for printk_fmt_sec (or whatever it will be called)
>> - Rename things to pf_, pi_, or something
>> - See if it's safe to pass a printk_fmt_sec to seq_file instead of a module
>
>Also it might be needed to store the pointer to struct module.
You mean, have a `struct module` entry for this? I somewhat suspect that
module.c maintainers are not likely to be happy about injecting non-generic
code into there if it's possible to be avoided, but maybe I'm misunderstanding?
>Both things together might allow to remove the global hash table and likely
>even the mutex.
>
>> - Handle cont + level
>> - Don't expose level/KERN_SOH directly
>
>I can't remember anything else. I am curious how v5 would look like.
You can join the club on that one... ;-)
Let me know if I understood you correctly on the `struct module` point, and
after that I'll start work on v5. Thanks!
On Thu 2021-02-18 12:21:55, Chris Down wrote:
> Thanks for all your feedback, Petr and Steven. :-)
>
> Petr, I believe this is a comprehensive checklist of everything we discussed
> for v5 -- any chance you could double check I'm not missing anything you
> folks wanted? Thanks!
>
> - Use seq_file iterator again instead of simple_open + size
> - Remove debugfs file first to avoid ENOENT
> - Tear down datastructures if debugfs fails
> - Human readable output format
> - Display file/line
> - Rename to CONFIG_PRINTK_INDEX, or... something
> - Move to another file, kernel/printk/debug_formats.c or similar
Just to be sure. The filename should be ideally based on the configure
option and API names, e.g. formats_index.c or so.
The printk_ prefix is not strictly necessary. The file is in printk/
directory. IMHO, we should have used ringbuffer.c but ...
> - Use `struct module *mod` instead of calling it module
> - Add documentation for printk_fmt_sec (or whatever it will be called)
> - Rename things to pf_, pi_, or something
> - See if it's safe to pass a printk_fmt_sec to seq_file instead of a module
Also it might be needed to store the pointer to struct module.
Both things together might allow to remove the global hash table and likely
even the mutex.
> - Handle cont + level
> - Don't expose level/KERN_SOH directly
I can't remember anything else. I am curious how v5 would look like.
Best Regards,
Petr
Thanks for all your feedback, Petr and Steven. :-)
Petr, I believe this is a comprehensive checklist of everything we discussed
for v5 -- any chance you could double check I'm not missing anything you folks
wanted? Thanks!
- Use seq_file iterator again instead of simple_open + size
- Remove debugfs file first to avoid ENOENT
- Tear down datastructures if debugfs fails
- Human readable output format
- Display file/line
- Rename to CONFIG_PRINTK_INDEX, or... something
- Move to another file, kernel/printk/debug_formats.c or similar
- Use `struct module *mod` instead of calling it module
- Add documentation for printk_fmt_sec (or whatever it will be called)
- Rename things to pf_, pi_, or something
- See if it's safe to pass a printk_fmt_sec to seq_file instead of a module
- Handle cont + level
- Don't expose level/KERN_SOH directly
On Thu 2021-02-18 12:41:39, Chris Down wrote:
> Petr Mladek writes:
> > > - See if it's safe to pass a printk_fmt_sec to seq_file instead of a module
> >
> > Also it might be needed to store the pointer to struct module.
>
> You mean, have a `struct module` entry for this? I somewhat suspect that
> module.c maintainers are not likely to be happy about injecting non-generic
> code into there if it's possible to be avoided, but maybe I'm
> misunderstanding?
Yes, I suggest to store the pointer into struct module. It includes
many external entries. It is similar to struct task_struct.
I am active also in the kernel livepatch subsystem. We have added
there three values:
#ifdef CONFIG_LIVEPATCH
bool klp; /* Is this a livepatch module? */
bool klp_alive;
/* Elf information */
struct klp_modinfo *klp_info;
#endif
Many other subsystems have their stuff there, for example:
#ifdef CONFIG_TRACING
unsigned int num_trace_bprintk_fmt;
const char **trace_bprintk_fmt_start;
#endif
#ifdef CONFIG_EVENT_TRACING
struct trace_event_call **trace_events;
unsigned int num_trace_events;
struct trace_eval_map **trace_evals;
unsigned int num_trace_evals;
#endif
#ifdef CONFIG_FTRACE_MCOUNT_RECORD
unsigned int num_ftrace_callsites;
unsigned long *ftrace_callsites;
#endif
#ifdef CONFIG_KPROBES
void *kprobes_text_start;
unsigned int kprobes_text_size;
unsigned long *kprobe_blacklist;
unsigned int num_kprobe_blacklist;
#endif
BTW: Jessica originally worked on the kernel livepatching.
She became module loader code maintainer because we needed
even more hacks there and the original maintainer got
busy with other stuff at that time ;-)
I am pretty sure that she would accept it. We need a per-module
value. It is not necessary to maintain separate global list/hash
table just to store these values.
Best Regards,
Petr
Petr Mladek writes:
>On Thu 2021-02-18 12:41:39, Chris Down wrote:
>> Petr Mladek writes:
>> > > - See if it's safe to pass a printk_fmt_sec to seq_file instead of a module
>> >
>> > Also it might be needed to store the pointer to struct module.
>>
>> You mean, have a `struct module` entry for this? I somewhat suspect that
>> module.c maintainers are not likely to be happy about injecting non-generic
>> code into there if it's possible to be avoided, but maybe I'm
>> misunderstanding?
>
>Yes, I suggest to store the pointer into struct module. It includes
>many external entries. It is similar to struct task_struct.
Ah yes, now you mention it that does look fine. Thanks!
Expect v5 in the coming days, then. :-)