2020-07-17 23:50:45

by John Ogness

[permalink] [raw]
Subject: [PATCH 0/4] printk: reimplement LOG_CONT handling

Hello,

Here is the second series to rework the printk subsystem. This series
removes LOG_CONT handling from printk() callers, storing all LOG_CONT
parts individually in the ringbuffer. With this series, LOG_CONT
handling is moved to the ringbuffer readers that provide the record
contents to users (console printing, syslog, /dev/kmsg).

This change is necessary in order to support the upcoming move to a
fully lockless printk() implementation.

This series is in line with the agreements [0] made at the meeting
during LPC2019 in Lisbon, with 1 exception: For the /dev/kmsg
interface, empty line placeholder records are reported for the
LOG_CONT parts.

Using placeholders avoids tools such as systemd-journald from
erroneously reporting missed messages. However, it also means that
empty placeholder records are visible in systemd-journald logs and
displayed in tools such as dmesg.

The effect can be easily observed with the sysrq help:

$ echo h | sudo tee /proc/sysrq-trigger
$ sudo dmesg | tail -n 30
$ sudo journalctl -k -n 30

Providing the placeholder entries allows a userspace tool to identify
if records were actually lost. IMHO this an important feature. Its
side effect can be addressed by userspace tools if they change to
silently consume empty records.

For dump tools that process the ringbuffer directly (such as crash,
makedumpfile, kexec-tools), they will need to implement LOG_CONT
handling if they want to present clean continuous line messages.

Finally, by moving LOG_CONT handling from writers to readers, some
incorrect pr_cont() usage is revealed. Patch 4 of this series
addresses one such example.

This series is based on the printk git tree [1] printk-rework branch.

[0] https://lkml.kernel.org/r/[email protected]
[1] https://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git (printk-rework branch)

John Ogness (4):
printk: ringbuffer: support dataless records
printk: store instead of processing cont parts
printk: process cont records during reading
ipconfig: cleanup printk usage

kernel/printk/printk.c | 569 ++++++++++++++++++++----------
kernel/printk/printk_ringbuffer.c | 58 ++-
kernel/printk/printk_ringbuffer.h | 15 +-
net/ipv4/ipconfig.c | 25 +-
4 files changed, 434 insertions(+), 233 deletions(-)

--
2.20.1


2020-07-17 23:52:06

by John Ogness

[permalink] [raw]
Subject: [PATCH 2/4] printk: store instead of processing cont parts

Instead of buffering continuous line parts before storing the full
line into the ringbuffer, store each part as its own record.

Signed-off-by: John Ogness <[email protected]>
---
kernel/printk/printk.c | 114 ++++-------------------------------------
1 file changed, 11 insertions(+), 103 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index fec71229169e..c4274c867771 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -495,9 +495,14 @@ static void truncate_msg(u16 *text_len, u16 *trunc_msg_len)
*trunc_msg_len = 0;
}

+static inline u32 printk_caller_id(void)
+{
+ return in_task() ? task_pid_nr(current) :
+ 0x80000000 + raw_smp_processor_id();
+}
+
/* insert record into the buffer, discard old ones, update heads */
-static int log_store(u32 caller_id, int facility, int level,
- enum log_flags flags, u64 ts_nsec,
+static int log_store(int facility, int level, enum log_flags flags,
const char *dict, u16 dict_len,
const char *text, u16 text_len)
{
@@ -525,11 +530,8 @@ static int log_store(u32 caller_id, int facility, int level,
r.info->facility = facility;
r.info->level = level & 7;
r.info->flags = flags & 0x1f;
- if (ts_nsec > 0)
- r.info->ts_nsec = ts_nsec;
- else
- r.info->ts_nsec = local_clock();
- r.info->caller_id = caller_id;
+ r.info->ts_nsec = local_clock();
+ r.info->caller_id = printk_caller_id();

/* insert message */
prb_commit(&e);
@@ -1874,100 +1876,6 @@ static inline void printk_delay(void)
}
}

-static inline u32 printk_caller_id(void)
-{
- return in_task() ? task_pid_nr(current) :
- 0x80000000 + raw_smp_processor_id();
-}
-
-/*
- * Continuation lines are buffered, and not committed to the record buffer
- * until the line is complete, or a race forces it. The line fragments
- * though, are printed immediately to the consoles to ensure everything has
- * reached the console in case of a kernel crash.
- */
-static struct cont {
- char buf[LOG_LINE_MAX];
- size_t len; /* length == 0 means unused buffer */
- u32 caller_id; /* printk_caller_id() of first print */
- u64 ts_nsec; /* time of first print */
- u8 level; /* log level of first message */
- u8 facility; /* log facility of first message */
- enum log_flags flags; /* prefix, newline flags */
-} cont;
-
-static void cont_flush(void)
-{
- if (cont.len == 0)
- return;
-
- log_store(cont.caller_id, cont.facility, cont.level, cont.flags,
- cont.ts_nsec, NULL, 0, cont.buf, cont.len);
- cont.len = 0;
-}
-
-static bool cont_add(u32 caller_id, int facility, int level,
- enum log_flags flags, const char *text, size_t len)
-{
- /* If the line gets too long, split it up in separate records. */
- if (cont.len + len > sizeof(cont.buf)) {
- cont_flush();
- return false;
- }
-
- if (!cont.len) {
- cont.facility = facility;
- cont.level = level;
- cont.caller_id = caller_id;
- cont.ts_nsec = local_clock();
- cont.flags = flags;
- }
-
- memcpy(cont.buf + cont.len, text, len);
- cont.len += len;
-
- // The original flags come from the first line,
- // but later continuations can add a newline.
- if (flags & LOG_NEWLINE) {
- cont.flags |= LOG_NEWLINE;
- cont_flush();
- }
-
- return true;
-}
-
-static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len)
-{
- const u32 caller_id = printk_caller_id();
-
- /*
- * If an earlier line was buffered, and we're a continuation
- * write from the same context, try to add it to the buffer.
- */
- if (cont.len) {
- if (cont.caller_id == caller_id && (lflags & LOG_CONT)) {
- if (cont_add(caller_id, facility, level, lflags, text, text_len))
- return text_len;
- }
- /* Otherwise, make sure it's flushed */
- cont_flush();
- }
-
- /* Skip empty continuation lines that couldn't be added - they just flush */
- if (!text_len && (lflags & LOG_CONT))
- return 0;
-
- /* If it doesn't end in a newline, try to buffer the current line */
- if (!(lflags & LOG_NEWLINE)) {
- if (cont_add(caller_id, facility, level, lflags, text, text_len))
- return text_len;
- }
-
- /* Store it in the record log */
- return log_store(caller_id, facility, level, lflags, 0,
- dict, dictlen, text, text_len);
-}
-
/* Must be called under logbuf_lock. */
int vprintk_store(int facility, int level,
const char *dict, size_t dictlen,
@@ -2015,8 +1923,8 @@ int vprintk_store(int facility, int level,
if (dict)
lflags |= LOG_NEWLINE;

- return log_output(facility, level, lflags,
- dict, dictlen, text, text_len);
+ return log_store(facility, level, lflags, dict, dictlen,
+ text, text_len);
}

asmlinkage int vprintk_emit(int facility, int level,
--
2.20.1

2020-07-18 00:03:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Fri, Jul 17, 2020 at 4:48 PM John Ogness <[email protected]> wrote:
>
> Using placeholders avoids tools such as systemd-journald from
> erroneously reporting missed messages. However, it also means that
> empty placeholder records are visible in systemd-journald logs and
> displayed in tools such as dmesg.

As long as the readers still reliably do the joining, this is fine.

HOWEVER.

Make sure you test the case of "fast concurrent readers". The last
time we did things like this, it was a disaster, because a concurrent
reader would see and return the _incomplete_ line, and the next entry
was still being generated on another CPU.

The reader would then decide to return that incomplete line, because
it had something.

And while in theory this could then be handled properly in user space,
in practice it wasn't. So you'd see a lot of logging tools that would
then report all those continuations as separate log events.

Which is the whole point of LOG_CONT - for that *not* to happen.

So this is just a heads-up that I will not pull something that breaks
LOG_CONT because it thinks "user space can handle it". No. User space
does not handle it, and we need to handle it for the user.

So as long as the kernel makes sure the joining does happen it at some
point, it's all fine. It obviously doesn't have to happen at printk()
time, just as long as incomplete records aren't exposed even to
concurrent readers.

A test-case with a short delay in between writes might be a good idea,
although the last time this broke it was very clear in peoples syslogs
and dmesg because it turns out there are lots of concurrent readers at
boot time and _somebody_ will hit the race.

Linus

2020-07-18 14:43:32

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On 2020-07-17, Linus Torvalds <[email protected]> wrote:
> Make sure you test the case of "fast concurrent readers". The last
> time we did things like this, it was a disaster, because a concurrent
> reader would see and return the _incomplete_ line, and the next entry
> was still being generated on another CPU.
>
> The reader would then decide to return that incomplete line, because
> it had something.
>
> And while in theory this could then be handled properly in user space,
> in practice it wasn't. So you'd see a lot of logging tools that would
> then report all those continuations as separate log events.
>
> Which is the whole point of LOG_CONT - for that *not* to happen.

I expect this is handled correctly since the reader is not given any
parts until a full line is ready, but I will put more focus on testing
this to make sure. Thanks for the regression and testing tips.

> So this is just a heads-up that I will not pull something that breaks
> LOG_CONT because it thinks "user space can handle it". No. User space
> does not handle it, and we need to handle it for the user.

Understood. Petr and Sergey are also strict about this. We are making a
serious effort to avoid breaking things for userspace.

John Ogness

2020-07-18 17:45:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Sat, Jul 18, 2020 at 7:43 AM John Ogness <[email protected]> wrote:
>
> I expect this is handled correctly since the reader is not given any
> parts until a full line is ready, but I will put more focus on testing
> this to make sure.

Yeah, the patches looked fine, but I only scanned them, and just
wanted to make sure.

Over the years, we've gotten printk wrong so many times that I get a
bit paranoid. Things can look fine on the screen, but then have odd
line breaks in the logs. Or vice versa. Or work fine on some machine,
but consistently show some race on another.

And some of the more complex features are hardly ever actually used -
I'm not sure the optional message context (aka dictionary) is ever
actually used.

Yes, all the "dev_printk()" helpers fill it in with the device
information (create_syslog_header()), and you _can_ use them if you
know about them (ie

journalctl -b _KERNEL_SUBSYSTEM=pci_bus

but I sometimes wonder how many people use all this complexity. And
how many people even know about it..)

So there are hidden things in there that can easily break *subtly* and
then take ages for people to notice, because while some are very
obvious indeed ("why is my module list message broken up into a
hundred lines?") others might be things people aren't even aware of.

Maybe a lot of system tools use those kernel dictionaries. Maybe it
would break immediately. I just sometimes wonder...

Linus

2020-07-19 14:38:11

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On (20/07/18 01:54), John Ogness wrote:
> ---
> kernel/printk/printk.c | 114 ++++-------------------------------------
> 1 file changed, 11 insertions(+), 103 deletions(-)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index fec71229169e..c4274c867771 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -495,9 +495,14 @@ static void truncate_msg(u16 *text_len, u16 *trunc_msg_len)
> *trunc_msg_len = 0;
> }
>
> +static inline u32 printk_caller_id(void)
> +{
> + return in_task() ? task_pid_nr(current) :
> + 0x80000000 + raw_smp_processor_id();
> +}

Can we merge lines that we don't want to merge?

pr_cont() -> IRQ -> pr_cont() -> NMI -> pr_cont()

|__________________________________|
different same 0x80000000 + cpu_id caller id
caller id

-ss

2020-07-19 18:29:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On Sun, Jul 19, 2020 at 7:35 AM Sergey Senozhatsky
<[email protected]> wrote:
>
> Can we merge lines that we don't want to merge?
>
> pr_cont() -> IRQ -> pr_cont() -> NMI -> pr_cont()

That pr_cont in either IRQ or NMI context would be a bug.

You can't validly have a PR_CONT without the non-cont that precedes it.

Linus

2020-07-20 01:53:39

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On (20/07/19 11:27), Linus Torvalds wrote:
> On Sun, Jul 19, 2020 at 7:35 AM Sergey Senozhatsky
> <[email protected]> wrote:
> >
> > Can we merge lines that we don't want to merge?
> >
> > pr_cont() -> IRQ -> pr_cont() -> NMI -> pr_cont()
>
> That pr_cont in either IRQ or NMI context would be a bug.
>
> You can't validly have a PR_CONT without the non-cont that precedes it.

Do I get it right, what you are saying is - when we process a PR_CONT
message the cont buffer should already contain previous non-LOG_NEWLINE
and non-PR_CONT message, otherwise it's a bug?

lockdep (I'll trim the code)

static void __print_lock_name(struct lock_class *class)
{
..
name = class->name;
if (!name) {
name = __get_key_name(class->key, str);
printk(KERN_CONT "%s", name);
} else {
printk(KERN_CONT "%s", name);
if (class->name_version > 1)
printk(KERN_CONT "#%d", class->name_version);
if (class->subclass)
printk(KERN_CONT "/%d", class->subclass);
}
}

static void print_lock_name(struct lock_class *class)
{
printk(KERN_CONT " (");
__print_lock_name(class);
printk(KERN_CONT "){%s}-{%hd:%hd}", usage, ...
}

static void
print_bad_irq_dependency(struct task_struct *curr,
{
..
pr_warn("which would create a new lock dependency:\n");
print_lock_name(hlock_class(prev));
pr_cont(" ->");
print_lock_name(hlock_class(next));
pr_cont("\n");
..
}

pr_warn() is LOG_NEWLINE, so cont buffer is empty by the time
we call print_lock_name()->__print_lock_name(), which do several
pr_cont() print outs.

I'm quite sure there is more code that does similar things.

But, overall, isn't it by design that we can process pr_cont()
message with no preceding non-cont message? Because of preliminary
flushes. Example:

CPU0

pr_info("foo"); // !LOG_NEWLINE goes into the cont buffer
pr_cont("1"); // OK
-> IRQ / NMI / exception / etc
pr_alert("error\n"); // flush cont buffer, log_store error message (it's LOG_NEWLINE)
<- iret
pr_cont("2"); // cont buffer was flushed. There is no preceding non-cont message
pr_cont("3");
pr_cont("\n");

Or am I misunderstanding what you saying?

-ss

2020-07-20 18:31:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On Sun, Jul 19, 2020 at 6:51 PM Sergey Senozhatsky
<[email protected]> wrote:
>
> Do I get it right, what you are saying is - when we process a PR_CONT
> message the cont buffer should already contain previous non-LOG_NEWLINE
> and non-PR_CONT message, otherwise it's a bug?

No.

I'm saying that the code that does PR_CONT should have done *some*
printing before, otherwise it's at the very least questionable.

IOW, you can't just randomly start printing with PR_CONT, without
having established _some_ context for it.

But that context could be a previous newline you created (the PR_CONT
will be a no-op). That's certainly useful for printing a header and
then after that printing possible other complex data that may or may
not have line breaks in it.

So your example looks fine. The context starts out with

pr_warn("which would create a new lock dependency:\n");

and after that you can use KERN_CONT / pr_cont() as much as you want,
since you've established a context for what you're printing.

And then it ends with 'pr_cont("\n")'.

So anything that was interrupted by this, and uses KERN_CONT /
pr_cont() will have no ambiguous issues. The code you pointed at both
started and ended a line.

That said, we have traditionally used not just "current process", but
also "last irq-level" as the context information, so I do think it
would be good to continue to do that.

At that point, "an interrupt printed something in the middle" isn't
even an issue any more, because it's clear that the context has
changed.

Linus

2020-07-21 03:54:33

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On Mon, 2020-07-20 at 11:30 -0700, Linus Torvalds wrote:
> On Sun, Jul 19, 2020 at 6:51 PM Sergey Senozhatsky
> <[email protected]> wrote:
> > Do I get it right, what you are saying is - when we process a PR_CONT
> > message the cont buffer should already contain previous non-LOG_NEWLINE
> > and non-PR_CONT message, otherwise it's a bug?
>
> No.
>
> I'm saying that the code that does PR_CONT should have done *some*
> printing before, otherwise it's at the very least questionable.
>
> IOW, you can't just randomly start printing with PR_CONT, without
> having established _some_ context for it.

I believe there are at least a few cases that
_only_ use pr_cont to emit
complete lines.

For example: SEQ_printf in kernel/sched/debug.c


2020-07-21 14:44:01

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On (20/07/20 11:30), Linus Torvalds wrote:
> > Do I get it right, what you are saying is - when we process a PR_CONT
> > message the cont buffer should already contain previous non-LOG_NEWLINE
> > and non-PR_CONT message, otherwise it's a bug?
>
> No.
>
> I'm saying that the code that does PR_CONT should have done *some*
> printing before, otherwise it's at the very least questionable.
>
> IOW, you can't just randomly start printing with PR_CONT, without
> having established _some_ context for it.

OK, I see. I sort of suspect that we may actually have code that does
just pr_cont() (e.g. what Joe pointed out). It doesn't seem like that
"establish a context" was ever enforced, doing a bunch of pr_cont()
simply works.

[..]
> That said, we have traditionally used not just "current process", but
> also "last irq-level" as the context information, so I do think it
> would be good to continue to do that.

OK, so basically, extending printk_caller_id() so that for IRQ/NMI
we will have more info than just "0x80000000 + raw_smp_processor_id()".

-ss

2020-07-21 14:57:50

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On 2020-07-21, Sergey Senozhatsky <[email protected]> wrote:
>> That said, we have traditionally used not just "current process", but
>> also "last irq-level" as the context information, so I do think it
>> would be good to continue to do that.
>
> OK, so basically, extending printk_caller_id() so that for IRQ/NMI
> we will have more info than just "0x80000000 + raw_smp_processor_id()".

If bit31 is set, the upper 8 bits could specify what the lower 24 bits
represent. That would give some freedom for the future.

For example:

0x80 = cpu id (generic context)
0x81 = interrupt number
0x82 = cpu id (nmi context)

Or maybe ascii should be used instead?

0x80 | '\0' = cpu id (generic context)
0x80 | 'i' = interrupt number
0x80 | 'n' = cpu id (nmi context)

Just an idea.

John Ogness

2020-07-21 15:42:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On Tue, Jul 21, 2020 at 7:42 AM Sergey Senozhatsky
<[email protected]> wrote:
>
> OK, so basically, extending printk_caller_id() so that for IRQ/NMI
> we will have more info than just "0x80000000 + raw_smp_processor_id()".

I think it's really preempt_count() that we want to have there.

That has the softirq/hardirq/NMI information in it.

So the "context" identifier of an incomplete write would be { task,
cpu, preempt_count() } of the writer. If a new KERN_CONT writer comes
in, it would have to match in that context to actually combine.

You can probably play "hide the bits" tricks to not *ac tually* use
three words for it. The task structure in particular tends to be very
aligned, you could hide bits in the low bits there. The CPU number
would fit in there, for example.

Linus

2020-07-28 02:24:07

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On (20/07/21 08:40), Linus Torvalds wrote:
> On Tue, Jul 21, 2020 at 7:42 AM Sergey Senozhatsky
> <[email protected]> wrote:
> >
> > OK, so basically, extending printk_caller_id() so that for IRQ/NMI
> > we will have more info than just "0x80000000 + raw_smp_processor_id()".
>
> I think it's really preempt_count() that we want to have there.
>
> That has the softirq/hardirq/NMI information in it.
>
> So the "context" identifier of an incomplete write would be { task,
> cpu, preempt_count() } of the writer. If a new KERN_CONT writer comes
> in, it would have to match in that context to actually combine.
>
> You can probably play "hide the bits" tricks to not *ac tually* use
> three words for it. The task structure in particular tends to be very
> aligned, you could hide bits in the low bits there. The CPU number
> would fit in there, for example.

The purpose of callerid has changed. We used to keep it _only_ in
struct printk_log and never used it for anything but

if (cont.owner == current && (lflags & LOG_CONT))
...

so it was easy to use task_struct pointers - we only cared if cont.owner
matches current and never dereferenced the cont.owner pointer.

But we do show caller id to users now (CONFIG_PRINTK_CALLER), so it
makes sense to keep there something more or less human readable, it
helps syzkaller/fuzzer people. That's why we keep PID in [30,0] bits
of callerid if the printk was called in_task(), and CPU_ID otherwise.

Replacing easy to understand PID/CPU_ID with some magic hex, e.g.
hash(current) >> shift or hash(smp_processor_id()) >> shift, will
make things less convenient, so I think it is reasonable to preserve
the existing scheme.

I like what John suggests:

- if printk was called from in_task(), then we keep PID in [30,0] bits
- otherwise we keep smp_processor_id() in [27,0] bits and in the upper
bits we keep the IRQ/NMI/etc flags.

It may make sense to "promote" u32 callerid to u64, just in case if
someone sometime in the future will have boxes with more than several
billion PIDs.

-ss

2020-07-29 02:04:43

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 2/4] printk: store instead of processing cont parts

On (20/07/21 17:03), John Ogness wrote:
> On 2020-07-21, Sergey Senozhatsky <[email protected]> wrote:
> >> That said, we have traditionally used not just "current process", but
> >> also "last irq-level" as the context information, so I do think it
> >> would be good to continue to do that.
> >
> > OK, so basically, extending printk_caller_id() so that for IRQ/NMI
> > we will have more info than just "0x80000000 + raw_smp_processor_id()".
>
> If bit31 is set, the upper 8 bits could specify what the lower 24 bits
> represent. That would give some freedom for the future.
>
> For example:
>
> 0x80 = cpu id (generic context)
> 0x81 = interrupt number
> 0x82 = cpu id (nmi context)

Looks good to me. I'd probably add in_softirq state. So the contexts
are: in_task, in_softirq, in_irq, in_nmi

For the purpose of future extensions, I would probably consider
u64 for callerid. So that we can store more "state bits" in the future,
like disabled/enabled preemption; disabled/enabled local irqs; etc.

-ss

2020-08-11 16:07:35

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Sat 2020-07-18 16:48:55, John Ogness wrote:
> On 2020-07-17, Linus Torvalds <[email protected]> wrote:
> > Make sure you test the case of "fast concurrent readers". The last
> > time we did things like this, it was a disaster, because a concurrent
> > reader would see and return the _incomplete_ line, and the next entry
> > was still being generated on another CPU.
> >
> > The reader would then decide to return that incomplete line, because
> > it had something.
> >
> > And while in theory this could then be handled properly in user space,
> > in practice it wasn't. So you'd see a lot of logging tools that would
> > then report all those continuations as separate log events.
> >
> > Which is the whole point of LOG_CONT - for that *not* to happen.
>
> I expect this is handled correctly since the reader is not given any
> parts until a full line is ready, but I will put more focus on testing
> this to make sure. Thanks for the regression and testing tips.

Hmm, the current patchset has different problem. The continuation
pieces are correctly passed as a single lines. But empty line is
printed for each unused sequence number to avoid warnings about
missed messages in journactl. It looks like:

Aug 11 17:08:16 sle15-sp1 kernel: x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64
Aug 11 17:08:16 sle15-sp1 kernel: x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64
Aug 11 17:08:16 sle15-sp1 kernel: x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
Aug 11 17:08:16 sle15-sp1 kernel: BIOS-provided physical RAM map:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
[...]
Aug 11 17:08:16 sle15-sp1 kernel: kvm-clock: using sched offset of 85305506196438 cycles
Aug 11 17:08:16 sle15-sp1 kernel: clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 88>
Aug 11 17:08:16 sle15-sp1 kernel: tsc: Detected 2112.000 MHz processor
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
Aug 11 17:08:16 sle15-sp1 kernel: last_pfn = 0x180000 max_arch_pfn = 0x400000000
Aug 11 17:08:16 sle15-sp1 kernel: MTRR default type: write-back
Aug 11 17:08:16 sle15-sp1 kernel: MTRR fixed ranges enabled:
[...]
Aug 11 17:08:16 sle15-sp1 kernel: clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519>
Aug 11 17:08:16 sle15-sp1 kernel: setup_percpu: NR_CPUS:8192 nr_cpumask_bits:3 nr_cpu_ids:3 nr_node_ids:1
Aug 11 17:08:16 sle15-sp1 kernel: percpu: Embedded 508 pages/cpu s2043904 r8192 d28672 u2097152
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel: pcpu-alloc: s2043904 r8192 d28672 u2097152 alloc=1*2097152
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel:
Aug 11 17:08:16 sle15-sp1 kernel: pcpu-alloc: [0] 0 [0] 1 [0] 2
Aug 11 17:08:16 sle15-sp1 kernel: KVM setup async PF for cpu 0
Aug 11 17:08:16 sle15-sp1 kernel: kvm-stealtime: cpu 0, msr 17f9f2080
Aug 11 17:08:16 sle15-sp1 kernel: Built 1 zonelists, mobility grouping on. Total pages: 1031901
Aug 11 17:08:16 sle15-sp1 kernel: Policy zone: Normal


> > So this is just a heads-up that I will not pull something that breaks
> > LOG_CONT because it thinks "user space can handle it". No. User space
> > does not handle it, and we need to handle it for the user.

I am afraid that this not acceptable for Linus either because people
will report bugs and complain about regression.

Slightly better solution is to skip the unused sequence numbers. It
looks good in both dmesg/journalctl by default. Note that journalctl
shows the warnings about missing lines "only" with "-a" option.
Sigh, but I am afraid that this is still not acceptable.


Hmm, I can't find any simple or even working solution for maintaining
a monotonic sequence number a lockless way that would be the same
for all stored pieces.

I am afraid that the only working solution is to store all pieces
in a single lockless transaction. I think that John already
proposed using 2nd small lockless buffer for this. The problem
might be how to synchronize flushing the pieces into the final
buffer.

Another solution would be to use separate buffer for each context
and CPU. The problem is a missing final "\n". It might cause
that a buffer is not flushed for a long time until another
message is printed in the same context on the same CPU.

The 2nd small logbuffer looks like a better choice if we
are able to solve the lockless flush.

Best Regards,
Petr

2020-08-12 16:41:20

by Petr Mladek

[permalink] [raw]
Subject: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Tue 2020-08-11 18:05:51, Petr Mladek wrote:
> On Sat 2020-07-18 16:48:55, John Ogness wrote:
> > On 2020-07-17, Linus Torvalds <[email protected]> wrote:
> > > Make sure you test the case of "fast concurrent readers". The last
> > > time we did things like this, it was a disaster, because a concurrent
> > > reader would see and return the _incomplete_ line, and the next entry
> > > was still being generated on another CPU.
> > >
> > > The reader would then decide to return that incomplete line, because
> > > it had something.
> > >
> > > And while in theory this could then be handled properly in user space,
> > > in practice it wasn't. So you'd see a lot of logging tools that would
> > > then report all those continuations as separate log events.
> > >
> > > Which is the whole point of LOG_CONT - for that *not* to happen.
> >
> > I expect this is handled correctly since the reader is not given any
> > parts until a full line is ready, but I will put more focus on testing
> > this to make sure. Thanks for the regression and testing tips.
>
> Hmm, the current patchset has different problem. The continuation
> pieces are correctly passed as a single lines. But empty line is
> printed for each unused sequence number to avoid warnings about
> missed messages in journactl. It looks like:
>
> I am afraid that the only working solution is to store all pieces
> in a single lockless transaction. I think that John already
> proposed using 2nd small lockless buffer for this. The problem
> might be how to synchronize flushing the pieces into the final
> buffer.

Do not panic! It might look scary. But I am less scared
after I wrote some pieces of the pseudo code.

So, I have one crazy idea to add one more state bit so that we
could have:

+ committed: set when the data are written into the data ring.
+ final: set when the data block could not longer get reopened
+ reuse: set when the desctiptor/data block could get reused


"final" bit will define when the descriptor could not longer
get reopened (cleared committed bit) and the data block could
not get extended.

The logic would be the following:

bool prb_reserve() {

desc = try_reopen_desc(seq);
if (desc) {
text_buf = data_alloc_continuous();
if (text_buf)
goto success;
else
/* commit the reopened desc back again */
prb_commit(desc);
}

/* Otherwise, do as before */
desc = desc_reserve();
if (!desc)
goto fail;

text_buf = data_alloc();
...

where:

static struct prb_desc *try_reopen_desc(seq)
{
struct prb_desc *desc;

enum desc_state d_state;
struct prb_desc desc;

d_state = desc_read(desc_ring, seq, &desc);
if (d_state != committed_and_not_finalized)
return NULL;

if (!is_same_context(desc))
return NULL;

/* try to reopen only when the state is still the same */
if(!atomic_long_cmpxchg_relaxed(state_var,
val_committed_and_not_finished,
val_reserved))
return NULL;

return desc;
}

static char *data_alloc_continuous()
{
/*
* Same as data_alloc() with one added parameter:
* unsigned long requested_begin_lpos;
*/

begin_lpos = atomic_long_read(&data_ring->head_lpos);

do {
if (begin_lpos != requested_begin_lpos)
return NULL;

... same as before

} while (!atomic_long_try_cmpxchg(&data_ring->head_lpos, &begin_lpos,
next_lpos)); /* LMM(data_alloc:A) */

if (requested_begin_lpos) {
/* only update tail lpos */
blk_lpos->next = next_lpos;
/* return pointer to the new data space */
return &blk->data[0];
}

/* For completely new block do everything as before */
blk = to_block(data_ring, begin_lpos);
blk->id = id; /* LMM(data_alloc:B) */
...
}

void prb_commit_and_finalize()
{
/* Same as prb_commit() + it will set also 'final' bit */
}


Addintional changes in the code:

+ desc_resrved() will also set 'final' bit in the previous
descriptor so that the descriptor could not longer get reopended
once committed.

+ prb_commit_and_finalize() will be called instead of prb_commit()
when the message ends with '\n'.

+ prb_read() will allow to read the data only when
the state is "committed_and_finalized".

+ desc_make_reusable() can be called only when the desciptor
is in "commited_and_finalized" state.


I am not sure if it is everything. Also it might need some code
refactoring.

But it looks like it might work. And it should not require new barriers.

2020-08-13 00:25:14

by John Ogness

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On 2020-08-12, Petr Mladek <[email protected]> wrote:
> So, I have one crazy idea to add one more state bit so that we
> could have:
>
> + committed: set when the data are written into the data ring.
> + final: set when the data block could not longer get reopened
> + reuse: set when the desctiptor/data block could get reused
>
> "final" bit will define when the descriptor could not longer
> get reopened (cleared committed bit) and the data block could
> not get extended.

I had not thought of extending data blocks. That is clever!

I implemented this solution for myself and am currently running more
tests. Some things that I changed from your suggestion:

1. I created a separate prb_reserve_cont() function. The reason for this
is because the caller needs to understand what is happening. The caller
is getting an existing record with existing data and must append new
data. The @text_len field of the info reports how long the existing data
is. So the LOG_CONT handling code in printk.c looks something like this:

if (lflags & LOG_CONT) {
struct prb_reserved_entry e;
struct printk_record r;

prb_rec_init_wr(&r, text_len, 0);

if (prb_reserve_cont(&e, prb, &r, caller_id)) {
memcpy(&r.text_buf[r.info->text_len], text, text_len);
r.info->text_len += text_len;

if (lflags & LOG_NEWLINE)
r.info->flags |= LOG_NEWLINE;

if (r.info->flags & LOG_NEWLINE)
prb_commit_finalize(&e);
else
prb_commit(&e);

return text_len;
}
}

This seemed simpler than trying to extend prb_reserve() to secretly
support LOG_CONT records.

2. I haven't yet figured out how to preserve calling context when a
newline appears. For example:

pr_info("text");
pr_cont(" 1");
pr_cont(" 2\n");
pr_cont("3");
pr_cont(" 4\n");

For "3" the calling context (info, timestamp) is lost because with "2"
the record is finalized. Perhaps the above is invalid usage of LOG_CONT?

3. There are some memory barriers introduced, but it looks like it
shouldn't add too much complexity.

I will continue to refine my working version and post a patch so that we
have something to work with. This looks to be the most promising way
forward. Thanks.

John Ogness

2020-08-13 05:22:21

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On (20/08/13 02:30), John Ogness wrote:
> 2. I haven't yet figured out how to preserve calling context when a
> newline appears. For example:
>
> pr_info("text");
> pr_cont(" 1");
> pr_cont(" 2\n");
> pr_cont("3");
> pr_cont(" 4\n");
>
> For "3" the calling context (info, timestamp) is lost because with "2"
> the record is finalized. Perhaps the above is invalid usage of LOG_CONT?

This is not an unseen pattern, I'm afraid. And the problem here can
be more general:

pr_info("text");
pr_cont("1");
exception/IRQ/NMI
pr_alert("text\n");
pr_cont("2");
pr_cont("\n");

I guess the solution would be to store "last log_level" in task_struct
and get current (new) timestamp for broken cont line?

We have this problem now. E.g. early boot messages from one of my boxes:

6,173,41837,-;x86: Booting SMP configuration:
6,174,41838,-;.... node #0, CPUs: #1 #2 #3 #4
4,175,44993,-;MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/...
4,176,44993,c; #5 #6 #7

"CPUs: #1 #2 #3 #4 #5 #6 #7" is supposed to be one cont line with
loglevel 6. But it gets interrupted and flushed so that "#5 #6 #7"
appears with the different loglevel.

-ss

2020-08-13 07:33:20

by Petr Mladek

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Thu 2020-08-13 02:30:02, John Ogness wrote:
> On 2020-08-12, Petr Mladek <[email protected]> wrote:
> > So, I have one crazy idea to add one more state bit so that we
> > could have:
> >
> > + committed: set when the data are written into the data ring.
> > + final: set when the data block could not longer get reopened
> > + reuse: set when the desctiptor/data block could get reused
> >
> > "final" bit will define when the descriptor could not longer
> > get reopened (cleared committed bit) and the data block could
> > not get extended.
>
> I implemented this solution for myself and am currently running more
> tests. Some things that I changed from your suggestion:
>
> 1. I created a separate prb_reserve_cont() function. The reason for this
> is because the caller needs to understand what is happening. The caller
> is getting an existing record with existing data and must append new
> data. The @text_len field of the info reports how long the existing data
> is.

I see.

> So the LOG_CONT handling code in printk.c looks something like this:

Yeah, it makes sense.


> 2. I haven't yet figured out how to preserve calling context when a
> newline appears. For example:
>
> pr_info("text");
> pr_cont(" 1");
> pr_cont(" 2\n");
> pr_cont("3");
> pr_cont(" 4\n");
>
> For "3" the calling context (info, timestamp) is lost because with "2"
> the record is finalized. Perhaps the above is invalid usage of LOG_CONT?

I am going to answer Sergey's reply.


> 3. There are some memory barriers introduced, but it looks like it
> shouldn't add too much complexity.

Uff. I have hooped for this.

> I will continue to refine my working version and post a patch so that we
> have something to work with. This looks to be the most promising way
> forward. Thanks.

Uff, I am happy that it seems working. Several other approaches looked
much more complicated or they caused regressions.

Best Regards,
Petr

2020-08-13 07:45:29

by John Ogness

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On 2020-08-13, Sergey Senozhatsky <[email protected]> wrote:
> This is not an unseen pattern, I'm afraid. And the problem here can
> be more general:
>
> pr_info("text");
> pr_cont("1");
> exception/IRQ/NMI
> pr_alert("text\n");
> pr_cont("2");
> pr_cont("\n");
>
> I guess the solution would be to store "last log_level" in task_struct
> and get current (new) timestamp for broken cont line?

(Warning: new ideas ahead)

The fundamental problem is that there is no real association between
the cont parts. So any interruption results in a broken record. If we
really want to do this correctly, we need real association.

With the new finalize flag for records, I thought about perhaps adding
support for chaining data blocks.

A data block currently stores an unsigned long for the ID of the
associated descriptor. But it could optionally include a second unsigned
long, which is the lpos of the next text part. All the data blocks of a
chain would point back to the same descriptor. The descriptor would only
point to the first data block of the chain and include a flag that it is
using chained data blocks.

Then we would only need to track the sequence number of the open record
and new data blocks could be added to the data block chain of the
correct record. Readers cannot see the record until it is finalized.

Also, since only finalized records can be invalidated, there are no
races of chains becoming invalidated while being appended.

My concerns about this idea:

- What if the printk user does not correctly terminate the cont message?
There is no mechanism to allow that open record to be force-finalized
so that readers can read newer records.

- For tasks, the sequence number of the open record could be stored on
the task_struct. For non-tasks, we could use a global per-cpu variable
where each CPU stores 2 sequence numbers: the sequence number of the
open record for the non-task and the sequence number of the open
record for an interrupting NMI. Is that sufficient?

John Ogness

2020-08-13 07:52:35

by Petr Mladek

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Thu 2020-08-13 14:18:53, Sergey Senozhatsky wrote:
> On (20/08/13 02:30), John Ogness wrote:
> > 2. I haven't yet figured out how to preserve calling context when a
> > newline appears. For example:
> >
> > pr_info("text");
> > pr_cont(" 1");
> > pr_cont(" 2\n");
> > pr_cont("3");
> > pr_cont(" 4\n");
> >
> > For "3" the calling context (info, timestamp) is lost because with "2"
> > the record is finalized. Perhaps the above is invalid usage of LOG_CONT?

If I get it correctly, the original code has the same problem.

The cont buffer is flushed when the cont piece ends with newline:

static bool cont_add(u32 caller_id, int facility, int level,
enum log_flags flags, const char *text, size_t len)
{
[...]

// The original flags come from the first line,
// but later continuations can add a newline.
if (flags & LOG_NEWLINE) {
cont.flags |= LOG_NEWLINE;
cont_flush();
}

return true;
}

cont_flush sets cont.len = 0;

static void cont_flush(void)
{
[...]
cont.len = 0;
}


The messages is appended only when cont.len != 0 in log_output:

static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len)
{
const u32 caller_id = printk_caller_id();

/*
* If an earlier line was buffered, and we're a continuation
* write from the same context, try to add it to the buffer.
*/
if (cont.len) {
if (cont.caller_id == caller_id && (lflags & LOG_CONT)) {
if (cont_add(caller_id, facility, level, lflags, text, text_len))
return text_len;
}

[...]
}


Also the original context is overridden when the cont buffer is empty:

static bool cont_add(u32 caller_id, int facility, int level,
enum log_flags flags, const char *text, size_t len)
{
[...]

if (!cont.len) {
cont.facility = facility;
cont.level = level;
cont.caller_id = caller_id;
cont.ts_nsec = local_clock();
cont.flags = flags;
}

[...]
}

So I would ignore this problem for now.


> This is not an unseen pattern, I'm afraid. And the problem here can
> be more general:
>
> pr_info("text");
> pr_cont("1");
> exception/IRQ/NMI
> pr_alert("text\n");
> pr_cont("2");
> pr_cont("\n");
>

Good point.


> I guess the solution would be to store "last log_level" in task_struct
> and get current (new) timestamp for broken cont line?

I think about storing the context in per-CPU and per-context array.
It should be more memory efficient than task_struct and it should
solve even the above problem.


> We have this problem now. E.g. early boot messages from one of my boxes:
>
> 6,173,41837,-;x86: Booting SMP configuration:
> 6,174,41838,-;.... node #0, CPUs: #1 #2 #3 #4
> 4,175,44993,-;MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/...
> 4,176,44993,c; #5 #6 #7
>
> "CPUs: #1 #2 #3 #4 #5 #6 #7" is supposed to be one cont line with
> loglevel 6. But it gets interrupted and flushed so that "#5 #6 #7"
> appears with the different loglevel.

Nice example. It would be nice to fix this. But it should be done
separately.

Best Regards,
Petr

2020-08-13 08:42:36

by Petr Mladek

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Thu 2020-08-13 09:50:25, John Ogness wrote:
> On 2020-08-13, Sergey Senozhatsky <[email protected]> wrote:
> > This is not an unseen pattern, I'm afraid. And the problem here can
> > be more general:
> >
> > pr_info("text");
> > pr_cont("1");
> > exception/IRQ/NMI
> > pr_alert("text\n");
> > pr_cont("2");
> > pr_cont("\n");
> >
> > I guess the solution would be to store "last log_level" in task_struct
> > and get current (new) timestamp for broken cont line?
>
> (Warning: new ideas ahead)
>
> The fundamental problem is that there is no real association between
> the cont parts. So any interruption results in a broken record. If we
> really want to do this correctly, we need real association.
>
> With the new finalize flag for records, I thought about perhaps adding
> support for chaining data blocks.
>
> A data block currently stores an unsigned long for the ID of the
> associated descriptor. But it could optionally include a second unsigned
> long, which is the lpos of the next text part. All the data blocks of a
> chain would point back to the same descriptor. The descriptor would only
> point to the first data block of the chain and include a flag that it is
> using chained data blocks.
>
> Then we would only need to track the sequence number of the open record
> and new data blocks could be added to the data block chain of the
> correct record. Readers cannot see the record until it is finalized.

Note that we still must try to append the continuous piece into the
same data block. We could not concatenate them in the reader API
because it would create the problem with unused/skipped sequence numbers.

Also it would complicate the reader code. It would need to know
whether a record has already been printed together with a previous
one.


> Also, since only finalized records can be invalidated, there are no
> races of chains becoming invalidated while being appended.
>
> My concerns about this idea:
>
> - What if the printk user does not correctly terminate the cont message?
> There is no mechanism to allow that open record to be force-finalized
> so that readers can read newer records.

This is a real problem. And it is the reason why the cont buffer is
currently flushed (finalized) by the next message from another context.


> - For tasks, the sequence number of the open record could be stored on
> the task_struct. For non-tasks, we could use a global per-cpu variable
> where each CPU stores 2 sequence numbers: the sequence number of the
> open record for the non-task and the sequence number of the open
> record for an interrupting NMI. Is that sufficient?

Yeah, it would be possible.

Anyway, this would fix an already existing problem. It might get
complicated and tricky. I am afraid that it comes from the "perfection
is the enemy of good" department.

A good compromise might be to just store "loglevel" from the previous
message in the given context. It could then be used for pr_cont()
messages in the same context.

Anyway, I would solve this later because it is already existing
problem.

Best Regards,
Petr

2020-08-13 10:30:48

by John Ogness

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On 2020-08-13, Petr Mladek <[email protected]> wrote:
> On Thu 2020-08-13 09:50:25, John Ogness wrote:
>> On 2020-08-13, Sergey Senozhatsky <[email protected]> wrote:
>> > This is not an unseen pattern, I'm afraid. And the problem here can
>> > be more general:
>> >
>> > pr_info("text");
>> > pr_cont("1");
>> > exception/IRQ/NMI
>> > pr_alert("text\n");
>> > pr_cont("2");
>> > pr_cont("\n");
>> >
>> > I guess the solution would be to store "last log_level" in task_struct
>> > and get current (new) timestamp for broken cont line?
>>
>> (Warning: new ideas ahead)
>>
>> The fundamental problem is that there is no real association between
>> the cont parts. So any interruption results in a broken record. If we
>> really want to do this correctly, we need real association.

I believe I failed to recognize the fundamental problem. The fundamental
problem is that the pr_cont() semantics are very poor. I now strongly
believe that we need to fix those semantics by having the pr_cont() user
take responsibility for buffering the message. Patching the ~2000
pr_cont() users will be far easier than continuing to twist ourselves
around this madness.

Here is an example for a new pr_cont() API:

struct pr_cont c;

pr_cont_alloc_info(&c);
(or alternatively)
dev_cont_alloc_info(dev, &c);

pr_cont(&c, "1");
pr_cont(&c, "2");

pr_cont_flush(&c);

Using macro magic, there can be the usual dbg, warn, err, etc. variants
of the alloc functions.

The alloc function would need to work for any context, but that would
not be an issue. If the cont message started to get too large, pr_cont()
could do its own flushing in between, while still holding on to the
context information. If for some reason the alloc function could not
allocate a buffer, all the pr_cont() calls could fallback to logging the
individual cont parts.

I believe this would solve all cont-related problems while also allowing
the new ringbuffer to remain as it already is in linux-next.

John Ogness

2020-08-13 11:31:56

by Petr Mladek

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Thu 2020-08-13 12:35:47, John Ogness wrote:
> On 2020-08-13, Petr Mladek <[email protected]> wrote:
> > On Thu 2020-08-13 09:50:25, John Ogness wrote:
> >> On 2020-08-13, Sergey Senozhatsky <[email protected]> wrote:
> >> > This is not an unseen pattern, I'm afraid. And the problem here can
> >> > be more general:
> >> >
> >> > pr_info("text");
> >> > pr_cont("1");
> >> > exception/IRQ/NMI
> >> > pr_alert("text\n");
> >> > pr_cont("2");
> >> > pr_cont("\n");
> >> >
> >> > I guess the solution would be to store "last log_level" in task_struct
> >> > and get current (new) timestamp for broken cont line?
> >>
> >> (Warning: new ideas ahead)
> >>
> >> The fundamental problem is that there is no real association between
> >> the cont parts. So any interruption results in a broken record. If we
> >> really want to do this correctly, we need real association.
>
> I believe I failed to recognize the fundamental problem. The fundamental
> problem is that the pr_cont() semantics are very poor. I now strongly
> believe that we need to fix those semantics by having the pr_cont() user
> take responsibility for buffering the message. Patching the ~2000
> pr_cont() users will be far easier than continuing to twist ourselves
> around this madness.

Yes, this would make the API much more reliable. But it is not completely
easy. It has been discussed several times. The most serious attempt
is https://lore.kernel.org/lkml/1541165517-3557-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/#t

lockdep.c is probably the most complicated pr_cont() user. The
conversion complicated the code a bit, see
https://lore.kernel.org/lkml/1541165517-3557-3-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/

Linus also wanted to have an API to flush part of the message
before some action and finalize it later. It would be usable
for things like:

pr_info("Doing something dangerous... ");
printk_flush();
do_dangerous_thing();
pr_cont("ok\n");

Where pr_cont() would allow to flush part of the message to the
consoles and add the end of the line later when it did not crash.

It would need some support in the lockless ring buffer API anyway.
But it is only nice to have from my POV.


> The alloc function would need to work for any context, but that would
> not be an issue. If the cont message started to get too large, pr_cont()
> could do its own flushing in between, while still holding on to the
> context information. If for some reason the alloc function could not
> allocate a buffer, all the pr_cont() calls could fallback to logging the
> individual cont parts.

Also this was discussed several times. The most acceptable solution
was to use "small" buffer on the stack (128 bytes or so).


> I believe this would solve all cont-related problems while also allowing
> the new ringbuffer to remain as it already is in linux-next.

You are right. The buffering is really good step. It would make
the API sane and reliable. It has been proposed and discussed many
times. The main obstacle was that there was no volunteer to get this
done. If you are able to do so, it would be amazing.

I am just not sure what is more complicated. Getting lockless
operation correctly is always tricky. But going though 2000
callers individually is also a lot of work.

Best Regards,
Petr

2020-08-13 11:58:07

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On (20/08/13 10:41), Petr Mladek wrote:
> > My concerns about this idea:
> >
> > - What if the printk user does not correctly terminate the cont message?
> > There is no mechanism to allow that open record to be force-finalized
> > so that readers can read newer records.
>
> This is a real problem. And it is the reason why the cont buffer is
> currently flushed (finalized) by the next message from another context.

I understand that you think that this should be discussed and addressed
later in a separate patch, but, since we are on pr_cont topic right now,
can we slow down and maybe re-think what is actually expected from
pr_cont()? IOW, have the "what is expect from this feature" thread?

For instance, is missing \n the one and only reason why printk-s from
another context flush cont buffer now? Because I can see some more reasons
for current behaviour and I'd like to question those reasons.

I think what Linus said a long time ago was that the initial purpose of
pr_cont was

pr_info("Initialize feature foo...");
if (init_feature_foo() == 0)
pr_cont("ok\n");
else
pr_cont("not ok\n");

And if init_feature_foo() crashes the kernel then the first printk()
form panic() will flush the cont buffer.

We can handle this by realizing that new printk() message has LOG_NEWLINE
and has different log_level (not pr_cont), maybe.

Let's look at the more general case:

CPU0 .. CPU255
pr_info("text");
pr_alert("boom\n");
pr_cont("1);
pr_cont("2\n");

Do we really need to preliminary flush CPU0 pr_cont buffer in this
case? There is no connection between messages from CPU0 and CPU255.
Maybe (maybe!) what matters here is keeping the order of messages
per-context rather than globally system-wide?

-ss

2020-08-14 03:36:28

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On (20/08/13 12:35), John Ogness wrote:
> I believe I failed to recognize the fundamental problem. The fundamental
> problem is that the pr_cont() semantics are very poor.

The semantics is pretty clear - use it only in UP early bootup,
anything else is broken :)

/*
* Annotation for a "continued" line of log printout (only done after a
* line that had no enclosing \n). Only to be used by core/arch code
* during early bootup (a continued line is not SMP-safe otherwise).
*/
#define KERN_CONT KERN_SOH "c"

> I now strongly believe that we need to fix those semantics by having the
> pr_cont() user take responsibility for buffering the message. Patching the
> ~2000 pr_cont() users will be far easier than continuing to twist ourselves
> around this madness.

I welcome this effort. We've been talking about the fact that pr_cont() is
not something we can ignore anymore (we have more and more SMP users of
it) since the Kernel Summit in Santa Fe, NM, but the general response back
then was "oh my god, who cares" (pretty sure this is very close to what Ted
Ts'o said during the printk session).

> Here is an example for a new pr_cont() API:
>
> struct pr_cont c;
>
> pr_cont_alloc_info(&c);
> (or alternatively)
> dev_cont_alloc_info(dev, &c);
>
> pr_cont(&c, "1");
> pr_cont(&c, "2");
>
> pr_cont_flush(&c);

This might be a bit more complex.

One thing that we need to handle here, I believe, is that the context
which crashes the kernel should flush its cont buffer, because the
information there is relevant to the crash:

pr_cont_alloc_info(&c);
pr_cont(&c, "1");
pr_cont(&c, "2");
>>
oops
panic()
<<
pr_cont_flush(&c);

We better flush that context's pr_cont buffer during panic().

Another example:


pr_cont_alloc_info(&c);

for (i = 0; i < p->sz; i++)
pr_cont(&c, p->buf[i]);
>>
page fault
exit
<<
pr_cont_flush(&c);

I believe we need to preliminary flush pr_cont() in this case as well,
because the information there might be very helpful.

-ss

2020-08-14 08:21:00

by John Ogness

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On 2020-08-14, Sergey Senozhatsky <[email protected]> wrote:
> One thing that we need to handle here, I believe, is that the context
> which crashes the kernel should flush its cont buffer, because the
> information there is relevant to the crash:
>
> pr_cont_alloc_info(&c);
> pr_cont(&c, "1");
> pr_cont(&c, "2");
> >>
> oops
> panic()
> <<
> pr_cont_flush(&c);
>
> We better flush that context's pr_cont buffer during panic().

I am not convinced of the general usefulness of partial messages, but as
long as we have an API that includes registration, usage, and
deregistration of some sort of handle, then we leave the window open for
such implementations.

John Ogness

2020-08-14 09:06:12

by Petr Mladek

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Thu 2020-08-13 20:54:35, Sergey Senozhatsky wrote:
> On (20/08/13 10:41), Petr Mladek wrote:
> > > My concerns about this idea:
> > >
> > > - What if the printk user does not correctly terminate the cont message?
> > > There is no mechanism to allow that open record to be force-finalized
> > > so that readers can read newer records.
> >
> > This is a real problem. And it is the reason why the cont buffer is
> > currently flushed (finalized) by the next message from another context.
>
> I understand that you think that this should be discussed and addressed
> later in a separate patch, but, since we are on pr_cont topic right now,
> can we slow down and maybe re-think what is actually expected from
> pr_cont()? IOW, have the "what is expect from this feature" thread?
>
> For instance, is missing \n the one and only reason why printk-s from
> another context flush cont buffer now? Because I can see some more reasons
> for current behaviour and I'd like to question those reasons.
>
> I think what Linus said a long time ago was that the initial purpose of
> pr_cont was
>
> pr_info("Initialize feature foo...");
> if (init_feature_foo() == 0)
> pr_cont("ok\n");
> else
> pr_cont("not ok\n");
>
> And if init_feature_foo() crashes the kernel then the first printk()
> form panic() will flush the cont buffer.
>
> We can handle this by realizing that new printk() message has LOG_NEWLINE
> and has different log_level (not pr_cont), maybe.

Yes, this is a handy behavior. But it is also complicated
on the implementation side. It requires that consoles are able to
print the existing part of the continuous line and print only
the rest later.

BTW: It used to work before the commit 5c2992ee7fd8a29d041 ("printk:
remove console flushing special cases for partial buffered
lines");

BTW2: It will be much easier to implement when only the last
message can be partially shown on the consoles. Each console
driver will need to track the position only in one message.

Also it will be easier when the part of the message if stored
in the main lockless ring buffer. Then the driver could just
try to reread the last message and see if it was concatenated.


> Let's look at the more general case:
>
> CPU0 .. CPU255
> pr_info("text");
> pr_alert("boom\n");
> pr_cont("1);
> pr_cont("2\n");
>
> Do we really need to preliminary flush CPU0 pr_cont buffer in this
> case? There is no connection between messages from CPU0 and CPU255.
> Maybe (maybe!) what matters here is keeping the order of messages
> per-context rather than globally system-wide?

Honestly, I have no idea how many newlines are missing. They are often
not noticed because the buffered message is flushed later by some
other one.

The chance that some other "random" message will flush the pending
message is much lower if we have many cont buffers per-context
and per-cpu.

I am not brave enough to add more cont buffers without some fallback
mechanism to flush them later (irq_work?, timer?) or without
audit of all callers.

Where the audit is implicit when all callers are converted to
the buffered printk API.


There is one more problem. Any buffering might cause that nobody
will be able to see the message when things go wrong. Flushing
during panic() might help but only when panic() is called
and when there are system-wide cont buffers.

By other words, the current pr_cont() behavior causes mixed output
from time to time. But it increases the chance to see the messages.
And it makes it easier to find them in crashdump. Perfect output
is nice. But it will not help when the messages gets lost.

All I want to say that it is not black and white.


My opinion:

I will leave the decision on John. If he thinks that converting
all pr_cont() users to a buffered API is easier I will be fine
with it. It was proposed and requested several times.

If John realizes that my proposal to allow to reopen committed
messages is easier, I will be fine with it as well. We could
create the buffered API and convert the most critical
users one by one later. Also the context information will
allow to connect the broken messages in userspace and
do not complicate the kernel side.

Anyway, the lockless printk() feature is more important for
me that a perfect output of continuous lines.

Best Regards,
Petr

2020-08-14 09:57:31

by Petr Mladek

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Fri 2020-08-14 10:22:35, John Ogness wrote:
> On 2020-08-14, Sergey Senozhatsky <[email protected]> wrote:
> > One thing that we need to handle here, I believe, is that the context
> > which crashes the kernel should flush its cont buffer, because the
> > information there is relevant to the crash:
> >
> > pr_cont_alloc_info(&c);
> > pr_cont(&c, "1");
> > pr_cont(&c, "2");
> > >>
> > oops
> > panic()
> > <<
> > pr_cont_flush(&c);
> >
> > We better flush that context's pr_cont buffer during panic().
>
> I am not convinced of the general usefulness of partial messages, but as
> long as we have an API that includes registration, usage, and
> deregistration of some sort of handle, then we leave the window open for
> such implementations.

Registering some handle is an interesting idea. But it rules out
buffers on the stack. And we should avoid dynamic allocation.
printk() must be reliable also when there is not enough memory.

Please, keep it simple and do not add dependencies on new subsystems.
Using external code in printk() is always a call for problems.
The called code must be lockless and must not call printk().

Best Regards,
Petr

2020-08-14 23:26:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Thu, Aug 13, 2020 at 4:54 AM Sergey Senozhatsky
<[email protected]> wrote:
>
> I think what Linus said a long time ago was that the initial purpose of
> pr_cont was
>
> pr_info("Initialize feature foo...");
> if (init_feature_foo() == 0)
> pr_cont("ok\n");
> else
> pr_cont("not ok\n");
>
> And if init_feature_foo() crashes the kernel then the first printk()
> form panic() will flush the cont buffer.

Right.

This is why I think any discussion that says "people should buffer
their lines themselves and we should get rid if pr_cont()" is
fundamentally broken.

Don't go down that hole. I won't take it. It's wrong.

The fact is, pr_cont() goes back to the original kernel. No, it wasn't
pr_cont() back then, and no, there were no actual explicit markers for
"this is a continuation" at all, it was all just "the last printk
didn't have a newline, so we continue where we left off".

We've added pr_cont (and KERN_CONT) since then, and I realize that a
lot of people hate the complexity it introduces, but it's a
fundamental complexity that you have to live with.

If you can't live with pr_cont(), you shouldn't be working on
printk(), and find some other area of the kernel that you _can_ live
with.

It really is that simple.

Linus

2020-08-14 23:53:35

by Joe Perches

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Fri, 2020-08-14 at 15:46 -0700, Linus Torvalds wrote:
> On Thu, Aug 13, 2020 at 4:54 AM Sergey Senozhatsky
> <[email protected]> wrote:
> > I think what Linus said a long time ago was that the initial purpose of
> > pr_cont was
> >
> > pr_info("Initialize feature foo...");
> > if (init_feature_foo() == 0)
> > pr_cont("ok\n");
> > else
> > pr_cont("not ok\n");
> >
> > And if init_feature_foo() crashes the kernel then the first printk()
> > form panic() will flush the cont buffer.
>
> Right.
>
> This is why I think any discussion that says "people should buffer
> their lines themselves and we should get rid if pr_cont()" is
> fundamentally broken.
>
> Don't go down that hole. I won't take it. It's wrong.

I don't think it's wrong per se.

It's reasonable to avoid pr_cont when appropriate.

Trivial buffering, or adding and using YA vsprintf
extension can avoid unnecessary message interleaving.

For instance, I just sent this patch to allow removal
of print_vma_addr and its use of pr_cont.

https://lore.kernel.org/lkml/[email protected]/

This is similar to the dump_flags_names removal back
in commit edf14cdbf9a0 ("mm, printk: introduce new format
string for flags")

> The fact is, pr_cont() goes back to the original kernel. No, it wasn't
> pr_cont() back then, and no, there were no actual explicit markers for
> "this is a continuation" at all, it was all just "the last printk
> didn't have a newline, so we continue where we left off".
>
> We've added pr_cont (and KERN_CONT) since then, and I realize that a
> lot of people hate the complexity it introduces, but it's a
> fundamental complexity that you have to live with.
>
> If you can't live with pr_cont(), you shouldn't be working on
> printk(), and find some other area of the kernel that you _can_ live
> with.
>
> It really is that simple.
>

2020-08-15 21:53:04

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On (20/08/14 15:46), Linus Torvalds wrote:
> On Thu, Aug 13, 2020 at 4:54 AM Sergey Senozhatsky
> <[email protected]> wrote:
> >
> > I think what Linus said a long time ago was that the initial purpose of
> > pr_cont was
> >
> > pr_info("Initialize feature foo...");
> > if (init_feature_foo() == 0)
> > pr_cont("ok\n");
> > else
> > pr_cont("not ok\n");
> >
> > And if init_feature_foo() crashes the kernel then the first printk()
> > form panic() will flush the cont buffer.
>
> Right.
>
> This is why I think any discussion that says "people should buffer
> their lines themselves and we should get rid if pr_cont()" is
> fundamentally broken.

I think what we've been talking about so far was "how do we buffer
cont lines now", what are the problems of current buffering and what
can we do to make that buffering SMP safe (preserving the context of
broken cont buffer, etc. etc.). I don't think that anyone concluded
to just "s/pr_cont/printk/g", I don't see this.

-ss

2020-08-15 21:57:38

by Joe Perches

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Fri, 2020-08-14 at 19:33 -0700, Linus Torvalds wrote:
> On Fri, Aug 14, 2020 at 4:52 PM Joe Perches <[email protected]> wrote:
> > On Fri, 2020-08-14 at 15:46 -0700, Linus Torvalds wrote:
> > > This is why I think any discussion that says "people should buffer
> > > their lines themselves and we should get rid if pr_cont()" is
> > > fundamentally broken.
> > >
> > > Don't go down that hole. I won't take it. It's wrong.
> >
> > I don't think it's wrong per se.
>
> It's *absolutely* and 100% wrong.
>
> Yes, any random *user* of pr_cont() can decide to buffer on it's own.

Which I believe is the point of the discussion,
not the complete removal of KERN_CONT.

> But when the discussion is about printk() - the implementation, not
> the users - then it's complete and utter BS to talk about trying to
> get rid of pr_cont().
>
> See the difference?

Sure, but I fail to see where anyone said get rid of pr_cont
in this thread. It seems all that was discussed was just
various schemes to improve coalescing output.


2020-08-15 21:57:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

On Fri, Aug 14, 2020 at 4:52 PM Joe Perches <[email protected]> wrote:
>
> On Fri, 2020-08-14 at 15:46 -0700, Linus Torvalds wrote:
> >
> > This is why I think any discussion that says "people should buffer
> > their lines themselves and we should get rid if pr_cont()" is
> > fundamentally broken.
> >
> > Don't go down that hole. I won't take it. It's wrong.
>
> I don't think it's wrong per se.

It's *absolutely* and 100% wrong.

Yes, any random *user* of pr_cont() can decide to buffer on it's own.

But when the discussion is about printk() - the implementation, not
the users - then it's complete and utter BS to talk about trying to
get rid of pr_cont().

See the difference?

Linus

2020-08-16 09:10:18

by David Laight

[permalink] [raw]
Subject: RE: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling

From: Joe Perches
> Sent: 15 August 2020 00:52
...
> > This is why I think any discussion that says "people should buffer
> > their lines themselves and we should get rid if pr_cont()" is
> > fundamentally broken.
> >
> > Don't go down that hole. I won't take it. It's wrong.
>
> I don't think it's wrong per se.
>
> It's reasonable to avoid pr_cont when appropriate.
>
> Trivial buffering, or adding and using YA vsprintf
> extension can avoid unnecessary message interleaving.
>
> For instance, I just sent this patch to allow removal
> of print_vma_addr and its use of pr_cont.
>
> https://lore.kernel.org/lkml/[email protected]/

ISTM that is a bit complex for a printf format.
Even with the 'noinline_for_stack' it is going to
use a lot of stack - and error printfs are already likely
to be near the stack limit.
The recursion for %pV might also cause grief.

In that case formatting the data into an on-stack char[]
before the printf is probably the simplest solution.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)