2020-08-11 09:46:01

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

On Tue 2020-08-11 12:40:22, Orson Zhai wrote:
> From: Thomas Gleixner <[email protected]>
>
> Timestamps printed in kernel log are retrieved by local_clock which reads
> jiffies as a referrence clock source.
> But it is diffcult to be synchronized with logs generated out of kernel,
> say some remote processors (Modem) in the Soc of mobile phones.
> Jiffies will be unchanged when system is in suspend mode but Modem will
> not.
>
> So timestamps are required to be compensated with suspend time in some
> complecated scenarios especially for Android system. As an initial
> implementation was rejected by Linus, Thomas made this patch [1] 2 yeas
> ago to replace ts_nsec with a struct ts that consists 3 types of time:
> mono, boot and real.
>
> This is an updated version which comes from patch [1] written by Thomas
> and suggestion [2] about VMCORE_INFO given by Linus.
>
> It provides the prerequisite code to print kernel logs with boot or real
> timestamp in the future.
>
> This patch has been tested in qemu-x86-system. One problem is the timestamp
> in kernel log will be printed [ 0.000000] for longer time than before.

This would be a regression. People put huge effort into having early boot
timestamps, see
https://lore.kernel.org/lkml/[email protected]/
Adding some active people from this patchset into CC.

I wonder if we could have these early timestamps also in the mono clock.

> 1 [ 0.000000] microcode: microcode updated early to revision 0x28, date = 2019-11-12
> 2 [ 0.000000] Linux version 5.8.0+ (root@spreadtrum) (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #2 SMP Fri Aug 7 21:30:51 CST 2020
> 3 [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.8.0+ root=UUID=4d889bca-be18-433d-9b86-c1c1714cc293 ro quiet splash vt.handoff=1
> 4 [ 0.000000] KERNEL supported cpus:
> 5 [ 0.000000] Intel GenuineIntel
> ....
> 223 [ 0.000000] Tracing variant of Tasks RCU enabled.
> 224 [ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> 225 [ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
> 226 [ 0.000000] NR_IRQS: 524544, nr_irqs: 456, preallocated irqs: 16
> 227 [ 0.059662] random: crng done (trusting CPU's manufacturer)
> ^^^^^^^^
> 228 [ 0.059662] Console: colour dummy device 80x25
> 229 [ 0.059662] printk: console [tty0] enabled
> 230 [ 0.059662] ACPI: Core revision 20200528
>
> compared to old log:
>
> 69 [ 0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000021fdfffff] usable
> 70 [ 0.000000] efi: EFI v2.31 by American Megatrends
> 71 [ 0.000000] efi: ESRT=0xdbf69818 ACPI=0xdaef8000 ACPI 2.0=0xdaef8000 SMBIOS=0xf0480
> 72 [ 0.000000] SMBIOS 2.8 present.
> 73 [ 0.000000] DMI: LENOVO ThinkCentre M4500t-N000/, BIOS FCKT58AUS 09/17/2014
> 74 [ 0.000000] tsc: Fast TSC calibration using PIT
> 75 [ 0.000000] tsc: Detected 3292.477 MHz processor
> 76 [ 0.000503] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> ^^^^^^^^
> 77 [ 0.000504] e820: remove [mem 0x000a0000-0x000fffff] usable
> 78 [ 0.000510] last_pfn = 0x21fe00 max_arch_pfn = 0x400000000
> 79 [ 0.000513] MTRR default type: uncachable

[...]

> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -366,7 +366,8 @@ enum log_flags {
> };
>
> struct printk_log {
> - u64 ts_nsec; /* timestamp in nanoseconds */
> + /* Timestamps in nanoseconds */
> + struct timestamps ts;
> u16 len; /* length of entire record */
> u16 text_len; /* length of text buffer */
> u16 dict_len; /* length of dictionary buffer */
> @@ -1090,7 +1091,9 @@ void log_buf_vmcoreinfo_setup(void)
> * parse it and detect any changes to structure down the line.
> */
> VMCOREINFO_STRUCT_SIZE(printk_log);
> - VMCOREINFO_OFFSET(printk_log, ts_nsec);
> + /* ts.mono used to be called "ts_nsec" */
> + VMCOREINFO_FIELD_OFFSET(printk_log, ts_nsec, ts.mono);
> + VMCOREINFO_OFFSET(printk_log, ts);
> VMCOREINFO_OFFSET(printk_log, len);
> VMCOREINFO_OFFSET(printk_log, text_len);
> VMCOREINFO_OFFSET(printk_log, dict_len);

At least "crash" tool would need an update anyway. AFAIK, it checks
the size of struct printk_log and refuses to read it when it changes.

It means that the hack with VMCOREINFO_FIELD_OFFSET probably is not
needed because we would need to update the crashdump-related tools anyway.

Well, the timing is good. We are about to switch the printk ring
buffer into a lockless one. It requires updating the crashdump tools
as well. We could do this at the same time. The lockless ring buffer
already is in linux-next. It is aimed for 5.10 or 5.11. We are still
discussion handling of continuous lines that might need another change
in the internal structures, see
https://lore.kernel.org/r/[email protected]

It would be great to synchronize all these changes changes of the
printk log buffer structures.

Best Regards,
Petr


2020-08-11 12:06:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

Petr Mladek <[email protected]> writes:
> On Tue 2020-08-11 12:40:22, Orson Zhai wrote:
>> This is an updated version which comes from patch [1] written by Thomas
>> and suggestion [2] about VMCORE_INFO given by Linus.

All of that want's to be properly distangled into seperate patches.

>> This patch has been tested in qemu-x86-system. One problem is the timestamp
>> in kernel log will be printed [ 0.000000] for longer time than before.
>
> This would be a regression. People put huge effort into having early boot
> timestamps, see
> https://lore.kernel.org/lkml/[email protected]/
> Adding some active people from this patchset into CC.
>
> I wonder if we could have these early timestamps also in the mono
> clock.

Not really. timekeeping init happens way after the early TSC (or
whatever clock) is registered as sched_clock(). And there is no
realistic way to move timekeeping init earlier.

What we could do instead is to utilize sched_clock() up to the point
where timekeeping becomes available and ensure that monotonic time is
not jumping backwards vs. sched_clock() when switching over. For this
early boot phase, clock realtime timestamps would be invalid of course
and they can stay invalid even after timekeeping is initialized on
systems where the RTC is not available in the early boot process.

> At least "crash" tool would need an update anyway. AFAIK, it checks
> the size of struct printk_log and refuses to read it when it changes.
>
> It means that the hack with VMCOREINFO_FIELD_OFFSET probably is not
> needed because we would need to update the crashdump-related tools anyway.
>
> Well, the timing is good. We are about to switch the printk ring
> buffer into a lockless one. It requires updating the crashdump tools
> as well. We could do this at the same time. The lockless ring buffer
> already is in linux-next. It is aimed for 5.10 or 5.11.
...
> It would be great to synchronize all these changes changes of the
> printk log buffer structures.

I agree that having one update is a good thing, but pretty please can we
finally make progress with this and not create yet another dependency?

Thanks,

tglx

2020-08-11 13:03:18

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

On Tue 2020-08-11 14:05:12, Thomas Gleixner wrote:
> Petr Mladek <[email protected]> writes:
> > At least "crash" tool would need an update anyway. AFAIK, it checks
> > the size of struct printk_log and refuses to read it when it changes.
> >
> > It means that the hack with VMCOREINFO_FIELD_OFFSET probably is not
> > needed because we would need to update the crashdump-related tools anyway.
> >
> > Well, the timing is good. We are about to switch the printk ring
> > buffer into a lockless one. It requires updating the crashdump tools
> > as well. We could do this at the same time. The lockless ring buffer
> > already is in linux-next. It is aimed for 5.10 or 5.11.
> ...
> > It would be great to synchronize all these changes changes of the
> > printk log buffer structures.
>
> I agree that having one update is a good thing, but pretty please can we
> finally make progress with this and not create yet another dependency?

To make it clear. I definitely do not want to block lockless printk by
this.

BTW: I am not 100% convinced that storing all three timestamps is
worth it. It increases the code complexity, metadata size. It needs
an interface with the userspace that has to stay backward compatible.

Also it still will be racy because the timestamp is taken when the message
is printed. It might be "long" before or after the event that
it talks about.

There is still the alternative to print all three timestamps regularly
for those interested. It is less user convenient but much easier
to maintain.

Best Regards,
Petr

2020-08-11 13:44:45

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real



On 8/11/20 9:02 AM, Petr Mladek wrote:
> On Tue 2020-08-11 14:05:12, Thomas Gleixner wrote:
>> Petr Mladek <[email protected]> writes:
>>> At least "crash" tool would need an update anyway. AFAIK, it checks
>>> the size of struct printk_log and refuses to read it when it changes.
>>>
>>> It means that the hack with VMCOREINFO_FIELD_OFFSET probably is not
>>> needed because we would need to update the crashdump-related tools anyway.
>>>
>>> Well, the timing is good. We are about to switch the printk ring
>>> buffer into a lockless one. It requires updating the crashdump tools
>>> as well. We could do this at the same time. The lockless ring buffer
>>> already is in linux-next. It is aimed for 5.10 or 5.11.
>> ...
>>> It would be great to synchronize all these changes changes of the
>>> printk log buffer structures.
>>
>> I agree that having one update is a good thing, but pretty please can we
>> finally make progress with this and not create yet another dependency?
>
> To make it clear. I definitely do not want to block lockless printk by
> this.

Thanks for clarifying that. I had the same concern that tglx had.

>
> BTW: I am not 100% convinced that storing all three timestamps is
> worth it. It increases the code complexity, metadata size. It needs
> an interface with the userspace that has to stay backward compatible.
>
> Also it still will be racy because the timestamp is taken when the message
> is printed. It might be "long" before or after the event that
> it talks about.

That scenario plays out today with the current timestamp. Printing debug with a
better timestamp doesn't resolve that problem nor is it intended to.

>
> There is still the alternative to print all three timestamps regularly
> for those interested. It is less user convenient but much easier
> to maintain.

While I agree, in general, your alternative is useful for in-person debugging it
is not helpful in cases where a user only has an image of a panic where the
printk time-synch message has scrolled off the screen.

Consider this real debug case where I hacked the boottime stamp into printk: We
had some systems with flaky HW that hit panics at 10 hours [1] of uptime. Since
these systems came from different vendors with different HW the clocks were
out-of-sync. I had a suspicion that it was some human-coded event causing the
HW to die but wasn't sure until I did a boottime-stamped printk to prove that
all the systems died after 10 hours.

I could have set a stopwatch or timer to somehow catch this kind of event. I
could have set up video cameras to watch the consoles, etc. There are a lot of
other ways I could have debugged this situation but ultimately the fastest thing
to do was code and provide kernels to the various HW companies and see if
everything lined up as I thought it would.

This problem happens far more often then I'd like to admit and we still see this
type of problem with new HW and FW. I also recall that other kernel groups
(storage, networking, etc.) were interested in the timestamps as it would make
their debugging easier to have a true synchronized timestamp available.

One other thing -- IIRC I had to modify the dmesg code by providing a sysfs (or
proc?) interface for dmesg to identify the stamp. It's something that should be
investigated with this code too.

P.

[1] It was 3+ years ago. I can't remember if it was 10 or 100 but you get the
point.

>
> Best Regards,
> Petr
>

2020-08-13 01:56:08

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

On (20/08/11 15:02), Petr Mladek wrote:
> On Tue 2020-08-11 14:05:12, Thomas Gleixner wrote:
> > Petr Mladek <[email protected]> writes:
> > > At least "crash" tool would need an update anyway. AFAIK, it checks
> > > the size of struct printk_log and refuses to read it when it changes.
> > >
> > > It means that the hack with VMCOREINFO_FIELD_OFFSET probably is not
> > > needed because we would need to update the crashdump-related tools anyway.
> > >
> > > Well, the timing is good. We are about to switch the printk ring
> > > buffer into a lockless one. It requires updating the crashdump tools
> > > as well. We could do this at the same time. The lockless ring buffer
> > > already is in linux-next. It is aimed for 5.10 or 5.11.
> > ...
> > > It would be great to synchronize all these changes changes of the
> > > printk log buffer structures.
> >
> > I agree that having one update is a good thing, but pretty please can we
> > finally make progress with this and not create yet another dependency?
>
> To make it clear. I definitely do not want to block lockless printk by
> this.
>
> BTW: I am not 100% convinced that storing all three timestamps is
> worth it. It increases the code complexity, metadata size. It needs
> an interface with the userspace that has to stay backward compatible.

Can we, perhaps, store those various "alternative" timestamps in dict so
then whoever wants to read them can just parse the dict key:value pairs
attach to each printk message?

> Also it still will be racy because the timestamp is taken when the message
> is printed. It might be "long" before or after the event that
> it talks about.
>
> There is still the alternative to print all three timestamps regularly
> for those interested. It is less user convenient but much easier
> to maintain.

Yes, that's a nice alternative.

-ss

2020-08-13 10:24:03

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

On Thu 2020-08-13 10:55:00, Sergey Senozhatsky wrote:
> On (20/08/11 15:02), Petr Mladek wrote:
> > On Tue 2020-08-11 14:05:12, Thomas Gleixner wrote:
> > > Petr Mladek <[email protected]> writes:
> > > > At least "crash" tool would need an update anyway. AFAIK, it checks
> > > > the size of struct printk_log and refuses to read it when it changes.
> > > >
> > > > It means that the hack with VMCOREINFO_FIELD_OFFSET probably is not
> > > > needed because we would need to update the crashdump-related tools anyway.
> > > >
> > > > Well, the timing is good. We are about to switch the printk ring
> > > > buffer into a lockless one. It requires updating the crashdump tools
> > > > as well. We could do this at the same time. The lockless ring buffer
> > > > already is in linux-next. It is aimed for 5.10 or 5.11.
> > > ...
> > > > It would be great to synchronize all these changes changes of the
> > > > printk log buffer structures.
> > >
> > > I agree that having one update is a good thing, but pretty please can we
> > > finally make progress with this and not create yet another dependency?
> >
> > To make it clear. I definitely do not want to block lockless printk by
> > this.
> >
> > BTW: I am not 100% convinced that storing all three timestamps is
> > worth it. It increases the code complexity, metadata size. It needs
> > an interface with the userspace that has to stay backward compatible.
>
> Can we, perhaps, store those various "alternative" timestamps in dict so
> then whoever wants to read them can just parse the dict key:value pairs
> attach to each printk message?

Interesting idea. It might be a way how to add optional metadata
without breaking compatibility with crashdump tools.

Well, I have bad feeling about it. Some of the reasons might be:

+ would take more space (prefix + text vs. binary representation)
+ not reliable because dict is currently dropped when no space
+ it would make the controversial dictionary feature more important


I would prefer to solve this by storing the timestamps in the
structure with metadata.

Best Regards,
Petr

2020-08-13 10:27:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

Sergey Senozhatsky <[email protected]> writes:
> On (20/08/11 15:02), Petr Mladek wrote:
>> There is still the alternative to print all three timestamps regularly
>> for those interested. It is less user convenient but much easier
>> to maintain.
>
> Yes, that's a nice alternative.

It's trivial on the kernel side, annoying for all people who do not care
about them because they show up in syslog and it's a fricking nightmare
to reconstruct over a large cluster of machines.

Thanks,

tglx

2020-08-13 11:33:26

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

On (20/08/13 12:22), Petr Mladek wrote:
>
> + would take more space (prefix + text vs. binary representation)

Dict buffer is allocated regardless of how we use it, and only printks
from drivers/* (dev_printk*) add dict payload. It might be the case
that on some (if not most) systems dict pages are not used 90% of times
(if not 99%).

> + not reliable because dict is currently dropped when no space

Well, in the perfect world this is a problem, but "maybe not having
alternative timestamps sometimes" can be OK approach for people who
wants to use those triplet timestamps.

But, in general, how real this problem is? What I sae so far (on my boxes)
was that printk messages are longer than dict payload.

Example:

6,223,93298,-,caller=T1;acpi PNP0A08:00: _OSC: OS supports [Segments MSI HPX-Type3]
SUBSYSTEM=acpi
DEVICE=+acpi:PNP0A08:00
6,224,93413,-,caller=T1;acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
SUBSYSTEM=acpi
DEVICE=+acpi:PNP0A08:00
6,225,93540,-,caller=T1;PCI host bridge to bus 0000:00
6,226,93541,-,caller=T1;pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window]
SUBSYSTEM=pci_bus
DEVICE=+pci_bus:0000:00

It might be the case, that overflowing dict buffer, without first overflowing
printk buffer, can be rather challenging.

> + it would make the controversial dictionary feature more important

Yes. But, for the time being, this controversial dict feature allocates
pages for dict buffer anyway.

-ss

2020-08-14 10:09:22

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH] printk: Change timestamp to triplet as mono, boot and real

On Thu 2020-08-13 20:31:55, Sergey Senozhatsky wrote:
> On (20/08/13 12:22), Petr Mladek wrote:
> >
> > + would take more space (prefix + text vs. binary representation)
>
> Dict buffer is allocated regardless of how we use it, and only printks
> from drivers/* (dev_printk*) add dict payload. It might be the case
> that on some (if not most) systems dict pages are not used 90% of times
> (if not 99%).
>
> > + not reliable because dict is currently dropped when no space
>
> Well, in the perfect world this is a problem, but "maybe not having
> alternative timestamps sometimes" can be OK approach for people who
> wants to use those triplet timestamps.

I am afraid the people would see missing timestamps as a bug.
They want it because they want to corelate kernel and userspace logs.
The timestamps from realtime clock are supposed to make it
straightforward.

More importantly. I do not see a reasonable way how to handle it.
No timestamp is worse than timestamps from non-synchronized clock.
Mixing timestamps from different clocks (as fallback) would cause
a lot of confusion (more harm than good).

> But, in general, how real this problem is? What I sae so far (on my boxes)
> was that printk messages are longer than dict payload.

Yeah, I think that dict are less used on normal systems. But it might
be very different when a driver gets more verbose during debugging
and the amount of dev_printk() calls is much higher.

Best Regards,
Petr