2019-07-22 14:27:23

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 0/3] arm64: Allow early timestamping of kernel log

So far, we've let the arm64 kernel start its meaningful time stamping
of the kernel log pretty late, which is caused by sched_clock() being
initialised rather late compared to other architectures.

Pavel Tatashin proposed[1] to move the initialisation of sched_clock
much earlier, which I had objections to. The reason for initialising
sched_clock late is that a number of systems have broken counters, and
we need to apply all kind of terrifying workarounds to avoid time
going backward on the affected platforms. Being able to identify the
right workaround comes pretty late in the kernel boot, and providing
an unreliable sched_clock, even for a short period of time, isn't an
appealing prospect.

To address this, I'm proposing that we allow an architecture to chose
to (1) divorce time stamping and sched_clock during the early phase of
booting, and (2) inherit the time stamping clock as the new epoch the
first time a sched_sched clock gets registered.

(1) would allow arm64 to provide a time stamping clock, however
unreliable it might be, while (2) would allow sched_clock to provide
time stamps that are continuous with the time-stamping clock.

The last patch in the series adds the necessary logic to arm64,
allowing the (potentially unreliable) time stamping of early kernel
messages.

Tested on a bunch of arm64 systems, both bare-metal and in VMs. Boot
tested on a x86 guest.

[1] https://lore.kernel.org/patchwork/patch/1015110/

Marc Zyngier (3):
printk: Allow architecture-specific timestamping function
sched/clock: Allow sched_clock to inherit timestamp_clock epoch
arm64: Allow early time stamping

arch/arm64/Kconfig | 3 +++
arch/arm64/kernel/setup.c | 44 +++++++++++++++++++++++++++++++++++++
include/linux/sched/clock.h | 13 +++++++++++
kernel/printk/printk.c | 4 ++--
kernel/time/sched_clock.c | 10 +++++++++
5 files changed, 72 insertions(+), 2 deletions(-)

--
2.20.1


2019-07-22 14:28:16

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 1/3] printk: Allow architecture-specific timestamping function

printk currently relies on local_clock to time-stamp the kernel
messages. In order to allow the timestamping (and only that)
to be overridden by architecture-specific code, let's declare
a new timestamp_clock() function, which gets used by the printk
code. Architectures willing to make use of this facility will
have to define CONFIG_ARCH_HAS_TIMESTAMP_CLOCK.

The default is of course to return local_clock(), so that the
existing behaviour stays unchanged.

Signed-off-by: Marc Zyngier <[email protected]>
---
include/linux/sched/clock.h | 13 +++++++++++++
kernel/printk/printk.c | 4 ++--
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/clock.h b/include/linux/sched/clock.h
index 867d588314e0..3cf4b2a8ce18 100644
--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -98,4 +98,17 @@ static inline void enable_sched_clock_irqtime(void) {}
static inline void disable_sched_clock_irqtime(void) {}
#endif

+#ifdef CONFIG_ARCH_HAS_TIMESTAMP_CLOCK
+/* Special need architectures can provide their timestamping function */
+extern u64 timestamp_clock(void);
+
+#else
+
+static inline u64 timestamp_clock(void)
+{
+ return local_clock();
+}
+
+#endif
+
#endif /* _LINUX_SCHED_CLOCK_H */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1888f6a3b694..166702316714 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -638,7 +638,7 @@ static int log_store(u32 caller_id, int facility, int level,
if (ts_nsec > 0)
msg->ts_nsec = ts_nsec;
else
- msg->ts_nsec = local_clock();
+ msg->ts_nsec = timestamp_clock();
#ifdef CONFIG_PRINTK_CALLER
msg->caller_id = caller_id;
#endif
@@ -1841,7 +1841,7 @@ static bool cont_add(u32 caller_id, int facility, int level,
cont.facility = facility;
cont.level = level;
cont.caller_id = caller_id;
- cont.ts_nsec = local_clock();
+ cont.ts_nsec = timestamp_clock();
cont.flags = flags;
}

--
2.20.1

2019-07-22 14:28:56

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 2/3] sched/clock: Allow sched_clock to inherit timestamp_clock epoch

Now that we can let an architecture override the timestamping
function, it becomes desirable to ensure that, should the
architecture code switch its timestamping code to sched_clock
once it has been registered, the sched_clock inherits the
timestamp value as its new epoch.

This ensures that the time stamps are continuous and that there
is no jitter other than that introduced by the lack of quality
of the timestamping clock.

Signed-off-by: Marc Zyngier <[email protected]>
---
kernel/time/sched_clock.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 142b07619918..ee1bd449ec81 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -192,6 +192,16 @@ sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
new_epoch = read();
cyc = cd.actual_read_sched_clock();
ns = rd.epoch_ns + cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask, rd.mult, rd.shift);
+
+ /*
+ * If the architecture has a timestamp clock, and this is the
+ * first time we register a new sched_clock, use the timestamp
+ * clock as the epoch.
+ */
+ if (IS_ENABLED(CONFIG_ARCH_HAS_TIMESTAMP_CLOCK) &&
+ unlikely(cd.actual_read_sched_clock == jiffy_sched_clock_read))
+ ns = timestamp_clock();
+
cd.actual_read_sched_clock = read;

rd.read_sched_clock = read;
--
2.20.1

2019-07-22 14:28:56

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 3/3] arm64: Allow early time stamping

In order to provide early timestamping on arm64 systems, we provide
a timestamp_clock() function that is available as early as possible.
This function simply returns the current counter value scales in
nanoseconds, and 0-based.

In order to deal with the idiosyncrasies of some broken platforms,
we condition this on the frequency being set in the CNTFRQ_EL0
register, and revert back to using local_clock() as soon as it starts
returning non-zero values.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/Kconfig | 3 +++
arch/arm64/kernel/setup.c | 44 +++++++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3adcec05b1f6..ac3882ddc1c2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -911,6 +911,9 @@ config ARCH_WANT_HUGE_PMD_SHARE
config ARCH_HAS_CACHE_LINE_SIZE
def_bool y

+config ARCH_HAS_TIMESTAMP_CLOCK
+ def_bool y
+
config ARCH_ENABLE_SPLIT_PMD_PTLOCK
def_bool y if PGTABLE_LEVELS > 2

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 9c4bad7d7131..cf21e3df7165 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -15,6 +15,7 @@
#include <linux/initrd.h>
#include <linux/console.h>
#include <linux/cache.h>
+#include <linux/clocksource.h>
#include <linux/screen_info.h>
#include <linux/init.h>
#include <linux/kexec.h>
@@ -28,10 +29,12 @@
#include <linux/of_fdt.h>
#include <linux/efi.h>
#include <linux/psci.h>
+#include <linux/sched/clock.h>
#include <linux/sched/task.h>
#include <linux/mm.h>

#include <asm/acpi.h>
+#include <asm/arch_timer.h>
#include <asm/fixmap.h>
#include <asm/cpu.h>
#include <asm/cputype.h>
@@ -269,8 +272,49 @@ arch_initcall(reserve_memblock_reserved_regions);

u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };

+static u64 notrace ts_counter_read_cc(const struct cyclecounter *cc)
+{
+ return __arch_counter_get_cntvct();
+}
+
+static struct cyclecounter ts_cc __ro_after_init = {
+ .read = ts_counter_read_cc,
+ .mask = CLOCKSOURCE_MASK(56),
+};
+
+static struct timecounter ts_tc;
+
+static bool ts_cc_valid __ro_after_init = false;
+
+static __init void timestamp_clock_init(void)
+{
+ u64 frq = arch_timer_get_cntfrq();
+
+ if (!frq)
+ return;
+
+ clocks_calc_mult_shift(&ts_cc.mult, &ts_cc.shift,
+ frq, NSEC_PER_SEC, 3600);
+ /* timestamp starts at 0 (local_clock is a good enough approximation) */
+ timecounter_init(&ts_tc, &ts_cc, local_clock());
+ ts_cc_valid = true;
+ pr_info("Using timestamp clock @%lluMHz\n", frq / 1000 / 1000);
+}
+
+u64 timestamp_clock(void)
+{
+ u64 ns = local_clock();
+
+ if (likely(ns || !ts_cc_valid))
+ return ns;
+
+ return timecounter_read(&ts_tc);
+}
+
void __init setup_arch(char **cmdline_p)
{
+ timestamp_clock_init();
+
init_mm.start_code = (unsigned long) _text;
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
--
2.20.1

2019-07-23 06:16:13

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 0/3] arm64: Allow early timestamping of kernel log

On Mon, Jul 22, 2019 at 3:33 AM Marc Zyngier <[email protected]> wrote:
>
> So far, we've let the arm64 kernel start its meaningful time stamping
> of the kernel log pretty late, which is caused by sched_clock() being
> initialised rather late compared to other architectures.
>
> Pavel Tatashin proposed[1] to move the initialisation of sched_clock
> much earlier, which I had objections to. The reason for initialising
> sched_clock late is that a number of systems have broken counters, and
> we need to apply all kind of terrifying workarounds to avoid time
> going backward on the affected platforms. Being able to identify the
> right workaround comes pretty late in the kernel boot, and providing
> an unreliable sched_clock, even for a short period of time, isn't an
> appealing prospect.
>
> To address this, I'm proposing that we allow an architecture to chose
> to (1) divorce time stamping and sched_clock during the early phase of
> booting, and (2) inherit the time stamping clock as the new epoch the
> first time a sched_sched clock gets registered.

Could we have a stable clock config for arm64: if it is known that
this Linux build is going to run on non-broken firmware, and with a
known stable cntvct_el0 register it can be optionally configured to
use that stable sched_clock instead of generic clock that arm64 is
using? This way, the early printk are going to be available on those
systems without adding a different method for printk's only. It would
also mean that other users of early sched_clock() such as ftrace could
benefit from it.

>
> (1) would allow arm64 to provide a time stamping clock, however
> unreliable it might be, while (2) would allow sched_clock to provide
> time stamps that are continuous with the time-stamping clock.
>
> The last patch in the series adds the necessary logic to arm64,
> allowing the (potentially unreliable) time stamping of early kernel
> messages.
>
> Tested on a bunch of arm64 systems, both bare-metal and in VMs. Boot
> tested on a x86 guest.
>
> [1] https://lore.kernel.org/patchwork/patch/1015110/
>
> Marc Zyngier (3):
> printk: Allow architecture-specific timestamping function
> sched/clock: Allow sched_clock to inherit timestamp_clock epoch
> arm64: Allow early time stamping
>
> arch/arm64/Kconfig | 3 +++
> arch/arm64/kernel/setup.c | 44 +++++++++++++++++++++++++++++++++++++
> include/linux/sched/clock.h | 13 +++++++++++
> kernel/printk/printk.c | 4 ++--
> kernel/time/sched_clock.c | 10 +++++++++
> 5 files changed, 72 insertions(+), 2 deletions(-)
>
> --
> 2.20.1
>

2019-07-23 09:17:22

by Hanjun Guo

[permalink] [raw]
Subject: Re: [PATCH 0/3] arm64: Allow early timestamping of kernel log

On 2019/7/22 18:33, Marc Zyngier wrote:
> So far, we've let the arm64 kernel start its meaningful time stamping
> of the kernel log pretty late, which is caused by sched_clock() being
> initialised rather late compared to other architectures.
>
> Pavel Tatashin proposed[1] to move the initialisation of sched_clock
> much earlier, which I had objections to. The reason for initialising
> sched_clock late is that a number of systems have broken counters, and
> we need to apply all kind of terrifying workarounds to avoid time
> going backward on the affected platforms. Being able to identify the
> right workaround comes pretty late in the kernel boot, and providing
> an unreliable sched_clock, even for a short period of time, isn't an
> appealing prospect.
>
> To address this, I'm proposing that we allow an architecture to chose
> to (1) divorce time stamping and sched_clock during the early phase of
> booting, and (2) inherit the time stamping clock as the new epoch the
> first time a sched_sched clock gets registered.
>
> (1) would allow arm64 to provide a time stamping clock, however
> unreliable it might be, while (2) would allow sched_clock to provide
> time stamps that are continuous with the time-stamping clock.
>
> The last patch in the series adds the necessary logic to arm64,
> allowing the (potentially unreliable) time stamping of early kernel
> messages.
>
> Tested on a bunch of arm64 systems, both bare-metal and in VMs. Boot
> tested on a x86 guest.

This makes the boot log more useful and I can debug some time consuming
issue easier before the arch timer initialization, tested on my ARM64
server, I can see the timestamping from the start [1],

Tested-by: Hanjun Guo <[email protected]>

Thanks
Hanjun

[1]:
[ 0.000000] Booting Linux on physical CPU 0x0000080000 [0x481fd010]
[ 0.000000] Linux version 5.2.0+ ([email protected]) (gcc version 9.0.1 20190312 (Red Hat 9.0.1-0.10) (GCC)) #45 SMP Tue Jul 23 09:17:48 CST 2019
[ 0.000000] Using timestamp clock @100MHz
[ 0.000074] efi: Getting EFI parameters from FDT:
[ 0.000082] efi: EFI v2.70 by EDK II
[ 0.000083] efi: ACPI 2.0=0x3a300000 SMBIOS 3.0=0x39f80000 MEMATTR=0x30996018 MEMRESERVE=0x30997e18
[ 0.000122] crashkernel reserved: 0x000000000ba00000 - 0x000000002ba00000 (512 MB)
[ 0.000126] cma: Failed to reserve 512 MiB
[ 0.185111] ACPI: Early table checksum verification disabled
[ 0.185115] ACPI: RSDP 0x000000003A300000 000024 (v02 HISI )
[ 0.185120] ACPI: XSDT 0x000000003A270000 00009C (v01 HISI HIP08 00000000 01000013)
[ 0.185127] ACPI: FACP 0x0000000039B10000 000114 (v06 HISI HIP08 00000000 HISI 20151124)
[ 0.185134] ACPI: DSDT 0x0000000039AB0000 0084E4 (v02 HISI HIP08 00000000 INTL 20181213)
[ 0.185139] ACPI: PCCT 0x0000000039FB0000 00008A (v01 HISI HIP08 00000000 HISI 20151124)
[ 0.185143] ACPI: SSDT 0x0000000039F90000 01021A (v02 HISI HIP07 00000000 INTL 20181213)
[ 0.185147] ACPI: BERT 0x0000000039F50000 000030 (v01 HISI HIP08 00000000 HISI 20151124)
[ 0.185150] ACPI: HEST 0x0000000039F30000 000308 (v01 HISI HIP08 00000000 HISI 20151124)
[ 0.185154] ACPI: ERST 0x0000000039EF0000 000230 (v01 HISI HIP08 00000000 HISI 20151124)
[ 0.185158] ACPI: EINJ 0x0000000039EE0000 000170 (v01 HISI HIP08 00000000 HISI 20151124)
[ 0.185162] ACPI: SLIT 0x0000000039B30000 00003C (v01 HISI HIP08 00000000 HISI 20151124)
[ 0.185166] ACPI: GTDT 0x0000000039B00000 00007C (v02 HISI HIP08 00000000 HISI 20151124)
[ 0.185169] ACPI: MCFG 0x0000000039AF0000 00003C (v01 HISI HIP08 00000000 HISI 20151124)
[ 0.185173] ACPI: SPCR 0x0000000039AE0000 000050 (v02 HISI HIP08 00000000 HISI 20151124)
[ 0.185177] ACPI: SRAT 0x0000000039AD0000 0007D0 (v03 HISI HIP08 00000000 HISI 20151124)
[ 0.185181] ACPI: APIC 0x0000000039AC0000 001E6C (v04 HISI HIP08 00000000 HISI 20151124)
[ 0.185185] ACPI: IORT 0x0000000039AA0000 001310 (v00 HISI HIP08 00000000 INTL 20181213)
[ 0.185189] ACPI: PPTT 0x0000000030970000 0031B0 (v01 HISI HIP08 00000000 HISI 20151124)
[ 0.185196] ACPI: SPCR: console: pl011,mmio32,0x94080000,115200
[ 0.185208] ACPI: SRAT: Node 0 PXM 0 [mem 0x2080000000-0x2fffffffff]
[ 0.185210] ACPI: SRAT: Node 1 PXM 1 [mem 0x3000000000-0x3fffffffff]
[ 0.185212] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[ 0.185213] ACPI: SRAT: Node 2 PXM 2 [mem 0x202000000000-0x202fffffffff]
[ 0.185215] ACPI: SRAT: Node 3 PXM 3 [mem 0x203000000000-0x203fffffffff]
[ 0.185221] NUMA: NODE_DATA [mem 0x2fffffe3c0-0x2fffffffff]
[ 0.185224] NUMA: NODE_DATA [mem 0x3fffffe3c0-0x3fffffffff]
[ 0.185226] NUMA: NODE_DATA [mem 0x202fffffe3c0-0x202fffffffff]
[ 0.185229] NUMA: NODE_DATA [mem 0x203ffdfde3c0-0x203ffdfdffff]


2019-07-23 14:45:52

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 0/3] arm64: Allow early timestamping of kernel log

On Mon, 22 Jul 2019 21:52:42 +0100,
Pavel Tatashin <[email protected]> wrote:
>
> On Mon, Jul 22, 2019 at 3:33 AM Marc Zyngier <[email protected]> wrote:
> >
> > So far, we've let the arm64 kernel start its meaningful time stamping
> > of the kernel log pretty late, which is caused by sched_clock() being
> > initialised rather late compared to other architectures.
> >
> > Pavel Tatashin proposed[1] to move the initialisation of sched_clock
> > much earlier, which I had objections to. The reason for initialising
> > sched_clock late is that a number of systems have broken counters, and
> > we need to apply all kind of terrifying workarounds to avoid time
> > going backward on the affected platforms. Being able to identify the
> > right workaround comes pretty late in the kernel boot, and providing
> > an unreliable sched_clock, even for a short period of time, isn't an
> > appealing prospect.
> >
> > To address this, I'm proposing that we allow an architecture to chose
> > to (1) divorce time stamping and sched_clock during the early phase of
> > booting, and (2) inherit the time stamping clock as the new epoch the
> > first time a sched_sched clock gets registered.
>
> Could we have a stable clock config for arm64: if it is known that
> this Linux build is going to run on non-broken firmware, and with a
> known stable cntvct_el0 register it can be optionally configured to
> use that stable sched_clock instead of generic clock that arm64 is
> using?

Hmmm. Then I guess a prerequisite is this patch:

https://lwn.net/Articles/490040/

as the number of systems this will reliably run on is a close
approximation to zero. Yes, counting is hard...

More seriously, we've been there before (cue the 32bit ARM port 8
years ago), and really don't want to go back to a time where we had
multiple config options for everything. There is one kernel, which
should run *reliably* on everything.

> This way, the early printk are going to be available on those
> systems without adding a different method for printk's only. It would
> also mean that other users of early sched_clock() such as ftrace could
> benefit from it.

See above. But maybe the real thing to do is to allow local_clock() to
be overridden. Same effect, same complexity, just hidden away from any
given subsystem.

Thanks,

M.

--
Jazz is not dead, it just smells funny.

2019-09-25 23:09:42

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 0/3] arm64: Allow early timestamping of kernel log

On Tue, Jul 23, 2019 at 3:17 AM Marc Zyngier <[email protected]> wrote:
>
> On Mon, 22 Jul 2019 21:52:42 +0100,
> Pavel Tatashin <[email protected]> wrote:
> >
> > On Mon, Jul 22, 2019 at 3:33 AM Marc Zyngier <[email protected]> wrote:
> > >
> > > So far, we've let the arm64 kernel start its meaningful time stamping
> > > of the kernel log pretty late, which is caused by sched_clock() being
> > > initialised rather late compared to other architectures.
> > >
> > > Pavel Tatashin proposed[1] to move the initialisation of sched_clock
> > > much earlier, which I had objections to. The reason for initialising
> > > sched_clock late is that a number of systems have broken counters, and
> > > we need to apply all kind of terrifying workarounds to avoid time
> > > going backward on the affected platforms. Being able to identify the
> > > right workaround comes pretty late in the kernel boot, and providing
> > > an unreliable sched_clock, even for a short period of time, isn't an
> > > appealing prospect.
> > >
> > > To address this, I'm proposing that we allow an architecture to chose
> > > to (1) divorce time stamping and sched_clock during the early phase of
> > > booting, and (2) inherit the time stamping clock as the new epoch the
> > > first time a sched_sched clock gets registered.

Hi Marc,

I know we briefly discussed this at plumbers, but I want to bring it
up again, because I am still puzzled why it is not possible to
stabilize unstable clock early in boot.

Here is an example where clock is stabilized:
https://soleen.com/source/xref/linux/kernel/sched/clock.c?r=457c8996#265

It uses a value that is read at last ticks to normalize clock, and
because ticks are not available early in boot instead we can make sure
that early in boot sched_clock() never returns value smaller than
previously returned value, and if we want to be extra careful, we can
also make sure that sched_clock() early in boot does not jump ahead by
more than some fixed amount of time i.e. more than one hour.

If sched_clock() is available early we will get the benefit of having
other tracers that use it to debug early boot information.

Pasha