2018-06-15 17:46:11

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v10 0/7] Early boot time stamps for x86

changelog
---------
v10 - v9
- Added another patch to this series that removes dependency
between KVM clock, and memblock allocator. The benefit is that
all clocks can now be initialized even earlier.
v9 - v8
- Addressed more comments from Dou Liyang

v8 - v7
- Addressed comments from Dou Liyang:
- Moved tsc_early_init() and tsc_early_fini() to be all inside
tsc.c, and changed them to be static.
- Removed warning when notsc parameter is used.
- Merged with:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

v7 - v6
- Removed tsc_disabled flag, now notsc is equivalent of
tsc=unstable
- Simplified changes to sched/clock.c, by removing the
sched_clock_early() and friends as requested by Peter Zijlstra.
We know always use sched_clock()
- Modified x86 sched_clock() to return either early boot time or
regular.
- Added another example why ealry boot time is important

v5 - v6
- Added a new patch:
time: sync read_boot_clock64() with persistent clock
Which fixes missing __init macro, and enabled time discrepancy
fix that was noted by Thomas Gleixner
- Split "x86/time: read_boot_clock64() implementation" into a
separate patch

v4 - v5
- Fix compiler warnings on systems with stable clocks.

v3 - v4
- Fixed tsc_early_fini() call to be in the 2nd patch as reported
by Dou Liyang
- Improved comment before __use_sched_clock_early to explain why
we need both booleans.
- Simplified valid_clock logic in read_boot_clock64().

v2 - v3
- Addressed comment from Thomas Gleixner
- Timestamps are available a little later in boot but still much
earlier than in mainline. This significantly simplified this
work.

v1 - v2
In patch "x86/tsc: tsc early":
- added tsc_adjusted_early()
- fixed 32-bit compile error use do_div()

The early boot time stamps were discussed recently in these threads:
http://lkml.kernel.org/r/[email protected]
http://lkml.kernel.org/r/[email protected]

I updated my series to the latest mainline and sending it again.

Peter mentioned he did not like patch 6,7, and we can discuss for a better
way to do that, but I think patches 1-5 can be accepted separetly, since
they already enable early timestamps on platforms where sched_clock() is
available early. Such as KVM.

Adding early boot time stamps support for x86 machines.
SPARC patches for early boot time stamps are already integrated into
mainline linux.

Sample output
-------------
Before:
https://paste.ubuntu.com/26133428/

After:
https://paste.ubuntu.com/26133523/

For exaples how early time stamps are used, see this work:
Example 1:
https://lwn.net/Articles/734374/
- Without early boot time stamps we would not know about the extra time
that is spent zeroing struct pages early in boot even when deferred
page initialization.

Example 2:
https://patchwork.kernel.org/patch/10021247/
- If early boot timestamps were available, the engineer who introduced
this bug would have noticed the extra time that is spent early in boot.
Pavel Tatashin (7):
x86/tsc: remove tsc_disabled flag
time: sync read_boot_clock64() with persistent clock
x86/time: read_boot_clock64() implementation
sched: early boot clock
kvm/x86: remove kvm memblock dependency
x86/paravirt: add active_sched_clock to pv_time_ops
x86/tsc: use tsc early

Example 3:
http://lkml.kernel.org/r/[email protected]
- Needed early time stamps to show improvement

arch/arm/kernel/time.c | 2 +-
arch/s390/kernel/time.c | 2 +-
arch/x86/include/asm/paravirt.h | 2 +-
arch/x86/include/asm/paravirt_types.h | 1 +
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 64 ++------------
arch/x86/kernel/paravirt.c | 1 +
arch/x86/kernel/setup.c | 7 +-
arch/x86/kernel/time.c | 30 +++++++
arch/x86/kernel/tsc.c | 116 ++++++++++++++++++++++----
arch/x86/xen/time.c | 7 +-
include/linux/timekeeping.h | 2 +-
kernel/sched/clock.c | 10 ++-
kernel/time/timekeeping.c | 8 +-
14 files changed, 165 insertions(+), 88 deletions(-)

--
2.17.1



2018-06-15 17:44:16

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v10 5/7] kvm/x86: remove kvm memblock dependency

KVM clock is initialized later compared to other hypervisor because it has
dependency on memblock allocator.

This patch removes this dependency by using memory from BSS instead of
allocating it.

The benefits:
- remove ifdef from common code
- earlier availability of TSC.
- remove dependency on memblock, and reduce code

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 64 ++++++--------------------------------
arch/x86/kernel/setup.c | 7 ++---
3 files changed, 12 insertions(+), 60 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5b2300b818af..c65c232d3ddd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name = "KVM",
.detect = kvm_detect,
.type = X86_HYPER_KVM,
+ .init.init_platform = kvmclock_init,
.init.guest_late_init = kvm_guest_init,
.init.x2apic_available = kvm_para_available,
};
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index bf8d1eb7fca3..01558d41ec2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,9 +23,9 @@
#include <asm/apic.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
-#include <linux/memblock.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
+#include <linux/mm.h>

#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
@@ -44,6 +44,11 @@ static int parse_no_kvmclock(char *arg)
}
early_param("no-kvmclock", parse_no_kvmclock);

+/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
+#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))
+static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
/* The hypervisor will put information about time periodically here */
static struct pvclock_vsyscall_time_info *hv_clock;
static struct pvclock_wall_clock *wall_clock;
@@ -244,43 +249,12 @@ static void kvm_shutdown(void)
native_machine_shutdown();
}

-static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
- phys_addr_t align)
-{
- phys_addr_t mem;
-
- mem = memblock_alloc(size, align);
- if (!mem)
- return 0;
-
- if (sev_active()) {
- if (early_set_memory_decrypted((unsigned long)__va(mem), size))
- goto e_free;
- }
-
- return mem;
-e_free:
- memblock_free(mem, size);
- return 0;
-}
-
-static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
-{
- if (sev_active())
- early_set_memory_encrypted((unsigned long)__va(addr), size);
-
- memblock_free(addr, size);
-}
-
void __init kvmclock_init(void)
{
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned long mem, mem_wall_clock;
- int size, cpu, wall_clock_size;
+ int cpu;
u8 flags;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
if (!kvm_para_available())
return;

@@ -290,28 +264,11 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
- mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
- if (!mem_wall_clock)
- return;
-
- wall_clock = __va(mem_wall_clock);
- memset(wall_clock, 0, wall_clock_size);
-
- mem = kvm_memblock_alloc(size, PAGE_SIZE);
- if (!mem) {
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
- wall_clock = NULL;
- return;
- }
-
- hv_clock = __va(mem);
- memset(hv_clock, 0, size);
+ wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
+ hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;

if (kvm_register_clock("primary cpu clock")) {
hv_clock = NULL;
- kvm_memblock_free(mem, size);
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
wall_clock = NULL;
return;
}
@@ -354,13 +311,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
int cpu;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned int size;

if (!hv_clock)
return 0;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
cpu = get_cpu();

vcpu_time = &hv_clock[cpu].pvti;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..5194d9c38a43 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1012,6 +1012,8 @@ void __init setup_arch(char **cmdline_p)
*/
init_hypervisor_platform();

+ tsc_early_delay_calibrate();
+
x86_init.resources.probe_roms();

/* after parse_early_param, so could debug it */
@@ -1197,11 +1199,6 @@ void __init setup_arch(char **cmdline_p)

memblock_find_dma_reserve();

-#ifdef CONFIG_KVM_GUEST
- kvmclock_init();
-#endif
-
- tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();

--
2.17.1


2018-06-15 17:44:42

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v10 4/7] sched: early boot clock

Allow sched_clock() to be used before schec_clock_init() and
sched_clock_init_late() are called. This provides us with a way to get
early boot timestamps on machines with unstable clocks.

Signed-off-by: Pavel Tatashin <[email protected]>
---
kernel/sched/clock.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..f034392b0f6c 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -205,6 +205,11 @@ void clear_sched_clock_stable(void)
*/
static int __init sched_clock_init_late(void)
{
+ /* Transition to unstable clock from early clock */
+ local_irq_disable();
+ __gtod_offset = sched_clock() + __sched_clock_offset - ktime_get_ns();
+ local_irq_enable();
+
sched_clock_running = 2;
/*
* Ensure that it is impossible to not do a static_key update.
@@ -350,8 +355,9 @@ u64 sched_clock_cpu(int cpu)
if (sched_clock_stable())
return sched_clock() + __sched_clock_offset;

- if (unlikely(!sched_clock_running))
- return 0ull;
+ /* Use early clock until sched_clock_init_late() */
+ if (unlikely(sched_clock_running < 2))
+ return sched_clock() + __sched_clock_offset;

preempt_disable_notrace();
scd = cpu_sdc(cpu);
--
2.17.1


2018-06-15 17:45:45

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v10 1/7] x86/tsc: remove tsc_disabled flag

tsc_disabled is set when notsc is passed as kernel parameter. The reason we
have notsc is to avoid timing problems on multi-socket systems. We already
have a mechanism, however, to detect and resolve these issues by invoking
tsc unstable path. Thus, make notsc to behave the same as tsc=unstable.

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Dou Liyang <[email protected]>
---
arch/x86/kernel/tsc.c | 18 +++---------------
1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..186395041725 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,11 +38,6 @@ EXPORT_SYMBOL(tsc_khz);
*/
static int __read_mostly tsc_unstable;

-/* native_sched_clock() is called before tsc_init(), so
- we must start with the TSC soft disabled to prevent
- erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
static DEFINE_STATIC_KEY_FALSE(__use_tsc);

int tsc_clocksource_reliable;
@@ -248,8 +243,7 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
#ifdef CONFIG_X86_TSC
int __init notsc_setup(char *str)
{
- pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
- tsc_disabled = 1;
+ mark_tsc_unstable("boot parameter notsc");
return 1;
}
#else
@@ -1307,7 +1301,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)

static int __init init_tsc_clocksource(void)
{
- if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+ if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
return 0;

if (tsc_unstable)
@@ -1414,12 +1408,6 @@ void __init tsc_init(void)
set_cyc2ns_scale(tsc_khz, cpu, cyc);
}

- if (tsc_disabled > 0)
- return;
-
- /* now allow native_sched_clock() to use rdtsc */
-
- tsc_disabled = 0;
static_branch_enable(&__use_tsc);

if (!no_sched_irq_time)
@@ -1455,7 +1443,7 @@ unsigned long calibrate_delay_is_known(void)
int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
const struct cpumask *mask = topology_core_cpumask(cpu);

- if (tsc_disabled || !constant_tsc || !mask)
+ if (!constant_tsc || !mask)
return 0;

sibling = cpumask_any_but(mask, cpu);
--
2.17.1


2018-06-15 17:45:55

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v10 7/7] x86/tsc: use tsc early

This patch adds early clock feature to x86 platforms.

tsc_early_init():
Determines offset, shift and multiplier for the early clock based on the
TSC frequency.

tsc_early_fini()
Implement the finish part of early tsc feature, prints message about the
offset, which can be useful to find out how much time was spent in post and
boot manager (if TSC starts from 0 during boot)

sched_clock_early():
TSC based implementation of early clock and is called from sched_clock().

start_early_clock():
Calls tsc_early_init(), and makes sched_clock() to use early boot clock

set_final_clock():
Sets the final clock which is either platform specific or
native_sched_clock(). Also calls tsc_early_fini() if early clock was
previously initialized.

Call start_early_clock() to start using early boot time stamps
functionality on the supported x86 platforms, and call set_final_clock() to
finish this feature and switch back to the default clock. The supported x86
systems are those where TSC frequency is determined early in boot.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/paravirt.h | 2 +-
arch/x86/kernel/tsc.c | 98 ++++++++++++++++++++++++++++++++-
2 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d49bbf4bb5c8..553b6c81c320 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -172,7 +172,7 @@ static inline int rdmsrl_safe(unsigned msr, unsigned long long *p)

static inline unsigned long long paravirt_sched_clock(void)
{
- return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
+ return PVOP_CALL0(unsigned long long, pv_time_ops.active_sched_clock);
}

struct static_key;
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 186395041725..59772750d634 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -182,6 +182,84 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
local_irq_restore(flags);
}

+static struct cyc2ns_data cyc2ns_early;
+
+static u64 sched_clock_early(void)
+{
+ u64 ns = mul_u64_u32_shr(rdtsc(), cyc2ns_early.cyc2ns_mul,
+ cyc2ns_early.cyc2ns_shift);
+ return ns + cyc2ns_early.cyc2ns_offset;
+}
+
+/*
+ * Initialize clock for early time stamps
+ */
+static void __init tsc_early_init(unsigned int khz)
+{
+ clocks_calc_mult_shift(&cyc2ns_early.cyc2ns_mul,
+ &cyc2ns_early.cyc2ns_shift,
+ khz, NSEC_PER_MSEC, 0);
+ cyc2ns_early.cyc2ns_offset = -sched_clock_early();
+}
+
+/*
+ * Finish clock for early time stamps, and hand over to permanent clock by
+ * setting __sched_clock_offset appropriately for continued time keeping.
+ */
+static void __init tsc_early_fini(void)
+{
+ unsigned long long t;
+ unsigned long r;
+
+ t = -cyc2ns_early.cyc2ns_offset;
+ r = do_div(t, NSEC_PER_SEC);
+
+ __sched_clock_offset = sched_clock_early() - sched_clock();
+ pr_info("early sched clock is finished, offset [%lld.%09lds]\n", t, r);
+}
+
+#ifdef CONFIG_PARAVIRT
+static inline void __init start_early_clock(void)
+{
+ tsc_early_init(tsc_khz);
+ pv_time_ops.active_sched_clock = sched_clock_early;
+}
+
+static inline void __init set_final_clock(void)
+{
+ pv_time_ops.active_sched_clock = pv_time_ops.sched_clock;
+
+ /* We did not have early sched clock if multiplier is 0 */
+ if (cyc2ns_early.cyc2ns_mul)
+ tsc_early_fini();
+}
+#else /* CONFIG_PARAVIRT */
+/*
+ * For native clock we use two switches static and dynamic, the static switch is
+ * initially true, so we check the dynamic switch, which is initially false.
+ * Later when early clock is disabled, we can alter the static switch in order
+ * to avoid branch check on every sched_clock() call.
+ */
+static bool __tsc_early;
+static DEFINE_STATIC_KEY_TRUE(__tsc_early_static);
+
+static inline void __init start_early_clock(void)
+{
+ tsc_early_init(tsc_khz);
+ __tsc_early = true;
+}
+
+static inline void __init set_final_clock(void)
+{
+ __tsc_early = false;
+ static_branch_disable(&__tsc_early_static);
+
+ /* We did not have early sched clock if multiplier is 0 */
+ if (cyc2ns_early.cyc2ns_mul)
+ tsc_early_fini();
+}
+#endif /* CONFIG_PARAVIRT */
+
/*
* Scheduler clock - returns current time in nanosec units.
*/
@@ -194,6 +272,13 @@ u64 native_sched_clock(void)
return cycles_2_ns(tsc_now);
}

+#ifndef CONFIG_PARAVIRT
+ if (static_branch_unlikely(&__tsc_early_static)) {
+ if (__tsc_early)
+ return sched_clock_early();
+ }
+#endif /* !CONFIG_PARAVIRT */
+
/*
* Fall back to jiffies if there's no TSC available:
* ( But note that we still use it if the TSC is marked
@@ -1352,6 +1437,7 @@ void __init tsc_early_delay_calibrate(void)
lpj = tsc_khz * 1000;
do_div(lpj, HZ);
loops_per_jiffy = lpj;
+ start_early_clock();
}

void __init tsc_init(void)
@@ -1361,7 +1447,7 @@ void __init tsc_init(void)

if (!boot_cpu_has(X86_FEATURE_TSC)) {
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
+ goto final_sched_clock;
}

cpu_khz = x86_platform.calibrate_cpu();
@@ -1380,7 +1466,7 @@ void __init tsc_init(void)
if (!tsc_khz) {
mark_tsc_unstable("could not calculate TSC khz");
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
+ goto final_sched_clock;
}

pr_info("Detected %lu.%03lu MHz processor\n",
@@ -1428,6 +1514,14 @@ void __init tsc_init(void)

clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
detect_art();
+final_sched_clock:
+ /*
+ * Final sched clock is either platform specific clock when
+ * CONFIG_PARAVIRT is defined, or native_sched_clock() with disabled
+ * static branch for early tsc clock. We must call this function even if
+ * start_early_clock() was never called.
+ */
+ set_final_clock();
}

#ifdef CONFIG_SMP
--
2.17.1


2018-06-15 17:46:06

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v10 2/7] time: sync read_boot_clock64() with persistent clock

read_boot_clock64() returns a boot start timestamp from epoch. Some arches
may need to access the persistent clock interface in order to calculate the
epoch offset. The resolution of the persistent clock, however, might be
low. Therefore, in order to avoid time discrepancies a new argument 'now'
is added to read_boot_clock64() parameters. Arch may decide to use it
instead of accessing persistent clock again.

Also, change read_boot_clock64() to have __init prototype since it is
accessed only during boot.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm/kernel/time.c | 2 +-
arch/s390/kernel/time.c | 2 +-
include/linux/timekeeping.h | 2 +-
kernel/time/timekeeping.c | 8 ++++++--
4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index cf2701cb0de8..40fd2e036eef 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -90,7 +90,7 @@ void read_persistent_clock64(struct timespec64 *ts)
__read_persistent_clock(ts);
}

-void read_boot_clock64(struct timespec64 *ts)
+void __init read_boot_clock64(struct timespec64 *now, struct timespec64 *ts)
{
__read_boot_clock(ts);
}
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index cf561160ea88..9b0c69e3d795 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -221,7 +221,7 @@ void read_persistent_clock64(struct timespec64 *ts)
ext_to_timespec64(clk, ts);
}

-void read_boot_clock64(struct timespec64 *ts)
+void __init read_boot_clock64(struct timespec64 *now, struct timespec64 *ts)
{
unsigned char clk[STORE_CLOCK_EXT_SIZE];
__u64 delta;
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 86bc2026efce..6b607e8bedfa 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -243,7 +243,7 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
extern int persistent_clock_is_local;

extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
+extern void read_boot_clock64(struct timespec64 *now, struct timespec64 *ts);
extern int update_persistent_clock64(struct timespec64 now);

/*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4786df904c22..fb6898fab374 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1502,9 +1502,13 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
* Function to read the exact time the system has been started.
* Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
*
+ * Argument 'now' contains time from persistent clock to calculate offset from
+ * epoch. May contain zeros if persist ant clock is not available.
+ *
* XXX - Do be sure to remove it once all arches implement it.
*/
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init read_boot_clock64(struct timespec64 *now,
+ struct timespec64 *ts)
{
ts->tv_sec = 0;
ts->tv_nsec = 0;
@@ -1535,7 +1539,7 @@ void __init timekeeping_init(void)
} else if (now.tv_sec || now.tv_nsec)
persistent_clock_exists = true;

- read_boot_clock64(&boot);
+ read_boot_clock64(&now, &boot);
if (!timespec64_valid_strict(&boot)) {
pr_warn("WARNING: Boot clock returned invalid value!\n"
" Check your CMOS/BIOS settings.\n");
--
2.17.1


2018-06-15 17:46:10

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v10 6/7] x86/paravirt: add active_sched_clock to pv_time_ops

Early boot clock might differ from the clock that is used later on,
therefore add a new field to pv_time_ops, that shows currently active
clock. If platform supports early boot clock, this field will be changed
to use that clock early in boot, and later will be replaced with the
permanent clock.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/paravirt_types.h | 1 +
arch/x86/kernel/paravirt.c | 1 +
arch/x86/xen/time.c | 7 ++++---
3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 180bc0bff0fb..d555b66c5bbb 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -99,6 +99,7 @@ struct pv_lazy_ops {
struct pv_time_ops {
unsigned long long (*sched_clock)(void);
unsigned long long (*steal_clock)(int cpu);
+ unsigned long long (*active_sched_clock)(void);
} __no_randomize_layout;

struct pv_cpu_ops {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 99dc79e76bdc..5bfdb72f152f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -320,6 +320,7 @@ struct pv_init_ops pv_init_ops = {
struct pv_time_ops pv_time_ops = {
.sched_clock = native_sched_clock,
.steal_clock = native_steal_clock,
+ .active_sched_clock = native_sched_clock,
};

__visible struct pv_irq_ops pv_irq_ops = {
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index e0f1bcf01d63..29b07020c827 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -505,8 +505,8 @@ static void __init xen_time_init(void)

void __ref xen_init_time_ops(void)
{
- pv_time_ops = xen_time_ops;
-
+ pv_time_ops.sched_clock = xen_time_ops.sched_clock;
+ pv_time_ops.steal_clock = xen_time_ops.steal_clock;
x86_init.timers.timer_init = xen_time_init;
x86_init.timers.setup_percpu_clockev = x86_init_noop;
x86_cpuinit.setup_percpu_clockev = x86_init_noop;
@@ -547,7 +547,8 @@ void __init xen_hvm_init_time_ops(void)
return;
}

- pv_time_ops = xen_time_ops;
+ pv_time_ops.sched_clock = xen_time_ops.sched_clock;
+ pv_time_ops.steal_clock = xen_time_ops.steal_clock;
x86_init.timers.setup_percpu_clockev = xen_time_init;
x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;

--
2.17.1


2018-06-15 17:46:37

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v10 3/7] x86/time: read_boot_clock64() implementation

read_boot_clock64() returns time of when system was started. Now, that
early boot clock is going to be available on x86 it is possible to
implement x86 specific version of read_boot_clock64() that takes advantage
of this new feature.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/time.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index 774ebafa97c4..32dff35719d9 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -15,6 +15,7 @@
#include <linux/i8253.h>
#include <linux/time.h>
#include <linux/export.h>
+#include <linux/sched/clock.h>

#include <asm/vsyscall.h>
#include <asm/x86_init.h>
@@ -104,3 +105,32 @@ void __init time_init(void)
{
late_time_init = x86_late_time_init;
}
+
+/*
+ * Called once during to boot to initialize boot time.
+ * This function returns timestamp in timespec format which is sec/nsec from
+ * epoch of when boot started.
+ * We use sched_clock_cpu() that gives us nanoseconds from when this clock has
+ * been started and it happens quiet early during boot process. To calculate
+ * offset from epoch we use information provided in 'now' by the caller
+ *
+ * If sched_clock_cpu() is not available or if there is any kind of error
+ * i.e. time from epoch is smaller than boot time, we must return zeros in ts,
+ * and the caller will take care of the error: by assuming that the time when
+ * this function was called is the beginning of boot time.
+ */
+void __init read_boot_clock64(struct timespec64 *now, struct timespec64 *ts)
+{
+ u64 ns_boot = sched_clock_cpu(smp_processor_id());
+ bool valid_clock;
+ u64 ns_now;
+
+ ns_now = timespec64_to_ns(now);
+ valid_clock = ns_boot && timespec64_valid_strict(now) &&
+ (ns_now > ns_boot);
+
+ if (!valid_clock)
+ *ts = (struct timespec64){0, 0};
+ else
+ *ts = ns_to_timespec64(ns_now - ns_boot);
+}
--
2.17.1


2018-06-18 08:43:49

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v10 3/7] x86/time: read_boot_clock64() implementation

On Fri, Jun 15, 2018 at 8:48 PM Pavel Tatashin
<[email protected]> wrote:
>
> read_boot_clock64() returns time of when system was started. Now, that
> early boot clock is going to be available on x86 it is possible to
> implement x86 specific version of read_boot_clock64() that takes advantage
> of this new feature.

> +void __init read_boot_clock64(struct timespec64 *now, struct timespec64 *ts)
> +{
> + u64 ns_boot = sched_clock_cpu(smp_processor_id());
> + bool valid_clock;
> + u64 ns_now;
> +
> + ns_now = timespec64_to_ns(now);
> + valid_clock = ns_boot && timespec64_valid_strict(now) &&
> + (ns_now > ns_boot);
> +

> + if (!valid_clock)
> + *ts = (struct timespec64){0, 0};
> + else
> + *ts = ns_to_timespec64(ns_now - ns_boot);



> +}


--
With Best Regards,
Andy Shevchenko

2018-06-18 08:47:25

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v10 3/7] x86/time: read_boot_clock64() implementation

On Mon, Jun 18, 2018 at 11:42 AM Andy Shevchenko
<[email protected]> wrote:
>
> On Fri, Jun 15, 2018 at 8:48 PM Pavel Tatashin <[email protected]> wrote:
> >
> > read_boot_clock64() returns time of when system was started. Now, that
> > early boot clock is going to be available on x86 it is possible to
> > implement x86 specific version of read_boot_clock64() that takes advantage
> > of this new feature.
>

Oops, sorry for previous empty mail.

> > +void __init read_boot_clock64(struct timespec64 *now, struct timespec64 *ts)
> > +{
> > + u64 ns_boot = sched_clock_cpu(smp_processor_id());
> > + bool valid_clock;
> > + u64 ns_now;
> > +
> > + ns_now = timespec64_to_ns(now);
> > + valid_clock = ns_boot && timespec64_valid_strict(now) &&
> > + (ns_now > ns_boot);
> > +
>


> > + if (!valid_clock)

Are we expecting more often clock to be non-valid?
Perhaps change to positive conditional?

> > + *ts = (struct timespec64){0, 0};

I dunno if additional variable would be better for readability, like

struct timespec64 null_ts = {0,0};
...
*ts = null_ts;

> > + else
> > + *ts = ns_to_timespec64(ns_now - ns_boot);

But I'm fine as long as Thomas is okay with this code.

> > +}


--
With Best Regards,
Andy Shevchenko

2018-06-19 14:26:25

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v10 3/7] x86/time: read_boot_clock64() implementation

> > > +void __init read_boot_clock64(struct timespec64 *now, struct timespec64 *ts)
> > > +{
> > > + u64 ns_boot = sched_clock_cpu(smp_processor_id());
> > > + bool valid_clock;
> > > + u64 ns_now;
> > > +
> > > + ns_now = timespec64_to_ns(now);
> > > + valid_clock = ns_boot && timespec64_valid_strict(now) &&
> > > + (ns_now > ns_boot);
> > > +
> >
>
>
> > > + if (!valid_clock)
>
> Are we expecting more often clock to be non-valid?
> Perhaps change to positive conditional?

Hi Andy,

Sure, I will change to:
if (valid_clock)
...
else
...

>
> > > + *ts = (struct timespec64){0, 0};
>
> I dunno if additional variable would be better for readability, like
>
> struct timespec64 null_ts = {0,0};

I don't mind adding ts_null, but I think, as-is ok here,

> ...
> *ts = null_ts;
>
> > > + else
> > > + *ts = ns_to_timespec64(ns_now - ns_boot);
>
> But I'm fine as long as Thomas is okay with this code.
>

Thank you for the review!

Pavel

2018-06-19 17:35:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 1/7] x86/tsc: remove tsc_disabled flag

On Fri, 15 Jun 2018, Pavel Tatashin wrote:

> tsc_disabled is set when notsc is passed as kernel parameter. The reason we
> have notsc is to avoid timing problems on multi-socket systems. We already
> have a mechanism, however, to detect and resolve these issues by invoking
> tsc unstable path. Thus, make notsc to behave the same as tsc=unstable.

notsc also excludes TSC from being used at all, while with your change its
suddenly used in sched_clock().

I'm not opposed to that change but the changelog is completely misleading
as it suggest that this is a change which has no functional impact.

Thanks,

tglx

2018-06-19 17:52:47

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v10 1/7] x86/tsc: remove tsc_disabled flag

> > tsc_disabled is set when notsc is passed as kernel parameter. The reason we
> > have notsc is to avoid timing problems on multi-socket systems. We already
> > have a mechanism, however, to detect and resolve these issues by invoking
> > tsc unstable path. Thus, make notsc to behave the same as tsc=unstable.
>
> notsc also excludes TSC from being used at all, while with your change its
> suddenly used in sched_clock().

Hi Thomas,

Thank you for the review. I will reword the patch comment. Currently,
at the end it says: "Thus, make notsc to behave the same as
tsc=unstable." To emphasize this, I will change the patch title to
this:

x86/tsc: redefine notsc to behave as tsc=unstable

And in the comment I will mention that as a consequence of this change
tsc is going to be used by sched_clock() when notsc is provided.

Thank you,
Pavel

2018-06-19 18:51:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v10 1/7] x86/tsc: remove tsc_disabled flag

On Tue, Jun 19, 2018 at 07:32:49PM +0200, Thomas Gleixner wrote:
> On Fri, 15 Jun 2018, Pavel Tatashin wrote:
>
> > tsc_disabled is set when notsc is passed as kernel parameter. The reason we
> > have notsc is to avoid timing problems on multi-socket systems. We already
> > have a mechanism, however, to detect and resolve these issues by invoking
> > tsc unstable path. Thus, make notsc to behave the same as tsc=unstable.
>
> notsc also excludes TSC from being used at all

It does not; there is TSC usage even if you boot with notsc on. See how
it does not clear X86_FEATURE_TSC for instance.



2018-06-19 19:14:36

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 1/7] x86/tsc: remove tsc_disabled flag

On Tue, 19 Jun 2018, Peter Zijlstra wrote:
> On Tue, Jun 19, 2018 at 07:32:49PM +0200, Thomas Gleixner wrote:
> > On Fri, 15 Jun 2018, Pavel Tatashin wrote:
> >
> > > tsc_disabled is set when notsc is passed as kernel parameter. The reason we
> > > have notsc is to avoid timing problems on multi-socket systems. We already
> > > have a mechanism, however, to detect and resolve these issues by invoking
> > > tsc unstable path. Thus, make notsc to behave the same as tsc=unstable.
> >
> > notsc also excludes TSC from being used at all
>
> It does not; there is TSC usage even if you boot with notsc on. See how
> it does not clear X86_FEATURE_TSC for instance.

Well, kinda. There is some stuff in the apic calibration which uses TSC
independent of tsc_disabled, but that's about it.

Thanks,

tglx

2018-06-19 20:03:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v10 1/7] x86/tsc: remove tsc_disabled flag

On Tue, Jun 19, 2018 at 09:12:45PM +0200, Thomas Gleixner wrote:
> On Tue, 19 Jun 2018, Peter Zijlstra wrote:
> > It does not; there is TSC usage even if you boot with notsc on. See how
> > it does not clear X86_FEATURE_TSC for instance.
>
> Well, kinda. There is some stuff in the apic calibration which uses TSC
> independent of tsc_disabled, but that's about it.

Still, that precludes booting on hardware without TSC, so what's the
point of having notsc? I'm all for killing the thing.

2018-06-19 20:41:21

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v10 3/7] x86/time: read_boot_clock64() implementation

On Tue, Jun 19, 2018 at 5:23 PM, Pavel Tatashin
<[email protected]> wrote:

>> > > + *ts = (struct timespec64){0, 0};
>>
>> I dunno if additional variable would be better for readability, like
>>
>> struct timespec64 null_ts = {0,0};
>
> I don't mind adding ts_null, but I think, as-is ok here,

Actually I meant presicely null_ts, and the reason why is that alias
is much easy to parse to get the idea, than to read entire "struct bla
bla bla".

But again, this is my personal point of view.

>
>> ...
>> *ts = null_ts;
>>
>> > > + else
>> > > + *ts = ns_to_timespec64(ns_now - ns_boot);
>>
>> But I'm fine as long as Thomas is okay with this code.

> Thank you for the review!

You're welcome!

--
With Best Regards,
Andy Shevchenko

2018-06-19 20:59:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 1/7] x86/tsc: remove tsc_disabled flag

On Tue, 19 Jun 2018, Peter Zijlstra wrote:

> On Tue, Jun 19, 2018 at 09:12:45PM +0200, Thomas Gleixner wrote:
> > On Tue, 19 Jun 2018, Peter Zijlstra wrote:
> > > It does not; there is TSC usage even if you boot with notsc on. See how
> > > it does not clear X86_FEATURE_TSC for instance.
> >
> > Well, kinda. There is some stuff in the apic calibration which uses TSC
> > independent of tsc_disabled, but that's about it.
>
> Still, that precludes booting on hardware without TSC, so what's the
> point of having notsc? I'm all for killing the thing.

I'm not arguing against removing it. I just refuse to accept changelogs
which suggest that there is no functional change.

Thanks,

tglx


2018-06-19 21:02:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v10 1/7] x86/tsc: remove tsc_disabled flag

On Tue, Jun 19, 2018 at 10:57:35PM +0200, Thomas Gleixner wrote:

> I'm not arguing against removing it. I just refuse to accept changelogs
> which suggest that there is no functional change.

Fair enough..

2018-06-19 21:16:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 2/7] time: sync read_boot_clock64() with persistent clock

On Fri, 15 Jun 2018, Pavel Tatashin wrote:

> read_boot_clock64() returns a boot start timestamp from epoch. Some arches
> may need to access the persistent clock interface in order to calculate the
> epoch offset. The resolution of the persistent clock, however, might be
> low. Therefore, in order to avoid time discrepancies a new argument 'now'
> is added to read_boot_clock64() parameters. Arch may decide to use it
> instead of accessing persistent clock again.

I kinda know what you are trying to say, but it's all but obvious.

> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 4786df904c22..fb6898fab374 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1502,9 +1502,13 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
> * Function to read the exact time the system has been started.
> * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
> *
> + * Argument 'now' contains time from persistent clock to calculate offset from
> + * epoch. May contain zeros if persist ant clock is not available.

What's a 'persist ant' ?

> + *
> * XXX - Do be sure to remove it once all arches implement it.
> */
> -void __weak read_boot_clock64(struct timespec64 *ts)
> +void __weak __init read_boot_clock64(struct timespec64 *now,
> + struct timespec64 *ts)
> {
> ts->tv_sec = 0;
> ts->tv_nsec = 0;
> @@ -1535,7 +1539,7 @@ void __init timekeeping_init(void)
> } else if (now.tv_sec || now.tv_nsec)
> persistent_clock_exists = true;
>
> - read_boot_clock64(&boot);
> + read_boot_clock64(&now, &boot);

This interface was already bolted on and this extension just makes it
worse. If 'now' is invalid then you need the same checks as after
read_persistent_clock() replicated along with conditionals of some sort.

'boot' time is used to adjust the wall to monotonic offset. So looking at
the math behind all of that:

read_persistent_clock(&now);
read_boot_clock(&boot);

tk_set_xtime(tk, now)
tk->xtime_sec = now.sec;
tk->xtime_nsec = now.nsec;

tk_set_wall_to_mono(tk, -boot);
tk->wall_to_mono = boot;
tk->offs_real = -boot;

timekeeping_update(tk)
tk->tkr_mono = tk->xtime + tk->wall_to_mono;

tkr_mono.base is guaranteed to be >= 0. So you need to make sure that

tk->xtime + tk->wall_to_mono >= 0

Yet another check which you need to do in that magic function. That's just
wrong. If this grows more users then they all have to copy the same sanity
checks over and over and half of them will be wrong.

Fortunately there is only a single real user of read_boot_clock() in the
tree: S390. By virtue of being S390 there is no worry about any sanity
checks. It just works.

ARM has the function, but there is no single subarch which actually
implements the thing, so this is easy to fix by removing it.

So the right thing to do is the following:

Provide a new function:

void __weak read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
ktime_t *boot_offset)
{
read_persistent_clock(wall_time);
}

Then add the new function to S390:

void read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
ktime_t *boot_offset)
{
unsigned char clk[STORE_CLOCK_EXT_SIZE];
struct timespec64 boot_time;
__u64 delta;

read_persistent_clock(wall_time);

delta = initial_leap_seconds + TOD_UNIX_EPOCH;
memcpy(clk, tod_clock_base, 16);
*(__u64 *) &clk[1] -= delta;
if (*(__u64 *) &clk[1] > delta)
clk[0]--;
ext_to_timespec64(clk, boot_time);
*boot_offset = timespec64_to_ns(timespec64_sub(wall_time, boot_time));
}

Then rework timekeeping_init():

timekeeping_init()
struct timespec64 wall_time, wall_to_mono, offs;
ktime_t boot_offset = 0;

read_persistent_wall_and_boot_offset(&walltime, &boot_offset);

if (!valid(walltime))
boottime = wall_time.tv_sec = wall_time.tv_nsec = 0;
else
persistent_clock = true;

if (boot_offset > timespec64_to_nsec(wall_time))
offs.tv_sec = offs.tv_nsec = 0;
else
offs = ns_to_timespec64(boot_offset);

wall_to_mono = timespec64_sub(offs, wall_time);
tk_set_wall_to_mono(tk, wall_time);


After that remove the read_boot_time() leftovers all over the place. And
then the x86 implementation boils down to:

void read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
ktime_t *boot_offset)
{
x86_platform.get_wallclock(ts);
*boot_offset = sched_clock();
}

And doing it this way - by adding the extra math to the only architecture
on the planet which has sane timers - is the right thing to do because all
other architectures will have to fall back to tricks similar to x86 by
doing clocksource/schedclock based guestimation of the time where the
machine actually reached the kernel.

Hmm?

Thanks,

tglx

2018-06-19 21:29:46

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v10 2/7] time: sync read_boot_clock64() with persistent clock

On Tue, Jun 19, 2018 at 5:17 PM Thomas Gleixner <[email protected]> wrote:
>
> On Fri, 15 Jun 2018, Pavel Tatashin wrote:
>
> > read_boot_clock64() returns a boot start timestamp from epoch. Some arches
> > may need to access the persistent clock interface in order to calculate the
> > epoch offset. The resolution of the persistent clock, however, might be
> > low. Therefore, in order to avoid time discrepancies a new argument 'now'
> > is added to read_boot_clock64() parameters. Arch may decide to use it
> > instead of accessing persistent clock again.
>
> I kinda know what you are trying to say, but it's all but obvious.
>
> > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> > index 4786df904c22..fb6898fab374 100644
> > --- a/kernel/time/timekeeping.c
> > +++ b/kernel/time/timekeeping.c
> > @@ -1502,9 +1502,13 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
> > * Function to read the exact time the system has been started.
> > * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
> > *
> > + * Argument 'now' contains time from persistent clock to calculate offset from
> > + * epoch. May contain zeros if persist ant clock is not available.
>
> What's a 'persist ant' ?

persistent, but I think spell checker decided that I was writing about
some ants. :)

>
> > + *
> > * XXX - Do be sure to remove it once all arches implement it.
> > */
> > -void __weak read_boot_clock64(struct timespec64 *ts)
> > +void __weak __init read_boot_clock64(struct timespec64 *now,
> > + struct timespec64 *ts)
> > {
> > ts->tv_sec = 0;
> > ts->tv_nsec = 0;
> > @@ -1535,7 +1539,7 @@ void __init timekeeping_init(void)
> > } else if (now.tv_sec || now.tv_nsec)
> > persistent_clock_exists = true;
> >
> > - read_boot_clock64(&boot);
> > + read_boot_clock64(&now, &boot);
>
> This interface was already bolted on and this extension just makes it
> worse. If 'now' is invalid then you need the same checks as after
> read_persistent_clock() replicated along with conditionals of some sort.
>
> 'boot' time is used to adjust the wall to monotonic offset. So looking at
> the math behind all of that:
>
> read_persistent_clock(&now);
> read_boot_clock(&boot);
>
> tk_set_xtime(tk, now)
> tk->xtime_sec = now.sec;
> tk->xtime_nsec = now.nsec;
>
> tk_set_wall_to_mono(tk, -boot);
> tk->wall_to_mono = boot;
> tk->offs_real = -boot;
>
> timekeeping_update(tk)
> tk->tkr_mono = tk->xtime + tk->wall_to_mono;
>
> tkr_mono.base is guaranteed to be >= 0. So you need to make sure that
>
> tk->xtime + tk->wall_to_mono >= 0
>
> Yet another check which you need to do in that magic function. That's just
> wrong. If this grows more users then they all have to copy the same sanity
> checks over and over and half of them will be wrong.
>
> Fortunately there is only a single real user of read_boot_clock() in the
> tree: S390. By virtue of being S390 there is no worry about any sanity
> checks. It just works.
>
> ARM has the function, but there is no single subarch which actually
> implements the thing, so this is easy to fix by removing it.
>
> So the right thing to do is the following:
>
> Provide a new function:
>
> void __weak read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
> ktime_t *boot_offset)
> {
> read_persistent_clock(wall_time);
> }
>
> Then add the new function to S390:
>
> void read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
> ktime_t *boot_offset)
> {
> unsigned char clk[STORE_CLOCK_EXT_SIZE];
> struct timespec64 boot_time;
> __u64 delta;
>
> read_persistent_clock(wall_time);
>
> delta = initial_leap_seconds + TOD_UNIX_EPOCH;
> memcpy(clk, tod_clock_base, 16);
> *(__u64 *) &clk[1] -= delta;
> if (*(__u64 *) &clk[1] > delta)
> clk[0]--;
> ext_to_timespec64(clk, boot_time);
> *boot_offset = timespec64_to_ns(timespec64_sub(wall_time, boot_time));
> }
>
> Then rework timekeeping_init():
>
> timekeeping_init()
> struct timespec64 wall_time, wall_to_mono, offs;
> ktime_t boot_offset = 0;
>
> read_persistent_wall_and_boot_offset(&walltime, &boot_offset);
>
> if (!valid(walltime))
> boottime = wall_time.tv_sec = wall_time.tv_nsec = 0;
> else
> persistent_clock = true;
>
> if (boot_offset > timespec64_to_nsec(wall_time))
> offs.tv_sec = offs.tv_nsec = 0;
> else
> offs = ns_to_timespec64(boot_offset);
>
> wall_to_mono = timespec64_sub(offs, wall_time);
> tk_set_wall_to_mono(tk, wall_time);
>
>
> After that remove the read_boot_time() leftovers all over the place. And
> then the x86 implementation boils down to:
>
> void read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
> ktime_t *boot_offset)
> {
> x86_platform.get_wallclock(ts);
> *boot_offset = sched_clock();
> }
>
> And doing it this way - by adding the extra math to the only architecture
> on the planet which has sane timers - is the right thing to do because all
> other architectures will have to fall back to tricks similar to x86 by
> doing clocksource/schedclock based guestimation of the time where the
> machine actually reached the kernel.
>
> Hmm?

Sounds good, I will redo this and the next patch with your suggestions.

Thank you,
Pavel

2018-06-19 22:55:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Fri, 15 Jun 2018, Pavel Tatashin wrote:
> This patch adds early clock feature to x86 platforms.

See Documentation about 'This patch' .... We already know that this is a
patch otherwise it would not be marked as such ...

> +/*
> + * Finish clock for early time stamps, and hand over to permanent clock by
> + * setting __sched_clock_offset appropriately for continued time keeping.
> + */
> +static void __init tsc_early_fini(void)
> +{
> + unsigned long long t;
> + unsigned long r;
> +
> + t = -cyc2ns_early.cyc2ns_offset;
> + r = do_div(t, NSEC_PER_SEC);
> +
> + __sched_clock_offset = sched_clock_early() - sched_clock();

As this is called _AFTER_ the TSC initialization which also includes
sanitizing of TSC_ADJUST, this can result in a complete bogus
__sched_clock_offset because sched_clock_early() will return nonsense when
TSC_ADJUST has been modified.

> + pr_info("early sched clock is finished, offset [%lld.%09lds]\n", t, r);

That only gives useful information when the machine boots cold and when the
machine has a sane BIOS. So it's almost guaranteed to print useless
information.

Thanks,

tglx

2018-06-19 23:15:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 6/7] x86/paravirt: add active_sched_clock to pv_time_ops

On Fri, 15 Jun 2018, Pavel Tatashin wrote:
> Early boot clock might differ from the clock that is used later on,
> therefore add a new field to pv_time_ops, that shows currently active
> clock. If platform supports early boot clock, this field will be changed
> to use that clock early in boot, and later will be replaced with the
> permanent clock.

I'm not seeing why this is required. If a hypervisor provides a
sched_clock() override then it's not the kernels problem to make sure that
it is usable.

Thanks,

tglx

2018-06-19 23:21:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Fri, 15 Jun 2018, Pavel Tatashin wrote:
> cpu_khz = x86_platform.calibrate_cpu();
> @@ -1380,7 +1466,7 @@ void __init tsc_init(void)
> if (!tsc_khz) {
> mark_tsc_unstable("could not calculate TSC khz");
> setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
> - return;
> + goto final_sched_clock;
> }
>
> pr_info("Detected %lu.%03lu MHz processor\n",
> @@ -1428,6 +1514,14 @@ void __init tsc_init(void)
>
> clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
> detect_art();

I'm pretty sure you missed at least one instance of 'return'. That's just
error prone and any new exit path of tsc_init() adds another chance to miss
it.

Thanks,

tglx

2018-06-19 23:54:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Fri, 15 Jun 2018, Pavel Tatashin wrote:

> tsc_early_init():
> Determines offset, shift and multiplier for the early clock based on the
> TSC frequency.
>
> tsc_early_fini()
> Implement the finish part of early tsc feature, prints message about the
> offset, which can be useful to find out how much time was spent in post and
> boot manager (if TSC starts from 0 during boot)
>
> sched_clock_early():
> TSC based implementation of early clock and is called from sched_clock().
>
> start_early_clock():
> Calls tsc_early_init(), and makes sched_clock() to use early boot clock
>
> set_final_clock():
> Sets the final clock which is either platform specific or
> native_sched_clock(). Also calls tsc_early_fini() if early clock was
> previously initialized.
>
> Call start_early_clock() to start using early boot time stamps
> functionality on the supported x86 platforms, and call set_final_clock() to
> finish this feature and switch back to the default clock. The supported x86
> systems are those where TSC frequency is determined early in boot.

Lots of functions for dubious value.

> +static struct cyc2ns_data cyc2ns_early;
> +
> +static u64 sched_clock_early(void)
> +{
> + u64 ns = mul_u64_u32_shr(rdtsc(), cyc2ns_early.cyc2ns_mul,
> + cyc2ns_early.cyc2ns_shift);
> + return ns + cyc2ns_early.cyc2ns_offset;
> +}
> +
> +/*
> + * Initialize clock for early time stamps
> + */
> +static void __init tsc_early_init(unsigned int khz)
> +{
> + clocks_calc_mult_shift(&cyc2ns_early.cyc2ns_mul,
> + &cyc2ns_early.cyc2ns_shift,
> + khz, NSEC_PER_MSEC, 0);
> + cyc2ns_early.cyc2ns_offset = -sched_clock_early();
> +}
> +
> +/*
> + * Finish clock for early time stamps, and hand over to permanent clock by
> + * setting __sched_clock_offset appropriately for continued time keeping.
> + */
> +static void __init tsc_early_fini(void)
> +{
> + unsigned long long t;
> + unsigned long r;
> +
> + t = -cyc2ns_early.cyc2ns_offset;
> + r = do_div(t, NSEC_PER_SEC);
> +
> + __sched_clock_offset = sched_clock_early() - sched_clock();
> + pr_info("early sched clock is finished, offset [%lld.%09lds]\n", t, r);
> +}
> +
> +#ifdef CONFIG_PARAVIRT
> +static inline void __init start_early_clock(void)
> +{
> + tsc_early_init(tsc_khz);
> + pv_time_ops.active_sched_clock = sched_clock_early;
> +}
> +
> +static inline void __init set_final_clock(void)
> +{
> + pv_time_ops.active_sched_clock = pv_time_ops.sched_clock;
> +
> + /* We did not have early sched clock if multiplier is 0 */
> + if (cyc2ns_early.cyc2ns_mul)
> + tsc_early_fini();
> +}
> +#else /* CONFIG_PARAVIRT */
> +/*
> + * For native clock we use two switches static and dynamic, the static switch is
> + * initially true, so we check the dynamic switch, which is initially false.
> + * Later when early clock is disabled, we can alter the static switch in order
> + * to avoid branch check on every sched_clock() call.
> + */
> +static bool __tsc_early;
> +static DEFINE_STATIC_KEY_TRUE(__tsc_early_static);
> +
> +static inline void __init start_early_clock(void)
> +{
> + tsc_early_init(tsc_khz);
> + __tsc_early = true;
> +}
> +
> +static inline void __init set_final_clock(void)
> +{
> + __tsc_early = false;
> + static_branch_disable(&__tsc_early_static);
> +
> + /* We did not have early sched clock if multiplier is 0 */
> + if (cyc2ns_early.cyc2ns_mul)
> + tsc_early_fini();
> +}
> +#endif /* CONFIG_PARAVIRT */
> +
> /*
> * Scheduler clock - returns current time in nanosec units.
> */
> @@ -194,6 +272,13 @@ u64 native_sched_clock(void)
> return cycles_2_ns(tsc_now);
> }
>
> +#ifndef CONFIG_PARAVIRT
> + if (static_branch_unlikely(&__tsc_early_static)) {
> + if (__tsc_early)
> + return sched_clock_early();
> + }
> +#endif /* !CONFIG_PARAVIRT */
> +

This whole function maze plus the ifdeffery which comes with it is really
horrible and not required. What's wrong with reusing the existing
functionality?

The patch below (uncompiled and untested) should achieve the same thing
without all the paravirt muck (which can be easily added w/o all the
ifdeffery if really required) by just reusing the existing conversion and
initialization functions.

If I'm not completely mistaken then the second invocation of
set_cyc2ns_scale() from tsc_init() will also take care of the smooth
sched_clock() transition from early to final w/o touching the core
__sched_clock_offset at all. Though my tired brain might trick me.

It might not work as is, but it should not be rocket science to make it do
so.

Thanks,

tglx

8<----------------------

tsc.c | 59 ++++++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 44 insertions(+), 15 deletions(-)

--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -39,6 +39,9 @@ EXPORT_SYMBOL(tsc_khz);
static int __read_mostly tsc_unstable;

static DEFINE_STATIC_KEY_FALSE(__use_tsc);
+static DEFINE_STATIC_KEY_TRUE(tsc_early_enabled);
+
+static bool tsc_early_sched_clock;

int tsc_clocksource_reliable;

@@ -133,18 +136,12 @@ static inline unsigned long long cycles_
return ns;
}

-static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+static void __set_cyc2ns_scale(unsigned long khz, int cpu,
+ unsigned long long tsc_now)
{
unsigned long long ns_now;
struct cyc2ns_data data;
struct cyc2ns *c2n;
- unsigned long flags;
-
- local_irq_save(flags);
- sched_clock_idle_sleep_event();
-
- if (!khz)
- goto done;

ns_now = cycles_2_ns(tsc_now);

@@ -176,22 +173,46 @@ static void set_cyc2ns_scale(unsigned lo
c2n->data[0] = data;
raw_write_seqcount_latch(&c2n->seq);
c2n->data[1] = data;
+}
+
+static void set_cyc2ns_scale(unsigned long khz, int cpu,
+ unsigned long long tsc_now)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ if (khz)
+ __set_cyc2ns_scale(khz, cpu, tsc_now);

-done:
sched_clock_idle_wakeup_event();
local_irq_restore(flags);
}

+static void __init sched_clock_early_init(unsigned int khz)
+{
+ cyc2ns_init(smp_processor_id());
+ __set_cyc2ns_scale(khz, smp_processor_id(), rdtsc());
+ tsc_early_sched_clock = true;
+}
+
+static void __init sched_clock_early_exit(void)
+{
+ static_branch_disable(&tsc_early_enabled);
+}
+
/*
* Scheduler clock - returns current time in nanosec units.
*/
u64 native_sched_clock(void)
{
- if (static_branch_likely(&__use_tsc)) {
- u64 tsc_now = rdtsc();
+ if (static_branch_likely(&__use_tsc))
+ return cycles_2_ns(rdtsc());

- /* return the value in ns */
- return cycles_2_ns(tsc_now);
+ if (static_branch_unlikely(&tsc_early_enabled)) {
+ if (tsc_early_sched_clock)
+ return cycles_2_ns(rdtsc());
}

/*
@@ -1332,9 +1353,10 @@ void __init tsc_early_delay_calibrate(vo
lpj = tsc_khz * 1000;
do_div(lpj, HZ);
loops_per_jiffy = lpj;
+ sched_clock_early_init(tsc_khz);
}

-void __init tsc_init(void)
+static void __init __tsc_init(void)
{
u64 lpj, cyc;
int cpu;
@@ -1384,7 +1406,8 @@ void __init tsc_init(void)
*/
cyc = rdtsc();
for_each_possible_cpu(cpu) {
- cyc2ns_init(cpu);
+ if (!tsc_early_sched_clock || cpu != smp_processor_id())
+ cyc2ns_init(cpu);
set_cyc2ns_scale(tsc_khz, cpu, cyc);
}

@@ -1411,6 +1434,12 @@ void __init tsc_init(void)
detect_art();
}

+void __init tsc_init(void)
+{
+ __tsc_init();
+ sched_clock_early_exit();
+}
+
#ifdef CONFIG_SMP
/*
* If we have a constant TSC and are using the TSC for the delay loop,




2018-06-20 00:02:32

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early



On 06/19/2018 07:52 PM, Thomas Gleixner wrote:
> On Fri, 15 Jun 2018, Pavel Tatashin wrote:
>
>> tsc_early_init():
>> Determines offset, shift and multiplier for the early clock based on the
>> TSC frequency.
>>
>> tsc_early_fini()
>> Implement the finish part of early tsc feature, prints message about the
>> offset, which can be useful to find out how much time was spent in post and
>> boot manager (if TSC starts from 0 during boot)
>>
>> sched_clock_early():
>> TSC based implementation of early clock and is called from sched_clock().
>>
>> start_early_clock():
>> Calls tsc_early_init(), and makes sched_clock() to use early boot clock
>>
>> set_final_clock():
>> Sets the final clock which is either platform specific or
>> native_sched_clock(). Also calls tsc_early_fini() if early clock was
>> previously initialized.
>>
>> Call start_early_clock() to start using early boot time stamps
>> functionality on the supported x86 platforms, and call set_final_clock() to
>> finish this feature and switch back to the default clock. The supported x86
>> systems are those where TSC frequency is determined early in boot.
>
> Lots of functions for dubious value.
>
>> +static struct cyc2ns_data cyc2ns_early;
>> +
>> +static u64 sched_clock_early(void)
>> +{
>> + u64 ns = mul_u64_u32_shr(rdtsc(), cyc2ns_early.cyc2ns_mul,
>> + cyc2ns_early.cyc2ns_shift);
>> + return ns + cyc2ns_early.cyc2ns_offset;
>> +}
>> +
>> +/*
>> + * Initialize clock for early time stamps
>> + */
>> +static void __init tsc_early_init(unsigned int khz)
>> +{
>> + clocks_calc_mult_shift(&cyc2ns_early.cyc2ns_mul,
>> + &cyc2ns_early.cyc2ns_shift,
>> + khz, NSEC_PER_MSEC, 0);
>> + cyc2ns_early.cyc2ns_offset = -sched_clock_early();
>> +}
>> +
>> +/*
>> + * Finish clock for early time stamps, and hand over to permanent clock by
>> + * setting __sched_clock_offset appropriately for continued time keeping.
>> + */
>> +static void __init tsc_early_fini(void)
>> +{
>> + unsigned long long t;
>> + unsigned long r;
>> +
>> + t = -cyc2ns_early.cyc2ns_offset;
>> + r = do_div(t, NSEC_PER_SEC);
>> +
>> + __sched_clock_offset = sched_clock_early() - sched_clock();
>> + pr_info("early sched clock is finished, offset [%lld.%09lds]\n", t, r);
>> +}
>> +
>> +#ifdef CONFIG_PARAVIRT
>> +static inline void __init start_early_clock(void)
>> +{
>> + tsc_early_init(tsc_khz);
>> + pv_time_ops.active_sched_clock = sched_clock_early;
>> +}
>> +
>> +static inline void __init set_final_clock(void)
>> +{
>> + pv_time_ops.active_sched_clock = pv_time_ops.sched_clock;
>> +
>> + /* We did not have early sched clock if multiplier is 0 */
>> + if (cyc2ns_early.cyc2ns_mul)
>> + tsc_early_fini();
>> +}
>> +#else /* CONFIG_PARAVIRT */
>> +/*
>> + * For native clock we use two switches static and dynamic, the static switch is
>> + * initially true, so we check the dynamic switch, which is initially false.
>> + * Later when early clock is disabled, we can alter the static switch in order
>> + * to avoid branch check on every sched_clock() call.
>> + */
>> +static bool __tsc_early;
>> +static DEFINE_STATIC_KEY_TRUE(__tsc_early_static);
>> +
>> +static inline void __init start_early_clock(void)
>> +{
>> + tsc_early_init(tsc_khz);
>> + __tsc_early = true;
>> +}
>> +
>> +static inline void __init set_final_clock(void)
>> +{
>> + __tsc_early = false;
>> + static_branch_disable(&__tsc_early_static);
>> +
>> + /* We did not have early sched clock if multiplier is 0 */
>> + if (cyc2ns_early.cyc2ns_mul)
>> + tsc_early_fini();
>> +}
>> +#endif /* CONFIG_PARAVIRT */
>> +
>> /*
>> * Scheduler clock - returns current time in nanosec units.
>> */
>> @@ -194,6 +272,13 @@ u64 native_sched_clock(void)
>> return cycles_2_ns(tsc_now);
>> }
>>
>> +#ifndef CONFIG_PARAVIRT
>> + if (static_branch_unlikely(&__tsc_early_static)) {
>> + if (__tsc_early)
>> + return sched_clock_early();
>> + }
>> +#endif /* !CONFIG_PARAVIRT */
>> +
>
> This whole function maze plus the ifdeffery which comes with it is really
> horrible and not required. What's wrong with reusing the existing
> functionality?
>
> The patch below (uncompiled and untested) should achieve the same thing
> without all the paravirt muck (which can be easily added w/o all the
> ifdeffery if really required) by just reusing the existing conversion and
> initialization functions.
>
> If I'm not completely mistaken then the second invocation of
> set_cyc2ns_scale() from tsc_init() will also take care of the smooth
> sched_clock() transition from early to final w/o touching the core
> __sched_clock_offset at all. Though my tired brain might trick me.
>
> It might not work as is, but it should not be rocket science to make it do
> so.
>
> Thanks,
>
> tglx
>
> 8<----------------------
>
> tsc.c | 59 ++++++++++++++++++++++++++++++++++++++++++++---------------
> 1 file changed, 44 insertions(+), 15 deletions(-)
>
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -39,6 +39,9 @@ EXPORT_SYMBOL(tsc_khz);
> static int __read_mostly tsc_unstable;
>
> static DEFINE_STATIC_KEY_FALSE(__use_tsc);
> +static DEFINE_STATIC_KEY_TRUE(tsc_early_enabled);
> +
> +static bool tsc_early_sched_clock;
>
> int tsc_clocksource_reliable;
>
> @@ -133,18 +136,12 @@ static inline unsigned long long cycles_
> return ns;
> }
>
> -static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
> +static void __set_cyc2ns_scale(unsigned long khz, int cpu,
> + unsigned long long tsc_now)
> {
> unsigned long long ns_now;
> struct cyc2ns_data data;
> struct cyc2ns *c2n;
> - unsigned long flags;
> -
> - local_irq_save(flags);
> - sched_clock_idle_sleep_event();
> -
> - if (!khz)
> - goto done;
>
> ns_now = cycles_2_ns(tsc_now);
>
> @@ -176,22 +173,46 @@ static void set_cyc2ns_scale(unsigned lo
> c2n->data[0] = data;
> raw_write_seqcount_latch(&c2n->seq);
> c2n->data[1] = data;
> +}
> +
> +static void set_cyc2ns_scale(unsigned long khz, int cpu,
> + unsigned long long tsc_now)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + sched_clock_idle_sleep_event();
> +
> + if (khz)
> + __set_cyc2ns_scale(khz, cpu, tsc_now);
>
> -done:
> sched_clock_idle_wakeup_event();
> local_irq_restore(flags);
> }
>
> +static void __init sched_clock_early_init(unsigned int khz)
> +{
> + cyc2ns_init(smp_processor_id());
> + __set_cyc2ns_scale(khz, smp_processor_id(), rdtsc());
> + tsc_early_sched_clock = true;
> +}
> +
> +static void __init sched_clock_early_exit(void)
> +{
> + static_branch_disable(&tsc_early_enabled);
> +}
> +
> /*
> * Scheduler clock - returns current time in nanosec units.
> */
> u64 native_sched_clock(void)
> {
> - if (static_branch_likely(&__use_tsc)) {
> - u64 tsc_now = rdtsc();
> + if (static_branch_likely(&__use_tsc))
> + return cycles_2_ns(rdtsc());
>
> - /* return the value in ns */
> - return cycles_2_ns(tsc_now);
> + if (static_branch_unlikely(&tsc_early_enabled)) {
> + if (tsc_early_sched_clock)
> + return cycles_2_ns(rdtsc());
> }
>
> /*
> @@ -1332,9 +1353,10 @@ void __init tsc_early_delay_calibrate(vo
> lpj = tsc_khz * 1000;
> do_div(lpj, HZ);
> loops_per_jiffy = lpj;
> + sched_clock_early_init(tsc_khz);
> }
>
> -void __init tsc_init(void)
> +static void __init __tsc_init(void)
> {
> u64 lpj, cyc;
> int cpu;
> @@ -1384,7 +1406,8 @@ void __init tsc_init(void)
> */
> cyc = rdtsc();
> for_each_possible_cpu(cpu) {
> - cyc2ns_init(cpu);
> + if (!tsc_early_sched_clock || cpu != smp_processor_id())
> + cyc2ns_init(cpu);
> set_cyc2ns_scale(tsc_khz, cpu, cyc);
> }
>
> @@ -1411,6 +1434,12 @@ void __init tsc_init(void)
> detect_art();
> }
>
> +void __init tsc_init(void)
> +{
> + __tsc_init();
> + sched_clock_early_exit();
> +}
> +
> #ifdef CONFIG_SMP
> /*
> * If we have a constant TSC and are using the TSC for the delay loop,
>
>
>

Hi Thomas,

I like this approach, as you said in patch 6, new paravirt functions are not needed because if hypervisor provides sched_clock() just use it, so we do not need to have ifdefs for paravirt stuff here.

I will update the whole series and send it out again. It will be in a much better shape after your review!

Thank you,
Pavel

2018-06-20 09:16:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Wed, Jun 20, 2018 at 01:52:10AM +0200, Thomas Gleixner wrote:
> u64 native_sched_clock(void)
> {
> + if (static_branch_likely(&__use_tsc))
> + return cycles_2_ns(rdtsc());
>
> + if (static_branch_unlikely(&tsc_early_enabled)) {
> + if (tsc_early_sched_clock)
> + return cycles_2_ns(rdtsc());
> }

I'm still puzzled by the entire need for tsc_early_enabled and all that.
Esp. since both branches do the exact same thing:

return cycles_2_ns(rdtsc());

2018-06-20 10:11:40

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

Hi Thomas,

On Wed, Jun 20, 2018 at 01:52:10AM +0200, Thomas Gleixner wrote:
> On Fri, 15 Jun 2018, Pavel Tatashin wrote:
>
> > tsc_early_init():
> > Determines offset, shift and multiplier for the early clock based on the
> > TSC frequency.
> >
> > tsc_early_fini()
> > Implement the finish part of early tsc feature, prints message about the
> > offset, which can be useful to find out how much time was spent in post and
> > boot manager (if TSC starts from 0 during boot)
> >
> > sched_clock_early():
> > TSC based implementation of early clock and is called from sched_clock().
> >
> > start_early_clock():
> > Calls tsc_early_init(), and makes sched_clock() to use early boot clock
> >
> > set_final_clock():
> > Sets the final clock which is either platform specific or
> > native_sched_clock(). Also calls tsc_early_fini() if early clock was
> > previously initialized.
> >
> > Call start_early_clock() to start using early boot time stamps
> > functionality on the supported x86 platforms, and call set_final_clock() to
> > finish this feature and switch back to the default clock. The supported x86
> > systems are those where TSC frequency is determined early in boot.
>
> Lots of functions for dubious value.
>
> > +static struct cyc2ns_data cyc2ns_early;
> > +
> > +static u64 sched_clock_early(void)
> > +{
> > + u64 ns = mul_u64_u32_shr(rdtsc(), cyc2ns_early.cyc2ns_mul,
> > + cyc2ns_early.cyc2ns_shift);
> > + return ns + cyc2ns_early.cyc2ns_offset;
> > +}
> > +
> > +/*
> > + * Initialize clock for early time stamps
> > + */
> > +static void __init tsc_early_init(unsigned int khz)
> > +{
> > + clocks_calc_mult_shift(&cyc2ns_early.cyc2ns_mul,
> > + &cyc2ns_early.cyc2ns_shift,
> > + khz, NSEC_PER_MSEC, 0);
> > + cyc2ns_early.cyc2ns_offset = -sched_clock_early();
> > +}
> > +
> > +/*
> > + * Finish clock for early time stamps, and hand over to permanent clock by
> > + * setting __sched_clock_offset appropriately for continued time keeping.
> > + */
> > +static void __init tsc_early_fini(void)
> > +{
> > + unsigned long long t;
> > + unsigned long r;
> > +
> > + t = -cyc2ns_early.cyc2ns_offset;
> > + r = do_div(t, NSEC_PER_SEC);
> > +
> > + __sched_clock_offset = sched_clock_early() - sched_clock();
> > + pr_info("early sched clock is finished, offset [%lld.%09lds]\n", t, r);
> > +}
> > +
> > +#ifdef CONFIG_PARAVIRT
> > +static inline void __init start_early_clock(void)
> > +{
> > + tsc_early_init(tsc_khz);
> > + pv_time_ops.active_sched_clock = sched_clock_early;
> > +}
> > +
> > +static inline void __init set_final_clock(void)
> > +{
> > + pv_time_ops.active_sched_clock = pv_time_ops.sched_clock;
> > +
> > + /* We did not have early sched clock if multiplier is 0 */
> > + if (cyc2ns_early.cyc2ns_mul)
> > + tsc_early_fini();
> > +}
> > +#else /* CONFIG_PARAVIRT */
> > +/*
> > + * For native clock we use two switches static and dynamic, the static switch is
> > + * initially true, so we check the dynamic switch, which is initially false.
> > + * Later when early clock is disabled, we can alter the static switch in order
> > + * to avoid branch check on every sched_clock() call.
> > + */
> > +static bool __tsc_early;
> > +static DEFINE_STATIC_KEY_TRUE(__tsc_early_static);
> > +
> > +static inline void __init start_early_clock(void)
> > +{
> > + tsc_early_init(tsc_khz);
> > + __tsc_early = true;
> > +}
> > +
> > +static inline void __init set_final_clock(void)
> > +{
> > + __tsc_early = false;
> > + static_branch_disable(&__tsc_early_static);
> > +
> > + /* We did not have early sched clock if multiplier is 0 */
> > + if (cyc2ns_early.cyc2ns_mul)
> > + tsc_early_fini();
> > +}
> > +#endif /* CONFIG_PARAVIRT */
> > +
> > /*
> > * Scheduler clock - returns current time in nanosec units.
> > */
> > @@ -194,6 +272,13 @@ u64 native_sched_clock(void)
> > return cycles_2_ns(tsc_now);
> > }
> >
> > +#ifndef CONFIG_PARAVIRT
> > + if (static_branch_unlikely(&__tsc_early_static)) {
> > + if (__tsc_early)
> > + return sched_clock_early();
> > + }
> > +#endif /* !CONFIG_PARAVIRT */
> > +
>
> This whole function maze plus the ifdeffery which comes with it is really
> horrible and not required. What's wrong with reusing the existing
> functionality?
>
> The patch below (uncompiled and untested) should achieve the same thing
> without all the paravirt muck (which can be easily added w/o all the
> ifdeffery if really required) by just reusing the existing conversion and
> initialization functions.
>
> If I'm not completely mistaken then the second invocation of
> set_cyc2ns_scale() from tsc_init() will also take care of the smooth
> sched_clock() transition from early to final w/o touching the core
> __sched_clock_offset at all. Though my tired brain might trick me.
>
> It might not work as is, but it should not be rocket science to make it do
> so.
>
> Thanks,
>
> tglx
>
> 8<----------------------
>
> tsc.c | 59 ++++++++++++++++++++++++++++++++++++++++++++---------------
> 1 file changed, 44 insertions(+), 15 deletions(-)
>
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -39,6 +39,9 @@ EXPORT_SYMBOL(tsc_khz);
> static int __read_mostly tsc_unstable;
>
> static DEFINE_STATIC_KEY_FALSE(__use_tsc);
> +static DEFINE_STATIC_KEY_TRUE(tsc_early_enabled);

One potential problem may be the several static_keys used here,
the "__use_tsc", the "__sched_clock_stable", it may not be used
very early in boot phase. As the the static_branch_enable() will
use pageing related code while the paging is not setup ready yet.

Some details on https://lkml.org/lkml/2018/6/13/92

Thanks,
Feng

> +
> +static bool tsc_early_sched_clock;
>
> int tsc_clocksource_reliable;
>
> @@ -133,18 +136,12 @@ static inline unsigned long long cycles_
> return ns;
> }
>
> -static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
> +static void __set_cyc2ns_scale(unsigned long khz, int cpu,
> + unsigned long long tsc_now)
> {
> unsigned long long ns_now;
> struct cyc2ns_data data;
> struct cyc2ns *c2n;
> - unsigned long flags;
> -
> - local_irq_save(flags);
> - sched_clock_idle_sleep_event();
> -
> - if (!khz)
> - goto done;
>
> ns_now = cycles_2_ns(tsc_now);
>
> @@ -176,22 +173,46 @@ static void set_cyc2ns_scale(unsigned lo
> c2n->data[0] = data;
> raw_write_seqcount_latch(&c2n->seq);
> c2n->data[1] = data;
> +}
> +
> +static void set_cyc2ns_scale(unsigned long khz, int cpu,
> + unsigned long long tsc_now)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + sched_clock_idle_sleep_event();
> +
> + if (khz)
> + __set_cyc2ns_scale(khz, cpu, tsc_now);
>
> -done:
> sched_clock_idle_wakeup_event();
> local_irq_restore(flags);
> }
>
> +static void __init sched_clock_early_init(unsigned int khz)
> +{
> + cyc2ns_init(smp_processor_id());
> + __set_cyc2ns_scale(khz, smp_processor_id(), rdtsc());
> + tsc_early_sched_clock = true;
> +}
> +
> +static void __init sched_clock_early_exit(void)
> +{
> + static_branch_disable(&tsc_early_enabled);
> +}
> +
> /*
> * Scheduler clock - returns current time in nanosec units.
> */
> u64 native_sched_clock(void)
> {
> - if (static_branch_likely(&__use_tsc)) {
> - u64 tsc_now = rdtsc();
> + if (static_branch_likely(&__use_tsc))
> + return cycles_2_ns(rdtsc());
>
> - /* return the value in ns */
> - return cycles_2_ns(tsc_now);
> + if (static_branch_unlikely(&tsc_early_enabled)) {
> + if (tsc_early_sched_clock)
> + return cycles_2_ns(rdtsc());
> }
>
> /*
> @@ -1332,9 +1353,10 @@ void __init tsc_early_delay_calibrate(vo
> lpj = tsc_khz * 1000;
> do_div(lpj, HZ);
> loops_per_jiffy = lpj;
> + sched_clock_early_init(tsc_khz);
> }
>
> -void __init tsc_init(void)
> +static void __init __tsc_init(void)
> {
> u64 lpj, cyc;
> int cpu;
> @@ -1384,7 +1406,8 @@ void __init tsc_init(void)
> */
> cyc = rdtsc();
> for_each_possible_cpu(cpu) {
> - cyc2ns_init(cpu);
> + if (!tsc_early_sched_clock || cpu != smp_processor_id())
> + cyc2ns_init(cpu);
> set_cyc2ns_scale(tsc_khz, cpu, cyc);
> }
>
> @@ -1411,6 +1434,12 @@ void __init tsc_init(void)
> detect_art();
> }
>
> +void __init tsc_init(void)
> +{
> + __tsc_init();
> + sched_clock_early_exit();
> +}
> +
> #ifdef CONFIG_SMP
> /*
> * If we have a constant TSC and are using the TSC for the delay loop,
>
>

2018-06-20 10:18:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Wed, 20 Jun 2018, Feng Tang wrote:
> On Wed, Jun 20, 2018 at 01:52:10AM +0200, Thomas Gleixner wrote:
> >
> > static DEFINE_STATIC_KEY_FALSE(__use_tsc);
> > +static DEFINE_STATIC_KEY_TRUE(tsc_early_enabled);
>
> One potential problem may be the several static_keys used here,
> the "__use_tsc", the "__sched_clock_stable", it may not be used
> very early in boot phase. As the the static_branch_enable() will
> use pageing related code while the paging is not setup ready yet.

I know how static keys work and thats the reason for having the extra
conditional. The key is disabled at a point where paging is available.

> > /*
> > * Scheduler clock - returns current time in nanosec units.
> > */
> > u64 native_sched_clock(void)
> > {
> > - if (static_branch_likely(&__use_tsc)) {
> > - u64 tsc_now = rdtsc();
> > + if (static_branch_likely(&__use_tsc))
> > + return cycles_2_ns(rdtsc());
> >
> > - /* return the value in ns */
> > - return cycles_2_ns(tsc_now);
> > + if (static_branch_unlikely(&tsc_early_enabled)) {
> > + if (tsc_early_sched_clock)
> > + return cycles_2_ns(rdtsc());
> > }

Thanks,

tglx

2018-06-20 10:29:21

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Wed, Jun 20, 2018 at 12:16:36PM +0200, Thomas Gleixner wrote:
> On Wed, 20 Jun 2018, Feng Tang wrote:
> > On Wed, Jun 20, 2018 at 01:52:10AM +0200, Thomas Gleixner wrote:
> > >
> > > static DEFINE_STATIC_KEY_FALSE(__use_tsc);
> > > +static DEFINE_STATIC_KEY_TRUE(tsc_early_enabled);
> >
> > One potential problem may be the several static_keys used here,
> > the "__use_tsc", the "__sched_clock_stable", it may not be used
> > very early in boot phase. As the the static_branch_enable() will
> > use pageing related code while the paging is not setup ready yet.
>
> I know how static keys work and thats the reason for having the extra
> conditional. The key is disabled at a point where paging is available.

Maybe I'm wrong, for printk timestamp, it uses the local_clock() thus
sched_clock_cpu()
sched_clock_stable
check the static_key "__sched_clock_stable" which is
defined FALSE, and need a static_branch_enable() to
take effect

Thanks,
Feng

>
> > > /*
> > > * Scheduler clock - returns current time in nanosec units.
> > > */
> > > u64 native_sched_clock(void)
> > > {
> > > - if (static_branch_likely(&__use_tsc)) {
> > > - u64 tsc_now = rdtsc();
> > > + if (static_branch_likely(&__use_tsc))
> > > + return cycles_2_ns(rdtsc());
> > >
> > > - /* return the value in ns */
> > > - return cycles_2_ns(tsc_now);
> > > + if (static_branch_unlikely(&tsc_early_enabled)) {
> > > + if (tsc_early_sched_clock)
> > > + return cycles_2_ns(rdtsc());
> > > }
>
> Thanks,
>
> tglx

2018-06-20 10:32:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Wed, 20 Jun 2018, Feng Tang wrote:

> On Wed, Jun 20, 2018 at 12:16:36PM +0200, Thomas Gleixner wrote:
> > On Wed, 20 Jun 2018, Feng Tang wrote:
> > > On Wed, Jun 20, 2018 at 01:52:10AM +0200, Thomas Gleixner wrote:
> > > >
> > > > static DEFINE_STATIC_KEY_FALSE(__use_tsc);
> > > > +static DEFINE_STATIC_KEY_TRUE(tsc_early_enabled);
> > >
> > > One potential problem may be the several static_keys used here,
> > > the "__use_tsc", the "__sched_clock_stable", it may not be used
> > > very early in boot phase. As the the static_branch_enable() will
> > > use pageing related code while the paging is not setup ready yet.
> >
> > I know how static keys work and thats the reason for having the extra
> > conditional. The key is disabled at a point where paging is available.
>
> Maybe I'm wrong, for printk timestamp, it uses the local_clock() thus
> sched_clock_cpu()
> sched_clock_stable
> check the static_key "__sched_clock_stable" which is
> defined FALSE, and need a static_branch_enable() to
> take effect

And that has nothing to do with this particular patch. Please read the rest
of the series, especially

https://lkml.kernel.org/r/[email protected]

Thanks,

tglx

2018-06-20 10:38:59

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Wed, Jun 20, 2018 at 12:30:58PM +0200, Thomas Gleixner wrote:
> On Wed, 20 Jun 2018, Feng Tang wrote:
>
> > On Wed, Jun 20, 2018 at 12:16:36PM +0200, Thomas Gleixner wrote:
> > > On Wed, 20 Jun 2018, Feng Tang wrote:
> > > > On Wed, Jun 20, 2018 at 01:52:10AM +0200, Thomas Gleixner wrote:
> > > > >
> > > > > static DEFINE_STATIC_KEY_FALSE(__use_tsc);
> > > > > +static DEFINE_STATIC_KEY_TRUE(tsc_early_enabled);
> > > >
> > > > One potential problem may be the several static_keys used here,
> > > > the "__use_tsc", the "__sched_clock_stable", it may not be used
> > > > very early in boot phase. As the the static_branch_enable() will
> > > > use pageing related code while the paging is not setup ready yet.
> > >
> > > I know how static keys work and thats the reason for having the extra
> > > conditional. The key is disabled at a point where paging is available.
> >
> > Maybe I'm wrong, for printk timestamp, it uses the local_clock() thus
> > sched_clock_cpu()
> > sched_clock_stable
> > check the static_key "__sched_clock_stable" which is
> > defined FALSE, and need a static_branch_enable() to
> > take effect
>
> And that has nothing to do with this particular patch. Please read the rest
> of the series, especially
>
> https://lkml.kernel.org/r/[email protected]

Had not read that patch, thanks for the pointing. No problem now :)

- Feng

2018-06-20 10:44:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Wed, 20 Jun 2018, Peter Zijlstra wrote:

> On Wed, Jun 20, 2018 at 01:52:10AM +0200, Thomas Gleixner wrote:
> > u64 native_sched_clock(void)
> > {
> > + if (static_branch_likely(&__use_tsc))
> > + return cycles_2_ns(rdtsc());
> >
> > + if (static_branch_unlikely(&tsc_early_enabled)) {
> > + if (tsc_early_sched_clock)
> > + return cycles_2_ns(rdtsc());
> > }
>
> I'm still puzzled by the entire need for tsc_early_enabled and all that.
> Esp. since both branches do the exact same thing:
>
> return cycles_2_ns(rdtsc());

Right. But up to the point where the real sched_clock initialization can be
done and the static keys can be flipped, there must be a way to
conditinally use TSC depending on availablility and early initialization.

We cannot make the __use_tsc key TRUE at conmpile timeunless we add an
extra conditional into that codepath which makes the whole static key moot.

So the second static key for the early stuff is TRUE and has the extra
conditional:

if (tsc_early_sched_clock)
return cycles_2_ns(rdtsc());

And that code path gets nuked with flipping the key to FALSE in the later
init code, so that in case of jiffies we do not have the extra conditional
in every sched_clock() invocation.

You might argue, that we shouldn't care becasue the jiffies case is just
the worst case fallback anyway. I wouldn't even disagree as those old
machines which have TSC varying with the CPU frequency really should not
matter anymore. Pavel might disagree of course.

Thanks,

tglx


2018-06-20 12:33:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Wed, Jun 20, 2018 at 12:42:40PM +0200, Thomas Gleixner wrote:
> On Wed, 20 Jun 2018, Peter Zijlstra wrote:

> > I'm still puzzled by the entire need for tsc_early_enabled and all that.
> > Esp. since both branches do the exact same thing:
> >
> > return cycles_2_ns(rdtsc());
>
> Right. But up to the point where the real sched_clock initialization can be
> done and the static keys can be flipped, there must be a way to
> conditinally use TSC depending on availablility and early initialization.

Ah, so we want to flip keys early, can be done, see below.

> You might argue, that we shouldn't care becasue the jiffies case is just
> the worst case fallback anyway. I wouldn't even disagree as those old
> machines which have TSC varying with the CPU frequency really should not
> matter anymore. Pavel might disagree of course.

You forgot (rightfully) that we even use TSC on those !constant
machines, we adjust the cycles_2_ns thing from the cpufreq notifiers.

The only case we should _ever_ use that jiffies callback is when TSC
really isn't there. Basically, if we kill notsc, we could make
native_sched_clock() := cycles_2_ns(rdtsc()) (for CONFIG_X86_TSC), the
end.

No static keys, nothing.

That said; flipping static keys early isn't hard. We should call
jump_label_init() early, because we want the entries sorted and the
key->entries link set. It will also replace the GENERIC_NOP5_ATOMIC
thing, which means we need to also do arch_init_ideal_nop() early, but
since that is pure CPUID based that should be doable.

And then something like the below could be used.

---
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e56c95be2808..2dd8c5bdd87b 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -140,4 +140,38 @@ __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry,
__jump_label_transform(entry, type, text_poke_early, 1);
}

+void jump_label_update_early(struct static_key *key, bool enable)
+{
+ struct jump_entry *entry, *stop = __stop___jump_table;
+
+ /*
+ * We need the table sorted and key->entries set up.
+ */
+ WARN_ON_ONCE(!static_key_initialized);
+
+ entry = static_key_entries(key);
+
+ /*
+ * Sanity check for early users, there had beter be a core kernel user.
+ */
+ if (!entry || !entry->code || !core_kernel_text(entry->code)) {
+ WARN_ON(1);
+ return;
+ }
+
+ for ( ; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
+ enum jump_label_type type = enable ^ jump_entry_branch(entry);
+ __jump_label_transform(entry, type, text_poke_early, 0);
+ }
+
+ atomic_set_release(&key->enabled, !!enabled);
+}
+
+#else
+
+void jump_label_update_early(struct static_key *key, bool enable)
+{
+ atomic_set(&key->enabled, !!enabled);
+}
+
#endif
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index b46b541c67c4..cac61beca25f 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -79,6 +79,7 @@

#include <linux/types.h>
#include <linux/compiler.h>
+#include <linux/bug.h>

extern bool static_key_initialized;

@@ -110,6 +111,17 @@ struct static_key {
};
};

+#define JUMP_TYPE_FALSE 0UL
+#define JUMP_TYPE_TRUE 1UL
+#define JUMP_TYPE_LINKED 2UL
+#define JUMP_TYPE_MASK 3UL
+
+static inline struct jump_entry *static_key_entries(struct static_key *key)
+{
+ WARN_ON_ONCE(key->type & JUMP_TYPE_LINKED);
+ return (struct jump_entry *)(key->type & ~JUMP_TYPE_MASK);
+}
+
#else
struct static_key {
atomic_t enabled;
@@ -119,6 +131,17 @@ struct static_key {

#ifdef HAVE_JUMP_LABEL
#include <asm/jump_label.h>
+
+static inline struct static_key *jump_entry_key(struct jump_entry *entry)
+{
+ return (struct static_key *)((unsigned long)entry->key & ~1UL);
+}
+
+static inline bool jump_entry_branch(struct jump_entry *entry)
+{
+ return (unsigned long)entry->key & 1UL;
+}
+
#endif

#ifndef __ASSEMBLY__
@@ -132,11 +155,6 @@ struct module;

#ifdef HAVE_JUMP_LABEL

-#define JUMP_TYPE_FALSE 0UL
-#define JUMP_TYPE_TRUE 1UL
-#define JUMP_TYPE_LINKED 2UL
-#define JUMP_TYPE_MASK 3UL
-
static __always_inline bool static_key_false(struct static_key *key)
{
return arch_static_branch(key, false);
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 01ebdf1f9f40..9710fa7582aa 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -295,12 +295,6 @@ void __weak __init_or_module arch_jump_label_transform_static(struct jump_entry
arch_jump_label_transform(entry, type);
}

-static inline struct jump_entry *static_key_entries(struct static_key *key)
-{
- WARN_ON_ONCE(key->type & JUMP_TYPE_LINKED);
- return (struct jump_entry *)(key->type & ~JUMP_TYPE_MASK);
-}
-
static inline bool static_key_type(struct static_key *key)
{
return key->type & JUMP_TYPE_TRUE;
@@ -321,16 +315,6 @@ static inline void static_key_set_linked(struct static_key *key)
key->type |= JUMP_TYPE_LINKED;
}

-static inline struct static_key *jump_entry_key(struct jump_entry *entry)
-{
- return (struct static_key *)((unsigned long)entry->key & ~1UL);
-}
-
-static bool jump_entry_branch(struct jump_entry *entry)
-{
- return (unsigned long)entry->key & 1UL;
-}
-
/***
* A 'struct static_key' uses a union such that it either points directly
* to a table of 'struct jump_entry' or to a linked list of modules which in

2018-06-20 13:31:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

On Wed, 20 Jun 2018, Peter Zijlstra wrote:
> On Wed, Jun 20, 2018 at 12:42:40PM +0200, Thomas Gleixner wrote:
> > On Wed, 20 Jun 2018, Peter Zijlstra wrote:
>
> > > I'm still puzzled by the entire need for tsc_early_enabled and all that.
> > > Esp. since both branches do the exact same thing:
> > >
> > > return cycles_2_ns(rdtsc());
> >
> > Right. But up to the point where the real sched_clock initialization can be
> > done and the static keys can be flipped, there must be a way to
> > conditinally use TSC depending on availablility and early initialization.
>
> Ah, so we want to flip keys early, can be done, see below.
>
> > You might argue, that we shouldn't care becasue the jiffies case is just
> > the worst case fallback anyway. I wouldn't even disagree as those old
> > machines which have TSC varying with the CPU frequency really should not
> > matter anymore. Pavel might disagree of course.
>
> You forgot (rightfully) that we even use TSC on those !constant
> machines, we adjust the cycles_2_ns thing from the cpufreq notifiers.
>
> The only case we should _ever_ use that jiffies callback is when TSC
> really isn't there. Basically, if we kill notsc, we could make
> native_sched_clock() := cycles_2_ns(rdtsc()) (for CONFIG_X86_TSC), the
> end.
>
> No static keys, nothing.

So much for the theory. There are CPUs out there where you can't figure out
the TSC frequency which in turn needs to discard TSC as well. Yes, it's all
utter crap....

> That said; flipping static keys early isn't hard. We should call
> jump_label_init() early, because we want the entries sorted and the
> key->entries link set. It will also replace the GENERIC_NOP5_ATOMIC
> thing, which means we need to also do arch_init_ideal_nop() early, but
> since that is pure CPUID based that should be doable.

Yes, that should work and then we'd just have a single static key.

Thanks,

tglx

2018-06-20 21:32:44

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v10 7/7] x86/tsc: use tsc early

Hi Peter,

> That said; flipping static keys early isn't hard. We should call
> jump_label_init() early, because we want the entries sorted and the
> key->entries link set. It will also replace the GENERIC_NOP5_ATOMIC
> thing, which means we need to also do arch_init_ideal_nop() early, but
> since that is pure CPUID based that should be doable.
>
> And then something like the below could be used.

I like the idea of making static branches available early, as it can
be used in more places
during boot.

However, that should be part of a separate project, and a follow up
cleanup can be done
to places that benefit from it. Such as tsc.c and perhaps sched.c
might benefit as well.

Thank you,
Pavel