2018-07-18 02:24:50

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 00/25] Early boot time stamps

changelog
---------
v14 - v13
- Included Thomas' KVM clock series, addressed comments from
reviewers.
http://lkml.kernel.org/r/[email protected]
- Fixed xen hvm panic reported by Boris
- Fixed build issue on microblaze

v13 - v12
- Addressed comments from Thomas Gleixner.
- Addressed comments from Peter Zijlstra.
- Added a patch from Borislav Petkov
- Added a new patch: sched: use static key for sched_clock_running
- Added xen pv fixes, so clock is initialized when other
hypervisors initialize their clocks.
Note: I am including kvm/x86: remove kvm memblock dependency, which
is part of this series:
http://lkml.kernel.org/r/[email protected]
Because without this patch it is not possible to test this series on
KVM.

v12 - v11
- split time: replace read_boot_clock64() with
read_persistent_wall_and_boot_offset() into four patches
- Added two patches one fixes an existing bug with text_poke()
another one enables static branches early. Note, because I found
and fixed the text_poke() bug, enabling static branching became
super easy, as no changes to jump_label* is needed.
- Modified x86/tsc: use tsc early to use static branches early, and
thus native_sched_clock() is not changed at all.
v11 - v10
- Addressed all the comments from Thomas Gleixner.
- I added one more patch:
"x86/tsc: prepare for early sched_clock" which fixes a problem
that I discovered while testing. I am not particularly happy with
the fix, as it adds a new argument that is used only in one
place, but if you have a suggestion for a different approach on
how to address this problem please let me know.

v10 - v9
- Added another patch to this series that removes dependency
between KVM clock, and memblock allocator. The benefit is that
all clocks can now be initialized even earlier.
v9 - v8
- Addressed more comments from Dou Liyang

v8 - v7
- Addressed comments from Dou Liyang:
- Moved tsc_early_init() and tsc_early_fini() to be all inside
tsc.c, and changed them to be static.
- Removed warning when notsc parameter is used.
- Merged with:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

v7 - v6
- Removed tsc_disabled flag, now notsc is equivalent of
tsc=unstable
- Simplified changes to sched/clock.c, by removing the
sched_clock_early() and friends as requested by Peter Zijlstra.
We know always use sched_clock()
- Modified x86 sched_clock() to return either early boot time or
regular.
- Added another example why ealry boot time is important

v5 - v6
- Added a new patch:
time: sync read_boot_clock64() with persistent clock
Which fixes missing __init macro, and enabled time discrepancy
fix that was noted by Thomas Gleixner
- Split "x86/time: read_boot_clock64() implementation" into a
separate patch

v4 - v5
- Fix compiler warnings on systems with stable clocks.

v3 - v4
- Fixed tsc_early_fini() call to be in the 2nd patch as reported
by Dou Liyang
- Improved comment before __use_sched_clock_early to explain why
we need both booleans.
- Simplified valid_clock logic in read_boot_clock64().

v2 - v3
- Addressed comment from Thomas Gleixner
- Timestamps are available a little later in boot but still much
earlier than in mainline. This significantly simplified this
work.

v1 - v2
In patch "x86/tsc: tsc early":
- added tsc_adjusted_early()
- fixed 32-bit compile error use do_div()

The early boot time stamps were discussed recently in these threads:
http://lkml.kernel.org/r/[email protected]
http://lkml.kernel.org/r/[email protected]

I updated my series to the latest mainline and sending it again.

Peter mentioned he did not like patch 6,7, and we can discuss for a better
way to do that, but I think patches 1-5 can be accepted separetly, since
they already enable early timestamps on platforms where sched_clock() is
available early. Such as KVM.

Adding early boot time stamps support for x86 machines.
SPARC patches for early boot time stamps are already integrated into
mainline linux.

Sample output
-------------
Before:
https://paste.ubuntu.com/26133428/

After:
https://paste.ubuntu.com/26133523/

For exaples how early time stamps are used, see this work:
Example 1:
https://lwn.net/Articles/734374/
- Without early boot time stamps we would not know about the extra time
that is spent zeroing struct pages early in boot even when deferred
page initialization.

Example 2:
https://patchwork.kernel.org/patch/10021247/
- If early boot timestamps were available, the engineer who introduced
this bug would have noticed the extra time that is spent early in boot.
Pavel Tatashin (7):
x86/tsc: remove tsc_disabled flag
time: sync read_boot_clock64() with persistent clock
x86/time: read_boot_clock64() implementation
sched: early boot clock
kvm/x86: remove kvm memblock dependency
x86/paravirt: add active_sched_clock to pv_time_ops
x86/tsc: use tsc early

Example 3:
http://lkml.kernel.org/r/[email protected]
- Needed early time stamps to show improvement

Borislav Petkov (1):
x86/CPU: Call detect_nopl() only on the BSP

Pavel Tatashin (17):
x86/kvmclock: Remove memblock dependency
x86: text_poke() may access uninitialized struct pages
x86: initialize static branching early
x86/tsc: redefine notsc to behave as tsc=unstable
x86/xen/time: initialize pv xen time in init_hypervisor_platform
x86/xen/time: output xen sched_clock time from 0
s390/time: add read_persistent_wall_and_boot_offset()
time: replace read_boot_clock64() with
read_persistent_wall_and_boot_offset()
time: default boot time offset to local_clock()
s390/time: remove read_boot_clock64()
ARM/time: remove read_boot_clock64()
x86/tsc: calibrate tsc only once
x86/tsc: initialize cyc2ns when tsc freq. is determined
x86/tsc: use tsc early
sched: move sched clock initialization and merge with generic clock
sched: early boot clock
sched: use static key for sched_clock_running

Peter Zijlstra (1):
x86/kvmclock: Avoid TSC recalibration

Thomas Gleixner (6):
x86/kvmclock: Remove page size requirement from wall_clock
x86/kvmclock: Decrapify kvm_register_clock()
x86/kvmclock: Cleanup the code
x86/kvmclock: Mark variables __initdata and __ro_after_init
x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock
x86/kvmclock: Switch kvmclock data to a PER_CPU variable

.../admin-guide/kernel-parameters.txt | 2 -
Documentation/x86/x86_64/boot-options.txt | 4 +-
arch/arm/include/asm/mach/time.h | 3 +-
arch/arm/kernel/time.c | 15 +-
arch/arm/plat-omap/counter_32k.c | 2 +-
arch/s390/kernel/time.c | 15 +-
arch/x86/include/asm/kvm_guest.h | 7 -
arch/x86/include/asm/kvm_para.h | 1 -
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tsc.h | 2 +-
arch/x86/kernel/alternative.c | 7 +
arch/x86/kernel/cpu/amd.c | 13 +-
arch/x86/kernel/cpu/common.c | 40 +--
arch/x86/kernel/jump_label.c | 11 +-
arch/x86/kernel/kvm.c | 14 +-
arch/x86/kernel/kvmclock.c | 266 ++++++++----------
arch/x86/kernel/setup.c | 10 +-
arch/x86/kernel/tsc.c | 187 ++++++------
arch/x86/xen/enlighten_pv.c | 51 ++--
arch/x86/xen/mmu_pv.c | 6 +-
arch/x86/xen/suspend_pv.c | 5 +-
arch/x86/xen/time.c | 17 +-
arch/x86/xen/xen-ops.h | 6 +-
drivers/clocksource/tegra20_timer.c | 2 +-
include/linux/sched_clock.h | 5 +-
include/linux/timekeeping.h | 3 +-
init/main.c | 4 +-
kernel/sched/clock.c | 49 ++--
kernel/sched/core.c | 1 -
kernel/sched/debug.c | 2 -
kernel/time/sched_clock.c | 2 +-
kernel/time/timekeeping.c | 62 ++--
32 files changed, 386 insertions(+), 429 deletions(-)
delete mode 100644 arch/x86/include/asm/kvm_guest.h

--
2.18.0



2018-07-18 02:24:20

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 02/25] x86/kvmclock: Remove page size requirement from wall_clock

From: Thomas Gleixner <[email protected]>

There is no requirement for wall_clock data to be page aligned or page
sized.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvmclock.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 890e9e58e4bf..e9863639312c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -46,14 +46,12 @@ early_param("no-kvmclock", parse_no_kvmclock);

/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
-#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))

static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);

/* The hypervisor will put information about time periodically here */
static struct pvclock_vsyscall_time_info *hv_clock;
-static struct pvclock_wall_clock *wall_clock;
+static struct pvclock_wall_clock wall_clock;

/*
* The wallclock is the time of day when we booted. Since then, some time may
@@ -66,15 +64,15 @@ static void kvm_get_wallclock(struct timespec64 *now)
int low, high;
int cpu;

- low = (int)slow_virt_to_phys(wall_clock);
- high = ((u64)slow_virt_to_phys(wall_clock) >> 32);
+ low = (int)slow_virt_to_phys(&wall_clock);
+ high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);

native_write_msr(msr_kvm_wall_clock, low, high);

cpu = get_cpu();

vcpu_time = &hv_clock[cpu].pvti;
- pvclock_read_wallclock(wall_clock, vcpu_time, now);
+ pvclock_read_wallclock(&wall_clock, vcpu_time, now);

put_cpu();
}
@@ -266,12 +264,10 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;

if (kvm_register_clock("primary cpu clock")) {
hv_clock = NULL;
- wall_clock = NULL;
return;
}

--
2.18.0


2018-07-18 02:25:08

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 13/25] x86/xen/time: initialize pv xen time in init_hypervisor_platform

In every hypervisor except for xen pv time ops are initialized in
init_hypervisor_platform().

Xen PV domains initialize time ops in x86_init.paging.pagetable_init(),
by calling xen_setup_shared_info() which is a poor design, as time is
needed prior to memory allocator.

xen_setup_shared_info() is called from two places: during boot, and
after suspend. Split the content of xen_setup_shared_info() into
three places:

1. add the clock relavent data into new xen pv init_platform vector, and
set clock ops in there.

2. move xen_setup_vcpu_info_placement() to new xen_pv_guest_late_init()
call.

3. Re-initializing parts of shared info copy to xen_pv_post_suspend() to
be symmetric to xen_pv_pre_suspend

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/xen/enlighten_pv.c | 51 +++++++++++++++++--------------------
arch/x86/xen/mmu_pv.c | 6 ++---
arch/x86/xen/suspend_pv.c | 5 ++--
arch/x86/xen/time.c | 7 +++--
arch/x86/xen/xen-ops.h | 6 ++---
5 files changed, 34 insertions(+), 41 deletions(-)

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 439a94bf89ad..105a57d73701 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -119,6 +119,27 @@ static void __init xen_banner(void)
version >> 16, version & 0xffff, extra.extraversion,
xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
}
+
+static void __init xen_pv_init_platform(void)
+{
+ set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+ HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+
+ /* xen clock uses per-cpu vcpu_info, need to init it for boot cpu */
+ xen_vcpu_info_reset(0);
+
+ /* pvclock is in shared info area */
+ xen_init_time_ops();
+}
+
+static void __init xen_pv_guest_late_init(void)
+{
+#ifndef CONFIG_SMP
+ /* Setup shared vcpu info for non-smp configurations */
+ xen_setup_vcpu_info_placement();
+#endif
+}
+
/* Check if running on Xen version (major, minor) or later */
bool
xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -947,34 +968,8 @@ static void xen_write_msr(unsigned int msr, unsigned low, unsigned high)
xen_write_msr_safe(msr, low, high);
}

-void xen_setup_shared_info(void)
-{
- set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
-
- HYPERVISOR_shared_info =
- (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
-
- xen_setup_mfn_list_list();
-
- if (system_state == SYSTEM_BOOTING) {
-#ifndef CONFIG_SMP
- /*
- * In UP this is as good a place as any to set up shared info.
- * Limit this to boot only, at restore vcpu setup is done via
- * xen_vcpu_restore().
- */
- xen_setup_vcpu_info_placement();
-#endif
- /*
- * Now that shared info is set up we can start using routines
- * that point to pvclock area.
- */
- xen_init_time_ops();
- }
-}
-
/* This is called once we have the cpu_possible_mask */
-void __ref xen_setup_vcpu_info_placement(void)
+void __init xen_setup_vcpu_info_placement(void)
{
int cpu;

@@ -1228,6 +1223,8 @@ asmlinkage __visible void __init xen_start_kernel(void)
x86_init.irqs.intr_mode_init = x86_init_noop;
x86_init.oem.arch_setup = xen_arch_setup;
x86_init.oem.banner = xen_banner;
+ x86_init.hyper.init_platform = xen_pv_init_platform;
+ x86_init.hyper.guest_late_init = xen_pv_guest_late_init;

/*
* Set up some pagetable state before starting to set any ptes.
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2c30cabfda90..52206ad81e4b 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1230,8 +1230,7 @@ static void __init xen_pagetable_p2m_free(void)
* We roundup to the PMD, which means that if anybody at this stage is
* using the __ka address of xen_start_info or
* xen_start_info->shared_info they are in going to crash. Fortunatly
- * we have already revectored in xen_setup_kernel_pagetable and in
- * xen_setup_shared_info.
+ * we have already revectored in xen_setup_kernel_pagetable.
*/
size = roundup(size, PMD_SIZE);

@@ -1292,8 +1291,7 @@ static void __init xen_pagetable_init(void)

/* Remap memory freed due to conflicts with E820 map */
xen_remap_memory();
-
- xen_setup_shared_info();
+ xen_setup_mfn_list_list();
}
static void xen_write_cr2(unsigned long cr2)
{
diff --git a/arch/x86/xen/suspend_pv.c b/arch/x86/xen/suspend_pv.c
index a2e0f110af56..8303b58c79a9 100644
--- a/arch/x86/xen/suspend_pv.c
+++ b/arch/x86/xen/suspend_pv.c
@@ -27,8 +27,9 @@ void xen_pv_pre_suspend(void)
void xen_pv_post_suspend(int suspend_cancelled)
{
xen_build_mfn_list_list();
-
- xen_setup_shared_info();
+ set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+ HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+ xen_setup_mfn_list_list();

if (suspend_cancelled) {
xen_start_info->store_mfn =
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index e0f1bcf01d63..53bb7a8d10b5 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -40,7 +40,7 @@ static unsigned long xen_tsc_khz(void)
return pvclock_tsc_khz(info);
}

-u64 xen_clocksource_read(void)
+static u64 xen_clocksource_read(void)
{
struct pvclock_vcpu_time_info *src;
u64 ret;
@@ -503,7 +503,7 @@ static void __init xen_time_init(void)
pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
}

-void __ref xen_init_time_ops(void)
+void __init xen_init_time_ops(void)
{
pv_time_ops = xen_time_ops;

@@ -542,8 +542,7 @@ void __init xen_hvm_init_time_ops(void)
return;

if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
- printk(KERN_INFO "Xen doesn't support pvclock on HVM,"
- "disable pv timer\n");
+ pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
return;
}

diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 3b34745d0a52..e78684597f57 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -31,7 +31,6 @@ extern struct shared_info xen_dummy_shared_info;
extern struct shared_info *HYPERVISOR_shared_info;

void xen_setup_mfn_list_list(void);
-void xen_setup_shared_info(void);
void xen_build_mfn_list_list(void);
void xen_setup_machphys_mapping(void);
void xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
@@ -68,12 +67,11 @@ void xen_init_irq_ops(void);
void xen_setup_timer(int cpu);
void xen_setup_runstate_info(int cpu);
void xen_teardown_timer(int cpu);
-u64 xen_clocksource_read(void);
void xen_setup_cpu_clockevents(void);
void xen_save_time_memory_area(void);
void xen_restore_time_memory_area(void);
-void __ref xen_init_time_ops(void);
-void __init xen_hvm_init_time_ops(void);
+void xen_init_time_ops(void);
+void xen_hvm_init_time_ops(void);

irqreturn_t xen_debug_interrupt(int irq, void *dev_id);

--
2.18.0


2018-07-18 02:25:15

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 08/25] x86/kvmclock: Avoid TSC recalibration

From: Peter Zijlstra <[email protected]>

If the host gives us a TSC rate, assume it is good and don't try and
recalibrate things against virtual timer hardware.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/kvmclock.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ed170171fe49..da0ede8ac8f6 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -141,7 +141,16 @@ static inline void kvm_sched_clock_init(bool stable)
*/
static unsigned long kvm_get_tsc_khz(void)
{
- return pvclock_tsc_khz(this_cpu_pvti());
+ unsigned long tsc_khz = pvclock_tsc_khz(this_cpu_pvti());
+
+ /*
+ * TSC frequency is reported by the host; calibration against (virtual)
+ * HPET/PM-timer in a guest is dodgy and pointless since the host
+ * already did it for us where required.
+ */
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
+ return tsc_khz;
}

static void kvm_get_preset_lpj(void)
--
2.18.0


2018-07-18 02:25:23

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 15/25] s390/time: add read_persistent_wall_and_boot_offset()

read_persistent_wall_and_boot_offset() will replace read_boot_clock64()
because on some architectures it is more convenient to read both sources
as one may depend on the other. For s390, implementation is the same
as read_boot_clock64() but also calling and returning value of
read_persistent_clock64()

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Martin Schwidefsky <[email protected]>
---
arch/s390/kernel/time.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index cf561160ea88..d1f5447d5687 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -221,6 +221,24 @@ void read_persistent_clock64(struct timespec64 *ts)
ext_to_timespec64(clk, ts);
}

+void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
+{
+ unsigned char clk[STORE_CLOCK_EXT_SIZE];
+ struct timespec64 boot_time;
+ __u64 delta;
+
+ delta = initial_leap_seconds + TOD_UNIX_EPOCH;
+ memcpy(clk, tod_clock_base, STORE_CLOCK_EXT_SIZE);
+ *(__u64 *)&clk[1] -= delta;
+ if (*(__u64 *)&clk[1] > delta)
+ clk[0]--;
+ ext_to_timespec64(clk, &boot_time);
+
+ read_persistent_clock64(wall_time);
+ *boot_offset = timespec64_sub(*wall_time, boot_time);
+}
+
void read_boot_clock64(struct timespec64 *ts)
{
unsigned char clk[STORE_CLOCK_EXT_SIZE];
--
2.18.0


2018-07-18 02:25:24

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 03/25] x86/kvmclock: Decrapify kvm_register_clock()

From: Thomas Gleixner <[email protected]>

The return value is pointless because the wrmsr cannot fail if
KVM_FEATURE_CLOCKSOURCE or KVM_FEATURE_CLOCKSOURCE2 are set.

kvm_register_clock() is only called locally so wants to be static.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 1 -
arch/x86/kernel/kvmclock.c | 34 +++++++++++----------------------
2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 3aea2658323a..4c723632c036 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -7,7 +7,6 @@
#include <uapi/asm/kvm_para.h>

extern void kvmclock_init(void);
-extern int kvm_register_clock(char *txt);

#ifdef CONFIG_KVM_GUEST
bool kvm_check_and_clear_guest_paused(void);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index e9863639312c..cbf0a6b9217b 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -186,23 +186,19 @@ struct clocksource kvm_clock = {
};
EXPORT_SYMBOL_GPL(kvm_clock);

-int kvm_register_clock(char *txt)
+static void kvm_register_clock(char *txt)
{
- int cpu = smp_processor_id();
- int low, high, ret;
struct pvclock_vcpu_time_info *src;
+ int cpu = smp_processor_id();
+ u64 pa;

if (!hv_clock)
- return 0;
+ return;

src = &hv_clock[cpu].pvti;
- low = (int)slow_virt_to_phys(src) | 1;
- high = ((u64)slow_virt_to_phys(src) >> 32);
- ret = native_write_msr_safe(msr_kvm_system_time, low, high);
- printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
- cpu, high, low, txt);
-
- return ret;
+ pa = slow_virt_to_phys(src) | 0x01ULL;
+ wrmsrl(msr_kvm_system_time, pa);
+ pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
}

static void kvm_save_sched_clock_state(void)
@@ -217,11 +213,7 @@ static void kvm_restore_sched_clock_state(void)
#ifdef CONFIG_X86_LOCAL_APIC
static void kvm_setup_secondary_clock(void)
{
- /*
- * Now that the first cpu already had this clocksource initialized,
- * we shouldn't fail.
- */
- WARN_ON(kvm_register_clock("secondary cpu clock"));
+ kvm_register_clock("secondary cpu clock");
}
#endif

@@ -264,16 +256,12 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
-
- if (kvm_register_clock("primary cpu clock")) {
- hv_clock = NULL;
- return;
- }
-
printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

+ hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+ kvm_register_clock("primary cpu clock");
+
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

--
2.18.0


2018-07-18 02:25:37

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 05/25] x86/kvmclock: Mark variables __initdata and __ro_after_init

From: Thomas Gleixner <[email protected]>

The kvmclock parameter is init data and the other variables are not
modified after init.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvmclock.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 04d2f5e1d783..f312d7f6de57 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -32,10 +32,10 @@
#include <asm/reboot.h>
#include <asm/kvmclock.h>

-static int kvmclock __ro_after_init = 1;
-static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
-static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
-static u64 kvm_sched_clock_offset;
+static int kvmclock __initdata = 1;
+static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
+static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
+static u64 kvm_sched_clock_offset __ro_after_init;

static int __init parse_no_kvmclock(char *arg)
{
@@ -50,7 +50,7 @@ early_param("no-kvmclock", parse_no_kvmclock);
static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);

/* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock;
+static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
static struct pvclock_wall_clock wall_clock;

/*
--
2.18.0


2018-07-18 02:25:47

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 07/25] x86/kvmclock: Switch kvmclock data to a PER_CPU variable

From: Thomas Gleixner <[email protected]>

The previous removal of the memblock dependency from kvmclock introduced a
static data array sized 64bytes * CONFIG_NR_CPUS. That's wasteful on large
systems when kvmclock is not used.

Replace it with:

- A static page sized array of pvclock data. It's page sized because the
pvclock data of the boot cpu is mapped into the VDSO so otherwise random
other data would be exposed to the vDSO

- A PER_CPU variable of pvclock data pointers. This is used to access the
pcvlock data storage on each CPU.

The setup is done in two stages:

- Early boot stores the pointer to the static page for the boot CPU in
the per cpu data.

- In the preparatory stage of CPU hotplug assign either an element of
the static array (when the CPU number is in that range) or allocate
memory and initialize the per cpu pointer.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvmclock.c | 99 ++++++++++++++++++++++++--------------
1 file changed, 62 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 08e7726a5e62..ed170171fe49 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
#include <asm/apic.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
+#include <linux/cpuhotplug.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/mm.h>
@@ -55,12 +56,23 @@ early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);

/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define HVC_BOOT_ARRAY_SIZE \
+ (PAGE_SIZE / sizeof(struct pvclock_vsyscall_time_info))

-static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-
-/* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
+static struct pvclock_vsyscall_time_info
+ hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __aligned(PAGE_SIZE);
static struct pvclock_wall_clock wall_clock;
+static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
+
+static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
+{
+ return &this_cpu_read(hv_clock_per_cpu)->pvti;
+}
+
+static inline struct pvclock_vsyscall_time_info *this_cpu_hvclock(void)
+{
+ return this_cpu_read(hv_clock_per_cpu);
+}

/*
* The wallclock is the time of day when we booted. Since then, some time may
@@ -69,17 +81,10 @@ static struct pvclock_wall_clock wall_clock;
*/
static void kvm_get_wallclock(struct timespec64 *now)
{
- struct pvclock_vcpu_time_info *vcpu_time;
- int cpu;
-
wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));
-
- cpu = get_cpu();
-
- vcpu_time = &hv_clock[cpu].pvti;
- pvclock_read_wallclock(&wall_clock, vcpu_time, now);
-
- put_cpu();
+ preempt_disable();
+ pvclock_read_wallclock(&wall_clock, this_cpu_pvti(), now);
+ preempt_enable();
}

static int kvm_set_wallclock(const struct timespec64 *now)
@@ -89,14 +94,10 @@ static int kvm_set_wallclock(const struct timespec64 *now)

static u64 kvm_clock_read(void)
{
- struct pvclock_vcpu_time_info *src;
u64 ret;
- int cpu;

preempt_disable_notrace();
- cpu = smp_processor_id();
- src = &hv_clock[cpu].pvti;
- ret = pvclock_clocksource_read(src);
+ ret = pvclock_clocksource_read(this_cpu_pvti());
preempt_enable_notrace();
return ret;
}
@@ -140,7 +141,7 @@ static inline void kvm_sched_clock_init(bool stable)
*/
static unsigned long kvm_get_tsc_khz(void)
{
- return pvclock_tsc_khz(&hv_clock[0].pvti);
+ return pvclock_tsc_khz(this_cpu_pvti());
}

static void kvm_get_preset_lpj(void)
@@ -157,15 +158,14 @@ static void kvm_get_preset_lpj(void)

bool kvm_check_and_clear_guest_paused(void)
{
- struct pvclock_vcpu_time_info *src;
+ struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
bool ret = false;

- if (!hv_clock)
+ if (!src)
return ret;

- src = &hv_clock[smp_processor_id()].pvti;
- if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
- src->flags &= ~PVCLOCK_GUEST_STOPPED;
+ if ((src->pvti.flags & PVCLOCK_GUEST_STOPPED) != 0) {
+ src->pvti.flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
ret = true;
}
@@ -183,17 +183,15 @@ EXPORT_SYMBOL_GPL(kvm_clock);

static void kvm_register_clock(char *txt)
{
- struct pvclock_vcpu_time_info *src;
- int cpu = smp_processor_id();
+ struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
u64 pa;

- if (!hv_clock)
+ if (!src)
return;

- src = &hv_clock[cpu].pvti;
- pa = slow_virt_to_phys(src) | 0x01ULL;
+ pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
wrmsrl(msr_kvm_system_time, pa);
- pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
+ pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
}

static void kvm_save_sched_clock_state(void)
@@ -241,20 +239,42 @@ static int __init kvm_setup_vsyscall_timeinfo(void)
#ifdef CONFIG_X86_64
u8 flags;

- if (!hv_clock || !kvmclock_vsyscall)
+ if (!per_cpu(hv_clock_per_cpu, 0) || !kvmclock_vsyscall)
return 0;

- flags = pvclock_read_flags(&hv_clock[0].pvti);
+ flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
if (!(flags & PVCLOCK_TSC_STABLE_BIT))
- return 1;
+ return 0;

- pvclock_set_pvti_cpu0_va(hv_clock);
+ pvclock_set_pvti_cpu0_va(hv_clock_boot);
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
#endif
return 0;
}
early_initcall(kvm_setup_vsyscall_timeinfo);

+static int kvmclock_setup_percpu(unsigned int cpu)
+{
+ struct pvclock_vsyscall_time_info *p = per_cpu(hv_clock_per_cpu, cpu);
+
+ /*
+ * The per cpu area setup replicates CPU0 data to all cpu
+ * pointers. So carefully check. CPU0 has been set up in init
+ * already.
+ */
+ if (!cpu || (p && p != per_cpu(hv_clock_per_cpu, 0)))
+ return 0;
+
+ /* Use the static page for the first CPUs, allocate otherwise */
+ if (cpu < HVC_BOOT_ARRAY_SIZE)
+ p = &hv_clock_boot[cpu];
+ else
+ p = kzalloc(sizeof(*p), GFP_KERNEL);
+
+ per_cpu(hv_clock_per_cpu, cpu) = p;
+ return p ? 0 : -ENOMEM;
+}
+
void __init kvmclock_init(void)
{
u8 flags;
@@ -269,16 +289,21 @@ void __init kvmclock_init(void)
return;
}

+ if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
+ kvmclock_setup_percpu, NULL) < 0) {
+ return;
+ }
+
pr_info("kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

- hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+ this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
kvm_register_clock("primary cpu clock");

if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

- flags = pvclock_read_flags(&hv_clock[0].pvti);
+ flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);

x86_platform.calibrate_tsc = kvm_get_tsc_khz;
--
2.18.0


2018-07-18 02:25:53

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 14/25] x86/xen/time: output xen sched_clock time from 0

It is expected for sched_clock() to output data from 0, when system boots.
Add an offset xen_sched_clock_offset (similarly how it is done in other
hypervisors i.e. kvm_sched_clock_offset) to count sched_clock() from 0,
when time is first initialized.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/xen/time.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 53bb7a8d10b5..25a780d89b7a 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -31,6 +31,8 @@
/* Xen may fire a timer up to this many ns early */
#define TIMER_SLOP 100000

+static u64 xen_sched_clock_offset __read_mostly;
+
/* Get the TSC speed from Xen */
static unsigned long xen_tsc_khz(void)
{
@@ -57,6 +59,11 @@ static u64 xen_clocksource_get_cycles(struct clocksource *cs)
return xen_clocksource_read();
}

+static u64 xen_sched_clock(void)
+{
+ return xen_clocksource_read() - xen_sched_clock_offset;
+}
+
static void xen_read_wallclock(struct timespec64 *ts)
{
struct shared_info *s = HYPERVISOR_shared_info;
@@ -367,7 +374,7 @@ void xen_timer_resume(void)
}

static const struct pv_time_ops xen_time_ops __initconst = {
- .sched_clock = xen_clocksource_read,
+ .sched_clock = xen_sched_clock,
.steal_clock = xen_steal_clock,
};

@@ -505,6 +512,7 @@ static void __init xen_time_init(void)

void __init xen_init_time_ops(void)
{
+ xen_sched_clock_offset = xen_clocksource_read();
pv_time_ops = xen_time_ops;

x86_init.timers.timer_init = xen_time_init;
--
2.18.0


2018-07-18 02:26:08

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 09/25] x86: text_poke() may access uninitialized struct pages

It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() we
may end up with accessing struct page that have not been initialized.

Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:

| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
| char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
| vfs_caches_init_early();
| sort_main_extable();
| trap_init();
|+ {
|+ static_branch_enable(&__test);
|+ WARN_ON(!static_branch_likely(&__test));
|+ }
| mm_init();

The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc1_pt_t1 #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.11.0-20171110_100015-anatol 04/01/2014
RIP: 0010:text_poke+0x20d/0x230
Code: 0f 0b 4c 89 e2 4c 89 ee 4c 89 f7 e8 7d 4b 9b 00 31 d2 31 f6 bf 86 02
00 00 48 8b 05 95 8e 24 01 e8 78 18 d8 00 e9 55 ff ff ff <0f> 0b e9 54 fe
ff ff 48 8b 05 75 a8 38 01 e9 64 fe ff ff 48 8b 1d
RSP: 0000:ffffffff94e03e30 EFLAGS: 00010046
RAX: 0100000000000000 RBX: fffff7b2c011f300 RCX: ffffffff94fcccf4
RDX: 0000000000000001 RSI: ffffffff94e03e77 RDI: ffffffff94fcccef
RBP: ffffffff94fcccef R08: 00000000fffffe00 R09: 00000000000000a0
R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000001
R13: ffffffff94e03e77 R14: ffffffff94fcdcef R15: fffff7b2c0000000
FS: 0000000000000000(0000) GS:ffff9adc87c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff9adc8499d000 CR3: 000000000460a001 CR4: 00000000000606b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? start_kernel+0x23e/0x4c8
? start_kernel+0x23f/0x4c8
? text_poke_bp+0x50/0xda
? arch_jump_label_transform+0x89/0xe0
? __jump_label_update+0x78/0xb0
? static_key_enable_cpuslocked+0x4d/0x80
? static_key_enable+0x11/0x20
? start_kernel+0x23e/0x4c8
? secondary_startup_64+0xa5/0xb0
---[ end trace abdc99c031b8a90a ]---

If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.

Use text_poke_early() in static branching until early boot IRQs are
enabled, at which time switch to text_poke. Also, ensure text_poke() is
never invoked when unitialized memory access may happen by using:
BUG_ON(!after_bootmem); assertion.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 7 +++++++
arch/x86/kernel/jump_label.c | 11 +++++++----
3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 2ecd34e2d46c..e85ff65c43c3 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,5 +37,6 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern int after_bootmem;

#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a481763a3776..014f214da581 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -668,6 +668,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
local_irq_save(flags);
memcpy(addr, opcode, len);
local_irq_restore(flags);
+ sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
return addr;
@@ -693,6 +694,12 @@ void *text_poke(void *addr, const void *opcode, size_t len)
struct page *pages[2];
int i;

+ /*
+ * While boot memory allocator is runnig we cannot use struct
+ * pages as they are not yet initialized.
+ */
+ BUG_ON(!after_bootmem);
+
if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e56c95be2808..eeea935e9bb5 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,15 +37,18 @@ static void bug_at(unsigned char *ip, int line)
BUG();
}

-static void __jump_label_transform(struct jump_entry *entry,
- enum jump_label_type type,
- void *(*poker)(void *, const void *, size_t),
- int init)
+static void __ref __jump_label_transform(struct jump_entry *entry,
+ enum jump_label_type type,
+ void *(*poker)(void *, const void *, size_t),
+ int init)
{
union jump_code_union code;
const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];

+ if (early_boot_irqs_disabled)
+ poker = text_poke_early;
+
if (type == JUMP_LABEL_JMP) {
if (init) {
/*
--
2.18.0


2018-07-18 02:26:18

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 01/25] x86/kvmclock: Remove memblock dependency

KVM clock is initialized later compared to other hypervisor clocks because
it has a dependency on the memblock allocator.

Bring it in line with other hypervisors by using memory from the BSS
instead of allocating it.

The benefits:

- Remove ifdef from common code
- Earlier availability of the clock
- Remove dependency on memblock, and reduce code

The downside:

- Static allocation of the per cpu data structures sized NR_CPUS * 64byte
Will be addressed in follow up patches.

[ tglx: Split out from larger series ]

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 66 +++++++-------------------------------
arch/x86/kernel/setup.c | 4 ---
3 files changed, 12 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5b2300b818af..c65c232d3ddd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name = "KVM",
.detect = kvm_detect,
.type = X86_HYPER_KVM,
+ .init.init_platform = kvmclock_init,
.init.guest_late_init = kvm_guest_init,
.init.x2apic_available = kvm_para_available,
};
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index bf8d1eb7fca3..890e9e58e4bf 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,9 +23,9 @@
#include <asm/apic.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
-#include <linux/memblock.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
+#include <linux/mm.h>

#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
@@ -44,6 +44,13 @@ static int parse_no_kvmclock(char *arg)
}
early_param("no-kvmclock", parse_no_kvmclock);

+/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
+#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))
+
+static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+
/* The hypervisor will put information about time periodically here */
static struct pvclock_vsyscall_time_info *hv_clock;
static struct pvclock_wall_clock *wall_clock;
@@ -244,43 +251,12 @@ static void kvm_shutdown(void)
native_machine_shutdown();
}

-static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
- phys_addr_t align)
-{
- phys_addr_t mem;
-
- mem = memblock_alloc(size, align);
- if (!mem)
- return 0;
-
- if (sev_active()) {
- if (early_set_memory_decrypted((unsigned long)__va(mem), size))
- goto e_free;
- }
-
- return mem;
-e_free:
- memblock_free(mem, size);
- return 0;
-}
-
-static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
-{
- if (sev_active())
- early_set_memory_encrypted((unsigned long)__va(addr), size);
-
- memblock_free(addr, size);
-}
-
void __init kvmclock_init(void)
{
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned long mem, mem_wall_clock;
- int size, cpu, wall_clock_size;
+ int cpu;
u8 flags;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
if (!kvm_para_available())
return;

@@ -290,28 +266,11 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
- mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
- if (!mem_wall_clock)
- return;
-
- wall_clock = __va(mem_wall_clock);
- memset(wall_clock, 0, wall_clock_size);
-
- mem = kvm_memblock_alloc(size, PAGE_SIZE);
- if (!mem) {
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
- wall_clock = NULL;
- return;
- }
-
- hv_clock = __va(mem);
- memset(hv_clock, 0, size);
+ wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
+ hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;

if (kvm_register_clock("primary cpu clock")) {
hv_clock = NULL;
- kvm_memblock_free(mem, size);
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
wall_clock = NULL;
return;
}
@@ -354,13 +313,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
int cpu;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned int size;

if (!hv_clock)
return 0;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
cpu = get_cpu();

vcpu_time = &hv_clock[cpu].pvti;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..da1dbd99cb6e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1197,10 +1197,6 @@ void __init setup_arch(char **cmdline_p)

memblock_find_dma_reserve();

-#ifdef CONFIG_KVM_GUEST
- kvmclock_init();
-#endif
-
tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();
--
2.18.0


2018-07-18 02:26:18

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 06/25] x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock

From: Thomas Gleixner <[email protected]>

There is no point to have this in the kvm code itself and call it from
there. This can be called from an initcall and the parameter is cleared
when the hypervisor is not KVM.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_guest.h | 7 -----
arch/x86/kernel/kvm.c | 13 ---------
arch/x86/kernel/kvmclock.c | 46 +++++++++++++++++++-------------
3 files changed, 28 insertions(+), 38 deletions(-)
delete mode 100644 arch/x86/include/asm/kvm_guest.h

diff --git a/arch/x86/include/asm/kvm_guest.h b/arch/x86/include/asm/kvm_guest.h
deleted file mode 100644
index 46185263d9c2..000000000000
--- a/arch/x86/include/asm/kvm_guest.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_X86_KVM_GUEST_H
-#define _ASM_X86_KVM_GUEST_H
-
-int kvm_setup_vsyscall_timeinfo(void);
-
-#endif /* _ASM_X86_KVM_GUEST_H */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c65c232d3ddd..a560750cc76f 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -45,7 +45,6 @@
#include <asm/apic.h>
#include <asm/apicdef.h>
#include <asm/hypervisor.h>
-#include <asm/kvm_guest.h>

static int kvmapf = 1;

@@ -66,15 +65,6 @@ static int __init parse_no_stealacc(char *arg)

early_param("no-steal-acc", parse_no_stealacc);

-static int kvmclock_vsyscall = 1;
-static int __init parse_no_kvmclock_vsyscall(char *arg)
-{
- kvmclock_vsyscall = 0;
- return 0;
-}
-
-early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
-
static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
static DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64);
static int has_steal_clock = 0;
@@ -560,9 +550,6 @@ static void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);

- if (kvmclock_vsyscall)
- kvm_setup_vsyscall_timeinfo();
-
#ifdef CONFIG_SMP
smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f312d7f6de57..08e7726a5e62 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -27,12 +27,14 @@
#include <linux/sched/clock.h>
#include <linux/mm.h>

+#include <asm/hypervisor.h>
#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
#include <asm/reboot.h>
#include <asm/kvmclock.h>

static int kvmclock __initdata = 1;
+static int kvmclock_vsyscall __initdata = 1;
static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
static u64 kvm_sched_clock_offset __ro_after_init;
@@ -44,6 +46,13 @@ static int __init parse_no_kvmclock(char *arg)
}
early_param("no-kvmclock", parse_no_kvmclock);

+static int __init parse_no_kvmclock_vsyscall(char *arg)
+{
+ kvmclock_vsyscall = 0;
+ return 0;
+}
+early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
+
/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)

@@ -227,6 +236,25 @@ static void kvm_shutdown(void)
native_machine_shutdown();
}

+static int __init kvm_setup_vsyscall_timeinfo(void)
+{
+#ifdef CONFIG_X86_64
+ u8 flags;
+
+ if (!hv_clock || !kvmclock_vsyscall)
+ return 0;
+
+ flags = pvclock_read_flags(&hv_clock[0].pvti);
+ if (!(flags & PVCLOCK_TSC_STABLE_BIT))
+ return 1;
+
+ pvclock_set_pvti_cpu0_va(hv_clock);
+ kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+#endif
+ return 0;
+}
+early_initcall(kvm_setup_vsyscall_timeinfo);
+
void __init kvmclock_init(void)
{
u8 flags;
@@ -270,21 +298,3 @@ void __init kvmclock_init(void)
clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
pv_info.name = "KVM";
}
-
-int __init kvm_setup_vsyscall_timeinfo(void)
-{
-#ifdef CONFIG_X86_64
- u8 flags;
-
- if (!hv_clock)
- return 0;
-
- flags = pvclock_read_flags(&hv_clock[0].pvti);
- if (!(flags & PVCLOCK_TSC_STABLE_BIT))
- return 1;
-
- pvclock_set_pvti_cpu0_va(hv_clock);
- kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
-#endif
- return 0;
-}
--
2.18.0


2018-07-18 02:26:27

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
and the second time in tsc_init().

Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
the calibration is done only early, and make tsc_init() to use the values
already determined in tsc_early_init().

Sometimes it is not possible to determine tsc early, as the subsystem that
is required is not yet initialized, in such case try again later in
tsc_init().

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/tsc.h | 2 +-
arch/x86/kernel/setup.c | 2 +-
arch/x86/kernel/tsc.c | 86 ++++++++++++++++++++------------------
3 files changed, 48 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 2701d221583a..c4368ff73652 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void)
extern struct system_counterval_t convert_art_to_tsc(u64 art);
extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);

-extern void tsc_early_delay_calibrate(void);
+extern void tsc_early_init(void);
extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 7490de925a81..5d32c55aeb8b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1014,6 +1014,7 @@ void __init setup_arch(char **cmdline_p)
*/
init_hypervisor_platform();

+ tsc_early_init();
x86_init.resources.probe_roms();

/* after parse_early_param, so could debug it */
@@ -1199,7 +1200,6 @@ void __init setup_arch(char **cmdline_p)

memblock_find_dma_reserve();

- tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 186395041725..bc8eb82050a3 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz);
unsigned int __read_mostly tsc_khz;
EXPORT_SYMBOL(tsc_khz);

+#define KHZ 1000
+
/*
* TSC can be unstable due to cpufreq or due to unsynced TSCs
*/
@@ -1335,34 +1337,10 @@ static int __init init_tsc_clocksource(void)
*/
device_initcall(init_tsc_clocksource);

-void __init tsc_early_delay_calibrate(void)
-{
- unsigned long lpj;
-
- if (!boot_cpu_has(X86_FEATURE_TSC))
- return;
-
- cpu_khz = x86_platform.calibrate_cpu();
- tsc_khz = x86_platform.calibrate_tsc();
-
- tsc_khz = tsc_khz ? : cpu_khz;
- if (!tsc_khz)
- return;
-
- lpj = tsc_khz * 1000;
- do_div(lpj, HZ);
- loops_per_jiffy = lpj;
-}
-
-void __init tsc_init(void)
+static bool determine_cpu_tsc_frequncies(void)
{
- u64 lpj, cyc;
- int cpu;
-
- if (!boot_cpu_has(X86_FEATURE_TSC)) {
- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
- }
+ /* Make sure that cpu and tsc are not already calibrated */
+ WARN_ON(cpu_khz || tsc_khz);

cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
@@ -1377,20 +1355,51 @@ void __init tsc_init(void)
else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
cpu_khz = tsc_khz;

- if (!tsc_khz) {
- mark_tsc_unstable("could not calculate TSC khz");
- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
- }
+ if (tsc_khz == 0)
+ return false;

pr_info("Detected %lu.%03lu MHz processor\n",
- (unsigned long)cpu_khz / 1000,
- (unsigned long)cpu_khz % 1000);
+ (unsigned long)cpu_khz / KHZ,
+ (unsigned long)cpu_khz % KHZ);

if (cpu_khz != tsc_khz) {
pr_info("Detected %lu.%03lu MHz TSC",
- (unsigned long)tsc_khz / 1000,
- (unsigned long)tsc_khz % 1000);
+ (unsigned long)tsc_khz / KHZ,
+ (unsigned long)tsc_khz % KHZ);
+ }
+ return true;
+}
+
+static unsigned long get_loops_per_jiffy(void)
+{
+ unsigned long lpj = tsc_khz * KHZ;
+
+ do_div(lpj, HZ);
+ return lpj;
+}
+
+void __init tsc_early_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TSC))
+ return;
+ if (!determine_cpu_tsc_frequncies())
+ return;
+ loops_per_jiffy = get_loops_per_jiffy();
+}
+
+void __init tsc_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TSC)) {
+ setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+ return;
+ }
+
+ if (!tsc_khz) {
+ /* We failed to determine frequencies earlier, try again */
+ if (!determine_cpu_tsc_frequncies()) {
+ mark_tsc_unstable("could not calculate TSC khz");
+ return;
+ }
}

/* Sanitize TSC ADJUST before cyc2ns gets initialized */
@@ -1413,10 +1422,7 @@ void __init tsc_init(void)
if (!no_sched_irq_time)
enable_sched_clock_irqtime();

- lpj = ((u64)tsc_khz * 1000);
- do_div(lpj, HZ);
- lpj_fine = lpj;
-
+ lpj_fine = get_loops_per_jiffy();
use_tsc_delay();

check_system_tsc_reliable();
--
2.18.0


2018-07-18 02:26:37

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 04/25] x86/kvmclock: Cleanup the code

From: Thomas Gleixner <[email protected]>

- Cleanup the mrs write for wall clock. The type casts to (int) are sloppy
because the wrmsr parameters are u32 and aside of that wrmsrl() already
provides the high/low split for free.

- Remove the pointless get_cpu()/put_cpu() dance from various
functions. Either they are called during early init where CPU is
guaranteed to be 0 or they are already called from non preemptible
context where smp_processor_id() can be used safely

- Simplify the convoluted check for kvmclock in the init function.

- Mark the parameter parsing function __init. No point in keeping it
around.

- Convert to pr_info()

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvmclock.c | 76 ++++++++++++--------------------------
1 file changed, 23 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index cbf0a6b9217b..04d2f5e1d783 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -37,7 +37,7 @@ static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
static u64 kvm_sched_clock_offset;

-static int parse_no_kvmclock(char *arg)
+static int __init parse_no_kvmclock(char *arg)
{
kvmclock = 0;
return 0;
@@ -61,13 +61,9 @@ static struct pvclock_wall_clock wall_clock;
static void kvm_get_wallclock(struct timespec64 *now)
{
struct pvclock_vcpu_time_info *vcpu_time;
- int low, high;
int cpu;

- low = (int)slow_virt_to_phys(&wall_clock);
- high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);
-
- native_write_msr(msr_kvm_wall_clock, low, high);
+ wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));

cpu = get_cpu();

@@ -117,11 +113,11 @@ static inline void kvm_sched_clock_init(bool stable)
kvm_sched_clock_offset = kvm_clock_read();
pv_time_ops.sched_clock = kvm_sched_clock_read;

- printk(KERN_INFO "kvm-clock: using sched offset of %llu cycles\n",
- kvm_sched_clock_offset);
+ pr_info("kvm-clock: using sched offset of %llu cycles",
+ kvm_sched_clock_offset);

BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
- sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
+ sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
}

/*
@@ -135,15 +131,7 @@ static inline void kvm_sched_clock_init(bool stable)
*/
static unsigned long kvm_get_tsc_khz(void)
{
- struct pvclock_vcpu_time_info *src;
- int cpu;
- unsigned long tsc_khz;
-
- cpu = get_cpu();
- src = &hv_clock[cpu].pvti;
- tsc_khz = pvclock_tsc_khz(src);
- put_cpu();
- return tsc_khz;
+ return pvclock_tsc_khz(&hv_clock[0].pvti);
}

static void kvm_get_preset_lpj(void)
@@ -160,29 +148,27 @@ static void kvm_get_preset_lpj(void)

bool kvm_check_and_clear_guest_paused(void)
{
- bool ret = false;
struct pvclock_vcpu_time_info *src;
- int cpu = smp_processor_id();
+ bool ret = false;

if (!hv_clock)
return ret;

- src = &hv_clock[cpu].pvti;
+ src = &hv_clock[smp_processor_id()].pvti;
if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
src->flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
ret = true;
}
-
return ret;
}

struct clocksource kvm_clock = {
- .name = "kvm-clock",
- .read = kvm_clock_get_cycles,
- .rating = 400,
- .mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .name = "kvm-clock",
+ .read = kvm_clock_get_cycles,
+ .rating = 400,
+ .mask = CLOCKSOURCE_MASK(64),
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS,
};
EXPORT_SYMBOL_GPL(kvm_clock);

@@ -198,7 +184,7 @@ static void kvm_register_clock(char *txt)
src = &hv_clock[cpu].pvti;
pa = slow_virt_to_phys(src) | 0x01ULL;
wrmsrl(msr_kvm_system_time, pa);
- pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
+ pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
}

static void kvm_save_sched_clock_state(void)
@@ -243,20 +229,19 @@ static void kvm_shutdown(void)

void __init kvmclock_init(void)
{
- struct pvclock_vcpu_time_info *vcpu_time;
- int cpu;
u8 flags;

- if (!kvm_para_available())
+ if (!kvm_para_available() || !kvmclock)
return;

- if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
+ if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
- } else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
+ } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
return;
+ }

- printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
+ pr_info("kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
@@ -265,20 +250,15 @@ void __init kvmclock_init(void)
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

- cpu = get_cpu();
- vcpu_time = &hv_clock[cpu].pvti;
- flags = pvclock_read_flags(vcpu_time);
-
+ flags = pvclock_read_flags(&hv_clock[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
- put_cpu();

x86_platform.calibrate_tsc = kvm_get_tsc_khz;
x86_platform.calibrate_cpu = kvm_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
- x86_cpuinit.early_percpu_clock_init =
- kvm_setup_secondary_clock;
+ x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
#endif
x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
@@ -294,26 +274,16 @@ void __init kvmclock_init(void)
int __init kvm_setup_vsyscall_timeinfo(void)
{
#ifdef CONFIG_X86_64
- int cpu;
u8 flags;
- struct pvclock_vcpu_time_info *vcpu_time;

if (!hv_clock)
return 0;

- cpu = get_cpu();
-
- vcpu_time = &hv_clock[cpu].pvti;
- flags = pvclock_read_flags(vcpu_time);
-
- if (!(flags & PVCLOCK_TSC_STABLE_BIT)) {
- put_cpu();
+ flags = pvclock_read_flags(&hv_clock[0].pvti);
+ if (!(flags & PVCLOCK_TSC_STABLE_BIT))
return 1;
- }

pvclock_set_pvti_cpu0_va(hv_clock);
- put_cpu();
-
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
#endif
return 0;
--
2.18.0


2018-07-18 02:26:39

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 22/25] x86/tsc: use tsc early

get timestamps and high resultion clock available to us as early as
possible.

native_sched_clock() outputs time based either on tsc after tsc_init() is
called later in boot, or using jiffies when clock interrupts are enabled,
which is also happens later in boot.

On the other hand, tsc frequency is known from as early as when
tsc_early_init() is called.

Use the early tsc calibration to output timestamps early.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/tsc.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 0b1abe7fdd8e..39ff2881f622 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1408,6 +1408,7 @@ void __init tsc_early_init(void)
/* Sanitize TSC ADJUST before cyc2ns gets initialized */
tsc_store_and_check_tsc_adjust(true);
cyc2ns_init_boot_cpu();
+ static_branch_enable(&__use_tsc);
}

void __init tsc_init(void)
--
2.18.0


2018-07-18 02:26:40

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 12/25] x86/tsc: redefine notsc to behave as tsc=unstable

Currently, notsc kernel parameter disables the use of tsc register by
sched_clock(). However, this parameter does not prevent linux from
accessing tsc in other places in kernel.

The only rational to boot with notsc is to avoid timing discrepancies on
multi-socket systems where different tsc frequencies may present, and thus
fallback to jiffies for clock source.

However, there is another method to solve the above problem, it is to boot
with tsc=unstable parameter. This parameter allows sched_clock() to use tsc
but in case tsc is outside of expected interval it is corrected back to a
sane value.

This is why there is no reason to keep notsc, and it can be removed. But,
for compatibility reasons we will keep this parameter but change its
definition to be the same as tsc=unstable.

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Dou Liyang <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 2 --
Documentation/x86/x86_64/boot-options.txt | 4 +---
arch/x86/kernel/tsc.c | 18 +++---------------
3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 533ff5c68970..5aed30cd0350 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2835,8 +2835,6 @@

nosync [HW,M68K] Disables sync negotiation for all devices.

- notsc [BUGS=X86-32] Disable Time Stamp Counter
-
nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 8d109ef67ab6..66114ab4f9fe 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -92,9 +92,7 @@ APICs
Timing

notsc
- Don't use the CPU time stamp counter to read the wall time.
- This can be used to work around timing problems on multiprocessor systems
- with not properly synchronized CPUs.
+ Deprecated, use tsc=unstable instead.

nohpet
Don't use the HPET timer.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..186395041725 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,11 +38,6 @@ EXPORT_SYMBOL(tsc_khz);
*/
static int __read_mostly tsc_unstable;

-/* native_sched_clock() is called before tsc_init(), so
- we must start with the TSC soft disabled to prevent
- erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
static DEFINE_STATIC_KEY_FALSE(__use_tsc);

int tsc_clocksource_reliable;
@@ -248,8 +243,7 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
#ifdef CONFIG_X86_TSC
int __init notsc_setup(char *str)
{
- pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
- tsc_disabled = 1;
+ mark_tsc_unstable("boot parameter notsc");
return 1;
}
#else
@@ -1307,7 +1301,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)

static int __init init_tsc_clocksource(void)
{
- if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+ if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
return 0;

if (tsc_unstable)
@@ -1414,12 +1408,6 @@ void __init tsc_init(void)
set_cyc2ns_scale(tsc_khz, cpu, cyc);
}

- if (tsc_disabled > 0)
- return;
-
- /* now allow native_sched_clock() to use rdtsc */
-
- tsc_disabled = 0;
static_branch_enable(&__use_tsc);

if (!no_sched_irq_time)
@@ -1455,7 +1443,7 @@ unsigned long calibrate_delay_is_known(void)
int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
const struct cpumask *mask = topology_core_cpumask(cpu);

- if (tsc_disabled || !constant_tsc || !mask)
+ if (!constant_tsc || !mask)
return 0;

sibling = cpumask_any_but(mask, cpu);
--
2.18.0


2018-07-18 02:26:41

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 17/25] time: default boot time offset to local_clock()

read_persistent_wall_and_boot_offset() is called during boot to read
both the persistent clock and also return the offset between the boot time
and the value of persistent clock.

Change the default boot_offset from zero to local_clock() so architectures,
that do not have a dedicated boot_clock but have early sched_clock(), such
as SPARCv9, x86, and possibly more will benefit from this change by getting
a better and more consistent estimate of the boot time without need for an
arch specific implementation.

Signed-off-by: Pavel Tatashin <[email protected]>
---
kernel/time/timekeeping.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index cb738f825c12..30d7f64ffc87 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1503,14 +1503,17 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
* Weak dummy function for arches that do not yet support it.
* wall_time - current time as returned by persistent clock
* boot_offset - offset that is defined as wall_time - boot_time
- * default to 0.
+ * The default function calculates offset based on the current value of
+ * local_clock(). This way architectures that support sched_clock() but don't
+ * support dedicated boot time clock will provide the best estimate of the
+ * boot time.
*/
void __weak __init
read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
struct timespec64 *boot_offset)
{
read_persistent_clock64(wall_time);
- *boot_offset = (struct timespec64){0};
+ *boot_offset = ns_to_timespec64(local_clock());
}

/* Flag for if timekeeping_resume() has injected sleeptime */
--
2.18.0


2018-07-18 02:26:48

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 24/25] sched: early boot clock

Allow sched_clock() to be used before schec_clock_init() is called.
This provides with a way to get early boot timestamps on machines with
unstable clocks.

Signed-off-by: Pavel Tatashin <[email protected]>
---
init/main.c | 2 +-
kernel/sched/clock.c | 10 +++++++++-
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 162d931c9511..ff0a24170b95 100644
--- a/init/main.c
+++ b/init/main.c
@@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
@@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
acpi_early_init();
if (late_time_init)
late_time_init();
+ sched_clock_init();
calibrate_delay();
pid_idr_init();
anon_vma_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 0e9dbb2d9aea..7a8a63b940ee 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -202,7 +202,15 @@ static void __sched_clock_gtod_offset(void)

void __init sched_clock_init(void)
{
+ unsigned long flags;
+
sched_clock_running = 1;
+
+ /* Adjust __gtod_offset for contigious transition from early clock */
+ local_irq_save(flags);
+ sched_clock_tick();
+ local_irq_restore(flags);
+ __sched_clock_gtod_offset();
}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
@@ -356,7 +364,7 @@ u64 sched_clock_cpu(int cpu)
return sched_clock() + __sched_clock_offset;

if (unlikely(!sched_clock_running))
- return 0ull;
+ return sched_clock();

preempt_disable_notrace();
scd = cpu_sdc(cpu);
--
2.18.0


2018-07-18 02:26:53

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 23/25] sched: move sched clock initialization and merge with generic clock

sched_clock_postinit() initializes a generic clock on systems where no
other clock is porvided. This function may be called only after
timekeeping_init().

Rename sched_clock_postinit to generic_clock_inti() and call it from
sched_clock_init(). Move the call for sched_clock_init() until after
time_init().

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/sched_clock.h | 5 ++---
init/main.c | 4 ++--
kernel/sched/clock.c | 27 +++++++++++++++++----------
kernel/sched/core.c | 1 -
kernel/time/sched_clock.c | 2 +-
5 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched_clock.h b/include/linux/sched_clock.h
index 411b52e424e1..abe28d5cb3f4 100644
--- a/include/linux/sched_clock.h
+++ b/include/linux/sched_clock.h
@@ -9,17 +9,16 @@
#define LINUX_SCHED_CLOCK

#ifdef CONFIG_GENERIC_SCHED_CLOCK
-extern void sched_clock_postinit(void);
+extern void generic_sched_clock_init(void);

extern void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate);
#else
-static inline void sched_clock_postinit(void) { }
+static inline void generic_sched_clock_init(void) { }

static inline void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate)
{
- ;
}
#endif

diff --git a/init/main.c b/init/main.c
index 3b4ada11ed52..162d931c9511 100644
--- a/init/main.c
+++ b/init/main.c
@@ -79,7 +79,7 @@
#include <linux/pti.h>
#include <linux/blkdev.h>
#include <linux/elevator.h>
-#include <linux/sched_clock.h>
+#include <linux/sched/clock.h>
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/context_tracking.h>
@@ -642,7 +642,7 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_postinit();
+ sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..0e9dbb2d9aea 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -53,6 +53,7 @@
*
*/
#include "sched.h"
+#include <linux/sched_clock.h>

/*
* Scheduler clock - returns current time in nanosec units.
@@ -68,11 +69,6 @@ EXPORT_SYMBOL_GPL(sched_clock);

__read_mostly int sched_clock_running;

-void sched_clock_init(void)
-{
- sched_clock_running = 1;
-}
-
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
* We must start with !__sched_clock_stable because the unstable -> stable
@@ -199,6 +195,15 @@ void clear_sched_clock_stable(void)
__clear_sched_clock_stable();
}

+static void __sched_clock_gtod_offset(void)
+{
+ __gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+}
+
+void __init sched_clock_init(void)
+{
+ sched_clock_running = 1;
+}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
* notably: acpi_processor and intel_idle, which can mark the TSC as unstable.
@@ -385,8 +390,6 @@ void sched_clock_tick(void)

void sched_clock_tick_stable(void)
{
- u64 gtod, clock;
-
if (!sched_clock_stable())
return;

@@ -398,9 +401,7 @@ void sched_clock_tick_stable(void)
* TSC to be unstable, any computation will be computing crap.
*/
local_irq_disable();
- gtod = ktime_get_ns();
- clock = sched_clock();
- __gtod_offset = (clock + __sched_clock_offset) - gtod;
+ __sched_clock_gtod_offset();
local_irq_enable();
}

@@ -434,6 +435,12 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);

#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */

+void __init sched_clock_init(void)
+{
+ sched_clock_running = 1;
+ generic_sched_clock_init();
+}
+
u64 sched_clock_cpu(int cpu)
{
if (unlikely(!sched_clock_running))
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..552406e9713b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5954,7 +5954,6 @@ void __init sched_init(void)
int i, j;
unsigned long alloc_size = 0, ptr;

- sched_clock_init();
wait_bit_init();

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 2d8f05aad442..cbc72c2c1fca 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -237,7 +237,7 @@ sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
pr_debug("Registered %pF as sched_clock source\n", read);
}

-void __init sched_clock_postinit(void)
+void __init generic_sched_clock_init(void)
{
/*
* If no sched_clock() function has been provided at that point,
--
2.18.0


2018-07-18 02:26:58

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 16/25] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset()

If architecture does not support exact boot time, it is challenging to
estimate boot time without having a reference to the current persistent
clock value. Yet, it cannot read the persistent clock time again, because
this may lead to math discrepancies with the caller of read_boot_clock64()
who have read the persistent clock at a different time.

This is why it is better to provide two values simultaneously: the
persistent clock value, and the boot time.

Replace read_boot_clock64() with:
read_persistent_wall_and_boot_offset(wall_time, boot_offset)

Where wall_time is returned by read_persistent_clock()
And boot_offset is wall_time - boot time, which defaults to 0.

Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/timekeeping.h | 3 +-
kernel/time/timekeeping.c | 59 +++++++++++++++++++------------------
2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 86bc2026efce..686bc27acef0 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -243,7 +243,8 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
extern int persistent_clock_is_local;

extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
+void read_persistent_clock_and_boot_offset(struct timespec64 *wall_clock,
+ struct timespec64 *boot_offset);
extern int update_persistent_clock64(struct timespec64 now);

/*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4786df904c22..cb738f825c12 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -17,6 +17,7 @@
#include <linux/nmi.h>
#include <linux/sched.h>
#include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
#include <linux/syscore_ops.h>
#include <linux/clocksource.h>
#include <linux/jiffies.h>
@@ -1496,18 +1497,20 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
}

/**
- * read_boot_clock64 - Return time of the system start.
+ * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset
+ * from the boot.
*
* Weak dummy function for arches that do not yet support it.
- * Function to read the exact time the system has been started.
- * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
- *
- * XXX - Do be sure to remove it once all arches implement it.
+ * wall_time - current time as returned by persistent clock
+ * boot_offset - offset that is defined as wall_time - boot_time
+ * default to 0.
*/
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init
+read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
{
- ts->tv_sec = 0;
- ts->tv_nsec = 0;
+ read_persistent_clock64(wall_time);
+ *boot_offset = (struct timespec64){0};
}

/* Flag for if timekeeping_resume() has injected sleeptime */
@@ -1521,28 +1524,29 @@ static bool persistent_clock_exists;
*/
void __init timekeeping_init(void)
{
+ struct timespec64 wall_time, boot_offset, wall_to_mono;
struct timekeeper *tk = &tk_core.timekeeper;
struct clocksource *clock;
unsigned long flags;
- struct timespec64 now, boot, tmp;
-
- read_persistent_clock64(&now);
- if (!timespec64_valid_strict(&now)) {
- pr_warn("WARNING: Persistent clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- now.tv_sec = 0;
- now.tv_nsec = 0;
- } else if (now.tv_sec || now.tv_nsec)
- persistent_clock_exists = true;

- read_boot_clock64(&boot);
- if (!timespec64_valid_strict(&boot)) {
- pr_warn("WARNING: Boot clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- boot.tv_sec = 0;
- boot.tv_nsec = 0;
+ read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
+ if (timespec64_valid_strict(&wall_time) &&
+ timespec64_to_ns(&wall_time) > 0) {
+ persistent_clock_exists = true;
+ } else {
+ pr_warn("Persistent clock returned invalid value");
+ wall_time = (struct timespec64){0};
}

+ if (timespec64_compare(&wall_time, &boot_offset) < 0)
+ boot_offset = (struct timespec64){0};
+
+ /*
+ * We want set wall_to_mono, so the following is true:
+ * wall time + wall_to_mono = boot time
+ */
+ wall_to_mono = timespec64_sub(boot_offset, wall_time);
+
raw_spin_lock_irqsave(&timekeeper_lock, flags);
write_seqcount_begin(&tk_core.seq);
ntp_init();
@@ -1552,13 +1556,10 @@ void __init timekeeping_init(void)
clock->enable(clock);
tk_setup_internals(tk, clock);

- tk_set_xtime(tk, &now);
+ tk_set_xtime(tk, &wall_time);
tk->raw_sec = 0;
- if (boot.tv_sec == 0 && boot.tv_nsec == 0)
- boot = tk_xtime(tk);

- set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec);
- tk_set_wall_to_mono(tk, tmp);
+ tk_set_wall_to_mono(tk, wall_to_mono);

timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);

--
2.18.0


2018-07-18 02:27:07

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 21/25] x86/tsc: initialize cyc2ns when tsc freq. is determined

cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data
structure.

Currently, the setup code for c2ns data for every possible CPU goes through
the same sequence of calculations as for the boot CPU, but is based on the
same tsc frequency as the boot CPU, and thus this is not necessary.

Initialize the boot cpu when tsc frequency is determined. Copy the
calculated data from the boot CPU to the other CPUs in tsc_init().

In addition do the following:

- Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init()
- Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be
called when system is up, and wraps around __set_cyc2ns_scale() that can
be called directly when system is booting but avoids saving restoring
IRQs and going and waking up from idle.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/tsc.c | 94 ++++++++++++++++++++++++-------------------
1 file changed, 53 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index bc8eb82050a3..0b1abe7fdd8e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -103,23 +103,6 @@ void cyc2ns_read_end(void)
* [email protected] "math is hard, lets go shopping!"
*/

-static void cyc2ns_data_init(struct cyc2ns_data *data)
-{
- data->cyc2ns_mul = 0;
- data->cyc2ns_shift = 0;
- data->cyc2ns_offset = 0;
-}
-
-static void __init cyc2ns_init(int cpu)
-{
- struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
-
- cyc2ns_data_init(&c2n->data[0]);
- cyc2ns_data_init(&c2n->data[1]);
-
- seqcount_init(&c2n->seq);
-}
-
static inline unsigned long long cycles_2_ns(unsigned long long cyc)
{
struct cyc2ns_data data;
@@ -135,18 +118,11 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
return ns;
}

-static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
{
unsigned long long ns_now;
struct cyc2ns_data data;
struct cyc2ns *c2n;
- unsigned long flags;
-
- local_irq_save(flags);
- sched_clock_idle_sleep_event();
-
- if (!khz)
- goto done;

ns_now = cycles_2_ns(tsc_now);

@@ -178,12 +154,55 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
c2n->data[0] = data;
raw_write_seqcount_latch(&c2n->seq);
c2n->data[1] = data;
+}
+
+static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ if (khz)
+ __set_cyc2ns_scale(khz, cpu, tsc_now);

-done:
sched_clock_idle_wakeup_event();
local_irq_restore(flags);
}

+/*
+ * Initialize cyc2ns for boot cpu
+ */
+static void __init cyc2ns_init_boot_cpu(void)
+{
+ struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+
+ seqcount_init(&c2n->seq);
+ __set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
+}
+
+/*
+ * Secondary CPUs do not run through cyc2ns_init(), so set up
+ * all the scale factors for all CPUs, assuming the same
+ * speed as the bootup CPU. (cpufreq notifiers will fix this
+ * up if their speed diverges)
+ */
+static void __init cyc2ns_init_secondary_cpus(void)
+{
+ unsigned int cpu, this_cpu = smp_processor_id();
+ struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+ struct cyc2ns_data *data = c2n->data;
+
+ for_each_possible_cpu(cpu) {
+ if (cpu != this_cpu) {
+ seqcount_init(&c2n->seq);
+ c2n = per_cpu_ptr(&cyc2ns, cpu);
+ c2n->data[0] = data[0];
+ c2n->data[1] = data[1];
+ }
+ }
+}
+
/*
* Scheduler clock - returns current time in nanosec units.
*/
@@ -1385,6 +1404,10 @@ void __init tsc_early_init(void)
if (!determine_cpu_tsc_frequncies())
return;
loops_per_jiffy = get_loops_per_jiffy();
+
+ /* Sanitize TSC ADJUST before cyc2ns gets initialized */
+ tsc_store_and_check_tsc_adjust(true);
+ cyc2ns_init_boot_cpu();
}

void __init tsc_init(void)
@@ -1400,23 +1423,12 @@ void __init tsc_init(void)
mark_tsc_unstable("could not calculate TSC khz");
return;
}
+ /* Sanitize TSC ADJUST before cyc2ns gets initialized */
+ tsc_store_and_check_tsc_adjust(true);
+ cyc2ns_init_boot_cpu();
}

- /* Sanitize TSC ADJUST before cyc2ns gets initialized */
- tsc_store_and_check_tsc_adjust(true);
-
- /*
- * Secondary CPUs do not run through tsc_init(), so set up
- * all the scale factors for all CPUs, assuming the same
- * speed as the bootup CPU. (cpufreq notifiers will fix this
- * up if their speed diverges)
- */
- cyc = rdtsc();
- for_each_possible_cpu(cpu) {
- cyc2ns_init(cpu);
- set_cyc2ns_scale(tsc_khz, cpu, cyc);
- }
-
+ cyc2ns_init_secondary_cpus();
static_branch_enable(&__use_tsc);

if (!no_sched_irq_time)
--
2.18.0


2018-07-18 02:27:14

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 19/25] ARM/time: remove read_boot_clock64()

read_boot_clock64() is deleted, and replaced with
read_persistent_wall_and_boot_offset().

The default implementation of read_persistent_wall_and_boot_offset()
provides a better fallback than the current stubs for read_boot_clock64()
that arm has with no users, so remove the old code.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm/include/asm/mach/time.h | 3 +--
arch/arm/kernel/time.c | 15 ++-------------
arch/arm/plat-omap/counter_32k.c | 2 +-
drivers/clocksource/tegra20_timer.c | 2 +-
4 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mach/time.h b/arch/arm/include/asm/mach/time.h
index 0f79e4dec7f9..4ac3a019a46f 100644
--- a/arch/arm/include/asm/mach/time.h
+++ b/arch/arm/include/asm/mach/time.h
@@ -13,7 +13,6 @@
extern void timer_tick(void);

typedef void (*clock_access_fn)(struct timespec64 *);
-extern int register_persistent_clock(clock_access_fn read_boot,
- clock_access_fn read_persistent);
+extern int register_persistent_clock(clock_access_fn read_persistent);

#endif
diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index cf2701cb0de8..078b259ead4e 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -83,29 +83,18 @@ static void dummy_clock_access(struct timespec64 *ts)
}

static clock_access_fn __read_persistent_clock = dummy_clock_access;
-static clock_access_fn __read_boot_clock = dummy_clock_access;

void read_persistent_clock64(struct timespec64 *ts)
{
__read_persistent_clock(ts);
}

-void read_boot_clock64(struct timespec64 *ts)
-{
- __read_boot_clock(ts);
-}
-
-int __init register_persistent_clock(clock_access_fn read_boot,
- clock_access_fn read_persistent)
+int __init register_persistent_clock(clock_access_fn read_persistent)
{
/* Only allow the clockaccess functions to be registered once */
- if (__read_persistent_clock == dummy_clock_access &&
- __read_boot_clock == dummy_clock_access) {
- if (read_boot)
- __read_boot_clock = read_boot;
+ if (__read_persistent_clock == dummy_clock_access) {
if (read_persistent)
__read_persistent_clock = read_persistent;
-
return 0;
}

diff --git a/arch/arm/plat-omap/counter_32k.c b/arch/arm/plat-omap/counter_32k.c
index 2438b96004c1..fcc5bfec8bd1 100644
--- a/arch/arm/plat-omap/counter_32k.c
+++ b/arch/arm/plat-omap/counter_32k.c
@@ -110,7 +110,7 @@ int __init omap_init_clocksource_32k(void __iomem *vbase)
}

sched_clock_register(omap_32k_read_sched_clock, 32, 32768);
- register_persistent_clock(NULL, omap_read_persistent_clock64);
+ register_persistent_clock(omap_read_persistent_clock64);
pr_info("OMAP clocksource: 32k_counter at 32768 Hz\n");

return 0;
diff --git a/drivers/clocksource/tegra20_timer.c b/drivers/clocksource/tegra20_timer.c
index c337a8100a7b..2242a36fc5b0 100644
--- a/drivers/clocksource/tegra20_timer.c
+++ b/drivers/clocksource/tegra20_timer.c
@@ -259,6 +259,6 @@ static int __init tegra20_init_rtc(struct device_node *np)
else
clk_prepare_enable(clk);

- return register_persistent_clock(NULL, tegra_read_persistent_clock64);
+ return register_persistent_clock(tegra_read_persistent_clock64);
}
TIMER_OF_DECLARE(tegra20_rtc, "nvidia,tegra20-rtc", tegra20_init_rtc);
--
2.18.0


2018-07-18 02:27:18

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 25/25] sched: use static key for sched_clock_running

sched_clock_running may be read every time sched_clock_cpu() is called.
Yet, this variable is updated only twice during boot, and never changes
again, therefore it is better to make it a static key.

Signed-off-by: Pavel Tatashin <[email protected]>
---
kernel/sched/clock.c | 16 ++++++++--------
kernel/sched/debug.c | 2 --
2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 7a8a63b940ee..858a1c8f594c 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -67,7 +67,7 @@ unsigned long long __weak sched_clock(void)
}
EXPORT_SYMBOL_GPL(sched_clock);

-__read_mostly int sched_clock_running;
+static DEFINE_STATIC_KEY_FALSE(sched_clock_running);

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
@@ -191,7 +191,7 @@ void clear_sched_clock_stable(void)

smp_mb(); /* matches sched_clock_init_late() */

- if (sched_clock_running == 2)
+ if (static_key_count(&sched_clock_running.key) == 2)
__clear_sched_clock_stable();
}

@@ -204,7 +204,7 @@ void __init sched_clock_init(void)
{
unsigned long flags;

- sched_clock_running = 1;
+ static_branch_inc(&sched_clock_running);

/* Adjust __gtod_offset for contigious transition from early clock */
local_irq_save(flags);
@@ -218,7 +218,7 @@ void __init sched_clock_init(void)
*/
static int __init sched_clock_init_late(void)
{
- sched_clock_running = 2;
+ static_branch_inc(&sched_clock_running);
/*
* Ensure that it is impossible to not do a static_key update.
*
@@ -363,7 +363,7 @@ u64 sched_clock_cpu(int cpu)
if (sched_clock_stable())
return sched_clock() + __sched_clock_offset;

- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return sched_clock();

preempt_disable_notrace();
@@ -386,7 +386,7 @@ void sched_clock_tick(void)
if (sched_clock_stable())
return;

- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return;

lockdep_assert_irqs_disabled();
@@ -445,13 +445,13 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);

void __init sched_clock_init(void)
{
- sched_clock_running = 1;
+ static_branch_inc(&sched_clock_running);
generic_sched_clock_init();
}

u64 sched_clock_cpu(int cpu)
{
- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return 0;

return sched_clock();
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e593b4118578..b0212f489a33 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -623,8 +623,6 @@ void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
#undef PU
}

-extern __read_mostly int sched_clock_running;
-
static void print_cpu(struct seq_file *m, int cpu)
{
struct rq *rq = cpu_rq(cpu);
--
2.18.0


2018-07-18 02:27:22

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 18/25] s390/time: remove read_boot_clock64()

read_boot_clock64() was replaced by read_persistent_wall_and_boot_offset()
so remove it.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/s390/kernel/time.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index d1f5447d5687..e8766beee5ad 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -239,19 +239,6 @@ void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
*boot_offset = timespec64_sub(*wall_time, boot_time);
}

-void read_boot_clock64(struct timespec64 *ts)
-{
- unsigned char clk[STORE_CLOCK_EXT_SIZE];
- __u64 delta;
-
- delta = initial_leap_seconds + TOD_UNIX_EPOCH;
- memcpy(clk, tod_clock_base, 16);
- *(__u64 *) &clk[1] -= delta;
- if (*(__u64 *) &clk[1] > delta)
- clk[0]--;
- ext_to_timespec64(clk, ts);
-}
-
static u64 read_tod_clock(struct clocksource *cs)
{
unsigned long long now, adj;
--
2.18.0


2018-07-18 02:27:39

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 10/25] x86: initialize static branching early

static branching is useful to hot-patch branches that are used in hot
path, but are infrequently changed.

x86 clock framework is one example that uses static branches to setup
the best clock during boot and never change it again.

Since we plan to enable clock early, we need static branching
functionality early as well.

static branching requires patching nop instructions, thus, we need
arch_init_ideal_nops() to be called prior to jump_label_init()

Here we do all the necessary steps to call arch_init_ideal_nops
after early_cpu_init().

Signed-off-by: Pavel Tatashin <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 13 +++++++-----
arch/x86/kernel/cpu/common.c | 38 +++++++++++++++++++-----------------
arch/x86/kernel/setup.c | 4 ++--
3 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 38915fbfae73..b732438c1a1e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -232,8 +232,6 @@ static void init_amd_k7(struct cpuinfo_x86 *c)
}
}

- set_cpu_cap(c, X86_FEATURE_K7);
-
/* calling is from identify_secondary_cpu() ? */
if (!c->cpu_index)
return;
@@ -617,6 +615,14 @@ static void early_init_amd(struct cpuinfo_x86 *c)

early_init_amd_mc(c);

+#ifdef CONFIG_X86_32
+ if (c->x86 == 6)
+ set_cpu_cap(c, X86_FEATURE_K7);
+#endif
+
+ if (c->x86 >= 0xf)
+ set_cpu_cap(c, X86_FEATURE_K8);
+
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);

/*
@@ -863,9 +869,6 @@ static void init_amd(struct cpuinfo_x86 *c)

init_amd_cacheinfo(c);

- if (c->x86 >= 0xf)
- set_cpu_cap(c, X86_FEATURE_K8);
-
if (cpu_has(c, X86_FEATURE_XMM2)) {
unsigned long long val;
int ret;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eb4cb3efd20e..71281ac43b15 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1015,6 +1015,24 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
}

+/*
+ * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
+ * unfortunately, that's not true in practice because of early VIA
+ * chips and (more importantly) broken virtualizers that are not easy
+ * to detect. In the latter case it doesn't even *fail* reliably, so
+ * probing for it doesn't even work. Disable it completely on 32-bit
+ * unless we can find a reliable way to detect all the broken cases.
+ * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
+ */
+static void detect_nopl(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_32
+ clear_cpu_cap(c, X86_FEATURE_NOPL);
+#else
+ set_cpu_cap(c, X86_FEATURE_NOPL);
+#endif
+}
+
/*
* Do minimum CPU detection early.
* Fields really needed: vendor, cpuid_level, family, model, mask,
@@ -1089,6 +1107,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
*/
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);
+
+ detect_nopl(c);
}

void __init early_cpu_init(void)
@@ -1124,24 +1144,6 @@ void __init early_cpu_init(void)
early_identify_cpu(&boot_cpu_data);
}

-/*
- * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
- * unfortunately, that's not true in practice because of early VIA
- * chips and (more importantly) broken virtualizers that are not easy
- * to detect. In the latter case it doesn't even *fail* reliably, so
- * probing for it doesn't even work. Disable it completely on 32-bit
- * unless we can find a reliable way to detect all the broken cases.
- * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
- */
-static void detect_nopl(struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
-#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
-#endif
-}
-
static void detect_null_seg_behavior(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index da1dbd99cb6e..7490de925a81 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -866,6 +866,8 @@ void __init setup_arch(char **cmdline_p)

idt_setup_early_traps();
early_cpu_init();
+ arch_init_ideal_nops();
+ jump_label_init();
early_ioremap_init();

setup_olpc_ofw_pgd();
@@ -1268,8 +1270,6 @@ void __init setup_arch(char **cmdline_p)

mcheck_init();

- arch_init_ideal_nops();
-
register_refined_jiffies(CLOCK_TICK_RATE);

#ifdef CONFIG_EFI
--
2.18.0


2018-07-18 02:28:11

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v14 11/25] x86/CPU: Call detect_nopl() only on the BSP

From: Borislav Petkov <[email protected]>

Make it use the setup_* variants and have it be called only on the BSP
and drop the call in generic_identify() - X86_FEATURE_NOPL will be
replicated to the APs through the forced caps. Helps keep the mess at a
manageable level.

Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/cpu/common.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 71281ac43b15..46408a8cdf62 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1024,12 +1024,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
* unless we can find a reliable way to detect all the broken cases.
* Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
*/
-static void detect_nopl(struct cpuinfo_x86 *c)
+static void detect_nopl(void)
{
#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_clear_cpu_cap(X86_FEATURE_NOPL);
#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_force_cpu_cap(X86_FEATURE_NOPL);
#endif
}

@@ -1108,7 +1108,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);

- detect_nopl(c);
+ detect_nopl();
}

void __init early_cpu_init(void)
@@ -1206,8 +1206,6 @@ static void generic_identify(struct cpuinfo_x86 *c)

get_model_name(c); /* Default name */

- detect_nopl(c);
-
detect_null_seg_behavior(c);

/*
--
2.18.0


2018-07-18 11:15:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v14 08/25] x86/kvmclock: Avoid TSC recalibration

On 18/07/2018 04:21, Pavel Tatashin wrote:
> From: Peter Zijlstra <[email protected]>
>
> If the host gives us a TSC rate, assume it is good and don't try and
> recalibrate things against virtual timer hardware.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> arch/x86/kernel/kvmclock.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index ed170171fe49..da0ede8ac8f6 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -141,7 +141,16 @@ static inline void kvm_sched_clock_init(bool stable)
> */
> static unsigned long kvm_get_tsc_khz(void)
> {
> - return pvclock_tsc_khz(this_cpu_pvti());
> + unsigned long tsc_khz = pvclock_tsc_khz(this_cpu_pvti());
> +
> + /*
> + * TSC frequency is reported by the host; calibration against (virtual)
> + * HPET/PM-timer in a guest is dodgy and pointless since the host
> + * already did it for us where required.
> + */
> + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +
> + return tsc_khz;
> }
>
> static void kvm_get_preset_lpj(void)
>

This patch (really a similar one) has just been sent to Linus.

Paolo

2018-07-18 13:35:10

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v14 08/25] x86/kvmclock: Avoid TSC recalibration

Thank you Paolo for letting me know. I will remove this patch from the
series, or it can be removed by whomever adds the series.

Thank you,
Pavel
On Wed, Jul 18, 2018 at 7:14 AM Paolo Bonzini <[email protected]> wrote:
>
> On 18/07/2018 04:21, Pavel Tatashin wrote:
> > From: Peter Zijlstra <[email protected]>
> >
> > If the host gives us a TSC rate, assume it is good and don't try and
> > recalibrate things against virtual timer hardware.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > ---
> > arch/x86/kernel/kvmclock.c | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index ed170171fe49..da0ede8ac8f6 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -141,7 +141,16 @@ static inline void kvm_sched_clock_init(bool stable)
> > */
> > static unsigned long kvm_get_tsc_khz(void)
> > {
> > - return pvclock_tsc_khz(this_cpu_pvti());
> > + unsigned long tsc_khz = pvclock_tsc_khz(this_cpu_pvti());
> > +
> > + /*
> > + * TSC frequency is reported by the host; calibration against (virtual)
> > + * HPET/PM-timer in a guest is dodgy and pointless since the host
> > + * already did it for us where required.
> > + */
> > + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > +
> > + return tsc_khz;
> > }
> >
> > static void kvm_get_preset_lpj(void)
> >
>
> This patch (really a similar one) has just been sent to Linus.
>
> Paolo

2018-07-19 05:34:30

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

Hi, Pavel

I am sorry, I didn't point out typo clearly in the previous version.
Please see the concerns below. ;-)

At 07/18/2018 10:22 AM, Pavel Tatashin wrote:
> During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
> and the second time in tsc_init().
>
> Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
> the calibration is done only early, and make tsc_init() to use the values
> already determined in tsc_early_init().
>
> Sometimes it is not possible to determine tsc early, as the subsystem that
> is required is not yet initialized, in such case try again later in
> tsc_init().
>
> Suggested-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> arch/x86/include/asm/tsc.h | 2 +-
> arch/x86/kernel/setup.c | 2 +-
> arch/x86/kernel/tsc.c | 86 ++++++++++++++++++++------------------
> 3 files changed, 48 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
> index 2701d221583a..c4368ff73652 100644
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void)
> extern struct system_counterval_t convert_art_to_tsc(u64 art);
> extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);
>
> -extern void tsc_early_delay_calibrate(void);
> +extern void tsc_early_init(void);
> extern void tsc_init(void);
> extern void mark_tsc_unstable(char *reason);
> extern int unsynchronized_tsc(void);
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 7490de925a81..5d32c55aeb8b 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1014,6 +1014,7 @@ void __init setup_arch(char **cmdline_p)
> */
> init_hypervisor_platform();
>
> + tsc_early_init();
> x86_init.resources.probe_roms();
>
> /* after parse_early_param, so could debug it */
> @@ -1199,7 +1200,6 @@ void __init setup_arch(char **cmdline_p)
>
> memblock_find_dma_reserve();
>
> - tsc_early_delay_calibrate();
> if (!early_xdbc_setup_hardware())
> early_xdbc_register_console();
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 186395041725..bc8eb82050a3 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz);
> unsigned int __read_mostly tsc_khz;
> EXPORT_SYMBOL(tsc_khz);
>
> +#define KHZ 1000
> +
> /*
> * TSC can be unstable due to cpufreq or due to unsynced TSCs
> */
> @@ -1335,34 +1337,10 @@ static int __init init_tsc_clocksource(void)
> */
> device_initcall(init_tsc_clocksource);
>
> -void __init tsc_early_delay_calibrate(void)
> -{
> - unsigned long lpj;
> -
> - if (!boot_cpu_has(X86_FEATURE_TSC))
> - return;
> -
> - cpu_khz = x86_platform.calibrate_cpu();
> - tsc_khz = x86_platform.calibrate_tsc();
> -
> - tsc_khz = tsc_khz ? : cpu_khz;
> - if (!tsc_khz)
> - return;
> -
> - lpj = tsc_khz * 1000;
> - do_div(lpj, HZ);
> - loops_per_jiffy = lpj;
> -}
> -
> -void __init tsc_init(void)
> +static bool determine_cpu_tsc_frequncies(void)

this function need to be mark as __init.

And a typo here: frequency, s/frequncies/frequencies

> {
> - u64 lpj, cyc;
> - int cpu;
> -
> - if (!boot_cpu_has(X86_FEATURE_TSC)) {
> - setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
> - return;
> - }
> + /* Make sure that cpu and tsc are not already calibrated */
> + WARN_ON(cpu_khz || tsc_khz);
>
> cpu_khz = x86_platform.calibrate_cpu();
> tsc_khz = x86_platform.calibrate_tsc();
> @@ -1377,20 +1355,51 @@ void __init tsc_init(void)
> else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
> cpu_khz = tsc_khz;
>
> - if (!tsc_khz) {
> - mark_tsc_unstable("could not calculate TSC khz");
> - setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
> - return;
> - }
> + if (tsc_khz == 0)
> + return false;
>
> pr_info("Detected %lu.%03lu MHz processor\n",
> - (unsigned long)cpu_khz / 1000,
> - (unsigned long)cpu_khz % 1000);
> + (unsigned long)cpu_khz / KHZ,
> + (unsigned long)cpu_khz % KHZ);
>
> if (cpu_khz != tsc_khz) {
> pr_info("Detected %lu.%03lu MHz TSC",
> - (unsigned long)tsc_khz / 1000,
> - (unsigned long)tsc_khz % 1000);
> + (unsigned long)tsc_khz / KHZ,
> + (unsigned long)tsc_khz % KHZ);
> + }

this curly brackets can be removed

> + return true;
> +}
> +
> +static unsigned long get_loops_per_jiffy(void)

mark as __init as well.

> +{
> + unsigned long lpj = tsc_khz * KHZ;
> +
> + do_div(lpj, HZ);
> + return lpj;
> +}
> +
> +void __init tsc_early_init(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_TSC))
> + return;
> + if (!determine_cpu_tsc_frequncies())
> + return;
> + loops_per_jiffy = get_loops_per_jiffy();
> +}
> +
> +void __init tsc_init(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_TSC)) {
> + setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
> + return;
> + }
> +
> + if (!tsc_khz) {
> + /* We failed to determine frequencies earlier, try again */
> + if (!determine_cpu_tsc_frequncies()) {

Missing "setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER)" for local
APIC;

Thanks,
dou
> + mark_tsc_unstable("could not calculate TSC khz");
> + return;
> + }
> }
>
> /* Sanitize TSC ADJUST before cyc2ns gets initialized */
> @@ -1413,10 +1422,7 @@ void __init tsc_init(void)
> if (!no_sched_irq_time)
> enable_sched_clock_irqtime();
>
> - lpj = ((u64)tsc_khz * 1000);
> - do_div(lpj, HZ);
> - lpj_fine = lpj;
> -
> + lpj_fine = get_loops_per_jiffy();
> use_tsc_delay();
>
> check_system_tsc_reliable();
>



2018-07-19 06:28:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

On Thu, 19 Jul 2018, Dou Liyang wrote:
> At 07/18/2018 10:22 AM, Pavel Tatashin wrote:
> > + (unsigned long)cpu_khz % KHZ);
> > if (cpu_khz != tsc_khz) {
> > pr_info("Detected %lu.%03lu MHz TSC",
> > - (unsigned long)tsc_khz / 1000,
> > - (unsigned long)tsc_khz % 1000);
> > + (unsigned long)tsc_khz / KHZ,
> > + (unsigned long)tsc_khz % KHZ);
> > + }
>
> this curly brackets can be removed

No. They want to stay, really.

https://lkml.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos

The pr_info() is a multiline statement due to the line breaks.

Thanks,

tglx

2018-07-19 06:50:07

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

Hi Thomas,

At 07/19/2018 02:25 PM, Thomas Gleixner wrote:
> On Thu, 19 Jul 2018, Dou Liyang wrote:
>> At 07/18/2018 10:22 AM, Pavel Tatashin wrote:
>>> + (unsigned long)cpu_khz % KHZ);
>>> if (cpu_khz != tsc_khz) {
>>> pr_info("Detected %lu.%03lu MHz TSC",
>>> - (unsigned long)tsc_khz / 1000,
>>> - (unsigned long)tsc_khz % 1000);
>>> + (unsigned long)tsc_khz / KHZ,
>>> + (unsigned long)tsc_khz % KHZ);
>>> + }
>>
>> this curly brackets can be removed
>
> No. They want to stay, really.
>
> https://lkml.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos
>
> The pr_info() is a multiline statement due to the line breaks.
>

I see, I??ll keep that in mind.

Thanks,

dou



2018-07-19 10:37:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

On Tue, Jul 17, 2018 at 10:22:06PM -0400, Pavel Tatashin wrote:
> During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
> and the second time in tsc_init().
>
> Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
> the calibration is done only early, and make tsc_init() to use the values
> already determined in tsc_early_init().
>
> Sometimes it is not possible to determine tsc early, as the subsystem that
> is required is not yet initialized, in such case try again later in
> tsc_init().

It might be nice to preserve some of the information tglx dug out during
review of all this. Like the various methods of calibrate_*() and their
dependencies.

And I note that this patch relies on the magic of native_calibrate_cpu()
working really early and not exploding in the quick calibration run.
This either wants fixing or documenting.

I think the initial idea was to only do the fast_calibrate (cpuid, msr
and possibly the quick_pit) things early and delay the HPET/PMTIMER
magic until later.

2018-07-19 10:41:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v14 24/25] sched: early boot clock

On Tue, Jul 17, 2018 at 10:22:10PM -0400, Pavel Tatashin wrote:

> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index 0e9dbb2d9aea..7a8a63b940ee 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -202,7 +202,15 @@ static void __sched_clock_gtod_offset(void)
>
> void __init sched_clock_init(void)
> {
> + unsigned long flags;
> +
> sched_clock_running = 1;
> +
> + /* Adjust __gtod_offset for contigious transition from early clock */
> + local_irq_save(flags);
> + sched_clock_tick();
> + local_irq_restore(flags);
> + __sched_clock_gtod_offset();

I think we want to keep __sched_clock_gtod_offset() inside the IRQ
disabled region.

And I just looked at my patch:

https://lkml.kernel.org/r/[email protected]

and that had a comment about how we wanted to set the gtod offset
_before_ setting sched_clock_running, yet here you do it the other way
around. Hmm?

> }
> /*
> * We run this as late_initcall() such that it runs after all built-in drivers,
> @@ -356,7 +364,7 @@ u64 sched_clock_cpu(int cpu)
> return sched_clock() + __sched_clock_offset;
>
> if (unlikely(!sched_clock_running))
> - return 0ull;
> + return sched_clock();
>
> preempt_disable_notrace();
> scd = cpu_sdc(cpu);

2018-07-19 10:51:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v14 25/25] sched: use static key for sched_clock_running

On Tue, Jul 17, 2018 at 10:22:11PM -0400, Pavel Tatashin wrote:
> sched_clock_running may be read every time sched_clock_cpu() is called.
> Yet, this variable is updated only twice during boot, and never changes
> again, therefore it is better to make it a static key.

Right, so the focus was always on making the sane TSC case fast, and if
TSC isn't stable we'd just make do and not care too much.

But this certainly isn't wrong, so ACK.

2018-07-19 11:02:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

On Thu, 19 Jul 2018, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 10:22:06PM -0400, Pavel Tatashin wrote:
> > During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
> > and the second time in tsc_init().
> >
> > Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
> > the calibration is done only early, and make tsc_init() to use the values
> > already determined in tsc_early_init().
> >
> > Sometimes it is not possible to determine tsc early, as the subsystem that
> > is required is not yet initialized, in such case try again later in
> > tsc_init().
>
> It might be nice to preserve some of the information tglx dug out during
> review of all this. Like the various methods of calibrate_*() and their
> dependencies.
>
> And I note that this patch relies on the magic of native_calibrate_cpu()
> working really early and not exploding in the quick calibration run.
> This either wants fixing or documenting.
>
> I think the initial idea was to only do the fast_calibrate (cpuid, msr
> and possibly the quick_pit) things early and delay the HPET/PMTIMER
> magic until later.

Yes. I really would prefer to have this as an explicit expressed mechanism
rather than relying on magic variables not being initialized.

Thanks

tglx


2018-07-19 14:18:03

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v14 24/25] sched: early boot clock

On Thu, Jul 19, 2018 at 6:40 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jul 17, 2018 at 10:22:10PM -0400, Pavel Tatashin wrote:
>
> > diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> > index 0e9dbb2d9aea..7a8a63b940ee 100644
> > --- a/kernel/sched/clock.c
> > +++ b/kernel/sched/clock.c
> > @@ -202,7 +202,15 @@ static void __sched_clock_gtod_offset(void)
> >
> > void __init sched_clock_init(void)
> > {
> > + unsigned long flags;
> > +
> > sched_clock_running = 1;
> > +
> > + /* Adjust __gtod_offset for contigious transition from early clock */
> > + local_irq_save(flags);
> > + sched_clock_tick();
> > + local_irq_restore(flags);
> > + __sched_clock_gtod_offset();
>
> I think we want to keep __sched_clock_gtod_offset() inside the IRQ
> disabled region.

Fixed.

>
> And I just looked at my patch:
>
> https://lkml.kernel.org/r/[email protected]
>
> and that had a comment about how we wanted to set the gtod offset
> _before_ setting sched_clock_running, yet here you do it the other way
> around. Hmm?

Fixed, and added your comment.

Thank you,
Pavel

2018-07-19 14:26:13

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v14 25/25] sched: use static key for sched_clock_running

On Thu, Jul 19, 2018 at 6:49 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jul 17, 2018 at 10:22:11PM -0400, Pavel Tatashin wrote:
> > sched_clock_running may be read every time sched_clock_cpu() is called.
> > Yet, this variable is updated only twice during boot, and never changes
> > again, therefore it is better to make it a static key.
>
> Right, so the focus was always on making the sane TSC case fast, and if
> TSC isn't stable we'd just make do and not care too much.
>

True for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK, but for other systems like
SPARC, it hurts to have this variable accessed every time, even though
they have a sane sched_clock().

> But this certainly isn't wrong, so ACK.

Thank you,
Pave

2018-07-19 16:01:57

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once



On 07/19/2018 07:01 AM, Thomas Gleixner wrote:
> On Thu, 19 Jul 2018, Peter Zijlstra wrote:
>> On Tue, Jul 17, 2018 at 10:22:06PM -0400, Pavel Tatashin wrote:
>>> During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
>>> and the second time in tsc_init().
>>>
>>> Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
>>> the calibration is done only early, and make tsc_init() to use the values
>>> already determined in tsc_early_init().
>>>
>>> Sometimes it is not possible to determine tsc early, as the subsystem that
>>> is required is not yet initialized, in such case try again later in
>>> tsc_init().
>>
>> It might be nice to preserve some of the information tglx dug out during
>> review of all this. Like the various methods of calibrate_*() and their
>> dependencies.
>>
>> And I note that this patch relies on the magic of native_calibrate_cpu()
>> working really early and not exploding in the quick calibration run.
>> This either wants fixing or documenting.
>>
>> I think the initial idea was to only do the fast_calibrate (cpuid, msr
>> and possibly the quick_pit) things early and delay the HPET/PMTIMER
>> magic until later.
>
> Yes. I really would prefer to have this as an explicit expressed mechanism
> rather than relying on magic variables not being initialized.

What is the best way to achieve this?

I did the following:

1367 static bool __init determine_cpu_tsc_frequencies(bool early)
1368 {
1369 /* Make sure that cpu and tsc are not already calibrated */
1370 WARN_ON(cpu_khz || tsc_khz);
1371
1372 if (early) {
1373 cpu_khz = x86_platform.calibrate_cpu();
1374 tsc_khz = x86_platform.calibrate_tsc();
1375 } else {
1376 cpu_khz = hpet_pmtime_calibrate_cpu();
1377 }

833 /**
834 * native_calibrate_cpu - calibrate the cpu on boot
835 */
836 unsigned long native_calibrate_cpu(void)
837 {
838 unsigned long flags, fast_calibrate;
839
840 fast_calibrate = cpu_khz_from_cpuid();
841 if (fast_calibrate)
842 return fast_calibrate;
843
844 fast_calibrate = cpu_khz_from_msr();
845 if (fast_calibrate)
846 return fast_calibrate;
847
848 local_irq_save(flags);
849 fast_calibrate = quick_pit_calibrate();
850 local_irq_restore(flags);
851 if (fast_calibrate)
852 return fast_calibrate;
853
854 return hpet_pmtime_calibrate_cpu();
855 }


And hpet_pmtime_calibrate_cpu() contains all the hpet/pmtime stuff.

However, when cpu_khz = x86_platform.calibrate_cpu() is called the first time, we still call hpet_pmtime_calibrate_cpu() from native_calibrate_cpu(). We cannot simply split native_calibrate_cpu() into two independent functions because it is also called from recalibrate_cpu_khz().

So, the question is how to enforce that the first time we do not call hpet/pmtime?

1. Use a new global variable? Kind of ugly.
2. Use system_state == SYSTEM_BOOTING ? Ugly, and probably not very safe.

Any other suggestion?

Thank you,
Pavel

2018-07-19 16:21:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

On Thu, 19 Jul 2018, Pavel Tatashin wrote:
> On 07/19/2018 07:01 AM, Thomas Gleixner wrote:
>
> And hpet_pmtime_calibrate_cpu() contains all the hpet/pmtime stuff.
>
> However, when cpu_khz = x86_platform.calibrate_cpu() is called the first
> time, we still call hpet_pmtime_calibrate_cpu() from
> native_calibrate_cpu(). We cannot simply split native_calibrate_cpu()
> into two independent functions because it is also called from
> recalibrate_cpu_khz().

> So, the question is how to enforce that the first time we do not call hpet/pmtime?
>
> 1. Use a new global variable? Kind of ugly.
> 2. Use system_state == SYSTEM_BOOTING ? Ugly, and probably not very safe.

Both are horrible.

So create two functions. native_...early..() and native....(). The early
one does not contain the hpet/pmtimer stuff and it replaces the ops.pointer
with the late one which contains all of it.

Hmm?

Thanks,

tglx

2018-07-19 16:51:08

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

> So create two functions. native_...early..() and native....(). The early
> one does not contain the hpet/pmtimer stuff and it replaces the ops.pointer
> with the late one which contains all of it.

Good idea. Actually, the late one will contain only hpet/pmtimer and I
will set it only if tsc frequency was not determined only.

Thank you,
Pavel

2018-07-19 18:39:39

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

On Thu, Jul 19, 2018 at 12:49 PM Pavel Tatashin
<[email protected]> wrote:
>
> > So create two functions. native_...early..() and native....(). The early
> > one does not contain the hpet/pmtimer stuff and it replaces the ops.pointer
> > with the late one which contains all of it.
>
> Good idea. Actually, the late one will contain only hpet/pmtimer and I
> will set it only if tsc frequency was not determined only.

If we determined tsc early in boot using one of the quick methods:
from cpuid/msr/quick_pit, can we assume that frequencies of all other
CPUs will be determined the same way? Or do we still have to fallback
to PIT/HPET/PMTIMER? I wondering if we support heterogeneous
multi-socket platforms with different CPUs, because that the only
platforms where I see such scenario is possible.

Thank you,
Pavel

2018-07-19 20:46:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

On Thu, 19 Jul 2018, Pavel Tatashin wrote:

> On Thu, Jul 19, 2018 at 12:49 PM Pavel Tatashin
> <[email protected]> wrote:
> >
> > > So create two functions. native_...early..() and native....(). The early
> > > one does not contain the hpet/pmtimer stuff and it replaces the ops.pointer
> > > with the late one which contains all of it.
> >
> > Good idea. Actually, the late one will contain only hpet/pmtimer and I
> > will set it only if tsc frequency was not determined only.
>
> If we determined tsc early in boot using one of the quick methods:
> from cpuid/msr/quick_pit, can we assume that frequencies of all other
> CPUs will be determined the same way? Or do we still have to fallback
> to PIT/HPET/PMTIMER? I wondering if we support heterogeneous
> multi-socket platforms with different CPUs, because that the only
> platforms where I see such scenario is possible.

The frequency for secondary CPUs is usually taken from the boot CPU and the
only reason why recalibration can happen is when the CPU does not have a
constant frequency TSC.

For that case the quick PIT + hpet/pmtimer calibration bundle is
required. So yes, the early calibration might work with quick PIT (those
CPUs definitely do not have MSR/CPUID based calibration), but the
recalibration might fail the quick PIT calibration for various reasons.

Thanks,

tglx

2018-07-19 20:48:31

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once



On 07/19/2018 04:44 PM, Thomas Gleixner wrote:
> On Thu, 19 Jul 2018, Pavel Tatashin wrote:
>
>> On Thu, Jul 19, 2018 at 12:49 PM Pavel Tatashin
>> <[email protected]> wrote:
>>>
>>>> So create two functions. native_...early..() and native....(). The early
>>>> one does not contain the hpet/pmtimer stuff and it replaces the ops.pointer
>>>> with the late one which contains all of it.
>>>
>>> Good idea. Actually, the late one will contain only hpet/pmtimer and I
>>> will set it only if tsc frequency was not determined only.
>>
>> If we determined tsc early in boot using one of the quick methods:
>> from cpuid/msr/quick_pit, can we assume that frequencies of all other
>> CPUs will be determined the same way? Or do we still have to fallback
>> to PIT/HPET/PMTIMER? I wondering if we support heterogeneous
>> multi-socket platforms with different CPUs, because that the only
>> platforms where I see such scenario is possible.
>
> The frequency for secondary CPUs is usually taken from the boot CPU and the
> only reason why recalibration can happen is when the CPU does not have a
> constant frequency TSC.
>
> For that case the quick PIT + hpet/pmtimer calibration bundle is
> required. So yes, the early calibration might work with quick PIT (those
> CPUs definitely do not have MSR/CPUID based calibration), but the
> recalibration might fail the quick PIT calibration for various reasons.

OK, good, I implemented with this in-mind. I will send out a new series shortly.

Thank you,
Pavel

>
> Thanks,
>
> tglx
>

2018-07-23 09:31:25

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH v14 20/25] x86/tsc: calibrate tsc only once

> >> If we determined tsc early in boot using one of the quick methods:
> >> from cpuid/msr/quick_pit, can we assume that frequencies of all other
> >> CPUs will be determined the same way? Or do we still have to fallback

Not on 32bit at least. You can have a mixed slot 1 SMP system such as
an ASUS BP6 with cores at say 300 and 450MHz.

Now whether you'll find any such system in the world today is another
question.

Alan