2018-07-12 03:07:31

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 00/18] Early boot time stamps

changelog
---------
v13 - v12
- Addressed comments from Thomas Gleixner.
- Addressed comments from Peter Zijlstra.
- Added a patch from Borislav Petkov
- Added a new patch: sched: use static key for sched_clock_running
- Added xen pv fixes, so clock is initialized when other
hypervisors initialize their clocks.
Note: I am including kvm/x86: remove kvm memblock dependency, which
is part of this series:
http://lkml.kernel.org/r/[email protected]
Because without this patch it is not possible to test this series on
KVM.

v12 - v11
- split time: replace read_boot_clock64() with
read_persistent_wall_and_boot_offset() into four patches
- Added two patches one fixes an existing bug with text_poke()
another one enables static branches early. Note, because I found
and fixed the text_poke() bug, enabling static branching became
super easy, as no changes to jump_label* is needed.
- Modified x86/tsc: use tsc early to use static branches early, and
thus native_sched_clock() is not changed at all.
v11 - v10
- Addressed all the comments from Thomas Gleixner.
- I added one more patch:
"x86/tsc: prepare for early sched_clock" which fixes a problem
that I discovered while testing. I am not particularly happy with
the fix, as it adds a new argument that is used only in one
place, but if you have a suggestion for a different approach on
how to address this problem please let me know.

v10 - v9
- Added another patch to this series that removes dependency
between KVM clock, and memblock allocator. The benefit is that
all clocks can now be initialized even earlier.
v9 - v8
- Addressed more comments from Dou Liyang

v8 - v7
- Addressed comments from Dou Liyang:
- Moved tsc_early_init() and tsc_early_fini() to be all inside
tsc.c, and changed them to be static.
- Removed warning when notsc parameter is used.
- Merged with:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

v7 - v6
- Removed tsc_disabled flag, now notsc is equivalent of
tsc=unstable
- Simplified changes to sched/clock.c, by removing the
sched_clock_early() and friends as requested by Peter Zijlstra.
We know always use sched_clock()
- Modified x86 sched_clock() to return either early boot time or
regular.
- Added another example why ealry boot time is important

v5 - v6
- Added a new patch:
time: sync read_boot_clock64() with persistent clock
Which fixes missing __init macro, and enabled time discrepancy
fix that was noted by Thomas Gleixner
- Split "x86/time: read_boot_clock64() implementation" into a
separate patch

v4 - v5
- Fix compiler warnings on systems with stable clocks.

v3 - v4
- Fixed tsc_early_fini() call to be in the 2nd patch as reported
by Dou Liyang
- Improved comment before __use_sched_clock_early to explain why
we need both booleans.
- Simplified valid_clock logic in read_boot_clock64().

v2 - v3
- Addressed comment from Thomas Gleixner
- Timestamps are available a little later in boot but still much
earlier than in mainline. This significantly simplified this
work.

v1 - v2
In patch "x86/tsc: tsc early":
- added tsc_adjusted_early()
- fixed 32-bit compile error use do_div()

The early boot time stamps were discussed recently in these threads:
http://lkml.kernel.org/r/[email protected]
http://lkml.kernel.org/r/[email protected]

I updated my series to the latest mainline and sending it again.

Peter mentioned he did not like patch 6,7, and we can discuss for a better
way to do that, but I think patches 1-5 can be accepted separetly, since
they already enable early timestamps on platforms where sched_clock() is
available early. Such as KVM.

Adding early boot time stamps support for x86 machines.
SPARC patches for early boot time stamps are already integrated into
mainline linux.

Sample output
-------------
Before:
https://paste.ubuntu.com/26133428/

After:
https://paste.ubuntu.com/26133523/

For exaples how early time stamps are used, see this work:
Example 1:
https://lwn.net/Articles/734374/
- Without early boot time stamps we would not know about the extra time
that is spent zeroing struct pages early in boot even when deferred
page initialization.

Example 2:
https://patchwork.kernel.org/patch/10021247/
- If early boot timestamps were available, the engineer who introduced
this bug would have noticed the extra time that is spent early in boot.
Pavel Tatashin (7):
x86/tsc: remove tsc_disabled flag
time: sync read_boot_clock64() with persistent clock
x86/time: read_boot_clock64() implementation
sched: early boot clock
kvm/x86: remove kvm memblock dependency
x86/paravirt: add active_sched_clock to pv_time_ops
x86/tsc: use tsc early

Example 3:
http://lkml.kernel.org/r/[email protected]
- Needed early time stamps to show improvement

Borislav Petkov (1):
x86/CPU: Call detect_nopl() only on the BSP

Pavel Tatashin (17):
x86: text_poke() may access uninitialized struct pages
x86: initialize static branching early
x86/tsc: redefine notsc to behave as tsc=unstable
kvm/x86: remove kvm memblock dependency
x86/xen/time: initialize pv xen time in init_hypervisor_platform
x86/xen/time: output xen sched_clock time from 0
s390/time: add read_persistent_wall_and_boot_offset()
time: replace read_boot_clock64() with
read_persistent_wall_and_boot_offset()
time: default boot time offset to local_clock()
s390/time: remove read_boot_clock64()
ARM/time: remove read_boot_clock64()
x86/tsc: calibrate tsc only once
x86/tsc: initialize cyc2ns when tsc freq. is determined
x86/tsc: use tsc early
sched: move sched clock initialization and merge with generic clock
sched: early boot clock
sched: use static key for sched_clock_running

.../admin-guide/kernel-parameters.txt | 2 -
Documentation/x86/x86_64/boot-options.txt | 4 +-
arch/arm/include/asm/mach/time.h | 3 +-
arch/arm/kernel/time.c | 15 +-
arch/arm/plat-omap/counter_32k.c | 2 +-
arch/s390/kernel/time.c | 15 +-
arch/x86/include/asm/kvm_para.h | 2 +-
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tsc.h | 2 +-
arch/x86/kernel/alternative.c | 7 +
arch/x86/kernel/cpu/amd.c | 13 +-
arch/x86/kernel/cpu/common.c | 40 ++--
arch/x86/kernel/jump_label.c | 11 +-
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 66 +------
arch/x86/kernel/setup.c | 11 +-
arch/x86/kernel/tsc.c | 187 +++++++++---------
arch/x86/xen/enlighten_pv.c | 51 +++--
arch/x86/xen/mmu_pv.c | 6 +-
arch/x86/xen/suspend_pv.c | 5 +-
arch/x86/xen/time.c | 24 +--
arch/x86/xen/xen-ops.h | 6 +-
drivers/clocksource/tegra20_timer.c | 2 +-
include/linux/sched_clock.h | 5 +-
include/linux/timekeeping.h | 3 +-
init/main.c | 4 +-
kernel/sched/clock.c | 48 +++--
kernel/sched/core.c | 1 -
kernel/sched/debug.c | 2 -
kernel/time/sched_clock.c | 2 +-
kernel/time/timekeeping.c | 62 +++---
31 files changed, 288 insertions(+), 315 deletions(-)

--
2.18.0



2018-07-12 03:06:23

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 11/18] s390/time: remove read_boot_clock64()

read_boot_clock64() was replaced by read_persistent_wall_and_boot_offset()
so remove it.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/s390/kernel/time.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index d1f5447d5687..e8766beee5ad 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -239,19 +239,6 @@ void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
*boot_offset = timespec64_sub(*wall_time, boot_time);
}

-void read_boot_clock64(struct timespec64 *ts)
-{
- unsigned char clk[STORE_CLOCK_EXT_SIZE];
- __u64 delta;
-
- delta = initial_leap_seconds + TOD_UNIX_EPOCH;
- memcpy(clk, tod_clock_base, 16);
- *(__u64 *) &clk[1] -= delta;
- if (*(__u64 *) &clk[1] > delta)
- clk[0]--;
- ext_to_timespec64(clk, ts);
-}
-
static u64 read_tod_clock(struct clocksource *cs)
{
unsigned long long now, adj;
--
2.18.0


2018-07-12 03:06:28

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 16/18] sched: move sched clock initialization and merge with generic clock

sched_clock_postinit() initializes a generic clock on systems where no
other clock is porvided. This function may be called only after
timekeeping_init().

Rename sched_clock_postinit to generic_clock_inti() and call it from
sched_clock_init(). Move the call for sched_clock_init() until after
time_init().

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/sched_clock.h | 5 ++---
init/main.c | 4 ++--
kernel/sched/clock.c | 26 ++++++++++++++++----------
kernel/sched/core.c | 1 -
kernel/time/sched_clock.c | 2 +-
5 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched_clock.h b/include/linux/sched_clock.h
index 411b52e424e1..abe28d5cb3f4 100644
--- a/include/linux/sched_clock.h
+++ b/include/linux/sched_clock.h
@@ -9,17 +9,16 @@
#define LINUX_SCHED_CLOCK

#ifdef CONFIG_GENERIC_SCHED_CLOCK
-extern void sched_clock_postinit(void);
+extern void generic_sched_clock_init(void);

extern void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate);
#else
-static inline void sched_clock_postinit(void) { }
+static inline void generic_sched_clock_init(void) { }

static inline void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate)
{
- ;
}
#endif

diff --git a/init/main.c b/init/main.c
index 3b4ada11ed52..162d931c9511 100644
--- a/init/main.c
+++ b/init/main.c
@@ -79,7 +79,7 @@
#include <linux/pti.h>
#include <linux/blkdev.h>
#include <linux/elevator.h>
-#include <linux/sched_clock.h>
+#include <linux/sched/clock.h>
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/context_tracking.h>
@@ -642,7 +642,7 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_postinit();
+ sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..3b02be1fdfda 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -68,11 +68,6 @@ EXPORT_SYMBOL_GPL(sched_clock);

__read_mostly int sched_clock_running;

-void sched_clock_init(void)
-{
- sched_clock_running = 1;
-}
-
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
* We must start with !__sched_clock_stable because the unstable -> stable
@@ -199,6 +194,15 @@ void clear_sched_clock_stable(void)
__clear_sched_clock_stable();
}

+static void __sched_clock_gtod_offset(void)
+{
+ __gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+}
+
+void __init sched_clock_init(void)
+{
+ sched_clock_running = 1;
+}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
* notably: acpi_processor and intel_idle, which can mark the TSC as unstable.
@@ -385,8 +389,6 @@ void sched_clock_tick(void)

void sched_clock_tick_stable(void)
{
- u64 gtod, clock;
-
if (!sched_clock_stable())
return;

@@ -398,9 +400,7 @@ void sched_clock_tick_stable(void)
* TSC to be unstable, any computation will be computing crap.
*/
local_irq_disable();
- gtod = ktime_get_ns();
- clock = sched_clock();
- __gtod_offset = (clock + __sched_clock_offset) - gtod;
+ __sched_clock_gtod_offset();
local_irq_enable();
}

@@ -434,6 +434,12 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);

#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */

+void __init sched_clock_init(void)
+{
+ sched_clock_running = 1;
+ generic_sched_clock_init();
+}
+
u64 sched_clock_cpu(int cpu)
{
if (unlikely(!sched_clock_running))
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..552406e9713b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5954,7 +5954,6 @@ void __init sched_init(void)
int i, j;
unsigned long alloc_size = 0, ptr;

- sched_clock_init();
wait_bit_init();

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 2d8f05aad442..cbc72c2c1fca 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -237,7 +237,7 @@ sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
pr_debug("Registered %pF as sched_clock source\n", read);
}

-void __init sched_clock_postinit(void)
+void __init generic_sched_clock_init(void)
{
/*
* If no sched_clock() function has been provided at that point,
--
2.18.0


2018-07-12 03:06:32

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 13/18] x86/tsc: calibrate tsc only once

During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
and the second time in tsc_init().

Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
the calibration is done only early, and make tsc_init() to use the values
already determined in tsc_early_init().

Sometimes it is not possible to determine tsc early, as the subsystem that
is required is not yet initialized, in such case try again later in
tsc_init().

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/tsc.h | 2 +-
arch/x86/kernel/setup.c | 2 +-
arch/x86/kernel/tsc.c | 86 ++++++++++++++++++++------------------
3 files changed, 48 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 2701d221583a..c4368ff73652 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void)
extern struct system_counterval_t convert_art_to_tsc(u64 art);
extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);

-extern void tsc_early_delay_calibrate(void);
+extern void tsc_early_init(void);
extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 01fcc8bf7c8f..07445482bb57 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1014,7 +1014,7 @@ void __init setup_arch(char **cmdline_p)
*/
init_hypervisor_platform();

- tsc_early_delay_calibrate();
+ tsc_early_init();

x86_init.resources.probe_roms();

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 186395041725..bc8eb82050a3 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz);
unsigned int __read_mostly tsc_khz;
EXPORT_SYMBOL(tsc_khz);

+#define KHZ 1000
+
/*
* TSC can be unstable due to cpufreq or due to unsynced TSCs
*/
@@ -1335,34 +1337,10 @@ static int __init init_tsc_clocksource(void)
*/
device_initcall(init_tsc_clocksource);

-void __init tsc_early_delay_calibrate(void)
-{
- unsigned long lpj;
-
- if (!boot_cpu_has(X86_FEATURE_TSC))
- return;
-
- cpu_khz = x86_platform.calibrate_cpu();
- tsc_khz = x86_platform.calibrate_tsc();
-
- tsc_khz = tsc_khz ? : cpu_khz;
- if (!tsc_khz)
- return;
-
- lpj = tsc_khz * 1000;
- do_div(lpj, HZ);
- loops_per_jiffy = lpj;
-}
-
-void __init tsc_init(void)
+static bool determine_cpu_tsc_frequncies(void)
{
- u64 lpj, cyc;
- int cpu;
-
- if (!boot_cpu_has(X86_FEATURE_TSC)) {
- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
- }
+ /* Make sure that cpu and tsc are not already calibrated */
+ WARN_ON(cpu_khz || tsc_khz);

cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
@@ -1377,20 +1355,51 @@ void __init tsc_init(void)
else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
cpu_khz = tsc_khz;

- if (!tsc_khz) {
- mark_tsc_unstable("could not calculate TSC khz");
- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
- }
+ if (tsc_khz == 0)
+ return false;

pr_info("Detected %lu.%03lu MHz processor\n",
- (unsigned long)cpu_khz / 1000,
- (unsigned long)cpu_khz % 1000);
+ (unsigned long)cpu_khz / KHZ,
+ (unsigned long)cpu_khz % KHZ);

if (cpu_khz != tsc_khz) {
pr_info("Detected %lu.%03lu MHz TSC",
- (unsigned long)tsc_khz / 1000,
- (unsigned long)tsc_khz % 1000);
+ (unsigned long)tsc_khz / KHZ,
+ (unsigned long)tsc_khz % KHZ);
+ }
+ return true;
+}
+
+static unsigned long get_loops_per_jiffy(void)
+{
+ unsigned long lpj = tsc_khz * KHZ;
+
+ do_div(lpj, HZ);
+ return lpj;
+}
+
+void __init tsc_early_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TSC))
+ return;
+ if (!determine_cpu_tsc_frequncies())
+ return;
+ loops_per_jiffy = get_loops_per_jiffy();
+}
+
+void __init tsc_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TSC)) {
+ setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+ return;
+ }
+
+ if (!tsc_khz) {
+ /* We failed to determine frequencies earlier, try again */
+ if (!determine_cpu_tsc_frequncies()) {
+ mark_tsc_unstable("could not calculate TSC khz");
+ return;
+ }
}

/* Sanitize TSC ADJUST before cyc2ns gets initialized */
@@ -1413,10 +1422,7 @@ void __init tsc_init(void)
if (!no_sched_irq_time)
enable_sched_clock_irqtime();

- lpj = ((u64)tsc_khz * 1000);
- do_div(lpj, HZ);
- lpj_fine = lpj;
-
+ lpj_fine = get_loops_per_jiffy();
use_tsc_delay();

check_system_tsc_reliable();
--
2.18.0


2018-07-12 03:06:56

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 06/18] x86/xen/time: initialize pv xen time in init_hypervisor_platform

In every hypervisor except for xen pv time ops are initialized in
init_hypervisor_platform().

Xen PV domains initialize time ops in x86_init.paging.pagetable_init(),
by calling xen_setup_shared_info() which is a poor design, as time is
needed prior to memory allocator.

xen_setup_shared_info() is called from two places: during boot, and
after suspend. Split the content of xen_setup_shared_info() into
three places:

1. add the clock relavent data into new xen pv init_platform vector, and
set clock ops in there.

2. move xen_setup_vcpu_info_placement() to new xen_pv_guest_late_init()
call.

3. Re-initializing parts of shared info copy to xen_pv_post_suspend() to
be symmetric to xen_pv_pre_suspend

Also, remove code duplications by calling xen_init_time_ops() from
xen_hvm_init_time_ops().

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/xen/enlighten_pv.c | 51 +++++++++++++++++--------------------
arch/x86/xen/mmu_pv.c | 6 ++---
arch/x86/xen/suspend_pv.c | 5 ++--
arch/x86/xen/time.c | 14 +++-------
arch/x86/xen/xen-ops.h | 6 ++---
5 files changed, 35 insertions(+), 47 deletions(-)

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 8d4e2e1ae60b..615ad0f16848 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -119,6 +119,27 @@ static void __init xen_banner(void)
version >> 16, version & 0xffff, extra.extraversion,
xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
}
+
+static void __init xen_pv_init_platform(void)
+{
+ set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+ HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+
+ /* xen clock uses per-cpu vcpu_info, need to init it for boot cpu */
+ xen_vcpu_info_reset(0);
+
+ /* pvclock is in shared info area */
+ xen_init_time_ops();
+}
+
+static void __init xen_pv_guest_late_init(void)
+{
+#ifndef CONFIG_SMP
+ /* Setup shared vcpu info for non-smp configurations */
+ xen_setup_vcpu_info_placement();
+#endif
+}
+
/* Check if running on Xen version (major, minor) or later */
bool
xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -947,34 +968,8 @@ static void xen_write_msr(unsigned int msr, unsigned low, unsigned high)
xen_write_msr_safe(msr, low, high);
}

-void xen_setup_shared_info(void)
-{
- set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
-
- HYPERVISOR_shared_info =
- (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
-
- xen_setup_mfn_list_list();
-
- if (system_state == SYSTEM_BOOTING) {
-#ifndef CONFIG_SMP
- /*
- * In UP this is as good a place as any to set up shared info.
- * Limit this to boot only, at restore vcpu setup is done via
- * xen_vcpu_restore().
- */
- xen_setup_vcpu_info_placement();
-#endif
- /*
- * Now that shared info is set up we can start using routines
- * that point to pvclock area.
- */
- xen_init_time_ops();
- }
-}
-
/* This is called once we have the cpu_possible_mask */
-void __ref xen_setup_vcpu_info_placement(void)
+void __init xen_setup_vcpu_info_placement(void)
{
int cpu;

@@ -1220,6 +1215,8 @@ asmlinkage __visible void __init xen_start_kernel(void)
x86_init.irqs.intr_mode_init = x86_init_noop;
x86_init.oem.arch_setup = xen_arch_setup;
x86_init.oem.banner = xen_banner;
+ x86_init.hyper.init_platform = xen_pv_init_platform;
+ x86_init.hyper.guest_late_init = xen_pv_guest_late_init;

/*
* Set up some pagetable state before starting to set any ptes.
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2c30cabfda90..52206ad81e4b 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1230,8 +1230,7 @@ static void __init xen_pagetable_p2m_free(void)
* We roundup to the PMD, which means that if anybody at this stage is
* using the __ka address of xen_start_info or
* xen_start_info->shared_info they are in going to crash. Fortunatly
- * we have already revectored in xen_setup_kernel_pagetable and in
- * xen_setup_shared_info.
+ * we have already revectored in xen_setup_kernel_pagetable.
*/
size = roundup(size, PMD_SIZE);

@@ -1292,8 +1291,7 @@ static void __init xen_pagetable_init(void)

/* Remap memory freed due to conflicts with E820 map */
xen_remap_memory();
-
- xen_setup_shared_info();
+ xen_setup_mfn_list_list();
}
static void xen_write_cr2(unsigned long cr2)
{
diff --git a/arch/x86/xen/suspend_pv.c b/arch/x86/xen/suspend_pv.c
index a2e0f110af56..8303b58c79a9 100644
--- a/arch/x86/xen/suspend_pv.c
+++ b/arch/x86/xen/suspend_pv.c
@@ -27,8 +27,9 @@ void xen_pv_pre_suspend(void)
void xen_pv_post_suspend(int suspend_cancelled)
{
xen_build_mfn_list_list();
-
- xen_setup_shared_info();
+ set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+ HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+ xen_setup_mfn_list_list();

if (suspend_cancelled) {
xen_start_info->store_mfn =
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index e0f1bcf01d63..2ad61bf896d6 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -40,7 +40,7 @@ static unsigned long xen_tsc_khz(void)
return pvclock_tsc_khz(info);
}

-u64 xen_clocksource_read(void)
+static u64 xen_clocksource_read(void)
{
struct pvclock_vcpu_time_info *src;
u64 ret;
@@ -503,7 +503,7 @@ static void __init xen_time_init(void)
pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
}

-void __ref xen_init_time_ops(void)
+void __init xen_init_time_ops(void)
{
pv_time_ops = xen_time_ops;

@@ -542,17 +542,11 @@ void __init xen_hvm_init_time_ops(void)
return;

if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
- printk(KERN_INFO "Xen doesn't support pvclock on HVM,"
- "disable pv timer\n");
+ pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
return;
}
-
- pv_time_ops = xen_time_ops;
+ xen_init_time_ops();
x86_init.timers.setup_percpu_clockev = xen_time_init;
x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
-
- x86_platform.calibrate_tsc = xen_tsc_khz;
- x86_platform.get_wallclock = xen_get_wallclock;
- x86_platform.set_wallclock = xen_set_wallclock;
}
#endif
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 3b34745d0a52..e78684597f57 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -31,7 +31,6 @@ extern struct shared_info xen_dummy_shared_info;
extern struct shared_info *HYPERVISOR_shared_info;

void xen_setup_mfn_list_list(void);
-void xen_setup_shared_info(void);
void xen_build_mfn_list_list(void);
void xen_setup_machphys_mapping(void);
void xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
@@ -68,12 +67,11 @@ void xen_init_irq_ops(void);
void xen_setup_timer(int cpu);
void xen_setup_runstate_info(int cpu);
void xen_teardown_timer(int cpu);
-u64 xen_clocksource_read(void);
void xen_setup_cpu_clockevents(void);
void xen_save_time_memory_area(void);
void xen_restore_time_memory_area(void);
-void __ref xen_init_time_ops(void);
-void __init xen_hvm_init_time_ops(void);
+void xen_init_time_ops(void);
+void xen_hvm_init_time_ops(void);

irqreturn_t xen_debug_interrupt(int irq, void *dev_id);

--
2.18.0


2018-07-12 03:07:06

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 03/18] x86/CPU: Call detect_nopl() only on the BSP

From: Borislav Petkov <[email protected]>

Make it use the setup_* variants and have it be called only on the BSP
and drop the call in generic_identify() - X86_FEATURE_NOPL will be
replicated to the APs through the forced caps. Helps keep the mess at a
manageable level.

Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/cpu/common.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 71281ac43b15..46408a8cdf62 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1024,12 +1024,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
* unless we can find a reliable way to detect all the broken cases.
* Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
*/
-static void detect_nopl(struct cpuinfo_x86 *c)
+static void detect_nopl(void)
{
#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_clear_cpu_cap(X86_FEATURE_NOPL);
#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_force_cpu_cap(X86_FEATURE_NOPL);
#endif
}

@@ -1108,7 +1108,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);

- detect_nopl(c);
+ detect_nopl();
}

void __init early_cpu_init(void)
@@ -1206,8 +1206,6 @@ static void generic_identify(struct cpuinfo_x86 *c)

get_model_name(c); /* Default name */

- detect_nopl(c);
-
detect_null_seg_behavior(c);

/*
--
2.18.0


2018-07-12 03:07:12

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 04/18] x86/tsc: redefine notsc to behave as tsc=unstable

Currently, notsc kernel parameter disables the use of tsc register by
sched_clock(). However, this parameter does not prevent linux from
accessing tsc in other places in kernel.

The only rational to boot with notsc is to avoid timing discrepancies on
multi-socket systems where different tsc frequencies may present, and thus
fallback to jiffies for clock source.

However, there is another method to solve the above problem, it is to boot
with tsc=unstable parameter. This parameter allows sched_clock() to use tsc
but in case tsc is outside of expected interval it is corrected back to a
sane value.

This is why there is no reason to keep notsc, and it can be removed. But,
for compatibility reasons we will keep this parameter but change its
definition to be the same as tsc=unstable.

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Dou Liyang <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 2 --
Documentation/x86/x86_64/boot-options.txt | 4 +---
arch/x86/kernel/tsc.c | 18 +++---------------
3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index efc7aa7a0670..f7123d28f318 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2835,8 +2835,6 @@

nosync [HW,M68K] Disables sync negotiation for all devices.

- notsc [BUGS=X86-32] Disable Time Stamp Counter
-
nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 8d109ef67ab6..66114ab4f9fe 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -92,9 +92,7 @@ APICs
Timing

notsc
- Don't use the CPU time stamp counter to read the wall time.
- This can be used to work around timing problems on multiprocessor systems
- with not properly synchronized CPUs.
+ Deprecated, use tsc=unstable instead.

nohpet
Don't use the HPET timer.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..186395041725 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,11 +38,6 @@ EXPORT_SYMBOL(tsc_khz);
*/
static int __read_mostly tsc_unstable;

-/* native_sched_clock() is called before tsc_init(), so
- we must start with the TSC soft disabled to prevent
- erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
static DEFINE_STATIC_KEY_FALSE(__use_tsc);

int tsc_clocksource_reliable;
@@ -248,8 +243,7 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
#ifdef CONFIG_X86_TSC
int __init notsc_setup(char *str)
{
- pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
- tsc_disabled = 1;
+ mark_tsc_unstable("boot parameter notsc");
return 1;
}
#else
@@ -1307,7 +1301,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)

static int __init init_tsc_clocksource(void)
{
- if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+ if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
return 0;

if (tsc_unstable)
@@ -1414,12 +1408,6 @@ void __init tsc_init(void)
set_cyc2ns_scale(tsc_khz, cpu, cyc);
}

- if (tsc_disabled > 0)
- return;
-
- /* now allow native_sched_clock() to use rdtsc */
-
- tsc_disabled = 0;
static_branch_enable(&__use_tsc);

if (!no_sched_irq_time)
@@ -1455,7 +1443,7 @@ unsigned long calibrate_delay_is_known(void)
int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
const struct cpumask *mask = topology_core_cpumask(cpu);

- if (tsc_disabled || !constant_tsc || !mask)
+ if (!constant_tsc || !mask)
return 0;

sibling = cpumask_any_but(mask, cpu);
--
2.18.0


2018-07-12 03:07:19

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 12/18] ARM/time: remove read_boot_clock64()

read_boot_clock64() is deleted, and replaced with
read_persistent_wall_and_boot_offset().

The default implementation of read_persistent_wall_and_boot_offset()
provides a better fallback than the current stubs for read_boot_clock64()
that arm has with no users, so remove the old code.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm/include/asm/mach/time.h | 3 +--
arch/arm/kernel/time.c | 15 ++-------------
arch/arm/plat-omap/counter_32k.c | 2 +-
drivers/clocksource/tegra20_timer.c | 2 +-
4 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mach/time.h b/arch/arm/include/asm/mach/time.h
index 0f79e4dec7f9..4ac3a019a46f 100644
--- a/arch/arm/include/asm/mach/time.h
+++ b/arch/arm/include/asm/mach/time.h
@@ -13,7 +13,6 @@
extern void timer_tick(void);

typedef void (*clock_access_fn)(struct timespec64 *);
-extern int register_persistent_clock(clock_access_fn read_boot,
- clock_access_fn read_persistent);
+extern int register_persistent_clock(clock_access_fn read_persistent);

#endif
diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index cf2701cb0de8..078b259ead4e 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -83,29 +83,18 @@ static void dummy_clock_access(struct timespec64 *ts)
}

static clock_access_fn __read_persistent_clock = dummy_clock_access;
-static clock_access_fn __read_boot_clock = dummy_clock_access;

void read_persistent_clock64(struct timespec64 *ts)
{
__read_persistent_clock(ts);
}

-void read_boot_clock64(struct timespec64 *ts)
-{
- __read_boot_clock(ts);
-}
-
-int __init register_persistent_clock(clock_access_fn read_boot,
- clock_access_fn read_persistent)
+int __init register_persistent_clock(clock_access_fn read_persistent)
{
/* Only allow the clockaccess functions to be registered once */
- if (__read_persistent_clock == dummy_clock_access &&
- __read_boot_clock == dummy_clock_access) {
- if (read_boot)
- __read_boot_clock = read_boot;
+ if (__read_persistent_clock == dummy_clock_access) {
if (read_persistent)
__read_persistent_clock = read_persistent;
-
return 0;
}

diff --git a/arch/arm/plat-omap/counter_32k.c b/arch/arm/plat-omap/counter_32k.c
index 2438b96004c1..fcc5bfec8bd1 100644
--- a/arch/arm/plat-omap/counter_32k.c
+++ b/arch/arm/plat-omap/counter_32k.c
@@ -110,7 +110,7 @@ int __init omap_init_clocksource_32k(void __iomem *vbase)
}

sched_clock_register(omap_32k_read_sched_clock, 32, 32768);
- register_persistent_clock(NULL, omap_read_persistent_clock64);
+ register_persistent_clock(omap_read_persistent_clock64);
pr_info("OMAP clocksource: 32k_counter at 32768 Hz\n");

return 0;
diff --git a/drivers/clocksource/tegra20_timer.c b/drivers/clocksource/tegra20_timer.c
index c337a8100a7b..2242a36fc5b0 100644
--- a/drivers/clocksource/tegra20_timer.c
+++ b/drivers/clocksource/tegra20_timer.c
@@ -259,6 +259,6 @@ static int __init tegra20_init_rtc(struct device_node *np)
else
clk_prepare_enable(clk);

- return register_persistent_clock(NULL, tegra_read_persistent_clock64);
+ return register_persistent_clock(tegra_read_persistent_clock64);
}
TIMER_OF_DECLARE(tegra20_rtc, "nvidia,tegra20-rtc", tegra20_init_rtc);
--
2.18.0


2018-07-12 03:07:25

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 07/18] x86/xen/time: output xen sched_clock time from 0

It is expected for sched_clock() to output data from 0, when system boots.
Add an offset xen_sched_clock_offset (similarly how it is done in other
hypervisors i.e. kvm_sched_clock_offset) to count sched_clock() from 0,
when time is first initialized.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/xen/time.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 2ad61bf896d6..3c6f3d603373 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -31,6 +31,8 @@
/* Xen may fire a timer up to this many ns early */
#define TIMER_SLOP 100000

+static u64 xen_sched_clock_offset __read_mostly;
+
/* Get the TSC speed from Xen */
static unsigned long xen_tsc_khz(void)
{
@@ -57,6 +59,11 @@ static u64 xen_clocksource_get_cycles(struct clocksource *cs)
return xen_clocksource_read();
}

+static u64 xen_sched_clock(void)
+{
+ return xen_clocksource_read() - xen_sched_clock_offset;
+}
+
static void xen_read_wallclock(struct timespec64 *ts)
{
struct shared_info *s = HYPERVISOR_shared_info;
@@ -367,7 +374,7 @@ void xen_timer_resume(void)
}

static const struct pv_time_ops xen_time_ops __initconst = {
- .sched_clock = xen_clocksource_read,
+ .sched_clock = xen_sched_clock,
.steal_clock = xen_steal_clock,
};

@@ -505,6 +512,7 @@ static void __init xen_time_init(void)

void __init xen_init_time_ops(void)
{
+ xen_sched_clock_offset = xen_clocksource_read();
pv_time_ops = xen_time_ops;

x86_init.timers.timer_init = xen_time_init;
--
2.18.0


2018-07-12 03:07:25

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 14/18] x86/tsc: initialize cyc2ns when tsc freq. is determined

cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data
structure.

Currently, the setup code for c2ns data for every possible CPU goes through
the same sequence of calculations as for the boot CPU, but is based on the
same tsc frequency as the boot CPU, and thus this is not necessary.

Initialize the boot cpu when tsc frequency is determined. Copy the
calculated data from the boot CPU to the other CPUs in tsc_init().

In addition do the following:

- Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init()
- Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be
called when system is up, and wraps around __set_cyc2ns_scale() that can
be called directly when system is booting but avoids saving restoring
IRQs and going and waking up from idle.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/tsc.c | 94 ++++++++++++++++++++++++-------------------
1 file changed, 53 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index bc8eb82050a3..0b1abe7fdd8e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -103,23 +103,6 @@ void cyc2ns_read_end(void)
* [email protected] "math is hard, lets go shopping!"
*/

-static void cyc2ns_data_init(struct cyc2ns_data *data)
-{
- data->cyc2ns_mul = 0;
- data->cyc2ns_shift = 0;
- data->cyc2ns_offset = 0;
-}
-
-static void __init cyc2ns_init(int cpu)
-{
- struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
-
- cyc2ns_data_init(&c2n->data[0]);
- cyc2ns_data_init(&c2n->data[1]);
-
- seqcount_init(&c2n->seq);
-}
-
static inline unsigned long long cycles_2_ns(unsigned long long cyc)
{
struct cyc2ns_data data;
@@ -135,18 +118,11 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
return ns;
}

-static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
{
unsigned long long ns_now;
struct cyc2ns_data data;
struct cyc2ns *c2n;
- unsigned long flags;
-
- local_irq_save(flags);
- sched_clock_idle_sleep_event();
-
- if (!khz)
- goto done;

ns_now = cycles_2_ns(tsc_now);

@@ -178,12 +154,55 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
c2n->data[0] = data;
raw_write_seqcount_latch(&c2n->seq);
c2n->data[1] = data;
+}
+
+static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ if (khz)
+ __set_cyc2ns_scale(khz, cpu, tsc_now);

-done:
sched_clock_idle_wakeup_event();
local_irq_restore(flags);
}

+/*
+ * Initialize cyc2ns for boot cpu
+ */
+static void __init cyc2ns_init_boot_cpu(void)
+{
+ struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+
+ seqcount_init(&c2n->seq);
+ __set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
+}
+
+/*
+ * Secondary CPUs do not run through cyc2ns_init(), so set up
+ * all the scale factors for all CPUs, assuming the same
+ * speed as the bootup CPU. (cpufreq notifiers will fix this
+ * up if their speed diverges)
+ */
+static void __init cyc2ns_init_secondary_cpus(void)
+{
+ unsigned int cpu, this_cpu = smp_processor_id();
+ struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+ struct cyc2ns_data *data = c2n->data;
+
+ for_each_possible_cpu(cpu) {
+ if (cpu != this_cpu) {
+ seqcount_init(&c2n->seq);
+ c2n = per_cpu_ptr(&cyc2ns, cpu);
+ c2n->data[0] = data[0];
+ c2n->data[1] = data[1];
+ }
+ }
+}
+
/*
* Scheduler clock - returns current time in nanosec units.
*/
@@ -1385,6 +1404,10 @@ void __init tsc_early_init(void)
if (!determine_cpu_tsc_frequncies())
return;
loops_per_jiffy = get_loops_per_jiffy();
+
+ /* Sanitize TSC ADJUST before cyc2ns gets initialized */
+ tsc_store_and_check_tsc_adjust(true);
+ cyc2ns_init_boot_cpu();
}

void __init tsc_init(void)
@@ -1400,23 +1423,12 @@ void __init tsc_init(void)
mark_tsc_unstable("could not calculate TSC khz");
return;
}
+ /* Sanitize TSC ADJUST before cyc2ns gets initialized */
+ tsc_store_and_check_tsc_adjust(true);
+ cyc2ns_init_boot_cpu();
}

- /* Sanitize TSC ADJUST before cyc2ns gets initialized */
- tsc_store_and_check_tsc_adjust(true);
-
- /*
- * Secondary CPUs do not run through tsc_init(), so set up
- * all the scale factors for all CPUs, assuming the same
- * speed as the bootup CPU. (cpufreq notifiers will fix this
- * up if their speed diverges)
- */
- cyc = rdtsc();
- for_each_possible_cpu(cpu) {
- cyc2ns_init(cpu);
- set_cyc2ns_scale(tsc_khz, cpu, cyc);
- }
-
+ cyc2ns_init_secondary_cpus();
static_branch_enable(&__use_tsc);

if (!no_sched_irq_time)
--
2.18.0


2018-07-12 03:07:39

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 09/18] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset()

If architecture does not support exact boot time, it is challenging to
estimate boot time without having a reference to the current persistent
clock value. Yet, it cannot read the persistent clock time again, because
this may lead to math discrepancies with the caller of read_boot_clock64()
who have read the persistent clock at a different time.

This is why it is better to provide two values simultaneously: the
persistent clock value, and the boot time.

Replace read_boot_clock64() with:
read_persistent_wall_and_boot_offset(wall_time, boot_offset)

Where wall_time is returned by read_persistent_clock()
And boot_offset is wall_time - boot time, which defaults to 0.

Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/timekeeping.h | 3 +-
kernel/time/timekeeping.c | 59 +++++++++++++++++++------------------
2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 86bc2026efce..686bc27acef0 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -243,7 +243,8 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
extern int persistent_clock_is_local;

extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
+void read_persistent_clock_and_boot_offset(struct timespec64 *wall_clock,
+ struct timespec64 *boot_offset);
extern int update_persistent_clock64(struct timespec64 now);

/*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4786df904c22..cb738f825c12 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -17,6 +17,7 @@
#include <linux/nmi.h>
#include <linux/sched.h>
#include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
#include <linux/syscore_ops.h>
#include <linux/clocksource.h>
#include <linux/jiffies.h>
@@ -1496,18 +1497,20 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
}

/**
- * read_boot_clock64 - Return time of the system start.
+ * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset
+ * from the boot.
*
* Weak dummy function for arches that do not yet support it.
- * Function to read the exact time the system has been started.
- * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
- *
- * XXX - Do be sure to remove it once all arches implement it.
+ * wall_time - current time as returned by persistent clock
+ * boot_offset - offset that is defined as wall_time - boot_time
+ * default to 0.
*/
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init
+read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
{
- ts->tv_sec = 0;
- ts->tv_nsec = 0;
+ read_persistent_clock64(wall_time);
+ *boot_offset = (struct timespec64){0};
}

/* Flag for if timekeeping_resume() has injected sleeptime */
@@ -1521,28 +1524,29 @@ static bool persistent_clock_exists;
*/
void __init timekeeping_init(void)
{
+ struct timespec64 wall_time, boot_offset, wall_to_mono;
struct timekeeper *tk = &tk_core.timekeeper;
struct clocksource *clock;
unsigned long flags;
- struct timespec64 now, boot, tmp;
-
- read_persistent_clock64(&now);
- if (!timespec64_valid_strict(&now)) {
- pr_warn("WARNING: Persistent clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- now.tv_sec = 0;
- now.tv_nsec = 0;
- } else if (now.tv_sec || now.tv_nsec)
- persistent_clock_exists = true;

- read_boot_clock64(&boot);
- if (!timespec64_valid_strict(&boot)) {
- pr_warn("WARNING: Boot clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- boot.tv_sec = 0;
- boot.tv_nsec = 0;
+ read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
+ if (timespec64_valid_strict(&wall_time) &&
+ timespec64_to_ns(&wall_time) > 0) {
+ persistent_clock_exists = true;
+ } else {
+ pr_warn("Persistent clock returned invalid value");
+ wall_time = (struct timespec64){0};
}

+ if (timespec64_compare(&wall_time, &boot_offset) < 0)
+ boot_offset = (struct timespec64){0};
+
+ /*
+ * We want set wall_to_mono, so the following is true:
+ * wall time + wall_to_mono = boot time
+ */
+ wall_to_mono = timespec64_sub(boot_offset, wall_time);
+
raw_spin_lock_irqsave(&timekeeper_lock, flags);
write_seqcount_begin(&tk_core.seq);
ntp_init();
@@ -1552,13 +1556,10 @@ void __init timekeeping_init(void)
clock->enable(clock);
tk_setup_internals(tk, clock);

- tk_set_xtime(tk, &now);
+ tk_set_xtime(tk, &wall_time);
tk->raw_sec = 0;
- if (boot.tv_sec == 0 && boot.tv_nsec == 0)
- boot = tk_xtime(tk);

- set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec);
- tk_set_wall_to_mono(tk, tmp);
+ tk_set_wall_to_mono(tk, wall_to_mono);

timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);

--
2.18.0


2018-07-12 03:07:56

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 08/18] s390/time: add read_persistent_wall_and_boot_offset()

read_persistent_wall_and_boot_offset() will replace read_boot_clock64()
because on some architectures it is more convenient to read both sources
as one may depend on the other. For s390, implementation is the same
as read_boot_clock64() but also calling and returning value of
read_persistent_clock64()

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Martin Schwidefsky <[email protected]>
---
arch/s390/kernel/time.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index cf561160ea88..d1f5447d5687 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -221,6 +221,24 @@ void read_persistent_clock64(struct timespec64 *ts)
ext_to_timespec64(clk, ts);
}

+void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
+{
+ unsigned char clk[STORE_CLOCK_EXT_SIZE];
+ struct timespec64 boot_time;
+ __u64 delta;
+
+ delta = initial_leap_seconds + TOD_UNIX_EPOCH;
+ memcpy(clk, tod_clock_base, STORE_CLOCK_EXT_SIZE);
+ *(__u64 *)&clk[1] -= delta;
+ if (*(__u64 *)&clk[1] > delta)
+ clk[0]--;
+ ext_to_timespec64(clk, &boot_time);
+
+ read_persistent_clock64(wall_time);
+ *boot_offset = timespec64_sub(*wall_time, boot_time);
+}
+
void read_boot_clock64(struct timespec64 *ts)
{
unsigned char clk[STORE_CLOCK_EXT_SIZE];
--
2.18.0


2018-07-12 03:08:02

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 02/18] x86: initialize static branching early

static branching is useful to hot-patch branches that are used in hot
path, but are infrequently changed.

x86 clock framework is one example that uses static branches to setup
the best clock during boot and never change it again.

Since we plan to enable clock early, we need static branching
functionality early as well.

static branching requires patching nop instructions, thus, we need
arch_init_ideal_nops() to be called prior to jump_label_init()

Here we do all the necessary steps to call arch_init_ideal_nops
after early_cpu_init().

Signed-off-by: Pavel Tatashin <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 13 +++++++-----
arch/x86/kernel/cpu/common.c | 38 +++++++++++++++++++-----------------
arch/x86/kernel/setup.c | 4 ++--
3 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 38915fbfae73..b732438c1a1e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -232,8 +232,6 @@ static void init_amd_k7(struct cpuinfo_x86 *c)
}
}

- set_cpu_cap(c, X86_FEATURE_K7);
-
/* calling is from identify_secondary_cpu() ? */
if (!c->cpu_index)
return;
@@ -617,6 +615,14 @@ static void early_init_amd(struct cpuinfo_x86 *c)

early_init_amd_mc(c);

+#ifdef CONFIG_X86_32
+ if (c->x86 == 6)
+ set_cpu_cap(c, X86_FEATURE_K7);
+#endif
+
+ if (c->x86 >= 0xf)
+ set_cpu_cap(c, X86_FEATURE_K8);
+
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);

/*
@@ -863,9 +869,6 @@ static void init_amd(struct cpuinfo_x86 *c)

init_amd_cacheinfo(c);

- if (c->x86 >= 0xf)
- set_cpu_cap(c, X86_FEATURE_K8);
-
if (cpu_has(c, X86_FEATURE_XMM2)) {
unsigned long long val;
int ret;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eb4cb3efd20e..71281ac43b15 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1015,6 +1015,24 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
}

+/*
+ * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
+ * unfortunately, that's not true in practice because of early VIA
+ * chips and (more importantly) broken virtualizers that are not easy
+ * to detect. In the latter case it doesn't even *fail* reliably, so
+ * probing for it doesn't even work. Disable it completely on 32-bit
+ * unless we can find a reliable way to detect all the broken cases.
+ * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
+ */
+static void detect_nopl(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_32
+ clear_cpu_cap(c, X86_FEATURE_NOPL);
+#else
+ set_cpu_cap(c, X86_FEATURE_NOPL);
+#endif
+}
+
/*
* Do minimum CPU detection early.
* Fields really needed: vendor, cpuid_level, family, model, mask,
@@ -1089,6 +1107,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
*/
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);
+
+ detect_nopl(c);
}

void __init early_cpu_init(void)
@@ -1124,24 +1144,6 @@ void __init early_cpu_init(void)
early_identify_cpu(&boot_cpu_data);
}

-/*
- * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
- * unfortunately, that's not true in practice because of early VIA
- * chips and (more importantly) broken virtualizers that are not easy
- * to detect. In the latter case it doesn't even *fail* reliably, so
- * probing for it doesn't even work. Disable it completely on 32-bit
- * unless we can find a reliable way to detect all the broken cases.
- * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
- */
-static void detect_nopl(struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
-#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
-#endif
-}
-
static void detect_null_seg_behavior(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..403b2d2c31d2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -866,6 +866,8 @@ void __init setup_arch(char **cmdline_p)

idt_setup_early_traps();
early_cpu_init();
+ arch_init_ideal_nops();
+ jump_label_init();
early_ioremap_init();

setup_olpc_ofw_pgd();
@@ -1272,8 +1274,6 @@ void __init setup_arch(char **cmdline_p)

mcheck_init();

- arch_init_ideal_nops();
-
register_refined_jiffies(CLOCK_TICK_RATE);

#ifdef CONFIG_EFI
--
2.18.0


2018-07-12 03:08:06

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 01/18] x86: text_poke() may access uninitialized struct pages

It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() we
may end up with accessing struct page that have not been initialized.

Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:

| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
| char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
| vfs_caches_init_early();
| sort_main_extable();
| trap_init();
|+ {
|+ static_branch_enable(&__test);
|+ WARN_ON(!static_branch_likely(&__test));
|+ }
| mm_init();

The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc1_pt_t1 #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.11.0-20171110_100015-anatol 04/01/2014
RIP: 0010:text_poke+0x20d/0x230
Code: 0f 0b 4c 89 e2 4c 89 ee 4c 89 f7 e8 7d 4b 9b 00 31 d2 31 f6 bf 86 02
00 00 48 8b 05 95 8e 24 01 e8 78 18 d8 00 e9 55 ff ff ff <0f> 0b e9 54 fe
ff ff 48 8b 05 75 a8 38 01 e9 64 fe ff ff 48 8b 1d
RSP: 0000:ffffffff94e03e30 EFLAGS: 00010046
RAX: 0100000000000000 RBX: fffff7b2c011f300 RCX: ffffffff94fcccf4
RDX: 0000000000000001 RSI: ffffffff94e03e77 RDI: ffffffff94fcccef
RBP: ffffffff94fcccef R08: 00000000fffffe00 R09: 00000000000000a0
R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000001
R13: ffffffff94e03e77 R14: ffffffff94fcdcef R15: fffff7b2c0000000
FS: 0000000000000000(0000) GS:ffff9adc87c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff9adc8499d000 CR3: 000000000460a001 CR4: 00000000000606b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? start_kernel+0x23e/0x4c8
? start_kernel+0x23f/0x4c8
? text_poke_bp+0x50/0xda
? arch_jump_label_transform+0x89/0xe0
? __jump_label_update+0x78/0xb0
? static_key_enable_cpuslocked+0x4d/0x80
? static_key_enable+0x11/0x20
? start_kernel+0x23e/0x4c8
? secondary_startup_64+0xa5/0xb0
---[ end trace abdc99c031b8a90a ]---

If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.

Use text_poke_early() in static branching until early boot IRQs are
enabled, at which time switch to text_poke. Also, ensure text_poke() is
never invoked when unitialized memory access may happen by using:
BUG_ON(!after_bootmem); assertion.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 7 +++++++
arch/x86/kernel/jump_label.c | 11 +++++++----
3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 2ecd34e2d46c..e85ff65c43c3 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,5 +37,6 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern int after_bootmem;

#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a481763a3776..014f214da581 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -668,6 +668,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
local_irq_save(flags);
memcpy(addr, opcode, len);
local_irq_restore(flags);
+ sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
return addr;
@@ -693,6 +694,12 @@ void *text_poke(void *addr, const void *opcode, size_t len)
struct page *pages[2];
int i;

+ /*
+ * While boot memory allocator is runnig we cannot use struct
+ * pages as they are not yet initialized.
+ */
+ BUG_ON(!after_bootmem);
+
if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e56c95be2808..eeea935e9bb5 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,15 +37,18 @@ static void bug_at(unsigned char *ip, int line)
BUG();
}

-static void __jump_label_transform(struct jump_entry *entry,
- enum jump_label_type type,
- void *(*poker)(void *, const void *, size_t),
- int init)
+static void __ref __jump_label_transform(struct jump_entry *entry,
+ enum jump_label_type type,
+ void *(*poker)(void *, const void *, size_t),
+ int init)
{
union jump_code_union code;
const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];

+ if (early_boot_irqs_disabled)
+ poker = text_poke_early;
+
if (type == JUMP_LABEL_JMP) {
if (init) {
/*
--
2.18.0


2018-07-12 03:08:11

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 17/18] sched: early boot clock

Allow sched_clock() to be used before schec_clock_init() is called.
This provides with a way to get early boot timestamps on machines with
unstable clocks.

Signed-off-by: Pavel Tatashin <[email protected]>
---
init/main.c | 2 +-
kernel/sched/clock.c | 10 +++++++++-
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 162d931c9511..ff0a24170b95 100644
--- a/init/main.c
+++ b/init/main.c
@@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
@@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
acpi_early_init();
if (late_time_init)
late_time_init();
+ sched_clock_init();
calibrate_delay();
pid_idr_init();
anon_vma_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 3b02be1fdfda..4d07b785d566 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -201,7 +201,15 @@ static void __sched_clock_gtod_offset(void)

void __init sched_clock_init(void)
{
+ unsigned long flags;
+
sched_clock_running = 1;
+
+ /* Adjust __gtod_offset for contigious transition from early clock */
+ local_irq_save(flags);
+ sched_clock_tick();
+ local_irq_restore(flags);
+ __sched_clock_gtod_offset();
}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
@@ -355,7 +363,7 @@ u64 sched_clock_cpu(int cpu)
return sched_clock() + __sched_clock_offset;

if (unlikely(!sched_clock_running))
- return 0ull;
+ return sched_clock();

preempt_disable_notrace();
scd = cpu_sdc(cpu);
--
2.18.0


2018-07-12 03:08:14

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 10/18] time: default boot time offset to local_clock()

read_persistent_wall_and_boot_offset() is called during boot to read
both the persistent clock and also return the offset between the boot time
and the value of persistent clock.

Change the default boot_offset from zero to local_clock() so architectures,
that do not have a dedicated boot_clock but have early sched_clock(), such
as SPARCv9, x86, and possibly more will benefit from this change by getting
a better and more consistent estimate of the boot time without need for an
arch specific implementation.

Signed-off-by: Pavel Tatashin <[email protected]>
---
kernel/time/timekeeping.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index cb738f825c12..30d7f64ffc87 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1503,14 +1503,17 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
* Weak dummy function for arches that do not yet support it.
* wall_time - current time as returned by persistent clock
* boot_offset - offset that is defined as wall_time - boot_time
- * default to 0.
+ * The default function calculates offset based on the current value of
+ * local_clock(). This way architectures that support sched_clock() but don't
+ * support dedicated boot time clock will provide the best estimate of the
+ * boot time.
*/
void __weak __init
read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
struct timespec64 *boot_offset)
{
read_persistent_clock64(wall_time);
- *boot_offset = (struct timespec64){0};
+ *boot_offset = ns_to_timespec64(local_clock());
}

/* Flag for if timekeeping_resume() has injected sleeptime */
--
2.18.0


2018-07-12 03:08:15

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 15/18] x86/tsc: use tsc early

get timestamps and high resultion clock available to us as early as
possible.

native_sched_clock() outputs time based either on tsc after tsc_init() is
called later in boot, or using jiffies when clock interrupts are enabled,
which is also happens later in boot.

On the other hand, tsc frequency is known from as early as when
tsc_early_init() is called.

Use the early tsc calibration to output timestamps early.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/tsc.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 0b1abe7fdd8e..39ff2881f622 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1408,6 +1408,7 @@ void __init tsc_early_init(void)
/* Sanitize TSC ADJUST before cyc2ns gets initialized */
tsc_store_and_check_tsc_adjust(true);
cyc2ns_init_boot_cpu();
+ static_branch_enable(&__use_tsc);
}

void __init tsc_init(void)
--
2.18.0


2018-07-12 03:08:17

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 18/18] sched: use static key for sched_clock_running

sched_clock_running may be read every time sched_clock_cpu() is called.
Yet, this variable is updated only twice during boot, and never changes
again, therefore it is better to make it a static key.

Signed-off-by: Pavel Tatashin <[email protected]>
---
kernel/sched/clock.c | 16 ++++++++--------
kernel/sched/debug.c | 2 --
2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 4d07b785d566..07e54073d6af 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -66,7 +66,7 @@ unsigned long long __weak sched_clock(void)
}
EXPORT_SYMBOL_GPL(sched_clock);

-__read_mostly int sched_clock_running;
+static DEFINE_STATIC_KEY_FALSE(sched_clock_running);

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
@@ -190,7 +190,7 @@ void clear_sched_clock_stable(void)

smp_mb(); /* matches sched_clock_init_late() */

- if (sched_clock_running == 2)
+ if (static_key_count(&sched_clock_running.key) == 2)
__clear_sched_clock_stable();
}

@@ -203,7 +203,7 @@ void __init sched_clock_init(void)
{
unsigned long flags;

- sched_clock_running = 1;
+ static_branch_inc(&sched_clock_running);

/* Adjust __gtod_offset for contigious transition from early clock */
local_irq_save(flags);
@@ -217,7 +217,7 @@ void __init sched_clock_init(void)
*/
static int __init sched_clock_init_late(void)
{
- sched_clock_running = 2;
+ static_branch_inc(&sched_clock_running);
/*
* Ensure that it is impossible to not do a static_key update.
*
@@ -362,7 +362,7 @@ u64 sched_clock_cpu(int cpu)
if (sched_clock_stable())
return sched_clock() + __sched_clock_offset;

- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return sched_clock();

preempt_disable_notrace();
@@ -385,7 +385,7 @@ void sched_clock_tick(void)
if (sched_clock_stable())
return;

- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return;

lockdep_assert_irqs_disabled();
@@ -444,13 +444,13 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);

void __init sched_clock_init(void)
{
- sched_clock_running = 1;
+ static_branch_inc(&sched_clock_running);
generic_sched_clock_init();
}

u64 sched_clock_cpu(int cpu)
{
- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return 0;

return sched_clock();
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e593b4118578..b0212f489a33 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -623,8 +623,6 @@ void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
#undef PU
}

-extern __read_mostly int sched_clock_running;
-
static void print_cpu(struct seq_file *m, int cpu)
{
struct rq *rq = cpu_rq(cpu);
--
2.18.0


2018-07-12 03:08:50

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v13 05/18] kvm/x86: remove kvm memblock dependency

KVM clock is initialized later compared to other hypervisor because it has
dependency on memblock allocator.

Lets bring it inline with other hypervisors by removing this dependency by
using memory from BSS instead of allocating it.

The benefits:
- remove ifdef from common code
- earlier availability of TSC.
- remove dependency on memblock, and reduce code
- earlier kvm sched_clock()

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 2 +-
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 66 +++++----------------------------
arch/x86/kernel/setup.c | 7 +---
4 files changed, 14 insertions(+), 62 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 3aea2658323a..62872a862914 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,7 +6,7 @@
#include <asm/alternative.h>
#include <uapi/asm/kvm_para.h>

-extern void kvmclock_init(void);
+extern void kvm_platform_init(void);
extern int kvm_register_clock(char *txt);

#ifdef CONFIG_KVM_GUEST
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5b2300b818af..874c53dd8ba2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name = "KVM",
.detect = kvm_detect,
.type = X86_HYPER_KVM,
+ .init.init_platform = kvm_platform_init,
.init.guest_late_init = kvm_guest_init,
.init.x2apic_available = kvm_para_available,
};
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index bf8d1eb7fca3..d2af1b1f060e 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,9 +23,9 @@
#include <asm/apic.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
-#include <linux/memblock.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
+#include <linux/mm.h>

#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
@@ -44,6 +44,11 @@ static int parse_no_kvmclock(char *arg)
}
early_param("no-kvmclock", parse_no_kvmclock);

+/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
+#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))
+static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
/* The hypervisor will put information about time periodically here */
static struct pvclock_vsyscall_time_info *hv_clock;
static struct pvclock_wall_clock *wall_clock;
@@ -244,43 +249,12 @@ static void kvm_shutdown(void)
native_machine_shutdown();
}

-static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
- phys_addr_t align)
-{
- phys_addr_t mem;
-
- mem = memblock_alloc(size, align);
- if (!mem)
- return 0;
-
- if (sev_active()) {
- if (early_set_memory_decrypted((unsigned long)__va(mem), size))
- goto e_free;
- }
-
- return mem;
-e_free:
- memblock_free(mem, size);
- return 0;
-}
-
-static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
-{
- if (sev_active())
- early_set_memory_encrypted((unsigned long)__va(addr), size);
-
- memblock_free(addr, size);
-}
-
-void __init kvmclock_init(void)
+void __init kvm_platform_init(void)
{
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned long mem, mem_wall_clock;
- int size, cpu, wall_clock_size;
+ int cpu;
u8 flags;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
if (!kvm_para_available())
return;

@@ -290,28 +264,11 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
- mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
- if (!mem_wall_clock)
- return;
-
- wall_clock = __va(mem_wall_clock);
- memset(wall_clock, 0, wall_clock_size);
-
- mem = kvm_memblock_alloc(size, PAGE_SIZE);
- if (!mem) {
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
- wall_clock = NULL;
- return;
- }
-
- hv_clock = __va(mem);
- memset(hv_clock, 0, size);
+ wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
+ hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;

if (kvm_register_clock("primary cpu clock")) {
hv_clock = NULL;
- kvm_memblock_free(mem, size);
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
wall_clock = NULL;
return;
}
@@ -354,13 +311,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
int cpu;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned int size;

if (!hv_clock)
return 0;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
cpu = get_cpu();

vcpu_time = &hv_clock[cpu].pvti;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 403b2d2c31d2..01fcc8bf7c8f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1014,6 +1014,8 @@ void __init setup_arch(char **cmdline_p)
*/
init_hypervisor_platform();

+ tsc_early_delay_calibrate();
+
x86_init.resources.probe_roms();

/* after parse_early_param, so could debug it */
@@ -1199,11 +1201,6 @@ void __init setup_arch(char **cmdline_p)

memblock_find_dma_reserve();

-#ifdef CONFIG_KVM_GUEST
- kvmclock_init();
-#endif
-
- tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();

--
2.18.0


2018-07-12 23:52:50

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v13 16/18] sched: move sched clock initialization and merge with generic clock

Hi Pavel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc4 next-20180712]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Pavel-Tatashin/Early-boot-time-stamps/20180712-200238
config: microblaze-mmu_defconfig (attached as .config)
compiler: microblaze-linux-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=microblaze

All errors (new ones prefixed by >>):

kernel/sched/clock.c: In function 'sched_clock_init':
>> kernel/sched/clock.c:440:2: error: implicit declaration of function 'generic_sched_clock_init'; did you mean 'sched_clock_init'? [-Werror=implicit-function-declaration]
generic_sched_clock_init();
^~~~~~~~~~~~~~~~~~~~~~~~
sched_clock_init
cc1: some warnings being treated as errors

vim +440 kernel/sched/clock.c

436
437 void __init sched_clock_init(void)
438 {
439 sched_clock_running = 1;
> 440 generic_sched_clock_init();
441 }
442

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.49 kB)
.config.gz (12.20 kB)
Download all attachments

2018-07-13 01:59:19

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v13 06/18] x86/xen/time: initialize pv xen time in init_hypervisor_platform

> -void __ref xen_init_time_ops(void)
> +void __init xen_init_time_ops(void)
> {
> pv_time_ops = xen_time_ops;
>
> @@ -542,17 +542,11 @@ void __init xen_hvm_init_time_ops(void)
> return;
>
> if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
> - printk(KERN_INFO "Xen doesn't support pvclock on HVM,"
> - "disable pv timer\n");
> + pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
> return;
> }
> -
> - pv_time_ops = xen_time_ops;
> + xen_init_time_ops();
> x86_init.timers.setup_percpu_clockev = xen_time_init;
> x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;

Boris reported a bug on HVM, which causes a panic in
x86_late_time_init(). It is introduced here: xen_init_time_ops() sets:
x86_init.timers.timer_init = xen_time_init; which was hpet_time_init()
in HVM. However, we might not even need hpet here. Thus, adding
x86_init.timers.timer_init = x86_init_noop; to the end of
xen_hvm_init_time_ops() should be sufficient.

Thank you,
Pavel

2018-07-13 07:23:43

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v13 13/18] x86/tsc: calibrate tsc only once


At 07/12/2018 08:04 AM, Pavel Tatashin wrote:
> During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
> and the second time in tsc_init().
>
> Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
> the calibration is done only early, and make tsc_init() to use the values
> already determined in tsc_early_init().
>
> Sometimes it is not possible to determine tsc early, as the subsystem that
> is required is not yet initialized, in such case try again later in
> tsc_init().
>
> Suggested-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Pavel Tatashin <[email protected]>

Hi Pavel,

Aha, a complex solution for a simple problem! ;-) And I did find any
benefits of doing that. did I miss something?

As the cpu_khz and tsc_khz are global variables and the tsc_khz may
be reset to cpu_khz. How about the following patch.

Thanks,
dou
------------------------8<-----------------------------------

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..e54fa1037d45 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1370,8 +1370,10 @@ void __init tsc_init(void)
return;
}

- cpu_khz = x86_platform.calibrate_cpu();
- tsc_khz = x86_platform.calibrate_tsc();
+ if (!tsc_khz) {
+ cpu_khz = x86_platform.calibrate_cpu();
+ tsc_khz = x86_platform.calibrate_tsc();
+ }

/*
* Trust non-zero tsc_khz as authorative,

> ---
> arch/x86/include/asm/tsc.h | 2 +-
> arch/x86/kernel/setup.c | 2 +-
> arch/x86/kernel/tsc.c | 86 ++++++++++++++++++++------------------
> 3 files changed, 48 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
> index 2701d221583a..c4368ff73652 100644
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void)
> extern struct system_counterval_t convert_art_to_tsc(u64 art);
> extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);
>
> -extern void tsc_early_delay_calibrate(void);
> +extern void tsc_early_init(void);
> extern void tsc_init(void);
> extern void mark_tsc_unstable(char *reason);
> extern int unsynchronized_tsc(void);
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 01fcc8bf7c8f..07445482bb57 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1014,7 +1014,7 @@ void __init setup_arch(char **cmdline_p)
> */
> init_hypervisor_platform();
>
> - tsc_early_delay_calibrate();
> + tsc_early_init();
>
> x86_init.resources.probe_roms();
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 186395041725..bc8eb82050a3 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz);
> unsigned int __read_mostly tsc_khz;
> EXPORT_SYMBOL(tsc_khz);
>
> +#define KHZ 1000
> +
> /*
> * TSC can be unstable due to cpufreq or due to unsynced TSCs
> */
> @@ -1335,34 +1337,10 @@ static int __init init_tsc_clocksource(void)
> */
> device_initcall(init_tsc_clocksource);
>
> -void __init tsc_early_delay_calibrate(void)
> -{
> - unsigned long lpj;
> -
> - if (!boot_cpu_has(X86_FEATURE_TSC))
> - return;
> -
> - cpu_khz = x86_platform.calibrate_cpu();
> - tsc_khz = x86_platform.calibrate_tsc();
> -
> - tsc_khz = tsc_khz ? : cpu_khz;
> - if (!tsc_khz)
> - return;
> -
> - lpj = tsc_khz * 1000;
> - do_div(lpj, HZ);
> - loops_per_jiffy = lpj;
> -}
> -
> -void __init tsc_init(void)
> +static bool determine_cpu_tsc_frequncies(void)
> {
> - u64 lpj, cyc;
> - int cpu;
> -
> - if (!boot_cpu_has(X86_FEATURE_TSC)) {
> - setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
> - return;
> - }
> + /* Make sure that cpu and tsc are not already calibrated */
> + WARN_ON(cpu_khz || tsc_khz);
>
> cpu_khz = x86_platform.calibrate_cpu();
> tsc_khz = x86_platform.calibrate_tsc();
> @@ -1377,20 +1355,51 @@ void __init tsc_init(void)
> else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
> cpu_khz = tsc_khz;
>
> - if (!tsc_khz) {
> - mark_tsc_unstable("could not calculate TSC khz");
> - setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
> - return;
> - }
> + if (tsc_khz == 0)
> + return false;
>
> pr_info("Detected %lu.%03lu MHz processor\n",
> - (unsigned long)cpu_khz / 1000,
> - (unsigned long)cpu_khz % 1000);
> + (unsigned long)cpu_khz / KHZ,
> + (unsigned long)cpu_khz % KHZ);
>
> if (cpu_khz != tsc_khz) {
> pr_info("Detected %lu.%03lu MHz TSC",
> - (unsigned long)tsc_khz / 1000,
> - (unsigned long)tsc_khz % 1000);
> + (unsigned long)tsc_khz / KHZ,
> + (unsigned long)tsc_khz % KHZ);
> + }
> + return true;
> +}
> +
> +static unsigned long get_loops_per_jiffy(void)
> +{
> + unsigned long lpj = tsc_khz * KHZ;
> +
> + do_div(lpj, HZ);
> + return lpj;
> +}
> +
> +void __init tsc_early_init(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_TSC))
> + return;
> + if (!determine_cpu_tsc_frequncies())
> + return;
> + loops_per_jiffy = get_loops_per_jiffy();
> +}
> +
> +void __init tsc_init(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_TSC)) {
> + setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
> + return;
> + }
> +
> + if (!tsc_khz) {
> + /* We failed to determine frequencies earlier, try again */
> + if (!determine_cpu_tsc_frequncies()) {
> + mark_tsc_unstable("could not calculate TSC khz");
> + return;
> + }
> }
>
> /* Sanitize TSC ADJUST before cyc2ns gets initialized */
> @@ -1413,10 +1422,7 @@ void __init tsc_init(void)
> if (!no_sched_irq_time)
> enable_sched_clock_irqtime();
>
> - lpj = ((u64)tsc_khz * 1000);
> - do_div(lpj, HZ);
> - lpj_fine = lpj;
> -
> + lpj_fine = get_loops_per_jiffy();
> use_tsc_delay();
>
> check_system_tsc_reliable();
>



2018-07-13 09:13:16

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v13 14/18] x86/tsc: initialize cyc2ns when tsc freq. is determined



At 07/12/2018 08:04 AM, Pavel Tatashin wrote:
> cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data
> structure.
>
> Currently, the setup code for c2ns data for every possible CPU goes through
> the same sequence of calculations as for the boot CPU, but is based on the
> same tsc frequency as the boot CPU, and thus this is not necessary.
>
> Initialize the boot cpu when tsc frequency is determined. Copy the
> calculated data from the boot CPU to the other CPUs in tsc_init().
>
> In addition do the following:
>
> - Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init()
> - Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be
> called when system is up, and wraps around __set_cyc2ns_scale() that can
> be called directly when system is booting but avoids saving restoring
> IRQs and going and waking up from idle.
>
> Suggested-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> arch/x86/kernel/tsc.c | 94 ++++++++++++++++++++++++-------------------
> 1 file changed, 53 insertions(+), 41 deletions(-)
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index bc8eb82050a3..0b1abe7fdd8e 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -103,23 +103,6 @@ void cyc2ns_read_end(void)
> * [email protected] "math is hard, lets go shopping!"
> */
>
> -static void cyc2ns_data_init(struct cyc2ns_data *data)
> -{
> - data->cyc2ns_mul = 0;
> - data->cyc2ns_shift = 0;
> - data->cyc2ns_offset = 0;
> -}
> -
> -static void __init cyc2ns_init(int cpu)
> -{
> - struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
> -
> - cyc2ns_data_init(&c2n->data[0]);
> - cyc2ns_data_init(&c2n->data[1]);
> -
> - seqcount_init(&c2n->seq);
> -}
> -
> static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> {
> struct cyc2ns_data data;
> @@ -135,18 +118,11 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> return ns;
> }
>
> -static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
> +static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
> {
> unsigned long long ns_now;
> struct cyc2ns_data data;
> struct cyc2ns *c2n;
> - unsigned long flags;
> -
> - local_irq_save(flags);
> - sched_clock_idle_sleep_event();
> -
> - if (!khz)
> - goto done;
>
> ns_now = cycles_2_ns(tsc_now);
>
> @@ -178,12 +154,55 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
> c2n->data[0] = data;
> raw_write_seqcount_latch(&c2n->seq);
> c2n->data[1] = data;
> +}
> +
> +static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + sched_clock_idle_sleep_event();
> +
> + if (khz)
> + __set_cyc2ns_scale(khz, cpu, tsc_now);
>
> -done:
> sched_clock_idle_wakeup_event();
> local_irq_restore(flags);
> } >
> +/*
> + * Initialize cyc2ns for boot cpu
> + */
> +static void __init cyc2ns_init_boot_cpu(void)
> +{
> + struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
> +
> + seqcount_init(&c2n->seq);
> + __set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
> +}
Hi Paval,

Here we didn't protect it with local_irq_save()/local_irq_restore()

> +
> +/*
> + * Secondary CPUs do not run through cyc2ns_init(), so set up
> + * all the scale factors for all CPUs, assuming the same
> + * speed as the bootup CPU. (cpufreq notifiers will fix this
> + * up if their speed diverges)
> + */
> +static void __init cyc2ns_init_secondary_cpus(void)
> +{
> + unsigned int cpu, this_cpu = smp_processor_id();
> + struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
> + struct cyc2ns_data *data = c2n->data;
> +
> + for_each_possible_cpu(cpu) {
> + if (cpu != this_cpu) {
> + seqcount_init(&c2n->seq);
> + c2n = per_cpu_ptr(&cyc2ns, cpu);
> + c2n->data[0] = data[0];
> + c2n->data[1] = data[1];
> + }
> + }
> +}
> +
> /*
> * Scheduler clock - returns current time in nanosec units.
> */
> @@ -1385,6 +1404,10 @@ void __init tsc_early_init(void)
> if (!determine_cpu_tsc_frequncies())
> return;
> loops_per_jiffy = get_loops_per_jiffy();
> +
> + /* Sanitize TSC ADJUST before cyc2ns gets initialized */
> + tsc_store_and_check_tsc_adjust(true);
> + cyc2ns_init_boot_cpu();
> }
>
> void __init tsc_init(void)
> @@ -1400,23 +1423,12 @@ void __init tsc_init(void)
> mark_tsc_unstable("could not calculate TSC khz");
> return;
> }
> + /* Sanitize TSC ADJUST before cyc2ns gets initialized */
> + tsc_store_and_check_tsc_adjust(true);
> + cyc2ns_init_boot_cpu();

In tsc_init(), the timer interrupt has worked, Is that OK?

IMO, no need to change so much, just use the original code,
eg:

+static void __init cyc2ns_init_bsp(void)
+{
+ cyc2ns_init(0);
+ set_cyc2ns_scale(tsc_khz, 0, rdtsc());
+}
+
+static void __init cyc2ns_init_cpus(void)
+{
+ struct cyc2ns *c2n0 = &per_cpu(cyc2ns, 0);
+ struct cyc2ns *c2n;
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ if (cpu != 0) {
+ cyc2ns_init(cpu);
+ c2n = &per_cpu(cyc2ns, cpu);
+ c2n->data[0] = c2n0->data[0];
+ c2n->data[1] = c2n0->data[1];
+ }
+ }
+}
+


Thanks,
dou

> }
>
> - /* Sanitize TSC ADJUST before cyc2ns gets initialized */
> - tsc_store_and_check_tsc_adjust(true);
> -
> - /*
> - * Secondary CPUs do not run through tsc_init(), so set up
> - * all the scale factors for all CPUs, assuming the same
> - * speed as the bootup CPU. (cpufreq notifiers will fix this
> - * up if their speed diverges)
> - */
> - cyc = rdtsc();
> - for_each_possible_cpu(cpu) {
> - cyc2ns_init(cpu);
> - set_cyc2ns_scale(tsc_khz, cpu, cyc);
> - }
> -
> + cyc2ns_init_secondary_cpus();
> static_branch_enable(&__use_tsc);
>
> if (!no_sched_irq_time)
>



2018-07-13 11:32:09

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v13 13/18] x86/tsc: calibrate tsc only once

On Fri, Jul 13, 2018 at 3:24 AM Dou Liyang <[email protected]> wrote:
>
>
> At 07/12/2018 08:04 AM, Pavel Tatashin wrote:
> > During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
> > and the second time in tsc_init().
> >
> > Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
> > the calibration is done only early, and make tsc_init() to use the values
> > already determined in tsc_early_init().
> >
> > Sometimes it is not possible to determine tsc early, as the subsystem that
> > is required is not yet initialized, in such case try again later in
> > tsc_init().
> >
> > Suggested-by: Thomas Gleixner <[email protected]>
> > Signed-off-by: Pavel Tatashin <[email protected]>
>
> Hi Pavel,
>
> Aha, a complex solution for a simple problem! ;-) And I did find any
> benefits of doing that. did I miss something?

Hi Dou,

I had this in previous version: init early, and unconditionally
re-init later (which required to resync sched clocks for continuity,
and check for some other corner cases). Thomas did not like the idea,
as it is less deterministic: it leads for code to work by accident,
where we might get one tsc frequency early and another later, and so
on. The request was to initialize only once, and if that fails do it
again later. This way, if early initialization is broken, we will know
and fix it.

>
> As the cpu_khz and tsc_khz are global variables and the tsc_khz may
> be reset to cpu_khz. How about the following patch.

Could you please explain where you think this patch can be applied,
and what it fixes?

Thank you,
Pavel

>
> Thanks,
> dou
> ------------------------8<-----------------------------------
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 74392d9d51e0..e54fa1037d45 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1370,8 +1370,10 @@ void __init tsc_init(void)
> return;
> }
>
> - cpu_khz = x86_platform.calibrate_cpu();
> - tsc_khz = x86_platform.calibrate_tsc();
> + if (!tsc_khz) {
> + cpu_khz = x86_platform.calibrate_cpu();
> + tsc_khz = x86_platform.calibrate_tsc();
> + }
>
> /*
> * Trust non-zero tsc_khz as authorative,

2018-07-13 11:40:37

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v13 14/18] x86/tsc: initialize cyc2ns when tsc freq. is determined

On Fri, Jul 13, 2018 at 5:13 AM Dou Liyang <[email protected]> wrote:
>
>
>
> At 07/12/2018 08:04 AM, Pavel Tatashin wrote:
> > cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data
> > structure.
> >
> > Currently, the setup code for c2ns data for every possible CPU goes through
> > the same sequence of calculations as for the boot CPU, but is based on the
> > same tsc frequency as the boot CPU, and thus this is not necessary.
> >
> > Initialize the boot cpu when tsc frequency is determined. Copy the
> > calculated data from the boot CPU to the other CPUs in tsc_init().
> >
> > In addition do the following:
> >
> > - Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init()
> > - Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be
> > called when system is up, and wraps around __set_cyc2ns_scale() that can
> > be called directly when system is booting but avoids saving restoring
> > IRQs and going and waking up from idle.
> >
> > Suggested-by: Thomas Gleixner <[email protected]>
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > ---
> > arch/x86/kernel/tsc.c | 94 ++++++++++++++++++++++++-------------------
> > 1 file changed, 53 insertions(+), 41 deletions(-)
> >
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index bc8eb82050a3..0b1abe7fdd8e 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -103,23 +103,6 @@ void cyc2ns_read_end(void)
> > * [email protected] "math is hard, lets go shopping!"
> > */
> >
> > -static void cyc2ns_data_init(struct cyc2ns_data *data)
> > -{
> > - data->cyc2ns_mul = 0;
> > - data->cyc2ns_shift = 0;
> > - data->cyc2ns_offset = 0;
> > -}
> > -
> > -static void __init cyc2ns_init(int cpu)
> > -{
> > - struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
> > -
> > - cyc2ns_data_init(&c2n->data[0]);
> > - cyc2ns_data_init(&c2n->data[1]);
> > -
> > - seqcount_init(&c2n->seq);
> > -}
> > -
> > static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> > {
> > struct cyc2ns_data data;
> > @@ -135,18 +118,11 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> > return ns;
> > }
> >
> > -static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
> > +static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
> > {
> > unsigned long long ns_now;
> > struct cyc2ns_data data;
> > struct cyc2ns *c2n;
> > - unsigned long flags;
> > -
> > - local_irq_save(flags);
> > - sched_clock_idle_sleep_event();
> > -
> > - if (!khz)
> > - goto done;
> >
> > ns_now = cycles_2_ns(tsc_now);
> >
> > @@ -178,12 +154,55 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
> > c2n->data[0] = data;
> > raw_write_seqcount_latch(&c2n->seq);
> > c2n->data[1] = data;
> > +}
> > +
> > +static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
> > +{
> > + unsigned long flags;
> > +
> > + local_irq_save(flags);
> > + sched_clock_idle_sleep_event();
> > +
> > + if (khz)
> > + __set_cyc2ns_scale(khz, cpu, tsc_now);
> >
> > -done:
> > sched_clock_idle_wakeup_event();
> > local_irq_restore(flags);
> > } >
> > +/*
> > + * Initialize cyc2ns for boot cpu
> > + */
> > +static void __init cyc2ns_init_boot_cpu(void)
> > +{
> > + struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
> > +
> > + seqcount_init(&c2n->seq);
> > + __set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
> > +}
> Hi Paval,
>
> Here we didn't protect it with local_irq_save()/local_irq_restore()

Hi Dou,

Thank you for reviewing this work.

Correct, but no need to do that: we do local_irqsave()/restore() only
in set_cyc2ns_scale(), because that is called when system is live, and
frequency changes.

cyc2ns_init_boot_cpu(), and cyc2ns_init_secondary_cpus(), are called
during boot, before smp_init(), and frequency does not change during
that time. So, nothing to protect us from.

> In tsc_init(), the timer interrupt has worked, Is that OK?
>
> IMO, no need to change so much, just use the original code,
> eg:

The request and suggestions on how to cleanup cyc2ns came from Thomas.
Overall, code is cleaner now.

Thank you,
Pavel

2018-07-16 09:33:16

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v13 13/18] x86/tsc: calibrate tsc only once



At 07/13/2018 07:30 PM, Pavel Tatashin wrote:
> On Fri, Jul 13, 2018 at 3:24 AM Dou Liyang <[email protected]> wrote:
>>
>>
>> At 07/12/2018 08:04 AM, Pavel Tatashin wrote:
>>> During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
>>> and the second time in tsc_init().
>>>
>>> Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
>>> the calibration is done only early, and make tsc_init() to use the values
>>> already determined in tsc_early_init().
>>>
>>> Sometimes it is not possible to determine tsc early, as the subsystem that
>>> is required is not yet initialized, in such case try again later in
>>> tsc_init().
>>>
>>> Suggested-by: Thomas Gleixner <[email protected]>
>>> Signed-off-by: Pavel Tatashin <[email protected]>
>>
>> Hi Pavel,
>>
>> Aha, a complex solution for a simple problem! ;-) And I did find any
>> benefits of doing that. did I miss something?
>
> Hi Dou,
>
> I had this in previous version: init early, and unconditionally
> re-init later (which required to resync sched clocks for continuity,
> and check for some other corner cases). Thomas did not like the idea,
> as it is less deterministic: it leads for code to work by accident,
> where we might get one tsc frequency early and another later, and so
> on. The request was to initialize only once, and if that fails do it
> again later. This way, if early initialization is broken, we will know
> and fix it.
>

Hi Pavel,

Yes, right, I have seen the purpose in v12.

>>
>> As the cpu_khz and tsc_khz are global variables and the tsc_khz may
>> be reset to cpu_khz. How about the following patch.
>
> Could you please explain where you think this patch can be applied,
> and what it fixes?
>

This patch is just an simple alternative to realize what you want in
your patch. your patch is also good but complex, and need some scrub.
eg:
  - Is it suitable to using the WARN_ON()
  - the name of determine_cpu_tsc_frequncies() function,
s/frequncies/frequencies/. BTW, How about tsc_calibrate_frequencies()
- ...

BTW, before this patch, seems we need make sure xen should work well, I
can investigate and try to test if we can also move the pagetable_init()
to the front of tsc_early_init() for you.

Thanks,
dou

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index da1dbd99cb6e..74cb16d89e25 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1197,12 +1197,12 @@ void __init setup_arch(char **cmdline_p)

memblock_find_dma_reserve();

+ x86_init.paging.pagetable_init();
+
tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();

- x86_init.paging.pagetable_init();
-
kasan_init();

/*

> Thank you,
> Pavel
>
>>
>> Thanks,
>> dou
>> ------------------------8<-----------------------------------
>>
>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>> index 74392d9d51e0..e54fa1037d45 100644
>> --- a/arch/x86/kernel/tsc.c
>> +++ b/arch/x86/kernel/tsc.c
>> @@ -1370,8 +1370,10 @@ void __init tsc_init(void)
>> return;
>> }
>>
>> - cpu_khz = x86_platform.calibrate_cpu();
>> - tsc_khz = x86_platform.calibrate_tsc();
>> + if (!tsc_khz) {
>> + cpu_khz = x86_platform.calibrate_cpu();
>> + tsc_khz = x86_platform.calibrate_tsc();
>> + }
>>
>> /*
>> * Trust non-zero tsc_khz as authorative,
>
>
>



2018-07-16 13:37:22

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v13 13/18] x86/tsc: calibrate tsc only once

On Mon, Jul 16, 2018 at 5:33 AM Dou Liyang <[email protected]> wrote:
>
>
>
> At 07/13/2018 07:30 PM, Pavel Tatashin wrote:
> > On Fri, Jul 13, 2018 at 3:24 AM Dou Liyang <[email protected]> wrote:
> >>
> >>
> >> At 07/12/2018 08:04 AM, Pavel Tatashin wrote:
> >>> During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
> >>> and the second time in tsc_init().
> >>>
> >>> Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
> >>> the calibration is done only early, and make tsc_init() to use the values
> >>> already determined in tsc_early_init().
> >>>
> >>> Sometimes it is not possible to determine tsc early, as the subsystem that
> >>> is required is not yet initialized, in such case try again later in
> >>> tsc_init().
> >>>
> >>> Suggested-by: Thomas Gleixner <[email protected]>
> >>> Signed-off-by: Pavel Tatashin <[email protected]>
> >>
> >> Hi Pavel,
> >>
> >> Aha, a complex solution for a simple problem! ;-) And I did find any
> >> benefits of doing that. did I miss something?
> >
> > Hi Dou,
> >
> > I had this in previous version: init early, and unconditionally
> > re-init later (which required to resync sched clocks for continuity,
> > and check for some other corner cases). Thomas did not like the idea,
> > as it is less deterministic: it leads for code to work by accident,
> > where we might get one tsc frequency early and another later, and so
> > on. The request was to initialize only once, and if that fails do it
> > again later. This way, if early initialization is broken, we will know
> > and fix it.
> >
>
> Hi Pavel,
>
> Yes, right, I have seen the purpose in v12.
>
> >>
> >> As the cpu_khz and tsc_khz are global variables and the tsc_khz may
> >> be reset to cpu_khz. How about the following patch.
> >
> > Could you please explain where you think this patch can be applied,
> > and what it fixes?
> >
>
> This patch is just an simple alternative to realize what you want in
> your patch. your patch is also good but complex, and need some scrub.
> eg:
> - Is it suitable to using the WARN_ON()
> - the name of determine_cpu_tsc_frequncies() function,
> s/frequncies/frequencies/. BTW, How about tsc_calibrate_frequencies()
> - ...
>
> BTW, before this patch, seems we need make sure xen should work well, I
> can investigate and try to test if we can also move the pagetable_init()
> to the front of tsc_early_init() for you.

v13 has xen patches, so xen timestamps work early as well now.

Thank you,
Pavel

2018-07-17 09:01:34

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH v13 13/18] x86/tsc: calibrate tsc only once



At 07/16/2018 09:35 PM, Pavel Tatashin wrote:
[...]
> v13 has xen patches, so xen timestamps work early as well now.
>

Oops, yes, my mistake. I will test the patchset with Thomas's kvm patch
for you.

Thanks,
dou

> Thank you,
> Pavel
>
>
>



2018-07-17 14:38:09

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v13 13/18] x86/tsc: calibrate tsc only once

>
> Oops, yes, my mistake. I will test the patchset with Thomas's kvm patch
> for you.

Thank you Dou. Please wait for updated series before you test it. I
will include patches from Thomas in the updated series.

Thank you,
Pavel

2018-07-17 15:30:45

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v13 06/18] x86/xen/time: initialize pv xen time in init_hypervisor_platform

On 07/11/2018 08:04 PM, Pavel Tatashin wrote:
> In every hypervisor except for xen pv time ops are initialized in
> init_hypervisor_platform().
>
> Xen PV domains initialize time ops in x86_init.paging.pagetable_init(),
> by calling xen_setup_shared_info() which is a poor design, as time is
> needed prior to memory allocator.
>
> xen_setup_shared_info() is called from two places: during boot, and
> after suspend. Split the content of xen_setup_shared_info() into
> three places:
>
> 1. add the clock relavent data into new xen pv init_platform vector, and
> set clock ops in there.
>
> 2. move xen_setup_vcpu_info_placement() to new xen_pv_guest_late_init()
> call.
>
> 3. Re-initializing parts of shared info copy to xen_pv_post_suspend() to
> be symmetric to xen_pv_pre_suspend
>
> Also, remove code duplications by calling xen_init_time_ops() from
> xen_hvm_init_time_ops().
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> arch/x86/xen/enlighten_pv.c | 51 +++++++++++++++++--------------------
> arch/x86/xen/mmu_pv.c | 6 ++---
> arch/x86/xen/suspend_pv.c | 5 ++--
> arch/x86/xen/time.c | 14 +++-------
> arch/x86/xen/xen-ops.h | 6 ++---
> 5 files changed, 35 insertions(+), 47 deletions(-)
>
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 8d4e2e1ae60b..615ad0f16848 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -119,6 +119,27 @@ static void __init xen_banner(void)
> version >> 16, version & 0xffff, extra.extraversion,
> xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
> }
> +
> +static void __init xen_pv_init_platform(void)
> +{
> + set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
> + HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
> +
> + /* xen clock uses per-cpu vcpu_info, need to init it for boot cpu */
> + xen_vcpu_info_reset(0);


I don't believe this is necessary, it has been done in
xen_start_kernel() for PV guests.


> +
> + /* pvclock is in shared info area */
> + xen_init_time_ops();
> +}
> +



> }
>
> -void __ref xen_init_time_ops(void)
> +void __init xen_init_time_ops(void)
> {
> pv_time_ops = xen_time_ops;
>
> @@ -542,17 +542,11 @@ void __init xen_hvm_init_time_ops(void)
> return;
>
> if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
> - printk(KERN_INFO "Xen doesn't support pvclock on HVM,"
> - "disable pv timer\n");
> + pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
> return;
> }
> -
> - pv_time_ops = xen_time_ops;
> + xen_init_time_ops();


As we discussed elsewhere, now that HVM guests call this routine as well
we need to make sure that x86_init.timers.timer_init is not updated
there for HVM since those guests expect a "real" timer to be connected
to IO-APIC.

-boris

> x86_init.timers.setup_percpu_clockev = xen_time_init;
> x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
> -
> - x86_platform.calibrate_tsc = xen_tsc_khz;
> - x86_platform.get_wallclock = xen_get_wallclock;
> - x86_platform.set_wallclock = xen_set_wallclock;
> }
> #endif
>


2018-07-17 15:45:49

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v13 07/18] x86/xen/time: output xen sched_clock time from 0

On 07/11/2018 08:04 PM, Pavel Tatashin wrote:
> It is expected for sched_clock() to output data from 0, when system boots.
> Add an offset xen_sched_clock_offset (similarly how it is done in other
> hypervisors i.e. kvm_sched_clock_offset) to count sched_clock() from 0,
> when time is first initialized.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> arch/x86/xen/time.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 2ad61bf896d6..3c6f3d603373 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -31,6 +31,8 @@
> /* Xen may fire a timer up to this many ns early */
> #define TIMER_SLOP 100000
>
> +static u64 xen_sched_clock_offset __read_mostly;
> +
> /* Get the TSC speed from Xen */
> static unsigned long xen_tsc_khz(void)
> {
> @@ -57,6 +59,11 @@ static u64 xen_clocksource_get_cycles(struct clocksource *cs)
> return xen_clocksource_read();
> }
>
> +static u64 xen_sched_clock(void)
> +{
> + return xen_clocksource_read() - xen_sched_clock_offset;
> +}


Should other invocations of xen_clocksource_read() also be offset?

-boris


> +
> static void xen_read_wallclock(struct timespec64 *ts)
> {
> struct shared_info *s = HYPERVISOR_shared_info;
> @@ -367,7 +374,7 @@ void xen_timer_resume(void)
> }
>
> static const struct pv_time_ops xen_time_ops __initconst = {
> - .sched_clock = xen_clocksource_read,
> + .sched_clock = xen_sched_clock,
> .steal_clock = xen_steal_clock,
> };
>
> @@ -505,6 +512,7 @@ static void __init xen_time_init(void)
>
> void __init xen_init_time_ops(void)
> {
> + xen_sched_clock_offset = xen_clocksource_read();
> pv_time_ops = xen_time_ops;
>
> x86_init.timers.timer_init = xen_time_init;


2018-07-18 01:40:12

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v13 06/18] x86/xen/time: initialize pv xen time in init_hypervisor_platform

> > + set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
> > + HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
> > +
> > + /* xen clock uses per-cpu vcpu_info, need to init it for boot cpu */
> > + xen_vcpu_info_reset(0);
>
>
> I don't believe this is necessary, it has been done in
> xen_start_kernel() for PV guests.

We need it, because HYPERVISOR_shared_info has changed from dummy.
And, to output timestamps we must have access to the actual shared
page.

> > +
> > + /* pvclock is in shared info area */
> > + xen_init_time_ops();
> > +}
> > +
>
>
>
> > }
> >
> > -void __ref xen_init_time_ops(void)
> > +void __init xen_init_time_ops(void)
> > {
> > pv_time_ops = xen_time_ops;
> >
> > @@ -542,17 +542,11 @@ void __init xen_hvm_init_time_ops(void)
> > return;
> >
> > if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
> > - printk(KERN_INFO "Xen doesn't support pvclock on HVM,"
> > - "disable pv timer\n");
> > + pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
> > return;
> > }
> > -
> > - pv_time_ops = xen_time_ops;
> > + xen_init_time_ops();
>
>
> As we discussed elsewhere, now that HVM guests call this routine as well
> we need to make sure that x86_init.timers.timer_init is not updated
> there for HVM since those guests expect a "real" timer to be connected
> to IO-APIC.

Yes, I decided to keep xen_init_time_ops() and xen_hvm_init_time_ops()
separate. The unification does not save that many LOC, but increase
the complexity.

Thank you,
Pavel

2018-07-18 02:00:56

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [PATCH v13 07/18] x86/xen/time: output xen sched_clock time from 0

> > +static u64 xen_sched_clock(void)
> > +{
> > + return xen_clocksource_read() - xen_sched_clock_offset;
> > +}
>
>
> Should other invocations of xen_clocksource_read() also be offset?
>

I do not believe so. Look in arch/x86/kernel/kvmclock.c
kvm_clock_get_cycles() is not normalized for instance. We need to
normalize kvm_sched_clock_read() and xen_sched_clock(), because it is
used by printk() as a sched_clock() prefix. So not to be confusing we
print time starting from 0, but the other instances where
xen_clocksource_read() is used do not need that.

Thank you,
Pavel