2018-07-19 20:58:43

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 00/26] Early boot time stamps

changelog
---------
v15 - v14
Repo: https://github.com/soleen/time_15.git
- dropped "x86/kvmclock: Avoid TSC recalibration" as Paolo Bonzini
suggested
- Fixed in "sched: early boot clock" whenched_clock_running is set,
and moved __sched_clock_gtod_offset inside IRQ as Peter noticed.
- Addressed comments from Dou Liyang: added missing __inits, and
X86_FEATURE_TSC_DEADLINE_TIMER, spelling.
- Fixed xen_sched_clock_offset on xen hvm (noticed by Boris
Ostrovsky).
- Added two patches to address Peter Zijlstra's request to split
native cpu calibration into early and late parts. The patches
are:
x86/tsc: split native_calibrate_cpu() into early and late parts
x86/tsc: use tsc_calibrate_cpu_early and
pit_hpet_ptimer_calibrate_cpu

v14 - v13
- Included Thomas' KVM clock series, addressed comments from
reviewers.
http://lkml.kernel.org/r/[email protected]
- Fixed xen hvm panic reported by Boris
- Fixed build issue on microblaze

v13 - v12
- Addressed comments from Thomas Gleixner.
- Addressed comments from Peter Zijlstra.
- Added a patch from Borislav Petkov
- Added a new patch: sched: use static key for sched_clock_running
- Added xen pv fixes, so clock is initialized when other
hypervisors initialize their clocks.
Note: I am including kvm/x86: remove kvm memblock dependency, which
is part of this series:
http://lkml.kernel.org/r/[email protected]
Because without this patch it is not possible to test this series on
KVM.

v12 - v11
- split time: replace read_boot_clock64() with
read_persistent_wall_and_boot_offset() into four patches
- Added two patches one fixes an existing bug with text_poke()
another one enables static branches early. Note, because I found
and fixed the text_poke() bug, enabling static branching became
super easy, as no changes to jump_label* is needed.
- Modified x86/tsc: use tsc early to use static branches early, and
thus native_sched_clock() is not changed at all.
v11 - v10
- Addressed all the comments from Thomas Gleixner.
- I added one more patch:
"x86/tsc: prepare for early sched_clock" which fixes a problem
that I discovered while testing. I am not particularly happy with
the fix, as it adds a new argument that is used only in one
place, but if you have a suggestion for a different approach on
how to address this problem please let me know.

v10 - v9
- Added another patch to this series that removes dependency
between KVM clock, and memblock allocator. The benefit is that
all clocks can now be initialized even earlier.
v9 - v8
- Addressed more comments from Dou Liyang

v8 - v7
- Addressed comments from Dou Liyang:
- Moved tsc_early_init() and tsc_early_fini() to be all inside
tsc.c, and changed them to be static.
- Removed warning when notsc parameter is used.
- Merged with:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

v7 - v6
- Removed tsc_disabled flag, now notsc is equivalent of
tsc=unstable
- Simplified changes to sched/clock.c, by removing the
sched_clock_early() and friends as requested by Peter Zijlstra.
We know always use sched_clock()
- Modified x86 sched_clock() to return either early boot time or
regular.
- Added another example why ealry boot time is important

v5 - v6
- Added a new patch:
time: sync read_boot_clock64() with persistent clock
Which fixes missing __init macro, and enabled time discrepancy
fix that was noted by Thomas Gleixner
- Split "x86/time: read_boot_clock64() implementation" into a
separate patch

v4 - v5
- Fix compiler warnings on systems with stable clocks.

v3 - v4
- Fixed tsc_early_fini() call to be in the 2nd patch as reported
by Dou Liyang
- Improved comment before __use_sched_clock_early to explain why
we need both booleans.
- Simplified valid_clock logic in read_boot_clock64().

v2 - v3
- Addressed comment from Thomas Gleixner
- Timestamps are available a little later in boot but still much
earlier than in mainline. This significantly simplified this
work.

v1 - v2
In patch "x86/tsc: tsc early":
- added tsc_adjusted_early()
- fixed 32-bit compile error use do_div()

The early boot time stamps were discussed recently in these threads:
http://lkml.kernel.org/r/[email protected]
http://lkml.kernel.org/r/[email protected]

I updated my series to the latest mainline and sending it again.

Peter mentioned he did not like patch 6,7, and we can discuss for a better
way to do that, but I think patches 1-5 can be accepted separetly, since
they already enable early timestamps on platforms where sched_clock() is
available early. Such as KVM.

Adding early boot time stamps support for x86 machines.
SPARC patches for early boot time stamps are already integrated into
mainline linux.

Sample output
-------------
Before:
https://paste.ubuntu.com/26133428/

After:
https://paste.ubuntu.com/26133523/

For exaples how early time stamps are used, see this work:
Example 1:
https://lwn.net/Articles/734374/
- Without early boot time stamps we would not know about the extra time
that is spent zeroing struct pages early in boot even when deferred
page initialization.

Example 2:
https://patchwork.kernel.org/patch/10021247/
- If early boot timestamps were available, the engineer who introduced
this bug would have noticed the extra time that is spent early in boot.
Pavel Tatashin (7):
x86/tsc: remove tsc_disabled flag
time: sync read_boot_clock64() with persistent clock
x86/time: read_boot_clock64() implementation
sched: early boot clock
kvm/x86: remove kvm memblock dependency
x86/paravirt: add active_sched_clock to pv_time_ops
x86/tsc: use tsc early

Example 3:
http://lkml.kernel.org/r/[email protected]
- Needed early time stamps to show improvement
Borislav Petkov (1):
x86/CPU: Call detect_nopl() only on the BSP

Pavel Tatashin (19):
x86/kvmclock: Remove memblock dependency
x86: text_poke() may access uninitialized struct pages
x86: initialize static branching early
x86/tsc: redefine notsc to behave as tsc=unstable
x86/xen/time: initialize pv xen time in init_hypervisor_platform
x86/xen/time: output xen sched_clock time from 0
s390/time: add read_persistent_wall_and_boot_offset()
time: replace read_boot_clock64() with
read_persistent_wall_and_boot_offset()
time: default boot time offset to local_clock()
s390/time: remove read_boot_clock64()
ARM/time: remove read_boot_clock64()
x86/tsc: calibrate tsc only once
x86/tsc: initialize cyc2ns when tsc freq. is determined
x86/tsc: use tsc early
sched: move sched clock initialization and merge with generic clock
sched: early boot clock
sched: use static key for sched_clock_running
x86/tsc: split native_calibrate_cpu() into early and late parts
x86/tsc: use tsc_calibrate_cpu_early and
pit_hpet_ptimer_calibrate_cpu

Thomas Gleixner (6):
x86/kvmclock: Remove page size requirement from wall_clock
x86/kvmclock: Decrapify kvm_register_clock()
x86/kvmclock: Cleanup the code
x86/kvmclock: Mark variables __initdata and __ro_after_init
x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock
x86/kvmclock: Switch kvmclock data to a PER_CPU variable

.../admin-guide/kernel-parameters.txt | 2 -
Documentation/x86/x86_64/boot-options.txt | 4 +-
arch/arm/include/asm/mach/time.h | 3 +-
arch/arm/kernel/time.c | 15 +-
arch/arm/plat-omap/counter_32k.c | 2 +-
arch/s390/kernel/time.c | 15 +-
arch/x86/include/asm/kvm_guest.h | 7 -
arch/x86/include/asm/kvm_para.h | 1 -
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tsc.h | 4 +-
arch/x86/kernel/alternative.c | 7 +
arch/x86/kernel/cpu/amd.c | 13 +-
arch/x86/kernel/cpu/common.c | 40 +--
arch/x86/kernel/jump_label.c | 11 +-
arch/x86/kernel/kvm.c | 14 +-
arch/x86/kernel/kvmclock.c | 256 +++++++-----------
arch/x86/kernel/setup.c | 10 +-
arch/x86/kernel/tsc.c | 253 +++++++++--------
arch/x86/kernel/x86_init.c | 2 +-
arch/x86/xen/enlighten_pv.c | 51 ++--
arch/x86/xen/mmu_pv.c | 6 +-
arch/x86/xen/suspend_pv.c | 5 +-
arch/x86/xen/time.c | 18 +-
arch/x86/xen/xen-ops.h | 6 +-
drivers/clocksource/tegra20_timer.c | 2 +-
include/linux/sched_clock.h | 5 +-
include/linux/timekeeping.h | 3 +-
init/main.c | 4 +-
kernel/sched/clock.c | 59 ++--
kernel/sched/core.c | 1 -
kernel/sched/debug.c | 2 -
kernel/time/sched_clock.c | 2 +-
kernel/time/timekeeping.c | 62 +++--
33 files changed, 439 insertions(+), 447 deletions(-)
delete mode 100644 arch/x86/include/asm/kvm_guest.h

--
2.18.0



2018-07-19 20:58:09

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 07/26] x86/kvmclock: Switch kvmclock data to a PER_CPU variable

From: Thomas Gleixner <[email protected]>

The previous removal of the memblock dependency from kvmclock introduced a
static data array sized 64bytes * CONFIG_NR_CPUS. That's wasteful on large
systems when kvmclock is not used.

Replace it with:

- A static page sized array of pvclock data. It's page sized because the
pvclock data of the boot cpu is mapped into the VDSO so otherwise random
other data would be exposed to the vDSO

- A PER_CPU variable of pvclock data pointers. This is used to access the
pcvlock data storage on each CPU.

The setup is done in two stages:

- Early boot stores the pointer to the static page for the boot CPU in
the per cpu data.

- In the preparatory stage of CPU hotplug assign either an element of
the static array (when the CPU number is in that range) or allocate
memory and initialize the per cpu pointer.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvmclock.c | 99 ++++++++++++++++++++++++--------------
1 file changed, 62 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 7d690d2238f8..91b94c0ae4e3 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
#include <asm/apic.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
+#include <linux/cpuhotplug.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/mm.h>
@@ -55,12 +56,23 @@ early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);

/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define HVC_BOOT_ARRAY_SIZE \
+ (PAGE_SIZE / sizeof(struct pvclock_vsyscall_time_info))

-static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-
-/* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
+static struct pvclock_vsyscall_time_info
+ hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __aligned(PAGE_SIZE);
static struct pvclock_wall_clock wall_clock;
+static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
+
+static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
+{
+ return &this_cpu_read(hv_clock_per_cpu)->pvti;
+}
+
+static inline struct pvclock_vsyscall_time_info *this_cpu_hvclock(void)
+{
+ return this_cpu_read(hv_clock_per_cpu);
+}

/*
* The wallclock is the time of day when we booted. Since then, some time may
@@ -69,17 +81,10 @@ static struct pvclock_wall_clock wall_clock;
*/
static void kvm_get_wallclock(struct timespec64 *now)
{
- struct pvclock_vcpu_time_info *vcpu_time;
- int cpu;
-
wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));
-
- cpu = get_cpu();
-
- vcpu_time = &hv_clock[cpu].pvti;
- pvclock_read_wallclock(&wall_clock, vcpu_time, now);
-
- put_cpu();
+ preempt_disable();
+ pvclock_read_wallclock(&wall_clock, this_cpu_pvti(), now);
+ preempt_enable();
}

static int kvm_set_wallclock(const struct timespec64 *now)
@@ -89,14 +94,10 @@ static int kvm_set_wallclock(const struct timespec64 *now)

static u64 kvm_clock_read(void)
{
- struct pvclock_vcpu_time_info *src;
u64 ret;
- int cpu;

preempt_disable_notrace();
- cpu = smp_processor_id();
- src = &hv_clock[cpu].pvti;
- ret = pvclock_clocksource_read(src);
+ ret = pvclock_clocksource_read(this_cpu_pvti());
preempt_enable_notrace();
return ret;
}
@@ -141,7 +142,7 @@ static inline void kvm_sched_clock_init(bool stable)
static unsigned long kvm_get_tsc_khz(void)
{
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
- return pvclock_tsc_khz(&hv_clock[0].pvti);
+ return pvclock_tsc_khz(this_cpu_pvti());
}

static void kvm_get_preset_lpj(void)
@@ -158,15 +159,14 @@ static void kvm_get_preset_lpj(void)

bool kvm_check_and_clear_guest_paused(void)
{
- struct pvclock_vcpu_time_info *src;
+ struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
bool ret = false;

- if (!hv_clock)
+ if (!src)
return ret;

- src = &hv_clock[smp_processor_id()].pvti;
- if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
- src->flags &= ~PVCLOCK_GUEST_STOPPED;
+ if ((src->pvti.flags & PVCLOCK_GUEST_STOPPED) != 0) {
+ src->pvti.flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
ret = true;
}
@@ -184,17 +184,15 @@ EXPORT_SYMBOL_GPL(kvm_clock);

static void kvm_register_clock(char *txt)
{
- struct pvclock_vcpu_time_info *src;
- int cpu = smp_processor_id();
+ struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
u64 pa;

- if (!hv_clock)
+ if (!src)
return;

- src = &hv_clock[cpu].pvti;
- pa = slow_virt_to_phys(src) | 0x01ULL;
+ pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
wrmsrl(msr_kvm_system_time, pa);
- pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
+ pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
}

static void kvm_save_sched_clock_state(void)
@@ -242,12 +240,12 @@ static int __init kvm_setup_vsyscall_timeinfo(void)
#ifdef CONFIG_X86_64
u8 flags;

- if (!hv_clock || !kvmclock_vsyscall)
+ if (!per_cpu(hv_clock_per_cpu, 0) || !kvmclock_vsyscall)
return 0;

- flags = pvclock_read_flags(&hv_clock[0].pvti);
+ flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
if (!(flags & PVCLOCK_TSC_STABLE_BIT))
- return 1;
+ return 0;

kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
#endif
@@ -255,6 +253,28 @@ static int __init kvm_setup_vsyscall_timeinfo(void)
}
early_initcall(kvm_setup_vsyscall_timeinfo);

+static int kvmclock_setup_percpu(unsigned int cpu)
+{
+ struct pvclock_vsyscall_time_info *p = per_cpu(hv_clock_per_cpu, cpu);
+
+ /*
+ * The per cpu area setup replicates CPU0 data to all cpu
+ * pointers. So carefully check. CPU0 has been set up in init
+ * already.
+ */
+ if (!cpu || (p && p != per_cpu(hv_clock_per_cpu, 0)))
+ return 0;
+
+ /* Use the static page for the first CPUs, allocate otherwise */
+ if (cpu < HVC_BOOT_ARRAY_SIZE)
+ p = &hv_clock_boot[cpu];
+ else
+ p = kzalloc(sizeof(*p), GFP_KERNEL);
+
+ per_cpu(hv_clock_per_cpu, cpu) = p;
+ return p ? 0 : -ENOMEM;
+}
+
void __init kvmclock_init(void)
{
u8 flags;
@@ -269,17 +289,22 @@ void __init kvmclock_init(void)
return;
}

+ if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
+ kvmclock_setup_percpu, NULL) < 0) {
+ return;
+ }
+
pr_info("kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

- hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+ this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
kvm_register_clock("primary cpu clock");
- pvclock_set_pvti_cpu0_va(hv_clock);
+ pvclock_set_pvti_cpu0_va(hv_clock_boot);

if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

- flags = pvclock_read_flags(&hv_clock[0].pvti);
+ flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);

x86_platform.calibrate_tsc = kvm_get_tsc_khz;
--
2.18.0


2018-07-19 20:58:23

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 11/26] x86/tsc: redefine notsc to behave as tsc=unstable

Currently, notsc kernel parameter disables the use of tsc register by
sched_clock(). However, this parameter does not prevent linux from
accessing tsc in other places in kernel.

The only rational to boot with notsc is to avoid timing discrepancies on
multi-socket systems where different tsc frequencies may present, and thus
fallback to jiffies for clock source.

However, there is another method to solve the above problem, it is to boot
with tsc=unstable parameter. This parameter allows sched_clock() to use tsc
but in case tsc is outside of expected interval it is corrected back to a
sane value.

This is why there is no reason to keep notsc, and it can be removed. But,
for compatibility reasons we will keep this parameter but change its
definition to be the same as tsc=unstable.

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Dou Liyang <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 2 --
Documentation/x86/x86_64/boot-options.txt | 4 +---
arch/x86/kernel/tsc.c | 18 +++---------------
3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 533ff5c68970..5aed30cd0350 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2835,8 +2835,6 @@

nosync [HW,M68K] Disables sync negotiation for all devices.

- notsc [BUGS=X86-32] Disable Time Stamp Counter
-
nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 8d109ef67ab6..66114ab4f9fe 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -92,9 +92,7 @@ APICs
Timing

notsc
- Don't use the CPU time stamp counter to read the wall time.
- This can be used to work around timing problems on multiprocessor systems
- with not properly synchronized CPUs.
+ Deprecated, use tsc=unstable instead.

nohpet
Don't use the HPET timer.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..186395041725 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,11 +38,6 @@ EXPORT_SYMBOL(tsc_khz);
*/
static int __read_mostly tsc_unstable;

-/* native_sched_clock() is called before tsc_init(), so
- we must start with the TSC soft disabled to prevent
- erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
static DEFINE_STATIC_KEY_FALSE(__use_tsc);

int tsc_clocksource_reliable;
@@ -248,8 +243,7 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
#ifdef CONFIG_X86_TSC
int __init notsc_setup(char *str)
{
- pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
- tsc_disabled = 1;
+ mark_tsc_unstable("boot parameter notsc");
return 1;
}
#else
@@ -1307,7 +1301,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)

static int __init init_tsc_clocksource(void)
{
- if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+ if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
return 0;

if (tsc_unstable)
@@ -1414,12 +1408,6 @@ void __init tsc_init(void)
set_cyc2ns_scale(tsc_khz, cpu, cyc);
}

- if (tsc_disabled > 0)
- return;
-
- /* now allow native_sched_clock() to use rdtsc */
-
- tsc_disabled = 0;
static_branch_enable(&__use_tsc);

if (!no_sched_irq_time)
@@ -1455,7 +1443,7 @@ unsigned long calibrate_delay_is_known(void)
int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
const struct cpumask *mask = topology_core_cpumask(cpu);

- if (tsc_disabled || !constant_tsc || !mask)
+ if (!constant_tsc || !mask)
return 0;

sibling = cpumask_any_but(mask, cpu);
--
2.18.0


2018-07-19 20:58:28

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 12/26] x86/xen/time: initialize pv xen time in init_hypervisor_platform

In every hypervisor except for xen pv time ops are initialized in
init_hypervisor_platform().

Xen PV domains initialize time ops in x86_init.paging.pagetable_init(),
by calling xen_setup_shared_info() which is a poor design, as time is
needed prior to memory allocator.

xen_setup_shared_info() is called from two places: during boot, and
after suspend. Split the content of xen_setup_shared_info() into
three places:

1. add the clock relavent data into new xen pv init_platform vector, and
set clock ops in there.

2. move xen_setup_vcpu_info_placement() to new xen_pv_guest_late_init()
call.

3. Re-initializing parts of shared info copy to xen_pv_post_suspend() to
be symmetric to xen_pv_pre_suspend

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/xen/enlighten_pv.c | 51 +++++++++++++++++--------------------
arch/x86/xen/mmu_pv.c | 6 ++---
arch/x86/xen/suspend_pv.c | 5 ++--
arch/x86/xen/time.c | 7 +++--
arch/x86/xen/xen-ops.h | 6 ++---
5 files changed, 34 insertions(+), 41 deletions(-)

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 439a94bf89ad..105a57d73701 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -119,6 +119,27 @@ static void __init xen_banner(void)
version >> 16, version & 0xffff, extra.extraversion,
xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
}
+
+static void __init xen_pv_init_platform(void)
+{
+ set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+ HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+
+ /* xen clock uses per-cpu vcpu_info, need to init it for boot cpu */
+ xen_vcpu_info_reset(0);
+
+ /* pvclock is in shared info area */
+ xen_init_time_ops();
+}
+
+static void __init xen_pv_guest_late_init(void)
+{
+#ifndef CONFIG_SMP
+ /* Setup shared vcpu info for non-smp configurations */
+ xen_setup_vcpu_info_placement();
+#endif
+}
+
/* Check if running on Xen version (major, minor) or later */
bool
xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -947,34 +968,8 @@ static void xen_write_msr(unsigned int msr, unsigned low, unsigned high)
xen_write_msr_safe(msr, low, high);
}

-void xen_setup_shared_info(void)
-{
- set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
-
- HYPERVISOR_shared_info =
- (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
-
- xen_setup_mfn_list_list();
-
- if (system_state == SYSTEM_BOOTING) {
-#ifndef CONFIG_SMP
- /*
- * In UP this is as good a place as any to set up shared info.
- * Limit this to boot only, at restore vcpu setup is done via
- * xen_vcpu_restore().
- */
- xen_setup_vcpu_info_placement();
-#endif
- /*
- * Now that shared info is set up we can start using routines
- * that point to pvclock area.
- */
- xen_init_time_ops();
- }
-}
-
/* This is called once we have the cpu_possible_mask */
-void __ref xen_setup_vcpu_info_placement(void)
+void __init xen_setup_vcpu_info_placement(void)
{
int cpu;

@@ -1228,6 +1223,8 @@ asmlinkage __visible void __init xen_start_kernel(void)
x86_init.irqs.intr_mode_init = x86_init_noop;
x86_init.oem.arch_setup = xen_arch_setup;
x86_init.oem.banner = xen_banner;
+ x86_init.hyper.init_platform = xen_pv_init_platform;
+ x86_init.hyper.guest_late_init = xen_pv_guest_late_init;

/*
* Set up some pagetable state before starting to set any ptes.
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2c30cabfda90..52206ad81e4b 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1230,8 +1230,7 @@ static void __init xen_pagetable_p2m_free(void)
* We roundup to the PMD, which means that if anybody at this stage is
* using the __ka address of xen_start_info or
* xen_start_info->shared_info they are in going to crash. Fortunatly
- * we have already revectored in xen_setup_kernel_pagetable and in
- * xen_setup_shared_info.
+ * we have already revectored in xen_setup_kernel_pagetable.
*/
size = roundup(size, PMD_SIZE);

@@ -1292,8 +1291,7 @@ static void __init xen_pagetable_init(void)

/* Remap memory freed due to conflicts with E820 map */
xen_remap_memory();
-
- xen_setup_shared_info();
+ xen_setup_mfn_list_list();
}
static void xen_write_cr2(unsigned long cr2)
{
diff --git a/arch/x86/xen/suspend_pv.c b/arch/x86/xen/suspend_pv.c
index a2e0f110af56..8303b58c79a9 100644
--- a/arch/x86/xen/suspend_pv.c
+++ b/arch/x86/xen/suspend_pv.c
@@ -27,8 +27,9 @@ void xen_pv_pre_suspend(void)
void xen_pv_post_suspend(int suspend_cancelled)
{
xen_build_mfn_list_list();
-
- xen_setup_shared_info();
+ set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+ HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+ xen_setup_mfn_list_list();

if (suspend_cancelled) {
xen_start_info->store_mfn =
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index e0f1bcf01d63..53bb7a8d10b5 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -40,7 +40,7 @@ static unsigned long xen_tsc_khz(void)
return pvclock_tsc_khz(info);
}

-u64 xen_clocksource_read(void)
+static u64 xen_clocksource_read(void)
{
struct pvclock_vcpu_time_info *src;
u64 ret;
@@ -503,7 +503,7 @@ static void __init xen_time_init(void)
pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
}

-void __ref xen_init_time_ops(void)
+void __init xen_init_time_ops(void)
{
pv_time_ops = xen_time_ops;

@@ -542,8 +542,7 @@ void __init xen_hvm_init_time_ops(void)
return;

if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
- printk(KERN_INFO "Xen doesn't support pvclock on HVM,"
- "disable pv timer\n");
+ pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
return;
}

diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 3b34745d0a52..e78684597f57 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -31,7 +31,6 @@ extern struct shared_info xen_dummy_shared_info;
extern struct shared_info *HYPERVISOR_shared_info;

void xen_setup_mfn_list_list(void);
-void xen_setup_shared_info(void);
void xen_build_mfn_list_list(void);
void xen_setup_machphys_mapping(void);
void xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
@@ -68,12 +67,11 @@ void xen_init_irq_ops(void);
void xen_setup_timer(int cpu);
void xen_setup_runstate_info(int cpu);
void xen_teardown_timer(int cpu);
-u64 xen_clocksource_read(void);
void xen_setup_cpu_clockevents(void);
void xen_save_time_memory_area(void);
void xen_restore_time_memory_area(void);
-void __ref xen_init_time_ops(void);
-void __init xen_hvm_init_time_ops(void);
+void xen_init_time_ops(void);
+void xen_hvm_init_time_ops(void);

irqreturn_t xen_debug_interrupt(int irq, void *dev_id);

--
2.18.0


2018-07-19 20:58:32

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 14/26] s390/time: add read_persistent_wall_and_boot_offset()

read_persistent_wall_and_boot_offset() will replace read_boot_clock64()
because on some architectures it is more convenient to read both sources
as one may depend on the other. For s390, implementation is the same
as read_boot_clock64() but also calling and returning value of
read_persistent_clock64()

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Martin Schwidefsky <[email protected]>
---
arch/s390/kernel/time.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index cf561160ea88..d1f5447d5687 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -221,6 +221,24 @@ void read_persistent_clock64(struct timespec64 *ts)
ext_to_timespec64(clk, ts);
}

+void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
+{
+ unsigned char clk[STORE_CLOCK_EXT_SIZE];
+ struct timespec64 boot_time;
+ __u64 delta;
+
+ delta = initial_leap_seconds + TOD_UNIX_EPOCH;
+ memcpy(clk, tod_clock_base, STORE_CLOCK_EXT_SIZE);
+ *(__u64 *)&clk[1] -= delta;
+ if (*(__u64 *)&clk[1] > delta)
+ clk[0]--;
+ ext_to_timespec64(clk, &boot_time);
+
+ read_persistent_clock64(wall_time);
+ *boot_offset = timespec64_sub(*wall_time, boot_time);
+}
+
void read_boot_clock64(struct timespec64 *ts)
{
unsigned char clk[STORE_CLOCK_EXT_SIZE];
--
2.18.0


2018-07-19 20:58:41

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 10/26] x86/CPU: Call detect_nopl() only on the BSP

From: Borislav Petkov <[email protected]>

Make it use the setup_* variants and have it be called only on the BSP
and drop the call in generic_identify() - X86_FEATURE_NOPL will be
replicated to the APs through the forced caps. Helps keep the mess at a
manageable level.

Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/cpu/common.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 71281ac43b15..46408a8cdf62 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1024,12 +1024,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
* unless we can find a reliable way to detect all the broken cases.
* Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
*/
-static void detect_nopl(struct cpuinfo_x86 *c)
+static void detect_nopl(void)
{
#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_clear_cpu_cap(X86_FEATURE_NOPL);
#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_force_cpu_cap(X86_FEATURE_NOPL);
#endif
}

@@ -1108,7 +1108,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);

- detect_nopl(c);
+ detect_nopl();
}

void __init early_cpu_init(void)
@@ -1206,8 +1206,6 @@ static void generic_identify(struct cpuinfo_x86 *c)

get_model_name(c); /* Default name */

- detect_nopl(c);
-
detect_null_seg_behavior(c);

/*
--
2.18.0


2018-07-19 20:58:52

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 15/26] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset()

If architecture does not support exact boot time, it is challenging to
estimate boot time without having a reference to the current persistent
clock value. Yet, it cannot read the persistent clock time again, because
this may lead to math discrepancies with the caller of read_boot_clock64()
who have read the persistent clock at a different time.

This is why it is better to provide two values simultaneously: the
persistent clock value, and the boot time.

Replace read_boot_clock64() with:
read_persistent_wall_and_boot_offset(wall_time, boot_offset)

Where wall_time is returned by read_persistent_clock()
And boot_offset is wall_time - boot time, which defaults to 0.

Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/timekeeping.h | 3 +-
kernel/time/timekeeping.c | 59 +++++++++++++++++++------------------
2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 86bc2026efce..686bc27acef0 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -243,7 +243,8 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
extern int persistent_clock_is_local;

extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
+void read_persistent_clock_and_boot_offset(struct timespec64 *wall_clock,
+ struct timespec64 *boot_offset);
extern int update_persistent_clock64(struct timespec64 now);

/*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4786df904c22..cb738f825c12 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -17,6 +17,7 @@
#include <linux/nmi.h>
#include <linux/sched.h>
#include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
#include <linux/syscore_ops.h>
#include <linux/clocksource.h>
#include <linux/jiffies.h>
@@ -1496,18 +1497,20 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
}

/**
- * read_boot_clock64 - Return time of the system start.
+ * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset
+ * from the boot.
*
* Weak dummy function for arches that do not yet support it.
- * Function to read the exact time the system has been started.
- * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
- *
- * XXX - Do be sure to remove it once all arches implement it.
+ * wall_time - current time as returned by persistent clock
+ * boot_offset - offset that is defined as wall_time - boot_time
+ * default to 0.
*/
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init
+read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
{
- ts->tv_sec = 0;
- ts->tv_nsec = 0;
+ read_persistent_clock64(wall_time);
+ *boot_offset = (struct timespec64){0};
}

/* Flag for if timekeeping_resume() has injected sleeptime */
@@ -1521,28 +1524,29 @@ static bool persistent_clock_exists;
*/
void __init timekeeping_init(void)
{
+ struct timespec64 wall_time, boot_offset, wall_to_mono;
struct timekeeper *tk = &tk_core.timekeeper;
struct clocksource *clock;
unsigned long flags;
- struct timespec64 now, boot, tmp;
-
- read_persistent_clock64(&now);
- if (!timespec64_valid_strict(&now)) {
- pr_warn("WARNING: Persistent clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- now.tv_sec = 0;
- now.tv_nsec = 0;
- } else if (now.tv_sec || now.tv_nsec)
- persistent_clock_exists = true;

- read_boot_clock64(&boot);
- if (!timespec64_valid_strict(&boot)) {
- pr_warn("WARNING: Boot clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- boot.tv_sec = 0;
- boot.tv_nsec = 0;
+ read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
+ if (timespec64_valid_strict(&wall_time) &&
+ timespec64_to_ns(&wall_time) > 0) {
+ persistent_clock_exists = true;
+ } else {
+ pr_warn("Persistent clock returned invalid value");
+ wall_time = (struct timespec64){0};
}

+ if (timespec64_compare(&wall_time, &boot_offset) < 0)
+ boot_offset = (struct timespec64){0};
+
+ /*
+ * We want set wall_to_mono, so the following is true:
+ * wall time + wall_to_mono = boot time
+ */
+ wall_to_mono = timespec64_sub(boot_offset, wall_time);
+
raw_spin_lock_irqsave(&timekeeper_lock, flags);
write_seqcount_begin(&tk_core.seq);
ntp_init();
@@ -1552,13 +1556,10 @@ void __init timekeeping_init(void)
clock->enable(clock);
tk_setup_internals(tk, clock);

- tk_set_xtime(tk, &now);
+ tk_set_xtime(tk, &wall_time);
tk->raw_sec = 0;
- if (boot.tv_sec == 0 && boot.tv_nsec == 0)
- boot = tk_xtime(tk);

- set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec);
- tk_set_wall_to_mono(tk, tmp);
+ tk_set_wall_to_mono(tk, wall_to_mono);

timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);

--
2.18.0


2018-07-19 20:58:58

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 16/26] time: default boot time offset to local_clock()

read_persistent_wall_and_boot_offset() is called during boot to read
both the persistent clock and also return the offset between the boot time
and the value of persistent clock.

Change the default boot_offset from zero to local_clock() so architectures,
that do not have a dedicated boot_clock but have early sched_clock(), such
as SPARCv9, x86, and possibly more will benefit from this change by getting
a better and more consistent estimate of the boot time without need for an
arch specific implementation.

Signed-off-by: Pavel Tatashin <[email protected]>
---
kernel/time/timekeeping.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index cb738f825c12..30d7f64ffc87 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1503,14 +1503,17 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
* Weak dummy function for arches that do not yet support it.
* wall_time - current time as returned by persistent clock
* boot_offset - offset that is defined as wall_time - boot_time
- * default to 0.
+ * The default function calculates offset based on the current value of
+ * local_clock(). This way architectures that support sched_clock() but don't
+ * support dedicated boot time clock will provide the best estimate of the
+ * boot time.
*/
void __weak __init
read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
struct timespec64 *boot_offset)
{
read_persistent_clock64(wall_time);
- *boot_offset = (struct timespec64){0};
+ *boot_offset = ns_to_timespec64(local_clock());
}

/* Flag for if timekeeping_resume() has injected sleeptime */
--
2.18.0


2018-07-19 20:58:58

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 02/26] x86/kvmclock: Remove page size requirement from wall_clock

From: Thomas Gleixner <[email protected]>

There is no requirement for wall_clock data to be page aligned or page
sized.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvmclock.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1f6ac5aaa904..a995d7d7164c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -46,14 +46,12 @@ early_param("no-kvmclock", parse_no_kvmclock);

/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
-#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))

static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);

/* The hypervisor will put information about time periodically here */
static struct pvclock_vsyscall_time_info *hv_clock;
-static struct pvclock_wall_clock *wall_clock;
+static struct pvclock_wall_clock wall_clock;

/*
* The wallclock is the time of day when we booted. Since then, some time may
@@ -66,15 +64,15 @@ static void kvm_get_wallclock(struct timespec64 *now)
int low, high;
int cpu;

- low = (int)slow_virt_to_phys(wall_clock);
- high = ((u64)slow_virt_to_phys(wall_clock) >> 32);
+ low = (int)slow_virt_to_phys(&wall_clock);
+ high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);

native_write_msr(msr_kvm_wall_clock, low, high);

cpu = get_cpu();

vcpu_time = &hv_clock[cpu].pvti;
- pvclock_read_wallclock(wall_clock, vcpu_time, now);
+ pvclock_read_wallclock(&wall_clock, vcpu_time, now);

put_cpu();
}
@@ -267,12 +265,10 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;

if (kvm_register_clock("primary cpu clock")) {
hv_clock = NULL;
- wall_clock = NULL;
return;
}

--
2.18.0


2018-07-19 20:59:02

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 17/26] s390/time: remove read_boot_clock64()

read_boot_clock64() was replaced by read_persistent_wall_and_boot_offset()
so remove it.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/s390/kernel/time.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index d1f5447d5687..e8766beee5ad 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -239,19 +239,6 @@ void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
*boot_offset = timespec64_sub(*wall_time, boot_time);
}

-void read_boot_clock64(struct timespec64 *ts)
-{
- unsigned char clk[STORE_CLOCK_EXT_SIZE];
- __u64 delta;
-
- delta = initial_leap_seconds + TOD_UNIX_EPOCH;
- memcpy(clk, tod_clock_base, 16);
- *(__u64 *) &clk[1] -= delta;
- if (*(__u64 *) &clk[1] > delta)
- clk[0]--;
- ext_to_timespec64(clk, ts);
-}
-
static u64 read_tod_clock(struct clocksource *cs)
{
unsigned long long now, adj;
--
2.18.0


2018-07-19 20:59:02

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 18/26] ARM/time: remove read_boot_clock64()

read_boot_clock64() is deleted, and replaced with
read_persistent_wall_and_boot_offset().

The default implementation of read_persistent_wall_and_boot_offset()
provides a better fallback than the current stubs for read_boot_clock64()
that arm has with no users, so remove the old code.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm/include/asm/mach/time.h | 3 +--
arch/arm/kernel/time.c | 15 ++-------------
arch/arm/plat-omap/counter_32k.c | 2 +-
drivers/clocksource/tegra20_timer.c | 2 +-
4 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mach/time.h b/arch/arm/include/asm/mach/time.h
index 0f79e4dec7f9..4ac3a019a46f 100644
--- a/arch/arm/include/asm/mach/time.h
+++ b/arch/arm/include/asm/mach/time.h
@@ -13,7 +13,6 @@
extern void timer_tick(void);

typedef void (*clock_access_fn)(struct timespec64 *);
-extern int register_persistent_clock(clock_access_fn read_boot,
- clock_access_fn read_persistent);
+extern int register_persistent_clock(clock_access_fn read_persistent);

#endif
diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index cf2701cb0de8..078b259ead4e 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -83,29 +83,18 @@ static void dummy_clock_access(struct timespec64 *ts)
}

static clock_access_fn __read_persistent_clock = dummy_clock_access;
-static clock_access_fn __read_boot_clock = dummy_clock_access;

void read_persistent_clock64(struct timespec64 *ts)
{
__read_persistent_clock(ts);
}

-void read_boot_clock64(struct timespec64 *ts)
-{
- __read_boot_clock(ts);
-}
-
-int __init register_persistent_clock(clock_access_fn read_boot,
- clock_access_fn read_persistent)
+int __init register_persistent_clock(clock_access_fn read_persistent)
{
/* Only allow the clockaccess functions to be registered once */
- if (__read_persistent_clock == dummy_clock_access &&
- __read_boot_clock == dummy_clock_access) {
- if (read_boot)
- __read_boot_clock = read_boot;
+ if (__read_persistent_clock == dummy_clock_access) {
if (read_persistent)
__read_persistent_clock = read_persistent;
-
return 0;
}

diff --git a/arch/arm/plat-omap/counter_32k.c b/arch/arm/plat-omap/counter_32k.c
index 2438b96004c1..fcc5bfec8bd1 100644
--- a/arch/arm/plat-omap/counter_32k.c
+++ b/arch/arm/plat-omap/counter_32k.c
@@ -110,7 +110,7 @@ int __init omap_init_clocksource_32k(void __iomem *vbase)
}

sched_clock_register(omap_32k_read_sched_clock, 32, 32768);
- register_persistent_clock(NULL, omap_read_persistent_clock64);
+ register_persistent_clock(omap_read_persistent_clock64);
pr_info("OMAP clocksource: 32k_counter at 32768 Hz\n");

return 0;
diff --git a/drivers/clocksource/tegra20_timer.c b/drivers/clocksource/tegra20_timer.c
index c337a8100a7b..2242a36fc5b0 100644
--- a/drivers/clocksource/tegra20_timer.c
+++ b/drivers/clocksource/tegra20_timer.c
@@ -259,6 +259,6 @@ static int __init tegra20_init_rtc(struct device_node *np)
else
clk_prepare_enable(clk);

- return register_persistent_clock(NULL, tegra_read_persistent_clock64);
+ return register_persistent_clock(tegra_read_persistent_clock64);
}
TIMER_OF_DECLARE(tegra20_rtc, "nvidia,tegra20-rtc", tegra20_init_rtc);
--
2.18.0


2018-07-19 20:59:09

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 19/26] x86/tsc: calibrate tsc only once

During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
and the second time in tsc_init().

Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
the calibration is done only early, and make tsc_init() to use the values
already determined in tsc_early_init().

Sometimes it is not possible to determine tsc early, as the subsystem that
is required is not yet initialized, in such case try again later in
tsc_init().

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/tsc.h | 2 +-
arch/x86/kernel/setup.c | 2 +-
arch/x86/kernel/tsc.c | 87 ++++++++++++++++++++------------------
3 files changed, 49 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 2701d221583a..c4368ff73652 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void)
extern struct system_counterval_t convert_art_to_tsc(u64 art);
extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);

-extern void tsc_early_delay_calibrate(void);
+extern void tsc_early_init(void);
extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 7490de925a81..5d32c55aeb8b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1014,6 +1014,7 @@ void __init setup_arch(char **cmdline_p)
*/
init_hypervisor_platform();

+ tsc_early_init();
x86_init.resources.probe_roms();

/* after parse_early_param, so could debug it */
@@ -1199,7 +1200,6 @@ void __init setup_arch(char **cmdline_p)

memblock_find_dma_reserve();

- tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 186395041725..4cab2236169e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz);
unsigned int __read_mostly tsc_khz;
EXPORT_SYMBOL(tsc_khz);

+#define KHZ 1000
+
/*
* TSC can be unstable due to cpufreq or due to unsynced TSCs
*/
@@ -1335,34 +1337,10 @@ static int __init init_tsc_clocksource(void)
*/
device_initcall(init_tsc_clocksource);

-void __init tsc_early_delay_calibrate(void)
-{
- unsigned long lpj;
-
- if (!boot_cpu_has(X86_FEATURE_TSC))
- return;
-
- cpu_khz = x86_platform.calibrate_cpu();
- tsc_khz = x86_platform.calibrate_tsc();
-
- tsc_khz = tsc_khz ? : cpu_khz;
- if (!tsc_khz)
- return;
-
- lpj = tsc_khz * 1000;
- do_div(lpj, HZ);
- loops_per_jiffy = lpj;
-}
-
-void __init tsc_init(void)
+static bool __init determine_cpu_tsc_frequencies(void)
{
- u64 lpj, cyc;
- int cpu;
-
- if (!boot_cpu_has(X86_FEATURE_TSC)) {
- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
- }
+ /* Make sure that cpu and tsc are not already calibrated */
+ WARN_ON(cpu_khz || tsc_khz);

cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
@@ -1377,20 +1355,52 @@ void __init tsc_init(void)
else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
cpu_khz = tsc_khz;

- if (!tsc_khz) {
- mark_tsc_unstable("could not calculate TSC khz");
- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
- }
+ if (tsc_khz == 0)
+ return false;

pr_info("Detected %lu.%03lu MHz processor\n",
- (unsigned long)cpu_khz / 1000,
- (unsigned long)cpu_khz % 1000);
+ (unsigned long)cpu_khz / KHZ,
+ (unsigned long)cpu_khz % KHZ);

if (cpu_khz != tsc_khz) {
pr_info("Detected %lu.%03lu MHz TSC",
- (unsigned long)tsc_khz / 1000,
- (unsigned long)tsc_khz % 1000);
+ (unsigned long)tsc_khz / KHZ,
+ (unsigned long)tsc_khz % KHZ);
+ }
+ return true;
+}
+
+static unsigned long __init get_loops_per_jiffy(void)
+{
+ unsigned long lpj = tsc_khz * KHZ;
+
+ do_div(lpj, HZ);
+ return lpj;
+}
+
+void __init tsc_early_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TSC))
+ return;
+ if (!determine_cpu_tsc_frequencies())
+ return;
+ loops_per_jiffy = get_loops_per_jiffy();
+}
+
+void __init tsc_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TSC)) {
+ setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+ return;
+ }
+
+ if (!tsc_khz) {
+ /* We failed to determine frequencies earlier, try again */
+ if (!determine_cpu_tsc_frequencies()) {
+ mark_tsc_unstable("could not calculate TSC khz");
+ setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+ return;
+ }
}

/* Sanitize TSC ADJUST before cyc2ns gets initialized */
@@ -1413,10 +1423,7 @@ void __init tsc_init(void)
if (!no_sched_irq_time)
enable_sched_clock_irqtime();

- lpj = ((u64)tsc_khz * 1000);
- do_div(lpj, HZ);
- lpj_fine = lpj;
-
+ lpj_fine = get_loops_per_jiffy();
use_tsc_delay();

check_system_tsc_reliable();
--
2.18.0


2018-07-19 20:59:15

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 24/26] sched: use static key for sched_clock_running

sched_clock_running may be read every time sched_clock_cpu() is called.
Yet, this variable is updated only twice during boot, and never changes
again, therefore it is better to make it a static key.

Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
---
kernel/sched/clock.c | 16 ++++++++--------
kernel/sched/debug.c | 2 --
2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 422cd63f8f17..c5c47ad3f386 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -67,7 +67,7 @@ unsigned long long __weak sched_clock(void)
}
EXPORT_SYMBOL_GPL(sched_clock);

-__read_mostly int sched_clock_running;
+static DEFINE_STATIC_KEY_FALSE(sched_clock_running);

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
@@ -191,7 +191,7 @@ void clear_sched_clock_stable(void)

smp_mb(); /* matches sched_clock_init_late() */

- if (sched_clock_running == 2)
+ if (static_key_count(&sched_clock_running.key) == 2)
__clear_sched_clock_stable();
}

@@ -215,7 +215,7 @@ void __init sched_clock_init(void)
__sched_clock_gtod_offset();
local_irq_restore(flags);

- sched_clock_running = 1;
+ static_branch_inc(&sched_clock_running);

/* Now that sched_clock_running is set adjust scd */
local_irq_save(flags);
@@ -228,7 +228,7 @@ void __init sched_clock_init(void)
*/
static int __init sched_clock_init_late(void)
{
- sched_clock_running = 2;
+ static_branch_inc(&sched_clock_running);
/*
* Ensure that it is impossible to not do a static_key update.
*
@@ -373,7 +373,7 @@ u64 sched_clock_cpu(int cpu)
if (sched_clock_stable())
return sched_clock() + __sched_clock_offset;

- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return sched_clock();

preempt_disable_notrace();
@@ -396,7 +396,7 @@ void sched_clock_tick(void)
if (sched_clock_stable())
return;

- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return;

lockdep_assert_irqs_disabled();
@@ -455,13 +455,13 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);

void __init sched_clock_init(void)
{
- sched_clock_running = 1;
+ static_branch_inc(&sched_clock_running);
generic_sched_clock_init();
}

u64 sched_clock_cpu(int cpu)
{
- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return 0;

return sched_clock();
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e593b4118578..b0212f489a33 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -623,8 +623,6 @@ void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
#undef PU
}

-extern __read_mostly int sched_clock_running;
-
static void print_cpu(struct seq_file *m, int cpu)
{
struct rq *rq = cpu_rq(cpu);
--
2.18.0


2018-07-19 20:59:15

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 22/26] sched: move sched clock initialization and merge with generic clock

sched_clock_postinit() initializes a generic clock on systems where no
other clock is porvided. This function may be called only after
timekeeping_init().

Rename sched_clock_postinit to generic_clock_inti() and call it from
sched_clock_init(). Move the call for sched_clock_init() until after
time_init().

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/sched_clock.h | 5 ++---
init/main.c | 4 ++--
kernel/sched/clock.c | 27 +++++++++++++++++----------
kernel/sched/core.c | 1 -
kernel/time/sched_clock.c | 2 +-
5 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched_clock.h b/include/linux/sched_clock.h
index 411b52e424e1..abe28d5cb3f4 100644
--- a/include/linux/sched_clock.h
+++ b/include/linux/sched_clock.h
@@ -9,17 +9,16 @@
#define LINUX_SCHED_CLOCK

#ifdef CONFIG_GENERIC_SCHED_CLOCK
-extern void sched_clock_postinit(void);
+extern void generic_sched_clock_init(void);

extern void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate);
#else
-static inline void sched_clock_postinit(void) { }
+static inline void generic_sched_clock_init(void) { }

static inline void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate)
{
- ;
}
#endif

diff --git a/init/main.c b/init/main.c
index 3b4ada11ed52..162d931c9511 100644
--- a/init/main.c
+++ b/init/main.c
@@ -79,7 +79,7 @@
#include <linux/pti.h>
#include <linux/blkdev.h>
#include <linux/elevator.h>
-#include <linux/sched_clock.h>
+#include <linux/sched/clock.h>
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/context_tracking.h>
@@ -642,7 +642,7 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_postinit();
+ sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..0e9dbb2d9aea 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -53,6 +53,7 @@
*
*/
#include "sched.h"
+#include <linux/sched_clock.h>

/*
* Scheduler clock - returns current time in nanosec units.
@@ -68,11 +69,6 @@ EXPORT_SYMBOL_GPL(sched_clock);

__read_mostly int sched_clock_running;

-void sched_clock_init(void)
-{
- sched_clock_running = 1;
-}
-
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
* We must start with !__sched_clock_stable because the unstable -> stable
@@ -199,6 +195,15 @@ void clear_sched_clock_stable(void)
__clear_sched_clock_stable();
}

+static void __sched_clock_gtod_offset(void)
+{
+ __gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+}
+
+void __init sched_clock_init(void)
+{
+ sched_clock_running = 1;
+}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
* notably: acpi_processor and intel_idle, which can mark the TSC as unstable.
@@ -385,8 +390,6 @@ void sched_clock_tick(void)

void sched_clock_tick_stable(void)
{
- u64 gtod, clock;
-
if (!sched_clock_stable())
return;

@@ -398,9 +401,7 @@ void sched_clock_tick_stable(void)
* TSC to be unstable, any computation will be computing crap.
*/
local_irq_disable();
- gtod = ktime_get_ns();
- clock = sched_clock();
- __gtod_offset = (clock + __sched_clock_offset) - gtod;
+ __sched_clock_gtod_offset();
local_irq_enable();
}

@@ -434,6 +435,12 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);

#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */

+void __init sched_clock_init(void)
+{
+ sched_clock_running = 1;
+ generic_sched_clock_init();
+}
+
u64 sched_clock_cpu(int cpu)
{
if (unlikely(!sched_clock_running))
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..552406e9713b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5954,7 +5954,6 @@ void __init sched_init(void)
int i, j;
unsigned long alloc_size = 0, ptr;

- sched_clock_init();
wait_bit_init();

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 2d8f05aad442..cbc72c2c1fca 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -237,7 +237,7 @@ sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
pr_debug("Registered %pF as sched_clock source\n", read);
}

-void __init sched_clock_postinit(void)
+void __init generic_sched_clock_init(void)
{
/*
* If no sched_clock() function has been provided at that point,
--
2.18.0


2018-07-19 20:59:18

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 25/26] x86/tsc: split native_calibrate_cpu() into early and late parts

Early in boot CPU can be calibrated using msr, cpuid, and quick pit
methods. The other methods pit/hpet/pmtimer are available only after acpi
is initialized.

Split native_calibrate_cpu() into early and late parts so they can be
called separately during early and late tsc calibration.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/tsc.c | 54 +++++++++++++++++++++++++-------------
2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c4368ff73652..88140e4f2292 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -40,6 +40,7 @@ extern int unsynchronized_tsc(void);
extern int check_tsc_unstable(void);
extern void mark_tsc_async_resets(char *reason);
extern unsigned long native_calibrate_cpu(void);
+extern unsigned long native_calibrate_cpu_early(void);
extern unsigned long native_calibrate_tsc(void);
extern unsigned long long native_sched_clock_from_tsc(u64 tsc);

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 9277ae9b68b3..60586779b02c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -680,30 +680,17 @@ static unsigned long cpu_khz_from_cpuid(void)
return eax_base_mhz * 1000;
}

-/**
- * native_calibrate_cpu - calibrate the cpu on boot
+/*
+ * calibrate cpu using pit, hpet, and ptimer methods. They are available
+ * later in boot after acpi is initialized.
*/
-unsigned long native_calibrate_cpu(void)
+static unsigned long pit_hpet_ptimer_calibrate_cpu(void)
{
u64 tsc1, tsc2, delta, ref1, ref2;
unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
- unsigned long flags, latch, ms, fast_calibrate;
+ unsigned long flags, latch, ms;
int hpet = is_hpet_enabled(), i, loopmin;

- fast_calibrate = cpu_khz_from_cpuid();
- if (fast_calibrate)
- return fast_calibrate;
-
- fast_calibrate = cpu_khz_from_msr();
- if (fast_calibrate)
- return fast_calibrate;
-
- local_irq_save(flags);
- fast_calibrate = quick_pit_calibrate();
- local_irq_restore(flags);
- if (fast_calibrate)
- return fast_calibrate;
-
/*
* Run 5 calibration loops to get the lowest frequency value
* (the best estimate). We use two different calibration modes
@@ -846,6 +833,37 @@ unsigned long native_calibrate_cpu(void)
return tsc_pit_min;
}

+/**
+ * native_calibrate_cpu_early - can calibrate the cpu early in boot
+ */
+unsigned long native_calibrate_cpu_early(void)
+{
+ unsigned long flags, fast_calibrate = cpu_khz_from_cpuid();
+
+ if (!fast_calibrate)
+ fast_calibrate = cpu_khz_from_msr();
+ if (!fast_calibrate) {
+ local_irq_save(flags);
+ fast_calibrate = quick_pit_calibrate();
+ local_irq_restore(flags);
+ }
+ return fast_calibrate;
+}
+
+
+/**
+ * native_calibrate_cpu - calibrate the cpu
+ */
+unsigned long native_calibrate_cpu(void)
+{
+ unsigned long tsc_freq = native_calibrate_cpu_early();
+
+ if (!tsc_freq)
+ tsc_freq = pit_hpet_ptimer_calibrate_cpu();
+
+ return tsc_freq;
+}
+
void recalibrate_cpu_khz(void)
{
#ifndef CONFIG_SMP
--
2.18.0


2018-07-19 20:59:46

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 23/26] sched: early boot clock

Allow sched_clock() to be used before schec_clock_init() is called.
This provides with a way to get early boot timestamps on machines with
unstable clocks.

Signed-off-by: Pavel Tatashin <[email protected]>
---
init/main.c | 2 +-
kernel/sched/clock.c | 20 +++++++++++++++++++-
2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 162d931c9511..ff0a24170b95 100644
--- a/init/main.c
+++ b/init/main.c
@@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
@@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
acpi_early_init();
if (late_time_init)
late_time_init();
+ sched_clock_init();
calibrate_delay();
pid_idr_init();
anon_vma_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 0e9dbb2d9aea..422cd63f8f17 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)

void __init sched_clock_init(void)
{
+ unsigned long flags;
+
+ /*
+ * Set __gtod_offset such that once we mark sched_clock_running,
+ * sched_clock_tick() continues where sched_clock() left off.
+ *
+ * Even if TSC is buggered, we're still UP at this point so it
+ * can't really be out of sync.
+ */
+ local_irq_save(flags);
+ __sched_clock_gtod_offset();
+ local_irq_restore(flags);
+
sched_clock_running = 1;
+
+ /* Now that sched_clock_running is set adjust scd */
+ local_irq_save(flags);
+ sched_clock_tick();
+ local_irq_restore(flags);
}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
@@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
return sched_clock() + __sched_clock_offset;

if (unlikely(!sched_clock_running))
- return 0ull;
+ return sched_clock();

preempt_disable_notrace();
scd = cpu_sdc(cpu);
--
2.18.0


2018-07-19 20:59:53

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 20/26] x86/tsc: initialize cyc2ns when tsc freq. is determined

cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data
structure.

Currently, the setup code for c2ns data for every possible CPU goes through
the same sequence of calculations as for the boot CPU, but is based on the
same tsc frequency as the boot CPU, and thus this is not necessary.

Initialize the boot cpu when tsc frequency is determined. Copy the
calculated data from the boot CPU to the other CPUs in tsc_init().

In addition do the following:

- Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init()
- Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be
called when system is up, and wraps around __set_cyc2ns_scale() that can
be called directly when system is booting but avoids saving restoring
IRQs and going and waking up from idle.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/tsc.c | 94 ++++++++++++++++++++++++-------------------
1 file changed, 53 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 4cab2236169e..7ea0718a4c75 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -103,23 +103,6 @@ void cyc2ns_read_end(void)
* [email protected] "math is hard, lets go shopping!"
*/

-static void cyc2ns_data_init(struct cyc2ns_data *data)
-{
- data->cyc2ns_mul = 0;
- data->cyc2ns_shift = 0;
- data->cyc2ns_offset = 0;
-}
-
-static void __init cyc2ns_init(int cpu)
-{
- struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
-
- cyc2ns_data_init(&c2n->data[0]);
- cyc2ns_data_init(&c2n->data[1]);
-
- seqcount_init(&c2n->seq);
-}
-
static inline unsigned long long cycles_2_ns(unsigned long long cyc)
{
struct cyc2ns_data data;
@@ -135,18 +118,11 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
return ns;
}

-static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
{
unsigned long long ns_now;
struct cyc2ns_data data;
struct cyc2ns *c2n;
- unsigned long flags;
-
- local_irq_save(flags);
- sched_clock_idle_sleep_event();
-
- if (!khz)
- goto done;

ns_now = cycles_2_ns(tsc_now);

@@ -178,12 +154,55 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
c2n->data[0] = data;
raw_write_seqcount_latch(&c2n->seq);
c2n->data[1] = data;
+}
+
+static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ if (khz)
+ __set_cyc2ns_scale(khz, cpu, tsc_now);

-done:
sched_clock_idle_wakeup_event();
local_irq_restore(flags);
}

+/*
+ * Initialize cyc2ns for boot cpu
+ */
+static void __init cyc2ns_init_boot_cpu(void)
+{
+ struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+
+ seqcount_init(&c2n->seq);
+ __set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
+}
+
+/*
+ * Secondary CPUs do not run through cyc2ns_init(), so set up
+ * all the scale factors for all CPUs, assuming the same
+ * speed as the bootup CPU. (cpufreq notifiers will fix this
+ * up if their speed diverges)
+ */
+static void __init cyc2ns_init_secondary_cpus(void)
+{
+ unsigned int cpu, this_cpu = smp_processor_id();
+ struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+ struct cyc2ns_data *data = c2n->data;
+
+ for_each_possible_cpu(cpu) {
+ if (cpu != this_cpu) {
+ seqcount_init(&c2n->seq);
+ c2n = per_cpu_ptr(&cyc2ns, cpu);
+ c2n->data[0] = data[0];
+ c2n->data[1] = data[1];
+ }
+ }
+}
+
/*
* Scheduler clock - returns current time in nanosec units.
*/
@@ -1385,6 +1404,10 @@ void __init tsc_early_init(void)
if (!determine_cpu_tsc_frequencies())
return;
loops_per_jiffy = get_loops_per_jiffy();
+
+ /* Sanitize TSC ADJUST before cyc2ns gets initialized */
+ tsc_store_and_check_tsc_adjust(true);
+ cyc2ns_init_boot_cpu();
}

void __init tsc_init(void)
@@ -1401,23 +1424,12 @@ void __init tsc_init(void)
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return;
}
+ /* Sanitize TSC ADJUST before cyc2ns gets initialized */
+ tsc_store_and_check_tsc_adjust(true);
+ cyc2ns_init_boot_cpu();
}

- /* Sanitize TSC ADJUST before cyc2ns gets initialized */
- tsc_store_and_check_tsc_adjust(true);
-
- /*
- * Secondary CPUs do not run through tsc_init(), so set up
- * all the scale factors for all CPUs, assuming the same
- * speed as the bootup CPU. (cpufreq notifiers will fix this
- * up if their speed diverges)
- */
- cyc = rdtsc();
- for_each_possible_cpu(cpu) {
- cyc2ns_init(cpu);
- set_cyc2ns_scale(tsc_khz, cpu, cyc);
- }
-
+ cyc2ns_init_secondary_cpus();
static_branch_enable(&__use_tsc);

if (!no_sched_irq_time)
--
2.18.0


2018-07-19 21:00:03

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 26/26] x86/tsc: use tsc_calibrate_cpu_early and pit_hpet_ptimer_calibrate_cpu

Early in boot enable tsc_calibrate_cpu_early and switch to
tsc_calibrate_cpu() only later. Do this unconditionally, because it is
unknown what methods other cpus will use to calibrate once they are
onlined.

If by the time tsc_init() is called tsc frequency is still unknown do only
pit_hpet_ptimer_calibrate_cpu to calibrate, as this function contails the
only methods that had not been called earlier in boot.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/tsc.h | 1 -
arch/x86/kernel/tsc.c | 25 +++++++++++++++++++------
arch/x86/kernel/x86_init.c | 2 +-
3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 88140e4f2292..eb5bbfeccb66 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -39,7 +39,6 @@ extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
extern int check_tsc_unstable(void);
extern void mark_tsc_async_resets(char *reason);
-extern unsigned long native_calibrate_cpu(void);
extern unsigned long native_calibrate_cpu_early(void);
extern unsigned long native_calibrate_tsc(void);
extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 60586779b02c..02e416b87ac1 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -854,7 +854,7 @@ unsigned long native_calibrate_cpu_early(void)
/**
* native_calibrate_cpu - calibrate the cpu
*/
-unsigned long native_calibrate_cpu(void)
+static unsigned long native_calibrate_cpu(void)
{
unsigned long tsc_freq = native_calibrate_cpu_early();

@@ -1374,13 +1374,19 @@ static int __init init_tsc_clocksource(void)
*/
device_initcall(init_tsc_clocksource);

-static bool __init determine_cpu_tsc_frequencies(void)
+static bool __init determine_cpu_tsc_frequencies(bool early)
{
/* Make sure that cpu and tsc are not already calibrated */
WARN_ON(cpu_khz || tsc_khz);

- cpu_khz = x86_platform.calibrate_cpu();
- tsc_khz = x86_platform.calibrate_tsc();
+ if (early) {
+ cpu_khz = x86_platform.calibrate_cpu();
+ tsc_khz = x86_platform.calibrate_tsc();
+ } else {
+ /* We should not be here with non-native cpu calibration */
+ WARN_ON(x86_platform.calibrate_cpu != native_calibrate_cpu);
+ cpu_khz = pit_hpet_ptimer_calibrate_cpu();
+ }

/*
* Trust non-zero tsc_khz as authorative,
@@ -1419,7 +1425,7 @@ void __init tsc_early_init(void)
{
if (!boot_cpu_has(X86_FEATURE_TSC))
return;
- if (!determine_cpu_tsc_frequencies())
+ if (!determine_cpu_tsc_frequencies(true))
return;
loops_per_jiffy = get_loops_per_jiffy();

@@ -1431,6 +1437,13 @@ void __init tsc_early_init(void)

void __init tsc_init(void)
{
+ /*
+ * native_calibrate_cpu_early can only calibrate using methods that are
+ * available early in boot.
+ */
+ if (x86_platform.calibrate_cpu == native_calibrate_cpu_early)
+ x86_platform.calibrate_cpu = native_calibrate_cpu;
+
if (!boot_cpu_has(X86_FEATURE_TSC)) {
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return;
@@ -1438,7 +1451,7 @@ void __init tsc_init(void)

if (!tsc_khz) {
/* We failed to determine frequencies earlier, try again */
- if (!determine_cpu_tsc_frequencies()) {
+ if (!determine_cpu_tsc_frequencies(false)) {
mark_tsc_unstable("could not calculate TSC khz");
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 3ab867603e81..2792b5573818 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -109,7 +109,7 @@ struct x86_cpuinit_ops x86_cpuinit = {
static void default_nmi_init(void) { };

struct x86_platform_ops x86_platform __ro_after_init = {
- .calibrate_cpu = native_calibrate_cpu,
+ .calibrate_cpu = native_calibrate_cpu_early,
.calibrate_tsc = native_calibrate_tsc,
.get_wallclock = mach_get_cmos_time,
.set_wallclock = mach_set_rtc_mmss,
--
2.18.0


2018-07-19 21:00:06

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 21/26] x86/tsc: use tsc early

get timestamps and high resultion clock available to us as early as
possible.

native_sched_clock() outputs time based either on tsc after tsc_init() is
called later in boot, or using jiffies when clock interrupts are enabled,
which is also happens later in boot.

On the other hand, tsc frequency is known from as early as when
tsc_early_init() is called.

Use the early tsc calibration to output timestamps early.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/tsc.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7ea0718a4c75..9277ae9b68b3 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1408,6 +1408,7 @@ void __init tsc_early_init(void)
/* Sanitize TSC ADJUST before cyc2ns gets initialized */
tsc_store_and_check_tsc_adjust(true);
cyc2ns_init_boot_cpu();
+ static_branch_enable(&__use_tsc);
}

void __init tsc_init(void)
--
2.18.0


2018-07-19 21:00:13

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 09/26] x86: initialize static branching early

static branching is useful to hot-patch branches that are used in hot
path, but are infrequently changed.

x86 clock framework is one example that uses static branches to setup
the best clock during boot and never change it again.

Since we plan to enable clock early, we need static branching
functionality early as well.

static branching requires patching nop instructions, thus, we need
arch_init_ideal_nops() to be called prior to jump_label_init()

Here we do all the necessary steps to call arch_init_ideal_nops
after early_cpu_init().

Signed-off-by: Pavel Tatashin <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 13 +++++++-----
arch/x86/kernel/cpu/common.c | 38 +++++++++++++++++++-----------------
arch/x86/kernel/setup.c | 4 ++--
3 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 38915fbfae73..b732438c1a1e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -232,8 +232,6 @@ static void init_amd_k7(struct cpuinfo_x86 *c)
}
}

- set_cpu_cap(c, X86_FEATURE_K7);
-
/* calling is from identify_secondary_cpu() ? */
if (!c->cpu_index)
return;
@@ -617,6 +615,14 @@ static void early_init_amd(struct cpuinfo_x86 *c)

early_init_amd_mc(c);

+#ifdef CONFIG_X86_32
+ if (c->x86 == 6)
+ set_cpu_cap(c, X86_FEATURE_K7);
+#endif
+
+ if (c->x86 >= 0xf)
+ set_cpu_cap(c, X86_FEATURE_K8);
+
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);

/*
@@ -863,9 +869,6 @@ static void init_amd(struct cpuinfo_x86 *c)

init_amd_cacheinfo(c);

- if (c->x86 >= 0xf)
- set_cpu_cap(c, X86_FEATURE_K8);
-
if (cpu_has(c, X86_FEATURE_XMM2)) {
unsigned long long val;
int ret;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eb4cb3efd20e..71281ac43b15 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1015,6 +1015,24 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
}

+/*
+ * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
+ * unfortunately, that's not true in practice because of early VIA
+ * chips and (more importantly) broken virtualizers that are not easy
+ * to detect. In the latter case it doesn't even *fail* reliably, so
+ * probing for it doesn't even work. Disable it completely on 32-bit
+ * unless we can find a reliable way to detect all the broken cases.
+ * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
+ */
+static void detect_nopl(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_32
+ clear_cpu_cap(c, X86_FEATURE_NOPL);
+#else
+ set_cpu_cap(c, X86_FEATURE_NOPL);
+#endif
+}
+
/*
* Do minimum CPU detection early.
* Fields really needed: vendor, cpuid_level, family, model, mask,
@@ -1089,6 +1107,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
*/
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);
+
+ detect_nopl(c);
}

void __init early_cpu_init(void)
@@ -1124,24 +1144,6 @@ void __init early_cpu_init(void)
early_identify_cpu(&boot_cpu_data);
}

-/*
- * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
- * unfortunately, that's not true in practice because of early VIA
- * chips and (more importantly) broken virtualizers that are not easy
- * to detect. In the latter case it doesn't even *fail* reliably, so
- * probing for it doesn't even work. Disable it completely on 32-bit
- * unless we can find a reliable way to detect all the broken cases.
- * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
- */
-static void detect_nopl(struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
-#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
-#endif
-}
-
static void detect_null_seg_behavior(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index da1dbd99cb6e..7490de925a81 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -866,6 +866,8 @@ void __init setup_arch(char **cmdline_p)

idt_setup_early_traps();
early_cpu_init();
+ arch_init_ideal_nops();
+ jump_label_init();
early_ioremap_init();

setup_olpc_ofw_pgd();
@@ -1268,8 +1270,6 @@ void __init setup_arch(char **cmdline_p)

mcheck_init();

- arch_init_ideal_nops();
-
register_refined_jiffies(CLOCK_TICK_RATE);

#ifdef CONFIG_EFI
--
2.18.0


2018-07-19 21:00:21

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 08/26] x86: text_poke() may access uninitialized struct pages

It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() we
may end up with accessing struct page that have not been initialized.

Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:

| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
| char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
| vfs_caches_init_early();
| sort_main_extable();
| trap_init();
|+ {
|+ static_branch_enable(&__test);
|+ WARN_ON(!static_branch_likely(&__test));
|+ }
| mm_init();

The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc1_pt_t1 #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.11.0-20171110_100015-anatol 04/01/2014
RIP: 0010:text_poke+0x20d/0x230
Code: 0f 0b 4c 89 e2 4c 89 ee 4c 89 f7 e8 7d 4b 9b 00 31 d2 31 f6 bf 86 02
00 00 48 8b 05 95 8e 24 01 e8 78 18 d8 00 e9 55 ff ff ff <0f> 0b e9 54 fe
ff ff 48 8b 05 75 a8 38 01 e9 64 fe ff ff 48 8b 1d
RSP: 0000:ffffffff94e03e30 EFLAGS: 00010046
RAX: 0100000000000000 RBX: fffff7b2c011f300 RCX: ffffffff94fcccf4
RDX: 0000000000000001 RSI: ffffffff94e03e77 RDI: ffffffff94fcccef
RBP: ffffffff94fcccef R08: 00000000fffffe00 R09: 00000000000000a0
R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000001
R13: ffffffff94e03e77 R14: ffffffff94fcdcef R15: fffff7b2c0000000
FS: 0000000000000000(0000) GS:ffff9adc87c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff9adc8499d000 CR3: 000000000460a001 CR4: 00000000000606b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? start_kernel+0x23e/0x4c8
? start_kernel+0x23f/0x4c8
? text_poke_bp+0x50/0xda
? arch_jump_label_transform+0x89/0xe0
? __jump_label_update+0x78/0xb0
? static_key_enable_cpuslocked+0x4d/0x80
? static_key_enable+0x11/0x20
? start_kernel+0x23e/0x4c8
? secondary_startup_64+0xa5/0xb0
---[ end trace abdc99c031b8a90a ]---

If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.

Use text_poke_early() in static branching until early boot IRQs are
enabled, at which time switch to text_poke. Also, ensure text_poke() is
never invoked when unitialized memory access may happen by using:
BUG_ON(!after_bootmem); assertion.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 7 +++++++
arch/x86/kernel/jump_label.c | 11 +++++++----
3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 2ecd34e2d46c..e85ff65c43c3 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,5 +37,6 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern int after_bootmem;

#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a481763a3776..014f214da581 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -668,6 +668,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
local_irq_save(flags);
memcpy(addr, opcode, len);
local_irq_restore(flags);
+ sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
return addr;
@@ -693,6 +694,12 @@ void *text_poke(void *addr, const void *opcode, size_t len)
struct page *pages[2];
int i;

+ /*
+ * While boot memory allocator is runnig we cannot use struct
+ * pages as they are not yet initialized.
+ */
+ BUG_ON(!after_bootmem);
+
if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e56c95be2808..eeea935e9bb5 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,15 +37,18 @@ static void bug_at(unsigned char *ip, int line)
BUG();
}

-static void __jump_label_transform(struct jump_entry *entry,
- enum jump_label_type type,
- void *(*poker)(void *, const void *, size_t),
- int init)
+static void __ref __jump_label_transform(struct jump_entry *entry,
+ enum jump_label_type type,
+ void *(*poker)(void *, const void *, size_t),
+ int init)
{
union jump_code_union code;
const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];

+ if (early_boot_irqs_disabled)
+ poker = text_poke_early;
+
if (type == JUMP_LABEL_JMP) {
if (init) {
/*
--
2.18.0


2018-07-19 21:00:51

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 05/26] x86/kvmclock: Mark variables __initdata and __ro_after_init

From: Thomas Gleixner <[email protected]>

The kvmclock parameter is init data and the other variables are not
modified after init.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvmclock.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 4afb03e49a4f..78aec160f5e0 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -32,10 +32,10 @@
#include <asm/reboot.h>
#include <asm/kvmclock.h>

-static int kvmclock __ro_after_init = 1;
-static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
-static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
-static u64 kvm_sched_clock_offset;
+static int kvmclock __initdata = 1;
+static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
+static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
+static u64 kvm_sched_clock_offset __ro_after_init;

static int __init parse_no_kvmclock(char *arg)
{
@@ -50,7 +50,7 @@ early_param("no-kvmclock", parse_no_kvmclock);
static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);

/* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock;
+static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
static struct pvclock_wall_clock wall_clock;

/*
--
2.18.0


2018-07-19 21:01:07

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 01/26] x86/kvmclock: Remove memblock dependency

KVM clock is initialized later compared to other hypervisor clocks because
it has a dependency on the memblock allocator.

Bring it in line with other hypervisors by using memory from the BSS
instead of allocating it.

The benefits:

- Remove ifdef from common code
- Earlier availability of the clock
- Remove dependency on memblock, and reduce code

The downside:

- Static allocation of the per cpu data structures sized NR_CPUS * 64byte
Will be addressed in follow up patches.

[ tglx: Split out from larger series ]

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 66 +++++++-------------------------------
arch/x86/kernel/setup.c | 4 ---
3 files changed, 12 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5b2300b818af..c65c232d3ddd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name = "KVM",
.detect = kvm_detect,
.type = X86_HYPER_KVM,
+ .init.init_platform = kvmclock_init,
.init.guest_late_init = kvm_guest_init,
.init.x2apic_available = kvm_para_available,
};
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 3b8e7c13c614..1f6ac5aaa904 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,9 +23,9 @@
#include <asm/apic.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
-#include <linux/memblock.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
+#include <linux/mm.h>

#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
@@ -44,6 +44,13 @@ static int parse_no_kvmclock(char *arg)
}
early_param("no-kvmclock", parse_no_kvmclock);

+/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
+#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))
+
+static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+
/* The hypervisor will put information about time periodically here */
static struct pvclock_vsyscall_time_info *hv_clock;
static struct pvclock_wall_clock *wall_clock;
@@ -245,43 +252,12 @@ static void kvm_shutdown(void)
native_machine_shutdown();
}

-static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
- phys_addr_t align)
-{
- phys_addr_t mem;
-
- mem = memblock_alloc(size, align);
- if (!mem)
- return 0;
-
- if (sev_active()) {
- if (early_set_memory_decrypted((unsigned long)__va(mem), size))
- goto e_free;
- }
-
- return mem;
-e_free:
- memblock_free(mem, size);
- return 0;
-}
-
-static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
-{
- if (sev_active())
- early_set_memory_encrypted((unsigned long)__va(addr), size);
-
- memblock_free(addr, size);
-}
-
void __init kvmclock_init(void)
{
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned long mem, mem_wall_clock;
- int size, cpu, wall_clock_size;
+ int cpu;
u8 flags;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
if (!kvm_para_available())
return;

@@ -291,28 +267,11 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
- mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
- if (!mem_wall_clock)
- return;
-
- wall_clock = __va(mem_wall_clock);
- memset(wall_clock, 0, wall_clock_size);
-
- mem = kvm_memblock_alloc(size, PAGE_SIZE);
- if (!mem) {
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
- wall_clock = NULL;
- return;
- }
-
- hv_clock = __va(mem);
- memset(hv_clock, 0, size);
+ wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
+ hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;

if (kvm_register_clock("primary cpu clock")) {
hv_clock = NULL;
- kvm_memblock_free(mem, size);
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
wall_clock = NULL;
return;
}
@@ -357,13 +316,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
int cpu;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned int size;

if (!hv_clock)
return 0;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
cpu = get_cpu();

vcpu_time = &hv_clock[cpu].pvti;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..da1dbd99cb6e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1197,10 +1197,6 @@ void __init setup_arch(char **cmdline_p)

memblock_find_dma_reserve();

-#ifdef CONFIG_KVM_GUEST
- kvmclock_init();
-#endif
-
tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();
--
2.18.0


2018-07-19 21:01:13

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 06/26] x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock

From: Thomas Gleixner <[email protected]>

There is no point to have this in the kvm code itself and call it from
there. This can be called from an initcall and the parameter is cleared
when the hypervisor is not KVM.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_guest.h | 7 -----
arch/x86/kernel/kvm.c | 13 ----------
arch/x86/kernel/kvmclock.c | 44 ++++++++++++++++++++------------
3 files changed, 27 insertions(+), 37 deletions(-)
delete mode 100644 arch/x86/include/asm/kvm_guest.h

diff --git a/arch/x86/include/asm/kvm_guest.h b/arch/x86/include/asm/kvm_guest.h
deleted file mode 100644
index 46185263d9c2..000000000000
--- a/arch/x86/include/asm/kvm_guest.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_X86_KVM_GUEST_H
-#define _ASM_X86_KVM_GUEST_H
-
-int kvm_setup_vsyscall_timeinfo(void);
-
-#endif /* _ASM_X86_KVM_GUEST_H */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c65c232d3ddd..a560750cc76f 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -45,7 +45,6 @@
#include <asm/apic.h>
#include <asm/apicdef.h>
#include <asm/hypervisor.h>
-#include <asm/kvm_guest.h>

static int kvmapf = 1;

@@ -66,15 +65,6 @@ static int __init parse_no_stealacc(char *arg)

early_param("no-steal-acc", parse_no_stealacc);

-static int kvmclock_vsyscall = 1;
-static int __init parse_no_kvmclock_vsyscall(char *arg)
-{
- kvmclock_vsyscall = 0;
- return 0;
-}
-
-early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
-
static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
static DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64);
static int has_steal_clock = 0;
@@ -560,9 +550,6 @@ static void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);

- if (kvmclock_vsyscall)
- kvm_setup_vsyscall_timeinfo();
-
#ifdef CONFIG_SMP
smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 78aec160f5e0..7d690d2238f8 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -27,12 +27,14 @@
#include <linux/sched/clock.h>
#include <linux/mm.h>

+#include <asm/hypervisor.h>
#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
#include <asm/reboot.h>
#include <asm/kvmclock.h>

static int kvmclock __initdata = 1;
+static int kvmclock_vsyscall __initdata = 1;
static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
static u64 kvm_sched_clock_offset __ro_after_init;
@@ -44,6 +46,13 @@ static int __init parse_no_kvmclock(char *arg)
}
early_param("no-kvmclock", parse_no_kvmclock);

+static int __init parse_no_kvmclock_vsyscall(char *arg)
+{
+ kvmclock_vsyscall = 0;
+ return 0;
+}
+early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
+
/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)

@@ -228,6 +237,24 @@ static void kvm_shutdown(void)
native_machine_shutdown();
}

+static int __init kvm_setup_vsyscall_timeinfo(void)
+{
+#ifdef CONFIG_X86_64
+ u8 flags;
+
+ if (!hv_clock || !kvmclock_vsyscall)
+ return 0;
+
+ flags = pvclock_read_flags(&hv_clock[0].pvti);
+ if (!(flags & PVCLOCK_TSC_STABLE_BIT))
+ return 1;
+
+ kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+#endif
+ return 0;
+}
+early_initcall(kvm_setup_vsyscall_timeinfo);
+
void __init kvmclock_init(void)
{
u8 flags;
@@ -272,20 +299,3 @@ void __init kvmclock_init(void)
clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
pv_info.name = "KVM";
}
-
-int __init kvm_setup_vsyscall_timeinfo(void)
-{
-#ifdef CONFIG_X86_64
- u8 flags;
-
- if (!hv_clock)
- return 0;
-
- flags = pvclock_read_flags(&hv_clock[0].pvti);
- if (!(flags & PVCLOCK_TSC_STABLE_BIT))
- return 1;
-
- kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
-#endif
- return 0;
-}
--
2.18.0


2018-07-19 21:01:16

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 04/26] x86/kvmclock: Cleanup the code

From: Thomas Gleixner <[email protected]>

- Cleanup the mrs write for wall clock. The type casts to (int) are sloppy
because the wrmsr parameters are u32 and aside of that wrmsrl() already
provides the high/low split for free.

- Remove the pointless get_cpu()/put_cpu() dance from various
functions. Either they are called during early init where CPU is
guaranteed to be 0 or they are already called from non preemptible
context where smp_processor_id() can be used safely

- Simplify the convoluted check for kvmclock in the init function.

- Mark the parameter parsing function __init. No point in keeping it
around.

- Convert to pr_info()

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvmclock.c | 72 ++++++++++++--------------------------
1 file changed, 22 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f0a0aef5e9fa..4afb03e49a4f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -37,7 +37,7 @@ static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
static u64 kvm_sched_clock_offset;

-static int parse_no_kvmclock(char *arg)
+static int __init parse_no_kvmclock(char *arg)
{
kvmclock = 0;
return 0;
@@ -61,13 +61,9 @@ static struct pvclock_wall_clock wall_clock;
static void kvm_get_wallclock(struct timespec64 *now)
{
struct pvclock_vcpu_time_info *vcpu_time;
- int low, high;
int cpu;

- low = (int)slow_virt_to_phys(&wall_clock);
- high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);
-
- native_write_msr(msr_kvm_wall_clock, low, high);
+ wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));

cpu = get_cpu();

@@ -117,11 +113,11 @@ static inline void kvm_sched_clock_init(bool stable)
kvm_sched_clock_offset = kvm_clock_read();
pv_time_ops.sched_clock = kvm_sched_clock_read;

- printk(KERN_INFO "kvm-clock: using sched offset of %llu cycles\n",
- kvm_sched_clock_offset);
+ pr_info("kvm-clock: using sched offset of %llu cycles",
+ kvm_sched_clock_offset);

BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
- sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
+ sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
}

/*
@@ -135,16 +131,8 @@ static inline void kvm_sched_clock_init(bool stable)
*/
static unsigned long kvm_get_tsc_khz(void)
{
- struct pvclock_vcpu_time_info *src;
- int cpu;
- unsigned long tsc_khz;
-
- cpu = get_cpu();
- src = &hv_clock[cpu].pvti;
- tsc_khz = pvclock_tsc_khz(src);
- put_cpu();
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
- return tsc_khz;
+ return pvclock_tsc_khz(&hv_clock[0].pvti);
}

static void kvm_get_preset_lpj(void)
@@ -161,29 +149,27 @@ static void kvm_get_preset_lpj(void)

bool kvm_check_and_clear_guest_paused(void)
{
- bool ret = false;
struct pvclock_vcpu_time_info *src;
- int cpu = smp_processor_id();
+ bool ret = false;

if (!hv_clock)
return ret;

- src = &hv_clock[cpu].pvti;
+ src = &hv_clock[smp_processor_id()].pvti;
if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
src->flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
ret = true;
}
-
return ret;
}

struct clocksource kvm_clock = {
- .name = "kvm-clock",
- .read = kvm_clock_get_cycles,
- .rating = 400,
- .mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .name = "kvm-clock",
+ .read = kvm_clock_get_cycles,
+ .rating = 400,
+ .mask = CLOCKSOURCE_MASK(64),
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS,
};
EXPORT_SYMBOL_GPL(kvm_clock);

@@ -199,7 +185,7 @@ static void kvm_register_clock(char *txt)
src = &hv_clock[cpu].pvti;
pa = slow_virt_to_phys(src) | 0x01ULL;
wrmsrl(msr_kvm_system_time, pa);
- pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
+ pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
}

static void kvm_save_sched_clock_state(void)
@@ -244,20 +230,19 @@ static void kvm_shutdown(void)

void __init kvmclock_init(void)
{
- struct pvclock_vcpu_time_info *vcpu_time;
- int cpu;
u8 flags;

- if (!kvm_para_available())
+ if (!kvm_para_available() || !kvmclock)
return;

- if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
+ if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
- } else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
+ } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
return;
+ }

- printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
+ pr_info("kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
@@ -267,20 +252,15 @@ void __init kvmclock_init(void)
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

- cpu = get_cpu();
- vcpu_time = &hv_clock[cpu].pvti;
- flags = pvclock_read_flags(vcpu_time);
-
+ flags = pvclock_read_flags(&hv_clock[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
- put_cpu();

x86_platform.calibrate_tsc = kvm_get_tsc_khz;
x86_platform.calibrate_cpu = kvm_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
- x86_cpuinit.early_percpu_clock_init =
- kvm_setup_secondary_clock;
+ x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
#endif
x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
@@ -296,20 +276,12 @@ void __init kvmclock_init(void)
int __init kvm_setup_vsyscall_timeinfo(void)
{
#ifdef CONFIG_X86_64
- int cpu;
u8 flags;
- struct pvclock_vcpu_time_info *vcpu_time;

if (!hv_clock)
return 0;

- cpu = get_cpu();
-
- vcpu_time = &hv_clock[cpu].pvti;
- flags = pvclock_read_flags(vcpu_time);
-
- put_cpu();
-
+ flags = pvclock_read_flags(&hv_clock[0].pvti);
if (!(flags & PVCLOCK_TSC_STABLE_BIT))
return 1;

--
2.18.0


2018-07-19 21:01:18

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 13/26] x86/xen/time: output xen sched_clock time from 0

It is expected for sched_clock() to output data from 0, when system boots.
Add an offset xen_sched_clock_offset (similarly how it is done in other
hypervisors i.e. kvm_sched_clock_offset) to count sched_clock() from 0,
when time is first initialized.

Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/xen/time.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 53bb7a8d10b5..c84f1e039d84 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -31,6 +31,8 @@
/* Xen may fire a timer up to this many ns early */
#define TIMER_SLOP 100000

+static u64 xen_sched_clock_offset __read_mostly;
+
/* Get the TSC speed from Xen */
static unsigned long xen_tsc_khz(void)
{
@@ -57,6 +59,11 @@ static u64 xen_clocksource_get_cycles(struct clocksource *cs)
return xen_clocksource_read();
}

+static u64 xen_sched_clock(void)
+{
+ return xen_clocksource_read() - xen_sched_clock_offset;
+}
+
static void xen_read_wallclock(struct timespec64 *ts)
{
struct shared_info *s = HYPERVISOR_shared_info;
@@ -367,7 +374,7 @@ void xen_timer_resume(void)
}

static const struct pv_time_ops xen_time_ops __initconst = {
- .sched_clock = xen_clocksource_read,
+ .sched_clock = xen_sched_clock,
.steal_clock = xen_steal_clock,
};

@@ -505,6 +512,7 @@ static void __init xen_time_init(void)

void __init xen_init_time_ops(void)
{
+ xen_sched_clock_offset = xen_clocksource_read();
pv_time_ops = xen_time_ops;

x86_init.timers.timer_init = xen_time_init;
@@ -546,6 +554,7 @@ void __init xen_hvm_init_time_ops(void)
return;
}

+ xen_sched_clock_offset = xen_clocksource_read();
pv_time_ops = xen_time_ops;
x86_init.timers.setup_percpu_clockev = xen_time_init;
x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
--
2.18.0


2018-07-19 21:01:21

by Pavel Tatashin

[permalink] [raw]
Subject: [PATCH v15 03/26] x86/kvmclock: Decrapify kvm_register_clock()

From: Thomas Gleixner <[email protected]>

The return value is pointless because the wrmsr cannot fail if
KVM_FEATURE_CLOCKSOURCE or KVM_FEATURE_CLOCKSOURCE2 are set.

kvm_register_clock() is only called locally so wants to be static.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 1 -
arch/x86/kernel/kvmclock.c | 33 ++++++++++-----------------------
2 files changed, 10 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 3aea2658323a..4c723632c036 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -7,7 +7,6 @@
#include <uapi/asm/kvm_para.h>

extern void kvmclock_init(void);
-extern int kvm_register_clock(char *txt);

#ifdef CONFIG_KVM_GUEST
bool kvm_check_and_clear_guest_paused(void);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index a995d7d7164c..f0a0aef5e9fa 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -187,23 +187,19 @@ struct clocksource kvm_clock = {
};
EXPORT_SYMBOL_GPL(kvm_clock);

-int kvm_register_clock(char *txt)
+static void kvm_register_clock(char *txt)
{
- int cpu = smp_processor_id();
- int low, high, ret;
struct pvclock_vcpu_time_info *src;
+ int cpu = smp_processor_id();
+ u64 pa;

if (!hv_clock)
- return 0;
+ return;

src = &hv_clock[cpu].pvti;
- low = (int)slow_virt_to_phys(src) | 1;
- high = ((u64)slow_virt_to_phys(src) >> 32);
- ret = native_write_msr_safe(msr_kvm_system_time, low, high);
- printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
- cpu, high, low, txt);
-
- return ret;
+ pa = slow_virt_to_phys(src) | 0x01ULL;
+ wrmsrl(msr_kvm_system_time, pa);
+ pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
}

static void kvm_save_sched_clock_state(void)
@@ -218,11 +214,7 @@ static void kvm_restore_sched_clock_state(void)
#ifdef CONFIG_X86_LOCAL_APIC
static void kvm_setup_secondary_clock(void)
{
- /*
- * Now that the first cpu already had this clocksource initialized,
- * we shouldn't fail.
- */
- WARN_ON(kvm_register_clock("secondary cpu clock"));
+ kvm_register_clock("secondary cpu clock");
}
#endif

@@ -265,16 +257,11 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
-
- if (kvm_register_clock("primary cpu clock")) {
- hv_clock = NULL;
- return;
- }
-
printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

+ hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+ kvm_register_clock("primary cpu clock");
pvclock_set_pvti_cpu0_va(hv_clock);

if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
--
2.18.0


Subject: [tip:x86/timers] x86/kvmclock: Remove memblock dependency

Commit-ID: 368a540e0232ad446931f5a4e8a5e06f69f21343
Gitweb: https://git.kernel.org/tip/368a540e0232ad446931f5a4e8a5e06f69f21343
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:20 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:36 +0200

x86/kvmclock: Remove memblock dependency

KVM clock is initialized later compared to other hypervisor clocks because
it has a dependency on the memblock allocator.

Bring it in line with other hypervisors by using memory from the BSS
instead of allocating it.

The benefits:

- Remove ifdef from common code
- Earlier availability of the clock
- Remove dependency on memblock, and reduce code

The downside:

- Static allocation of the per cpu data structures sized NR_CPUS * 64byte
Will be addressed in follow up patches.

[ tglx: Split out from larger series ]

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 66 ++++++++--------------------------------------
arch/x86/kernel/setup.c | 4 ---
3 files changed, 12 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5b2300b818af..c65c232d3ddd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name = "KVM",
.detect = kvm_detect,
.type = X86_HYPER_KVM,
+ .init.init_platform = kvmclock_init,
.init.guest_late_init = kvm_guest_init,
.init.x2apic_available = kvm_para_available,
};
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 3b8e7c13c614..1f6ac5aaa904 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,9 +23,9 @@
#include <asm/apic.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
-#include <linux/memblock.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
+#include <linux/mm.h>

#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
@@ -44,6 +44,13 @@ static int parse_no_kvmclock(char *arg)
}
early_param("no-kvmclock", parse_no_kvmclock);

+/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
+#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))
+
+static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+
/* The hypervisor will put information about time periodically here */
static struct pvclock_vsyscall_time_info *hv_clock;
static struct pvclock_wall_clock *wall_clock;
@@ -245,43 +252,12 @@ static void kvm_shutdown(void)
native_machine_shutdown();
}

-static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
- phys_addr_t align)
-{
- phys_addr_t mem;
-
- mem = memblock_alloc(size, align);
- if (!mem)
- return 0;
-
- if (sev_active()) {
- if (early_set_memory_decrypted((unsigned long)__va(mem), size))
- goto e_free;
- }
-
- return mem;
-e_free:
- memblock_free(mem, size);
- return 0;
-}
-
-static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
-{
- if (sev_active())
- early_set_memory_encrypted((unsigned long)__va(addr), size);
-
- memblock_free(addr, size);
-}
-
void __init kvmclock_init(void)
{
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned long mem, mem_wall_clock;
- int size, cpu, wall_clock_size;
+ int cpu;
u8 flags;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
if (!kvm_para_available())
return;

@@ -291,28 +267,11 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
- mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
- if (!mem_wall_clock)
- return;
-
- wall_clock = __va(mem_wall_clock);
- memset(wall_clock, 0, wall_clock_size);
-
- mem = kvm_memblock_alloc(size, PAGE_SIZE);
- if (!mem) {
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
- wall_clock = NULL;
- return;
- }
-
- hv_clock = __va(mem);
- memset(hv_clock, 0, size);
+ wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
+ hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;

if (kvm_register_clock("primary cpu clock")) {
hv_clock = NULL;
- kvm_memblock_free(mem, size);
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
wall_clock = NULL;
return;
}
@@ -357,13 +316,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
int cpu;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned int size;

if (!hv_clock)
return 0;

- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
cpu = get_cpu();

vcpu_time = &hv_clock[cpu].pvti;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..da1dbd99cb6e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1197,10 +1197,6 @@ void __init setup_arch(char **cmdline_p)

memblock_find_dma_reserve();

-#ifdef CONFIG_KVM_GUEST
- kvmclock_init();
-#endif
-
tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();

Subject: [tip:x86/timers] x86/kvmclock: Remove page size requirement from wall_clock

Commit-ID: 7ef363a39514ed8a6f2333fbae1875ac0953715a
Gitweb: https://git.kernel.org/tip/7ef363a39514ed8a6f2333fbae1875ac0953715a
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:21 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:36 +0200

x86/kvmclock: Remove page size requirement from wall_clock

There is no requirement for wall_clock data to be page aligned or page
sized.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/kvmclock.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1f6ac5aaa904..a995d7d7164c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -46,14 +46,12 @@ early_param("no-kvmclock", parse_no_kvmclock);

/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
-#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))

static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);

/* The hypervisor will put information about time periodically here */
static struct pvclock_vsyscall_time_info *hv_clock;
-static struct pvclock_wall_clock *wall_clock;
+static struct pvclock_wall_clock wall_clock;

/*
* The wallclock is the time of day when we booted. Since then, some time may
@@ -66,15 +64,15 @@ static void kvm_get_wallclock(struct timespec64 *now)
int low, high;
int cpu;

- low = (int)slow_virt_to_phys(wall_clock);
- high = ((u64)slow_virt_to_phys(wall_clock) >> 32);
+ low = (int)slow_virt_to_phys(&wall_clock);
+ high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);

native_write_msr(msr_kvm_wall_clock, low, high);

cpu = get_cpu();

vcpu_time = &hv_clock[cpu].pvti;
- pvclock_read_wallclock(wall_clock, vcpu_time, now);
+ pvclock_read_wallclock(&wall_clock, vcpu_time, now);

put_cpu();
}
@@ -267,12 +265,10 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;

if (kvm_register_clock("primary cpu clock")) {
hv_clock = NULL;
- wall_clock = NULL;
return;
}


Subject: [tip:x86/timers] x86/kvmclock: Decrapify kvm_register_clock()

Commit-ID: 7a5ddc8fe0ea9518cd7fb6a929cac7d864c6f300
Gitweb: https://git.kernel.org/tip/7a5ddc8fe0ea9518cd7fb6a929cac7d864c6f300
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:22 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:36 +0200

x86/kvmclock: Decrapify kvm_register_clock()

The return value is pointless because the wrmsr cannot fail if
KVM_FEATURE_CLOCKSOURCE or KVM_FEATURE_CLOCKSOURCE2 are set.

kvm_register_clock() is only called locally so wants to be static.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/include/asm/kvm_para.h | 1 -
arch/x86/kernel/kvmclock.c | 33 ++++++++++-----------------------
2 files changed, 10 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 3aea2658323a..4c723632c036 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -7,7 +7,6 @@
#include <uapi/asm/kvm_para.h>

extern void kvmclock_init(void);
-extern int kvm_register_clock(char *txt);

#ifdef CONFIG_KVM_GUEST
bool kvm_check_and_clear_guest_paused(void);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index a995d7d7164c..f0a0aef5e9fa 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -187,23 +187,19 @@ struct clocksource kvm_clock = {
};
EXPORT_SYMBOL_GPL(kvm_clock);

-int kvm_register_clock(char *txt)
+static void kvm_register_clock(char *txt)
{
- int cpu = smp_processor_id();
- int low, high, ret;
struct pvclock_vcpu_time_info *src;
+ int cpu = smp_processor_id();
+ u64 pa;

if (!hv_clock)
- return 0;
+ return;

src = &hv_clock[cpu].pvti;
- low = (int)slow_virt_to_phys(src) | 1;
- high = ((u64)slow_virt_to_phys(src) >> 32);
- ret = native_write_msr_safe(msr_kvm_system_time, low, high);
- printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
- cpu, high, low, txt);
-
- return ret;
+ pa = slow_virt_to_phys(src) | 0x01ULL;
+ wrmsrl(msr_kvm_system_time, pa);
+ pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
}

static void kvm_save_sched_clock_state(void)
@@ -218,11 +214,7 @@ static void kvm_restore_sched_clock_state(void)
#ifdef CONFIG_X86_LOCAL_APIC
static void kvm_setup_secondary_clock(void)
{
- /*
- * Now that the first cpu already had this clocksource initialized,
- * we shouldn't fail.
- */
- WARN_ON(kvm_register_clock("secondary cpu clock"));
+ kvm_register_clock("secondary cpu clock");
}
#endif

@@ -265,16 +257,11 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

- hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
-
- if (kvm_register_clock("primary cpu clock")) {
- hv_clock = NULL;
- return;
- }
-
printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

+ hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+ kvm_register_clock("primary cpu clock");
pvclock_set_pvti_cpu0_va(hv_clock);

if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))

Subject: [tip:x86/timers] x86/kvmclock: Cleanup the code

Commit-ID: 146c394d0c3c8e88df433a179c2b0b85fd8cf247
Gitweb: https://git.kernel.org/tip/146c394d0c3c8e88df433a179c2b0b85fd8cf247
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:23 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:37 +0200

x86/kvmclock: Cleanup the code

- Cleanup the mrs write for wall clock. The type casts to (int) are sloppy
because the wrmsr parameters are u32 and aside of that wrmsrl() already
provides the high/low split for free.

- Remove the pointless get_cpu()/put_cpu() dance from various
functions. Either they are called during early init where CPU is
guaranteed to be 0 or they are already called from non preemptible
context where smp_processor_id() can be used safely

- Simplify the convoluted check for kvmclock in the init function.

- Mark the parameter parsing function __init. No point in keeping it
around.

- Convert to pr_info()

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/kvmclock.c | 72 ++++++++++++++--------------------------------
1 file changed, 22 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f0a0aef5e9fa..4afb03e49a4f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -37,7 +37,7 @@ static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
static u64 kvm_sched_clock_offset;

-static int parse_no_kvmclock(char *arg)
+static int __init parse_no_kvmclock(char *arg)
{
kvmclock = 0;
return 0;
@@ -61,13 +61,9 @@ static struct pvclock_wall_clock wall_clock;
static void kvm_get_wallclock(struct timespec64 *now)
{
struct pvclock_vcpu_time_info *vcpu_time;
- int low, high;
int cpu;

- low = (int)slow_virt_to_phys(&wall_clock);
- high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);
-
- native_write_msr(msr_kvm_wall_clock, low, high);
+ wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));

cpu = get_cpu();

@@ -117,11 +113,11 @@ static inline void kvm_sched_clock_init(bool stable)
kvm_sched_clock_offset = kvm_clock_read();
pv_time_ops.sched_clock = kvm_sched_clock_read;

- printk(KERN_INFO "kvm-clock: using sched offset of %llu cycles\n",
- kvm_sched_clock_offset);
+ pr_info("kvm-clock: using sched offset of %llu cycles",
+ kvm_sched_clock_offset);

BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
- sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
+ sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
}

/*
@@ -135,16 +131,8 @@ static inline void kvm_sched_clock_init(bool stable)
*/
static unsigned long kvm_get_tsc_khz(void)
{
- struct pvclock_vcpu_time_info *src;
- int cpu;
- unsigned long tsc_khz;
-
- cpu = get_cpu();
- src = &hv_clock[cpu].pvti;
- tsc_khz = pvclock_tsc_khz(src);
- put_cpu();
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
- return tsc_khz;
+ return pvclock_tsc_khz(&hv_clock[0].pvti);
}

static void kvm_get_preset_lpj(void)
@@ -161,29 +149,27 @@ static void kvm_get_preset_lpj(void)

bool kvm_check_and_clear_guest_paused(void)
{
- bool ret = false;
struct pvclock_vcpu_time_info *src;
- int cpu = smp_processor_id();
+ bool ret = false;

if (!hv_clock)
return ret;

- src = &hv_clock[cpu].pvti;
+ src = &hv_clock[smp_processor_id()].pvti;
if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
src->flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
ret = true;
}
-
return ret;
}

struct clocksource kvm_clock = {
- .name = "kvm-clock",
- .read = kvm_clock_get_cycles,
- .rating = 400,
- .mask = CLOCKSOURCE_MASK(64),
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .name = "kvm-clock",
+ .read = kvm_clock_get_cycles,
+ .rating = 400,
+ .mask = CLOCKSOURCE_MASK(64),
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS,
};
EXPORT_SYMBOL_GPL(kvm_clock);

@@ -199,7 +185,7 @@ static void kvm_register_clock(char *txt)
src = &hv_clock[cpu].pvti;
pa = slow_virt_to_phys(src) | 0x01ULL;
wrmsrl(msr_kvm_system_time, pa);
- pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
+ pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
}

static void kvm_save_sched_clock_state(void)
@@ -244,20 +230,19 @@ static void kvm_shutdown(void)

void __init kvmclock_init(void)
{
- struct pvclock_vcpu_time_info *vcpu_time;
- int cpu;
u8 flags;

- if (!kvm_para_available())
+ if (!kvm_para_available() || !kvmclock)
return;

- if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
+ if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
- } else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
+ } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
return;
+ }

- printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
+ pr_info("kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
@@ -267,20 +252,15 @@ void __init kvmclock_init(void)
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

- cpu = get_cpu();
- vcpu_time = &hv_clock[cpu].pvti;
- flags = pvclock_read_flags(vcpu_time);
-
+ flags = pvclock_read_flags(&hv_clock[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
- put_cpu();

x86_platform.calibrate_tsc = kvm_get_tsc_khz;
x86_platform.calibrate_cpu = kvm_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
- x86_cpuinit.early_percpu_clock_init =
- kvm_setup_secondary_clock;
+ x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
#endif
x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
@@ -296,20 +276,12 @@ void __init kvmclock_init(void)
int __init kvm_setup_vsyscall_timeinfo(void)
{
#ifdef CONFIG_X86_64
- int cpu;
u8 flags;
- struct pvclock_vcpu_time_info *vcpu_time;

if (!hv_clock)
return 0;

- cpu = get_cpu();
-
- vcpu_time = &hv_clock[cpu].pvti;
- flags = pvclock_read_flags(vcpu_time);
-
- put_cpu();
-
+ flags = pvclock_read_flags(&hv_clock[0].pvti);
if (!(flags & PVCLOCK_TSC_STABLE_BIT))
return 1;


Subject: [tip:x86/timers] x86/kvmclock: Mark variables __initdata and __ro_after_init

Commit-ID: 42f8df935efefba51d0c5321b1325436523e3377
Gitweb: https://git.kernel.org/tip/42f8df935efefba51d0c5321b1325436523e3377
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:24 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:37 +0200

x86/kvmclock: Mark variables __initdata and __ro_after_init

The kvmclock parameter is init data and the other variables are not
modified after init.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/kvmclock.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 4afb03e49a4f..78aec160f5e0 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -32,10 +32,10 @@
#include <asm/reboot.h>
#include <asm/kvmclock.h>

-static int kvmclock __ro_after_init = 1;
-static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
-static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
-static u64 kvm_sched_clock_offset;
+static int kvmclock __initdata = 1;
+static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
+static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
+static u64 kvm_sched_clock_offset __ro_after_init;

static int __init parse_no_kvmclock(char *arg)
{
@@ -50,7 +50,7 @@ early_param("no-kvmclock", parse_no_kvmclock);
static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);

/* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock;
+static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
static struct pvclock_wall_clock wall_clock;

/*

Subject: [tip:x86/timers] x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock

Commit-ID: e499a9b6dc488aff7f284bee51936f510ab7ad15
Gitweb: https://git.kernel.org/tip/e499a9b6dc488aff7f284bee51936f510ab7ad15
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:25 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:37 +0200

x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock

There is no point to have this in the kvm code itself and call it from
there. This can be called from an initcall and the parameter is cleared
when the hypervisor is not KVM.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/include/asm/kvm_guest.h | 7 -------
arch/x86/kernel/kvm.c | 13 ------------
arch/x86/kernel/kvmclock.c | 44 ++++++++++++++++++++++++----------------
3 files changed, 27 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_guest.h b/arch/x86/include/asm/kvm_guest.h
deleted file mode 100644
index 46185263d9c2..000000000000
--- a/arch/x86/include/asm/kvm_guest.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_X86_KVM_GUEST_H
-#define _ASM_X86_KVM_GUEST_H
-
-int kvm_setup_vsyscall_timeinfo(void);
-
-#endif /* _ASM_X86_KVM_GUEST_H */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c65c232d3ddd..a560750cc76f 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -45,7 +45,6 @@
#include <asm/apic.h>
#include <asm/apicdef.h>
#include <asm/hypervisor.h>
-#include <asm/kvm_guest.h>

static int kvmapf = 1;

@@ -66,15 +65,6 @@ static int __init parse_no_stealacc(char *arg)

early_param("no-steal-acc", parse_no_stealacc);

-static int kvmclock_vsyscall = 1;
-static int __init parse_no_kvmclock_vsyscall(char *arg)
-{
- kvmclock_vsyscall = 0;
- return 0;
-}
-
-early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
-
static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
static DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64);
static int has_steal_clock = 0;
@@ -560,9 +550,6 @@ static void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);

- if (kvmclock_vsyscall)
- kvm_setup_vsyscall_timeinfo();
-
#ifdef CONFIG_SMP
smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 78aec160f5e0..7d690d2238f8 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -27,12 +27,14 @@
#include <linux/sched/clock.h>
#include <linux/mm.h>

+#include <asm/hypervisor.h>
#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
#include <asm/reboot.h>
#include <asm/kvmclock.h>

static int kvmclock __initdata = 1;
+static int kvmclock_vsyscall __initdata = 1;
static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
static u64 kvm_sched_clock_offset __ro_after_init;
@@ -44,6 +46,13 @@ static int __init parse_no_kvmclock(char *arg)
}
early_param("no-kvmclock", parse_no_kvmclock);

+static int __init parse_no_kvmclock_vsyscall(char *arg)
+{
+ kvmclock_vsyscall = 0;
+ return 0;
+}
+early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
+
/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)

@@ -228,6 +237,24 @@ static void kvm_shutdown(void)
native_machine_shutdown();
}

+static int __init kvm_setup_vsyscall_timeinfo(void)
+{
+#ifdef CONFIG_X86_64
+ u8 flags;
+
+ if (!hv_clock || !kvmclock_vsyscall)
+ return 0;
+
+ flags = pvclock_read_flags(&hv_clock[0].pvti);
+ if (!(flags & PVCLOCK_TSC_STABLE_BIT))
+ return 1;
+
+ kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+#endif
+ return 0;
+}
+early_initcall(kvm_setup_vsyscall_timeinfo);
+
void __init kvmclock_init(void)
{
u8 flags;
@@ -272,20 +299,3 @@ void __init kvmclock_init(void)
clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
pv_info.name = "KVM";
}
-
-int __init kvm_setup_vsyscall_timeinfo(void)
-{
-#ifdef CONFIG_X86_64
- u8 flags;
-
- if (!hv_clock)
- return 0;
-
- flags = pvclock_read_flags(&hv_clock[0].pvti);
- if (!(flags & PVCLOCK_TSC_STABLE_BIT))
- return 1;
-
- kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
-#endif
- return 0;
-}

Subject: [tip:x86/timers] x86/kvmclock: Switch kvmclock data to a PER_CPU variable

Commit-ID: 95a3d4454bb1cf5bfd666c27fdd2dc188e17c14d
Gitweb: https://git.kernel.org/tip/95a3d4454bb1cf5bfd666c27fdd2dc188e17c14d
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:26 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:38 +0200

x86/kvmclock: Switch kvmclock data to a PER_CPU variable

The previous removal of the memblock dependency from kvmclock introduced a
static data array sized 64bytes * CONFIG_NR_CPUS. That's wasteful on large
systems when kvmclock is not used.

Replace it with:

- A static page sized array of pvclock data. It's page sized because the
pvclock data of the boot cpu is mapped into the VDSO so otherwise random
other data would be exposed to the vDSO

- A PER_CPU variable of pvclock data pointers. This is used to access the
pcvlock data storage on each CPU.

The setup is done in two stages:

- Early boot stores the pointer to the static page for the boot CPU in
the per cpu data.

- In the preparatory stage of CPU hotplug assign either an element of
the static array (when the CPU number is in that range) or allocate
memory and initialize the per cpu pointer.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/kvmclock.c | 99 +++++++++++++++++++++++++++++-----------------
1 file changed, 62 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 7d690d2238f8..91b94c0ae4e3 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
#include <asm/apic.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
+#include <linux/cpuhotplug.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/mm.h>
@@ -55,12 +56,23 @@ early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);

/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define HVC_BOOT_ARRAY_SIZE \
+ (PAGE_SIZE / sizeof(struct pvclock_vsyscall_time_info))

-static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-
-/* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
+static struct pvclock_vsyscall_time_info
+ hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __aligned(PAGE_SIZE);
static struct pvclock_wall_clock wall_clock;
+static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
+
+static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
+{
+ return &this_cpu_read(hv_clock_per_cpu)->pvti;
+}
+
+static inline struct pvclock_vsyscall_time_info *this_cpu_hvclock(void)
+{
+ return this_cpu_read(hv_clock_per_cpu);
+}

/*
* The wallclock is the time of day when we booted. Since then, some time may
@@ -69,17 +81,10 @@ static struct pvclock_wall_clock wall_clock;
*/
static void kvm_get_wallclock(struct timespec64 *now)
{
- struct pvclock_vcpu_time_info *vcpu_time;
- int cpu;
-
wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));
-
- cpu = get_cpu();
-
- vcpu_time = &hv_clock[cpu].pvti;
- pvclock_read_wallclock(&wall_clock, vcpu_time, now);
-
- put_cpu();
+ preempt_disable();
+ pvclock_read_wallclock(&wall_clock, this_cpu_pvti(), now);
+ preempt_enable();
}

static int kvm_set_wallclock(const struct timespec64 *now)
@@ -89,14 +94,10 @@ static int kvm_set_wallclock(const struct timespec64 *now)

static u64 kvm_clock_read(void)
{
- struct pvclock_vcpu_time_info *src;
u64 ret;
- int cpu;

preempt_disable_notrace();
- cpu = smp_processor_id();
- src = &hv_clock[cpu].pvti;
- ret = pvclock_clocksource_read(src);
+ ret = pvclock_clocksource_read(this_cpu_pvti());
preempt_enable_notrace();
return ret;
}
@@ -141,7 +142,7 @@ static inline void kvm_sched_clock_init(bool stable)
static unsigned long kvm_get_tsc_khz(void)
{
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
- return pvclock_tsc_khz(&hv_clock[0].pvti);
+ return pvclock_tsc_khz(this_cpu_pvti());
}

static void kvm_get_preset_lpj(void)
@@ -158,15 +159,14 @@ static void kvm_get_preset_lpj(void)

bool kvm_check_and_clear_guest_paused(void)
{
- struct pvclock_vcpu_time_info *src;
+ struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
bool ret = false;

- if (!hv_clock)
+ if (!src)
return ret;

- src = &hv_clock[smp_processor_id()].pvti;
- if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
- src->flags &= ~PVCLOCK_GUEST_STOPPED;
+ if ((src->pvti.flags & PVCLOCK_GUEST_STOPPED) != 0) {
+ src->pvti.flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
ret = true;
}
@@ -184,17 +184,15 @@ EXPORT_SYMBOL_GPL(kvm_clock);

static void kvm_register_clock(char *txt)
{
- struct pvclock_vcpu_time_info *src;
- int cpu = smp_processor_id();
+ struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
u64 pa;

- if (!hv_clock)
+ if (!src)
return;

- src = &hv_clock[cpu].pvti;
- pa = slow_virt_to_phys(src) | 0x01ULL;
+ pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
wrmsrl(msr_kvm_system_time, pa);
- pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
+ pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
}

static void kvm_save_sched_clock_state(void)
@@ -242,12 +240,12 @@ static int __init kvm_setup_vsyscall_timeinfo(void)
#ifdef CONFIG_X86_64
u8 flags;

- if (!hv_clock || !kvmclock_vsyscall)
+ if (!per_cpu(hv_clock_per_cpu, 0) || !kvmclock_vsyscall)
return 0;

- flags = pvclock_read_flags(&hv_clock[0].pvti);
+ flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
if (!(flags & PVCLOCK_TSC_STABLE_BIT))
- return 1;
+ return 0;

kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
#endif
@@ -255,6 +253,28 @@ static int __init kvm_setup_vsyscall_timeinfo(void)
}
early_initcall(kvm_setup_vsyscall_timeinfo);

+static int kvmclock_setup_percpu(unsigned int cpu)
+{
+ struct pvclock_vsyscall_time_info *p = per_cpu(hv_clock_per_cpu, cpu);
+
+ /*
+ * The per cpu area setup replicates CPU0 data to all cpu
+ * pointers. So carefully check. CPU0 has been set up in init
+ * already.
+ */
+ if (!cpu || (p && p != per_cpu(hv_clock_per_cpu, 0)))
+ return 0;
+
+ /* Use the static page for the first CPUs, allocate otherwise */
+ if (cpu < HVC_BOOT_ARRAY_SIZE)
+ p = &hv_clock_boot[cpu];
+ else
+ p = kzalloc(sizeof(*p), GFP_KERNEL);
+
+ per_cpu(hv_clock_per_cpu, cpu) = p;
+ return p ? 0 : -ENOMEM;
+}
+
void __init kvmclock_init(void)
{
u8 flags;
@@ -269,17 +289,22 @@ void __init kvmclock_init(void)
return;
}

+ if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
+ kvmclock_setup_percpu, NULL) < 0) {
+ return;
+ }
+
pr_info("kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

- hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+ this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
kvm_register_clock("primary cpu clock");
- pvclock_set_pvti_cpu0_va(hv_clock);
+ pvclock_set_pvti_cpu0_va(hv_clock_boot);

if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

- flags = pvclock_read_flags(&hv_clock[0].pvti);
+ flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);

x86_platform.calibrate_tsc = kvm_get_tsc_khz;

Subject: [tip:x86/timers] x86/alternatives, jumplabel: Use text_poke_early() before mm_init()

Commit-ID: 6fffacb30349e0903602d664f7ab6fc87e85162e
Gitweb: https://git.kernel.org/tip/6fffacb30349e0903602d664f7ab6fc87e85162e
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:27 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:38 +0200

x86/alternatives, jumplabel: Use text_poke_early() before mm_init()

It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() it can
end up accessing a struct page which has not been initialized yet.

Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:

| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
| char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
| vfs_caches_init_early();
| sort_main_extable();
| trap_init();
|+ {
|+ static_branch_enable(&__test);
|+ WARN_ON(!static_branch_likely(&__test));
|+ }
| mm_init();

The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
RIP: 0010:text_poke+0x20d/0x230
Call Trace:
? text_poke_bp+0x50/0xda
? arch_jump_label_transform+0x89/0xe0
? __jump_label_update+0x78/0xb0
? static_key_enable_cpuslocked+0x4d/0x80
? static_key_enable+0x11/0x20
? start_kernel+0x23e/0x4c8
? secondary_startup_64+0xa5/0xb0

---[ end trace abdc99c031b8a90a ]---

If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.

Use text_poke_early() in static branching until early boot IRQs are enabled
and from there switch to text_poke. Also, ensure text_poke() is never
invoked when unitialized memory access may happen by using adding a
!after_bootmem assertion.

Signed-off-by: Pavel Tatashin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 7 +++++++
arch/x86/kernel/jump_label.c | 11 +++++++----
3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 2ecd34e2d46c..e85ff65c43c3 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,5 +37,6 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern int after_bootmem;

#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a481763a3776..014f214da581 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -668,6 +668,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
local_irq_save(flags);
memcpy(addr, opcode, len);
local_irq_restore(flags);
+ sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
return addr;
@@ -693,6 +694,12 @@ void *text_poke(void *addr, const void *opcode, size_t len)
struct page *pages[2];
int i;

+ /*
+ * While boot memory allocator is runnig we cannot use struct
+ * pages as they are not yet initialized.
+ */
+ BUG_ON(!after_bootmem);
+
if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e56c95be2808..eeea935e9bb5 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,15 +37,18 @@ static void bug_at(unsigned char *ip, int line)
BUG();
}

-static void __jump_label_transform(struct jump_entry *entry,
- enum jump_label_type type,
- void *(*poker)(void *, const void *, size_t),
- int init)
+static void __ref __jump_label_transform(struct jump_entry *entry,
+ enum jump_label_type type,
+ void *(*poker)(void *, const void *, size_t),
+ int init)
{
union jump_code_union code;
const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];

+ if (early_boot_irqs_disabled)
+ poker = text_poke_early;
+
if (type == JUMP_LABEL_JMP) {
if (init) {
/*

Subject: [tip:x86/timers] x86/jump_label: Initialize static branching early

Commit-ID: 8990cac6e5ea7fa57607736019fe8dca961b998f
Gitweb: https://git.kernel.org/tip/8990cac6e5ea7fa57607736019fe8dca961b998f
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:28 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:38 +0200

x86/jump_label: Initialize static branching early

Static branching is useful to runtime patch branches that are used in hot
path, but are infrequently changed.

The x86 clock framework is one example that uses static branches to setup
the best clock during boot and never changes it again.

It is desired to enable the TSC based sched clock early to allow fine
grained boot time analysis early on. That requires the static branching
functionality to be functional early as well.

Static branching requires patching nop instructions, thus,
arch_init_ideal_nops() must be called prior to jump_label_init().

Do all the necessary steps to call arch_init_ideal_nops() right after
early_cpu_init(), which also allows to insert a call to jump_label_init()
right after that. jump_label_init() will be called again from the generic
init code, but the code is protected against reinitialization already.

[ tglx: Massaged changelog ]

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/cpu/amd.c | 13 ++++++++-----
arch/x86/kernel/cpu/common.c | 38 ++++++++++++++++++++------------------
arch/x86/kernel/setup.c | 4 ++--
3 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 38915fbfae73..b732438c1a1e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -232,8 +232,6 @@ static void init_amd_k7(struct cpuinfo_x86 *c)
}
}

- set_cpu_cap(c, X86_FEATURE_K7);
-
/* calling is from identify_secondary_cpu() ? */
if (!c->cpu_index)
return;
@@ -617,6 +615,14 @@ static void early_init_amd(struct cpuinfo_x86 *c)

early_init_amd_mc(c);

+#ifdef CONFIG_X86_32
+ if (c->x86 == 6)
+ set_cpu_cap(c, X86_FEATURE_K7);
+#endif
+
+ if (c->x86 >= 0xf)
+ set_cpu_cap(c, X86_FEATURE_K8);
+
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);

/*
@@ -863,9 +869,6 @@ static void init_amd(struct cpuinfo_x86 *c)

init_amd_cacheinfo(c);

- if (c->x86 >= 0xf)
- set_cpu_cap(c, X86_FEATURE_K8);
-
if (cpu_has(c, X86_FEATURE_XMM2)) {
unsigned long long val;
int ret;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eb4cb3efd20e..71281ac43b15 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1015,6 +1015,24 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
}

+/*
+ * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
+ * unfortunately, that's not true in practice because of early VIA
+ * chips and (more importantly) broken virtualizers that are not easy
+ * to detect. In the latter case it doesn't even *fail* reliably, so
+ * probing for it doesn't even work. Disable it completely on 32-bit
+ * unless we can find a reliable way to detect all the broken cases.
+ * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
+ */
+static void detect_nopl(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_32
+ clear_cpu_cap(c, X86_FEATURE_NOPL);
+#else
+ set_cpu_cap(c, X86_FEATURE_NOPL);
+#endif
+}
+
/*
* Do minimum CPU detection early.
* Fields really needed: vendor, cpuid_level, family, model, mask,
@@ -1089,6 +1107,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
*/
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);
+
+ detect_nopl(c);
}

void __init early_cpu_init(void)
@@ -1124,24 +1144,6 @@ void __init early_cpu_init(void)
early_identify_cpu(&boot_cpu_data);
}

-/*
- * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
- * unfortunately, that's not true in practice because of early VIA
- * chips and (more importantly) broken virtualizers that are not easy
- * to detect. In the latter case it doesn't even *fail* reliably, so
- * probing for it doesn't even work. Disable it completely on 32-bit
- * unless we can find a reliable way to detect all the broken cases.
- * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
- */
-static void detect_nopl(struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
-#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
-#endif
-}
-
static void detect_null_seg_behavior(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index da1dbd99cb6e..7490de925a81 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -866,6 +866,8 @@ void __init setup_arch(char **cmdline_p)

idt_setup_early_traps();
early_cpu_init();
+ arch_init_ideal_nops();
+ jump_label_init();
early_ioremap_init();

setup_olpc_ofw_pgd();
@@ -1268,8 +1270,6 @@ void __init setup_arch(char **cmdline_p)

mcheck_init();

- arch_init_ideal_nops();
-
register_refined_jiffies(CLOCK_TICK_RATE);

#ifdef CONFIG_EFI

Subject: [tip:x86/timers] x86/CPU: Call detect_nopl() only on the BSP

Commit-ID: 9b3661cd7e5400689ed168a7275e75af333177e6
Gitweb: https://git.kernel.org/tip/9b3661cd7e5400689ed168a7275e75af333177e6
Author: Borislav Petkov <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:29 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:39 +0200

x86/CPU: Call detect_nopl() only on the BSP

Make it use the setup_* variants and have it be called only on the BSP and
drop the call in generic_identify() - X86_FEATURE_NOPL will be replicated
to the APs through the forced caps. Helps to keep the mess at a manageable
level.

Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/cpu/common.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 71281ac43b15..46408a8cdf62 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1024,12 +1024,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
* unless we can find a reliable way to detect all the broken cases.
* Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
*/
-static void detect_nopl(struct cpuinfo_x86 *c)
+static void detect_nopl(void)
{
#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_clear_cpu_cap(X86_FEATURE_NOPL);
#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_force_cpu_cap(X86_FEATURE_NOPL);
#endif
}

@@ -1108,7 +1108,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);

- detect_nopl(c);
+ detect_nopl();
}

void __init early_cpu_init(void)
@@ -1206,8 +1206,6 @@ static void generic_identify(struct cpuinfo_x86 *c)

get_model_name(c); /* Default name */

- detect_nopl(c);
-
detect_null_seg_behavior(c);

/*

Subject: [tip:x86/timers] x86/tsc: Redefine notsc to behave as tsc=unstable

Commit-ID: fe9af81e524e8a86bdd59c0cc0d9e2b0ccaf840f
Gitweb: https://git.kernel.org/tip/fe9af81e524e8a86bdd59c0cc0d9e2b0ccaf840f
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:30 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:39 +0200

x86/tsc: Redefine notsc to behave as tsc=unstable

Currently, the notsc kernel parameter disables the use of the TSC by
sched_clock(). However, this parameter does not prevent the kernel from
accessing tsc in other places.

The only rationale to boot with notsc is to avoid timing discrepancies on
multi-socket systems where TSC are not properly synchronized, and thus
exclude TSC from being used for time keeping. But that prevents using TSC
as sched_clock() as well, which is not necessary as the core sched_clock()
implementation can handle non synchronized TSC based sched clocks just
fine.

However, there is another method to solve the above problem: booting with
tsc=unstable parameter. This parameter allows sched_clock() to use TSC and
just excludes it from timekeeping.

So there is no real reason to keep notsc, but for compatibility reasons the
parameter has to stay. Make it behave like 'tsc=unstable' instead.

[ tglx: Massaged changelog ]

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Dou Liyang <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
Documentation/admin-guide/kernel-parameters.txt | 2 --
Documentation/x86/x86_64/boot-options.txt | 4 +---
arch/x86/kernel/tsc.c | 18 +++---------------
3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 533ff5c68970..5aed30cd0350 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2835,8 +2835,6 @@

nosync [HW,M68K] Disables sync negotiation for all devices.

- notsc [BUGS=X86-32] Disable Time Stamp Counter
-
nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 8d109ef67ab6..66114ab4f9fe 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -92,9 +92,7 @@ APICs
Timing

notsc
- Don't use the CPU time stamp counter to read the wall time.
- This can be used to work around timing problems on multiprocessor systems
- with not properly synchronized CPUs.
+ Deprecated, use tsc=unstable instead.

nohpet
Don't use the HPET timer.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..186395041725 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,11 +38,6 @@ EXPORT_SYMBOL(tsc_khz);
*/
static int __read_mostly tsc_unstable;

-/* native_sched_clock() is called before tsc_init(), so
- we must start with the TSC soft disabled to prevent
- erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
static DEFINE_STATIC_KEY_FALSE(__use_tsc);

int tsc_clocksource_reliable;
@@ -248,8 +243,7 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
#ifdef CONFIG_X86_TSC
int __init notsc_setup(char *str)
{
- pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
- tsc_disabled = 1;
+ mark_tsc_unstable("boot parameter notsc");
return 1;
}
#else
@@ -1307,7 +1301,7 @@ unreg:

static int __init init_tsc_clocksource(void)
{
- if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+ if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
return 0;

if (tsc_unstable)
@@ -1414,12 +1408,6 @@ void __init tsc_init(void)
set_cyc2ns_scale(tsc_khz, cpu, cyc);
}

- if (tsc_disabled > 0)
- return;
-
- /* now allow native_sched_clock() to use rdtsc */
-
- tsc_disabled = 0;
static_branch_enable(&__use_tsc);

if (!no_sched_irq_time)
@@ -1455,7 +1443,7 @@ unsigned long calibrate_delay_is_known(void)
int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
const struct cpumask *mask = topology_core_cpumask(cpu);

- if (tsc_disabled || !constant_tsc || !mask)
+ if (!constant_tsc || !mask)
return 0;

sibling = cpumask_any_but(mask, cpu);

Subject: [tip:x86/timers] x86/xen/time: Output xen sched_clock time from 0

Commit-ID: 38669ba205d178d2d38bfd194a196d65a44d5af2
Gitweb: https://git.kernel.org/tip/38669ba205d178d2d38bfd194a196d65a44d5af2
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:32 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:40 +0200

x86/xen/time: Output xen sched_clock time from 0

It is expected for sched_clock() to output data from 0, when system boots.

Add an offset xen_sched_clock_offset (similarly how it is done in other
hypervisors i.e. kvm_sched_clock_offset) to count sched_clock() from 0,
when time is first initialized.

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/xen/time.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 53bb7a8d10b5..c84f1e039d84 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -31,6 +31,8 @@
/* Xen may fire a timer up to this many ns early */
#define TIMER_SLOP 100000

+static u64 xen_sched_clock_offset __read_mostly;
+
/* Get the TSC speed from Xen */
static unsigned long xen_tsc_khz(void)
{
@@ -57,6 +59,11 @@ static u64 xen_clocksource_get_cycles(struct clocksource *cs)
return xen_clocksource_read();
}

+static u64 xen_sched_clock(void)
+{
+ return xen_clocksource_read() - xen_sched_clock_offset;
+}
+
static void xen_read_wallclock(struct timespec64 *ts)
{
struct shared_info *s = HYPERVISOR_shared_info;
@@ -367,7 +374,7 @@ void xen_timer_resume(void)
}

static const struct pv_time_ops xen_time_ops __initconst = {
- .sched_clock = xen_clocksource_read,
+ .sched_clock = xen_sched_clock,
.steal_clock = xen_steal_clock,
};

@@ -505,6 +512,7 @@ static void __init xen_time_init(void)

void __init xen_init_time_ops(void)
{
+ xen_sched_clock_offset = xen_clocksource_read();
pv_time_ops = xen_time_ops;

x86_init.timers.timer_init = xen_time_init;
@@ -546,6 +554,7 @@ void __init xen_hvm_init_time_ops(void)
return;
}

+ xen_sched_clock_offset = xen_clocksource_read();
pv_time_ops = xen_time_ops;
x86_init.timers.setup_percpu_clockev = xen_time_init;
x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;

Subject: [tip:x86/timers] x86/xen/time: Initialize pv xen time in init_hypervisor_platform()

Commit-ID: 7b25b9cb0dad8395b5cf5a02196d0e88ccda67d5
Gitweb: https://git.kernel.org/tip/7b25b9cb0dad8395b5cf5a02196d0e88ccda67d5
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:31 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:39 +0200

x86/xen/time: Initialize pv xen time in init_hypervisor_platform()

In every hypervisor except for xen pv time ops are initialized in
init_hypervisor_platform().

Xen PV domains initialize time ops in x86_init.paging.pagetable_init(),
by calling xen_setup_shared_info() which is a poor design, as time is
needed prior to memory allocator.

xen_setup_shared_info() is called from two places: during boot, and
after suspend. Split the content of xen_setup_shared_info() into
three places:

1. add the clock relavent data into new xen pv init_platform vector, and
set clock ops in there.

2. move xen_setup_vcpu_info_placement() to new xen_pv_guest_late_init()
call.

3. Re-initializing parts of shared info copy to xen_pv_post_suspend() to
be symmetric to xen_pv_pre_suspend

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/xen/enlighten_pv.c | 51 +++++++++++++++++++++------------------------
arch/x86/xen/mmu_pv.c | 6 ++----
arch/x86/xen/suspend_pv.c | 5 +++--
arch/x86/xen/time.c | 7 +++----
arch/x86/xen/xen-ops.h | 6 ++----
5 files changed, 34 insertions(+), 41 deletions(-)

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 439a94bf89ad..105a57d73701 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -119,6 +119,27 @@ static void __init xen_banner(void)
version >> 16, version & 0xffff, extra.extraversion,
xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
}
+
+static void __init xen_pv_init_platform(void)
+{
+ set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+ HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+
+ /* xen clock uses per-cpu vcpu_info, need to init it for boot cpu */
+ xen_vcpu_info_reset(0);
+
+ /* pvclock is in shared info area */
+ xen_init_time_ops();
+}
+
+static void __init xen_pv_guest_late_init(void)
+{
+#ifndef CONFIG_SMP
+ /* Setup shared vcpu info for non-smp configurations */
+ xen_setup_vcpu_info_placement();
+#endif
+}
+
/* Check if running on Xen version (major, minor) or later */
bool
xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -947,34 +968,8 @@ static void xen_write_msr(unsigned int msr, unsigned low, unsigned high)
xen_write_msr_safe(msr, low, high);
}

-void xen_setup_shared_info(void)
-{
- set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
-
- HYPERVISOR_shared_info =
- (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
-
- xen_setup_mfn_list_list();
-
- if (system_state == SYSTEM_BOOTING) {
-#ifndef CONFIG_SMP
- /*
- * In UP this is as good a place as any to set up shared info.
- * Limit this to boot only, at restore vcpu setup is done via
- * xen_vcpu_restore().
- */
- xen_setup_vcpu_info_placement();
-#endif
- /*
- * Now that shared info is set up we can start using routines
- * that point to pvclock area.
- */
- xen_init_time_ops();
- }
-}
-
/* This is called once we have the cpu_possible_mask */
-void __ref xen_setup_vcpu_info_placement(void)
+void __init xen_setup_vcpu_info_placement(void)
{
int cpu;

@@ -1228,6 +1223,8 @@ asmlinkage __visible void __init xen_start_kernel(void)
x86_init.irqs.intr_mode_init = x86_init_noop;
x86_init.oem.arch_setup = xen_arch_setup;
x86_init.oem.banner = xen_banner;
+ x86_init.hyper.init_platform = xen_pv_init_platform;
+ x86_init.hyper.guest_late_init = xen_pv_guest_late_init;

/*
* Set up some pagetable state before starting to set any ptes.
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2c30cabfda90..52206ad81e4b 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1230,8 +1230,7 @@ static void __init xen_pagetable_p2m_free(void)
* We roundup to the PMD, which means that if anybody at this stage is
* using the __ka address of xen_start_info or
* xen_start_info->shared_info they are in going to crash. Fortunatly
- * we have already revectored in xen_setup_kernel_pagetable and in
- * xen_setup_shared_info.
+ * we have already revectored in xen_setup_kernel_pagetable.
*/
size = roundup(size, PMD_SIZE);

@@ -1292,8 +1291,7 @@ static void __init xen_pagetable_init(void)

/* Remap memory freed due to conflicts with E820 map */
xen_remap_memory();
-
- xen_setup_shared_info();
+ xen_setup_mfn_list_list();
}
static void xen_write_cr2(unsigned long cr2)
{
diff --git a/arch/x86/xen/suspend_pv.c b/arch/x86/xen/suspend_pv.c
index a2e0f110af56..8303b58c79a9 100644
--- a/arch/x86/xen/suspend_pv.c
+++ b/arch/x86/xen/suspend_pv.c
@@ -27,8 +27,9 @@ void xen_pv_pre_suspend(void)
void xen_pv_post_suspend(int suspend_cancelled)
{
xen_build_mfn_list_list();
-
- xen_setup_shared_info();
+ set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+ HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+ xen_setup_mfn_list_list();

if (suspend_cancelled) {
xen_start_info->store_mfn =
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index e0f1bcf01d63..53bb7a8d10b5 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -40,7 +40,7 @@ static unsigned long xen_tsc_khz(void)
return pvclock_tsc_khz(info);
}

-u64 xen_clocksource_read(void)
+static u64 xen_clocksource_read(void)
{
struct pvclock_vcpu_time_info *src;
u64 ret;
@@ -503,7 +503,7 @@ static void __init xen_time_init(void)
pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
}

-void __ref xen_init_time_ops(void)
+void __init xen_init_time_ops(void)
{
pv_time_ops = xen_time_ops;

@@ -542,8 +542,7 @@ void __init xen_hvm_init_time_ops(void)
return;

if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
- printk(KERN_INFO "Xen doesn't support pvclock on HVM,"
- "disable pv timer\n");
+ pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
return;
}

diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 3b34745d0a52..e78684597f57 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -31,7 +31,6 @@ extern struct shared_info xen_dummy_shared_info;
extern struct shared_info *HYPERVISOR_shared_info;

void xen_setup_mfn_list_list(void);
-void xen_setup_shared_info(void);
void xen_build_mfn_list_list(void);
void xen_setup_machphys_mapping(void);
void xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
@@ -68,12 +67,11 @@ void xen_init_irq_ops(void);
void xen_setup_timer(int cpu);
void xen_setup_runstate_info(int cpu);
void xen_teardown_timer(int cpu);
-u64 xen_clocksource_read(void);
void xen_setup_cpu_clockevents(void);
void xen_save_time_memory_area(void);
void xen_restore_time_memory_area(void);
-void __ref xen_init_time_ops(void);
-void __init xen_hvm_init_time_ops(void);
+void xen_init_time_ops(void);
+void xen_hvm_init_time_ops(void);

irqreturn_t xen_debug_interrupt(int irq, void *dev_id);


Subject: [tip:x86/timers] timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()

Commit-ID: 3eca993740b8eb40f514b90b1877a4dbcf0a6710
Gitweb: https://git.kernel.org/tip/3eca993740b8eb40f514b90b1877a4dbcf0a6710
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:34 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:40 +0200

timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()

If architecture does not support exact boot time, it is challenging to
estimate boot time without having a reference to the current persistent
clock value. Yet, it cannot read the persistent clock time again, because
this may lead to math discrepancies with the caller of read_boot_clock64()
who have read the persistent clock at a different time.

This is why it is better to provide two values simultaneously: the
persistent clock value, and the boot time.

Replace read_boot_clock64() with:
read_persistent_wall_and_boot_offset(wall_time, boot_offset)

Where wall_time is returned by read_persistent_clock() And boot_offset is
wall_time - boot time, which defaults to 0.

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
include/linux/timekeeping.h | 3 ++-
kernel/time/timekeeping.c | 59 +++++++++++++++++++++++----------------------
2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 86bc2026efce..686bc27acef0 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -243,7 +243,8 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
extern int persistent_clock_is_local;

extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
+void read_persistent_clock_and_boot_offset(struct timespec64 *wall_clock,
+ struct timespec64 *boot_offset);
extern int update_persistent_clock64(struct timespec64 now);

/*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4786df904c22..cb738f825c12 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -17,6 +17,7 @@
#include <linux/nmi.h>
#include <linux/sched.h>
#include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
#include <linux/syscore_ops.h>
#include <linux/clocksource.h>
#include <linux/jiffies.h>
@@ -1496,18 +1497,20 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
}

/**
- * read_boot_clock64 - Return time of the system start.
+ * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset
+ * from the boot.
*
* Weak dummy function for arches that do not yet support it.
- * Function to read the exact time the system has been started.
- * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
- *
- * XXX - Do be sure to remove it once all arches implement it.
+ * wall_time - current time as returned by persistent clock
+ * boot_offset - offset that is defined as wall_time - boot_time
+ * default to 0.
*/
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init
+read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
{
- ts->tv_sec = 0;
- ts->tv_nsec = 0;
+ read_persistent_clock64(wall_time);
+ *boot_offset = (struct timespec64){0};
}

/* Flag for if timekeeping_resume() has injected sleeptime */
@@ -1521,28 +1524,29 @@ static bool persistent_clock_exists;
*/
void __init timekeeping_init(void)
{
+ struct timespec64 wall_time, boot_offset, wall_to_mono;
struct timekeeper *tk = &tk_core.timekeeper;
struct clocksource *clock;
unsigned long flags;
- struct timespec64 now, boot, tmp;
-
- read_persistent_clock64(&now);
- if (!timespec64_valid_strict(&now)) {
- pr_warn("WARNING: Persistent clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- now.tv_sec = 0;
- now.tv_nsec = 0;
- } else if (now.tv_sec || now.tv_nsec)
- persistent_clock_exists = true;

- read_boot_clock64(&boot);
- if (!timespec64_valid_strict(&boot)) {
- pr_warn("WARNING: Boot clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- boot.tv_sec = 0;
- boot.tv_nsec = 0;
+ read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
+ if (timespec64_valid_strict(&wall_time) &&
+ timespec64_to_ns(&wall_time) > 0) {
+ persistent_clock_exists = true;
+ } else {
+ pr_warn("Persistent clock returned invalid value");
+ wall_time = (struct timespec64){0};
}

+ if (timespec64_compare(&wall_time, &boot_offset) < 0)
+ boot_offset = (struct timespec64){0};
+
+ /*
+ * We want set wall_to_mono, so the following is true:
+ * wall time + wall_to_mono = boot time
+ */
+ wall_to_mono = timespec64_sub(boot_offset, wall_time);
+
raw_spin_lock_irqsave(&timekeeper_lock, flags);
write_seqcount_begin(&tk_core.seq);
ntp_init();
@@ -1552,13 +1556,10 @@ void __init timekeeping_init(void)
clock->enable(clock);
tk_setup_internals(tk, clock);

- tk_set_xtime(tk, &now);
+ tk_set_xtime(tk, &wall_time);
tk->raw_sec = 0;
- if (boot.tv_sec == 0 && boot.tv_nsec == 0)
- boot = tk_xtime(tk);

- set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec);
- tk_set_wall_to_mono(tk, tmp);
+ tk_set_wall_to_mono(tk, wall_to_mono);

timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);


Subject: [tip:x86/timers] s390/time: Add read_persistent_wall_and_boot_offset()

Commit-ID: be2e0e4257678408b0ab00ea9e743b9094e393e8
Gitweb: https://git.kernel.org/tip/be2e0e4257678408b0ab00ea9e743b9094e393e8
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:33 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:40 +0200

s390/time: Add read_persistent_wall_and_boot_offset()

read_persistent_wall_and_boot_offset() will replace read_boot_clock64()
because on some architectures it is more convenient to read both sources
as one may depend on the other. For s390, implementation is the same
as read_boot_clock64() but also calling and returning value of
read_persistent_clock64()

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Martin Schwidefsky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/s390/kernel/time.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index cf561160ea88..d1f5447d5687 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -221,6 +221,24 @@ void read_persistent_clock64(struct timespec64 *ts)
ext_to_timespec64(clk, ts);
}

+void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
+{
+ unsigned char clk[STORE_CLOCK_EXT_SIZE];
+ struct timespec64 boot_time;
+ __u64 delta;
+
+ delta = initial_leap_seconds + TOD_UNIX_EPOCH;
+ memcpy(clk, tod_clock_base, STORE_CLOCK_EXT_SIZE);
+ *(__u64 *)&clk[1] -= delta;
+ if (*(__u64 *)&clk[1] > delta)
+ clk[0]--;
+ ext_to_timespec64(clk, &boot_time);
+
+ read_persistent_clock64(wall_time);
+ *boot_offset = timespec64_sub(*wall_time, boot_time);
+}
+
void read_boot_clock64(struct timespec64 *ts)
{
unsigned char clk[STORE_CLOCK_EXT_SIZE];

Subject: [tip:x86/timers] timekeeping: Default boot time offset to local_clock()

Commit-ID: 4b1b7f8054896cee25669f6cea7cb6dd17f508f7
Gitweb: https://git.kernel.org/tip/4b1b7f8054896cee25669f6cea7cb6dd17f508f7
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:35 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:41 +0200

timekeeping: Default boot time offset to local_clock()

read_persistent_wall_and_boot_offset() is called during boot to read
both the persistent clock and also return the offset between the boot time
and the value of persistent clock.

Change the default boot_offset from zero to local_clock() so architectures,
that do not have a dedicated boot_clock but have early sched_clock(), such
as SPARCv9, x86, and possibly more will benefit from this change by getting
a better and more consistent estimate of the boot time without need for an
arch specific implementation.

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
kernel/time/timekeeping.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index cb738f825c12..30d7f64ffc87 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1503,14 +1503,17 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
* Weak dummy function for arches that do not yet support it.
* wall_time - current time as returned by persistent clock
* boot_offset - offset that is defined as wall_time - boot_time
- * default to 0.
+ * The default function calculates offset based on the current value of
+ * local_clock(). This way architectures that support sched_clock() but don't
+ * support dedicated boot time clock will provide the best estimate of the
+ * boot time.
*/
void __weak __init
read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
struct timespec64 *boot_offset)
{
read_persistent_clock64(wall_time);
- *boot_offset = (struct timespec64){0};
+ *boot_offset = ns_to_timespec64(local_clock());
}

/* Flag for if timekeeping_resume() has injected sleeptime */

Subject: [tip:x86/timers] ARM/time: Remove read_boot_clock64()

Commit-ID: 227e3958a780499b3ec41c36d4752ac4f4962874
Gitweb: https://git.kernel.org/tip/227e3958a780499b3ec41c36d4752ac4f4962874
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:37 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:41 +0200

ARM/time: Remove read_boot_clock64()

read_boot_clock64() is deleted, and replaced with
read_persistent_wall_and_boot_offset().

The default implementation of read_persistent_wall_and_boot_offset()
provides a better fallback than the current stubs for read_boot_clock64()
that arm has with no users, so remove the old code.

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/arm/include/asm/mach/time.h | 3 +--
arch/arm/kernel/time.c | 15 ++-------------
arch/arm/plat-omap/counter_32k.c | 2 +-
drivers/clocksource/tegra20_timer.c | 2 +-
4 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mach/time.h b/arch/arm/include/asm/mach/time.h
index 0f79e4dec7f9..4ac3a019a46f 100644
--- a/arch/arm/include/asm/mach/time.h
+++ b/arch/arm/include/asm/mach/time.h
@@ -13,7 +13,6 @@
extern void timer_tick(void);

typedef void (*clock_access_fn)(struct timespec64 *);
-extern int register_persistent_clock(clock_access_fn read_boot,
- clock_access_fn read_persistent);
+extern int register_persistent_clock(clock_access_fn read_persistent);

#endif
diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index cf2701cb0de8..078b259ead4e 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -83,29 +83,18 @@ static void dummy_clock_access(struct timespec64 *ts)
}

static clock_access_fn __read_persistent_clock = dummy_clock_access;
-static clock_access_fn __read_boot_clock = dummy_clock_access;

void read_persistent_clock64(struct timespec64 *ts)
{
__read_persistent_clock(ts);
}

-void read_boot_clock64(struct timespec64 *ts)
-{
- __read_boot_clock(ts);
-}
-
-int __init register_persistent_clock(clock_access_fn read_boot,
- clock_access_fn read_persistent)
+int __init register_persistent_clock(clock_access_fn read_persistent)
{
/* Only allow the clockaccess functions to be registered once */
- if (__read_persistent_clock == dummy_clock_access &&
- __read_boot_clock == dummy_clock_access) {
- if (read_boot)
- __read_boot_clock = read_boot;
+ if (__read_persistent_clock == dummy_clock_access) {
if (read_persistent)
__read_persistent_clock = read_persistent;
-
return 0;
}

diff --git a/arch/arm/plat-omap/counter_32k.c b/arch/arm/plat-omap/counter_32k.c
index 2438b96004c1..fcc5bfec8bd1 100644
--- a/arch/arm/plat-omap/counter_32k.c
+++ b/arch/arm/plat-omap/counter_32k.c
@@ -110,7 +110,7 @@ int __init omap_init_clocksource_32k(void __iomem *vbase)
}

sched_clock_register(omap_32k_read_sched_clock, 32, 32768);
- register_persistent_clock(NULL, omap_read_persistent_clock64);
+ register_persistent_clock(omap_read_persistent_clock64);
pr_info("OMAP clocksource: 32k_counter at 32768 Hz\n");

return 0;
diff --git a/drivers/clocksource/tegra20_timer.c b/drivers/clocksource/tegra20_timer.c
index c337a8100a7b..2242a36fc5b0 100644
--- a/drivers/clocksource/tegra20_timer.c
+++ b/drivers/clocksource/tegra20_timer.c
@@ -259,6 +259,6 @@ static int __init tegra20_init_rtc(struct device_node *np)
else
clk_prepare_enable(clk);

- return register_persistent_clock(NULL, tegra_read_persistent_clock64);
+ return register_persistent_clock(tegra_read_persistent_clock64);
}
TIMER_OF_DECLARE(tegra20_rtc, "nvidia,tegra20-rtc", tegra20_init_rtc);

Subject: [tip:x86/timers] x86/tsc: Calibrate tsc only once

Commit-ID: cf7a63ef4e0203f6f33284c69e8188d91422de83
Gitweb: https://git.kernel.org/tip/cf7a63ef4e0203f6f33284c69e8188d91422de83
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:38 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:42 +0200

x86/tsc: Calibrate tsc only once

During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
and the second time in tsc_init().

Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
the calibration is done only early, and make tsc_init() to use the values
already determined in tsc_early_init().

Sometimes it is not possible to determine tsc early, as the subsystem that
is required is not yet initialized, in such case try again later in
tsc_init().

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/include/asm/tsc.h | 2 +-
arch/x86/kernel/setup.c | 2 +-
arch/x86/kernel/tsc.c | 87 +++++++++++++++++++++++++---------------------
3 files changed, 49 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 2701d221583a..c4368ff73652 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void)
extern struct system_counterval_t convert_art_to_tsc(u64 art);
extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);

-extern void tsc_early_delay_calibrate(void);
+extern void tsc_early_init(void);
extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 7490de925a81..5d32c55aeb8b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1014,6 +1014,7 @@ void __init setup_arch(char **cmdline_p)
*/
init_hypervisor_platform();

+ tsc_early_init();
x86_init.resources.probe_roms();

/* after parse_early_param, so could debug it */
@@ -1199,7 +1200,6 @@ void __init setup_arch(char **cmdline_p)

memblock_find_dma_reserve();

- tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 186395041725..4cab2236169e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz);
unsigned int __read_mostly tsc_khz;
EXPORT_SYMBOL(tsc_khz);

+#define KHZ 1000
+
/*
* TSC can be unstable due to cpufreq or due to unsynced TSCs
*/
@@ -1335,34 +1337,10 @@ unreg:
*/
device_initcall(init_tsc_clocksource);

-void __init tsc_early_delay_calibrate(void)
-{
- unsigned long lpj;
-
- if (!boot_cpu_has(X86_FEATURE_TSC))
- return;
-
- cpu_khz = x86_platform.calibrate_cpu();
- tsc_khz = x86_platform.calibrate_tsc();
-
- tsc_khz = tsc_khz ? : cpu_khz;
- if (!tsc_khz)
- return;
-
- lpj = tsc_khz * 1000;
- do_div(lpj, HZ);
- loops_per_jiffy = lpj;
-}
-
-void __init tsc_init(void)
+static bool __init determine_cpu_tsc_frequencies(void)
{
- u64 lpj, cyc;
- int cpu;
-
- if (!boot_cpu_has(X86_FEATURE_TSC)) {
- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
- }
+ /* Make sure that cpu and tsc are not already calibrated */
+ WARN_ON(cpu_khz || tsc_khz);

cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
@@ -1377,20 +1355,52 @@ void __init tsc_init(void)
else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
cpu_khz = tsc_khz;

- if (!tsc_khz) {
- mark_tsc_unstable("could not calculate TSC khz");
- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
- return;
- }
+ if (tsc_khz == 0)
+ return false;

pr_info("Detected %lu.%03lu MHz processor\n",
- (unsigned long)cpu_khz / 1000,
- (unsigned long)cpu_khz % 1000);
+ (unsigned long)cpu_khz / KHZ,
+ (unsigned long)cpu_khz % KHZ);

if (cpu_khz != tsc_khz) {
pr_info("Detected %lu.%03lu MHz TSC",
- (unsigned long)tsc_khz / 1000,
- (unsigned long)tsc_khz % 1000);
+ (unsigned long)tsc_khz / KHZ,
+ (unsigned long)tsc_khz % KHZ);
+ }
+ return true;
+}
+
+static unsigned long __init get_loops_per_jiffy(void)
+{
+ unsigned long lpj = tsc_khz * KHZ;
+
+ do_div(lpj, HZ);
+ return lpj;
+}
+
+void __init tsc_early_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TSC))
+ return;
+ if (!determine_cpu_tsc_frequencies())
+ return;
+ loops_per_jiffy = get_loops_per_jiffy();
+}
+
+void __init tsc_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TSC)) {
+ setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+ return;
+ }
+
+ if (!tsc_khz) {
+ /* We failed to determine frequencies earlier, try again */
+ if (!determine_cpu_tsc_frequencies()) {
+ mark_tsc_unstable("could not calculate TSC khz");
+ setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+ return;
+ }
}

/* Sanitize TSC ADJUST before cyc2ns gets initialized */
@@ -1413,10 +1423,7 @@ void __init tsc_init(void)
if (!no_sched_irq_time)
enable_sched_clock_irqtime();

- lpj = ((u64)tsc_khz * 1000);
- do_div(lpj, HZ);
- lpj_fine = lpj;
-
+ lpj_fine = get_loops_per_jiffy();
use_tsc_delay();

check_system_tsc_reliable();

Subject: [tip:x86/timers] s390/time: Remove read_boot_clock64()

Commit-ID: 00067a6db2e95f3b9d9a017b3be3c715d54cc0de
Gitweb: https://git.kernel.org/tip/00067a6db2e95f3b9d9a017b3be3c715d54cc0de
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:36 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:41 +0200

s390/time: Remove read_boot_clock64()

read_boot_clock64() was replaced by read_persistent_wall_and_boot_offset()
so remove it.

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/s390/kernel/time.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index d1f5447d5687..e8766beee5ad 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -239,19 +239,6 @@ void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
*boot_offset = timespec64_sub(*wall_time, boot_time);
}

-void read_boot_clock64(struct timespec64 *ts)
-{
- unsigned char clk[STORE_CLOCK_EXT_SIZE];
- __u64 delta;
-
- delta = initial_leap_seconds + TOD_UNIX_EPOCH;
- memcpy(clk, tod_clock_base, 16);
- *(__u64 *) &clk[1] -= delta;
- if (*(__u64 *) &clk[1] > delta)
- clk[0]--;
- ext_to_timespec64(clk, ts);
-}
-
static u64 read_tod_clock(struct clocksource *cs)
{
unsigned long long now, adj;

Subject: [tip:x86/timers] x86/tsc: Initialize cyc2ns when tsc frequency is determined

Commit-ID: e2a9ca29b5edc89da2fddeae30e1070b272395c5
Gitweb: https://git.kernel.org/tip/e2a9ca29b5edc89da2fddeae30e1070b272395c5
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:39 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:42 +0200

x86/tsc: Initialize cyc2ns when tsc frequency is determined

cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data
structure.

Currently, the setup code for c2ns data for every possible CPU goes through
the same sequence of calculations as for the boot CPU, but is based on the
same tsc frequency as the boot CPU, and thus this is not necessary.

Initialize the boot cpu when tsc frequency is determined. Copy the
calculated data from the boot CPU to the other CPUs in tsc_init().

In addition do the following:

- Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init()

- Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be
called when system is up, and wraps around __set_cyc2ns_scale() that can
be called directly when system is booting but avoids saving restoring
IRQs and going and waking up from idle.

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/tsc.c | 94 +++++++++++++++++++++++++++++----------------------
1 file changed, 53 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 4cab2236169e..7ea0718a4c75 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -103,23 +103,6 @@ void cyc2ns_read_end(void)
* [email protected] "math is hard, lets go shopping!"
*/

-static void cyc2ns_data_init(struct cyc2ns_data *data)
-{
- data->cyc2ns_mul = 0;
- data->cyc2ns_shift = 0;
- data->cyc2ns_offset = 0;
-}
-
-static void __init cyc2ns_init(int cpu)
-{
- struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
-
- cyc2ns_data_init(&c2n->data[0]);
- cyc2ns_data_init(&c2n->data[1]);
-
- seqcount_init(&c2n->seq);
-}
-
static inline unsigned long long cycles_2_ns(unsigned long long cyc)
{
struct cyc2ns_data data;
@@ -135,18 +118,11 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
return ns;
}

-static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
{
unsigned long long ns_now;
struct cyc2ns_data data;
struct cyc2ns *c2n;
- unsigned long flags;
-
- local_irq_save(flags);
- sched_clock_idle_sleep_event();
-
- if (!khz)
- goto done;

ns_now = cycles_2_ns(tsc_now);

@@ -178,12 +154,55 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
c2n->data[0] = data;
raw_write_seqcount_latch(&c2n->seq);
c2n->data[1] = data;
+}
+
+static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ if (khz)
+ __set_cyc2ns_scale(khz, cpu, tsc_now);

-done:
sched_clock_idle_wakeup_event();
local_irq_restore(flags);
}

+/*
+ * Initialize cyc2ns for boot cpu
+ */
+static void __init cyc2ns_init_boot_cpu(void)
+{
+ struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+
+ seqcount_init(&c2n->seq);
+ __set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
+}
+
+/*
+ * Secondary CPUs do not run through cyc2ns_init(), so set up
+ * all the scale factors for all CPUs, assuming the same
+ * speed as the bootup CPU. (cpufreq notifiers will fix this
+ * up if their speed diverges)
+ */
+static void __init cyc2ns_init_secondary_cpus(void)
+{
+ unsigned int cpu, this_cpu = smp_processor_id();
+ struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+ struct cyc2ns_data *data = c2n->data;
+
+ for_each_possible_cpu(cpu) {
+ if (cpu != this_cpu) {
+ seqcount_init(&c2n->seq);
+ c2n = per_cpu_ptr(&cyc2ns, cpu);
+ c2n->data[0] = data[0];
+ c2n->data[1] = data[1];
+ }
+ }
+}
+
/*
* Scheduler clock - returns current time in nanosec units.
*/
@@ -1385,6 +1404,10 @@ void __init tsc_early_init(void)
if (!determine_cpu_tsc_frequencies())
return;
loops_per_jiffy = get_loops_per_jiffy();
+
+ /* Sanitize TSC ADJUST before cyc2ns gets initialized */
+ tsc_store_and_check_tsc_adjust(true);
+ cyc2ns_init_boot_cpu();
}

void __init tsc_init(void)
@@ -1401,23 +1424,12 @@ void __init tsc_init(void)
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return;
}
+ /* Sanitize TSC ADJUST before cyc2ns gets initialized */
+ tsc_store_and_check_tsc_adjust(true);
+ cyc2ns_init_boot_cpu();
}

- /* Sanitize TSC ADJUST before cyc2ns gets initialized */
- tsc_store_and_check_tsc_adjust(true);
-
- /*
- * Secondary CPUs do not run through tsc_init(), so set up
- * all the scale factors for all CPUs, assuming the same
- * speed as the bootup CPU. (cpufreq notifiers will fix this
- * up if their speed diverges)
- */
- cyc = rdtsc();
- for_each_possible_cpu(cpu) {
- cyc2ns_init(cpu);
- set_cyc2ns_scale(tsc_khz, cpu, cyc);
- }
-
+ cyc2ns_init_secondary_cpus();
static_branch_enable(&__use_tsc);

if (!no_sched_irq_time)

Subject: [tip:x86/timers] x86/tsc: Use TSC as sched clock early

Commit-ID: 4763f03d3d186ce8a1125844790152d76804ad60
Gitweb: https://git.kernel.org/tip/4763f03d3d186ce8a1125844790152d76804ad60
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:40 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:42 +0200

x86/tsc: Use TSC as sched clock early

All prerequesites for enabling TSC as sched clock early in the boot
process are available now:

- Early attempt of TSC calibration

- Early availablity of static branch patching

If TSC frequency can be established in the early calibration, enable the
static key which switches sched clock to use TSC.

[ tglx: Massaged changelog ]

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/kernel/tsc.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7ea0718a4c75..9277ae9b68b3 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1408,6 +1408,7 @@ void __init tsc_early_init(void)
/* Sanitize TSC ADJUST before cyc2ns gets initialized */
tsc_store_and_check_tsc_adjust(true);
cyc2ns_init_boot_cpu();
+ static_branch_enable(&__use_tsc);
}

void __init tsc_init(void)

Subject: [tip:x86/timers] sched/clock: Move sched clock initialization and merge with generic clock

Commit-ID: 5d2a4e91a541cb04d20d11602f0f9340291322ac
Gitweb: https://git.kernel.org/tip/5d2a4e91a541cb04d20d11602f0f9340291322ac
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:41 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:43 +0200

sched/clock: Move sched clock initialization and merge with generic clock

sched_clock_postinit() initializes a generic clock on systems where no
other clock is provided. This function may be called only after
timekeeping_init().

Rename sched_clock_postinit to generic_clock_inti() and call it from
sched_clock_init(). Move the call for sched_clock_init() until after
time_init().

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
include/linux/sched_clock.h | 5 ++---
init/main.c | 4 ++--
kernel/sched/clock.c | 27 +++++++++++++++++----------
kernel/sched/core.c | 1 -
kernel/time/sched_clock.c | 2 +-
5 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched_clock.h b/include/linux/sched_clock.h
index 411b52e424e1..abe28d5cb3f4 100644
--- a/include/linux/sched_clock.h
+++ b/include/linux/sched_clock.h
@@ -9,17 +9,16 @@
#define LINUX_SCHED_CLOCK

#ifdef CONFIG_GENERIC_SCHED_CLOCK
-extern void sched_clock_postinit(void);
+extern void generic_sched_clock_init(void);

extern void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate);
#else
-static inline void sched_clock_postinit(void) { }
+static inline void generic_sched_clock_init(void) { }

static inline void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate)
{
- ;
}
#endif

diff --git a/init/main.c b/init/main.c
index 3b4ada11ed52..162d931c9511 100644
--- a/init/main.c
+++ b/init/main.c
@@ -79,7 +79,7 @@
#include <linux/pti.h>
#include <linux/blkdev.h>
#include <linux/elevator.h>
-#include <linux/sched_clock.h>
+#include <linux/sched/clock.h>
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/context_tracking.h>
@@ -642,7 +642,7 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_postinit();
+ sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..0e9dbb2d9aea 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -53,6 +53,7 @@
*
*/
#include "sched.h"
+#include <linux/sched_clock.h>

/*
* Scheduler clock - returns current time in nanosec units.
@@ -68,11 +69,6 @@ EXPORT_SYMBOL_GPL(sched_clock);

__read_mostly int sched_clock_running;

-void sched_clock_init(void)
-{
- sched_clock_running = 1;
-}
-
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
* We must start with !__sched_clock_stable because the unstable -> stable
@@ -199,6 +195,15 @@ void clear_sched_clock_stable(void)
__clear_sched_clock_stable();
}

+static void __sched_clock_gtod_offset(void)
+{
+ __gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+}
+
+void __init sched_clock_init(void)
+{
+ sched_clock_running = 1;
+}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
* notably: acpi_processor and intel_idle, which can mark the TSC as unstable.
@@ -385,8 +390,6 @@ void sched_clock_tick(void)

void sched_clock_tick_stable(void)
{
- u64 gtod, clock;
-
if (!sched_clock_stable())
return;

@@ -398,9 +401,7 @@ void sched_clock_tick_stable(void)
* TSC to be unstable, any computation will be computing crap.
*/
local_irq_disable();
- gtod = ktime_get_ns();
- clock = sched_clock();
- __gtod_offset = (clock + __sched_clock_offset) - gtod;
+ __sched_clock_gtod_offset();
local_irq_enable();
}

@@ -434,6 +435,12 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);

#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */

+void __init sched_clock_init(void)
+{
+ sched_clock_running = 1;
+ generic_sched_clock_init();
+}
+
u64 sched_clock_cpu(int cpu)
{
if (unlikely(!sched_clock_running))
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..552406e9713b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5954,7 +5954,6 @@ void __init sched_init(void)
int i, j;
unsigned long alloc_size = 0, ptr;

- sched_clock_init();
wait_bit_init();

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 2d8f05aad442..cbc72c2c1fca 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -237,7 +237,7 @@ sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
pr_debug("Registered %pF as sched_clock source\n", read);
}

-void __init sched_clock_postinit(void)
+void __init generic_sched_clock_init(void)
{
/*
* If no sched_clock() function has been provided at that point,

Subject: [tip:x86/timers] sched/clock: Enable sched clock early

Commit-ID: 857baa87b6422bcfb84ed3631d6839920cb5b09d
Gitweb: https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:43 +0200

sched/clock: Enable sched clock early

Allow sched_clock() to be used before schec_clock_init() is called. This
provides a way to get early boot timestamps on machines with unstable
clocks.

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
init/main.c | 2 +-
kernel/sched/clock.c | 20 +++++++++++++++++++-
2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 162d931c9511..ff0a24170b95 100644
--- a/init/main.c
+++ b/init/main.c
@@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
@@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
acpi_early_init();
if (late_time_init)
late_time_init();
+ sched_clock_init();
calibrate_delay();
pid_idr_init();
anon_vma_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 0e9dbb2d9aea..422cd63f8f17 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)

void __init sched_clock_init(void)
{
+ unsigned long flags;
+
+ /*
+ * Set __gtod_offset such that once we mark sched_clock_running,
+ * sched_clock_tick() continues where sched_clock() left off.
+ *
+ * Even if TSC is buggered, we're still UP at this point so it
+ * can't really be out of sync.
+ */
+ local_irq_save(flags);
+ __sched_clock_gtod_offset();
+ local_irq_restore(flags);
+
sched_clock_running = 1;
+
+ /* Now that sched_clock_running is set adjust scd */
+ local_irq_save(flags);
+ sched_clock_tick();
+ local_irq_restore(flags);
}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
@@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
return sched_clock() + __sched_clock_offset;

if (unlikely(!sched_clock_running))
- return 0ull;
+ return sched_clock();

preempt_disable_notrace();
scd = cpu_sdc(cpu);

Subject: [tip:x86/timers] x86/tsc: Split native_calibrate_cpu() into early and late parts

Commit-ID: 03821f451d2d2d7599061244734245be139014ea
Gitweb: https://git.kernel.org/tip/03821f451d2d2d7599061244734245be139014ea
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:44 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:44 +0200

x86/tsc: Split native_calibrate_cpu() into early and late parts

During early boot TSC and CPU frequency can be calibrated using MSR, CPUID,
and quick PIT calibration methods. The other methods PIT/HPET/PMTIMER are
available only after ACPI is initialized.

Split native_calibrate_cpu() into early and late parts so they can be
called separately during early and late tsc calibration.

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/tsc.c | 54 ++++++++++++++++++++++++++++++----------------
2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c4368ff73652..88140e4f2292 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -40,6 +40,7 @@ extern int unsynchronized_tsc(void);
extern int check_tsc_unstable(void);
extern void mark_tsc_async_resets(char *reason);
extern unsigned long native_calibrate_cpu(void);
+extern unsigned long native_calibrate_cpu_early(void);
extern unsigned long native_calibrate_tsc(void);
extern unsigned long long native_sched_clock_from_tsc(u64 tsc);

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 9277ae9b68b3..60586779b02c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -680,30 +680,17 @@ static unsigned long cpu_khz_from_cpuid(void)
return eax_base_mhz * 1000;
}

-/**
- * native_calibrate_cpu - calibrate the cpu on boot
+/*
+ * calibrate cpu using pit, hpet, and ptimer methods. They are available
+ * later in boot after acpi is initialized.
*/
-unsigned long native_calibrate_cpu(void)
+static unsigned long pit_hpet_ptimer_calibrate_cpu(void)
{
u64 tsc1, tsc2, delta, ref1, ref2;
unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
- unsigned long flags, latch, ms, fast_calibrate;
+ unsigned long flags, latch, ms;
int hpet = is_hpet_enabled(), i, loopmin;

- fast_calibrate = cpu_khz_from_cpuid();
- if (fast_calibrate)
- return fast_calibrate;
-
- fast_calibrate = cpu_khz_from_msr();
- if (fast_calibrate)
- return fast_calibrate;
-
- local_irq_save(flags);
- fast_calibrate = quick_pit_calibrate();
- local_irq_restore(flags);
- if (fast_calibrate)
- return fast_calibrate;
-
/*
* Run 5 calibration loops to get the lowest frequency value
* (the best estimate). We use two different calibration modes
@@ -846,6 +833,37 @@ unsigned long native_calibrate_cpu(void)
return tsc_pit_min;
}

+/**
+ * native_calibrate_cpu_early - can calibrate the cpu early in boot
+ */
+unsigned long native_calibrate_cpu_early(void)
+{
+ unsigned long flags, fast_calibrate = cpu_khz_from_cpuid();
+
+ if (!fast_calibrate)
+ fast_calibrate = cpu_khz_from_msr();
+ if (!fast_calibrate) {
+ local_irq_save(flags);
+ fast_calibrate = quick_pit_calibrate();
+ local_irq_restore(flags);
+ }
+ return fast_calibrate;
+}
+
+
+/**
+ * native_calibrate_cpu - calibrate the cpu
+ */
+unsigned long native_calibrate_cpu(void)
+{
+ unsigned long tsc_freq = native_calibrate_cpu_early();
+
+ if (!tsc_freq)
+ tsc_freq = pit_hpet_ptimer_calibrate_cpu();
+
+ return tsc_freq;
+}
+
void recalibrate_cpu_khz(void)
{
#ifndef CONFIG_SMP

Subject: [tip:x86/timers] sched/clock: Use static key for sched_clock_running

Commit-ID: 46457ea464f5341d1f9dad8dd213805d45f7f117
Gitweb: https://git.kernel.org/tip/46457ea464f5341d1f9dad8dd213805d45f7f117
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:43 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:43 +0200

sched/clock: Use static key for sched_clock_running

sched_clock_running may be read every time sched_clock_cpu() is called.
Yet, this variable is updated only twice during boot, and never changes
again, therefore it is better to make it a static key.

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
kernel/sched/clock.c | 16 ++++++++--------
kernel/sched/debug.c | 2 --
2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 422cd63f8f17..c5c47ad3f386 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -67,7 +67,7 @@ unsigned long long __weak sched_clock(void)
}
EXPORT_SYMBOL_GPL(sched_clock);

-__read_mostly int sched_clock_running;
+static DEFINE_STATIC_KEY_FALSE(sched_clock_running);

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
@@ -191,7 +191,7 @@ void clear_sched_clock_stable(void)

smp_mb(); /* matches sched_clock_init_late() */

- if (sched_clock_running == 2)
+ if (static_key_count(&sched_clock_running.key) == 2)
__clear_sched_clock_stable();
}

@@ -215,7 +215,7 @@ void __init sched_clock_init(void)
__sched_clock_gtod_offset();
local_irq_restore(flags);

- sched_clock_running = 1;
+ static_branch_inc(&sched_clock_running);

/* Now that sched_clock_running is set adjust scd */
local_irq_save(flags);
@@ -228,7 +228,7 @@ void __init sched_clock_init(void)
*/
static int __init sched_clock_init_late(void)
{
- sched_clock_running = 2;
+ static_branch_inc(&sched_clock_running);
/*
* Ensure that it is impossible to not do a static_key update.
*
@@ -373,7 +373,7 @@ u64 sched_clock_cpu(int cpu)
if (sched_clock_stable())
return sched_clock() + __sched_clock_offset;

- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return sched_clock();

preempt_disable_notrace();
@@ -396,7 +396,7 @@ void sched_clock_tick(void)
if (sched_clock_stable())
return;

- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return;

lockdep_assert_irqs_disabled();
@@ -455,13 +455,13 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);

void __init sched_clock_init(void)
{
- sched_clock_running = 1;
+ static_branch_inc(&sched_clock_running);
generic_sched_clock_init();
}

u64 sched_clock_cpu(int cpu)
{
- if (unlikely(!sched_clock_running))
+ if (!static_branch_unlikely(&sched_clock_running))
return 0;

return sched_clock();
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e593b4118578..b0212f489a33 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -623,8 +623,6 @@ void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
#undef PU
}

-extern __read_mostly int sched_clock_running;
-
static void print_cpu(struct seq_file *m, int cpu)
{
struct rq *rq = cpu_rq(cpu);

2018-07-19 22:36:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v15 00/26] Early boot time stamps

Pavel,

On Thu, 19 Jul 2018, Pavel Tatashin wrote:

> changelog
> ---------
> v15 - v14

I've applied the series to tip:x86/timers and pushed it out.

Thanks for the patience and for going the extra miles to reach your initial
goal of early TSC timestamps. The overall result looks very reasonable and
is not only a functional improvement: Quite some old ballast and duct tape
has been cleaned up on the way.

Thanks,

tglx

Subject: [tip:x86/timers] x86/tsc: Make use of tsc_calibrate_cpu_early()

Commit-ID: 8dbe438589f373544a1af8b4a859e4da853c0f90
Gitweb: https://git.kernel.org/tip/8dbe438589f373544a1af8b4a859e4da853c0f90
Author: Pavel Tatashin <[email protected]>
AuthorDate: Thu, 19 Jul 2018 16:55:45 -0400
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 00:02:44 +0200

x86/tsc: Make use of tsc_calibrate_cpu_early()

During early boot enable tsc_calibrate_cpu_early() and switch to
tsc_calibrate_cpu() only later. Do this unconditionally, because it is
unknown what methods other cpus will use to calibrate once they are
onlined.

If by the time tsc_init() is called tsc frequency is still unknown do only
pit_hpet_ptimer_calibrate_cpu() to calibrate, as this function contains the
only methods wich have not been called and tried earlier.

Signed-off-by: Pavel Tatashin <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/include/asm/tsc.h | 1 -
arch/x86/kernel/tsc.c | 25 +++++++++++++++++++------
arch/x86/kernel/x86_init.c | 2 +-
3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 88140e4f2292..eb5bbfeccb66 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -39,7 +39,6 @@ extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
extern int check_tsc_unstable(void);
extern void mark_tsc_async_resets(char *reason);
-extern unsigned long native_calibrate_cpu(void);
extern unsigned long native_calibrate_cpu_early(void);
extern unsigned long native_calibrate_tsc(void);
extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 60586779b02c..02e416b87ac1 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -854,7 +854,7 @@ unsigned long native_calibrate_cpu_early(void)
/**
* native_calibrate_cpu - calibrate the cpu
*/
-unsigned long native_calibrate_cpu(void)
+static unsigned long native_calibrate_cpu(void)
{
unsigned long tsc_freq = native_calibrate_cpu_early();

@@ -1374,13 +1374,19 @@ unreg:
*/
device_initcall(init_tsc_clocksource);

-static bool __init determine_cpu_tsc_frequencies(void)
+static bool __init determine_cpu_tsc_frequencies(bool early)
{
/* Make sure that cpu and tsc are not already calibrated */
WARN_ON(cpu_khz || tsc_khz);

- cpu_khz = x86_platform.calibrate_cpu();
- tsc_khz = x86_platform.calibrate_tsc();
+ if (early) {
+ cpu_khz = x86_platform.calibrate_cpu();
+ tsc_khz = x86_platform.calibrate_tsc();
+ } else {
+ /* We should not be here with non-native cpu calibration */
+ WARN_ON(x86_platform.calibrate_cpu != native_calibrate_cpu);
+ cpu_khz = pit_hpet_ptimer_calibrate_cpu();
+ }

/*
* Trust non-zero tsc_khz as authorative,
@@ -1419,7 +1425,7 @@ void __init tsc_early_init(void)
{
if (!boot_cpu_has(X86_FEATURE_TSC))
return;
- if (!determine_cpu_tsc_frequencies())
+ if (!determine_cpu_tsc_frequencies(true))
return;
loops_per_jiffy = get_loops_per_jiffy();

@@ -1431,6 +1437,13 @@ void __init tsc_early_init(void)

void __init tsc_init(void)
{
+ /*
+ * native_calibrate_cpu_early can only calibrate using methods that are
+ * available early in boot.
+ */
+ if (x86_platform.calibrate_cpu == native_calibrate_cpu_early)
+ x86_platform.calibrate_cpu = native_calibrate_cpu;
+
if (!boot_cpu_has(X86_FEATURE_TSC)) {
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return;
@@ -1438,7 +1451,7 @@ void __init tsc_init(void)

if (!tsc_khz) {
/* We failed to determine frequencies earlier, try again */
- if (!determine_cpu_tsc_frequencies()) {
+ if (!determine_cpu_tsc_frequencies(false)) {
mark_tsc_unstable("could not calculate TSC khz");
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 3ab867603e81..2792b5573818 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -109,7 +109,7 @@ struct x86_cpuinit_ops x86_cpuinit = {
static void default_nmi_init(void) { };

struct x86_platform_ops x86_platform __ro_after_init = {
- .calibrate_cpu = native_calibrate_cpu,
+ .calibrate_cpu = native_calibrate_cpu_early,
.calibrate_tsc = native_calibrate_tsc,
.get_wallclock = mach_get_cmos_time,
.set_wallclock = mach_set_rtc_mmss,

2018-07-20 08:10:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

On Thu, Jul 19, 2018 at 04:55:42PM -0400, Pavel Tatashin wrote:
> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index 0e9dbb2d9aea..422cd63f8f17 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)
>
> void __init sched_clock_init(void)
> {
> + unsigned long flags;
> +
> + /*
> + * Set __gtod_offset such that once we mark sched_clock_running,
> + * sched_clock_tick() continues where sched_clock() left off.
> + *
> + * Even if TSC is buggered, we're still UP at this point so it
> + * can't really be out of sync.
> + */
> + local_irq_save(flags);
> + __sched_clock_gtod_offset();
> + local_irq_restore(flags);
> +
> sched_clock_running = 1;
> +
> + /* Now that sched_clock_running is set adjust scd */
> + local_irq_save(flags);
> + sched_clock_tick();
> + local_irq_restore(flags);
> }

Sorry, that's still wrong. Because the moment you enable
sched_clock_running we need to have everything set-up for it to run.

The above looks double weird because you could've just done that =1
under the same IRQ-disable section and it would've mostly been OK
(except for NMIs). But the reason it's weird like that is because you're
going to change it into a static key later on.

The below cures things.

---
Subject: sched/clock: Close a hole in sched_clock_init()

All data required for the 'unstable' sched_clock must be set-up _before_
enabling it -- setting sched_clock_running. This includes the
__gtod_offset but also a recent scd stamp.

Make the gtod-offset update also set the csd stamp -- it requires the
same two clock reads _anyway_. This doesn't hurt in the
sched_clock_tick_stable() case and ensures sched_clock_init() gets
everything set-up before use.

Also switch to unconditional IRQ-disable/enable because the static key
stuff already requires this is not ran with IRQs disabled.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/clock.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c5c47ad3f386..811a39aca1ce 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -197,13 +197,14 @@ void clear_sched_clock_stable(void)

static void __sched_clock_gtod_offset(void)
{
- __gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+ struct sched_clock_data *scd = this_scd();
+
+ __scd_stamp(scd);
+ __gtod_offset = (scd->tick_raw + __sched_clock_offset) - scd->tick_gtod;
}

void __init sched_clock_init(void)
{
- unsigned long flags;
-
/*
* Set __gtod_offset such that once we mark sched_clock_running,
* sched_clock_tick() continues where sched_clock() left off.
@@ -211,16 +212,11 @@ void __init sched_clock_init(void)
* Even if TSC is buggered, we're still UP at this point so it
* can't really be out of sync.
*/
- local_irq_save(flags);
+ local_irq_disable();
__sched_clock_gtod_offset();
- local_irq_restore(flags);
+ local_irq_enable();

static_branch_inc(&sched_clock_running);
-
- /* Now that sched_clock_running is set adjust scd */
- local_irq_save(flags);
- sched_clock_tick();
- local_irq_restore(flags);
}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,


Subject: [tip:x86/timers] sched/clock: Close a hole in sched_clock_init()

Commit-ID: 9407f5a7ee77c631d1e100436132437cf6237e45
Gitweb: https://git.kernel.org/tip/9407f5a7ee77c631d1e100436132437cf6237e45
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 20 Jul 2018 10:09:11 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Jul 2018 11:58:00 +0200

sched/clock: Close a hole in sched_clock_init()

All data required for the 'unstable' sched_clock must be set-up _before_
enabling it -- setting sched_clock_running. This includes the
__gtod_offset but also a recent scd stamp.

Make the gtod-offset update also set the csd stamp -- it requires the
same two clock reads _anyway_. This doesn't hurt in the
sched_clock_tick_stable() case and ensures sched_clock_init() gets
everything set-up before use.

Also switch to unconditional IRQ-disable/enable because the static key
stuff already requires this is not ran with IRQs disabled.

Fixes: 857baa87b642 ("sched/clock: Enable sched clock early")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/clock.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c5c47ad3f386..811a39aca1ce 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -197,13 +197,14 @@ void clear_sched_clock_stable(void)

static void __sched_clock_gtod_offset(void)
{
- __gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+ struct sched_clock_data *scd = this_scd();
+
+ __scd_stamp(scd);
+ __gtod_offset = (scd->tick_raw + __sched_clock_offset) - scd->tick_gtod;
}

void __init sched_clock_init(void)
{
- unsigned long flags;
-
/*
* Set __gtod_offset such that once we mark sched_clock_running,
* sched_clock_tick() continues where sched_clock() left off.
@@ -211,16 +212,11 @@ void __init sched_clock_init(void)
* Even if TSC is buggered, we're still UP at this point so it
* can't really be out of sync.
*/
- local_irq_save(flags);
+ local_irq_disable();
__sched_clock_gtod_offset();
- local_irq_restore(flags);
+ local_irq_enable();

static_branch_inc(&sched_clock_running);
-
- /* Now that sched_clock_running is set adjust scd */
- local_irq_save(flags);
- sched_clock_tick();
- local_irq_restore(flags);
}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,

2018-07-24 19:54:06

by Guenter Roeck

[permalink] [raw]
Subject: Re: [tip:x86/timers] sched/clock: Enable sched clock early

Hi,

On Thu, Jul 19, 2018 at 03:33:21PM -0700, tip-bot for Pavel Tatashin wrote:
> Commit-ID: 857baa87b6422bcfb84ed3631d6839920cb5b09d
> Gitweb: https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
> Author: Pavel Tatashin <[email protected]>
> AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
> Committer: Thomas Gleixner <[email protected]>
> CommitDate: Fri, 20 Jul 2018 00:02:43 +0200
>
> sched/clock: Enable sched clock early
>
> Allow sched_clock() to be used before schec_clock_init() is called. This
> provides a way to get early boot timestamps on machines with unstable
> clocks.
>

This patch causes a regression when running a qemu emulation with
arm:integratorcp.

...
Console: colour dummy device 80x30
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180
sched_clock_register+0x44/0x278
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc6-next-20180724 #1
Hardware name: ARM Integrator/CP (Device Tree)
[<c0010cb4>] (unwind_backtrace) from [<c000dc24>] (show_stack+0x10/0x18)
[<c000dc24>] (show_stack) from [<c03ffb94>] (dump_stack+0x18/0x24)
[<c03ffb94>] (dump_stack) from [<c001a000>] (__warn+0xc8/0xf0)
[<c001a000>] (__warn) from [<c001a13c>] (warn_slowpath_null+0x3c/0x4c)
[<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
[<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
[<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
[<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
[<c0519c18>] (start_kernel) from [<00000000>] ( (null))
---[ end trace 08080eb81afa002c ]---
sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 21474836475000000ns
...

A complete boot log is available at
http://kerneltests.org/builders/qemu-arm-next/builds/979/steps/qemubuildcommand/logs/stdio

Unfortunately, reverting the patch results in conflicts, so I am unable
to confirm that it is the only culprit.

From the context and from looking into the patch, it appears that this
can happen in any system if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is not
enabled.

Bisect log is attached.

Guenter

---
# bad: [3946cd385042069ec57d3f04240def53b4eed7e5] Add linux-next specific files for 20180724
# good: [d72e90f33aa4709ebecc5005562f52335e106a60] Linux 4.18-rc6
git bisect start 'HEAD' 'v4.18-rc6'
# good: [f5fa891e325acf096c0f79e1d1b922002e251e5a] Merge remote-tracking branch 'crypto/master'
git bisect good f5fa891e325acf096c0f79e1d1b922002e251e5a
# good: [cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4] Merge remote-tracking branch 'spi/for-next'
git bisect good cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4
# bad: [6b5bfa57bf4553d051be65d85d021465041406d8] Merge remote-tracking branch 'char-misc/char-misc-next'
git bisect bad 6b5bfa57bf4553d051be65d85d021465041406d8
# bad: [675a67e9ef3c041999f412cb75418d2b0def3854] Merge remote-tracking branch 'rcu/rcu/next'
git bisect bad 675a67e9ef3c041999f412cb75418d2b0def3854
# good: [e78b01a51131f25fc2d881bc43001575c129069c] Merge branch 'perf/core'
git bisect good e78b01a51131f25fc2d881bc43001575c129069c
# good: [4e581bce514f4107ce84525f0f75f89c92b4140e] Merge branch 'x86/cpu'
git bisect good 4e581bce514f4107ce84525f0f75f89c92b4140e
# good: [20fa22e90e54e2d21cace7ba083598531670f7cf] Merge branch 'x86/pti'
git bisect good 20fa22e90e54e2d21cace7ba083598531670f7cf
# good: [4763f03d3d186ce8a1125844790152d76804ad60] x86/tsc: Use TSC as sched clock early
git bisect good 4763f03d3d186ce8a1125844790152d76804ad60
# good: [5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725] tools/memory-model: Rename litmus tests to comply to norm7
git bisect good 5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725
# bad: [fc3d25e1c8f6a9232530db02a1072033e22e0fe3] Merge branch 'x86/timers'
git bisect bad fc3d25e1c8f6a9232530db02a1072033e22e0fe3
# bad: [46457ea464f5341d1f9dad8dd213805d45f7f117] sched/clock: Use static key for sched_clock_running
git bisect bad 46457ea464f5341d1f9dad8dd213805d45f7f117
# bad: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
git bisect bad 857baa87b6422bcfb84ed3631d6839920cb5b09d
# good: [5d2a4e91a541cb04d20d11602f0f9340291322ac] sched/clock: Move sched clock initialization and merge with generic clock
git bisect good 5d2a4e91a541cb04d20d11602f0f9340291322ac
# first bad commit: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early

2018-07-24 20:25:38

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [tip:x86/timers] sched/clock: Enable sched clock early

On Tue, Jul 24, 2018 at 3:54 PM Guenter Roeck <[email protected]> wrote:
>
> Hi,
>
> On Thu, Jul 19, 2018 at 03:33:21PM -0700, tip-bot for Pavel Tatashin wrote:
> > Commit-ID: 857baa87b6422bcfb84ed3631d6839920cb5b09d
> > Gitweb: https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
> > Author: Pavel Tatashin <[email protected]>
> > AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
> > Committer: Thomas Gleixner <[email protected]>
> > CommitDate: Fri, 20 Jul 2018 00:02:43 +0200
> >
> > sched/clock: Enable sched clock early
> >
> > Allow sched_clock() to be used before schec_clock_init() is called. This
> > provides a way to get early boot timestamps on machines with unstable
> > clocks.
> >
>
> This patch causes a regression when running a qemu emulation with
> arm:integratorcp.

Thank you for the report. I will study it.

>
> ...
> Console: colour dummy device 80x30
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180
> sched_clock_register+0x44/0x278
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc6-next-20180724 #1
> Hardware name: ARM Integrator/CP (Device Tree)
> [<c0010cb4>] (unwind_backtrace) from [<c000dc24>] (show_stack+0x10/0x18)
> [<c000dc24>] (show_stack) from [<c03ffb94>] (dump_stack+0x18/0x24)
> [<c03ffb94>] (dump_stack) from [<c001a000>] (__warn+0xc8/0xf0)
> [<c001a000>] (__warn) from [<c001a13c>] (warn_slowpath_null+0x3c/0x4c)
> [<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
> [<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
> [<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
> [<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
> [<c0519c18>] (start_kernel) from [<00000000>] ( (null))
> ---[ end trace 08080eb81afa002c ]---
> sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 21474836475000000ns
> ...
>
> A complete boot log is available at
> http://kerneltests.org/builders/qemu-arm-next/builds/979/steps/qemubuildcommand/logs/stdio
>
> Unfortunately, reverting the patch results in conflicts, so I am unable
> to confirm that it is the only culprit.
>
> From the context and from looking into the patch, it appears that this
> can happen in any system if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is not
> enabled.
>
> Bisect log is attached.
>
> Guenter
>
> ---
> # bad: [3946cd385042069ec57d3f04240def53b4eed7e5] Add linux-next specific files for 20180724
> # good: [d72e90f33aa4709ebecc5005562f52335e106a60] Linux 4.18-rc6
> git bisect start 'HEAD' 'v4.18-rc6'
> # good: [f5fa891e325acf096c0f79e1d1b922002e251e5a] Merge remote-tracking branch 'crypto/master'
> git bisect good f5fa891e325acf096c0f79e1d1b922002e251e5a
> # good: [cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4] Merge remote-tracking branch 'spi/for-next'
> git bisect good cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4
> # bad: [6b5bfa57bf4553d051be65d85d021465041406d8] Merge remote-tracking branch 'char-misc/char-misc-next'
> git bisect bad 6b5bfa57bf4553d051be65d85d021465041406d8
> # bad: [675a67e9ef3c041999f412cb75418d2b0def3854] Merge remote-tracking branch 'rcu/rcu/next'
> git bisect bad 675a67e9ef3c041999f412cb75418d2b0def3854
> # good: [e78b01a51131f25fc2d881bc43001575c129069c] Merge branch 'perf/core'
> git bisect good e78b01a51131f25fc2d881bc43001575c129069c
> # good: [4e581bce514f4107ce84525f0f75f89c92b4140e] Merge branch 'x86/cpu'
> git bisect good 4e581bce514f4107ce84525f0f75f89c92b4140e
> # good: [20fa22e90e54e2d21cace7ba083598531670f7cf] Merge branch 'x86/pti'
> git bisect good 20fa22e90e54e2d21cace7ba083598531670f7cf
> # good: [4763f03d3d186ce8a1125844790152d76804ad60] x86/tsc: Use TSC as sched clock early
> git bisect good 4763f03d3d186ce8a1125844790152d76804ad60
> # good: [5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725] tools/memory-model: Rename litmus tests to comply to norm7
> git bisect good 5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725
> # bad: [fc3d25e1c8f6a9232530db02a1072033e22e0fe3] Merge branch 'x86/timers'
> git bisect bad fc3d25e1c8f6a9232530db02a1072033e22e0fe3
> # bad: [46457ea464f5341d1f9dad8dd213805d45f7f117] sched/clock: Use static key for sched_clock_running
> git bisect bad 46457ea464f5341d1f9dad8dd213805d45f7f117
> # bad: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
> git bisect bad 857baa87b6422bcfb84ed3631d6839920cb5b09d
> # good: [5d2a4e91a541cb04d20d11602f0f9340291322ac] sched/clock: Move sched clock initialization and merge with generic clock
> git bisect good 5d2a4e91a541cb04d20d11602f0f9340291322ac
> # first bad commit: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early

2018-07-25 00:38:05

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [tip:x86/timers] sched/clock: Enable sched clock early

On Tue, Jul 24, 2018 at 4:22 PM Pavel Tatashin
<[email protected]> wrote:
>
> On Tue, Jul 24, 2018 at 3:54 PM Guenter Roeck <[email protected]> wrote:
> >
> > Hi,
> >
> > On Thu, Jul 19, 2018 at 03:33:21PM -0700, tip-bot for Pavel Tatashin wrote:
> > > Commit-ID: 857baa87b6422bcfb84ed3631d6839920cb5b09d
> > > Gitweb: https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
> > > Author: Pavel Tatashin <[email protected]>
> > > AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
> > > Committer: Thomas Gleixner <[email protected]>
> > > CommitDate: Fri, 20 Jul 2018 00:02:43 +0200
> > >
> > > sched/clock: Enable sched clock early
> > >
> > > Allow sched_clock() to be used before schec_clock_init() is called. This
> > > provides a way to get early boot timestamps on machines with unstable
> > > clocks.
> > >
> >
> > This patch causes a regression when running a qemu emulation with
> > arm:integratorcp.
>
> Thank you for the report. I will study it.
>
> >
> > ...
> > Console: colour dummy device 80x30
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180
> > sched_clock_register+0x44/0x278
> > Modules linked in:
> > CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc6-next-20180724 #1
> > Hardware name: ARM Integrator/CP (Device Tree)
> > [<c0010cb4>] (unwind_backtrace) from [<c000dc24>] (show_stack+0x10/0x18)
> > [<c000dc24>] (show_stack) from [<c03ffb94>] (dump_stack+0x18/0x24)
> > [<c03ffb94>] (dump_stack) from [<c001a000>] (__warn+0xc8/0xf0)
> > [<c001a000>] (__warn) from [<c001a13c>] (warn_slowpath_null+0x3c/0x4c)
> > [<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
> > [<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
> > [<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
> > [<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
> > [<c0519c18>] (start_kernel) from [<00000000>] ( (null))
> > ---[ end trace 08080eb81afa002c ]---
> > sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 21474836475000000ns
> > ...
> >
> > A complete boot log is available at
> > http://kerneltests.org/builders/qemu-arm-next/builds/979/steps/qemubuildcommand/logs/stdio
> >
> > Unfortunately, reverting the patch results in conflicts, so I am unable
> > to confirm that it is the only culprit.
> >
> > From the context and from looking into the patch, it appears that this
> > can happen in any system if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is not
> > enabled.

Do you have a complete config, and also qemu args that were used? I
have tried defconfig arm, and run in qemu, could not reproduce the
problem.

Thank you,
Pavel

> >
> > Bisect log is attached.
> >
> > Guenter
> >
> > ---
> > # bad: [3946cd385042069ec57d3f04240def53b4eed7e5] Add linux-next specific files for 20180724
> > # good: [d72e90f33aa4709ebecc5005562f52335e106a60] Linux 4.18-rc6
> > git bisect start 'HEAD' 'v4.18-rc6'
> > # good: [f5fa891e325acf096c0f79e1d1b922002e251e5a] Merge remote-tracking branch 'crypto/master'
> > git bisect good f5fa891e325acf096c0f79e1d1b922002e251e5a
> > # good: [cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4] Merge remote-tracking branch 'spi/for-next'
> > git bisect good cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4
> > # bad: [6b5bfa57bf4553d051be65d85d021465041406d8] Merge remote-tracking branch 'char-misc/char-misc-next'
> > git bisect bad 6b5bfa57bf4553d051be65d85d021465041406d8
> > # bad: [675a67e9ef3c041999f412cb75418d2b0def3854] Merge remote-tracking branch 'rcu/rcu/next'
> > git bisect bad 675a67e9ef3c041999f412cb75418d2b0def3854
> > # good: [e78b01a51131f25fc2d881bc43001575c129069c] Merge branch 'perf/core'
> > git bisect good e78b01a51131f25fc2d881bc43001575c129069c
> > # good: [4e581bce514f4107ce84525f0f75f89c92b4140e] Merge branch 'x86/cpu'
> > git bisect good 4e581bce514f4107ce84525f0f75f89c92b4140e
> > # good: [20fa22e90e54e2d21cace7ba083598531670f7cf] Merge branch 'x86/pti'
> > git bisect good 20fa22e90e54e2d21cace7ba083598531670f7cf
> > # good: [4763f03d3d186ce8a1125844790152d76804ad60] x86/tsc: Use TSC as sched clock early
> > git bisect good 4763f03d3d186ce8a1125844790152d76804ad60
> > # good: [5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725] tools/memory-model: Rename litmus tests to comply to norm7
> > git bisect good 5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725
> > # bad: [fc3d25e1c8f6a9232530db02a1072033e22e0fe3] Merge branch 'x86/timers'
> > git bisect bad fc3d25e1c8f6a9232530db02a1072033e22e0fe3
> > # bad: [46457ea464f5341d1f9dad8dd213805d45f7f117] sched/clock: Use static key for sched_clock_running
> > git bisect bad 46457ea464f5341d1f9dad8dd213805d45f7f117
> > # bad: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
> > git bisect bad 857baa87b6422bcfb84ed3631d6839920cb5b09d
> > # good: [5d2a4e91a541cb04d20d11602f0f9340291322ac] sched/clock: Move sched clock initialization and merge with generic clock
> > git bisect good 5d2a4e91a541cb04d20d11602f0f9340291322ac
> > # first bad commit: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early

2018-07-25 01:25:18

by Guenter Roeck

[permalink] [raw]
Subject: Re: [tip:x86/timers] sched/clock: Enable sched clock early

On 07/24/2018 05:36 PM, Pavel Tatashin wrote:
> On Tue, Jul 24, 2018 at 4:22 PM Pavel Tatashin
> <[email protected]> wrote:
>>
>> On Tue, Jul 24, 2018 at 3:54 PM Guenter Roeck <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> On Thu, Jul 19, 2018 at 03:33:21PM -0700, tip-bot for Pavel Tatashin wrote:
>>>> Commit-ID: 857baa87b6422bcfb84ed3631d6839920cb5b09d
>>>> Gitweb: https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
>>>> Author: Pavel Tatashin <[email protected]>
>>>> AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
>>>> Committer: Thomas Gleixner <[email protected]>
>>>> CommitDate: Fri, 20 Jul 2018 00:02:43 +0200
>>>>
>>>> sched/clock: Enable sched clock early
>>>>
>>>> Allow sched_clock() to be used before schec_clock_init() is called. This
>>>> provides a way to get early boot timestamps on machines with unstable
>>>> clocks.
>>>>
>>>
>>> This patch causes a regression when running a qemu emulation with
>>> arm:integratorcp.
>>
>> Thank you for the report. I will study it.
>>
>>>
>>> ...
>>> Console: colour dummy device 80x30
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180
>>> sched_clock_register+0x44/0x278
>>> Modules linked in:
>>> CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc6-next-20180724 #1
>>> Hardware name: ARM Integrator/CP (Device Tree)
>>> [<c0010cb4>] (unwind_backtrace) from [<c000dc24>] (show_stack+0x10/0x18)
>>> [<c000dc24>] (show_stack) from [<c03ffb94>] (dump_stack+0x18/0x24)
>>> [<c03ffb94>] (dump_stack) from [<c001a000>] (__warn+0xc8/0xf0)
>>> [<c001a000>] (__warn) from [<c001a13c>] (warn_slowpath_null+0x3c/0x4c)
>>> [<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
>>> [<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
>>> [<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
>>> [<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
>>> [<c0519c18>] (start_kernel) from [<00000000>] ( (null))
>>> ---[ end trace 08080eb81afa002c ]---
>>> sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 21474836475000000ns
>>> ...
>>>
>>> A complete boot log is available at
>>> http://kerneltests.org/builders/qemu-arm-next/builds/979/steps/qemubuildcommand/logs/stdio
>>>
>>> Unfortunately, reverting the patch results in conflicts, so I am unable
>>> to confirm that it is the only culprit.
>>>
>>> From the context and from looking into the patch, it appears that this
>>> can happen in any system if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is not
>>> enabled.
>
> Do you have a complete config, and also qemu args that were used? I
> have tried defconfig arm, and run in qemu, could not reproduce the
> problem.
>

integrator_defconfig+CONFIG_DEVTMPFS=y+CONFIG_DEVTMPFS_MOUNT=y

Qemu command line is
qemu-system-arm -M integratorcp -m 128 \
-kernel arch/arm/boot/zImage -no-reboot \
-initrd busybox-armv4.cpio \
--append "rdinit=/sbin/init console=ttyAMA0,115200" \
-serial stdio -monitor null -nographic \
-dtb arch/arm/boot/dts/integratorcp.dtb

The scripts and files used are available from [email protected]:groeck/linux-build-test.git.
qemu is version 2.12.

Guenter

> Thank you,
> Pavel
>
>>>
>>> Bisect log is attached.
>>>
>>> Guenter
>>>
>>> ---
>>> # bad: [3946cd385042069ec57d3f04240def53b4eed7e5] Add linux-next specific files for 20180724
>>> # good: [d72e90f33aa4709ebecc5005562f52335e106a60] Linux 4.18-rc6
>>> git bisect start 'HEAD' 'v4.18-rc6'
>>> # good: [f5fa891e325acf096c0f79e1d1b922002e251e5a] Merge remote-tracking branch 'crypto/master'
>>> git bisect good f5fa891e325acf096c0f79e1d1b922002e251e5a
>>> # good: [cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4] Merge remote-tracking branch 'spi/for-next'
>>> git bisect good cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4
>>> # bad: [6b5bfa57bf4553d051be65d85d021465041406d8] Merge remote-tracking branch 'char-misc/char-misc-next'
>>> git bisect bad 6b5bfa57bf4553d051be65d85d021465041406d8
>>> # bad: [675a67e9ef3c041999f412cb75418d2b0def3854] Merge remote-tracking branch 'rcu/rcu/next'
>>> git bisect bad 675a67e9ef3c041999f412cb75418d2b0def3854
>>> # good: [e78b01a51131f25fc2d881bc43001575c129069c] Merge branch 'perf/core'
>>> git bisect good e78b01a51131f25fc2d881bc43001575c129069c
>>> # good: [4e581bce514f4107ce84525f0f75f89c92b4140e] Merge branch 'x86/cpu'
>>> git bisect good 4e581bce514f4107ce84525f0f75f89c92b4140e
>>> # good: [20fa22e90e54e2d21cace7ba083598531670f7cf] Merge branch 'x86/pti'
>>> git bisect good 20fa22e90e54e2d21cace7ba083598531670f7cf
>>> # good: [4763f03d3d186ce8a1125844790152d76804ad60] x86/tsc: Use TSC as sched clock early
>>> git bisect good 4763f03d3d186ce8a1125844790152d76804ad60
>>> # good: [5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725] tools/memory-model: Rename litmus tests to comply to norm7
>>> git bisect good 5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725
>>> # bad: [fc3d25e1c8f6a9232530db02a1072033e22e0fe3] Merge branch 'x86/timers'
>>> git bisect bad fc3d25e1c8f6a9232530db02a1072033e22e0fe3
>>> # bad: [46457ea464f5341d1f9dad8dd213805d45f7f117] sched/clock: Use static key for sched_clock_running
>>> git bisect bad 46457ea464f5341d1f9dad8dd213805d45f7f117
>>> # bad: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
>>> git bisect bad 857baa87b6422bcfb84ed3631d6839920cb5b09d
>>> # good: [5d2a4e91a541cb04d20d11602f0f9340291322ac] sched/clock: Move sched clock initialization and merge with generic clock
>>> git bisect good 5d2a4e91a541cb04d20d11602f0f9340291322ac
>>> # first bad commit: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
>


2018-07-25 03:13:40

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [tip:x86/timers] sched/clock: Enable sched clock early

Peter,

The problem is in this stack

start_kernel
local_irq_enable
late_time_init
sched_clock_init
generic_sched_clock_init
sched_clock_register
WARN_ON(!irqs_disabled());

Before this work, sched_clock_init() was called prior to enabling
interrupts, but now after. So, we hit this WARN_ON() in
sched_clock_register().

The question is why do we need this warning in sched_clock_register? I
guess because we want to make this section of the code atomic:

195 new_epoch = read(); <- from here
196 cyc = cd.actual_read_sched_clock();
197 ns = rd.epoch_ns + cyc_to_ns((cyc - rd.epoch_cyc) &
rd.sched_clock_mask, rd.mult, rd.shift);
198 cd.actual_read_sched_clock = read;
199
200 rd.read_sched_clock = read;
201 rd.sched_clock_mask = new_mask;
202 rd.mult = new_mult;
203 rd.shift = new_shift;
204 rd.epoch_cyc = new_epoch;
205 rd.epoch_ns = ns;
206
207 update_clock_read_data(&rd); <- to here

If we need it, we can surround the sched_clock_register() with
local_irq_disable/local_irq_enable:

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index cbc72c2c1fca..5015b165b55b 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -243,8 +243,11 @@ void __init generic_sched_clock_init(void)
* If no sched_clock() function has been provided at that point,
* make it the final one one.
*/
- if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
+ if (cd.actual_read_sched_clock == jiffy_sched_clock_read) {
+ local_irq_disable();
sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
+ local_irq_enable();
+ }

update_sched_clock();

Thank you,
Pavel

2018-07-25 04:07:51

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [tip:x86/timers] sched/clock: Enable sched clock early

> integrator_defconfig+CONFIG_DEVTMPFS=y+CONFIG_DEVTMPFS_MOUNT=y
>
> Qemu command line is
> qemu-system-arm -M integratorcp -m 128 \
> -kernel arch/arm/boot/zImage -no-reboot \
> -initrd busybox-armv4.cpio \
> --append "rdinit=/sbin/init console=ttyAMA0,115200" \
> -serial stdio -monitor null -nographic \
> -dtb arch/arm/boot/dts/integratorcp.dtb
>
> The scripts and files used are available from [email protected]:groeck/linux-build-test.git.
> qemu is version 2.12.

Reproduced. Thank you

2018-07-30 12:37:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/timers] sched/clock: Enable sched clock early

On Tue, Jul 24, 2018 at 10:41:19PM -0400, Pavel Tatashin wrote:

> If we need it, we can surround the sched_clock_register() with
> local_irq_disable/local_irq_enable:
>
> diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
> index cbc72c2c1fca..5015b165b55b 100644
> --- a/kernel/time/sched_clock.c
> +++ b/kernel/time/sched_clock.c
> @@ -243,8 +243,11 @@ void __init generic_sched_clock_init(void)
> * If no sched_clock() function has been provided at that point,
> * make it the final one one.
> */
> - if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
> + if (cd.actual_read_sched_clock == jiffy_sched_clock_read) {
> + local_irq_disable();
> sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
> + local_irq_enable();
> + }
>
> update_sched_clock();

I'm thinking maybe disable IRQs for that entire function, instead of
just the register call.

2018-07-30 13:47:12

by Pavel Tatashin

[permalink] [raw]
Subject: Re: [tip:x86/timers] sched/clock: Enable sched clock early

> > - if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
> > + if (cd.actual_read_sched_clock == jiffy_sched_clock_read) {
> > + local_irq_disable();
> > sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
> > + local_irq_enable();
> > + }
> >
> > update_sched_clock();
>
> I'm thinking maybe disable IRQs for that entire function, instead of
> just the register call.

Sure, I will send a patch.

Thank you,
Pavel

2018-09-05 09:10:25

by Chuan Hua, Lei

[permalink] [raw]
Subject: Re: [PATCH v15 19/26] x86/tsc: calibrate tsc only once

> static unsigned long __init get_loops_per_jiffy(void)
> {
> unsigned long lpj = tsc_khz * KHZ;
>
> do_div(lpj, HZ);
> return lpj;
> }
Just tried this with 4.19-rc2 on x86(32bit). lpj return as zero which is not expected
After disassembling the code,
0xc1239a9e <+199>: imul $0x3e8,0xc12296e4,%edx
0xc1239aa8 <+209>: xor %ecx,%ecx
0xc1239aaa <+211>: test %edx,%edx
0xc1239aac <+213>: mov %eax,%ebx
0xc1239aae <+215>: je 0xc1239abd <tsc_init+230>
0xc1239ab0 <+217>: mov $0x64,%ecx
0xc1239ab5 <+222>: mov %edx,%eax
0xc1239ab7 <+224>: xor %edx,%edx
0xc1239ab9 <+226>: div %ecx
0xc1239abb <+228>: mov %eax,%ecx
0xc1239abd <+230>: mov %ebx,%eax
0xc1239abf <+232>: mov $0x64,%ebx
0xc1239ac4 <+237>: div %ebx
0xc1239ac6 <+239>: mov %ecx,%edx
imul will load the result into %edx, %edx supposed to be high 32 bit which is not zero,
It should be zero in this case. both lpj and tsc_khz should be u64 to work properly.


2018-09-05 10:44:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v15 19/26] x86/tsc: calibrate tsc only once

On Wed, 5 Sep 2018, Chuan Hua, Lei wrote:

> > static unsigned long __init get_loops_per_jiffy(void)
> > {
> > unsigned long lpj = tsc_khz * KHZ;
> >
> > do_div(lpj, HZ);
> > return lpj;
> > }
> Just tried this with 4.19-rc2 on x86(32bit). lpj return as zero which is not
> expected
> After disassembling the code,
> 0xc1239a9e <+199>: imul $0x3e8,0xc12296e4,%edx
> 0xc1239aa8 <+209>: xor %ecx,%ecx
> 0xc1239aaa <+211>: test %edx,%edx
> 0xc1239aac <+213>: mov %eax,%ebx
> 0xc1239aae <+215>: je 0xc1239abd <tsc_init+230>
> 0xc1239ab0 <+217>: mov $0x64,%ecx
> 0xc1239ab5 <+222>: mov %edx,%eax
> 0xc1239ab7 <+224>: xor %edx,%edx
> 0xc1239ab9 <+226>: div %ecx
> 0xc1239abb <+228>: mov %eax,%ecx
> 0xc1239abd <+230>: mov %ebx,%eax
> 0xc1239abf <+232>: mov $0x64,%ebx
> 0xc1239ac4 <+237>: div %ebx
> 0xc1239ac6 <+239>: mov %ecx,%edx
> imul will load the result into %edx, %edx supposed to be high 32 bit which is
> not zero,
> It should be zero in this case. both lpj and tsc_khz should be u64 to work
> properly.

Good catch! Care to send a patch?

Thanks,

tglx

2018-11-06 05:43:06

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

(added various kvm/virtualization lists in Cc as well as qemu as I don't
know who's "wrong" here)

Pavel Tatashin wrote on Thu, Jul 19, 2018:
> Allow sched_clock() to be used before schec_clock_init() is called.
> This provides with a way to get early boot timestamps on machines with
> unstable clocks.

This isn't something I understand, but bisect tells me this patch
(landed as 857baa87b64 ("sched/clock: Enable sched clock early")) makes
a VM running with kvmclock take a step in uptime/printk timer early in
boot sequence as illustrated below. The step seems to be related to the
amount of time the host was suspended while qemu was running before the
reboot.

$ dmesg
...
[ 0.000000] SMBIOS 2.8 present.
[ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[283120.529821] kvm-clock: cpu 0, msr 321a8001, primary cpu clock
[283120.529822] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[283120.529824] tsc: Detected 2592.000 MHz processor
...

(The VM is x86_64 on x86_64, I can provide my .config on request but
don't think it's related)


It's rather annoying for me as I often reboot VMs and rely on the
'uptime' command to check if I did just reboot or not as I have the
attention span of a goldfish; I'd rather not have to find something else
to check if I did just reboot or not.

Note that if the qemu process is restarted, there is no offset anymore.

I unfortunately just did that so cannot say with confidence (putting my
laptop to sleep for 30s only led to a 2s offset and I do not want to
wait longer right now), but it looks like the clock is still mostly
correct after reboot after disabling my VM's ntp client. Will infirm
that tomorrow if I was wrong.


Happy to try to help fixing this in any way, as written above the quote
I'm not even actually sure who is wrong here.

Thanks!



(As a side, mostly unrelated note, insert swearing here about cf7a63ef4
not compiling earlier in this serie; some variable declaration got
removed before their use. Was fixed in the next patch but I didn't
notice the kernel didn't fully rebuild and wasted time in my bisect
heading the wrong way...)

> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> init/main.c | 2 +-
> kernel/sched/clock.c | 20 +++++++++++++++++++-
> 2 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/init/main.c b/init/main.c
> index 162d931c9511..ff0a24170b95 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
> softirq_init();
> timekeeping_init();
> time_init();
> - sched_clock_init();
> printk_safe_init();
> perf_event_init();
> profile_init();
> @@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
> acpi_early_init();
> if (late_time_init)
> late_time_init();
> + sched_clock_init();
> calibrate_delay();
> pid_idr_init();
> anon_vma_init();
> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index 0e9dbb2d9aea..422cd63f8f17 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)
>
> void __init sched_clock_init(void)
> {
> + unsigned long flags;
> +
> + /*
> + * Set __gtod_offset such that once we mark sched_clock_running,
> + * sched_clock_tick() continues where sched_clock() left off.
> + *
> + * Even if TSC is buggered, we're still UP at this point so it
> + * can't really be out of sync.
> + */
> + local_irq_save(flags);
> + __sched_clock_gtod_offset();
> + local_irq_restore(flags);
> +
> sched_clock_running = 1;
> +
> + /* Now that sched_clock_running is set adjust scd */
> + local_irq_save(flags);
> + sched_clock_tick();
> + local_irq_restore(flags);
> }
> /*
> * We run this as late_initcall() such that it runs after all built-in drivers,
> @@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
> return sched_clock() + __sched_clock_offset;
>
> if (unlikely(!sched_clock_running))
> - return 0ull;
> + return sched_clock();
>
> preempt_disable_notrace();
> scd = cpu_sdc(cpu);
--
Dominique Martinet | Asmadeus

2018-11-06 11:37:44

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

Pavel has a new email address, cc'd - steve

On 11/6/2018 12:42 AM, Dominique Martinet wrote:
> (added various kvm/virtualization lists in Cc as well as qemu as I don't
> know who's "wrong" here)
>
> Pavel Tatashin wrote on Thu, Jul 19, 2018:
>> Allow sched_clock() to be used before schec_clock_init() is called.
>> This provides with a way to get early boot timestamps on machines with
>> unstable clocks.
>
> This isn't something I understand, but bisect tells me this patch
> (landed as 857baa87b64 ("sched/clock: Enable sched clock early")) makes
> a VM running with kvmclock take a step in uptime/printk timer early in
> boot sequence as illustrated below. The step seems to be related to the
> amount of time the host was suspended while qemu was running before the
> reboot.
>
> $ dmesg
> ...
> [ 0.000000] SMBIOS 2.8 present.
> [ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
> [ 0.000000] Hypervisor detected: KVM
> [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [283120.529821] kvm-clock: cpu 0, msr 321a8001, primary cpu clock
> [283120.529822] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [283120.529824] tsc: Detected 2592.000 MHz processor
> ...
>
> (The VM is x86_64 on x86_64, I can provide my .config on request but
> don't think it's related)
>
>
> It's rather annoying for me as I often reboot VMs and rely on the
> 'uptime' command to check if I did just reboot or not as I have the
> attention span of a goldfish; I'd rather not have to find something else
> to check if I did just reboot or not.
>
> Note that if the qemu process is restarted, there is no offset anymore.
>
> I unfortunately just did that so cannot say with confidence (putting my
> laptop to sleep for 30s only led to a 2s offset and I do not want to
> wait longer right now), but it looks like the clock is still mostly
> correct after reboot after disabling my VM's ntp client. Will infirm
> that tomorrow if I was wrong.
>
>
> Happy to try to help fixing this in any way, as written above the quote
> I'm not even actually sure who is wrong here.
>
> Thanks!
>
>
>
> (As a side, mostly unrelated note, insert swearing here about cf7a63ef4
> not compiling earlier in this serie; some variable declaration got
> removed before their use. Was fixed in the next patch but I didn't
> notice the kernel didn't fully rebuild and wasted time in my bisect
> heading the wrong way...)
>
>> Signed-off-by: Pavel Tatashin <[email protected]>
>> ---
>> init/main.c | 2 +-
>> kernel/sched/clock.c | 20 +++++++++++++++++++-
>> 2 files changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/init/main.c b/init/main.c
>> index 162d931c9511..ff0a24170b95 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
>> softirq_init();
>> timekeeping_init();
>> time_init();
>> - sched_clock_init();
>> printk_safe_init();
>> perf_event_init();
>> profile_init();
>> @@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
>> acpi_early_init();
>> if (late_time_init)
>> late_time_init();
>> + sched_clock_init();
>> calibrate_delay();
>> pid_idr_init();
>> anon_vma_init();
>> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
>> index 0e9dbb2d9aea..422cd63f8f17 100644
>> --- a/kernel/sched/clock.c
>> +++ b/kernel/sched/clock.c
>> @@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)
>>
>> void __init sched_clock_init(void)
>> {
>> + unsigned long flags;
>> +
>> + /*
>> + * Set __gtod_offset such that once we mark sched_clock_running,
>> + * sched_clock_tick() continues where sched_clock() left off.
>> + *
>> + * Even if TSC is buggered, we're still UP at this point so it
>> + * can't really be out of sync.
>> + */
>> + local_irq_save(flags);
>> + __sched_clock_gtod_offset();
>> + local_irq_restore(flags);
>> +
>> sched_clock_running = 1;
>> +
>> + /* Now that sched_clock_running is set adjust scd */
>> + local_irq_save(flags);
>> + sched_clock_tick();
>> + local_irq_restore(flags);
>> }
>> /*
>> * We run this as late_initcall() such that it runs after all built-in drivers,
>> @@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
>> return sched_clock() + __sched_clock_offset;
>>
>> if (unlikely(!sched_clock_running))
>> - return 0ull;
>> + return sched_clock();
>>
>> preempt_disable_notrace();
>> scd = cpu_sdc(cpu);

2019-01-02 22:55:56

by Salvatore Bonaccorso

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

Hi,

On Tue, Nov 06, 2018 at 06:35:36AM -0500, Steven Sistare wrote:
> Pavel has a new email address, cc'd - steve
>
> On 11/6/2018 12:42 AM, Dominique Martinet wrote:
> > (added various kvm/virtualization lists in Cc as well as qemu as I don't
> > know who's "wrong" here)
> >
> > Pavel Tatashin wrote on Thu, Jul 19, 2018:
> >> Allow sched_clock() to be used before schec_clock_init() is called.
> >> This provides with a way to get early boot timestamps on machines with
> >> unstable clocks.
> >
> > This isn't something I understand, but bisect tells me this patch
> > (landed as 857baa87b64 ("sched/clock: Enable sched clock early")) makes
> > a VM running with kvmclock take a step in uptime/printk timer early in
> > boot sequence as illustrated below. The step seems to be related to the
> > amount of time the host was suspended while qemu was running before the
> > reboot.
> >
> > $ dmesg
> > ...
> > [ 0.000000] SMBIOS 2.8 present.
> > [ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
> > [ 0.000000] Hypervisor detected: KVM
> > [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> > [283120.529821] kvm-clock: cpu 0, msr 321a8001, primary cpu clock
> > [283120.529822] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> > [283120.529824] tsc: Detected 2592.000 MHz processor
> > ...
> >
> > (The VM is x86_64 on x86_64, I can provide my .config on request but
> > don't think it's related)
> >
> >
> > It's rather annoying for me as I often reboot VMs and rely on the
> > 'uptime' command to check if I did just reboot or not as I have the
> > attention span of a goldfish; I'd rather not have to find something else
> > to check if I did just reboot or not.
> >
> > Note that if the qemu process is restarted, there is no offset anymore.
> >
> > I unfortunately just did that so cannot say with confidence (putting my
> > laptop to sleep for 30s only led to a 2s offset and I do not want to
> > wait longer right now), but it looks like the clock is still mostly
> > correct after reboot after disabling my VM's ntp client. Will infirm
> > that tomorrow if I was wrong.
> >
> >
> > Happy to try to help fixing this in any way, as written above the quote
> > I'm not even actually sure who is wrong here.

A user in Debian reported the same/similar issue (with 4.19.13):

https://bugs.debian.org/918036

Regards,
Salvatore

2019-01-04 03:20:02

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

Could you please send the config file and qemu arguments that were
used to reproduce this problem.

Thank you,
Pasha

On Wed, Jan 2, 2019 at 3:20 PM Salvatore Bonaccorso <[email protected]> wrote:
>
> Hi,
>
> On Tue, Nov 06, 2018 at 06:35:36AM -0500, Steven Sistare wrote:
> > Pavel has a new email address, cc'd - steve
> >
> > On 11/6/2018 12:42 AM, Dominique Martinet wrote:
> > > (added various kvm/virtualization lists in Cc as well as qemu as I don't
> > > know who's "wrong" here)
> > >
> > > Pavel Tatashin wrote on Thu, Jul 19, 2018:
> > >> Allow sched_clock() to be used before schec_clock_init() is called.
> > >> This provides with a way to get early boot timestamps on machines with
> > >> unstable clocks.
> > >
> > > This isn't something I understand, but bisect tells me this patch
> > > (landed as 857baa87b64 ("sched/clock: Enable sched clock early")) makes
> > > a VM running with kvmclock take a step in uptime/printk timer early in
> > > boot sequence as illustrated below. The step seems to be related to the
> > > amount of time the host was suspended while qemu was running before the
> > > reboot.
> > >
> > > $ dmesg
> > > ...
> > > [ 0.000000] SMBIOS 2.8 present.
> > > [ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
> > > [ 0.000000] Hypervisor detected: KVM
> > > [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> > > [283120.529821] kvm-clock: cpu 0, msr 321a8001, primary cpu clock
> > > [283120.529822] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> > > [283120.529824] tsc: Detected 2592.000 MHz processor
> > > ...
> > >
> > > (The VM is x86_64 on x86_64, I can provide my .config on request but
> > > don't think it's related)
> > >
> > >
> > > It's rather annoying for me as I often reboot VMs and rely on the
> > > 'uptime' command to check if I did just reboot or not as I have the
> > > attention span of a goldfish; I'd rather not have to find something else
> > > to check if I did just reboot or not.
> > >
> > > Note that if the qemu process is restarted, there is no offset anymore.
> > >
> > > I unfortunately just did that so cannot say with confidence (putting my
> > > laptop to sleep for 30s only led to a 2s offset and I do not want to
> > > wait longer right now), but it looks like the clock is still mostly
> > > correct after reboot after disabling my VM's ntp client. Will infirm
> > > that tomorrow if I was wrong.
> > >
> > >
> > > Happy to try to help fixing this in any way, as written above the quote
> > > I'm not even actually sure who is wrong here.
>
> A user in Debian reported the same/similar issue (with 4.19.13):
>
> https://bugs.debian.org/918036
>
> Regards,
> Salvatore

2019-01-04 05:29:25

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

Pavel Tatashin wrote on Thu, Jan 03, 2019:
> Could you please send the config file and qemu arguments that were
> used to reproduce this problem.

Running qemu by hand, nothing fancy e.g. this works:

# qemu-system-x86_64 -m 1G -smp 4 -drive file=/root/kvm-wrapper/disks/f2.img,if=virtio -serial mon:stdio --enable-kvm -cpu Haswell -device virtio-rng-pci -nographic

(used a specific cpu just in case but normally runnning with cpu host on
a skylake machine; can probably go older)


qemu is fedora 29 blend as is:
$ qemu-system-x86_64 --version
QEMU emulator version 3.0.0 (qemu-3.0.0-3.fc29)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers


compressed .config attached to the mail, this can likely be trimmed down
some as well but that takes more time for me..
I didn't rebuild the kernel so not 100% sure (comes from
/proc/config.gz) but it should work on a 4.20-rc2 kernel as written in
the first few lines; 857baa87b64 I referred to in another mail was
merged in 4.19-rc1 so anything past that is probably OK to reproduce...


Re-checked today with these exact options (fresh VM start; then suspend
laptop for a bit, then reboot VM):
[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 2477.907447] kvm-clock: cpu 0, msr 153a4001, primary cpu clock
[ 2477.907448] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 2477.907450] tsc: Detected 2592.000 MHz processor


As offered previously, happy to help in any way.

Thanks,
--
Dominique


Attachments:
(No filename) (1.58 kB)
config.xz (19.02 kB)
Download all attachments

2019-01-07 18:36:55

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

On Thu, Jan 3, 2019 at 6:43 PM Dominique Martinet
<[email protected]> wrote:
>
> Pavel Tatashin wrote on Thu, Jan 03, 2019:
> > Could you please send the config file and qemu arguments that were
> > used to reproduce this problem.
>
> Running qemu by hand, nothing fancy e.g. this works:
>
> # qemu-system-x86_64 -m 1G -smp 4 -drive file=/root/kvm-wrapper/disks/f2.img,if=virtio -serial mon:stdio --enable-kvm -cpu Haswell -device virtio-rng-pci -nographic
>
> (used a specific cpu just in case but normally runnning with cpu host on
> a skylake machine; can probably go older)
>
>
> qemu is fedora 29 blend as is:
> $ qemu-system-x86_64 --version
> QEMU emulator version 3.0.0 (qemu-3.0.0-3.fc29)
> Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
>
>
> compressed .config attached to the mail, this can likely be trimmed down
> some as well but that takes more time for me..
> I didn't rebuild the kernel so not 100% sure (comes from
> /proc/config.gz) but it should work on a 4.20-rc2 kernel as written in
> the first few lines; 857baa87b64 I referred to in another mail was
> merged in 4.19-rc1 so anything past that is probably OK to reproduce...
>
>
> Re-checked today with these exact options (fresh VM start; then suspend
> laptop for a bit, then reboot VM):
> [ 0.000000] Hypervisor detected: KVM
> [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [ 2477.907447] kvm-clock: cpu 0, msr 153a4001, primary cpu clock
> [ 2477.907448] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [ 2477.907450] tsc: Detected 2592.000 MHz processor

I could not reproduce the problem. Did you suspend to memory between
wake ups? Does this time jump happen every time, even if your laptop
sleeps for a minute?

I have tried with qemu 2.6 and 3.1 on Ubuntu, testing 4.20rc2.

Pasha

2019-01-07 23:50:26

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

Pavel Tatashin wrote on Mon, Jan 07, 2019:
> I could not reproduce the problem. Did you suspend to memory between
> wake ups? Does this time jump happen every time, even if your laptop
> sleeps for a minute?

I'm not sure I understand "suspend to memory between the wake ups".
The full sequence is:
- start a VM (just in case, I let it boot till the end)
- suspend to memory (aka systemctl suspend) the host
- after resuming the host, soft reboot the VM (login through
serial/ssh/whatever and reboot or in the qemu console 'system_reset')

I've just slept exactly one minute and reproduced again with the fedora
stock kernel now (4.19.13-300.fc29.x86_64) in the VM.

Interestingly I'm not getting the same offset between multiple reboots
now despite not suspending again; but if I don't suspend I cannot seem
to get it to give an offset at all (only tried for a few minutes; this
might not be true) ; OTOH I pushed my luck further and even with a five
seconds sleep I'm getting a noticeable offset on first VM reboot after
resume:

[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 179.362163] kvm-clock: cpu 0, msr 13c01001, primary cpu clock
[ 179.362163] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns

Honestly not sure what more information I could give, I'll try on some
other hardware than my laptop (if I can get a server to resume after
suspend through ipmi or wake on lan); but I don't have anything I could
install ubuntu on to try their qemu's version... although I really don't
want to believe that's the difference...

Thanks,
--
Dominique

2019-01-08 01:06:31

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

I did exactly the same sequence on Kaby Lake CPU and could not
reproduce it. What is your host CPU?

Thank you,
Pasha

On Mon, Jan 7, 2019 at 6:48 PM Dominique Martinet
<[email protected]> wrote:
>
> Pavel Tatashin wrote on Mon, Jan 07, 2019:
> > I could not reproduce the problem. Did you suspend to memory between
> > wake ups? Does this time jump happen every time, even if your laptop
> > sleeps for a minute?
>
> I'm not sure I understand "suspend to memory between the wake ups".
> The full sequence is:
> - start a VM (just in case, I let it boot till the end)
> - suspend to memory (aka systemctl suspend) the host
> - after resuming the host, soft reboot the VM (login through
> serial/ssh/whatever and reboot or in the qemu console 'system_reset')
>
> I've just slept exactly one minute and reproduced again with the fedora
> stock kernel now (4.19.13-300.fc29.x86_64) in the VM.
>
> Interestingly I'm not getting the same offset between multiple reboots
> now despite not suspending again; but if I don't suspend I cannot seem
> to get it to give an offset at all (only tried for a few minutes; this
> might not be true) ; OTOH I pushed my luck further and even with a five
> seconds sleep I'm getting a noticeable offset on first VM reboot after
> resume:
>
> [ 0.000000] Hypervisor detected: KVM
> [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [ 179.362163] kvm-clock: cpu 0, msr 13c01001, primary cpu clock
> [ 179.362163] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
>
> Honestly not sure what more information I could give, I'll try on some
> other hardware than my laptop (if I can get a server to resume after
> suspend through ipmi or wake on lan); but I don't have anything I could
> install ubuntu on to try their qemu's version... although I really don't
> want to believe that's the difference...
>
> Thanks,
> --
> Dominique

2019-01-08 01:10:46

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

Pavel Tatashin wrote on Mon, Jan 07, 2019:
> I did exactly the same sequence on Kaby Lake CPU and could not
> reproduce it. What is your host CPU?

skylake consumer laptop CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz

I don't have any kaby lake around; I have access to older servers though...
--
Dominique

2019-01-26 02:15:40

by Jon DeVree

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

On Mon, Jan 07, 2019 at 20:04:41 -0500, Pavel Tatashin wrote:
> I did exactly the same sequence on Kaby Lake CPU and could not
> reproduce it. What is your host CPU?
>

I have some machines which display this bug and others that don't, so I
was able to figure out the difference between their configurations.

TL;DR the bug appears to be based on wther or not
/sys/devices/system/clocksource/clocksource0/current_clocksource is set
to TSC in the hypervisor

This is the log from a machine with the bug:

[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 12 and 11
[1162908.013830] kvm-clock: cpu 0, msr 3e0fea001, primary cpu clock
[1162908.013830] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[1162908.013834] tsc: Detected 1899.888 MHz processor

This is the log from a machine without the bug:

[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: cpu 0, msr 149fea001, primary cpu clock
[ 0.000000] kvm-clock: using sched offset of 1558436482528906 cycles
[ 0.000002] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000004] tsc: Detected 2097.570 MHz processor

Note the additional line of output on the machine without the bug:

[ 0.000000] kvm-clock: using sched offset of 1558436482528906 cycles

This is printed from kvm_sched_clock_init() in
arch/x86/kernel/kvmclock.c based on whether or not the clock is stable.
For the clock to be stable both KVM_FEATURE_CLOCKSOURCE_STABLE_BIT and
PVCLOCK_TSC_STABLE_BIT have to be set. Both of these are controlled by
the hypervisor kernel.

* KVM_FEATURE_CLOCKSOURCE_STABLE_BIT is always set by the hypervisor
starting with Linux v2.6.35 - 371bcf646d17 ("KVM: x86: Tell the guest
we'll warn it about tsc stability")
* PVCLOCK_TSC_STABLE_BIT is set starting in Linux v3.8 but only if the
clocksource is the TSC - d828199e8444 ("KVM: x86: implement
PVCLOCK_TSC_STABLE_BIT pvclock flag")

I changed the clocksource of a hypervisor that wasn't having issues from
TSC to HPET and when I started up a guest VM the bug suddenly appeared.
I shut down the guest, set the hypervisor's clocksource back to TSC,
started up the guest and the bug went away again.

You don't actually have to reboot the guest before the bug is visible
either, just letting the guest sit at the GRUB menu for a minute or two
before loading Linux is enough to make the bug plainly visible in the
printk timestamps.

I don't know enough to actually fix the bug, but hopefully this is
enough to allow everyone else to reproduce it and come up with a fix.

--
Jon
X(7): A program for managing terminal windows. See also screen(1) and tmux(1).

2019-01-26 16:12:00

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v15 23/26] sched: early boot clock

On 19-01-25 21:04:10, Jon DeVree wrote:
>
> * KVM_FEATURE_CLOCKSOURCE_STABLE_BIT is always set by the hypervisor
> starting with Linux v2.6.35 - 371bcf646d17 ("KVM: x86: Tell the guest
> we'll warn it about tsc stability")
> * PVCLOCK_TSC_STABLE_BIT is set starting in Linux v3.8 but only if the
> clocksource is the TSC - d828199e8444 ("KVM: x86: implement
> PVCLOCK_TSC_STABLE_BIT pvclock flag")
>
> I changed the clocksource of a hypervisor that wasn't having issues from
> TSC to HPET and when I started up a guest VM the bug suddenly appeared.
> I shut down the guest, set the hypervisor's clocksource back to TSC,
> started up the guest and the bug went away again.
>
> You don't actually have to reboot the guest before the bug is visible
> either, just letting the guest sit at the GRUB menu for a minute or two
> before loading Linux is enough to make the bug plainly visible in the
> printk timestamps.
>
> I don't know enough to actually fix the bug, but hopefully this is
> enough to allow everyone else to reproduce it and come up with a fix.

Thank you very much for your analysis, I am now able to reproduce the
problem by setting clocksource on my machine to hpet. I will soon submit a
patch with a fix.

Pasha

>
> --
> Jon
> X(7): A program for managing terminal windows. See also screen(1) and tmux(1).