changelog
---------
v12 - v11
- splitted time: replace read_boot_clock64() with
read_persistent_wall_and_boot_offset() into four patches
- Added two patches one fixes an existing bug with text_poke()
another one enables static branches early. Note, because I found
and fixed the text_poke() bug, enabling static branching became
super easy, as no changes to jump_label* is needed.
- Modified x86/tsc: use tsc early to use static branches early, and
thus native_sched_clock() is not changed at all.
v11 - v10
- Addressed all the comments from Thomas Gleixner.
- I added one more patch:
"x86/tsc: prepare for early sched_clock" which fixes a problem
that I discovered while testing. I am not particularly happy with
the fix, as it adds a new argument that is used only in one
place, but if you have a suggestion for a different approach on
how to address this problem please let me know.
v10 - v9
- Added another patch to this series that removes dependency
between KVM clock, and memblock allocator. The benefit is that
all clocks can now be initialized even earlier.
v9 - v8
- Addressed more comments from Dou Liyang
v8 - v7
- Addressed comments from Dou Liyang:
- Moved tsc_early_init() and tsc_early_fini() to be all inside
tsc.c, and changed them to be static.
- Removed warning when notsc parameter is used.
- Merged with:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
v7 - v6
- Removed tsc_disabled flag, now notsc is equivalent of
tsc=unstable
- Simplified changes to sched/clock.c, by removing the
sched_clock_early() and friends as requested by Peter Zijlstra.
We know always use sched_clock()
- Modified x86 sched_clock() to return either early boot time or
regular.
- Added another example why ealry boot time is important
v5 - v6
- Added a new patch:
time: sync read_boot_clock64() with persistent clock
Which fixes missing __init macro, and enabled time discrepancy
fix that was noted by Thomas Gleixner
- Split "x86/time: read_boot_clock64() implementation" into a
separate patch
v4 - v5
- Fix compiler warnings on systems with stable clocks.
v3 - v4
- Fixed tsc_early_fini() call to be in the 2nd patch as reported
by Dou Liyang
- Improved comment before __use_sched_clock_early to explain why
we need both booleans.
- Simplified valid_clock logic in read_boot_clock64().
v2 - v3
- Addressed comment from Thomas Gleixner
- Timestamps are available a little later in boot but still much
earlier than in mainline. This significantly simplified this
work.
v1 - v2
In patch "x86/tsc: tsc early":
- added tsc_adjusted_early()
- fixed 32-bit compile error use do_div()
The early boot time stamps were discussed recently in these threads:
http://lkml.kernel.org/r/[email protected]
http://lkml.kernel.org/r/[email protected]
I updated my series to the latest mainline and sending it again.
Peter mentioned he did not like patch 6,7, and we can discuss for a better
way to do that, but I think patches 1-5 can be accepted separetly, since
they already enable early timestamps on platforms where sched_clock() is
available early. Such as KVM.
Adding early boot time stamps support for x86 machines.
SPARC patches for early boot time stamps are already integrated into
mainline linux.
Sample output
-------------
Before:
https://paste.ubuntu.com/26133428/
After:
https://paste.ubuntu.com/26133523/
For exaples how early time stamps are used, see this work:
Example 1:
https://lwn.net/Articles/734374/
- Without early boot time stamps we would not know about the extra time
that is spent zeroing struct pages early in boot even when deferred
page initialization.
Example 2:
https://patchwork.kernel.org/patch/10021247/
- If early boot timestamps were available, the engineer who introduced
this bug would have noticed the extra time that is spent early in boot.
Pavel Tatashin (7):
x86/tsc: remove tsc_disabled flag
time: sync read_boot_clock64() with persistent clock
x86/time: read_boot_clock64() implementation
sched: early boot clock
kvm/x86: remove kvm memblock dependency
x86/paravirt: add active_sched_clock to pv_time_ops
x86/tsc: use tsc early
Example 3:
http://lkml.kernel.org/r/[email protected]
- Needed early time stamps to show improvement
Pavel Tatashin (11):
x86: text_poke() may access uninitialized struct pages
x86: initialize static branching early
x86/tsc: redefine notsc to behave as tsc=unstable
kvm/x86: remove kvm memblock dependency
s390/time: add read_persistent_wall_and_boot_offset()
time: replace read_boot_clock64() with
read_persistent_wall_and_boot_offset()
s390/time: remove read_boot_clock64()
ARM/time: remove read_boot_clock64()
x86/tsc: prepare for early sched_clock
sched: early boot clock
x86/tsc: use tsc early
.../admin-guide/kernel-parameters.txt | 2 -
Documentation/x86/x86_64/boot-options.txt | 4 +-
arch/arm/kernel/time.c | 12 +---
arch/s390/kernel/time.c | 15 +++--
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 10 ++-
arch/x86/kernel/cpu/amd.c | 13 ++--
arch/x86/kernel/cpu/common.c | 38 ++++++-----
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 64 +++---------------
arch/x86/kernel/setup.c | 11 ++--
arch/x86/kernel/tsc.c | 65 ++++++++++---------
include/linux/timekeeping.h | 3 +-
kernel/sched/clock.c | 10 ++-
kernel/time/timekeeping.c | 61 ++++++++---------
15 files changed, 140 insertions(+), 170 deletions(-)
--
2.17.1
We want to get timestamps and high resultion clock available to us as early
as possible in boot. But, native_sched_clock() outputs time based either on
tsc after tsc_init() is called later in boot, or using jiffies when clock
interrupts are enabled, which is also happens later in boot.
On the other hand, we know tsc frequency from as early as when
tsc_early_delay_calibrate() is called. So, we use the early tsc calibration
to output timestamps early. Later in boot when tsc_init() is called we
calibrate tsc again using more precise methods, and start using that.
Since sched_clock() is in a hot path, we want to make sure that no
regressions are introduced to this function, with the current approach
sched_clock() path is not modified at all.
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/tsc.c | 40 ++++++++++++++++++++++++++--------------
1 file changed, 26 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 654a01cc0358..b8a893d8f84a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -133,22 +133,13 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
return ns;
}
-static void set_cyc2ns_scale(unsigned long khz, int cpu,
- unsigned long long tsc_now,
- unsigned long long sched_now)
+static void __set_cyc2ns_scale(unsigned long khz, int cpu,
+ unsigned long long tsc_now,
+ unsigned long long sched_now)
{
- unsigned long long ns_now;
+ unsigned long long ns_now = cycles_2_ns(tsc_now) + sched_now;
struct cyc2ns_data data;
struct cyc2ns *c2n;
- unsigned long flags;
-
- local_irq_save(flags);
- sched_clock_idle_sleep_event();
-
- if (!khz)
- goto done;
-
- ns_now = cycles_2_ns(tsc_now) + sched_now;
/*
* Compute a new multiplier as per the above comment and ensure our
@@ -178,12 +169,31 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu,
c2n->data[0] = data;
raw_write_seqcount_latch(&c2n->seq);
c2n->data[1] = data;
+}
+
+static void set_cyc2ns_scale(unsigned long khz, int cpu,
+ unsigned long long tsc_now,
+ unsigned long long sched_now)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ sched_clock_idle_sleep_event();
+
+ if (khz)
+ __set_cyc2ns_scale(khz, cpu, tsc_now, sched_now);
-done:
sched_clock_idle_wakeup_event();
local_irq_restore(flags);
}
+static void __init sched_clock_early_init(unsigned int khz)
+{
+ cyc2ns_init(smp_processor_id());
+ __set_cyc2ns_scale(khz, smp_processor_id(), rdtsc(), 0);
+ static_branch_enable(&__use_tsc);
+}
+
/*
* Scheduler clock - returns current time in nanosec units.
*/
@@ -1354,6 +1364,7 @@ void __init tsc_early_delay_calibrate(void)
lpj = tsc_khz * 1000;
do_div(lpj, HZ);
loops_per_jiffy = lpj;
+ sched_clock_early_init(tsc_khz);
}
void __init tsc_init(void)
@@ -1382,6 +1393,7 @@ void __init tsc_init(void)
if (!tsc_khz) {
mark_tsc_unstable("could not calculate TSC khz");
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+ static_branch_disable(&__use_tsc);
return;
}
--
2.17.1
Allow sched_clock() to be used before schec_clock_init() and
sched_clock_init_late() are called. This provides us with a way to get
early boot timestamps on machines with unstable clocks.
Signed-off-by: Pavel Tatashin <[email protected]>
---
kernel/sched/clock.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..f034392b0f6c 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -205,6 +205,11 @@ void clear_sched_clock_stable(void)
*/
static int __init sched_clock_init_late(void)
{
+ /* Transition to unstable clock from early clock */
+ local_irq_disable();
+ __gtod_offset = sched_clock() + __sched_clock_offset - ktime_get_ns();
+ local_irq_enable();
+
sched_clock_running = 2;
/*
* Ensure that it is impossible to not do a static_key update.
@@ -350,8 +355,9 @@ u64 sched_clock_cpu(int cpu)
if (sched_clock_stable())
return sched_clock() + __sched_clock_offset;
- if (unlikely(!sched_clock_running))
- return 0ull;
+ /* Use early clock until sched_clock_init_late() */
+ if (unlikely(sched_clock_running < 2))
+ return sched_clock() + __sched_clock_offset;
preempt_disable_notrace();
scd = cpu_sdc(cpu);
--
2.17.1
It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() we
may end up with accessing struct page that have not been initialized.
Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:
| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
| char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
| vfs_caches_init_early();
| sort_main_extable();
| trap_init();
|+ {
|+ static_branch_enable(&__test);
|+ WARN_ON(!static_branch_likely(&__test));
|+ }
| mm_init();
The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc1_pt_t1 #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.11.0-20171110_100015-anatol 04/01/2014
RIP: 0010:text_poke+0x20d/0x230
Code: 0f 0b 4c 89 e2 4c 89 ee 4c 89 f7 e8 7d 4b 9b 00 31 d2 31 f6 bf 86 02
00 00 48 8b 05 95 8e 24 01 e8 78 18 d8 00 e9 55 ff ff ff <0f> 0b e9 54 fe
ff ff 48 8b 05 75 a8 38 01 e9 64 fe ff ff 48 8b 1d
RSP: 0000:ffffffff94e03e30 EFLAGS: 00010046
RAX: 0100000000000000 RBX: fffff7b2c011f300 RCX: ffffffff94fcccf4
RDX: 0000000000000001 RSI: ffffffff94e03e77 RDI: ffffffff94fcccef
RBP: ffffffff94fcccef R08: 00000000fffffe00 R09: 00000000000000a0
R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000001
R13: ffffffff94e03e77 R14: ffffffff94fcdcef R15: fffff7b2c0000000
FS: 0000000000000000(0000) GS:ffff9adc87c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff9adc8499d000 CR3: 000000000460a001 CR4: 00000000000606b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? start_kernel+0x23e/0x4c8
? start_kernel+0x23f/0x4c8
? text_poke_bp+0x50/0xda
? arch_jump_label_transform+0x89/0xe0
? __jump_label_update+0x78/0xb0
? static_key_enable_cpuslocked+0x4d/0x80
? static_key_enable+0x11/0x20
? start_kernel+0x23e/0x4c8
? secondary_startup_64+0xa5/0xb0
---[ end trace abdc99c031b8a90a ]---
If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 10 +++++++++-
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 2ecd34e2d46c..e85ff65c43c3 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,5 +37,6 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern int after_bootmem;
#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a481763a3776..0230dbc3c599 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -686,13 +686,21 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
*
* Note: Must be called under text_mutex.
*/
-void *text_poke(void *addr, const void *opcode, size_t len)
+void __ref *text_poke(void *addr, const void *opcode, size_t len)
{
unsigned long flags;
char *vaddr;
struct page *pages[2];
int i;
+ /* While boot memory allocator is runnig we cannot use struct
+ * pages as they are not yet initialized. However, we also know
+ * that this is early in boot, and it is safe to fallback to
+ * text_poke_early.
+ */
+ if (unlikely(!after_bootmem))
+ return text_poke_early(addr, opcode, len);
+
if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
--
2.17.1
KVM clock is initialized later compared to other hypervisor because it has
dependency on memblock allocator.
Lets bring it inline with other hypervisors by removing this dependency by
using memory from BSS instead of allocating it.
The benefits:
- remove ifdef from common code
- earlier availability of TSC.
- remove dependency on memblock, and reduce code
- earlier kvm sched_clock()
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 64 ++++++--------------------------------
arch/x86/kernel/setup.c | 7 ++---
3 files changed, 12 insertions(+), 60 deletions(-)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5b2300b818af..c65c232d3ddd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name = "KVM",
.detect = kvm_detect,
.type = X86_HYPER_KVM,
+ .init.init_platform = kvmclock_init,
.init.guest_late_init = kvm_guest_init,
.init.x2apic_available = kvm_para_available,
};
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index bf8d1eb7fca3..01558d41ec2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,9 +23,9 @@
#include <asm/apic.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
-#include <linux/memblock.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
+#include <linux/mm.h>
#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
@@ -44,6 +44,11 @@ static int parse_no_kvmclock(char *arg)
}
early_param("no-kvmclock", parse_no_kvmclock);
+/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
+#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))
+static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
/* The hypervisor will put information about time periodically here */
static struct pvclock_vsyscall_time_info *hv_clock;
static struct pvclock_wall_clock *wall_clock;
@@ -244,43 +249,12 @@ static void kvm_shutdown(void)
native_machine_shutdown();
}
-static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
- phys_addr_t align)
-{
- phys_addr_t mem;
-
- mem = memblock_alloc(size, align);
- if (!mem)
- return 0;
-
- if (sev_active()) {
- if (early_set_memory_decrypted((unsigned long)__va(mem), size))
- goto e_free;
- }
-
- return mem;
-e_free:
- memblock_free(mem, size);
- return 0;
-}
-
-static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
-{
- if (sev_active())
- early_set_memory_encrypted((unsigned long)__va(addr), size);
-
- memblock_free(addr, size);
-}
-
void __init kvmclock_init(void)
{
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned long mem, mem_wall_clock;
- int size, cpu, wall_clock_size;
+ int cpu;
u8 flags;
- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
if (!kvm_para_available())
return;
@@ -290,28 +264,11 @@ void __init kvmclock_init(void)
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;
- wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
- mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
- if (!mem_wall_clock)
- return;
-
- wall_clock = __va(mem_wall_clock);
- memset(wall_clock, 0, wall_clock_size);
-
- mem = kvm_memblock_alloc(size, PAGE_SIZE);
- if (!mem) {
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
- wall_clock = NULL;
- return;
- }
-
- hv_clock = __va(mem);
- memset(hv_clock, 0, size);
+ wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
+ hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
if (kvm_register_clock("primary cpu clock")) {
hv_clock = NULL;
- kvm_memblock_free(mem, size);
- kvm_memblock_free(mem_wall_clock, wall_clock_size);
wall_clock = NULL;
return;
}
@@ -354,13 +311,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
int cpu;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
- unsigned int size;
if (!hv_clock)
return 0;
- size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
cpu = get_cpu();
vcpu_time = &hv_clock[cpu].pvti;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 403b2d2c31d2..01fcc8bf7c8f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1014,6 +1014,8 @@ void __init setup_arch(char **cmdline_p)
*/
init_hypervisor_platform();
+ tsc_early_delay_calibrate();
+
x86_init.resources.probe_roms();
/* after parse_early_param, so could debug it */
@@ -1199,11 +1201,6 @@ void __init setup_arch(char **cmdline_p)
memblock_find_dma_reserve();
-#ifdef CONFIG_KVM_GUEST
- kvmclock_init();
-#endif
-
- tsc_early_delay_calibrate();
if (!early_xdbc_setup_hardware())
early_xdbc_register_console();
--
2.17.1
read_boot_clock64() is deleted, and replaced with
read_persistent_wall_and_boot_offset().
The default implementation of read_persistent_wall_and_boot_offset()
provides a better fallback than the current stubs for read_boot_clock64()
that arm has, so remove the old code.
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm/kernel/time.c | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)
diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index cf2701cb0de8..0a6a457b13c7 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -83,29 +83,19 @@ static void dummy_clock_access(struct timespec64 *ts)
}
static clock_access_fn __read_persistent_clock = dummy_clock_access;
-static clock_access_fn __read_boot_clock = dummy_clock_access;
void read_persistent_clock64(struct timespec64 *ts)
{
__read_persistent_clock(ts);
}
-void read_boot_clock64(struct timespec64 *ts)
-{
- __read_boot_clock(ts);
-}
-
int __init register_persistent_clock(clock_access_fn read_boot,
clock_access_fn read_persistent)
{
/* Only allow the clockaccess functions to be registered once */
- if (__read_persistent_clock == dummy_clock_access &&
- __read_boot_clock == dummy_clock_access) {
- if (read_boot)
- __read_boot_clock = read_boot;
+ if (__read_persistent_clock == dummy_clock_access) {
if (read_persistent)
__read_persistent_clock = read_persistent;
-
return 0;
}
--
2.17.1
read_persistent_wall_and_boot_offset() will replace read_boot_clock64()
because on some architectures it is more convenient to read both sources
as one may depend on the other. For s390, implementation is the same
as read_boot_clock64() but also calling and returning value of
read_persistent_clock64()
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/s390/kernel/time.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index cf561160ea88..d1f5447d5687 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -221,6 +221,24 @@ void read_persistent_clock64(struct timespec64 *ts)
ext_to_timespec64(clk, ts);
}
+void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
+{
+ unsigned char clk[STORE_CLOCK_EXT_SIZE];
+ struct timespec64 boot_time;
+ __u64 delta;
+
+ delta = initial_leap_seconds + TOD_UNIX_EPOCH;
+ memcpy(clk, tod_clock_base, STORE_CLOCK_EXT_SIZE);
+ *(__u64 *)&clk[1] -= delta;
+ if (*(__u64 *)&clk[1] > delta)
+ clk[0]--;
+ ext_to_timespec64(clk, &boot_time);
+
+ read_persistent_clock64(wall_time);
+ *boot_offset = timespec64_sub(*wall_time, boot_time);
+}
+
void read_boot_clock64(struct timespec64 *ts)
{
unsigned char clk[STORE_CLOCK_EXT_SIZE];
--
2.17.1
We will change sched_clock() to be called early. But, during boot
sched_clock() changes its output without notifying us about the change of
clock source.
This happens in tsc_init(), when static_branch_enable(&__use_tsc) is
called.
native_sched_clock() changes from outputing jiffies to reading tsc, but
sched is not notified in anyway. So, to preserve the continoutity in this
place we add the offset of sched_clock() to the calculation of cyc2ns.
Without this change, the output would look like this:
[ 0.004000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.009000] tsc: Fast TSC calibration using PIT
[ 0.010000] tsc: Detected 3192.137 MHz processor
[ 0.011000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e03465ceb2, max_idle_ns: 440795259855 ns
static_branch_enable(__use_tsc) is called, and timestamps became precise
but reduced:
[ 0.002233] Calibrating delay loop (skipped), value calculated using timer frequency.. 6384.27 BogoMIPS (lpj=3192137)
[ 0.002516] pid_max: default: 32768 minimum: 301
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/tsc.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 186395041725..654a01cc0358 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -133,7 +133,9 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
return ns;
}
-static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+static void set_cyc2ns_scale(unsigned long khz, int cpu,
+ unsigned long long tsc_now,
+ unsigned long long sched_now)
{
unsigned long long ns_now;
struct cyc2ns_data data;
@@ -146,7 +148,7 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
if (!khz)
goto done;
- ns_now = cycles_2_ns(tsc_now);
+ ns_now = cycles_2_ns(tsc_now) + sched_now;
/*
* Compute a new multiplier as per the above comment and ensure our
@@ -936,7 +938,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
if (!(freq->flags & CPUFREQ_CONST_LOOPS))
mark_tsc_unstable("cpufreq changes");
- set_cyc2ns_scale(tsc_khz, freq->cpu, rdtsc());
+ set_cyc2ns_scale(tsc_khz, freq->cpu, rdtsc(), 0);
}
return 0;
@@ -1285,7 +1287,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)
/* Update the sched_clock() rate to match the clocksource one */
for_each_possible_cpu(cpu)
- set_cyc2ns_scale(tsc_khz, cpu, tsc_stop);
+ set_cyc2ns_scale(tsc_khz, cpu, tsc_stop, 0);
out:
if (tsc_unstable)
@@ -1356,7 +1358,7 @@ void __init tsc_early_delay_calibrate(void)
void __init tsc_init(void)
{
- u64 lpj, cyc;
+ u64 lpj, cyc, sch;
int cpu;
if (!boot_cpu_has(X86_FEATURE_TSC)) {
@@ -1403,9 +1405,10 @@ void __init tsc_init(void)
* up if their speed diverges)
*/
cyc = rdtsc();
+ sch = local_clock();
for_each_possible_cpu(cpu) {
cyc2ns_init(cpu);
- set_cyc2ns_scale(tsc_khz, cpu, cyc);
+ set_cyc2ns_scale(tsc_khz, cpu, cyc, sch);
}
static_branch_enable(&__use_tsc);
--
2.17.1
Currently, notsc kernel parameter disables the use of tsc register by
sched_clock(). However, this parameter does not prevent linux from
accessing tsc in other places in kernel.
The only rational to boot with notsc is to avoid timing discrepancies on
multi-socket systems where different tsc frequencies may present, and thus
fallback to jiffies for clock source.
However, there is another method to solve the above problem, it is to boot
with tsc=unstable parameter. This parameter allows sched_clock() to use tsc
but in case tsc is outside of expected interval it is corrected back to a
sane value.
This is why there is no reason to keep notsc, and it can be removed. But,
for compatibility reasons we will keep this parameter but change its
definition to be the same as tsc=unstable.
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Dou Liyang <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 2 --
Documentation/x86/x86_64/boot-options.txt | 4 +---
arch/x86/kernel/tsc.c | 18 +++---------------
3 files changed, 4 insertions(+), 20 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index efc7aa7a0670..f7123d28f318 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2835,8 +2835,6 @@
nosync [HW,M68K] Disables sync negotiation for all devices.
- notsc [BUGS=X86-32] Disable Time Stamp Counter
-
nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 8d109ef67ab6..66114ab4f9fe 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -92,9 +92,7 @@ APICs
Timing
notsc
- Don't use the CPU time stamp counter to read the wall time.
- This can be used to work around timing problems on multiprocessor systems
- with not properly synchronized CPUs.
+ Deprecated, use tsc=unstable instead.
nohpet
Don't use the HPET timer.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..186395041725 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,11 +38,6 @@ EXPORT_SYMBOL(tsc_khz);
*/
static int __read_mostly tsc_unstable;
-/* native_sched_clock() is called before tsc_init(), so
- we must start with the TSC soft disabled to prevent
- erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
static DEFINE_STATIC_KEY_FALSE(__use_tsc);
int tsc_clocksource_reliable;
@@ -248,8 +243,7 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
#ifdef CONFIG_X86_TSC
int __init notsc_setup(char *str)
{
- pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
- tsc_disabled = 1;
+ mark_tsc_unstable("boot parameter notsc");
return 1;
}
#else
@@ -1307,7 +1301,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)
static int __init init_tsc_clocksource(void)
{
- if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+ if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
return 0;
if (tsc_unstable)
@@ -1414,12 +1408,6 @@ void __init tsc_init(void)
set_cyc2ns_scale(tsc_khz, cpu, cyc);
}
- if (tsc_disabled > 0)
- return;
-
- /* now allow native_sched_clock() to use rdtsc */
-
- tsc_disabled = 0;
static_branch_enable(&__use_tsc);
if (!no_sched_irq_time)
@@ -1455,7 +1443,7 @@ unsigned long calibrate_delay_is_known(void)
int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
const struct cpumask *mask = topology_core_cpumask(cpu);
- if (tsc_disabled || !constant_tsc || !mask)
+ if (!constant_tsc || !mask)
return 0;
sibling = cpumask_any_but(mask, cpu);
--
2.17.1
static branching is useful to hot-patch branches that are used in hot
path, but are infrequently changed.
x86 clock framework is one example that uses static branches to setup
the best clock during boot and never change it again.
Since we plan to enable clock early, we need static branching
functionality early as well.
static branching requires patching nop instructions, thus, we need
arch_init_ideal_nops() to be called prior to jump_label_init()
Here we do all the necessary steps to call arch_init_ideal_nops
after early_cpu_init().
Signed-off-by: Pavel Tatashin <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 13 +++++++-----
arch/x86/kernel/cpu/common.c | 38 +++++++++++++++++++-----------------
arch/x86/kernel/setup.c | 4 ++--
3 files changed, 30 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 082d7875cef8..355105aebc4e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -232,8 +232,6 @@ static void init_amd_k7(struct cpuinfo_x86 *c)
}
}
- set_cpu_cap(c, X86_FEATURE_K7);
-
/* calling is from identify_secondary_cpu() ? */
if (!c->cpu_index)
return;
@@ -615,6 +613,14 @@ static void early_init_amd(struct cpuinfo_x86 *c)
early_init_amd_mc(c);
+#ifdef CONFIG_X86_32
+ if (c->x86 == 6)
+ set_cpu_cap(c, X86_FEATURE_K7);
+#endif
+
+ if (c->x86 >= 0xf)
+ set_cpu_cap(c, X86_FEATURE_K8);
+
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);
/*
@@ -861,9 +867,6 @@ static void init_amd(struct cpuinfo_x86 *c)
init_amd_cacheinfo(c);
- if (c->x86 >= 0xf)
- set_cpu_cap(c, X86_FEATURE_K8);
-
if (cpu_has(c, X86_FEATURE_XMM2)) {
unsigned long long val;
int ret;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0df7151cfef4..952d31f75821 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1012,6 +1012,24 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
}
+/*
+ * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
+ * unfortunately, that's not true in practice because of early VIA
+ * chips and (more importantly) broken virtualizers that are not easy
+ * to detect. In the latter case it doesn't even *fail* reliably, so
+ * probing for it doesn't even work. Disable it completely on 32-bit
+ * unless we can find a reliable way to detect all the broken cases.
+ * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
+ */
+static void detect_nopl(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_32
+ clear_cpu_cap(c, X86_FEATURE_NOPL);
+#else
+ set_cpu_cap(c, X86_FEATURE_NOPL);
+#endif
+}
+
/*
* Do minimum CPU detection early.
* Fields really needed: vendor, cpuid_level, family, model, mask,
@@ -1086,6 +1104,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
*/
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);
+
+ detect_nopl(c);
}
void __init early_cpu_init(void)
@@ -1121,24 +1141,6 @@ void __init early_cpu_init(void)
early_identify_cpu(&boot_cpu_data);
}
-/*
- * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
- * unfortunately, that's not true in practice because of early VIA
- * chips and (more importantly) broken virtualizers that are not easy
- * to detect. In the latter case it doesn't even *fail* reliably, so
- * probing for it doesn't even work. Disable it completely on 32-bit
- * unless we can find a reliable way to detect all the broken cases.
- * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
- */
-static void detect_nopl(struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
-#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
-#endif
-}
-
static void detect_null_seg_behavior(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..403b2d2c31d2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -866,6 +866,8 @@ void __init setup_arch(char **cmdline_p)
idt_setup_early_traps();
early_cpu_init();
+ arch_init_ideal_nops();
+ jump_label_init();
early_ioremap_init();
setup_olpc_ofw_pgd();
@@ -1272,8 +1274,6 @@ void __init setup_arch(char **cmdline_p)
mcheck_init();
- arch_init_ideal_nops();
-
register_refined_jiffies(CLOCK_TICK_RATE);
#ifdef CONFIG_EFI
--
2.17.1
If architecture does not support exact boot time, it is challenging to
estimate boot time without having a reference to the current persistent
clock value. Yet, we cannot read the persistent clock time again, because
this may lead to math discrepancies with the caller of read_boot_clock64()
who have read the persistent clock at a different time.
This is why it is better to provide two values simultaneously: the
persistent clock value, and the boot time.
Thus, we replace read_boot_clock64() with:
read_persistent_wall_and_boot_offset(wall_time, boot_offset)
Where wall_time is returned by read_persistent_clock()
And boot_offset is wall_time - boot time
We calculate boot_offset using the current value of local_clock() so
architectures, that do not have a dedicated boot_clock but have early
sched_clock(), such as SPARCv9, x86, and possibly more will benefit from
this change by getting a better and more consistent estimate of the boot
time without need for an arch specific implementation.
Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/timekeeping.h | 3 +-
kernel/time/timekeeping.c | 61 +++++++++++++++++++------------------
2 files changed, 34 insertions(+), 30 deletions(-)
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 86bc2026efce..686bc27acef0 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -243,7 +243,8 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
extern int persistent_clock_is_local;
extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
+void read_persistent_clock_and_boot_offset(struct timespec64 *wall_clock,
+ struct timespec64 *boot_offset);
extern int update_persistent_clock64(struct timespec64 now);
/*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4786df904c22..aface5c13e7d 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -17,6 +17,7 @@
#include <linux/nmi.h>
#include <linux/sched.h>
#include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
#include <linux/syscore_ops.h>
#include <linux/clocksource.h>
#include <linux/jiffies.h>
@@ -1496,18 +1497,23 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
}
/**
- * read_boot_clock64 - Return time of the system start.
+ * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset
+ * from the boot.
*
* Weak dummy function for arches that do not yet support it.
- * Function to read the exact time the system has been started.
- * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
- *
- * XXX - Do be sure to remove it once all arches implement it.
+ * wall_time - current time as returned by persistent clock
+ * boot_offset - offset that is defined as wall_time - boot_time
+ * The default function calculates offset based on the current value of
+ * local_clock(). This way architectures that support sched_clock() but don't
+ * support dedicated boot time clock will provide the best estimate of the
+ * boot time.
*/
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init
+read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+ struct timespec64 *boot_offset)
{
- ts->tv_sec = 0;
- ts->tv_nsec = 0;
+ read_persistent_clock64(wall_time);
+ *boot_offset = ns_to_timespec64(local_clock());
}
/* Flag for if timekeeping_resume() has injected sleeptime */
@@ -1521,28 +1527,28 @@ static bool persistent_clock_exists;
*/
void __init timekeeping_init(void)
{
+ struct timespec64 wall_time, boot_offset, wall_to_mono;
struct timekeeper *tk = &tk_core.timekeeper;
struct clocksource *clock;
unsigned long flags;
- struct timespec64 now, boot, tmp;
-
- read_persistent_clock64(&now);
- if (!timespec64_valid_strict(&now)) {
- pr_warn("WARNING: Persistent clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- now.tv_sec = 0;
- now.tv_nsec = 0;
- } else if (now.tv_sec || now.tv_nsec)
- persistent_clock_exists = true;
- read_boot_clock64(&boot);
- if (!timespec64_valid_strict(&boot)) {
- pr_warn("WARNING: Boot clock returned invalid value!\n"
- " Check your CMOS/BIOS settings.\n");
- boot.tv_sec = 0;
- boot.tv_nsec = 0;
+ read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
+ if (timespec64_valid_strict(&wall_time) &&
+ timespec64_to_ns(&wall_time)) {
+ persistent_clock_exists = true;
+ } else {
+ pr_warn("Persistent clock returned invalid value");
+ wall_time = (struct timespec64){0};
}
+ if (timespec64_compare(&wall_time, &boot_offset) < 0)
+ boot_offset = (struct timespec64){0};
+
+ /* We want set wall_to_mono, so the following is true:
+ * wall time + wall_to_mono = boot time
+ */
+ wall_to_mono = timespec64_sub(boot_offset, wall_time);
+
raw_spin_lock_irqsave(&timekeeper_lock, flags);
write_seqcount_begin(&tk_core.seq);
ntp_init();
@@ -1552,13 +1558,10 @@ void __init timekeeping_init(void)
clock->enable(clock);
tk_setup_internals(tk, clock);
- tk_set_xtime(tk, &now);
+ tk_set_xtime(tk, &wall_time);
tk->raw_sec = 0;
- if (boot.tv_sec == 0 && boot.tv_nsec == 0)
- boot = tk_xtime(tk);
- set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec);
- tk_set_wall_to_mono(tk, tmp);
+ tk_set_wall_to_mono(tk, wall_to_mono);
timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
--
2.17.1
read_boot_clock64() was replaced by read_persistent_wall_and_boot_offset()
so remove it.
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/s390/kernel/time.c | 13 -------------
1 file changed, 13 deletions(-)
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index d1f5447d5687..e8766beee5ad 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -239,19 +239,6 @@ void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
*boot_offset = timespec64_sub(*wall_time, boot_time);
}
-void read_boot_clock64(struct timespec64 *ts)
-{
- unsigned char clk[STORE_CLOCK_EXT_SIZE];
- __u64 delta;
-
- delta = initial_leap_seconds + TOD_UNIX_EPOCH;
- memcpy(clk, tod_clock_base, 16);
- *(__u64 *) &clk[1] -= delta;
- if (*(__u64 *) &clk[1] > delta)
- clk[0]--;
- ext_to_timespec64(clk, ts);
-}
-
static u64 read_tod_clock(struct clocksource *cs)
{
unsigned long long now, adj;
--
2.17.1
On 06/21/2018 02:25 PM, Pavel Tatashin wrote:
> It supposed to be safe to modify static branches after jump_label_init().
> But, because static key modifying code eventually calls text_poke() we
> may end up with accessing struct page that have not been initialized.
>
> Here is how to quickly reproduce the problem. Insert code like this
> into init/main.c:
>
> | +static DEFINE_STATIC_KEY_FALSE(__test);
> | asmlinkage __visible void __init start_kernel(void)
> | {
> | char *command_line;
> |@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
> | vfs_caches_init_early();
> | sort_main_extable();
> | trap_init();
> |+ {
> |+ static_branch_enable(&__test);
> |+ WARN_ON(!static_branch_likely(&__test));
> |+ }
> | mm_init();
>
> The following warnings show-up:
> WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc1_pt_t1 #30
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.11.0-20171110_100015-anatol 04/01/2014
> RIP: 0010:text_poke+0x20d/0x230
> Code: 0f 0b 4c 89 e2 4c 89 ee 4c 89 f7 e8 7d 4b 9b 00 31 d2 31 f6 bf 86 02
> 00 00 48 8b 05 95 8e 24 01 e8 78 18 d8 00 e9 55 ff ff ff <0f> 0b e9 54 fe
> ff ff 48 8b 05 75 a8 38 01 e9 64 fe ff ff 48 8b 1d
> RSP: 0000:ffffffff94e03e30 EFLAGS: 00010046
> RAX: 0100000000000000 RBX: fffff7b2c011f300 RCX: ffffffff94fcccf4
> RDX: 0000000000000001 RSI: ffffffff94e03e77 RDI: ffffffff94fcccef
> RBP: ffffffff94fcccef R08: 00000000fffffe00 R09: 00000000000000a0
> R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000001
> R13: ffffffff94e03e77 R14: ffffffff94fcdcef R15: fffff7b2c0000000
> FS: 0000000000000000(0000) GS:ffff9adc87c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff9adc8499d000 CR3: 000000000460a001 CR4: 00000000000606b0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> ? start_kernel+0x23e/0x4c8
> ? start_kernel+0x23f/0x4c8
> ? text_poke_bp+0x50/0xda
> ? arch_jump_label_transform+0x89/0xe0
> ? __jump_label_update+0x78/0xb0
> ? static_key_enable_cpuslocked+0x4d/0x80
> ? static_key_enable+0x11/0x20
> ? start_kernel+0x23e/0x4c8
> ? secondary_startup_64+0xa5/0xb0
> ---[ end trace abdc99c031b8a90a ]---
>
> If the code above is moved after mm_init(), no warning is shown, as struct
> pages are initialized during handover from memblock.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> arch/x86/include/asm/text-patching.h | 1 +
> arch/x86/kernel/alternative.c | 10 +++++++++-
> 2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index 2ecd34e2d46c..e85ff65c43c3 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -37,5 +37,6 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern int after_bootmem;
>
> #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index a481763a3776..0230dbc3c599 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -686,13 +686,21 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
> *
> * Note: Must be called under text_mutex.
> */
> -void *text_poke(void *addr, const void *opcode, size_t len)
> +void __ref *text_poke(void *addr, const void *opcode, size_t len)
> {
> unsigned long flags;
> char *vaddr;
> struct page *pages[2];
> int i;
>
> + /* While boot memory allocator is runnig we cannot use struct
coding style:
/*
* While boot memory ....
But more importantly, does this patch need to be backported for stable?
> + * pages as they are not yet initialized. However, we also know
> + * that this is early in boot, and it is safe to fallback to
> + * text_poke_early.
> + */
> + if (unlikely(!after_bootmem))
> + return text_poke_early(addr, opcode, len);
> +
> if (!core_kernel_text((unsigned long)addr)) {
> pages[0] = vmalloc_to_page(addr);
> pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
>
thanks,
--
~Randy
On Thu, Jun 21, 2018 at 05:25:09PM -0400, Pavel Tatashin wrote:
> static branching is useful to hot-patch branches that are used in hot
> path, but are infrequently changed.
>
> x86 clock framework is one example that uses static branches to setup
> the best clock during boot and never change it again.
>
> Since we plan to enable clock early, we need static branching
> functionality early as well.
>
> static branching requires patching nop instructions, thus, we need
> arch_init_ideal_nops() to be called prior to jump_label_init()
>
> Here we do all the necessary steps to call arch_init_ideal_nops
> after early_cpu_init().
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> Suggested-by: Peter Zijlstra <[email protected]>
> ---
> arch/x86/kernel/cpu/amd.c | 13 +++++++-----
> arch/x86/kernel/cpu/common.c | 38 +++++++++++++++++++-----------------
> arch/x86/kernel/setup.c | 4 ++--
> 3 files changed, 30 insertions(+), 25 deletions(-)
Looks ok to me.
Reviewed-by: Borislav Petkov <[email protected]>
Also, please take the patch below into your queue and keep it a separate
patch in case we have to revert it later. It should help in keeping the
mess manageable and not let it go completely out of control before we've
done the cleanup.
Thx.
---
From: Borislav Petkov <[email protected]>
Date: Sat, 23 Jun 2018 11:04:47 +0200
Subject: [PATCH] x86/CPU: Call detect_nopl() only on the BSP
Make it use the setup_* variants and have it be called only on the BSP
and drop the call in generic_identify() - X86_FEATURE_NOPL will be
replicated to the APs through the forced caps. Helps keep the mess at a
manageable level.
Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/cpu/common.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 952d31f75821..1b5edbd8f6db 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1021,12 +1021,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
* unless we can find a reliable way to detect all the broken cases.
* Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
*/
-static void detect_nopl(struct cpuinfo_x86 *c)
+static void detect_nopl(void)
{
#ifdef CONFIG_X86_32
- clear_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_clear_cpu_cap(X86_FEATURE_NOPL);
#else
- set_cpu_cap(c, X86_FEATURE_NOPL);
+ setup_force_cpu_cap(X86_FEATURE_NOPL);
#endif
}
@@ -1105,7 +1105,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
if (!pgtable_l5_enabled())
setup_clear_cpu_cap(X86_FEATURE_LA57);
- detect_nopl(c);
+ detect_nopl();
}
void __init early_cpu_init(void)
@@ -1203,8 +1203,6 @@ static void generic_identify(struct cpuinfo_x86 *c)
get_model_name(c); /* Default name */
- detect_nopl(c);
-
detect_null_seg_behavior(c);
/*
--
2.17.0.582.gccdcbd54c
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
Hi Borislav,
>
> Reviewed-by: Borislav Petkov <[email protected]>
Thank you.
>
> Also, please take the patch below into your queue and keep it a separate
> patch in case we have to revert it later. It should help in keeping the
> mess manageable and not let it go completely out of control before we've
> done the cleanup.
OK. I will include it into the next itiration. Once I get some more
feedback from Thomas and Peter.
Pavel
On Thu, 21 Jun 2018, Pavel Tatashin wrote:
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 5b2300b818af..c65c232d3ddd 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
> .name = "KVM",
> .detect = kvm_detect,
> .type = X86_HYPER_KVM,
> + .init.init_platform = kvmclock_init,
Please rename that function to kvm_platform_init().
Other than that:
Reviewed-by: Thomas Gleixner <[email protected]>
On Thu, 21 Jun 2018, Pavel Tatashin wrote:
> Currently, notsc kernel parameter disables the use of tsc register by
> sched_clock(). However, this parameter does not prevent linux from
> accessing tsc in other places in kernel.
>
> The only rational to boot with notsc is to avoid timing discrepancies on
> multi-socket systems where different tsc frequencies may present, and thus
> fallback to jiffies for clock source.
>
> However, there is another method to solve the above problem, it is to boot
> with tsc=unstable parameter. This parameter allows sched_clock() to use tsc
> but in case tsc is outside of expected interval it is corrected back to a
> sane value.
>
> This is why there is no reason to keep notsc, and it can be removed. But,
> for compatibility reasons we will keep this parameter but change its
> definition to be the same as tsc=unstable.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> Reviewed-by: Dou Liyang <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
On Thu, 21 Jun 2018, Pavel Tatashin wrote:
> If architecture does not support exact boot time, it is challenging to
> estimate boot time without having a reference to the current persistent
> clock value. Yet, we cannot read the persistent clock time again, because
> this may lead to math discrepancies with the caller of read_boot_clock64()
> who have read the persistent clock at a different time.
>
> This is why it is better to provide two values simultaneously: the
> persistent clock value, and the boot time.
>
> Thus, we replace read_boot_clock64() with:
> read_persistent_wall_and_boot_offset(wall_time, boot_offset)
>
> Where wall_time is returned by read_persistent_clock()
> And boot_offset is wall_time - boot time
>
> We calculate boot_offset using the current value of local_clock() so
> architectures, that do not have a dedicated boot_clock but have early
> sched_clock(), such as SPARCv9, x86, and possibly more will benefit from
> this change by getting a better and more consistent estimate of the boot
> time without need for an arch specific implementation.
Again. You are doing 5 things in one patch.
> /**
> - * read_boot_clock64 - Return time of the system start.
> + * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset
> + * from the boot.
> *
> * Weak dummy function for arches that do not yet support it.
> - * Function to read the exact time the system has been started.
> - * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
> - *
> - * XXX - Do be sure to remove it once all arches implement it.
> + * wall_time - current time as returned by persistent clock
> + * boot_offset - offset that is defined as wall_time - boot_time
> + * The default function calculates offset based on the current value of
> + * local_clock(). This way architectures that support sched_clock() but don't
> + * support dedicated boot time clock will provide the best estimate of the
> + * boot time.
> */
> -void __weak read_boot_clock64(struct timespec64 *ts)
> +void __weak __init
> +read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
> + struct timespec64 *boot_offset)
> {
> - ts->tv_sec = 0;
> - ts->tv_nsec = 0;
> + read_persistent_clock64(wall_time);
> + *boot_offset = ns_to_timespec64(local_clock());
Introduce the default function with
*boot_offset = 0;
And then in a follow up patch, add the local_clock() magic to it. This 'oh
lets do all in one just because we can' is just wrong. It's harder to
review and its bad for bisection because you cannot differentiate whether
the wreckage comes from the timekeeping_init() conversion or the magic new
functionality in read_persistent_wall_and_boot_offset().
That said, I like the local_clock() idea itself, but it might wreckage some
architecture(s), so we need to watchout for that.
> /* Flag for if timekeeping_resume() has injected sleeptime */
> @@ -1521,28 +1527,28 @@ static bool persistent_clock_exists;
> */
> void __init timekeeping_init(void)
> {
> + struct timespec64 wall_time, boot_offset, wall_to_mono;
> struct timekeeper *tk = &tk_core.timekeeper;
> struct clocksource *clock;
> unsigned long flags;
> - struct timespec64 now, boot, tmp;
> -
> - read_persistent_clock64(&now);
> - if (!timespec64_valid_strict(&now)) {
> - pr_warn("WARNING: Persistent clock returned invalid value!\n"
> - " Check your CMOS/BIOS settings.\n");
> - now.tv_sec = 0;
> - now.tv_nsec = 0;
> - } else if (now.tv_sec || now.tv_nsec)
> - persistent_clock_exists = true;
>
> - read_boot_clock64(&boot);
> - if (!timespec64_valid_strict(&boot)) {
> - pr_warn("WARNING: Boot clock returned invalid value!\n"
> - " Check your CMOS/BIOS settings.\n");
> - boot.tv_sec = 0;
> - boot.tv_nsec = 0;
> + read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
> + if (timespec64_valid_strict(&wall_time) &&
> + timespec64_to_ns(&wall_time)) {
Can you please have an explicit ts_to_ns(wall) > 0 there? I know it's
implied in timespec64_valid_strict(), but it makes this code more obvious.
> + persistent_clock_exists = true;
> + } else {
> + pr_warn("Persistent clock returned invalid value");
> + wall_time = (struct timespec64){0};
> }
>
> + if (timespec64_compare(&wall_time, &boot_offset) < 0)
> + boot_offset = (struct timespec64){0};
> +
> + /* We want set wall_to_mono, so the following is true:
See: https://lkml.kernel.org/r/CA+55aFyQYJerovMsSoSKS7PessZBr4vNp-3QUUwhqk4A4_jcbg@mail.gmail.com
> + * wall time + wall_to_mono = boot time
> + */
> + wall_to_mono = timespec64_sub(boot_offset, wall_time);
> +
> raw_spin_lock_irqsave(&timekeeper_lock, flags);
Thanks,
tglx
On Thu, 21 Jun 2018, Pavel Tatashin wrote:
> read_boot_clock64() is deleted, and replaced with
> read_persistent_wall_and_boot_offset().
>
> The default implementation of read_persistent_wall_and_boot_offset()
> provides a better fallback than the current stubs for read_boot_clock64()
> that arm has, so remove the old code.
You forgot to say, that this function has no users, so the removal is not
changing anything,
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> arch/arm/kernel/time.c | 12 +-----------
> 1 file changed, 1 insertion(+), 11 deletions(-)
>
> diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
> index cf2701cb0de8..0a6a457b13c7 100644
> --- a/arch/arm/kernel/time.c
> +++ b/arch/arm/kernel/time.c
> @@ -83,29 +83,19 @@ static void dummy_clock_access(struct timespec64 *ts)
> }
>
> static clock_access_fn __read_persistent_clock = dummy_clock_access;
> -static clock_access_fn __read_boot_clock = dummy_clock_access;
>
> void read_persistent_clock64(struct timespec64 *ts)
> {
> __read_persistent_clock(ts);
> }
>
> -void read_boot_clock64(struct timespec64 *ts)
> -{
> - __read_boot_clock(ts);
> -}
> -
> int __init register_persistent_clock(clock_access_fn read_boot,
> clock_access_fn read_persistent)
> {
> /* Only allow the clockaccess functions to be registered once */
> - if (__read_persistent_clock == dummy_clock_access &&
> - __read_boot_clock == dummy_clock_access) {
> - if (read_boot)
> - __read_boot_clock = read_boot;
That leaves the fn_read_boot argument unused. So this wants to be cleaned
up as well.
Thanks,
tglx
On Thu, 21 Jun 2018, Pavel Tatashin wrote:
> We will change sched_clock() to be called early.
Why is this relevant? Does the issue only appear with that change?
> But, during boot sched_clock() changes its output without notifying us
> about the change of clock source.
Why would the sched clock change notify _US_?
Can you please try to write your changelog in factual technical terms
without impersonating the system/code?
> This happens in tsc_init(), when static_branch_enable(&__use_tsc) is
> called.
>
> native_sched_clock() changes from outputing jiffies to reading tsc, but
> sched is not notified in anyway. So, to preserve the continoutity in this
sched? -EMAKESNOSENSE
There is no notification mechanism and it's not required.
> place we add the offset of sched_clock() to the calculation of cyc2ns.
s/we//
See Documentation/process/submitting-patches.rst
Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to
do frotz", as if you are giving orders to the codebase to change its
behaviour.
> Without this change, the output would look like this:
Would? It looks exactly like this or why would you try to fix it?
>
> [ 0.004000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.009000] tsc: Fast TSC calibration using PIT
> [ 0.010000] tsc: Detected 3192.137 MHz processor
> [ 0.011000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e03465ceb2, max_idle_ns: 440795259855 ns
>
> static_branch_enable(__use_tsc) is called, and timestamps became precise
> but reduced:
>
> [ 0.002233] Calibrating delay loop (skipped), value calculated using timer frequency.. 6384.27 BogoMIPS (lpj=3192137)
> [ 0.002516] pid_max: default: 32768 minimum: 301
Let me give you an example:
When tsc_init() enables the usage of TSC for sched_clock() the
initialization of the tsc sched clock conversion data starts from zero
and not from the current jiffies based sched_clock() value. This makes
the timestamps jump backwards:
[ 0.010000] tsc: Detected 3192.137 MHz processor
[ 0.011000] clocksource: tsc-early: mask: ...
[ 0.002233] Calibrating delay loop (skipped), ....
To address this, extend set_cyc2ns_scale() with an argument to base the
cyc2ns offset on the current sched_clock() value. During run time this
offset is 0 as the cyc2ns offset is based on the TSC sched_clock()
itself.
See? Precise and pure technical. No we/us/would/ and no irrelevant
information.
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> arch/x86/kernel/tsc.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 186395041725..654a01cc0358 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -133,7 +133,9 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> return ns;
> }
>
> -static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
> +static void set_cyc2ns_scale(unsigned long khz, int cpu,
> + unsigned long long tsc_now,
> + unsigned long long sched_now)
sched_now is not a real good name for this as it's only used at
initialization time. So the argument name should reflect this otherwise you
wonder yourself when looking at that code 6 month from now, why it's 0 on
all the run time call sites. init_offset or some other sensible name which
makes the purpose entirely clear.
> void __init tsc_init(void)
> {
> - u64 lpj, cyc;
> + u64 lpj, cyc, sch;
sch? what's wrong with sched_now or now? It's not that there is a 3 letter
limit.
> int cpu;
>
> if (!boot_cpu_has(X86_FEATURE_TSC)) {
> @@ -1403,9 +1405,10 @@ void __init tsc_init(void)
> * up if their speed diverges)
> */
> cyc = rdtsc();
> + sch = local_clock();
> for_each_possible_cpu(cpu) {
> cyc2ns_init(cpu);
> - set_cyc2ns_scale(tsc_khz, cpu, cyc);
> + set_cyc2ns_scale(tsc_khz, cpu, cyc, sch);
> }
Thanks,
tglx
On Thu, 21 Jun 2018, Pavel Tatashin wrote:
> /*
> * Scheduler clock - returns current time in nanosec units.
> */
> @@ -1354,6 +1364,7 @@ void __init tsc_early_delay_calibrate(void)
> lpj = tsc_khz * 1000;
> do_div(lpj, HZ);
> loops_per_jiffy = lpj;
> + sched_clock_early_init(tsc_khz);
> }
>
> void __init tsc_init(void)
> @@ -1382,6 +1393,7 @@ void __init tsc_init(void)
> if (!tsc_khz) {
> mark_tsc_unstable("could not calculate TSC khz");
> setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
> + static_branch_disable(&__use_tsc);
This wants a proper comment.
> return;
> }
Thanks,
tglx
> Let me give you an example:
>
> When tsc_init() enables the usage of TSC for sched_clock() the
> initialization of the tsc sched clock conversion data starts from zero
> and not from the current jiffies based sched_clock() value. This makes
> the timestamps jump backwards:
>
> [ 0.010000] tsc: Detected 3192.137 MHz processor
> [ 0.011000] clocksource: tsc-early: mask: ...
> [ 0.002233] Calibrating delay loop (skipped), ....
>
> To address this, extend set_cyc2ns_scale() with an argument to base the
> cyc2ns offset on the current sched_clock() value. During run time this
> offset is 0 as the cyc2ns offset is based on the TSC sched_clock()
> itself.
>
> See? Precise and pure technical. No we/us/would/ and no irrelevant
> information.
Yes, thank you Thomas. I will update the changelog based on your suggestions, and no longer will impersonating my commit comments.
>
>> Signed-off-by: Pavel Tatashin <[email protected]>
>> ---
>> arch/x86/kernel/tsc.c | 15 +++++++++------
>> 1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>> index 186395041725..654a01cc0358 100644
>> --- a/arch/x86/kernel/tsc.c
>> +++ b/arch/x86/kernel/tsc.c
>> @@ -133,7 +133,9 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
>> return ns;
>> }
>>
>> -static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
>> +static void set_cyc2ns_scale(unsigned long khz, int cpu,
>> + unsigned long long tsc_now,
>> + unsigned long long sched_now)
>
> sched_now is not a real good name for this as it's only used at
> initialization time. So the argument name should reflect this otherwise you
> wonder yourself when looking at that code 6 month from now, why it's 0 on
> all the run time call sites. init_offset or some other sensible name which
> makes the purpose entirely clear.
>
>> void __init tsc_init(void)
>> {
>> - u64 lpj, cyc;
>> + u64 lpj, cyc, sch;
>
> sch? what's wrong with sched_now or now? It's not that there is a 3 letter
> limit.
Sometimes I get caught into following the local style too much:
void __init tsc_init(void)
{
u64 lpj, cyc;
int cpu;
Hm, all the above are 3-letter variables, lets add another 3 letter one :)
I will change it to init_offset as you suggested above for set_cyc2ns_scale().
Also, I will address all the other comments that you provided in this series.
Thank you,
Pavel
On Sat, 23 Jun 2018, Thomas Gleixner wrote:
> On Thu, 21 Jun 2018, Pavel Tatashin wrote:
> > We will change sched_clock() to be called early.
>
> Why is this relevant? Does the issue only appear with that change?
So you forgot to answer this question. I did not find a system yet, which
actually exposes this behaviour on mainline.
Is this an artifact of your early sched clock thing?
Thanks,
tglx
> So you forgot to answer this question. I did not find a system yet, which
> actually exposes this behaviour on mainline.
>
> Is this an artifact of your early sched clock thing?
>
Yes, it is. Let me explain how it happens.
So, the problem is introduced in patch "sched: early boot clock" by this change:
- if (unlikely(!sched_clock_running))
- return 0ull;
+ /* Use early clock until sched_clock_init_late() */
+ if (unlikely(sched_clock_running < 2))
+ return sched_clock() + __sched_clock_offset;
As soon as sched_clock() starts output non-zero values, we start
output time without correcting the output as it is done in
sched_clock_local() where unstable TSC and backward motion are
detected. But, since early in boot interrupts are disabled, we cannot
really correct invalid time, and therefore must rely on sched_clock()
to provide us with a contiguous and sane time. In earlier versions of
this project, I was solving this problem by adjusting
__sched_clock_offset every time sched_clock()'s continuity was changed
by using a callback function into sched/clock.c. But, Peter was
against that approach.
On 06/23/2018 12:56 PM, Thomas Gleixner wrote:
> On Thu, 21 Jun 2018, Pavel Tatashin wrote:
>> /*
>> * Scheduler clock - returns current time in nanosec units.
>> */
>> @@ -1354,6 +1364,7 @@ void __init tsc_early_delay_calibrate(void)
>> lpj = tsc_khz * 1000;
>> do_div(lpj, HZ);
>> loops_per_jiffy = lpj;
>> + sched_clock_early_init(tsc_khz);
>> }
>>
>> void __init tsc_init(void)
>> @@ -1382,6 +1393,7 @@ void __init tsc_init(void)
>> if (!tsc_khz) {
>> mark_tsc_unstable("could not calculate TSC khz");
>> setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
>> + static_branch_disable(&__use_tsc);
>
> This wants a proper comment.
Yes, I will add a comment. Basically, this change is for a rare scenario where early in boot we were able to get TSC frequency but failed to do so later in boot when we tried to calculate it more precisely. Though, I am not sure if this is actually possible. However, if this is indeed possible, we will get into the same continuity problem as the one that is solved in "x86/tsc: prepare for early sched_clock". So, I guess we would have to modify either "jiffies_64" or __sched_clock_offset if we want to take care of this corner case.
Thank you,
Pavel
On Sat, 23 Jun 2018, Pavel Tatashin wrote:
> > So you forgot to answer this question. I did not find a system yet, which
> > actually exposes this behaviour on mainline.
> >
> > Is this an artifact of your early sched clock thing?
> >
>
> Yes, it is. Let me explain how it happens.
>
> So, the problem is introduced in patch "sched: early boot clock" by this change:
>
> - if (unlikely(!sched_clock_running))
> - return 0ull;
> + /* Use early clock until sched_clock_init_late() */
> + if (unlikely(sched_clock_running < 2))
> + return sched_clock() + __sched_clock_offset;
>
> As soon as sched_clock() starts output non-zero values, we start
> output time without correcting the output as it is done in
> sched_clock_local() where unstable TSC and backward motion are
> detected. But, since early in boot interrupts are disabled, we cannot
> really correct invalid time, and therefore must rely on sched_clock()
> to provide us with a contiguous and sane time.
In early boot the TSC frequency is not changing at all so the whole
unstable thing is completely irrelevant. And that also applies to backward
motion. There is no such thing during early boot.
> In earlier versions of this project, I was solving this problem by
> adjusting __sched_clock_offset every time sched_clock()'s continuity was
> changed by using a callback function into sched/clock.c. But, Peter was
> against that approach.
So your changelog is completely misleading and utterly wrong. What the heck
has this to do with jiffies and the use_tsc jump label? Exactly nothing.
> [ 0.004000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.009000] tsc: Fast TSC calibration using PIT
> [ 0.010000] tsc: Detected 3192.137 MHz processor
> [ 0.011000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e03465ceb2, max_idle_ns: 440795259855 ns
> static_branch_enable(__use_tsc) is called, and timestamps became precise
> but reduced:
> [ 0.002233] Calibrating delay loop (skipped), value calculated using timer frequency.. 6384.27 BogoMIPS (lpj=3192137)
This is crap, because this is not what the current implementation does. The
current implementation enables the static branch when the early TSC sched
clock is enabled. So even the timestamps are crap because with the early
sched clock the time stamps are precise _before_ the timer interrupt is
initialized. That's the whole purpose of the exercise, right?
This made me assume that its an existing problem. Heck, changelogs are
about facts and not fairy tales. And it's completely non interesting that
you observed this problem with some random variant of your patches. What's
interesting is the factual problem which makes the change necessary.
So the problem is not the transition from jiffies to early TSC because at
the point where you enable the early TSC sched clock the jiffies sched
clock value is exactly ZERO simply because the timer interrupt is not
running yet.
The problem happens when you switch from the early TSC thing to the real
one in tsc_init(). And that happens because you reinitialize the cyc2ns
data of the boot CPU which was already initialized. That sets the offset to
the current TSC value and voila time goes backwards.
This whole nonsense can be completely avoided.
If you look at the existing tsc_init() then you'll notice that the loop
which initializes cyc2ns data for each possible cpu is bonkers. It does the
same stupid thing over and over for every possible CPU, i.e.
- Set already zeroed data to zero
- Calculate the cyc2ns conversion values from the same input
- Invoke sched_clock_idle_sleep/wakeup_event() which executes the
same code over and over on the boot cpu and the boot cpu sched
clock data.
That's pointless and wants to be fixed in a preparatory step.
static void __init cyc2ns_init(void)
{
unsigned int cpu, this_cpu = smp_processor_id();
struct cyc2ns_data data;
struct cyc2ns *c2n;
c2n = this_cpu_ptr(&cyc2ns);
seqcount_init(&c2n->seq);
set_cyc2ns_scale(tsc_khz, cpu, rdtsc());
data = c2n->data[0];
for_each_possible_cpu(cpu) {
if (cpu == this_cpu)
continue;
seqcount_init(&c2n->seq);
c2n = per_cpu_ptr(&cyc2ns, cpu)
c2n->data[0] = c2n->data[1] = data;
}
}
This is safe and correct because the whole latch mechanics are completely
irrelevant for the non boot cpus as they are far away from being up. That
replaces the loop in tsc_init() with a single call to cyc2ns_init(().
The next bogosity is the whole calibration logic.
void __init tsc_early_delay_calibrate(void)
{
cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
tsc_khz = tsc_khz ? : cpu_khz;
if (!tsc_khz)
return;
lpj = tsc_khz * 1000;
do_div(lpj, HZ);
loops_per_jiffy = lpj;
}
and
void __init tsc_init(void)
{
cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
/*
* Trust non-zero tsc_khz as authorative,
* and use it to sanity check cpu_khz,
* which will be off if system timer is off.
*/
if (tsc_khz == 0)
tsc_khz = cpu_khz;
else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
cpu_khz = tsc_khz;
if (!tsc_khz) {
.....
So this is exactly the same code sequence just that tsc_init() has the
extra logic of sanitizing cpu_khz for some bogus cpuid/msr values.
So we have yet another completely pointless exercise in tsc_init(). It just
can go away and the tsc_khz value will not be more precise in the second
invocation of that stuff than in the first. Ergo the extra cpu_khz
sanitizing can be moved to early calibrate and the duplicate nonsense in
tsc_init() can be discarded.. tsc_init() then needs only to check for
tsc_khz == 0 and be done with it.
Now back to your early sched clock thing. It does in the init code:
tsc_sched_clock_init()
{
c2n = this_cpu_ptr(&cyc2ns);
seqcount_init(&c2n->seq);
__set_cyc2ns_scale(tsc_khz, cpu, rdtsc());
}
i.e it avoids the sched_clock_idle_sleep/wakeup_event() calls because they
make no sense, but it initializes cyc2ns for the boot CPU with the correct
values.
And this early init sequence also needs to pull over the tsc adjust
magic. So tsc_early_delay_calibrate() which should btw. be renamed to
tsc_early_init() should have:
{
cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
tsc_khz = tsc_khz ? : cpu_khz;
if (!tsc_khz)
return;
/* Sanitize TSC ADJUST before cyc2ns gets initialized */
tsc_store_and_check_tsc_adjust(true);
calc_lpj(tsc_khz);
tsc_sched_clock_init();
}
After that there is absolutely _ZERO_ reason to reinitialize the cyc2ns
value for the boot CPU which also completely removes the requirement for
this 'adjust offset' change. It's just going to work.
The only change you have to make is:
static void __init cyc2ns_init(void)
{
unsigned int cpu, this_cpu = smp_processor_id();
struct cyc2ns_data data;
struct cyc2ns *c2n;
c2n = this_cpu_ptr(&cyc2ns);
- seqcount_init(&c2n->seq);
- set_cyc2ns_scale(tsc_khz, cpu, rdtsc());
data = c2n->data[0];
for_each_possible_cpu(cpu) {
if (cpu == this_cpu)
continue;
seqcount_init(&c2n->seq);
c2n = per_cpu_ptr(&cyc2ns, cpu)
c2n->data[0] = c2n->data[1] = data;
}
}
<rant>
TBH. I'm utterly disappointed how all this was approached.
I absolutely detest the approach of duct taping something new into existing
code without a proper analysis of the existing infrastructure in the first
place. This is just utter waste of time. I don't care about your wasted
time at all, but I care about the fact, that I had to sit down and do the
analysis myself and about the time wasted in reviewing half baken
stuff. Sure, I was able do the analysis rather fast because I'm familiar
with the code, but I still had to sit down and figure out all the
details. You might have needed a week for that, but that would have saved
you several weeks of tinkering and the frustration of getting your patches
rejected over and over.
Alone the fact that dmesg has this:
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.020000] tsc: Fast TSC calibration using PIT
should have made you look into exactly what I was looking at just now. It's
really not rocket science to figure out that both calibrations do
absolutely the same thing and this is even more hilarious as you are doing
this to do boot time analysis in order to make the boot faster. Oh well.
</rant>
So I made your homework and expect as compensation a sqeaky clean patch set
with proper changelogs. I surely might have missed a few details, but I'm
also sure you find and fix them if you avoid trying to repost the whole
thing tomorrow. Take your time and do it right and ask questions upfront if
anything is unclear instead of implementing hacks based on weird
assumptions.
That'll make up for both your and my frustration over this whole excercise.
Thanks,
tglx
Hi Thomas,
Thank you for your feedback. My comments below:
> > As soon as sched_clock() starts output non-zero values, we start
> > output time without correcting the output as it is done in
> > sched_clock_local() where unstable TSC and backward motion are
> > detected. But, since early in boot interrupts are disabled, we cannot
> > really correct invalid time, and therefore must rely on sched_clock()
> > to provide us with a contiguous and sane time.
>
> In early boot the TSC frequency is not changing at all so the whole
> unstable thing is completely irrelevant. And that also applies to backward
> motion. There is no such thing during early boot.
Yes, however, the frequency changes slightly during re-calibration. I
see it every time in my no-kvm qemu VM, and I think even on physical
machine. Now, as you suggest below, I can remove the second
calibration entirely, and keep only the early calibration, I am not
sure what the historical reason to why linux has two of them in the
first place. But, I assumed the later one was more precise because of
some outside reasons, such as different cpu p-state, or something
else.
>
> > In earlier versions of this project, I was solving this problem by
> > adjusting __sched_clock_offset every time sched_clock()'s continuity was
> > changed by using a callback function into sched/clock.c. But, Peter was
> > against that approach.
>
> So your changelog is completely misleading and utterly wrong. What the heck
> has this to do with jiffies and the use_tsc jump label? Exactly nothing.
This output is what happens after: "sched: early sched_clock" but
before"x86/tsc: use tsc early", hence the title of patch: "x86/tsc:
prepare for early sched_clock"
So, before "x86/tsc: prepare for early sched_clock" we go from
jiffies to real tsc, and thats where the problem happens. I assume
theoretically, the same problem could happen if we can't calibrate TSC
early, but succeed later. Now, this is, however, a moot point, as you
suggest to remove the second calibration.
After thinking about this some more, in the future revision I will
simply switch order for "x86/tsc: use tsc early" to go before "sched:
early sched_clock", since transitions jiffies -> tsc and tsc ->
jiffies won't be possible with the changes you suggest below.
>
> > [ 0.004000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> > [ 0.009000] tsc: Fast TSC calibration using PIT
> > [ 0.010000] tsc: Detected 3192.137 MHz processor
> > [ 0.011000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e03465ceb2, max_idle_ns: 440795259855 ns
>
> > static_branch_enable(__use_tsc) is called, and timestamps became precise
> > but reduced:
>
> > [ 0.002233] Calibrating delay loop (skipped), value calculated using timer frequency.. 6384.27 BogoMIPS (lpj=3192137)
>
> This is crap, because this is not what the current implementation does. The
> current implementation enables the static branch when the early TSC sched
> clock is enabled. So even the timestamps are crap because with the early
> sched clock the time stamps are precise _before_ the timer interrupt is
> initialized. That's the whole purpose of the exercise, right?
>
> This made me assume that its an existing problem. Heck, changelogs are
> about facts and not fairy tales. And it's completely non interesting that
> you observed this problem with some random variant of your patches. What's
> interesting is the factual problem which makes the change necessary.
The changelog was about a fact, as stated above: output is from what
happens after "sched: early sched_clock" but before "x86/tsc: use tsc
early", (i.e. "x86/tsc: use tsc early" might be reverted or
bisected).
>
> So the problem is not the transition from jiffies to early TSC because at
> the point where you enable the early TSC sched clock the jiffies sched
> clock value is exactly ZERO simply because the timer interrupt is not
> running yet.
>
> The problem happens when you switch from the early TSC thing to the real
> one in tsc_init(). And that happens because you reinitialize the cyc2ns
> data of the boot CPU which was already initialized. That sets the offset to
> the current TSC value and voila time goes backwards.
>
> This whole nonsense can be completely avoided.
>
> If you look at the existing tsc_init() then you'll notice that the loop
> which initializes cyc2ns data for each possible cpu is bonkers. It does the
> same stupid thing over and over for every possible CPU, i.e.
>
> - Set already zeroed data to zero
>
> - Calculate the cyc2ns conversion values from the same input
>
> - Invoke sched_clock_idle_sleep/wakeup_event() which executes the
> same code over and over on the boot cpu and the boot cpu sched
> clock data.
>
> That's pointless and wants to be fixed in a preparatory step.
I will change the code as you suggest below: I like calibrating only
in one place.
> <rant>
>
> TBH. I'm utterly disappointed how all this was approached.
>
> I absolutely detest the approach of duct taping something new into existing
> code without a proper analysis of the existing infrastructure in the first
> place.
For this particular patch, I politely asked for suggestions in cover letter:
v10-v11
- I added one more patch:
"x86/tsc: prepare for early sched_clock" which fixes a problem
that I discovered while testing. I am not particularly happy with
the fix, as it adds a new argument that is used only in one
place, but if you have a suggestion for a different approach on
how to address this problem please let me know.
The confusion was that I should have been more clear where the problem
is exactly happens, and that is happens before"x86/tsc: use tsc early"
but after"sched: early sched_clock".
This is just utter waste of time. I don't care about your wasted
> time at all, but I care about the fact, that I had to sit down and do the
> analysis myself and about the time wasted in reviewing half baken
> stuff. Sure, I was able do the analysis rather fast because I'm familiar
> with the code, but I still had to sit down and figure out all the
> details. You might have needed a week for that, but that would have saved
> you several weeks of tinkering and the frustration of getting your patches
> rejected over and over.
>
> Alone the fact that dmesg has this:
>
> [ 0.000000] tsc: Fast TSC calibration using PIT
> [ 0.020000] tsc: Fast TSC calibration using PIT
I know that TSC is calibrated the same way, but frequency is slightly
different, I was not sure what causes the difference.
>
> should have made you look into exactly what I was looking at just now. It's
> really not rocket science to figure out that both calibrations do
> absolutely the same thing and this is even more hilarious as you are doing
> this to do boot time analysis in order to make the boot faster. Oh well.
>
> </rant>
>
> So I made your homework and expect as compensation a sqeaky clean patch set
> with proper changelogs. I surely might have missed a few details, but I'm
> also sure you find and fix them if you avoid trying to repost the whole
> thing tomorrow. Take your time and do it right and ask questions upfront if
> anything is unclear instead of implementing hacks based on weird
> assumptions.
Again, thank you for your help and review. I will address all the
comment and send out a new series when its ready.
Pavel
On Sat, 23 Jun 2018, Pavel Tatashin wrote:
> > > As soon as sched_clock() starts output non-zero values, we start
> > > output time without correcting the output as it is done in
> > > sched_clock_local() where unstable TSC and backward motion are
> > > detected. But, since early in boot interrupts are disabled, we cannot
> > > really correct invalid time, and therefore must rely on sched_clock()
> > > to provide us with a contiguous and sane time.
> >
> > In early boot the TSC frequency is not changing at all so the whole
> > unstable thing is completely irrelevant. And that also applies to backward
> > motion. There is no such thing during early boot.
>
> Yes, however, the frequency changes slightly during re-calibration. I
> see it every time in my no-kvm qemu VM, and I think even on physical
> machine. Now, as you suggest below, I can remove the second
> calibration entirely, and keep only the early calibration, I am not
> sure what the historical reason to why linux has two of them in the
git log/blame exist for a reason and in case that does not work asking
around might give answers as well.
> first place. But, I assumed the later one was more precise because of
> some outside reasons, such as different cpu p-state, or something
> else.
cpu p->states are exactly the same. This is still early boot. Nothing has
fiddled with any of this.
> > > In earlier versions of this project, I was solving this problem by
> > > adjusting __sched_clock_offset every time sched_clock()'s continuity was
> > > changed by using a callback function into sched/clock.c. But, Peter was
> > > against that approach.
> >
> > So your changelog is completely misleading and utterly wrong. What the heck
> > has this to do with jiffies and the use_tsc jump label? Exactly nothing.
>
> This output is what happens after: "sched: early sched_clock" but
> before"x86/tsc: use tsc early", hence the title of patch: "x86/tsc:
> prepare for early sched_clock"
> So, before "x86/tsc: prepare for early sched_clock" we go from
> jiffies to real tsc, and thats where the problem happens. I assume
> theoretically, the same problem could happen if we can't calibrate TSC
> early, but succeed later. Now, this is, however, a moot point, as you
> suggest to remove the second calibration.
Please stop assuming things. Figure the root cause out, really.
> After thinking about this some more, in the future revision I will
> simply switch order for "x86/tsc: use tsc early" to go before "sched:
> early sched_clock", since transitions jiffies -> tsc and tsc ->
> jiffies won't be possible with the changes you suggest below.
>
> >
> > > [ 0.004000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> > > [ 0.009000] tsc: Fast TSC calibration using PIT
> > > [ 0.010000] tsc: Detected 3192.137 MHz processor
> > > [ 0.011000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e03465ceb2, max_idle_ns: 440795259855 ns
> >
> > > static_branch_enable(__use_tsc) is called, and timestamps became precise
> > > but reduced:
> >
> > > [ 0.002233] Calibrating delay loop (skipped), value calculated using timer frequency.. 6384.27 BogoMIPS (lpj=3192137)
> >
> > This is crap, because this is not what the current implementation does. The
> > current implementation enables the static branch when the early TSC sched
> > clock is enabled. So even the timestamps are crap because with the early
> > sched clock the time stamps are precise _before_ the timer interrupt is
> > initialized. That's the whole purpose of the exercise, right?
> >
> > This made me assume that its an existing problem. Heck, changelogs are
> > about facts and not fairy tales. And it's completely non interesting that
> > you observed this problem with some random variant of your patches. What's
> > interesting is the factual problem which makes the change necessary.
>
> The changelog was about a fact, as stated above: output is from what
> happens after "sched: early sched_clock" but before "x86/tsc: use tsc
> early", (i.e. "x86/tsc: use tsc early" might be reverted or
> bisected).
I can see the intention now, but this is exactly why precise wording and
problem description and root cause analysis matter. It just confused me
completely,
> For this particular patch, I politely asked for suggestions in cover letter:
>
> v10-v11
> - I added one more patch:
> "x86/tsc: prepare for early sched_clock" which fixes a problem
> that I discovered while testing. I am not particularly happy with
> the fix, as it adds a new argument that is used only in one
> place, but if you have a suggestion for a different approach on
> how to address this problem please let me know.
Fair enough. I missed that. It would have been more obvious to mark the
patch RFC and add exactly this into the changelog.
Thanks,
tglx
On Thu, 21 Jun 2018 17:25:12 -0400
Pavel Tatashin <[email protected]> wrote:
> read_persistent_wall_and_boot_offset() will replace read_boot_clock64()
> because on some architectures it is more convenient to read both sources
> as one may depend on the other. For s390, implementation is the same
> as read_boot_clock64() but also calling and returning value of
> read_persistent_clock64()
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> arch/s390/kernel/time.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
> index cf561160ea88..d1f5447d5687 100644
> --- a/arch/s390/kernel/time.c
> +++ b/arch/s390/kernel/time.c
> @@ -221,6 +221,24 @@ void read_persistent_clock64(struct timespec64 *ts)
> ext_to_timespec64(clk, ts);
> }
>
> +void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
> + struct timespec64 *boot_offset)
> +{
> + unsigned char clk[STORE_CLOCK_EXT_SIZE];
> + struct timespec64 boot_time;
> + __u64 delta;
> +
> + delta = initial_leap_seconds + TOD_UNIX_EPOCH;
> + memcpy(clk, tod_clock_base, STORE_CLOCK_EXT_SIZE);
> + *(__u64 *)&clk[1] -= delta;
> + if (*(__u64 *)&clk[1] > delta)
> + clk[0]--;
> + ext_to_timespec64(clk, &boot_time);
> +
> + read_persistent_clock64(wall_time);
> + *boot_offset = timespec64_sub(*wall_time, boot_time);
> +}
> +
> void read_boot_clock64(struct timespec64 *ts)
> {
> unsigned char clk[STORE_CLOCK_EXT_SIZE];
From a s390 standpoint this looks reasonable.
Reviewed-by: Martin Schwidefsky <[email protected]>
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Thu, Jun 21, 2018 at 05:25:08PM -0400, Pavel Tatashin wrote:
> ---
> arch/x86/include/asm/text-patching.h | 1 +
> arch/x86/kernel/alternative.c | 10 +++++++++-
> 2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index 2ecd34e2d46c..e85ff65c43c3 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -37,5 +37,6 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern int after_bootmem;
>
> #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index a481763a3776..0230dbc3c599 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -686,13 +686,21 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
> *
> * Note: Must be called under text_mutex.
> */
> -void *text_poke(void *addr, const void *opcode, size_t len)
> +void __ref *text_poke(void *addr, const void *opcode, size_t len)
> {
> unsigned long flags;
> char *vaddr;
> struct page *pages[2];
> int i;
>
> + /* While boot memory allocator is runnig we cannot use struct
Broken comment style..
> + * pages as they are not yet initialized. However, we also know
> + * that this is early in boot, and it is safe to fallback to
> + * text_poke_early.
> + */
> + if (unlikely(!after_bootmem))
> + return text_poke_early(addr, opcode, len);
I'm not entirely sure this is right.. Because not only do we need the
whole fixmap stuff working, we also need #DB and the IPI handlers set-up
and working.
Also, why do this here and not override @poker in
__jump_label_transform() ?
And I added a sync_core() in text_poke_early(), which I think we need
for this.
> if (!core_kernel_text((unsigned long)addr)) {
> pages[0] = vmalloc_to_page(addr);
> pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
> --
> 2.17.1
>
On Mon, 25 Jun 2018, Peter Zijlstra wrote:
> On Thu, Jun 21, 2018 at 05:25:08PM -0400, Pavel Tatashin wrote:
> > -void *text_poke(void *addr, const void *opcode, size_t len)
> > +void __ref *text_poke(void *addr, const void *opcode, size_t len)
> > {
> > unsigned long flags;
> > char *vaddr;
> > struct page *pages[2];
> > int i;
> >
> > + /* While boot memory allocator is runnig we cannot use struct
>
> Broken comment style..
>
> > + * pages as they are not yet initialized. However, we also know
> > + * that this is early in boot, and it is safe to fallback to
> > + * text_poke_early.
> > + */
> > + if (unlikely(!after_bootmem))
> > + return text_poke_early(addr, opcode, len);
>
> I'm not entirely sure this is right.. Because not only do we need the
> whole fixmap stuff working, we also need #DB and the IPI handlers set-up
> and working.
IPI? That's early UP boot why would you need an IPI?
Thanks,
tglx
On Thu, Jun 21, 2018 at 05:25:17PM -0400, Pavel Tatashin wrote:
> Allow sched_clock() to be used before schec_clock_init() and
> sched_clock_init_late() are called. This provides us with a way to get
> early boot timestamps on machines with unstable clocks.
There are !x86 architectures that use this code and might not expect to
have their sched_clock() called quite that early. Please verify.
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> kernel/sched/clock.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index 10c83e73837a..f034392b0f6c 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -205,6 +205,11 @@ void clear_sched_clock_stable(void)
> */
> static int __init sched_clock_init_late(void)
> {
> + /* Transition to unstable clock from early clock */
This is wrong... or at least it smells horribly.
This is not the point where we transition from early to unstable, that
is in fact in sched_clock_init.
This function, sched_clock_init_late(), is where we attempt to
transition from unstable to stable. And this is _waaaay_ after SMP init.
> + local_irq_disable();
> + __gtod_offset = sched_clock() + __sched_clock_offset - ktime_get_ns();
> + local_irq_enable();
This might work in sched_clock_init(), which is pre-SMP.
> sched_clock_running = 2;
> /*
> * Ensure that it is impossible to not do a static_key update.
> @@ -350,8 +355,9 @@ u64 sched_clock_cpu(int cpu)
> if (sched_clock_stable())
> return sched_clock() + __sched_clock_offset;
>
> - if (unlikely(!sched_clock_running))
> - return 0ull;
> + /* Use early clock until sched_clock_init_late() */
> + if (unlikely(sched_clock_running < 2))
> + return sched_clock() + __sched_clock_offset;
And then this remains !sched_clock_running, except instead of 0, you
then return sched_clock() + __sched_clock_offset;
> preempt_disable_notrace();
> scd = cpu_sdc(cpu);
On Mon, Jun 25, 2018 at 10:39:24AM +0200, Thomas Gleixner wrote:
> On Mon, 25 Jun 2018, Peter Zijlstra wrote:
> > On Thu, Jun 21, 2018 at 05:25:08PM -0400, Pavel Tatashin wrote:
> > > -void *text_poke(void *addr, const void *opcode, size_t len)
> > > +void __ref *text_poke(void *addr, const void *opcode, size_t len)
> > > {
> > > unsigned long flags;
> > > char *vaddr;
> > > struct page *pages[2];
> > > int i;
> > >
> > > + /* While boot memory allocator is runnig we cannot use struct
> >
> > Broken comment style..
> >
> > > + * pages as they are not yet initialized. However, we also know
> > > + * that this is early in boot, and it is safe to fallback to
> > > + * text_poke_early.
> > > + */
> > > + if (unlikely(!after_bootmem))
> > > + return text_poke_early(addr, opcode, len);
> >
> > I'm not entirely sure this is right.. Because not only do we need the
> > whole fixmap stuff working, we also need #DB and the IPI handlers set-up
> > and working.
>
> IPI? That's early UP boot why would you need an IPI?
Because the way this is called is from __jump_label_transform() ->
text_poke_bp() -> text_poke() -> text_poke_early().
And if you look at text_poke_bp(), you'll note it relies on #DB and
on_each_cpu() IPIs.
Which is of course exactly the reason I avoided text_poke_bp() entirely
in the first place.
On Mon, 25 Jun 2018, Peter Zijlstra wrote:
> On Mon, Jun 25, 2018 at 10:39:24AM +0200, Thomas Gleixner wrote:
> > On Mon, 25 Jun 2018, Peter Zijlstra wrote:
> > > I'm not entirely sure this is right.. Because not only do we need the
> > > whole fixmap stuff working, we also need #DB and the IPI handlers set-up
> > > and working.
> >
> > IPI? That's early UP boot why would you need an IPI?
>
> Because the way this is called is from __jump_label_transform() ->
> text_poke_bp() -> text_poke() -> text_poke_early().
>
> And if you look at text_poke_bp(), you'll note it relies on #DB and
> on_each_cpu() IPIs.
on_each_cpu() resolves to a direct call on the current CPU and as there is
no other CPU it does not matter. #DB might be a different story, haven't
looked yet.
> Which is of course exactly the reason I avoided text_poke_bp() entirely
> in the first place.
Fair enough.
Thanks,
tglx
On Mon, Jun 25, 2018 at 11:18:02AM +0200, Thomas Gleixner wrote:
> On Mon, 25 Jun 2018, Peter Zijlstra wrote:
> > On Mon, Jun 25, 2018 at 10:39:24AM +0200, Thomas Gleixner wrote:
> > > On Mon, 25 Jun 2018, Peter Zijlstra wrote:
> > > > I'm not entirely sure this is right.. Because not only do we need the
> > > > whole fixmap stuff working, we also need #DB and the IPI handlers set-up
> > > > and working.
> > >
> > > IPI? That's early UP boot why would you need an IPI?
> >
> > Because the way this is called is from __jump_label_transform() ->
> > text_poke_bp() -> text_poke() -> text_poke_early().
> >
> > And if you look at text_poke_bp(), you'll note it relies on #DB and
> > on_each_cpu() IPIs.
>
> on_each_cpu() resolves to a direct call on the current CPU and as there is
> no other CPU it does not matter. #DB might be a different story, haven't
> looked yet.
It _should_ all work.. but scary, who knows where this early stuff ends
up being used.
Hi Peter,
> It _should_ all work.. but scary, who knows where this early stuff ends
> up being used.
I have tested this patch, and the following patch, which moves the
jump label init early and it works as Thomas describes:
on_each_cpu() ends up calling only the current CPU.
Also, you mentioned:
> And I added a sync_core() in text_poke_early(), which I think we need
> for this.
text_poke_bp() ends up calling on_each_cpu(do_sync_core, NULL, 1);
which is called on boot CPU, and thus sync_core is called. If we keep
this patch we can remove sync_core() change from the next patch.
However, the other way to fix this bug is to change:
arch/x86/kernel/jump_label.c
-void arch_jump_label_transform(struct jump_entry *entry,
+void __ref arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)
{
+ void *(*poker)(void *, const void *, size_t) = NULL;
+
+ if (unlikely(!after_bootmem))
+ poker = text_poke_early;
+
mutex_lock(&text_mutex);
- __jump_label_transform(entry, type, NULL, 0);
+ __jump_label_transform(entry, type, poker, 0);
mutex_unlock(&text_mutex);
}
Also, modify text_poke_early to call sync_core().
Of course, this way won't prevent us from having some other code
calling text_poke() in the future during boot where uninitialized
memory access is possible. If text_poke() is called sufficiently
early, it will fail because virt_to_page() will fail, but there is an
interval, where virt_to_page() succeeds (after paging_init()), but
pages are not yet initialized (before mem_init()). To safeguard us we
could add:
BUG_ON(!after_bootmem);
To text_poke()
Thank you,
Pavel
On Mon, Jun 25, 2018 at 4:56 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Jun 21, 2018 at 05:25:17PM -0400, Pavel Tatashin wrote:
> > Allow sched_clock() to be used before schec_clock_init() and
> > sched_clock_init_late() are called. This provides us with a way to get
> > early boot timestamps on machines with unstable clocks.
>
> There are !x86 architectures that use this code and might not expect to
> have their sched_clock() called quite that early. Please verify.
Correct, with this change from what I could gather so far, the other
arches with unstable clocks either pick-up and start output time
earlier or output 0. I could not test on every specific configuration
of course.
>
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > ---
> > kernel/sched/clock.c | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> > index 10c83e73837a..f034392b0f6c 100644
> > --- a/kernel/sched/clock.c
> > +++ b/kernel/sched/clock.c
> > @@ -205,6 +205,11 @@ void clear_sched_clock_stable(void)
> > */
> > static int __init sched_clock_init_late(void)
> > {
> > + /* Transition to unstable clock from early clock */
>
> This is wrong... or at least it smells horribly.
>
> This is not the point where we transition from early to unstable, that
> is in fact in sched_clock_init.
>
> This function, sched_clock_init_late(), is where we attempt to
> transition from unstable to stable. And this is _waaaay_ after SMP init.
>
> > + local_irq_disable();
> > + __gtod_offset = sched_clock() + __sched_clock_offset - ktime_get_ns();
> > + local_irq_enable();
>
> This might work in sched_clock_init(), which is pre-SMP.
>
> > sched_clock_running = 2;
> > /*
> > * Ensure that it is impossible to not do a static_key update.
> > @@ -350,8 +355,9 @@ u64 sched_clock_cpu(int cpu)
> > if (sched_clock_stable())
> > return sched_clock() + __sched_clock_offset;
> >
> > - if (unlikely(!sched_clock_running))
> > - return 0ull;
> > + /* Use early clock until sched_clock_init_late() */
> > + if (unlikely(sched_clock_running < 2))
> > + return sched_clock() + __sched_clock_offset;
>
> And then this remains !sched_clock_running, except instead of 0, you
> then return sched_clock() + __sched_clock_offset;
>
> > preempt_disable_notrace();
> > scd = cpu_sdc(cpu);
>
All the above is because of:
> > + if (unlikely(sched_clock_running < 2))
> > + return sched_clock() + __sched_clock_offset;
Keep early boot clock until sched_clock_running
sched_clock_init_late(). This can be changed to:
+ if (unlikely(!sched_clock_running))
+ return sched_clock() + __sched_clock_offset;
And do all the necessary changes in sched_clock_init().
The reason, why I had "< 2" in the first place, is because we did not
have early static branching before, and ended-up with sched_clock()
outputting: early TSC, jiffies, real TSC. The imprecise jiffies
timestamps during boot were confusing.
However, now with the early branching, and with upcoming change that
Thomas suggested that guarantees that if early TSC are enabled we will
keep them forever, the "< 2" is not needed. I will update this patch
accordingly.
Thank you,
Pavel
On Mon, Jun 25, 2018 at 3:09 AM Martin Schwidefsky
<[email protected]> wrote:
>
> From a s390 standpoint this looks reasonable.
>
> Reviewed-by: Martin Schwidefsky <[email protected]>
>
Thank you Martin!
Pavel
On Mon, Jun 25, 2018 at 08:32:41AM -0400, Pavel Tatashin wrote:
> > It _should_ all work.. but scary, who knows where this early stuff ends
> > up being used.
>
> I have tested this patch
Sure,.. but call me paranoid.
> However, the other way to fix this bug is to change:
> arch/x86/kernel/jump_label.c
>
> -void arch_jump_label_transform(struct jump_entry *entry,
> +void __ref arch_jump_label_transform(struct jump_entry *entry,
> enum jump_label_type type)
> {
> + void *(*poker)(void *, const void *, size_t) = NULL;
> +
> + if (unlikely(!after_bootmem))
> + poker = text_poke_early;
> +
> mutex_lock(&text_mutex);
> - __jump_label_transform(entry, type, NULL, 0);
> + __jump_label_transform(entry, type, poker, 0);
> mutex_unlock(&text_mutex);
> }
No need to elide that mutex I think, by going through the normal
static_key_enable() stuff you already take a whole bunch of them, and
they should work early just fine (no contention etc.).
Also, I think the better condition is @early_boot_irqs_disabled, until
we enable IRQs for the first time, text_poke_early() should be fine. And
once we enable interrupts, all that other crud should really be working.
This gives:
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e56c95be2808..425ba6102828 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -46,6 +46,9 @@ static void __jump_label_transform(struct jump_entry *entry,
const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
+ if (early_boot_irqs_disabled)
+ poker = text_poke_early;
+
if (type == JUMP_LABEL_JMP) {
if (init) {
/*
> Also, modify text_poke_early to call sync_core().
and that.
> Also, I think the better condition is @early_boot_irqs_disabled, until
> we enable IRQs for the first time, text_poke_early() should be fine. And
> once we enable interrupts, all that other crud should really be working.
Sure, I will use early_boot_irqs_disabled flag. I think, we still want
to have BUG_ON(!after_bootmem); in text_poke(). I could do
BUG_ON(early_boot_irqs_disabled), but I am worried that there are call
sites that might be using text_poke() between mem_init() and
local_irq_enable() in start_kernel().
>
> This gives:
>
> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
> index e56c95be2808..425ba6102828 100644
> --- a/arch/x86/kernel/jump_label.c
> +++ b/arch/x86/kernel/jump_label.c
> @@ -46,6 +46,9 @@ static void __jump_label_transform(struct jump_entry *entry,
> const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
> const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
>
> + if (early_boot_irqs_disabled)
> + poker = text_poke_early;
> +
> if (type == JUMP_LABEL_JMP) {
> if (init) {
> /*
>
> > Also, modify text_poke_early to call sync_core().
Also, we still need to add __ref to __jump_label_transform as
text_poke_early() is __init
Hi Peter,
I have revisted this patch after modifying x86 sched_clock() to contigously
output tsc once it is setup early in boot, based on the latest suggestions
from Thomas.
On 18-06-25 10:55:43, Peter Zijlstra wrote:
> On Thu, Jun 21, 2018 at 05:25:17PM -0400, Pavel Tatashin wrote:
> > Allow sched_clock() to be used before schec_clock_init() and
> > sched_clock_init_late() are called. This provides us with a way to get
> > early boot timestamps on machines with unstable clocks.
>
> There are !x86 architectures that use this code and might not expect to
> have their sched_clock() called quite that early. Please verify.
>
> > + local_irq_disable();
> > + __gtod_offset = sched_clock() + __sched_clock_offset - ktime_get_ns();
> > + local_irq_enable();
>
> This might work in sched_clock_init(), which is pre-SMP.
>
> > sched_clock_running = 2;
> > /*
> > * Ensure that it is impossible to not do a static_key update.
> > @@ -350,8 +355,9 @@ u64 sched_clock_cpu(int cpu)
> > if (sched_clock_stable())
> > return sched_clock() + __sched_clock_offset;
> >
> > - if (unlikely(!sched_clock_running))
> > - return 0ull;
> > + /* Use early clock until sched_clock_init_late() */
> > + if (unlikely(sched_clock_running < 2))
> > + return sched_clock() + __sched_clock_offset;
>
> And then this remains !sched_clock_running, except instead of 0, you
> then return sched_clock() + __sched_clock_offset;
>
> > preempt_disable_notrace();
> > scd = cpu_sdc(cpu);
Unfortunatly the above suggestion won't work. And here is why.
We have a call sequence like this:
start_kernel
sched_init()
sched_clock_init()
In this call sched_clock_running is set to 1. Which means
that sched_clock_cpu() starts doing the following sequence:
scd = cpu_sdc(cpu);
clock = sched_clock_local(scd);
Where we try to filter the output of sched_clock() based
on the value of scd. But, that won't work, because to get
this functionality, we need to have: timer initialized
that wakes up and updates scd, and we need timekeeping
initialized, so we can call ktime_get_ns(). Both of which
are called later.
...
timekeeping_init() After this we can call ktime_get_ns()
time_init() Here we configure x86_late_time_init pointer.
...
late_time_init()
x86_late_time_init()
x86_init.timers.timer_init()
hpet_time_init() Only after this call we finally start
getting clock interrupts, and can get precise output from
sched_clock_local().
The way I solved the above, is I changed sched_clock() to keep outputing
time based on early boot sched_clock() until sched_clock_init_late(), at
whic point everything is configured and we can switch to the permanent
clock, eventhough this happens after smp init.
If you have a better solution, please let me know.
Thank you,
Pavel
On Mon, Jun 25, 2018 at 03:23:20PM -0400, Pavel Tatashin wrote:
> Unfortunatly the above suggestion won't work. And here is why.
>
> We have a call sequence like this:
>
> start_kernel
> sched_init()
> sched_clock_init()
> In this call sched_clock_running is set to 1. Which means
> that sched_clock_cpu() starts doing the following sequence:
> scd = cpu_sdc(cpu);
> clock = sched_clock_local(scd);
> Where we try to filter the output of sched_clock() based
> on the value of scd. But, that won't work, because to get
> this functionality, we need to have: timer initialized
> that wakes up and updates scd, and we need timekeeping
> initialized, so we can call ktime_get_ns(). Both of which
> are called later.
> ...
> timekeeping_init() After this we can call ktime_get_ns()
> time_init() Here we configure x86_late_time_init pointer.
> ...
> late_time_init()
> x86_late_time_init()
> x86_init.timers.timer_init()
> hpet_time_init() Only after this call we finally start
> getting clock interrupts, and can get precise output from
> sched_clock_local().
>
> The way I solved the above, is I changed sched_clock() to keep outputing
> time based on early boot sched_clock() until sched_clock_init_late(), at
> whic point everything is configured and we can switch to the permanent
> clock, eventhough this happens after smp init.
>
> If you have a better solution, please let me know.
How's something like this? That moves sched_clock_init() to right before
we enable IRQs for the first time (which is after we've started the
whole timekeeping business).
The thing is, sched_clock_init_late() reall is far too late, we need to
switch to unstable before we bring up SMP.
---
include/linux/sched_clock.h | 5 -----
init/main.c | 4 ++--
kernel/sched/clock.c | 49 +++++++++++++++++++++++++++++++++++----------
kernel/sched/core.c | 1 -
kernel/time/sched_clock.c | 2 +-
5 files changed, 41 insertions(+), 20 deletions(-)
diff --git a/include/linux/sched_clock.h b/include/linux/sched_clock.h
index 411b52e424e1..2d223677740f 100644
--- a/include/linux/sched_clock.h
+++ b/include/linux/sched_clock.h
@@ -9,17 +9,12 @@
#define LINUX_SCHED_CLOCK
#ifdef CONFIG_GENERIC_SCHED_CLOCK
-extern void sched_clock_postinit(void);
-
extern void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate);
#else
-static inline void sched_clock_postinit(void) { }
-
static inline void sched_clock_register(u64 (*read)(void), int bits,
unsigned long rate)
{
- ;
}
#endif
diff --git a/init/main.c b/init/main.c
index 3b4ada11ed52..162d931c9511 100644
--- a/init/main.c
+++ b/init/main.c
@@ -79,7 +79,7 @@
#include <linux/pti.h>
#include <linux/blkdev.h>
#include <linux/elevator.h>
-#include <linux/sched_clock.h>
+#include <linux/sched/clock.h>
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/context_tracking.h>
@@ -642,7 +642,7 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_postinit();
+ sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..c8286b9fc593 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -68,11 +68,6 @@ EXPORT_SYMBOL_GPL(sched_clock);
__read_mostly int sched_clock_running;
-void sched_clock_init(void)
-{
- sched_clock_running = 1;
-}
-
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
* We must start with !__sched_clock_stable because the unstable -> stable
@@ -199,6 +194,23 @@ void clear_sched_clock_stable(void)
__clear_sched_clock_stable();
}
+static void __sched_clock_gtod_offset(void)
+{
+ __gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+}
+
+void __init sched_clock_init(void)
+{
+ /*
+ * Set __gtod_offset such that once we mark sched_clock_running,
+ * sched_clock_tick() continues where sched_clock() left off.
+ *
+ * Even if TSC is buggered, we're still UP at this point so it
+ * can't really be out of sync.
+ */
+ __sched_clock_gtod_offset();
+ sched_clock_running = 1;
+}
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
* notably: acpi_processor and intel_idle, which can mark the TSC as unstable.
@@ -351,7 +363,7 @@ u64 sched_clock_cpu(int cpu)
return sched_clock() + __sched_clock_offset;
if (unlikely(!sched_clock_running))
- return 0ull;
+ return sched_clock(); /* __sched_clock_offset == 0 */
preempt_disable_notrace();
scd = cpu_sdc(cpu);
@@ -385,8 +397,6 @@ void sched_clock_tick(void)
void sched_clock_tick_stable(void)
{
- u64 gtod, clock;
-
if (!sched_clock_stable())
return;
@@ -398,9 +408,7 @@ void sched_clock_tick_stable(void)
* TSC to be unstable, any computation will be computing crap.
*/
local_irq_disable();
- gtod = ktime_get_ns();
- clock = sched_clock();
- __gtod_offset = (clock + __sched_clock_offset) - gtod;
+ __sched_clock_gtod_offset();
local_irq_enable();
}
@@ -434,6 +442,24 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
+#ifdef CONFIG_GENERIC_SCHED_CLOCK
+
+/*
+ * kernel/time/sched_clock.c:sched_clock_init()
+ */
+
+u64 sched_clock_cpu(int cpu)
+{
+ return sched_clock();
+}
+
+#else /* CONFIG_GENERIC_SCHED_CLOCK */
+
+void __init sched_clock_init(void)
+{
+ sched_clock_running = 1;
+}
+
u64 sched_clock_cpu(int cpu)
{
if (unlikely(!sched_clock_running))
@@ -442,6 +468,7 @@ u64 sched_clock_cpu(int cpu)
return sched_clock();
}
+#endif /* CONFIG_GENERIC_SCHED_CLOCK */
#endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a98d54cd5535..b27d034ef4a7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5953,7 +5953,6 @@ void __init sched_init(void)
int i, j;
unsigned long alloc_size = 0, ptr;
- sched_clock_init();
wait_bit_init();
#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 2d8f05aad442..b4fedf312979 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -237,7 +237,7 @@ sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
pr_debug("Registered %pF as sched_clock source\n", read);
}
-void __init sched_clock_postinit(void)
+void __init sched_clock_init(void)
{
/*
* If no sched_clock() function has been provided at that point,
>
> How's something like this? That moves sched_clock_init() to right before
> we enable IRQs for the first time (which is after we've started the
> whole timekeeping business).
>
> The thing is, sched_clock_init_late() reall is far too late, we need to
> switch to unstable before we bring up SMP.
OK, sure.
> - sched_clock_postinit();
> + sched_clock_init();
Yes, we can move sched_clock_init(). But placing it after time_init() would
work on all arches with unstable clock except for x86.
See comment above time_init x86:
arch/x86/kernel/time.c
99/*
100 * Initialize TSC and delay the periodic timer init to
101 * late x86_late_time_init() so ioremap works.
102 */
103void __init time_init(void)
104{
105 late_time_init = x86_late_time_init;
106}
Only After this:
> > late_time_init()
> > x86_late_time_init()
> > x86_init.timers.timer_init()
> > hpet_time_init() Only after this call we finally start
> > getting clock interrupts, and can get precise output from
> > sched_clock_local().
We start getting timer interrupts. Is it acceptable to move
sched_clock_init() after late_time_init()?
Thank you,
Pavel
On Tue, Jun 26, 2018 at 7:29 AM Pavel Tatashin
<[email protected]> wrote:
>
> >
> > How's something like this? That moves sched_clock_init() to right before
> > we enable IRQs for the first time (which is after we've started the
> > whole timekeeping business).
> >
> > The thing is, sched_clock_init_late() reall is far too late, we need to
> > switch to unstable before we bring up SMP.
>
> OK, sure.
>
> > - sched_clock_postinit();
> > + sched_clock_init();
>
> Yes, we can move sched_clock_init(). But placing it after time_init() would
> work on all arches with unstable clock except for x86.
>
> See comment above time_init x86:
> arch/x86/kernel/time.c
>
> 99/*
> 100 * Initialize TSC and delay the periodic timer init to
> 101 * late x86_late_time_init() so ioremap works.
> 102 */
> 103void __init time_init(void)
> 104{
> 105 late_time_init = x86_late_time_init;
> 106}
>
> Only After this:
> > > late_time_init()
> > > x86_late_time_init()
> > > x86_init.timers.timer_init()
> > > hpet_time_init() Only after this call we finally start
> > > getting clock interrupts, and can get precise output from
> > > sched_clock_local().
>
> We start getting timer interrupts. Is it acceptable to move
> sched_clock_init() after late_time_init()?
A change like this on top of your suggested change, would work:
1. Move sched_clock_init() after late_time_init().
diff --git a/init/main.c b/init/main.c
index 162d931c9511..ff0a24170b95 100644
--- a/init/main.c
+++ b/init/main.c
@@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
- sched_clock_init();
printk_safe_init();
perf_event_init();
profile_init();
@@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
acpi_early_init();
if (late_time_init)
late_time_init();
+ sched_clock_init();
calibrate_delay();
pid_idr_init();
anon_vma_init();
2. sched_clock_tick() bailouts while sched_clock_running == 0, so we
must do __sched_clock_gtod_offset() after setting sched_clock_running
to 1, but also, we must have at least one tick passed in order for
correct computation. So something like this works, and preserves
continuity:
void __init sched_clock_init(void)
{
+ unsigned long flags;
+
+ pr_err("before sched_clock_running = 1");
+ sched_clock_running = 1;
/*
* Set __gtod_offset such that once we mark sched_clock_running,
* sched_clock_tick() continues where sched_clock() left off.
@@ -208,8 +212,11 @@ void __init sched_clock_init(void)
* Even if TSC is buggered, we're still UP at this point so it
* can't really be out of sync.
*/
+ local_irq_save(flags);
+ sched_clock_tick();
+ local_irq_restore(flags);
__sched_clock_gtod_offset();
- sched_clock_running = 1;
+ pr_err("after sched_clock_running = 1");
}
[ 0.712302] APIC: Switch to symmetric I/O mode setup
[ 0.717386] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.722031] clocksource: tsc-early: mask: 0xffffffffffffffff
max_cycles: 0x2e02bde913e, max
_idle_ns: 440795290413 ns
[ 0.722415] before sched_clock_running = 1
[ 0.722544] after sched_clock_running = 1
[ 0.722680] Calibrating delay loop (skipped), value calculated
using timer frequency.. 6383.98 BogoMIPS (lpj=3191993)
[ 0.723546] pid_max: default: 32768 minimum: 301
[ 0.724087] Security Framework initialized
On Tue, Jun 26, 2018 at 07:27:05AM -0400, Pavel Tatashin wrote:
> We start getting timer interrupts. Is it acceptable to move
> sched_clock_init() after late_time_init()?
After much puzzling and cursing, yes, I suppose that'll work.
Pavel,
first of all, sorry for my last outburst. I just was in a lousy mood after
staring into too much half baken stuff and failed to make myself stay away
from the computer.
On Sun, 24 Jun 2018, Thomas Gleixner wrote:
> On Sat, 23 Jun 2018, Pavel Tatashin wrote:
> And this early init sequence also needs to pull over the tsc adjust
> magic. So tsc_early_delay_calibrate() which should btw. be renamed to
> tsc_early_init() should have:
>
> {
> cpu_khz = x86_platform.calibrate_cpu();
> tsc_khz = x86_platform.calibrate_tsc();
>
> tsc_khz = tsc_khz ? : cpu_khz;
> if (!tsc_khz)
> return;
>
> /* Sanitize TSC ADJUST before cyc2ns gets initialized */
> tsc_store_and_check_tsc_adjust(true);
>
> calc_lpj(tsc_khz);
>
> tsc_sched_clock_init();
> }
Peter made me look deeper into this and there are a few issues, which I
missed, depending on when some of the resources become available. So we
probably cannot hook all of this into tsc_early_delay_calibrate().
I have an idea how to distangle it and we'll end up in a staged approach,
which looks like this:
1) Earliest one (not sure how early yet)
Attempt to use MSR/CPUID. If not running on a hypervisor this can
try the quick PIT calibration, but nothing else.
2) Post init_hypervisor_platform()
An attempt to use the hypervisor data can be made.
3) Post early_acpi_boot_init()
This can do PIT/HPET based calibration
4) Post x86_dtb_init()
PIT/PMTIMER based calibration
Once tsc_khz is known, no further attempts of calibration are made. I'll
look into that later tonight.
Thanks,
tglx
On Tue, Jun 26, 2018 at 2:42 PM Pavel Tatashin
<[email protected]> wrote:
>
> Hi Thomas,
>
> On Tue, Jun 26, 2018 at 11:44 AM Thomas Gleixner <[email protected]> wrote:
> >
> > Pavel,
> >
> > first of all, sorry for my last outburst. I just was in a lousy mood after
> > staring into too much half baken stuff and failed to make myself stay away
> > from the computer.
>
> Thank you.
>
> >
> > On Sun, 24 Jun 2018, Thomas Gleixner wrote:
> > > On Sat, 23 Jun 2018, Pavel Tatashin wrote:
> > > And this early init sequence also needs to pull over the tsc adjust
> > > magic. So tsc_early_delay_calibrate() which should btw. be renamed to
> > > tsc_early_init() should have:
> > >
> > > {
> > > cpu_khz = x86_platform.calibrate_cpu();
> > > tsc_khz = x86_platform.calibrate_tsc();
> > >
> > > tsc_khz = tsc_khz ? : cpu_khz;
> > > if (!tsc_khz)
> > > return;
> > >
> > > /* Sanitize TSC ADJUST before cyc2ns gets initialized */
> > > tsc_store_and_check_tsc_adjust(true);
> > >
> > > calc_lpj(tsc_khz);
> > >
> > > tsc_sched_clock_init();
> > > }
> >
> > Peter made me look deeper into this and there are a few issues, which I
> > missed, depending on when some of the resources become available. So we
> > probably cannot hook all of this into tsc_early_delay_calibrate().
> >
> > I have an idea how to distangle it and we'll end up in a staged approach,
> > which looks like this:
> >
> > 1) Earliest one (not sure how early yet)
> >
> > Attempt to use MSR/CPUID. If not running on a hypervisor this can
> > try the quick PIT calibration, but nothing else.
> >
> > 2) Post init_hypervisor_platform()
> >
> > An attempt to use the hypervisor data can be made.
> >
> > 3) Post early_acpi_boot_init()
> >
> > This can do PIT/HPET based calibration
> >
> > 4) Post x86_dtb_init()
> >
> > PIT/PMTIMER based calibration
> >
> > Once tsc_khz is known, no further attempts of calibration are made. I'll
> > look into that later tonight.
>
> I think, there are no reasons to try staged attempts. It usually gets
> harder to maintain overtime. In my opinion it is best if do it in two
> tries, as right now, but just cleaner. The first attempt we get a
> crude result, using the lowest denominator to which current logic
> might fallback if something else is not available that early in boot:
> i.e cpu calibration loop in native_calibrate_cpu() but later get
> something better. Also, even if early clock does not work because we
> could not get tsc early, it is not a problem, we still will probably
> determine it later during tsc_init call.
Actually, nevermind, I looked through the code again, it seems that if
we get early tsc frequency we can keep it, but otherwise just try it
again at later time when in tsc_init(). So, no need for
cyc2ns_reinit_boot(). I still think no need for staged attempts, but
try in two different places: in tsc_init_early() -> works? use that
tsc frequency later, does not try again in tsc_init(), and use the new
one.
In tsc_init we can have something like this:
void __init tsc_init(void)
{
if (!boot_cpu_has(X86_FEATURE_TSC))
return;
/* See if we were not able to determine tsc frequency early,
but can now */
if (!tsc_khz && determine_cpu_tsc_frequncies()) {
/* Sanitize TSC ADJUST before cyc2ns gets initialized */
tsc_store_and_check_tsc_adjust(true);
cyc2ns_init_boot_cpu();
}
....
}
Pavel
Hi Thomas,
On Tue, Jun 26, 2018 at 11:44 AM Thomas Gleixner <[email protected]> wrote:
>
> Pavel,
>
> first of all, sorry for my last outburst. I just was in a lousy mood after
> staring into too much half baken stuff and failed to make myself stay away
> from the computer.
Thank you.
>
> On Sun, 24 Jun 2018, Thomas Gleixner wrote:
> > On Sat, 23 Jun 2018, Pavel Tatashin wrote:
> > And this early init sequence also needs to pull over the tsc adjust
> > magic. So tsc_early_delay_calibrate() which should btw. be renamed to
> > tsc_early_init() should have:
> >
> > {
> > cpu_khz = x86_platform.calibrate_cpu();
> > tsc_khz = x86_platform.calibrate_tsc();
> >
> > tsc_khz = tsc_khz ? : cpu_khz;
> > if (!tsc_khz)
> > return;
> >
> > /* Sanitize TSC ADJUST before cyc2ns gets initialized */
> > tsc_store_and_check_tsc_adjust(true);
> >
> > calc_lpj(tsc_khz);
> >
> > tsc_sched_clock_init();
> > }
>
> Peter made me look deeper into this and there are a few issues, which I
> missed, depending on when some of the resources become available. So we
> probably cannot hook all of this into tsc_early_delay_calibrate().
>
> I have an idea how to distangle it and we'll end up in a staged approach,
> which looks like this:
>
> 1) Earliest one (not sure how early yet)
>
> Attempt to use MSR/CPUID. If not running on a hypervisor this can
> try the quick PIT calibration, but nothing else.
>
> 2) Post init_hypervisor_platform()
>
> An attempt to use the hypervisor data can be made.
>
> 3) Post early_acpi_boot_init()
>
> This can do PIT/HPET based calibration
>
> 4) Post x86_dtb_init()
>
> PIT/PMTIMER based calibration
>
> Once tsc_khz is known, no further attempts of calibration are made. I'll
> look into that later tonight.
I think, there are no reasons to try staged attempts. It usually gets
harder to maintain overtime. In my opinion it is best if do it in two
tries, as right now, but just cleaner. The first attempt we get a
crude result, using the lowest denominator to which current logic
might fallback if something else is not available that early in boot:
i.e cpu calibration loop in native_calibrate_cpu() but later get
something better. Also, even if early clock does not work because we
could not get tsc early, it is not a problem, we still will probably
determine it later during tsc_init call.
I have re-wrote tsc_early_init()/tsc_init(), they are much simpler now:
void __init tsc_early_init(void)
{
if (!boot_cpu_has(X86_FEATURE_TSC))
return;
if (!determine_cpu_tsc_frequncies())
return;
cyc2ns_init_boot_cpu();
static_branch_enable(&__use_tsc);
loops_per_jiffy = get_loops_per_jiffy(tsc_khz);
}
void __init tsc_init(void)
{
if (!boot_cpu_has(X86_FEATURE_TSC))
return;
if (!determine_cpu_tsc_frequncies()) {
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return;
}
/* Sanitize TSC ADJUST before cyc2ns gets initialized */
tsc_store_and_check_tsc_adjust(true);
cyc2ns_reinit_boot_cpu();
cyc2ns_init_secondary_cpus();
static_branch_enable(&__use_tsc);
if (!no_sched_irq_time)
enable_sched_clock_irqtime();
lpj_fine = get_loops_per_jiffy(tsc_khz);
use_tsc_delay();
check_system_tsc_reliable();
if (unsynchronized_tsc()) {
mark_tsc_unstable("TSCs unsynchronized");
return;
}
clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
detect_art();
}
All the new functions are self explanatory. I added three cyc2ns
related functions based on your suggestions on how to clean-up that
code:
static void __init cyc2ns_init_boot(void)
{
struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
seqcount_init(&c2n->seq);
__set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
}
static void __init cyc2ns_init_secondary(void)
{
unsigned int cpu, this_cpu = smp_processor_id();
struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
struct cyc2ns_data *data = c2n->data;
for_each_possible_cpu(cpu) {
if (cpu != this_cpu) {
seqcount_init(&c2n->seq);
c2n = per_cpu_ptr(&cyc2ns, cpu);
c2n->data[0] = data[0];
c2n->data[1] = data[1];
}
}
}
/* Reinitialize boot cpu c2ns, using the offset of the current
sched_clock() value */
static void __init cyc2ns_reinit_boot(void)
{
struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
unsigned long sched_now = sched_clock();
__set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
c2n->data[0].cyc2ns_offset += sched_now;
c2n->data[1].cyc2ns_offset += sched_now;
}
I know, conceptually, it is similar to what I had before, but I think
it is simple enough, easy to maintain, but more importantly safe. What
do you think?
Thank you,
Pavel
On Tue, 26 Jun 2018, Pavel Tatashin wrote:
> On Tue, Jun 26, 2018 at 11:44 AM Thomas Gleixner <[email protected]> wrote:
> > I have an idea how to distangle it and we'll end up in a staged approach,
> > which looks like this:
> >
> > 1) Earliest one (not sure how early yet)
> >
> > Attempt to use MSR/CPUID. If not running on a hypervisor this can
> > try the quick PIT calibration, but nothing else.
> >
> > 2) Post init_hypervisor_platform()
> >
> > An attempt to use the hypervisor data can be made.
> >
> > 3) Post early_acpi_boot_init()
> >
> > This can do PIT/HPET based calibration
> >
> > 4) Post x86_dtb_init()
> >
> > PIT/PMTIMER based calibration
> >
> > Once tsc_khz is known, no further attempts of calibration are made. I'll
> > look into that later tonight.
>
> I think, there are no reasons to try staged attempts. It usually gets
> harder to maintain overtime. In my opinion it is best if do it in two
> tries, as right now, but just cleaner. The first attempt we get a
> crude result, using the lowest denominator to which current logic
> might fallback if something else is not available that early in boot:
> i.e cpu calibration loop in native_calibrate_cpu() but later get
> something better. Also, even if early clock does not work because we
> could not get tsc early, it is not a problem, we still will probably
> determine it later during tsc_init call.
I still want to document the unholy mess of what is initialized and
available when. We have 5 hypervisors and 3 different points in early boot
where the calibrate_* callbacks are overwritten. The XEN PV one is actually
post tsc_init_early() for whatever reason.
That's all completely obscure and any attempt of moving tsc_early_init()
earlier than where it is now is just lottery.
The other issue is that double calibration, e.g. doing the PIT thing twice
is just consuming boot time for no value.
All of that has been duct taped over time and we really don't want yet
another thing glued to it just because we can.
> I have re-wrote tsc_early_init()/tsc_init(), they are much simpler now:
>
> void __init tsc_early_init(void)
> {
> if (!boot_cpu_has(X86_FEATURE_TSC))
> return;
>
> if (!determine_cpu_tsc_frequncies())
> return;
That still wants to do the TSC_ADJUST call here.
Thanks,
tglx
On Thu, 28 Jun 2018, Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 12:43:59PM +0200, Thomas Gleixner wrote:
> > init_hypervisor_platform()
> > vmware:
> > Retrieves frequency and store it for the
> > calibration function
> >
> > khz = vmware_get_khz_magic()
> > vmware_tsc_khz = khz
> > calibrate_cpu = vmware_get_tsc_khz
> > calibrate_tsc = vmware_get_tsc_khz
> > preset_lpj(khz)
> >
> > hyperv:
> > if special hyperv MSRs are available:
> >
> > calibrate_cpu = hv_get_tsc_khz
> > calibrate_tsc = hv_get_tsc_khz
> >
> > MSR is readable already in this function
> >
> > jailhouse:
> >
> > Frequency is available in this function and store
> > in a variable for the calibration function
> >
> > calibrate_cpu = jailhouse_get_tsc
> > calibrate_tsc = jailhouse_get_tsc
> >
> > ...
> >
> > kvmclock_init()
> >
> > if (magic_conditions)
> > calibrate_tsc = kvm_get_tsc_khz
> > calibrate_cpu = kvm_get_tsc_khz
> >
> > kvm_get_preset_lpj()
> > khz = kvm_get_tsc_khz()
> > preset_lpj(khz);
>
> Note that all these which get TSC values from a HV _should_ set
> X86_FEATURE_TSC_KNOWN_FREQ to avoid the late recalibrate.
True.
> Calibrating against a virtual PIT/HPET/PMTIMER is utterly pointless.
It kinda works :)
> > The generic initilizaiton does everything twice, which makes no sense,
> > except for the unlikely case were no fast functions are available and the
> > quick PIT calibration fails (PMTIMER/HPET) are not available in early
> > calibration. HPET
>
> Incomplete; but I suspect you want to talk about how we can make HPET
> available early by putting it in a fixmap.
As I figured out now that's only part of the picture. We also need to do
that past x86_dtb_init() because early_quirks() and the early acpi/sfi/dtb
stuff needs to be completed before it makes sense to access hpet. But at
that point we also have the PMTIMER info.
> And only if we fail, do we at a later stage try again using PMTIMER.
>
> Currently it all works by accident, since !hpet and acpi_pm_read_early()
> returns 0, but really we should not be running the fallback crap at all
> that early.
Right.
Thanks,
tglx
On Thu, Jun 28, 2018 at 12:43:59PM +0200, Thomas Gleixner wrote:
> init_hypervisor_platform()
> vmware:
> Retrieves frequency and store it for the
> calibration function
>
> khz = vmware_get_khz_magic()
> vmware_tsc_khz = khz
> calibrate_cpu = vmware_get_tsc_khz
> calibrate_tsc = vmware_get_tsc_khz
> preset_lpj(khz)
>
> hyperv:
> if special hyperv MSRs are available:
>
> calibrate_cpu = hv_get_tsc_khz
> calibrate_tsc = hv_get_tsc_khz
>
> MSR is readable already in this function
>
> jailhouse:
>
> Frequency is available in this function and store
> in a variable for the calibration function
>
> calibrate_cpu = jailhouse_get_tsc
> calibrate_tsc = jailhouse_get_tsc
>
> ...
>
> kvmclock_init()
>
> if (magic_conditions)
> calibrate_tsc = kvm_get_tsc_khz
> calibrate_cpu = kvm_get_tsc_khz
>
> kvm_get_preset_lpj()
> khz = kvm_get_tsc_khz()
> preset_lpj(khz);
Note that all these which get TSC values from a HV _should_ set
X86_FEATURE_TSC_KNOWN_FREQ to avoid the late recalibrate.
Calibrating against a virtual PIT/HPET/PMTIMER is utterly pointless.
> The generic initilizaiton does everything twice, which makes no sense,
> except for the unlikely case were no fast functions are available and the
> quick PIT calibration fails (PMTIMER/HPET) are not available in early
> calibration. HPET
Incomplete; but I suspect you want to talk about how we can make HPET
available early by putting it in a fixmap.
And only if we fail, do we at a later stage try again using PMTIMER.
Currently it all works by accident, since !hpet and acpi_pm_read_early()
returns 0, but really we should not be running the fallback crap at all
that early.
On Thu, 28 Jun 2018, Thomas Gleixner wrote:
> I still want to document the unholy mess of what is initialized and
> available when. We have 5 hypervisors and 3 different points in early boot
> where the calibrate_* callbacks are overwritten. The XEN PV one is actually
> post tsc_init_early() for whatever reason.
>
> That's all completely obscure and any attempt of moving tsc_early_init()
> earlier than where it is now is just lottery.
>
> The other issue is that double calibration, e.g. doing the PIT thing twice
> is just consuming boot time for no value.
>
> All of that has been duct taped over time and we really don't want yet
> another thing glued to it just because we can.
So here is the full picture of the TSC/CPU calibration maze:
Compile time setup:
native_calibrate_tsc
CPUID based frequency read out with magic fixups
for broken CPUID implementations
native_calibrate_cpu
Try the following:
1) CPUID based (different leaf than the TSC one)
2) MSR based
3) Quick PIT calibration
4) PIT/HPET/PMTIMER calibration (slow) and only
available in tsc_init(). Could be made working
post x86_dtb_init().
Boot sequence:
start_kernel()
INTEL_MID:
x86_intel_mid_early_setup()
calibrate_tsc = intel_mid_calibrate_tsc
intel_mid_calibrate_tsc() { return 0; }
setup_arch()
x86_init.oem.arch_setup();
INTEL_MID:
intel_mid_arch_setup()
PENWELL:
x86_platform.calibrate_tsc = mfld_calibrate_tsc;
MSR based magic. Value would be available right away.
TANGIER:
x86_platform.calibrate_tsc = tangier_calibrate_tsc;
Different MSR based magic. Value would be available
right away.
....
init_hypervisor_platform()
vmware:
Retrieves frequency and store it for the
calibration function
khz = vmware_get_khz_magic()
vmware_tsc_khz = khz
calibrate_cpu = vmware_get_tsc_khz
calibrate_tsc = vmware_get_tsc_khz
preset_lpj(khz)
hyperv:
if special hyperv MSRs are available:
calibrate_cpu = hv_get_tsc_khz
calibrate_tsc = hv_get_tsc_khz
MSR is readable already in this function
jailhouse:
Frequency is available in this function and store
in a variable for the calibration function
calibrate_cpu = jailhouse_get_tsc
calibrate_tsc = jailhouse_get_tsc
...
kvmclock_init()
if (magic_conditions)
calibrate_tsc = kvm_get_tsc_khz
calibrate_cpu = kvm_get_tsc_khz
kvm_get_preset_lpj()
khz = kvm_get_tsc_khz()
preset_lpj(khz);
tsc_early_delay_calibrate()
tsc_khz = calibrate_tsc()
cpu_khz = calibrate_cpu()
....
set_lpj(tsc_khz);
x86_init.paging.pagetable_init()
xen_pagetable_init()
xen_setup_shared_info()
xen_hvm_init_time_ops()
if (XENFEAT_hvm_safe_pvclock)
calibrate_tsc = xen_tsc_khz
PV clock based access
tsc_init()
tsc_khz = calibrate_tsc()
cpu_khz = calibrate_cpu()
Putting this into a table:
Platform tsc_early_delay_calibrate() tsc_init()
-----------------------------------------------------------------------
Generic native_calibrate_tsc() native_calibrate_tsc()
native_calibrate_cpu() native_calibrate_cpu()
(Cannot do HPET/PMTIMER)
-----------------------------------------------------------------------
INTEL_MID intel_mid_calibrate_tsc() intel_mid_calibrate_tsc()
Generic native_calibrate_cpu() native_calibrate_cpu()
INTEL_MID mfld_calibrate_tsc() mfld_calibrate_tsc()
PENWELL native_calibrate_cpu() native_calibrate_cpu()
INTEL_MID tangier_calibrate_tsc() tangier_calibrate_tsc()
TANGIER native_calibrate_cpu() native_calibrate_cpu()
-----------------------------------------------------------------------
VNWARE vmware_get_tsc_khz() vmware_get_tsc_khz()
vmware_get_tsc_khz() vmware_get_tsc_khz()
HYPERV hv_get_tsc_khz() hv_get_tsc_khz()
hv_get_tsc_khz() hv_get_tsc_khz()
JAILHOUSE jailhouse_get_tsc() jailhouse_get_tsc()
jailhouse_get_tsc() jailhouse_get_tsc()
KVM kvm_get_tsc_khz() kvm_get_tsc_khz()
kvm_get_tsc_khz() kvm_get_tsc_khz()
------------------------------------------------------------------------
XEN native_calibrate_tsc() xen_tsc_khz()
native_calibrate_cpu() native_calibrate_cpu()
------------------------------------------------------------------------
The only platform which cannot use the special TSC calibration routine
in the early calibration is XEN because it's initialized just _after_ the
early calibration runs.
For enhanced fun the early calibration stuff was moved from right after
init_hypervisor_platform() to the place where it is now in commit
ccb64941f375a6 ("x86/timers: Move simple_udelay_calibration() past
kvmclock_init()") to speed up KVM boot time by avoiding the PIT
calibration. I have no idea why it wasn't just moved past the XEN
initialization a few lines further down, especially as the change was done
by a XEN maintainer :) Boris?
The other HV guests all do more or less the same thing and return the same
value for cpu_khz and tsc_khz via the calibration indirection despite the
value being known in the init_platform() function already.
The generic initilizaiton does everything twice, which makes no sense,
except for the unlikely case were no fast functions are available and the
quick PIT calibration fails (PMTIMER/HPET) are not available in early
calibration. HPET
The INTEL MID stuff is wierd and not really obvious. AFAIR those systems
don't have PIT or such, so they need to rely on the MSR/CPUID mechanisms to
work, but that's just working because and not for obvious reasons. Andy,
can you shed some light on that stuff?
So some of this just works by chance, things are done twice and pointlessly
(XEN). This really wants to be cleaned up and well documented which the
requirements of each platform are, especially the Intel-MID stuff needs
that.
Thanks,
tglx
On Thu, Jun 28, 2018 at 11:23 AM Thomas Gleixner <[email protected]> wrote:
>
> On Thu, 28 Jun 2018, Thomas Gleixner wrote:
> > I still want to document the unholy mess of what is initialized and
> > available when. We have 5 hypervisors and 3 different points in early boot
> > where the calibrate_* callbacks are overwritten. The XEN PV one is actually
> > post tsc_init_early() for whatever reason.
> >
> > That's all completely obscure and any attempt of moving tsc_early_init()
> > earlier than where it is now is just lottery.
> >
> > The other issue is that double calibration, e.g. doing the PIT thing twice
> > is just consuming boot time for no value.
> >
> > All of that has been duct taped over time and we really don't want yet
> > another thing glued to it just because we can.
>
> So here is the full picture of the TSC/CPU calibration maze:
>
> Compile time setup:
> native_calibrate_tsc
> CPUID based frequency read out with magic fixups
> for broken CPUID implementations
>
> native_calibrate_cpu
> Try the following:
>
> 1) CPUID based (different leaf than the TSC one)
> 2) MSR based
> 3) Quick PIT calibration
> 4) PIT/HPET/PMTIMER calibration (slow) and only
> available in tsc_init(). Could be made working
> post x86_dtb_init().
>
>
> Boot sequence:
>
> start_kernel()
>
> INTEL_MID:
> x86_intel_mid_early_setup()
> calibrate_tsc = intel_mid_calibrate_tsc
>
> intel_mid_calibrate_tsc() { return 0; }
>
> setup_arch()
>
> x86_init.oem.arch_setup();
> INTEL_MID:
> intel_mid_arch_setup()
>
> PENWELL:
> x86_platform.calibrate_tsc = mfld_calibrate_tsc;
>
> MSR based magic. Value would be available right away.
>
> TANGIER:
> x86_platform.calibrate_tsc = tangier_calibrate_tsc;
>
> Different MSR based magic. Value would be available
> right away.
>
> ....
>
> init_hypervisor_platform()
> vmware:
> Retrieves frequency and store it for the
> calibration function
>
> khz = vmware_get_khz_magic()
> vmware_tsc_khz = khz
> calibrate_cpu = vmware_get_tsc_khz
> calibrate_tsc = vmware_get_tsc_khz
> preset_lpj(khz)
>
> hyperv:
> if special hyperv MSRs are available:
>
> calibrate_cpu = hv_get_tsc_khz
> calibrate_tsc = hv_get_tsc_khz
>
> MSR is readable already in this function
>
> jailhouse:
>
> Frequency is available in this function and store
> in a variable for the calibration function
>
> calibrate_cpu = jailhouse_get_tsc
> calibrate_tsc = jailhouse_get_tsc
>
> ...
>
> kvmclock_init()
>
> if (magic_conditions)
> calibrate_tsc = kvm_get_tsc_khz
> calibrate_cpu = kvm_get_tsc_khz
>
> kvm_get_preset_lpj()
> khz = kvm_get_tsc_khz()
> preset_lpj(khz);
>
> tsc_early_delay_calibrate()
> tsc_khz = calibrate_tsc()
> cpu_khz = calibrate_cpu()
>
> ....
> set_lpj(tsc_khz);
>
>
> x86_init.paging.pagetable_init()
> xen_pagetable_init()
> xen_setup_shared_info()
> xen_hvm_init_time_ops()
> if (XENFEAT_hvm_safe_pvclock)
> calibrate_tsc = xen_tsc_khz
>
> PV clock based access
>
> tsc_init()
> tsc_khz = calibrate_tsc()
> cpu_khz = calibrate_cpu()
>
>
> Putting this into a table:
>
> Platform tsc_early_delay_calibrate() tsc_init()
> -----------------------------------------------------------------------
>
> Generic native_calibrate_tsc() native_calibrate_tsc()
> native_calibrate_cpu() native_calibrate_cpu()
> (Cannot do HPET/PMTIMER)
>
> -----------------------------------------------------------------------
>
> INTEL_MID intel_mid_calibrate_tsc() intel_mid_calibrate_tsc()
> Generic native_calibrate_cpu() native_calibrate_cpu()
>
> INTEL_MID mfld_calibrate_tsc() mfld_calibrate_tsc()
> PENWELL native_calibrate_cpu() native_calibrate_cpu()
>
> INTEL_MID tangier_calibrate_tsc() tangier_calibrate_tsc()
> TANGIER native_calibrate_cpu() native_calibrate_cpu()
>
> -----------------------------------------------------------------------
>
> VNWARE vmware_get_tsc_khz() vmware_get_tsc_khz()
> vmware_get_tsc_khz() vmware_get_tsc_khz()
>
> HYPERV hv_get_tsc_khz() hv_get_tsc_khz()
> hv_get_tsc_khz() hv_get_tsc_khz()
>
>
> JAILHOUSE jailhouse_get_tsc() jailhouse_get_tsc()
> jailhouse_get_tsc() jailhouse_get_tsc()
>
>
> KVM kvm_get_tsc_khz() kvm_get_tsc_khz()
> kvm_get_tsc_khz() kvm_get_tsc_khz()
>
> ------------------------------------------------------------------------
>
> XEN native_calibrate_tsc() xen_tsc_khz()
> native_calibrate_cpu() native_calibrate_cpu()
>
> ------------------------------------------------------------------------
>
> The only platform which cannot use the special TSC calibration routine
> in the early calibration is XEN because it's initialized just _after_ the
> early calibration runs.
>
> For enhanced fun the early calibration stuff was moved from right after
> init_hypervisor_platform() to the place where it is now in commit
> ccb64941f375a6 ("x86/timers: Move simple_udelay_calibration() past
> kvmclock_init()") to speed up KVM boot time by avoiding the PIT
> calibration. I have no idea why it wasn't just moved past the XEN
> initialization a few lines further down, especially as the change was done
> by a XEN maintainer :) Boris?
>
> The other HV guests all do more or less the same thing and return the same
> value for cpu_khz and tsc_khz via the calibration indirection despite the
> value being known in the init_platform() function already.
>
> The generic initilizaiton does everything twice, which makes no sense,
> except for the unlikely case were no fast functions are available and the
> quick PIT calibration fails (PMTIMER/HPET) are not available in early
> calibration. HPET
>
> The INTEL MID stuff is wierd and not really obvious. AFAIR those systems
> don't have PIT or such, so they need to rely on the MSR/CPUID mechanisms to
> work, but that's just working because and not for obvious reasons. Andy,
> can you shed some light on that stuff?
>
> So some of this just works by chance, things are done twice and pointlessly
> (XEN). This really wants to be cleaned up and well documented which the
> requirements of each platform are, especially the Intel-MID stuff needs
> that.
Hi Thomas,
In addition to above, we have xen hvm:
setup_arch()
...
init_hypervisor_platform();
x86_init.hyper.init_platform();
xen_hvm_guest_init()
xen_hvm_init_time_ops();
...
tsc_early_delay_calibrate();
tsc_khz = x86_platform.calibrate_tsc(); == xen_tsc_khz()
...
Which works early.
So, what should we do with xen, which seems to be the only platform
that would provide different tsc frequency early and late, because of
different calibration method?
Thank you,
Pavel
On Thu, 28 Jun 2018, Pavel Tatashin wrote:
> On Thu, Jun 28, 2018 at 11:23 AM Thomas Gleixner <[email protected]> wrote:
> Hi Thomas,
>
> In addition to above, we have xen hvm:
>
> setup_arch()
> ...
> init_hypervisor_platform();
> x86_init.hyper.init_platform();
> xen_hvm_guest_init()
> xen_hvm_init_time_ops();
Duh. Missed that completely.
> ...
> tsc_early_delay_calibrate();
> tsc_khz = x86_platform.calibrate_tsc(); == xen_tsc_khz()
> ...
>
> Which works early.
>
> So, what should we do with xen, which seems to be the only platform
> that would provide different tsc frequency early and late, because of
> different calibration method?
Fix it? I have no idea why XEN has two variants of the scheme and I neither
have a clue why the KVM clock stuff is late.
Thanks,
tglx
On 18-06-29 09:30:10, Thomas Gleixner wrote:
> On Thu, 28 Jun 2018, Pavel Tatashin wrote:
> > On Thu, Jun 28, 2018 at 11:23 AM Thomas Gleixner <[email protected]> wrote:
> > Hi Thomas,
> >
> > In addition to above, we have xen hvm:
> >
> > setup_arch()
> > ...
> > init_hypervisor_platform();
> > x86_init.hyper.init_platform();
> > xen_hvm_guest_init()
> > xen_hvm_init_time_ops();
>
> Duh. Missed that completely.
>
> > ...
> > tsc_early_delay_calibrate();
> > tsc_khz = x86_platform.calibrate_tsc(); == xen_tsc_khz()
> > ...
> >
> > Which works early.
> >
> > So, what should we do with xen, which seems to be the only platform
> > that would provide different tsc frequency early and late, because of
> > different calibration method?
>
> Fix it? I have no idea why XEN has two variants of the scheme and I neither
> have a clue why the KVM clock stuff is late.
kvm clock I have fixed in my series.
http://lkml.kernel.org/r/[email protected]
It was late because, it depended on memblock, which is initialized later,
after e820 ranges are read.
So, after my series kvm clock will work right after
init_hypervisor_platform, the same as with most other hypervisors.
With xen, it is a little more challenging: I do not have a hardware
configured to run xen to test it.
On the other hand, most likely that tsc_early_delay_calibrate() now
fails to calibrate tsc on xen, and calibrates correctly later in
tsc_init(). I doubt, that we can use CPUID method to determine tsc
frequency on xen cpu. Anyways, that would be accidental behavior.
To make xen clock available early, we need HYPERVISOR_shared_info. So, at
least that part of xen_setup_shared_info() must be called/available from
init_hypervisor_platform():
void xen_setup_shared_info(void)
{
set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
HYPERVISOR_shared_info =
(struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
xen_setup_mfn_list_list();
...
Which, I have no idea if it is possible, or safe to do.
For now, in my series, I would like check in tsc_early_init() if we are
running xen, and set tsc_khz to 0 if so. Later in tsc_init() we will get
a proper calibration. The question is how to make this check least
intrusive.
Pavel
On Thu, 2018-06-28 at 12:43 +0200, Thomas Gleixner wrote:
> On Thu, 28 Jun 2018, Thomas Gleixner wrote:
> > I still want to document the unholy mess of what is initialized and
> > available when. We have 5 hypervisors and 3 different points in
> > early boot
> > where the calibrate_* callbacks are overwritten. The XEN PV one is
> > actually
> > post tsc_init_early() for whatever reason.
> >
> > That's all completely obscure and any attempt of moving
> > tsc_early_init()
> > earlier than where it is now is just lottery.
> >
> > The other issue is that double calibration, e.g. doing the PIT thing
> > twice
> > is just consuming boot time for no value.
> >
> > All of that has been duct taped over time and we really don't want
> > yet
> > another thing glued to it just because we can.
Hmm... Good question about Intel MID, I would try to put my
understanding here.
> Boot sequence:
>
> start_kernel()
>
> INTEL_MID:
> x86_intel_mid_early_setup()
> calibrate_tsc = intel_mid_calibrate_tsc
>
> intel_mid_calibrate_tsc() { return 0; }
This sounds like a stub against very old calibration code since Intel
MID has no PIT, HPET, PMTIMER to calibrate from.
>
> setup_arch()
>
> x86_init.oem.arch_setup();
> INTEL_MID:
> intel_mid_arch_setup()
>
> PENWELL:
> x86_platform.calibrate_tsc = mfld_calibrate_tsc;
>
> MSR based magic. Value would be available right away.
>
> TANGIER:
> x86_platform.calibrate_tsc = tangier_calibrate_tsc;
>
> Different MSR based magic. Value would be available
> right away.
This stuff is how we can read TSC frequency on those platforms.
The commit 7da7c1561366 ("x86, tsc: Add static (MSR) TSC calibration on
Intel Atom SoCs") introduced a common way for all those chips to get TSC
frequency, while forgetting remove old code.
Surprisingly, the same guy even amended legacy code in the commit
f3a02ecebed7 ("x86/tsc: Set TSC_KNOWN_FREQ and TSC_RELIABLE flags on
Intel Atom SoCs").
> INTEL_MID intel_mid_calibrate_tsc() intel_mid_calibrate_
> tsc()
> Generic native_calibrate_cpu() native_ca
> librate_cpu()
>
> INTEL_MID mfld_calibrate_tsc() mfld_calibrate_ts
> c()
> PENWELL native_calibrate_cpu() native_ca
> librate_cpu()
>
> INTEL_MID tangier_calibrate_tsc() tangier_calibr
> ate_tsc()
> TANGIER native_calibrate_cpu() native_ca
> librate_cpu()
>
Taking above into consideration, I think we may just remove the legacy
code from mfld.c and mrfld.c and see what happen.
If you can tell me points to test, I can prepare patches to remove and
test on both Medfiled and Merrifield platforms (Penwell and Tangier SoCs
respectively).
> The INTEL MID stuff is wierd and not really obvious. AFAIR those
> systems
> don't have PIT or such, so they need to rely on the MSR/CPUID
> mechanisms to
> work, but that's just working because and not for obvious reasons.
> Andy,
> can you shed some light on that stuff?
Hope above makes sense.
--
Andy Shevchenko <[email protected]>
Intel Finland Oy
On Fri, 2018-06-29 at 17:30 +0300, Andy Shevchenko wrote:
> On Thu, 2018-06-28 at 12:43 +0200, Thomas Gleixner wrote:
> > On Thu, 28 Jun 2018, Thomas Gleixner wrote:
> > >
> Taking above into consideration, I think we may just remove the legacy
> code from mfld.c and mrfld.c and see what happen.
>
> If you can tell me points to test, I can prepare patches to remove and
> test on both Medfiled and Merrifield platforms (Penwell and Tangier
> SoCs
> respectively).
I'm almost have a patch series. Tested for now for Intel Merrifield
(Tangier SoC).
--
Andy Shevchenko <[email protected]>
Intel Finland Oy
On Fri, Jun 29, 2018 at 09:30:10AM +0200, Thomas Gleixner wrote:
> On Thu, 28 Jun 2018, Pavel Tatashin wrote:
> > On Thu, Jun 28, 2018 at 11:23 AM Thomas Gleixner <[email protected]> wrote:
> > Hi Thomas,
> >
> > In addition to above, we have xen hvm:
> >
> > setup_arch()
> > ...
> > init_hypervisor_platform();
> > x86_init.hyper.init_platform();
> > xen_hvm_guest_init()
> > xen_hvm_init_time_ops();
>
> Duh. Missed that completely.
>
> > ...
> > tsc_early_delay_calibrate();
> > tsc_khz = x86_platform.calibrate_tsc(); == xen_tsc_khz()
> > ...
> >
> > Which works early.
> >
> > So, what should we do with xen, which seems to be the only platform
> > that would provide different tsc frequency early and late, because of
> > different calibration method?
>
> Fix it? I have no idea why XEN has two variants of the scheme and I neither
> have a clue why the KVM clock stuff is late.
Added Juergen (Xen).
>
> Thanks,
>
> tglx
>
On Fri, 29 Jun 2018, Pavel Tatashin wrote:
> On 18-06-29 09:30:10, Thomas Gleixner wrote:
> For now, in my series, I would like check in tsc_early_init() if we are
> running xen, and set tsc_khz to 0 if so. Later in tsc_init() we will get
> a proper calibration. The question is how to make this check least
> intrusive.
By not having it at all. Talk to the XEN folks (IIRC there are a few
coworkers of you involved with XEN).
Thanks,
tglx
On 21/06/2018 23:25, Pavel Tatashin wrote:
> KVM clock is initialized later compared to other hypervisor because it has
> dependency on memblock allocator.
>
> Lets bring it inline with other hypervisors by removing this dependency by
> using memory from BSS instead of allocating it.
>
> The benefits:
> - remove ifdef from common code
> - earlier availability of TSC.
> - remove dependency on memblock, and reduce code
> - earlier kvm sched_clock()
>
> Signed-off-by: Pavel Tatashin <[email protected]>
The reason for this is to avoid wasting a lot of BSS memory when KVM is
not in use. Thomas is going to send his take on this!
Paolo
> ---
> arch/x86/kernel/kvm.c | 1 +
> arch/x86/kernel/kvmclock.c | 64 ++++++--------------------------------
> arch/x86/kernel/setup.c | 7 ++---
> 3 files changed, 12 insertions(+), 60 deletions(-)
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 5b2300b818af..c65c232d3ddd 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
> .name = "KVM",
> .detect = kvm_detect,
> .type = X86_HYPER_KVM,
> + .init.init_platform = kvmclock_init,
> .init.guest_late_init = kvm_guest_init,
> .init.x2apic_available = kvm_para_available,
> };
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index bf8d1eb7fca3..01558d41ec2c 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -23,9 +23,9 @@
> #include <asm/apic.h>
> #include <linux/percpu.h>
> #include <linux/hardirq.h>
> -#include <linux/memblock.h>
> #include <linux/sched.h>
> #include <linux/sched/clock.h>
> +#include <linux/mm.h>
>
> #include <asm/mem_encrypt.h>
> #include <asm/x86_init.h>
> @@ -44,6 +44,11 @@ static int parse_no_kvmclock(char *arg)
> }
> early_param("no-kvmclock", parse_no_kvmclock);
>
> +/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
> +#define HV_CLOCK_SIZE (sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
> +#define WALL_CLOCK_SIZE (sizeof(struct pvclock_wall_clock))
> +static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
> +static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
> /* The hypervisor will put information about time periodically here */
> static struct pvclock_vsyscall_time_info *hv_clock;
> static struct pvclock_wall_clock *wall_clock;
> @@ -244,43 +249,12 @@ static void kvm_shutdown(void)
> native_machine_shutdown();
> }
>
> -static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
> - phys_addr_t align)
> -{
> - phys_addr_t mem;
> -
> - mem = memblock_alloc(size, align);
> - if (!mem)
> - return 0;
> -
> - if (sev_active()) {
> - if (early_set_memory_decrypted((unsigned long)__va(mem), size))
> - goto e_free;
> - }
> -
> - return mem;
> -e_free:
> - memblock_free(mem, size);
> - return 0;
> -}
> -
> -static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
> -{
> - if (sev_active())
> - early_set_memory_encrypted((unsigned long)__va(addr), size);
> -
> - memblock_free(addr, size);
> -}
> -
> void __init kvmclock_init(void)
> {
> struct pvclock_vcpu_time_info *vcpu_time;
> - unsigned long mem, mem_wall_clock;
> - int size, cpu, wall_clock_size;
> + int cpu;
> u8 flags;
>
> - size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
> -
> if (!kvm_para_available())
> return;
>
> @@ -290,28 +264,11 @@ void __init kvmclock_init(void)
> } else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
> return;
>
> - wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
> - mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
> - if (!mem_wall_clock)
> - return;
> -
> - wall_clock = __va(mem_wall_clock);
> - memset(wall_clock, 0, wall_clock_size);
> -
> - mem = kvm_memblock_alloc(size, PAGE_SIZE);
> - if (!mem) {
> - kvm_memblock_free(mem_wall_clock, wall_clock_size);
> - wall_clock = NULL;
> - return;
> - }
> -
> - hv_clock = __va(mem);
> - memset(hv_clock, 0, size);
> + wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
> + hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
>
> if (kvm_register_clock("primary cpu clock")) {
> hv_clock = NULL;
> - kvm_memblock_free(mem, size);
> - kvm_memblock_free(mem_wall_clock, wall_clock_size);
> wall_clock = NULL;
> return;
> }
> @@ -354,13 +311,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
> int cpu;
> u8 flags;
> struct pvclock_vcpu_time_info *vcpu_time;
> - unsigned int size;
>
> if (!hv_clock)
> return 0;
>
> - size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
> -
> cpu = get_cpu();
>
> vcpu_time = &hv_clock[cpu].pvti;
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 403b2d2c31d2..01fcc8bf7c8f 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1014,6 +1014,8 @@ void __init setup_arch(char **cmdline_p)
> */
> init_hypervisor_platform();
>
> + tsc_early_delay_calibrate();
> +
> x86_init.resources.probe_roms();
>
> /* after parse_early_param, so could debug it */
> @@ -1199,11 +1201,6 @@ void __init setup_arch(char **cmdline_p)
>
> memblock_find_dma_reserve();
>
> -#ifdef CONFIG_KVM_GUEST
> - kvmclock_init();
> -#endif
> -
> - tsc_early_delay_calibrate();
> if (!early_xdbc_setup_hardware())
> early_xdbc_register_console();
>
>
On Thu, 5 Jul 2018, Paolo Bonzini wrote:
> On 21/06/2018 23:25, Pavel Tatashin wrote:
> > KVM clock is initialized later compared to other hypervisor because it has
> > dependency on memblock allocator.
> >
> > Lets bring it inline with other hypervisors by removing this dependency by
> > using memory from BSS instead of allocating it.
> >
> > The benefits:
> > - remove ifdef from common code
> > - earlier availability of TSC.
> > - remove dependency on memblock, and reduce code
> > - earlier kvm sched_clock()
> >
> > Signed-off-by: Pavel Tatashin <[email protected]>
>
> The reason for this is to avoid wasting a lot of BSS memory when KVM is
> not in use. Thomas is going to send his take on this!
Got it working with per cpu variables, but there is a different subtle
issue with that.
The pvclock data is mapped into the VDSO as well, i.e. as a full page.
Right now with the linear array, which is forced to be page sized at least
this only maps pvclock data or zeroed data (after the last CPU) into the
VDSO.
With PER CPU variables this would map arbitraty other per cpu data which
happens to be in the same page into the VDSO. Not really what we want.
That means to utilize PER CPU data this requires to allocate page sized
pvclock data space for each CPU to prevent leaking arbitrary stuff.
As this data is allocated on demand, i.e. only if kvmclock is used, this
might be tolerable, but I'm not so sure.
Thanks,
tglx
On 06/07/2018 11:24, Thomas Gleixner wrote:
>> The reason for this is to avoid wasting a lot of BSS memory when KVM is
>> not in use. Thomas is going to send his take on this!
> Got it working with per cpu variables, but there is a different subtle
> issue with that.
>
> The pvclock data is mapped into the VDSO as well, i.e. as a full page.
>
> Right now with the linear array, which is forced to be page sized at least
> this only maps pvclock data or zeroed data (after the last CPU) into the
> VDSO.
>
> With PER CPU variables this would map arbitraty other per cpu data which
> happens to be in the same page into the VDSO. Not really what we want.
>
> That means to utilize PER CPU data this requires to allocate page sized
> pvclock data space for each CPU to prevent leaking arbitrary stuff.
>
> As this data is allocated on demand, i.e. only if kvmclock is used, this
> might be tolerable, but I'm not so sure.
One possibility is to introduce another layer of indirection: in
addition to the percpu pvclock data, add a percpu pointer to the pvclock
data and initialize it to point to a page-aligned variable in BSS. CPU0
(used by vDSO) doesn't touch the pointer and keeps using the BSS
variable, APs instead redirect the pointer to the percpu data.
Paolo
On Fri, 6 Jul 2018, Paolo Bonzini wrote:
> On 06/07/2018 11:24, Thomas Gleixner wrote:
> >> The reason for this is to avoid wasting a lot of BSS memory when KVM is
> >> not in use. Thomas is going to send his take on this!
> > Got it working with per cpu variables, but there is a different subtle
> > issue with that.
> >
> > The pvclock data is mapped into the VDSO as well, i.e. as a full page.
> >
> > Right now with the linear array, which is forced to be page sized at least
> > this only maps pvclock data or zeroed data (after the last CPU) into the
> > VDSO.
> >
> > With PER CPU variables this would map arbitraty other per cpu data which
> > happens to be in the same page into the VDSO. Not really what we want.
> >
> > That means to utilize PER CPU data this requires to allocate page sized
> > pvclock data space for each CPU to prevent leaking arbitrary stuff.
> >
> > As this data is allocated on demand, i.e. only if kvmclock is used, this
> > might be tolerable, but I'm not so sure.
>
> One possibility is to introduce another layer of indirection: in
> addition to the percpu pvclock data, add a percpu pointer to the pvclock
> data and initialize it to point to a page-aligned variable in BSS. CPU0
> (used by vDSO) doesn't touch the pointer and keeps using the BSS
> variable, APs instead redirect the pointer to the percpu data.
Yeah, thought about that, but the extra indirection is ugly. Instead of
using per cpu data, I just can allocate the memory _after_ the allocators
are up and running and use a single page sized static __initdata for the
early boot.
Thanks,
tglx
On 06/07/2018 11:45, Thomas Gleixner wrote:
>> One possibility is to introduce another layer of indirection: in
>> addition to the percpu pvclock data, add a percpu pointer to the pvclock
>> data and initialize it to point to a page-aligned variable in BSS. CPU0
>> (used by vDSO) doesn't touch the pointer and keeps using the BSS
>> variable, APs instead redirect the pointer to the percpu data.
> Yeah, thought about that, but the extra indirection is ugly. Instead of
> using per cpu data, I just can allocate the memory _after_ the allocators
> are up and running and use a single page sized static __initdata for the
> early boot.
Either works for me. Assembly-wise, the indirection should be more or
less the same as what we have now; even more efficient because it
accesses a percpu pointer instead of computing it based on
smp_processor_id().
Paolo
On Fri, 6 Jul 2018, Paolo Bonzini wrote:
> On 06/07/2018 11:45, Thomas Gleixner wrote:
> >> One possibility is to introduce another layer of indirection: in
> >> addition to the percpu pvclock data, add a percpu pointer to the pvclock
> >> data and initialize it to point to a page-aligned variable in BSS. CPU0
> >> (used by vDSO) doesn't touch the pointer and keeps using the BSS
> >> variable, APs instead redirect the pointer to the percpu data.
> > Yeah, thought about that, but the extra indirection is ugly. Instead of
> > using per cpu data, I just can allocate the memory _after_ the allocators
> > are up and running and use a single page sized static __initdata for the
> > early boot.
>
> Either works for me. Assembly-wise, the indirection should be more or
> less the same as what we have now; even more efficient because it
> accesses a percpu pointer instead of computing it based on
> smp_processor_id().
Good point. Let me try that.
Thanks,
tglx
On Fri, 6 Jul 2018, Thomas Gleixner wrote:
> On Fri, 6 Jul 2018, Paolo Bonzini wrote:
> > On 06/07/2018 11:45, Thomas Gleixner wrote:
> > >> One possibility is to introduce another layer of indirection: in
> > >> addition to the percpu pvclock data, add a percpu pointer to the pvclock
> > >> data and initialize it to point to a page-aligned variable in BSS. CPU0
> > >> (used by vDSO) doesn't touch the pointer and keeps using the BSS
> > >> variable, APs instead redirect the pointer to the percpu data.
> > > Yeah, thought about that, but the extra indirection is ugly. Instead of
> > > using per cpu data, I just can allocate the memory _after_ the allocators
> > > are up and running and use a single page sized static __initdata for the
> > > early boot.
> >
> > Either works for me. Assembly-wise, the indirection should be more or
> > less the same as what we have now; even more efficient because it
> > accesses a percpu pointer instead of computing it based on
> > smp_processor_id().
>
> Good point. Let me try that.
Duh, that either requires a static key or a static PER_CPU pointer thingy,
which then wastes 8 bytes per cpu instead of 64 byte if unused.
Thanks,
tglx
I think using __initdata during init_hypervisor_platform() +
allocating during x86_init.hyper.guest_late_init() is a good approach.
My only concern, it would mean we need to init/uinit/init clock for
boot cpu. Does it mean the clock continuity is preserved during the
transition? I believe so, but needs to be verified.
Pavel
On Fri, Jul 6, 2018 at 6:53 AM Thomas Gleixner <[email protected]> wrote:
>
> On Fri, 6 Jul 2018, Thomas Gleixner wrote:
> > On Fri, 6 Jul 2018, Paolo Bonzini wrote:
> > > On 06/07/2018 11:45, Thomas Gleixner wrote:
> > > >> One possibility is to introduce another layer of indirection: in
> > > >> addition to the percpu pvclock data, add a percpu pointer to the pvclock
> > > >> data and initialize it to point to a page-aligned variable in BSS. CPU0
> > > >> (used by vDSO) doesn't touch the pointer and keeps using the BSS
> > > >> variable, APs instead redirect the pointer to the percpu data.
> > > > Yeah, thought about that, but the extra indirection is ugly. Instead of
> > > > using per cpu data, I just can allocate the memory _after_ the allocators
> > > > are up and running and use a single page sized static __initdata for the
> > > > early boot.
> > >
> > > Either works for me. Assembly-wise, the indirection should be more or
> > > less the same as what we have now; even more efficient because it
> > > accesses a percpu pointer instead of computing it based on
> > > smp_processor_id().
> >
> > Good point. Let me try that.
>
> Duh, that either requires a static key or a static PER_CPU pointer thingy,
> which then wastes 8 bytes per cpu instead of 64 byte if unused.
>
> Thanks,
>
> tglx
On 06/07/2018 17:03, Pavel Tatashin wrote:
> I think using __initdata during init_hypervisor_platform() +
> allocating during x86_init.hyper.guest_late_init() is a good approach.
> My only concern, it would mean we need to init/uinit/init clock for
> boot cpu. Does it mean the clock continuity is preserved during the
> transition? I believe so, but needs to be verified.
Yes, it's the same as any other pvclock update.
Paolo
On 06/28/2018 06:43 AM, Thomas Gleixner wrote:
>
> For enhanced fun the early calibration stuff was moved from right after
> init_hypervisor_platform() to the place where it is now in commit
> ccb64941f375a6 ("x86/timers: Move simple_udelay_calibration() past
> kvmclock_init()") to speed up KVM boot time by avoiding the PIT
> calibration. I have no idea why it wasn't just moved past the XEN
> initialization a few lines further down, especially as the change was done
> by a XEN maintainer :) Boris?
(Sorry for late response, I was out)
No good reason for that. I don't know what (whether?) I was thinking.
-boris