changelog
---------
v5 - v6
- Added a new patch:
time: sync read_boot_clock64() with persistent clock
Which fixes missing __init macro, and enabled time discrepancy
fix that was noted by Thomas Gleixner
- Split "x86/time: read_boot_clock64() implementation" into a
separate patch
v4 - v5 - Fix compiler warnings on systems with stable clocks.
v3 - v4
- Fixed tsc_early_fini() call to be in the 2nd patch as reported
by Dou Liyang
- Improved comment before __use_sched_clock_early to explain why
we need both booleans.
- Simplified valid_clock logic in read_boot_clock64().
v2 - v3
- Addressed comment from Thomas Gleixner
- Timestamps are available a little later in boot but still much
earlier than in mainline. This significantly simplified this
work.
v1 - v2
In patch "x86/tsc: tsc early":
- added tsc_adjusted_early()
- fixed 32-bit compile error use do_div()
Adding early boot time stamps support for x86 machines.
SPARC patches for early boot time stamps are already integrated into
mainline linux.
Sample output
-------------
Before:
https://hastebin.com/jadaqukubu.scala
After:
https://hastebin.com/nubipozacu.scala
For more exaples how early time stamps are used, see this work:
https://lwn.net/Articles/732233/
As seen above, currently timestamps are available from around the time when
"Security Framework" is initialized. But, 26s already passed until we
reached to this point.
Pavel Tatashin (2):
sched/clock: interface to allow timestamps early in boot
x86/tsc: use tsc early
Pavel Tatashin (4):
sched/clock: interface to allow timestamps early in boot
time: sync read_boot_clock64() with persistent clock
x86/time: read_boot_clock64() implementation
x86/tsc: use tsc early
arch/arm/kernel/time.c | 2 +-
arch/s390/kernel/time.c | 2 +-
arch/x86/include/asm/tsc.h | 4 +++
arch/x86/kernel/setup.c | 10 +++++--
arch/x86/kernel/time.c | 31 ++++++++++++++++++++++
arch/x86/kernel/tsc.c | 47 +++++++++++++++++++++++++++++++++
include/linux/sched/clock.h | 4 +++
include/linux/timekeeping.h | 10 +++----
kernel/sched/clock.c | 63 ++++++++++++++++++++++++++++++++++++++++++++-
kernel/time/timekeeping.c | 8 ++++--
10 files changed, 169 insertions(+), 12 deletions(-)
--
2.14.1
In Linux printk() can output timestamps next to every line. This is very
useful for tracking regressions, and finding places that can be optimized.
However, the timestamps are available only later in boot. On smaller
machines it is insignificant amount of time, but on larger it can be many
seconds or even minutes into the boot process.
This patch adds an interface for platforms with unstable sched clock to
show timestamps early in boot. In order to get this functionality a
platform must:
- Implement u64 sched_clock_early()
Clock that returns monotonic time
- Call sched_clock_early_init()
Tells sched clock that the early clock can be used
- Call sched_clock_early_fini()
Tells sched clock that the early clock is finished, and sched clock
should hand over the operation to permanent clock.
Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/sched/clock.h | 4 +++
kernel/sched/clock.c | 63 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 66 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/clock.h b/include/linux/sched/clock.h
index a55600ffdf4b..f8291fa28c0c 100644
--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -63,6 +63,10 @@ extern void sched_clock_tick_stable(void);
extern void sched_clock_idle_sleep_event(void);
extern void sched_clock_idle_wakeup_event(void);
+void sched_clock_early_init(void);
+void sched_clock_early_fini(void);
+u64 sched_clock_early(void);
+
/*
* As outlined in clock.c, provides a fast, high resolution, nanosecond
* time source that is monotonic per cpu argument and has bounded drift
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index ca0f8fc945c6..2a41791b22fa 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -80,9 +80,17 @@ EXPORT_SYMBOL_GPL(sched_clock);
__read_mostly int sched_clock_running;
+static bool __read_mostly sched_clock_early_running;
+
void sched_clock_init(void)
{
- sched_clock_running = 1;
+ /*
+ * We start clock once early clock is finished or if early clock
+ * was not running.
+ */
+ if (!sched_clock_early_running)
+ sched_clock_running = 1;
+
}
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
@@ -96,6 +104,16 @@ void sched_clock_init(void)
static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
static int __sched_clock_stable_early = 1;
+/*
+ * Because static branches cannot be altered before jump_label_init() is called,
+ * and early time stamps may be initialized before that, we start with sched
+ * clock early static branch enabled, and global status disabled. Early in boot
+ * it is decided whether to enable the global status as well (set
+ * sched_clock_early_running to true), later when early clock is no longer
+ * needed, the static branch is disabled to keep hot-path fast.
+ */
+static DEFINE_STATIC_KEY_TRUE(__use_sched_clock_early);
+
/*
* We want: ktime_get_ns() + __gtod_offset == sched_clock() + __sched_clock_offset
*/
@@ -362,6 +380,11 @@ u64 sched_clock_cpu(int cpu)
if (sched_clock_stable())
return sched_clock() + __sched_clock_offset;
+ if (static_branch_unlikely(&__use_sched_clock_early)) {
+ if (sched_clock_early_running)
+ return sched_clock_early();
+ }
+
if (unlikely(!sched_clock_running))
return 0ull;
@@ -444,6 +467,44 @@ void sched_clock_idle_wakeup_event(void)
}
EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
+u64 __weak sched_clock_early(void)
+{
+ return 0;
+}
+
+/*
+ * Is called when sched_clock_early() is about to be finished: notifies sched
+ * clock that after this call sched_clock_early() cannot be used.
+ */
+void __init sched_clock_early_fini(void)
+{
+ struct sched_clock_data *scd = this_scd();
+ u64 now_early, now_sched;
+
+ now_early = sched_clock_early();
+ now_sched = sched_clock();
+
+ __gtod_offset = now_early - scd->tick_gtod;
+ __sched_clock_offset = now_early - now_sched;
+
+ sched_clock_early_running = false;
+ static_branch_disable(&__use_sched_clock_early);
+
+ /* Now that early clock is finished, start regular sched clock */
+ sched_clock_init();
+}
+
+/*
+ * Notifies sched clock that early boot clocksource is available, it means that
+ * the current platform has implemented sched_clock_early().
+ *
+ * The early clock is running until sched_clock_early_fini is called.
+ */
+void __init sched_clock_early_init(void)
+{
+ sched_clock_early_running = true;
+}
+
#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
u64 sched_clock_cpu(int cpu)
--
2.14.1
read_boot_clock64() returns a boot start timestamp from epoch. Some arches
may need to access the persistent clock interface in order to calculate the
epoch offset. However, the resolution of the persistent clock might be low.
Therefore, in order to avoid time discrepancies a new argument 'now' is
added to read_boot_clock64() parameters. Arch may decide to use it instead
of accessing persistent clock again.
Also, change read_boot_clock64() to have __init prototype since it is
accessed only during boot.
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/arm/kernel/time.c | 2 +-
arch/s390/kernel/time.c | 2 +-
include/linux/timekeeping.h | 10 +++++-----
kernel/time/timekeeping.c | 8 ++++++--
4 files changed, 13 insertions(+), 9 deletions(-)
diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index 629f8e9981f1..5b259261a268 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -90,7 +90,7 @@ void read_persistent_clock64(struct timespec64 *ts)
__read_persistent_clock(ts);
}
-void read_boot_clock64(struct timespec64 *ts)
+void __init read_boot_clock64(struct timespec64 *now, struct timespec64 *ts)
{
__read_boot_clock(ts);
}
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index 192efdfac918..fd3050e2e825 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -203,7 +203,7 @@ void read_persistent_clock64(struct timespec64 *ts)
tod_to_timeval(clock - TOD_UNIX_EPOCH, ts);
}
-void read_boot_clock64(struct timespec64 *ts)
+void __init read_boot_clock64(struct timespec64 *now, struct timespec64 *ts)
{
__u64 clock;
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index ddc229ff6d1e..ffe5705bd064 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -340,11 +340,11 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
*/
extern int persistent_clock_is_local;
-extern void read_persistent_clock(struct timespec *ts);
-extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
-extern int update_persistent_clock(struct timespec now);
-extern int update_persistent_clock64(struct timespec64 now);
+void read_persistent_clock(struct timespec *ts);
+void read_persistent_clock64(struct timespec64 *ts);
+void read_boot_clock64(struct timespec64 *now, struct timespec64 *ts);
+int update_persistent_clock(struct timespec now);
+int update_persistent_clock64(struct timespec64 now);
#endif
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index cedafa008de5..a74f4c3a46a4 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1468,9 +1468,13 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
* Function to read the exact time the system has been started.
* Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
*
+ * Argument 'now' contains time from persistent clock to calculate offset from
+ * epoch. May contain zeros if persist ant clock is not available.
+ *
* XXX - Do be sure to remove it once all arches implement it.
*/
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init read_boot_clock64(struct timespec64 *now,
+ struct timespec64 *ts)
{
ts->tv_sec = 0;
ts->tv_nsec = 0;
@@ -1501,7 +1505,7 @@ void __init timekeeping_init(void)
} else if (now.tv_sec || now.tv_nsec)
persistent_clock_exists = true;
- read_boot_clock64(&boot);
+ read_boot_clock64(&now, &boot);
if (!timespec64_valid_strict(&boot)) {
pr_warn("WARNING: Boot clock returned invalid value!\n"
" Check your CMOS/BIOS settings.\n");
--
2.14.1
tsc_early_init():
Determines offset, shift and multiplier for the early clock based on the
TSC frequency. Notifies sched clock by calling sched_clock_early_init()
that early clock is available.
tsc_early_fini()
Implement the finish part of early tsc feature, prints message about the
offset, which can be useful to find out how much time was spent in post and
boot manager (if TSC starts from 0 during boot), and also calls
sched_clock_early_fini() to let sched clock that early clock cannot be used
anymore.
sched_clock_early():
TSC based implementation of weak function that is defined in sched clock.
Call tsc_early_init() to initialize early boot time stamps functionality on
the supported x86 platforms, and call tsc_early_fini() to finish this
feature after permanent tsc has been initialized.
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/include/asm/tsc.h | 4 ++++
arch/x86/kernel/setup.c | 10 ++++++++--
arch/x86/kernel/time.c | 1 +
arch/x86/kernel/tsc.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index f5e6f1c417df..6dc9618b24e3 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -50,11 +50,15 @@ extern bool tsc_store_and_check_tsc_adjust(bool bootcpu);
extern void tsc_verify_tsc_adjust(bool resume);
extern void check_tsc_sync_source(int cpu);
extern void check_tsc_sync_target(void);
+void tsc_early_init(unsigned int khz);
+void tsc_early_fini(void);
#else
static inline bool tsc_store_and_check_tsc_adjust(bool bootcpu) { return false; }
static inline void tsc_verify_tsc_adjust(bool resume) { }
static inline void check_tsc_sync_source(int cpu) { }
static inline void check_tsc_sync_target(void) { }
+static inline void tsc_early_init(unsigned int khz) { }
+static inline void tsc_early_fini(void) { }
#endif
extern int notsc_setup(char *);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3486d0498800..413434d98a23 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -812,7 +812,11 @@ dump_kernel_offset(struct notifier_block *self, unsigned long v, void *p)
return 0;
}
-static void __init simple_udelay_calibration(void)
+/*
+ * Initialize early tsc to show early boot timestamps, and also loops_per_jiffy
+ * for udelay
+ */
+static void __init early_clock_calibration(void)
{
unsigned int tsc_khz, cpu_khz;
unsigned long lpj;
@@ -827,6 +831,8 @@ static void __init simple_udelay_calibration(void)
if (!tsc_khz)
return;
+ tsc_early_init(tsc_khz);
+
lpj = tsc_khz * 1000;
do_div(lpj, HZ);
loops_per_jiffy = lpj;
@@ -1039,7 +1045,7 @@ void __init setup_arch(char **cmdline_p)
*/
init_hypervisor_platform();
- simple_udelay_calibration();
+ early_clock_calibration();
x86_init.resources.probe_roms();
diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index fbad8bf2fa24..44411d769b53 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -86,6 +86,7 @@ static __init void x86_late_time_init(void)
{
x86_init.timers.timer_init();
tsc_init();
+ tsc_early_fini();
}
/*
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 796d96bb0821..bd44c2dd4235 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1263,6 +1263,53 @@ static int __init init_tsc_clocksource(void)
*/
device_initcall(init_tsc_clocksource);
+#ifdef CONFIG_X86_TSC
+
+static struct cyc2ns_data cyc2ns_early;
+static bool sched_clock_early_enabled;
+
+u64 sched_clock_early(void)
+{
+ u64 ns;
+
+ if (!sched_clock_early_enabled)
+ return 0;
+ ns = mul_u64_u32_shr(rdtsc(), cyc2ns_early.cyc2ns_mul,
+ cyc2ns_early.cyc2ns_shift);
+ return ns + cyc2ns_early.cyc2ns_offset;
+}
+
+/*
+ * Initialize clock for early time stamps
+ */
+void __init tsc_early_init(unsigned int khz)
+{
+ sched_clock_early_enabled = true;
+ clocks_calc_mult_shift(&cyc2ns_early.cyc2ns_mul,
+ &cyc2ns_early.cyc2ns_shift,
+ khz, NSEC_PER_MSEC, 0);
+ cyc2ns_early.cyc2ns_offset = -sched_clock_early();
+ sched_clock_early_init();
+}
+
+void __init tsc_early_fini(void)
+{
+ unsigned long long t;
+ unsigned long r;
+
+ /* We did not have early sched clock if multiplier is 0 */
+ if (cyc2ns_early.cyc2ns_mul == 0)
+ return;
+
+ t = -cyc2ns_early.cyc2ns_offset;
+ r = do_div(t, NSEC_PER_SEC);
+
+ sched_clock_early_fini();
+ pr_info("sched clock early is finished, offset [%lld.%09lds]\n", t, r);
+ sched_clock_early_enabled = false;
+}
+#endif /* CONFIG_X86_TSC */
+
void __init tsc_init(void)
{
u64 lpj, cyc;
--
2.14.1
read_boot_clock64() returns time of when system started. Now, that
sched_clock_early() is available on systems with unstable clocks it is
possible to implement x86 specific version of read_boot_clock64() that
takes advantage of this new interface.
Signed-off-by: Pavel Tatashin <[email protected]>
---
arch/x86/kernel/time.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index e0754cdbad37..fbad8bf2fa24 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -14,6 +14,7 @@
#include <linux/i8253.h>
#include <linux/time.h>
#include <linux/export.h>
+#include <linux/sched/clock.h>
#include <asm/vsyscall.h>
#include <asm/x86_init.h>
@@ -95,3 +96,32 @@ void __init time_init(void)
{
late_time_init = x86_late_time_init;
}
+
+/*
+ * Called once during to boot to initialize boot time.
+ * This function returns timestamp in timespec format which is sec/nsec from
+ * epoch of when boot started.
+ * We use sched_clock_early() that gives us nanoseconds from when this clock has
+ * been started and it happens quiet early during boot process. To calculate
+ * offset from epoch we use information provided in 'now' by the caller
+ *
+ * If sched_clock_early() is not available or if there is any kind of error
+ * i.e. time from epoch is smaller than boot time, we must return zeros in ts,
+ * and the caller will take care of the error: by assuming that the time when
+ * this function was called is the beginning of boot time.
+ */
+void __init read_boot_clock64(struct timespec64 *now, struct timespec64 *ts)
+{
+ u64 ns_boot = sched_clock_early();
+ bool valid_clock;
+ u64 ns_now;
+
+ ns_now = timespec64_to_ns(now);
+ valid_clock = ns_boot && timespec64_valid_strict(now) &&
+ (ns_now > ns_boot);
+
+ if (!valid_clock)
+ *ts = (struct timespec64){0, 0};
+ else
+ *ts = ns_to_timespec64(ns_now - ns_boot);
+}
--
2.14.1
On Wed, Aug 30, 2017 at 02:12:09PM -0700, Fenghua Yu wrote:
> +static struct cyc2ns_data cyc2ns_early;
> +static bool sched_clock_early_enabled;
Should these two varaibles be "__initdata"?
> +u64 sched_clock_early(void)
This function is only called during boot time. Should it
be a "__init" function?
Thanks.
-Fenghua
Hi Fenghua,
Thank you for looking at this. Unfortunately I can't mark either of them
__init because sched_clock_early() is called from
u64 sched_clock_cpu(int cpu)
Which is around for the live of the system.
Thank you,
Pasha
On 08/30/2017 05:21 PM, Fenghua Yu wrote:
> On Wed, Aug 30, 2017 at 02:12:09PM -0700, Fenghua Yu wrote:
>> +static struct cyc2ns_data cyc2ns_early;
>> +static bool sched_clock_early_enabled;
>
> Should these two varaibles be "__initdata"?
>
>> +u64 sched_clock_early(void)
> This function is only called during boot time. Should it
> be a "__init" function?
>
> Thanks.
>
> -Fenghua
>
On Wed, Aug 30, 2017 at 02:03:22PM -0400, Pavel Tatashin wrote:
> In Linux printk() can output timestamps next to every line. This is very
> useful for tracking regressions, and finding places that can be optimized.
> However, the timestamps are available only later in boot. On smaller
> machines it is insignificant amount of time, but on larger it can be many
> seconds or even minutes into the boot process.
>
> This patch adds an interface for platforms with unstable sched clock to
> show timestamps early in boot. In order to get this functionality a
> platform must:
>
> - Implement u64 sched_clock_early()
> Clock that returns monotonic time
>
> - Call sched_clock_early_init()
> Tells sched clock that the early clock can be used
>
> - Call sched_clock_early_fini()
> Tells sched clock that the early clock is finished, and sched clock
> should hand over the operation to permanent clock.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
Urgh, that's horrific.
Can't we simply make sched_clock() go earlier? (we're violating "notsc"
in any case and really should kill that option).
Then we can do something like so on top...
---
include/linux/sched/clock.h | 6 +++++-
kernel/sched/clock.c | 42 +++++++++++++++++++++++++++---------------
2 files changed, 32 insertions(+), 16 deletions(-)
diff --git a/include/linux/sched/clock.h b/include/linux/sched/clock.h
index a55600ffdf4b..986d14a208e7 100644
--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -20,9 +20,12 @@ extern u64 running_clock(void);
extern u64 sched_clock_cpu(int cpu);
-extern void sched_clock_init(void);
#ifndef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
+static inline void sched_clock_init(void)
+{
+}
+
static inline void sched_clock_tick(void)
{
}
@@ -49,6 +52,7 @@ static inline u64 local_clock(void)
return sched_clock();
}
#else
+extern void sched_clock_init(void);
extern int sched_clock_stable(void);
extern void clear_sched_clock_stable(void);
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index ca0f8fc945c6..47d13d37f2f1 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -80,11 +80,6 @@ EXPORT_SYMBOL_GPL(sched_clock);
__read_mostly int sched_clock_running;
-void sched_clock_init(void)
-{
- sched_clock_running = 1;
-}
-
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
/*
* We must start with !__sched_clock_stable because the unstable -> stable
@@ -211,6 +206,31 @@ void clear_sched_clock_stable(void)
__clear_sched_clock_stable();
}
+static void __sched_clock_gtod_offset(void)
+{
+ u64 gtod, clock;
+
+ local_irq_disable();
+ gtod = ktime_get_ns();
+ clock = sched_clock();
+ __gtod_offset = (clock + __sched_clock_offset) - gtod;
+ local_irq_enable();
+}
+
+void sched_clock_init(void)
+{
+ /*
+ * Set __gtod_offset such that once we mark sched_clock_running,
+ * sched_clock_tick() continues where sched_clock() left off.
+ *
+ * Even if TSC is buggered, we're still UP at this point so it
+ * can't really be out of sync.
+ */
+ __sched_clock_gtod_offset();
+ barrier();
+ sched_clock_running = 1;
+}
+
/*
* We run this as late_initcall() such that it runs after all built-in drivers,
* notably: acpi_processor and intel_idle, which can mark the TSC as unstable.
@@ -363,7 +383,7 @@ u64 sched_clock_cpu(int cpu)
return sched_clock() + __sched_clock_offset;
if (unlikely(!sched_clock_running))
- return 0ull;
+ return sched_clock();
preempt_disable_notrace();
scd = cpu_sdc(cpu);
@@ -397,7 +417,6 @@ void sched_clock_tick(void)
void sched_clock_tick_stable(void)
{
- u64 gtod, clock;
if (!sched_clock_stable())
return;
@@ -409,11 +428,7 @@ void sched_clock_tick_stable(void)
* good moment to update our __gtod_offset. Because once we find the
* TSC to be unstable, any computation will be computing crap.
*/
- local_irq_disable();
- gtod = ktime_get_ns();
- clock = sched_clock();
- __gtod_offset = (clock + __sched_clock_offset) - gtod;
- local_irq_enable();
+ __sched_clock_gtod_offset();
}
/*
@@ -448,9 +463,6 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
u64 sched_clock_cpu(int cpu)
{
- if (unlikely(!sched_clock_running))
- return 0;
-
return sched_clock();
}
On Wed, Sep 27, 2017 at 02:58:57PM +0200, Peter Zijlstra wrote:
> (we're violating "notsc" in any case and really should kill that
> option).
Something like so; in particular simple_udelay_calibrate() will issue
RDTSC _way_ early, so there is absolutely no point in then pretending we
can't use RDTSC for sched_clock.
---
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 796d96bb0821..1dd3849a42ca 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -37,13 +37,6 @@ EXPORT_SYMBOL(tsc_khz);
*/
static int __read_mostly tsc_unstable;
-/* native_sched_clock() is called before tsc_init(), so
- we must start with the TSC soft disabled to prevent
- erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
-static DEFINE_STATIC_KEY_FALSE(__use_tsc);
-
int tsc_clocksource_reliable;
static u32 art_to_tsc_numerator;
@@ -191,24 +184,7 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
*/
u64 native_sched_clock(void)
{
- if (static_branch_likely(&__use_tsc)) {
- u64 tsc_now = rdtsc();
-
- /* return the value in ns */
- return cycles_2_ns(tsc_now);
- }
-
- /*
- * Fall back to jiffies if there's no TSC available:
- * ( But note that we still use it if the TSC is marked
- * unstable. We do this because unlike Time Of Day,
- * the scheduler clock tolerates small errors and it's
- * very important for it to be as fast as the platform
- * can achieve it. )
- */
-
- /* No locking but a rare wrong value is not a big deal: */
- return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
+ return cycles_2_ns(rdtsc());
}
/*
@@ -244,27 +220,6 @@ int check_tsc_unstable(void)
}
EXPORT_SYMBOL_GPL(check_tsc_unstable);
-#ifdef CONFIG_X86_TSC
-int __init notsc_setup(char *str)
-{
- pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
- tsc_disabled = 1;
- return 1;
-}
-#else
-/*
- * disable flag for tsc. Takes effect by clearing the TSC cpu flag
- * in cpu/common.c
- */
-int __init notsc_setup(char *str)
-{
- setup_clear_cpu_cap(X86_FEATURE_TSC);
- return 1;
-}
-#endif
-
-__setup("notsc", notsc_setup);
-
static int no_sched_irq_time;
static int __init tsc_setup(char *str)
@@ -1229,7 +1184,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)
static int __init init_tsc_clocksource(void)
{
- if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+ if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
return 0;
if (tsc_clocksource_reliable)
@@ -1311,14 +1266,6 @@ void __init tsc_init(void)
set_cyc2ns_scale(tsc_khz, cpu, cyc);
}
- if (tsc_disabled > 0)
- return;
-
- /* now allow native_sched_clock() to use rdtsc */
-
- tsc_disabled = 0;
- static_branch_enable(&__use_tsc);
-
if (!no_sched_irq_time)
enable_sched_clock_irqtime();
@@ -1348,7 +1295,7 @@ unsigned long calibrate_delay_is_known(void)
int sibling, cpu = smp_processor_id();
struct cpumask *mask = topology_core_cpumask(cpu);
- if (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
+ if (!cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
return 0;
if (!mask)
Hi Peter,
I am totally happy with removing notsc. This certainly simplifies the
sched_clock code. Are there any issues with removing existing kernel
parameters that I should be aware of?
Thank you,
Pasha
On 09/27/2017 09:10 AM, Peter Zijlstra wrote:
> On Wed, Sep 27, 2017 at 02:58:57PM +0200, Peter Zijlstra wrote:
>> (we're violating "notsc" in any case and really should kill that
>> option).
>
> Something like so; in particular simple_udelay_calibrate() will issue
> RDTSC _way_ early, so there is absolutely no point in then pretending we
> can't use RDTSC for sched_clock.
>
Hi Pasha, Peter
At 09/27/2017 09:16 PM, Pasha Tatashin wrote:
> Hi Peter,
>
> I am totally happy with removing notsc. This certainly simplifies the
> sched_clock code. Are there any issues with removing existing kernel
> parameters that I should be aware of?
>
We do not want to do that. Because, we use "notsc" to support Dynamic
Reconfiguration[1].
AFAIK, this feature enables hot-add system board which contains CPUs
and memories. But the CPUs in different board may have different TSCs
which are not consistent with the TSC from the existing CPUs. If we
hot-add a board directly, the machine may happen the inconsistency of
TSC.
We make our effort to specify the same TSC value as existing one through
hardware and firmware, but it is hard. So we recommend to specify
"notsc" option in command line for users who want to use Dynamic
Reconfiguration.
[1]
http://www.fujitsu.com/global/products/computing/servers/mission-critical/primequest/technology/availability/dynamic-reconfiguration.html
Thanks,
dou
> Thank you,
> Pasha
>
> On 09/27/2017 09:10 AM, Peter Zijlstra wrote:
>> On Wed, Sep 27, 2017 at 02:58:57PM +0200, Peter Zijlstra wrote:
>>> (we're violating "notsc" in any case and really should kill that
>>> option).
>>
>> Something like so; in particular simple_udelay_calibrate() will issue
>> RDTSC _way_ early, so there is absolutely no point in then pretending we
>> can't use RDTSC for sched_clock.
>>
>
>
>
On Wed, Aug 30, 2017 at 02:03:22PM -0400, Pavel Tatashin wrote:
> In Linux printk() can output timestamps next to every line. This is very
> useful for tracking regressions, and finding places that can be optimized.
> However, the timestamps are available only later in boot. On smaller
> machines it is insignificant amount of time, but on larger it can be many
> seconds or even minutes into the boot process.
The sched_clock work I did for ARM could be setup really early at boot,
from setup_arch(). I tried to encourage platforms to do that, but all
my encouragement fell on deaf ears - most people setup the sched_clock
source along side the time initialisation on ARM.
I don't think we need yet another "early" mechanism to solve this problem,
we just need people to use the existing mechanism to register their
sched_clock implementation earlier.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
Hi Russell,
This might be so for ARM, and in fact if you look at my SPARC
implementation, I simply made source clock initialize early, so regular
sched_clock() is used. As on SPARC, we use either %tick or %stick
registers with frequency determined via OpenFrimware. But, on x86 there
are dozen ways clock sources are setup, and some of them available quiet
late in boot because of various dependencies. So, my early clock
initialization for x86 (and expendable to other platforms with unstable
clocks) is to make it available when TSC is available, which is
determined by already existing kernel functionality in
simple_udelay_calibration().
My goal was not to introduce any regressions to the already complex (in
terms of number of branches and loads) sched_clock_cpu(), therefore I
added a new function and avoided any extra branches through out the life
of the system. I could mitigate some of that by using static branches,
but imo the current approach is better.
Pasha
Hi Dou,
This makes sense. The current sched_clock_early() approach does not
break it because with notsc TSC is used early in boot, and later
stopped. But, notsc must stay.
Peter,
So, we could either expend sched_clock() with another static branch for
early clock, or use what I proposed. IMO, the later is better, but
either way works for me.
Thank you,
Pasha
On 09/27/2017 09:52 AM, Dou Liyang wrote:
> Hi Pasha, Peter
>
> At 09/27/2017 09:16 PM, Pasha Tatashin wrote:
>> Hi Peter,
>>
>> I am totally happy with removing notsc. This certainly simplifies the
>> sched_clock code. Are there any issues with removing existing kernel
>> parameters that I should be aware of?
>>
>
> We do not want to do that. Because, we use "notsc" to support Dynamic
> Reconfiguration[1].
>
> AFAIK, this feature enables hot-add system board which contains CPUs
> and memories. But the CPUs in different board may have different TSCs
> which are not consistent with the TSC from the existing CPUs. If we
> hot-add a board directly, the machine may happen the inconsistency of
> TSC.
>
> We make our effort to specify the same TSC value as existing one through
> hardware and firmware, but it is hard. So we recommend to specify
> "notsc" option in command line for users who want to use Dynamic
> Reconfiguration.
>
> [1]
> http://www.fujitsu.com/global/products/computing/servers/mission-critical/primequest/technology/availability/dynamic-reconfiguration.html
>
On Wed, Sep 27, 2017 at 09:52:36PM +0800, Dou Liyang wrote:
> We do not want to do that. Because, we use "notsc" to support Dynamic
> Reconfiguration[1].
>
> AFAIK, this feature enables hot-add system board which contains CPUs
> and memories. But the CPUs in different board may have different TSCs
> which are not consistent with the TSC from the existing CPUs. If we hot-add
> a board directly, the machine may happen the inconsistency of
> TSC.
>
> We make our effort to specify the same TSC value as existing one through
> hardware and firmware, but it is hard. So we recommend to specify
> "notsc" option in command line for users who want to use Dynamic
> Reconfiguration.
Oh gawd, that's horrific. And in my book a good reason to kill that
option.
On Wed, Sep 27, 2017 at 08:05:48PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 27, 2017 at 09:52:36PM +0800, Dou Liyang wrote:
> > We do not want to do that. Because, we use "notsc" to support Dynamic
> > Reconfiguration[1].
> >
> > AFAIK, this feature enables hot-add system board which contains CPUs
> > and memories. But the CPUs in different board may have different TSCs
> > which are not consistent with the TSC from the existing CPUs. If we hot-add
> > a board directly, the machine may happen the inconsistency of
> > TSC.
> >
> > We make our effort to specify the same TSC value as existing one through
> > hardware and firmware, but it is hard. So we recommend to specify
> > "notsc" option in command line for users who want to use Dynamic
> > Reconfiguration.
>
> Oh gawd, that's horrific. And in my book a good reason to kill that
> option.
That is, even with unsynchronized TSC we're better off using RDTSC. The
whole mess in kernel/sched/clock.c is all about getting semi sensible
results out of unsynchronized TSC.
There really is no reason to artificially kill TSC usage.
On Wed, Sep 27, 2017 at 03:45:06PM +0100, Russell King - ARM Linux wrote:
> On Wed, Aug 30, 2017 at 02:03:22PM -0400, Pavel Tatashin wrote:
> > In Linux printk() can output timestamps next to every line. This is very
> > useful for tracking regressions, and finding places that can be optimized.
> > However, the timestamps are available only later in boot. On smaller
> > machines it is insignificant amount of time, but on larger it can be many
> > seconds or even minutes into the boot process.
>
> The sched_clock work I did for ARM could be setup really early at boot,
> from setup_arch(). I tried to encourage platforms to do that, but all
> my encouragement fell on deaf ears - most people setup the sched_clock
> source along side the time initialisation on ARM.
>
> I don't think we need yet another "early" mechanism to solve this problem,
> we just need people to use the existing mechanism to register their
> sched_clock implementation earlier.
x86 is a bit 'special' in the whole sched_clock department. But yes, we
should very much make the regular sched_clock() happen earlier.
Hi Peter,
At 09/28/2017 02:09 AM, Peter Zijlstra wrote:
> On Wed, Sep 27, 2017 at 08:05:48PM +0200, Peter Zijlstra wrote:
>> On Wed, Sep 27, 2017 at 09:52:36PM +0800, Dou Liyang wrote:
>>> We do not want to do that. Because, we use "notsc" to support Dynamic
>>> Reconfiguration[1].
>>>
>>> AFAIK, this feature enables hot-add system board which contains CPUs
>>> and memories. But the CPUs in different board may have different TSCs
>>> which are not consistent with the TSC from the existing CPUs. If we hot-add
>>> a board directly, the machine may happen the inconsistency of
>>> TSC.
>>>
>>> We make our effort to specify the same TSC value as existing one through
>>> hardware and firmware, but it is hard. So we recommend to specify
>>> "notsc" option in command line for users who want to use Dynamic
>>> Reconfiguration.
>>
>> Oh gawd, that's horrific. And in my book a good reason to kill that
>> option.
>
> That is, even with unsynchronized TSC we're better off using RDTSC. The
> whole mess in kernel/sched/clock.c is all about getting semi sensible
> results out of unsynchronized TSC.
>
It will be best if we can support TSC sync capability in x86, but seems
is not easy.
Thanks,
dou.
> There really is no reason to artificially kill TSC usage.
>
>
>
On Thu, Sep 28, 2017 at 06:03:05PM +0800, Dou Liyang wrote:
> At 09/28/2017 02:09 AM, Peter Zijlstra wrote:
> > On Wed, Sep 27, 2017 at 08:05:48PM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 27, 2017 at 09:52:36PM +0800, Dou Liyang wrote:
> > > > We do not want to do that. Because, we use "notsc" to support Dynamic
> > > > Reconfiguration[1].
> > > >
> > > > AFAIK, this feature enables hot-add system board which contains CPUs
> > > > and memories. But the CPUs in different board may have different TSCs
> > > > which are not consistent with the TSC from the existing CPUs. If we hot-add
> > > > a board directly, the machine may happen the inconsistency of
> > > > TSC.
> > > >
> > > > We make our effort to specify the same TSC value as existing one through
> > > > hardware and firmware, but it is hard. So we recommend to specify
> > > > "notsc" option in command line for users who want to use Dynamic
> > > > Reconfiguration.
> > >
> > > Oh gawd, that's horrific. And in my book a good reason to kill that
> > > option.
> >
> > That is, even with unsynchronized TSC we're better off using RDTSC. The
> > whole mess in kernel/sched/clock.c is all about getting semi sensible
> > results out of unsynchronized TSC.
> >
>
> It will be best if we can support TSC sync capability in x86, but seems
> is not easy.
Sure, your hardware achieving sync would be best, but even if it does
not, we can still use TSC. Using notsc simple because you fail to sync
TSCs is quite crazy.
The thing is, we need to support unsync'ed TSC in any case, because
older chips (pre Nehalem) didn't have synchronized TSC in any case, and
it still happens on recent chips if the BIOS mucks it up, which happens
surprisingly often :-(
I would suggest you try your reconfigurable setup with "tsc=unstable"
and see if that works for you. That marks the TSC unconditionally
unstable at boot and avoids any further wobbles once the TSC watchdog
notices (although that too _should_ more or less work).
I do however hope you have a custom clocksource driver placed at higher
priority than the HPET.
On Thu, 28 Sep 2017, Peter Zijlstra wrote:
> On Thu, Sep 28, 2017 at 06:03:05PM +0800, Dou Liyang wrote:
> > At 09/28/2017 02:09 AM, Peter Zijlstra wrote:
> > > On Wed, Sep 27, 2017 at 08:05:48PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Sep 27, 2017 at 09:52:36PM +0800, Dou Liyang wrote:
> > > > > We do not want to do that. Because, we use "notsc" to support Dynamic
> > > > > Reconfiguration[1].
> > > > >
> > > > > AFAIK, this feature enables hot-add system board which contains CPUs
> > > > > and memories. But the CPUs in different board may have different TSCs
> > > > > which are not consistent with the TSC from the existing CPUs. If we hot-add
> > > > > a board directly, the machine may happen the inconsistency of
> > > > > TSC.
> > > > >
> > > > > We make our effort to specify the same TSC value as existing one through
> > > > > hardware and firmware, but it is hard. So we recommend to specify
> > > > > "notsc" option in command line for users who want to use Dynamic
> > > > > Reconfiguration.
> > > >
> > > > Oh gawd, that's horrific. And in my book a good reason to kill that
> > > > option.
> > >
> > > That is, even with unsynchronized TSC we're better off using RDTSC. The
> > > whole mess in kernel/sched/clock.c is all about getting semi sensible
> > > results out of unsynchronized TSC.
> > >
> >
> > It will be best if we can support TSC sync capability in x86, but seems
> > is not easy.
>
> Sure, your hardware achieving sync would be best, but even if it does
> not, we can still use TSC. Using notsc simple because you fail to sync
> TSCs is quite crazy.
>
> The thing is, we need to support unsync'ed TSC in any case, because
> older chips (pre Nehalem) didn't have synchronized TSC in any case, and
> it still happens on recent chips if the BIOS mucks it up, which happens
> surprisingly often :-(
>
> I would suggest you try your reconfigurable setup with "tsc=unstable"
> and see if that works for you. That marks the TSC unconditionally
> unstable at boot and avoids any further wobbles once the TSC watchdog
> notices (although that too _should_ more or less work).
That should do the trick nicely and we might just end up converting notsc
to tsc=unstable silently so we can avoid the bike shed discussions about
removing it.
Thanks,
tglx
>>> It will be best if we can support TSC sync capability in x86, but seems
>>> is not easy.
>>
>> Sure, your hardware achieving sync would be best, but even if it does
>> not, we can still use TSC. Using notsc simple because you fail to sync
>> TSCs is quite crazy.
>>
>> The thing is, we need to support unsync'ed TSC in any case, because
>> older chips (pre Nehalem) didn't have synchronized TSC in any case, and
>> it still happens on recent chips if the BIOS mucks it up, which happens
>> surprisingly often :-(
>>
>> I would suggest you try your reconfigurable setup with "tsc=unstable"
>> and see if that works for you. That marks the TSC unconditionally
>> unstable at boot and avoids any further wobbles once the TSC watchdog
>> notices (although that too _should_ more or less work).
>
> That should do the trick nicely and we might just end up converting notsc
> to tsc=unstable silently so we can avoid the bike shed discussions about
> removing it.
>
Ok, I will start working on converting notsc to unstable, and modify my
patches to do what Peter suggested earlier. In the mean time, I'd like
to hear from Dou if this setup works with dynamic reconfig.
Thank you,
Pasha
Hi, Pasha
At 09/28/2017 09:11 PM, Pasha Tatashin wrote:
>>>> It will be best if we can support TSC sync capability in x86, but seems
>>>> is not easy.
>>>
>>> Sure, your hardware achieving sync would be best, but even if it does
>>> not, we can still use TSC. Using notsc simple because you fail to sync
>>> TSCs is quite crazy.
>>>
>>> The thing is, we need to support unsync'ed TSC in any case, because
>>> older chips (pre Nehalem) didn't have synchronized TSC in any case, and
>>> it still happens on recent chips if the BIOS mucks it up, which happens
>>> surprisingly often :-(
>>>
>>> I would suggest you try your reconfigurable setup with "tsc=unstable"
>>> and see if that works for you. That marks the TSC unconditionally
>>> unstable at boot and avoids any further wobbles once the TSC watchdog
>>> notices (although that too _should_ more or less work).
>>
>> That should do the trick nicely and we might just end up converting notsc
>> to tsc=unstable silently so we can avoid the bike shed discussions about
>> removing it.
>>
>
> Ok, I will start working on converting notsc to unstable, and modify my
> patches to do what Peter suggested earlier. In the mean time, I'd like
> to hear from Dou if this setup works with dynamic reconfig.
>
OK, I will do it, But, October 1 is our national holiday, I will in
holiday, and I just returned the test machine. :-(
May reply you in middle of the October.
Thanks,
dou.
> Thank you,
> Pasha
>
>
>