2015-08-17 20:41:11

by John Stultz

[permalink] [raw]
Subject: [PATCH 0/9] Time items for 4.3

Hey Ingo,
I got back from vacation last week, so this is a little later
then I wanted it to be. But here is my queue of items I've collected
for 4.3

There's a handful of y2038-prep related items. Then some small fixes
for timespec_trunc(), timer_list output, and clocksource watchdog false
positives caused by processing stalls or delays.

Let me know if there's anything that you object to.

thanks
-john

Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Karsten Blees <[email protected]>
Cc: Wang YanQing <[email protected]>
Cc: Xunlei Pang <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Shaohua Li <[email protected]>

If you'd like to pull this set:
The following changes since commit d770e558e21961ad6cfdf0ff7df0eb5d7d4f0754:

Linux 4.2-rc1 (2015-07-05 11:01:52 -0700)

are available in the git repository at:

https://git.linaro.org/people/john.stultz/linux.git fortglx/4.3/time

for you to fetch changes up to bad86629c40a9fd8887c742ffd958e867592959a:

clocksource: Sanity check watchdog clocksource (2015-08-17 11:28:30 -0700)

Baolin Wang (3):
time: Introduce struct itimerspec64
time: Introduce current_kernel_time64()
time: Introduce timespec64_to_jiffies()/jiffies_to_timespec64()

John Stultz (1):
timer_list: Add the base offset so remaining nsecs are accurate for
non monotonic timers

Karsten Blees (1):
time: Fix nanosecond file time rounding in timespec_trunc()

Shaohua Li (2):
clocksource: Improve unstable clocksource detection
clocksource: Sanity check watchdog clocksource

Wang YanQing (1):
time: Always make sure wall_to_monotonic isn't positive

Xunlei Pang (1):
time: Add the common weak version of update_persistent_clock()

include/linux/jiffies.h | 22 +++++++++++++++++++---
include/linux/time64.h | 35 +++++++++++++++++++++++++++++++++++
include/linux/timekeeping.h | 9 ++++++++-
kernel/time/clocksource.c | 10 +++++++---
kernel/time/ntp.c | 5 +++++
kernel/time/time.c | 43 +++++++++++++++++++++----------------------
kernel/time/timekeeping.c | 19 +++++++++++++------
kernel/time/timer_list.c | 2 +-
8 files changed, 109 insertions(+), 36 deletions(-)

--
1.9.1


2015-08-17 20:43:32

by John Stultz

[permalink] [raw]
Subject: [PATCH 1/9] timer_list: Add the base offset so remaining nsecs are accurate for non monotonic timers

I noticed for non-monotonic timers in timer_list, some of the
output looked a little confusing.

For example:
#1: <0000000000000000>, posix_timer_fn, S:01, hrtimer_start_range_ns, leap-a-day/2360
# expires at 1434412800000000000-1434412800000000000 nsecs [in 1434410725062375469 to 1434410725062375469 nsecs]

You'll note the relative time till the expiration "[in xxx to
yyy nsecs]" is incorrect. This is because its printing the delta
between CLOCK_MONOTONIC time to the CLOCK_REALTIME expiration.

This patch fixes this issue by adding the clock offset to the
"now" time which we use to calculate the delta.

Cc: Prarit Bhargava <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jiri Bohac <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
kernel/time/timer_list.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index a4536e1..129c960 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -137,7 +137,7 @@ print_base(struct seq_file *m, struct hrtimer_clock_base *base, u64 now)
(unsigned long long) ktime_to_ns(base->offset));
#endif
SEQ_printf(m, "active timers:\n");
- print_active_timers(m, base, now);
+ print_active_timers(m, base, now + ktime_to_ns(base->offset));
}

static void print_cpu(struct seq_file *m, int cpu, u64 now)
--
1.9.1

2015-08-17 20:41:14

by John Stultz

[permalink] [raw]
Subject: [PATCH 2/9] time: Fix nanosecond file time rounding in timespec_trunc()

From: Karsten Blees <[email protected]>

timespec_trunc() avoids rounding if granularity <= nanoseconds-per-jiffie
(or TICK_NSEC). This optimization assumes that:

1. current_kernel_time().tv_nsec is already rounded to TICK_NSEC (i.e.
with HZ=1000 you'd get 1000000, 2000000, 3000000... but never 1000001).
This is no longer true (probably since hrtimers introduced in 2.6.16).

2. TICK_NSEC is evenly divisible by all possible granularities. This may
be true for HZ=100, 250, 1000, but obviously not for HZ=300 /
TICK_NSEC=3333333 (introduced in 2.6.20).

Thus, sub-second portions of in-core file times are not rounded to on-disk
granularity. I.e. file times may change when the inode is re-read from disk
or when the file system is remounted.

This affects all file systems with file time granularities > 1 ns and < 1s,
e.g. CEPH (1000 ns), UDF (1000 ns), CIFS (100 ns), NTFS (100 ns) and FUSE
(configurable from user mode via struct fuse_init_out.time_gran).

Steps to reproduce with e.g. UDF:

$ dd if=/dev/zero of=udfdisk count=10000 && mkudffs udfdisk
$ mkdir udf && mount udfdisk udf
$ touch udf/test && stat -c %y udf/test
2015-06-09 10:22:56.130006767 +0200
$ umount udf && mount udfdisk udf
$ stat -c %y udf/test
2015-06-09 10:22:56.130006000 +0200

Remounting truncates the mtime to 1 µs.

Fix the rounding in timespec_trunc() and update the documentation.

timespec_trunc() is exclusively used to calculate inode's [acm]time (mostly
via current_fs_time()), and always with super_block.s_time_gran as second
argument. So this can safely be changed without side effects.

Note: This does _not_ fix the issue for FAT's 2 second mtime resolution,
as super_block.s_time_gran isn't prepared to handle different ctime /
mtime / atime resolutions nor resolutions > 1 second.

Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Karsten Blees <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
kernel/time/time.c | 22 ++++++++--------------
1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/kernel/time/time.c b/kernel/time/time.c
index 85d5bb1..34dbd42 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -287,26 +287,20 @@ EXPORT_SYMBOL(jiffies_to_usecs);
* @t: Timespec
* @gran: Granularity in ns.
*
- * Truncate a timespec to a granularity. gran must be smaller than a second.
- * Always rounds down.
- *
- * This function should be only used for timestamps returned by
- * current_kernel_time() or CURRENT_TIME, not with do_gettimeofday() because
- * it doesn't handle the better resolution of the latter.
+ * Truncate a timespec to a granularity. Always rounds down. gran must
+ * not be 0 nor greater than a second (NSEC_PER_SEC, or 10^9 ns).
*/
struct timespec timespec_trunc(struct timespec t, unsigned gran)
{
- /*
- * Division is pretty slow so avoid it for common cases.
- * Currently current_kernel_time() never returns better than
- * jiffies resolution. Exploit that.
- */
- if (gran <= jiffies_to_usecs(1) * 1000) {
+ /* Avoid division in the common cases 1 ns and 1 s. */
+ if (gran == 1) {
/* nothing */
- } else if (gran == 1000000000) {
+ } else if (gran == NSEC_PER_SEC) {
t.tv_nsec = 0;
- } else {
+ } else if (gran > 1 && gran < NSEC_PER_SEC) {
t.tv_nsec -= t.tv_nsec % gran;
+ } else {
+ WARN(1, "illegal file time granularity: %u", gran);
}
return t;
}
--
1.9.1

2015-08-17 20:43:09

by John Stultz

[permalink] [raw]
Subject: [PATCH 3/9] time: Always make sure wall_to_monotonic isn't positive

From: Wang YanQing <[email protected]>

Two issues were found on an IMX6 development board without an
enabled RTC device(resulting in the boot time and monotonic
time being initialized to 0).

Issue 1:exportfs -a generate:
"exportfs: /opt/nfs/arm does not support NFS export"
Issue 2:cat /proc/stat:
"btime 4294967236"

The same issues can be reproduced on x86 after running the
following code:
int main(void)
{
struct timeval val;
int ret;

val.tv_sec = 0;
val.tv_usec = 0;
ret = settimeofday(&val, NULL);
return 0;
}

Two issues are different symptoms of same problem:
The reason is a positive wall_to_monotonic pushes boot time back
to the time before Epoch, and getboottime will return negative
value.

In symptom 1:
negative boot time cause get_expiry() to overflow time_t
when input expire time is 2147483647, then cache_flush()
always clears entries just added in ip_map_parse.
In symptom 2:
show_stat() uses "unsigned long" to print negative btime
value returned by getboottime.

This patch fix the problem by prohibiting time from being set to a value which
would cause a negative boot time. As a result one can't set the CLOCK_REALTIME
time prior to (1970 + system uptime).

Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Wang YanQing <[email protected]>
[jstultz: reworded commit message]
Signed-off-by: John Stultz <[email protected]>
---
kernel/time/timekeeping.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index bca3667..4cdb771 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -911,6 +911,7 @@ int do_settimeofday64(const struct timespec64 *ts)
struct timekeeper *tk = &tk_core.timekeeper;
struct timespec64 ts_delta, xt;
unsigned long flags;
+ int ret = 0;

if (!timespec64_valid_strict(ts))
return -EINVAL;
@@ -924,10 +925,15 @@ int do_settimeofday64(const struct timespec64 *ts)
ts_delta.tv_sec = ts->tv_sec - xt.tv_sec;
ts_delta.tv_nsec = ts->tv_nsec - xt.tv_nsec;

+ if (timespec64_compare(&tk->wall_to_monotonic, &ts_delta) > 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
tk_set_wall_to_mono(tk, timespec64_sub(tk->wall_to_monotonic, ts_delta));

tk_set_xtime(tk, ts);
-
+out:
timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&tk_core.seq);
@@ -936,7 +942,7 @@ int do_settimeofday64(const struct timespec64 *ts)
/* signal hrtimers about time change */
clock_was_set();

- return 0;
+ return ret;
}
EXPORT_SYMBOL(do_settimeofday64);

@@ -965,7 +971,8 @@ int timekeeping_inject_offset(struct timespec *ts)

/* Make sure the proposed value is valid */
tmp = timespec64_add(tk_xtime(tk), ts64);
- if (!timespec64_valid_strict(&tmp)) {
+ if (timespec64_compare(&tk->wall_to_monotonic, &ts64) > 0 ||
+ !timespec64_valid_strict(&tmp)) {
ret = -EINVAL;
goto error;
}
--
1.9.1

2015-08-17 20:42:48

by John Stultz

[permalink] [raw]
Subject: [PATCH 4/9] time: Add the common weak version of update_persistent_clock()

From: Xunlei Pang <[email protected]>

The weak update_persistent_clock64() calls update_persistent_clock(),
if the architecture defines an update_persistent_clock64() to replace
and remove its update_persistent_clock() version, when building the
kernel the linker will throw an undefined symbol error, that is, any
arch that switches to update_persistent_clock64() will have this issue.

To solve the issue, we add the common weak update_persistent_clock().

Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Xunlei Pang <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
kernel/time/ntp.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index fb4d98c..df68cb8 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -487,6 +487,11 @@ out:
}

#ifdef CONFIG_GENERIC_CMOS_UPDATE
+int __weak update_persistent_clock(struct timespec now)
+{
+ return -ENODEV;
+}
+
int __weak update_persistent_clock64(struct timespec64 now64)
{
struct timespec now;
--
1.9.1

2015-08-17 20:41:19

by John Stultz

[permalink] [raw]
Subject: [PATCH 5/9] time: Introduce struct itimerspec64

From: Baolin Wang <[email protected]>

The struct itimerspec is not year 2038 safe on 32bit systems due to
the limitation of the struct timespec members. Introduce itimerspec64
which uses struct timespec64 instead and provide conversion functions.

Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/time64.h | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/include/linux/time64.h b/include/linux/time64.h
index 77b5df2..367d5af 100644
--- a/include/linux/time64.h
+++ b/include/linux/time64.h
@@ -12,11 +12,18 @@ typedef __s64 time64_t;
*/
#if __BITS_PER_LONG == 64
# define timespec64 timespec
+#define itimerspec64 itimerspec
#else
struct timespec64 {
time64_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
+
+struct itimerspec64 {
+ struct timespec64 it_interval;
+ struct timespec64 it_value;
+};
+
#endif

/* Parameters used to convert the timespec values: */
@@ -45,6 +52,16 @@ static inline struct timespec64 timespec_to_timespec64(const struct timespec ts)
return ts;
}

+static inline struct itimerspec itimerspec64_to_itimerspec(struct itimerspec64 *its64)
+{
+ return *its64;
+}
+
+static inline struct itimerspec64 itimerspec_to_itimerspec64(struct itimerspec *its)
+{
+ return *its;
+}
+
# define timespec64_equal timespec_equal
# define timespec64_compare timespec_compare
# define set_normalized_timespec64 set_normalized_timespec
@@ -77,6 +94,24 @@ static inline struct timespec64 timespec_to_timespec64(const struct timespec ts)
return ret;
}

+static inline struct itimerspec itimerspec64_to_itimerspec(struct itimerspec64 *its64)
+{
+ struct itimerspec ret;
+
+ ret.it_interval = timespec64_to_timespec(its64->it_interval);
+ ret.it_value = timespec64_to_timespec(its64->it_value);
+ return ret;
+}
+
+static inline struct itimerspec64 itimerspec_to_itimerspec64(struct itimerspec *its)
+{
+ struct itimerspec64 ret;
+
+ ret.it_interval = timespec_to_timespec64(its->it_interval);
+ ret.it_value = timespec_to_timespec64(its->it_value);
+ return ret;
+}
+
static inline int timespec64_equal(const struct timespec64 *a,
const struct timespec64 *b)
{
--
1.9.1

2015-08-17 20:41:21

by John Stultz

[permalink] [raw]
Subject: [PATCH 6/9] time: Introduce current_kernel_time64()

From: Baolin Wang <[email protected]>

The current_kernel_time() is not year 2038 safe on 32bit systems
since it returns a timespec value. Introduce current_kernel_time64()
which returns a timespec64 value.

Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/timekeeping.h | 9 ++++++++-
kernel/time/timekeeping.c | 6 +++---
2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 3aa72e6..657ea03 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -18,10 +18,17 @@ extern int do_sys_settimeofday(const struct timespec *tv,
* Kernel time accessors
*/
unsigned long get_seconds(void);
-struct timespec current_kernel_time(void);
+struct timespec64 current_kernel_time64(void);
/* does not take xtime_lock */
struct timespec __current_kernel_time(void);

+static inline struct timespec current_kernel_time(void)
+{
+ struct timespec64 now = current_kernel_time64();
+
+ return timespec64_to_timespec(now);
+}
+
/*
* timespec based interfaces
*/
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4cdb771..f6ee2e6 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1881,7 +1881,7 @@ struct timespec __current_kernel_time(void)
return timespec64_to_timespec(tk_xtime(tk));
}

-struct timespec current_kernel_time(void)
+struct timespec64 current_kernel_time64(void)
{
struct timekeeper *tk = &tk_core.timekeeper;
struct timespec64 now;
@@ -1893,9 +1893,9 @@ struct timespec current_kernel_time(void)
now = tk_xtime(tk);
} while (read_seqcount_retry(&tk_core.seq, seq));

- return timespec64_to_timespec(now);
+ return now;
}
-EXPORT_SYMBOL(current_kernel_time);
+EXPORT_SYMBOL(current_kernel_time64);

struct timespec64 get_monotonic_coarse64(void)
{
--
1.9.1

2015-08-17 20:42:10

by John Stultz

[permalink] [raw]
Subject: [PATCH 7/9] time: Introduce timespec64_to_jiffies()/jiffies_to_timespec64()

From: Baolin Wang <[email protected]>

The conversion between struct timespec and jiffies is not year 2038
safe on 32bit systems. Introduce timespec64_to_jiffies() and
jiffies_to_timespec64() functions which use struct timespec64 to
make it ready for 2038 issue.

Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/jiffies.h | 22 +++++++++++++++++++---
kernel/time/time.c | 21 +++++++++++++--------
2 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index 535fd3b..bf96d9f 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -416,9 +416,25 @@ static inline unsigned long usecs_to_jiffies(const unsigned int u)
}
}

-extern unsigned long timespec_to_jiffies(const struct timespec *value);
-extern void jiffies_to_timespec(const unsigned long jiffies,
- struct timespec *value);
+extern unsigned long timespec64_to_jiffies(const struct timespec64 *value);
+extern void jiffies_to_timespec64(const unsigned long jiffies,
+ struct timespec64 *value);
+static inline unsigned long timespec_to_jiffies(const struct timespec *value)
+{
+ struct timespec64 ts = timespec_to_timespec64(*value);
+
+ return timespec64_to_jiffies(&ts);
+}
+
+static inline void jiffies_to_timespec(const unsigned long jiffies,
+ struct timespec *value)
+{
+ struct timespec64 ts;
+
+ jiffies_to_timespec64(jiffies, &ts);
+ *value = timespec64_to_timespec(ts);
+}
+
extern unsigned long timeval_to_jiffies(const struct timeval *value);
extern void jiffies_to_timeval(const unsigned long jiffies,
struct timeval *value);
diff --git a/kernel/time/time.c b/kernel/time/time.c
index 34dbd42..f18ab10 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -540,7 +540,7 @@ EXPORT_SYMBOL(__usecs_to_jiffies);
* value to a scaled second value.
*/
static unsigned long
-__timespec_to_jiffies(unsigned long sec, long nsec)
+__timespec64_to_jiffies(u64 sec, long nsec)
{
nsec = nsec + TICK_NSEC - 1;

@@ -548,22 +548,27 @@ __timespec_to_jiffies(unsigned long sec, long nsec)
sec = MAX_SEC_IN_JIFFIES;
nsec = 0;
}
- return (((u64)sec * SEC_CONVERSION) +
+ return ((sec * SEC_CONVERSION) +
(((u64)nsec * NSEC_CONVERSION) >>
(NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;

}

-unsigned long
-timespec_to_jiffies(const struct timespec *value)
+static unsigned long
+__timespec_to_jiffies(unsigned long sec, long nsec)
{
- return __timespec_to_jiffies(value->tv_sec, value->tv_nsec);
+ return __timespec64_to_jiffies((u64)sec, nsec);
}

-EXPORT_SYMBOL(timespec_to_jiffies);
+unsigned long
+timespec64_to_jiffies(const struct timespec64 *value)
+{
+ return __timespec64_to_jiffies(value->tv_sec, value->tv_nsec);
+}
+EXPORT_SYMBOL(timespec64_to_jiffies);

void
-jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
+jiffies_to_timespec64(const unsigned long jiffies, struct timespec64 *value)
{
/*
* Convert jiffies to nanoseconds and separate with
@@ -574,7 +579,7 @@ jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
NSEC_PER_SEC, &rem);
value->tv_nsec = rem;
}
-EXPORT_SYMBOL(jiffies_to_timespec);
+EXPORT_SYMBOL(jiffies_to_timespec64);

/*
* We could use a similar algorithm to timespec_to_jiffies (with a
--
1.9.1

2015-08-17 20:41:23

by John Stultz

[permalink] [raw]
Subject: [PATCH 8/9] clocksource: Improve unstable clocksource detection

From: Shaohua Li <[email protected]>

>From time to time we saw TSC is marked as unstable in our systems, while
the CPUs declare to have stable TSC. Looking at the clocksource unstable
detection, there are two problems:
- watchdog clock source wrap. HPET is the most common watchdog clock
source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
can wrap in about 5 minutes.
- threshold isn't scaled against interval. The threshold is 0.0625s in
0.5s interval. What if the actual interval is bigger than 0.5s?

The watchdog runs in a timer bh, so hard/soft irq can defer its running.
Heavy network stack softirq can hog a cpu. IPMI driver can disable
interrupt for a very long time. The first problem is mostly we are
suffering I think.

Here is a simple patch to fix the issues. If the waterdog doesn't run
for a long time, we ignore the detection. This should work for the two
problems. For the second one, we probably doen't need to scale if the
interval isn't very long.

Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
kernel/time/clocksource.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 841b72f..8417c83 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
static void __clocksource_change_rating(struct clocksource *cs, int rating);

/*
- * Interval: 0.5sec Threshold: 0.0625s
+ * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)

static void clocksource_watchdog_work(struct work_struct *work)
@@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
continue;

/* Check the deviation from the watchdog clocksource. */
- if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
+ if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) &&
+ cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
+ wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
pr_warn("timekeeping watchdog: Marking clocksource '%s' as unstable because the skew is too large:\n",
cs->name);
pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
--
1.9.1

2015-08-17 20:41:44

by John Stultz

[permalink] [raw]
Subject: [PATCH 9/9] clocksource: Sanity check watchdog clocksource

From: Shaohua Li <[email protected]>

Add a sanity check to make sure watchdog clocksource doesn't wrap too
quickly.

Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
kernel/time/clocksource.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 8417c83..64e4629 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -327,7 +327,8 @@ static void clocksource_enqueue_watchdog(struct clocksource *cs)
if (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS)
cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
/* Pick the best watchdog. */
- if (!watchdog || cs->rating > watchdog->rating) {
+ if (cs->max_idle_ns > WATCHDOG_MAX_INTERVAL_NS &&
+ (!watchdog || cs->rating > watchdog->rating)) {
watchdog = cs;
/* Reset watchdog cycles */
clocksource_reset_watchdog();
--
1.9.1

2015-08-17 21:01:34

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH 1/9] timer_list: Add the base offset so remaining nsecs are accurate for non monotonic timers

On 08/17/2015 02:40 PM, John Stultz wrote:
> I noticed for non-monotonic timers in timer_list, some of the
> output looked a little confusing.
>
> For example:
> #1: <0000000000000000>, posix_timer_fn, S:01, hrtimer_start_range_ns, leap-a-day/2360
> # expires at 1434412800000000000-1434412800000000000 nsecs [in 1434410725062375469 to 1434410725062375469 nsecs]
>
> You'll note the relative time till the expiration "[in xxx to
> yyy nsecs]" is incorrect. This is because its printing the delta
> between CLOCK_MONOTONIC time to the CLOCK_REALTIME expiration.
>
> This patch fixes this issue by adding the clock offset to the
> "now" time which we use to calculate the delta.
>
> Cc: Prarit Bhargava <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Richard Cochran <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Jiri Bohac <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Shuah Khan <[email protected]>
> Signed-off-by: John Stultz <[email protected]>
> ---

Hi John,

I see just this patch in the series and not the others. Could you
please make sure I am on the cc for all of them. I will review and
try to get these into 4.3

thanks,
-- Shuah

--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
[email protected] | (970) 217-8978

2015-08-17 21:04:18

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH 1/9] timer_list: Add the base offset so remaining nsecs are accurate for non monotonic timers

On 08/17/2015 03:01 PM, Shuah Khan wrote:
> On 08/17/2015 02:40 PM, John Stultz wrote:
>> I noticed for non-monotonic timers in timer_list, some of the
>> output looked a little confusing.
>>
>> For example:
>> #1: <0000000000000000>, posix_timer_fn, S:01, hrtimer_start_range_ns, leap-a-day/2360
>> # expires at 1434412800000000000-1434412800000000000 nsecs [in 1434410725062375469 to 1434410725062375469 nsecs]
>>
>> You'll note the relative time till the expiration "[in xxx to
>> yyy nsecs]" is incorrect. This is because its printing the delta
>> between CLOCK_MONOTONIC time to the CLOCK_REALTIME expiration.
>>
>> This patch fixes this issue by adding the clock offset to the
>> "now" time which we use to calculate the delta.
>>
>> Cc: Prarit Bhargava <[email protected]>
>> Cc: Daniel Bristot de Oliveira <[email protected]>
>> Cc: Richard Cochran <[email protected]>
>> Cc: Jan Kara <[email protected]>
>> Cc: Jiri Bohac <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Shuah Khan <[email protected]>
>> Signed-off-by: John Stultz <[email protected]>
>> ---
>
> Hi John,
>
> I see just this patch in the series and not the others. Could you
> please make sure I am on the cc for all of them. I will review and
> try to get these into 4.3
>

Never mind. Looks like this one at least is better suited to go through
timer git.

-- Shuah


--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
[email protected] | (970) 217-8978

2015-08-17 21:05:10

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 1/9] timer_list: Add the base offset so remaining nsecs are accurate for non monotonic timers

On Mon, Aug 17, 2015 at 2:01 PM, Shuah Khan <[email protected]> wrote:
> On 08/17/2015 02:40 PM, John Stultz wrote:
>> I noticed for non-monotonic timers in timer_list, some of the
>> output looked a little confusing.
>>
>> For example:
>> #1: <0000000000000000>, posix_timer_fn, S:01, hrtimer_start_range_ns, leap-a-day/2360
>> # expires at 1434412800000000000-1434412800000000000 nsecs [in 1434410725062375469 to 1434410725062375469 nsecs]
>>
>> You'll note the relative time till the expiration "[in xxx to
>> yyy nsecs]" is incorrect. This is because its printing the delta
>> between CLOCK_MONOTONIC time to the CLOCK_REALTIME expiration.
>>
>> This patch fixes this issue by adding the clock offset to the
>> "now" time which we use to calculate the delta.
>>
>> Cc: Prarit Bhargava <[email protected]>
>> Cc: Daniel Bristot de Oliveira <[email protected]>
>> Cc: Richard Cochran <[email protected]>
>> Cc: Jan Kara <[email protected]>
>> Cc: Jiri Bohac <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Shuah Khan <[email protected]>
>> Signed-off-by: John Stultz <[email protected]>
>> ---
>
> Hi John,
>
> I see just this patch in the series and not the others. Could you
> please make sure I am on the cc for all of them. I will review and
> try to get these into 4.3

So sorry here. I actually had you on CC for this from a previous patch
series this patch was included in, and didn't remove you before
sending. I'm hoping Ingo queues these for 4.3, so you shouldn't have
to do anything (unless you want to review it and provide any feedback
:).

thanks
-john

2015-08-17 21:24:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 9/9] clocksource: Sanity check watchdog clocksource

On Mon, 17 Aug 2015, John Stultz wrote:

> From: Shaohua Li <[email protected]>
>
> Add a sanity check to make sure watchdog clocksource doesn't wrap too
> quickly.

Too quickly for what?

Thanks,

tglx



2015-08-17 22:03:09

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 9/9] clocksource: Sanity check watchdog clocksource

On Mon, Aug 17, 2015 at 2:24 PM, Thomas Gleixner <[email protected]> wrote:
> On Mon, 17 Aug 2015, John Stultz wrote:
>
>> From: Shaohua Li <[email protected]>
>>
>> Add a sanity check to make sure watchdog clocksource doesn't wrap too
>> quickly.
>
> Too quickly for what?

The maximum interval delay limit (which would prevent the logic from
avoiding false positives) .

I can try to expand this text and resend if you want.

Aren't you on vacation? :)
-john

2015-08-17 22:04:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Mon, 17 Aug 2015, John Stultz wrote:

> From: Shaohua Li <[email protected]>
>
> >From time to time we saw TSC is marked as unstable in our systems, while

Stray '>'

> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> detection, there are two problems:
> - watchdog clock source wrap. HPET is the most common watchdog clock
> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> can wrap in about 5 minutes.
> - threshold isn't scaled against interval. The threshold is 0.0625s in
> 0.5s interval. What if the actual interval is bigger than 0.5s?
>
> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> interrupt for a very long time.

And they hold off the timer softirq for more than a second? Don't you
think that's the problem which needs to be fixed?

> The first problem is mostly we are suffering I think.

So you think that's the root cause and because your patch makes it go
away it's not necessary to know for sure, right?

> Here is a simple patch to fix the issues. If the waterdog doesn't run

waterdog?

> for a long time, we ignore the detection.

What's 'long time'? Please explain the numbers chosen.

> This should work for the two

Emphasis on 'should'?

> problems. For the second one, we probably doen't need to scale if the
> interval isn't very long.

-ENOPARSE

> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
> static void __clocksource_change_rating(struct clocksource *cs, int rating);
>
> /*
> - * Interval: 0.5sec Threshold: 0.0625s
> + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
> */
> #define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
>
> static void clocksource_watchdog_work(struct work_struct *work)
> @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
> continue;
>
> /* Check the deviation from the watchdog clocksource. */
> - if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
> + if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) &&
> + cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> + wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {

So that adds a new opportunity for undiscovered wreckage:

clocksource_watchdog();
.... <--- SMI skews TSC
looong_irq_disabled_region();
....
clocksource_watchdog(); <--- Does not detect skew

and it will not detect it later on if that SMI was a one time event.

So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
prevents the watchdog to run is the proper thing to do.

Thanks,

tglx

2015-08-17 22:08:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 9/9] clocksource: Sanity check watchdog clocksource

On Mon, 17 Aug 2015, John Stultz wrote:

> On Mon, Aug 17, 2015 at 2:24 PM, Thomas Gleixner <[email protected]> wrote:
> > On Mon, 17 Aug 2015, John Stultz wrote:
> >
> >> From: Shaohua Li <[email protected]>
> >>
> >> Add a sanity check to make sure watchdog clocksource doesn't wrap too
> >> quickly.
> >
> > Too quickly for what?
>
> The maximum interval delay limit (which would prevent the logic from
> avoiding false positives) .
>
> I can try to expand this text and resend if you want.

Yes, please. These half baken changelogs suck.

> Aren't you on vacation? :)

Just returned :)

Thanks,

tglx

2015-08-17 22:14:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/9] time: Fix nanosecond file time rounding in timespec_trunc()

On Mon, 17 Aug 2015, John Stultz wrote:
> + } else {
> + WARN(1, "illegal file time granularity: %u", gran);

s/illegal/Invalid/

Thanks,

tglx

2015-08-17 22:17:30

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <[email protected]> wrote:
> On Mon, 17 Aug 2015, John Stultz wrote:
>
>> From: Shaohua Li <[email protected]>
>>
>> >From time to time we saw TSC is marked as unstable in our systems, while
>
> Stray '>'
>
>> the CPUs declare to have stable TSC. Looking at the clocksource unstable
>> detection, there are two problems:
>> - watchdog clock source wrap. HPET is the most common watchdog clock
>> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
>> can wrap in about 5 minutes.
>> - threshold isn't scaled against interval. The threshold is 0.0625s in
>> 0.5s interval. What if the actual interval is bigger than 0.5s?
>>
>> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
>> Heavy network stack softirq can hog a cpu. IPMI driver can disable
>> interrupt for a very long time.
>
> And they hold off the timer softirq for more than a second? Don't you
> think that's the problem which needs to be fixed?

Though this is an issue I've experienced (and tried unsuccessfully to
fix in a more complicated way) with the RT kernel, where high priority
tasks blocked the watchdog long enough that we'd disqualify the TSC.

Ideally that sort of high-priority RT busyness would be avoided, but
its also a pain to have false positive trigger when doing things like
stress testing.


>> The first problem is mostly we are suffering I think.
>
> So you think that's the root cause and because your patch makes it go
> away it's not necessary to know for sure, right?
>
>> Here is a simple patch to fix the issues. If the waterdog doesn't run
>
> waterdog?

Allergen-free. :)


>> for a long time, we ignore the detection.
>
> What's 'long time'? Please explain the numbers chosen.
>
>> This should work for the two
>
> Emphasis on 'should'?
>
>> problems. For the second one, we probably doen't need to scale if the
>> interval isn't very long.
>
> -ENOPARSE
>
>> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
>> static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>
>> /*
>> - * Interval: 0.5sec Threshold: 0.0625s
>> + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
>> */
>> #define WATCHDOG_INTERVAL (HZ >> 1)
>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>> #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
>>
>> static void clocksource_watchdog_work(struct work_struct *work)
>> @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
>> continue;
>>
>> /* Check the deviation from the watchdog clocksource. */
>> - if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
>> + if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) &&
>> + cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>> + wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>
> So that adds a new opportunity for undiscovered wreckage:
>
> clocksource_watchdog();
> .... <--- SMI skews TSC
> looong_irq_disabled_region();
> ....
> clocksource_watchdog(); <--- Does not detect skew
>
> and it will not detect it later on if that SMI was a one time event.
>
> So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
> prevents the watchdog to run is the proper thing to do.

I'm not sure here. I feel like these delay-caused false positives
(I've seen similar reports w/ VMs being stalled) are more common then
one-off SMI TSC skews.

There are hard lines in the timekeeping code, where we do say: Don't
delay us past X or we can't really handle it, but in this case, the
main clocksource is fine and the limit is being caused by the
watchdog. So I think some sort of a solution to remove this
restriction would be good. We don't want to needlessly punish fine
hardware because our checks for bad hardware add extra restrictions.

That said, I agree the "should"s and other vague qualifiers in the
commit description you point out should have more specifics to back
things up. And I'm fine delaying this (and the follow-on) patch until
those details are provided.

thanks
-john

2015-08-18 02:57:25

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <[email protected]> wrote:
> > On Mon, 17 Aug 2015, John Stultz wrote:
> >
> >> From: Shaohua Li <[email protected]>
> >>
> >> >From time to time we saw TSC is marked as unstable in our systems, while
> >
> > Stray '>'
> >
> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> >> detection, there are two problems:
> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> >> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> >> can wrap in about 5 minutes.
> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> >> 0.5s interval. What if the actual interval is bigger than 0.5s?
> >>
> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> >> interrupt for a very long time.
> >
> > And they hold off the timer softirq for more than a second? Don't you
> > think that's the problem which needs to be fixed?
>
> Though this is an issue I've experienced (and tried unsuccessfully to
> fix in a more complicated way) with the RT kernel, where high priority
> tasks blocked the watchdog long enough that we'd disqualify the TSC.
>
> Ideally that sort of high-priority RT busyness would be avoided, but
> its also a pain to have false positive trigger when doing things like
> stress testing.
>
>
> >> The first problem is mostly we are suffering I think.
> >
> > So you think that's the root cause and because your patch makes it go
> > away it's not necessary to know for sure, right?
> >
> >> Here is a simple patch to fix the issues. If the waterdog doesn't run
> >
> > waterdog?
>
> Allergen-free. :)
>
>
> >> for a long time, we ignore the detection.
> >
> > What's 'long time'? Please explain the numbers chosen.
> >
> >> This should work for the two
> >
> > Emphasis on 'should'?
> >
> >> problems. For the second one, we probably doen't need to scale if the
> >> interval isn't very long.
> >
> > -ENOPARSE
> >
> >> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
> >> static void __clocksource_change_rating(struct clocksource *cs, int rating);
> >>
> >> /*
> >> - * Interval: 0.5sec Threshold: 0.0625s
> >> + * Interval: 0.5sec MaxInterval: 1s Threshold: 0.0625s
> >> */
> >> #define WATCHDOG_INTERVAL (HZ >> 1)
> >> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> >> #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
> >>
> >> static void clocksource_watchdog_work(struct work_struct *work)
> >> @@ -217,7 +218,9 @@ static void clocksource_watchdog(unsigned long data)
> >> continue;
> >>
> >> /* Check the deviation from the watchdog clocksource. */
> >> - if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
> >> + if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) &&
> >> + cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> >> + wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> >
> > So that adds a new opportunity for undiscovered wreckage:
> >
> > clocksource_watchdog();
> > .... <--- SMI skews TSC
> > looong_irq_disabled_region();
> > ....
> > clocksource_watchdog(); <--- Does not detect skew
> >
> > and it will not detect it later on if that SMI was a one time event.
> >
> > So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
> > prevents the watchdog to run is the proper thing to do.
>
> I'm not sure here. I feel like these delay-caused false positives
> (I've seen similar reports w/ VMs being stalled) are more common then
> one-off SMI TSC skews.
>
> There are hard lines in the timekeeping code, where we do say: Don't
> delay us past X or we can't really handle it, but in this case, the
> main clocksource is fine and the limit is being caused by the
> watchdog. So I think some sort of a solution to remove this
> restriction would be good. We don't want to needlessly punish fine
> hardware because our checks for bad hardware add extra restrictions.
>
> That said, I agree the "should"s and other vague qualifiers in the
> commit description you point out should have more specifics to back
> things up. And I'm fine delaying this (and the follow-on) patch until
> those details are provided.

It's not something I guess. We do see the issue from time to time. The
IPMI driver accesses some IO ports in softirq and hog cpu for a very
long time, then the watchdog alert. The false alert on the other hand
has very worse effect. It forces to use HPET as clocksource, which has
very big performance penality. We can't even manually switch back to TSC
as current interface doesn't allow us to do it, then we can only reboot
the system. I agree the driver should be fixed, but the watchdog has
false alert, we definitively should fix it.

The 1s interval is arbitary. If you think there is better way to fix the
issue, please let me know.

Thanks,
Shaohua

2015-08-18 03:39:22

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Mon, Aug 17, 2015 at 7:57 PM, Shaohua Li <[email protected]> wrote:
> On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
>> That said, I agree the "should"s and other vague qualifiers in the
>> commit description you point out should have more specifics to back
>> things up. And I'm fine delaying this (and the follow-on) patch until
>> those details are provided.
>
> It's not something I guess. We do see the issue from time to time. The
> IPMI driver accesses some IO ports in softirq and hog cpu for a very
> long time, then the watchdog alert. The false alert on the other hand
> has very worse effect. It forces to use HPET as clocksource, which has
> very big performance penality. We can't even manually switch back to TSC
> as current interface doesn't allow us to do it, then we can only reboot
> the system. I agree the driver should be fixed, but the watchdog has
> false alert, we definitively should fix it.

I think Thomas is requesting that some of the vague terms be
quantified. Seeing the issue "from time to time" isn't super
informative. When the IPMI driver hogs the cpu "for a very long time",
how long does that actually take? You've provided the HPET
frequency, and the wrapping interval on your hardware. Do these
intervals all line up properly?

I sympathize that the "show-your-work" math problem aspect of this
request might be a little remedial and irritating, esp when the patch
fixes the problem for you. But its important, so later on when some
bug crops up in near by code, folks can easily repeat your calculation
and know the problem isn't from your code.

> The 1s interval is arbitary. If you think there is better way to fix the
> issue, please let me know.

I don't think 1s is necessarily arbitrary. Maybe not much conscious
thought was put into it, but clearly .001 sec wasn't chosen, nor
10minutes for a reason.

So given the intervals you're seeing the problem with, would maybe a
larger max interval (say 30-seconds) make more or less sense? What
would the tradeoffs be? (ie: Would that exclude clocksources with
faster wraps from being used as watchdogs, with your patches?).

I'm sure an good interval could be chosen with some thought, and the
rational be explained. :)

thanks
-john

2015-08-18 08:38:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Mon, 17 Aug 2015, John Stultz wrote:
> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <[email protected]> wrote:
> > On Mon, 17 Aug 2015, John Stultz wrote:
> >
> >> From: Shaohua Li <[email protected]>
> >>
> >> >From time to time we saw TSC is marked as unstable in our systems, while
> >
> > Stray '>'
> >
> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> >> detection, there are two problems:
> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> >> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> >> can wrap in about 5 minutes.
> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> >> 0.5s interval. What if the actual interval is bigger than 0.5s?
> >>
> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> >> interrupt for a very long time.
> >
> > And they hold off the timer softirq for more than a second? Don't you
> > think that's the problem which needs to be fixed?
>
> Though this is an issue I've experienced (and tried unsuccessfully to
> fix in a more complicated way) with the RT kernel, where high priority
> tasks blocked the watchdog long enough that we'd disqualify the TSC.

Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
due to the fixed threshold being applied?

> > So 'fixing' the watchdog is the wrong approach. Fixing the stuff which
> > prevents the watchdog to run is the proper thing to do.
>
> I'm not sure here. I feel like these delay-caused false positives
> (I've seen similar reports w/ VMs being stalled) are more common then
> one-off SMI TSC skews.

Yes, they are more common, but the other issues are reality as well.

> There are hard lines in the timekeeping code, where we do say: Don't
> delay us past X or we can't really handle it, but in this case, the
> main clocksource is fine and the limit is being caused by the
> watchdog. So I think some sort of a solution to remove this
> restriction would be good. We don't want to needlessly punish fine
> hardware because our checks for bad hardware add extra restrictions.

No argument here. Though fine hardware has an escape route already to
avoid the watchdog business alltogether (tsc=reliable on the command
line).

> That said, I agree the "should"s and other vague qualifiers in the
> commit description you point out should have more specifics to back
> things up. And I'm fine delaying this (and the follow-on) patch until
> those details are provided.

Fair enough.

Thanks,

tglx

2015-08-18 08:58:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Mon, 17 Aug 2015, Shaohua Li wrote:
> On Mon, Aug 17, 2015 at 03:17:28PM -0700, John Stultz wrote:
> > That said, I agree the "should"s and other vague qualifiers in the
> > commit description you point out should have more specifics to back
> > things up. And I'm fine delaying this (and the follow-on) patch until
> > those details are provided.
>
> It's not something I guess. We do see the issue from time to time. The
> IPMI driver accesses some IO ports in softirq and hog cpu for a very
> long time, then the watchdog alert.

You still fail to provide proper numbers. 'very long time' does not
qualify as an argument at all.

> The false alert on the other hand has very worse effect. It forces
> to use HPET as clocksource, which has very big performance
> penality. We can't even manually switch back to TSC as current
> interface doesn't allow us to do it, then we can only reboot the
> system. I agree the driver should be fixed, but the watchdog has
> false alert, we definitively should fix it.

I tend to disagree. The watchdog has constraints and the driver is
violating these constraints, so the first thing which wants to be
addressed is the driver itself.

The behaviour of the watchdog in the case of constraint violations is
definitely suboptimal and can lead to false positives. I'm not against
making that more robust, but I'm not accepting your 'watchdog is
broken' argumentation at all.

Thanks,

tglx

2015-08-18 17:49:25

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner <[email protected]> wrote:
> On Mon, 17 Aug 2015, John Stultz wrote:
>> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <[email protected]> wrote:
>> > On Mon, 17 Aug 2015, John Stultz wrote:
>> >
>> >> From: Shaohua Li <[email protected]>
>> >>
>> >> >From time to time we saw TSC is marked as unstable in our systems, while
>> >
>> > Stray '>'
>> >
>> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
>> >> detection, there are two problems:
>> >> - watchdog clock source wrap. HPET is the most common watchdog clock
>> >> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
>> >> can wrap in about 5 minutes.
>> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
>> >> 0.5s interval. What if the actual interval is bigger than 0.5s?
>> >>
>> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
>> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
>> >> interrupt for a very long time.
>> >
>> > And they hold off the timer softirq for more than a second? Don't you
>> > think that's the problem which needs to be fixed?
>>
>> Though this is an issue I've experienced (and tried unsuccessfully to
>> fix in a more complicated way) with the RT kernel, where high priority
>> tasks blocked the watchdog long enough that we'd disqualify the TSC.
>
> Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> due to the fixed threshold being applied?

This was years ago, but in my experience, the watchdog false positives
were due to HPET wraparounds.

thanks
-john

2015-08-18 19:28:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Tue, 18 Aug 2015, John Stultz wrote:
> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner <[email protected]> wrote:
> > On Mon, 17 Aug 2015, John Stultz wrote:
> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <[email protected]> wrote:
> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> >> >
> >> >> From: Shaohua Li <[email protected]>
> >> >>
> >> >> >From time to time we saw TSC is marked as unstable in our systems, while
> >> >
> >> > Stray '>'
> >> >
> >> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> >> >> detection, there are two problems:
> >> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> >> >> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> >> >> can wrap in about 5 minutes.
> >> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> >> >> 0.5s interval. What if the actual interval is bigger than 0.5s?
> >> >>
> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> >> >> interrupt for a very long time.
> >> >
> >> > And they hold off the timer softirq for more than a second? Don't you
> >> > think that's the problem which needs to be fixed?
> >>
> >> Though this is an issue I've experienced (and tried unsuccessfully to
> >> fix in a more complicated way) with the RT kernel, where high priority
> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> >
> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> > due to the fixed threshold being applied?
>
> This was years ago, but in my experience, the watchdog false positives
> were due to HPET wraparounds.

Blocking stuff for 5 minutes is insane ....

2015-08-18 20:11:54

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner <[email protected]> wrote:
> On Tue, 18 Aug 2015, John Stultz wrote:
>> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner <[email protected]> wrote:
>> > On Mon, 17 Aug 2015, John Stultz wrote:
>> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <[email protected]> wrote:
>> >> > On Mon, 17 Aug 2015, John Stultz wrote:
>> >> >
>> >> >> From: Shaohua Li <[email protected]>
>> >> >>
>> >> >> >From time to time we saw TSC is marked as unstable in our systems, while
>> >> >
>> >> > Stray '>'
>> >> >
>> >> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
>> >> >> detection, there are two problems:
>> >> >> - watchdog clock source wrap. HPET is the most common watchdog clock
>> >> >> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
>> >> >> can wrap in about 5 minutes.
>> >> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
>> >> >> 0.5s interval. What if the actual interval is bigger than 0.5s?
>> >> >>
>> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
>> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
>> >> >> interrupt for a very long time.
>> >> >
>> >> > And they hold off the timer softirq for more than a second? Don't you
>> >> > think that's the problem which needs to be fixed?
>> >>
>> >> Though this is an issue I've experienced (and tried unsuccessfully to
>> >> fix in a more complicated way) with the RT kernel, where high priority
>> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
>> >
>> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
>> > due to the fixed threshold being applied?
>>
>> This was years ago, but in my experience, the watchdog false positives
>> were due to HPET wraparounds.
>
> Blocking stuff for 5 minutes is insane ....

Yea. It was usually due to -RT stress testing, which keept the
machines busy for quite awhile. But again, if you have machines being
maxed out with networking load, etc, even for long amounts of time, we
still want to avoid false positives. Because after the watchdog
disqualifies the TSC, the only clocksources left wrap around much
sooner, and we're more likely to then actually lose time during the
next load spike.

Cc'ing Clark and Steven to see if its something they still run into,
and maybe they can help validate the patch.

thanks
-john

2015-08-18 20:18:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Tue, 18 Aug 2015, John Stultz wrote:
> On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner <[email protected]> wrote:
> > On Tue, 18 Aug 2015, John Stultz wrote:
> >> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner <[email protected]> wrote:
> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> >> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <[email protected]> wrote:
> >> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> >> >> >
> >> >> >> From: Shaohua Li <[email protected]>
> >> >> >>
> >> >> >> >From time to time we saw TSC is marked as unstable in our systems, while
> >> >> >
> >> >> > Stray '>'
> >> >> >
> >> >> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> >> >> >> detection, there are two problems:
> >> >> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> >> >> >> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> >> >> >> can wrap in about 5 minutes.
> >> >> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> >> >> >> 0.5s interval. What if the actual interval is bigger than 0.5s?
> >> >> >>
> >> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> >> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> >> >> >> interrupt for a very long time.
> >> >> >
> >> >> > And they hold off the timer softirq for more than a second? Don't you
> >> >> > think that's the problem which needs to be fixed?
> >> >>
> >> >> Though this is an issue I've experienced (and tried unsuccessfully to
> >> >> fix in a more complicated way) with the RT kernel, where high priority
> >> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> >> >
> >> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> >> > due to the fixed threshold being applied?
> >>
> >> This was years ago, but in my experience, the watchdog false positives
> >> were due to HPET wraparounds.
> >
> > Blocking stuff for 5 minutes is insane ....
>
> Yea. It was usually due to -RT stress testing, which keept the
> machines busy for quite awhile. But again, if you have machines being
> maxed out with networking load, etc, even for long amounts of time, we
> still want to avoid false positives. Because after the watchdog

The networking softirq does not hog the other softirqs. It has a limit
on processing loops and then goes back to let the other softirqs be
handled. So no, I doubt that heavy networking can cause this. If it
does then we have some other way more serious problems.

I can see the issue with RT stress testing, but not with networking in
mainline.

Thanks,

tglx

2015-08-26 17:15:58

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Tue, Aug 18, 2015 at 10:18:09PM +0200, Thomas Gleixner wrote:
> On Tue, 18 Aug 2015, John Stultz wrote:
> > On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner <[email protected]> wrote:
> > > On Tue, 18 Aug 2015, John Stultz wrote:
> > >> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner <[email protected]> wrote:
> > >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > >> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <[email protected]> wrote:
> > >> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > >> >> >
> > >> >> >> From: Shaohua Li <[email protected]>
> > >> >> >>
> > >> >> >> >From time to time we saw TSC is marked as unstable in our systems, while
> > >> >> >
> > >> >> > Stray '>'
> > >> >> >
> > >> >> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> > >> >> >> detection, there are two problems:
> > >> >> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> > >> >> >> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> > >> >> >> can wrap in about 5 minutes.
> > >> >> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> > >> >> >> 0.5s interval. What if the actual interval is bigger than 0.5s?
> > >> >> >>
> > >> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> > >> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> > >> >> >> interrupt for a very long time.
> > >> >> >
> > >> >> > And they hold off the timer softirq for more than a second? Don't you
> > >> >> > think that's the problem which needs to be fixed?
> > >> >>
> > >> >> Though this is an issue I've experienced (and tried unsuccessfully to
> > >> >> fix in a more complicated way) with the RT kernel, where high priority
> > >> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> > >> >
> > >> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> > >> > due to the fixed threshold being applied?
> > >>
> > >> This was years ago, but in my experience, the watchdog false positives
> > >> were due to HPET wraparounds.
> > >
> > > Blocking stuff for 5 minutes is insane ....
> >
> > Yea. It was usually due to -RT stress testing, which keept the
> > machines busy for quite awhile. But again, if you have machines being
> > maxed out with networking load, etc, even for long amounts of time, we
> > still want to avoid false positives. Because after the watchdog
>
> The networking softirq does not hog the other softirqs. It has a limit
> on processing loops and then goes back to let the other softirqs be
> handled. So no, I doubt that heavy networking can cause this. If it
> does then we have some other way more serious problems.
>
> I can see the issue with RT stress testing, but not with networking in
> mainline.

Ok, the issue is triggerd in my kvm guest, I guess it's easier to
trigger in kvm because hpet is 100Mhz.

[ 135.930067] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 135.930095] clocksource: 'hpet' wd_now: 2bc19ea0 wd_last: 6c4e5570 mask: ffffffff
[ 135.930105] clocksource: 'tsc' cs_now: 481250b45b cs_last: 219e6efb50 mask: ffffffffffffffff
[ 135.938750] clocksource: Switched to clocksource hpet

The HPET clock is 100MHz, CPU speed is 2200MHz, kvm is passed correct cpu
info, so guest cpuinfo shows TSC is stable.

hpet interval is ((0x2bc19ea0 - 0x6c4e5570) & 0xffffffff) / 100000000 = 32.1s.

The HPET wraps interval is 0xffffffff / 100000000 = 42.9s

tsc interval is (0x481250b45b - 0x219e6efb50) / 2200000000 = 75s

32.1 + 42.9 = 75

The example shows hpet wraps, while tsc is marked unstable

Thanks,
Shaohua

2015-08-31 21:12:56

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Wed, Aug 26, 2015 at 10:15:33AM -0700, Shaohua Li wrote:
> On Tue, Aug 18, 2015 at 10:18:09PM +0200, Thomas Gleixner wrote:
> > On Tue, 18 Aug 2015, John Stultz wrote:
> > > On Tue, Aug 18, 2015 at 12:28 PM, Thomas Gleixner <[email protected]> wrote:
> > > > On Tue, 18 Aug 2015, John Stultz wrote:
> > > >> On Tue, Aug 18, 2015 at 1:38 AM, Thomas Gleixner <[email protected]> wrote:
> > > >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > > >> >> On Mon, Aug 17, 2015 at 3:04 PM, Thomas Gleixner <[email protected]> wrote:
> > > >> >> > On Mon, 17 Aug 2015, John Stultz wrote:
> > > >> >> >
> > > >> >> >> From: Shaohua Li <[email protected]>
> > > >> >> >>
> > > >> >> >> >From time to time we saw TSC is marked as unstable in our systems, while
> > > >> >> >
> > > >> >> > Stray '>'
> > > >> >> >
> > > >> >> >> the CPUs declare to have stable TSC. Looking at the clocksource unstable
> > > >> >> >> detection, there are two problems:
> > > >> >> >> - watchdog clock source wrap. HPET is the most common watchdog clock
> > > >> >> >> source. It's 32-bit and runs in 14.3Mhz. That means the hpet counter
> > > >> >> >> can wrap in about 5 minutes.
> > > >> >> >> - threshold isn't scaled against interval. The threshold is 0.0625s in
> > > >> >> >> 0.5s interval. What if the actual interval is bigger than 0.5s?
> > > >> >> >>
> > > >> >> >> The watchdog runs in a timer bh, so hard/soft irq can defer its running.
> > > >> >> >> Heavy network stack softirq can hog a cpu. IPMI driver can disable
> > > >> >> >> interrupt for a very long time.
> > > >> >> >
> > > >> >> > And they hold off the timer softirq for more than a second? Don't you
> > > >> >> > think that's the problem which needs to be fixed?
> > > >> >>
> > > >> >> Though this is an issue I've experienced (and tried unsuccessfully to
> > > >> >> fix in a more complicated way) with the RT kernel, where high priority
> > > >> >> tasks blocked the watchdog long enough that we'd disqualify the TSC.
> > > >> >
> > > >> > Did it disqualify the watchdog due to HPET wraparounds (5 minutes) or
> > > >> > due to the fixed threshold being applied?
> > > >>
> > > >> This was years ago, but in my experience, the watchdog false positives
> > > >> were due to HPET wraparounds.
> > > >
> > > > Blocking stuff for 5 minutes is insane ....
> > >
> > > Yea. It was usually due to -RT stress testing, which keept the
> > > machines busy for quite awhile. But again, if you have machines being
> > > maxed out with networking load, etc, even for long amounts of time, we
> > > still want to avoid false positives. Because after the watchdog
> >
> > The networking softirq does not hog the other softirqs. It has a limit
> > on processing loops and then goes back to let the other softirqs be
> > handled. So no, I doubt that heavy networking can cause this. If it
> > does then we have some other way more serious problems.
> >
> > I can see the issue with RT stress testing, but not with networking in
> > mainline.
>
> Ok, the issue is triggerd in my kvm guest, I guess it's easier to
> trigger in kvm because hpet is 100Mhz.
>
> [ 135.930067] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large:
> [ 135.930095] clocksource: 'hpet' wd_now: 2bc19ea0 wd_last: 6c4e5570 mask: ffffffff
> [ 135.930105] clocksource: 'tsc' cs_now: 481250b45b cs_last: 219e6efb50 mask: ffffffffffffffff
> [ 135.938750] clocksource: Switched to clocksource hpet
>
> The HPET clock is 100MHz, CPU speed is 2200MHz, kvm is passed correct cpu
> info, so guest cpuinfo shows TSC is stable.
>
> hpet interval is ((0x2bc19ea0 - 0x6c4e5570) & 0xffffffff) / 100000000 = 32.1s.
>
> The HPET wraps interval is 0xffffffff / 100000000 = 42.9s
>
> tsc interval is (0x481250b45b - 0x219e6efb50) / 2200000000 = 75s
>
> 32.1 + 42.9 = 75
>
> The example shows hpet wraps, while tsc is marked unstable

Thomas & John,
Is this data enough to prove TSC unstable issue can be triggered by HPET
wrap? I can resend the patch with the data included.

Thanks,
Shaohua

2015-08-31 21:48:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Mon, 31 Aug 2015, Shaohua Li wrote:
> > The HPET wraps interval is 0xffffffff / 100000000 = 42.9s
> >
> > tsc interval is (0x481250b45b - 0x219e6efb50) / 2200000000 = 75s
> >
> > 32.1 + 42.9 = 75
> >
> > The example shows hpet wraps, while tsc is marked unstable
>
> Thomas & John,
> Is this data enough to prove TSC unstable issue can be triggered by HPET
> wrap? I can resend the patch with the data included.

Well, it's enough data to prove:

- that keeping a VM off the CPU for 75 seconds is insane.

- that emulating the HPET with 100MHz shortens the HPET wraparound by
a factor of 7 compared to real hardware. With a realist HPET
frequency you have about 300 seconds.

Who though that using 100MHz HPET frequency is a brilliant idea?

So we should add crappy heuristics to the watchdog just to workaround
virt insanities? I'm not convinced.

Thanks,

tglx

2015-08-31 22:40:40

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > The HPET wraps interval is 0xffffffff / 100000000 = 42.9s
> > >
> > > tsc interval is (0x481250b45b - 0x219e6efb50) / 2200000000 = 75s
> > >
> > > 32.1 + 42.9 = 75
> > >
> > > The example shows hpet wraps, while tsc is marked unstable
> >
> > Thomas & John,
> > Is this data enough to prove TSC unstable issue can be triggered by HPET
> > wrap? I can resend the patch with the data included.
>
> Well, it's enough data to prove:
>
> - that keeping a VM off the CPU for 75 seconds is insane.

It wraps in 42.9s. 42.9s isn't a long time hard to block. I don’t think
it's just VM off. A softirq can hog the cpu.

> - that emulating the HPET with 100MHz shortens the HPET wraparound by
> a factor of 7 compared to real hardware. With a realist HPET
> frequency you have about 300 seconds.
>
> Who though that using 100MHz HPET frequency is a brilliant idea?

I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. It’s
insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
introduce even higher overhead in virtual, because of the vmexit of
iomemory access

> So we should add crappy heuristics to the watchdog just to workaround
> virt insanities? I'm not convinced.

This is a real issue which could impact performance seriously. Though
the data is collected in vm, we do see the issue happens in physical
machines too. The watchdog clock source shows restriction here
apparently, it deserves an improvement if we can do. I'm happy to hear
from you if there is better solution, but we shouldn't pretend there is
no issue here.

Thanks,
Shaohua