Date: Fri, 9 Dec 2016 23:59:21 +0100 (CET)
From: Thomas Gleixner <tglx@linutronix.de>
To: Roland Scheidegger <rscheidegger_lists@hispeed.ch>
cc: Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
        x86@kernel.org, LKML <linux-kernel@vger.kernel.org>,
        Allen Hung <allen_hung@dell.com>
Subject: Re: [PATCH] x86/tsc: RFC: re-synchronize TSCs to boot cpu TSC
In-Reply-To: <50876e71-ade8-3c3b-db8b-8f5ae136157e@hispeed.ch>
Message-ID: <alpine.DEB.2.20.1612092203430.3612@nanos>
References: <1481179042-968-1-git-send-email-sroland@vmware.com> <alpine.DEB.2.20.1612080854420.3400@nanos> <6194ea73-99dd-ceca-6f8d-e945ca717c75@hispeed.ch> <alpine.DEB.2.20.1612090953570.3794@nanos> <38f3ca77-ce32-d180-106a-e7242c635394@hispeed.ch>
 <alpine.DEB.2.20.1612091806000.3423@nanos> <50876e71-ade8-3c3b-db8b-8f5ae136157e@hispeed.ch>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7379
Lines: 223

On Fri, 9 Dec 2016, Roland Scheidegger wrote:

Cc'ed someone from Dell. 

> Am 09.12.2016 um 18:33 schrieb Thomas Gleixner:
> > Can you add the patch below to gather more information? There is a hunk in
> > there with an '#if 0' which sets the TSC ADJUST to 0 on boot, which you can
> > turn on as second step.
> 
> Ok, here's the results:
> ...
> TSC ADJUST synchronize: Reference CPU0: -2820267100 CPU1: -2822498296
> TSC target sync skipped
> smpboot: Vector locked
> smpboot: Vector setup done
> smpboot: Clock setup
> TSC source sync skipped
> smpboot: Target CPU is online

I did not expect that to happen. Now I'm puzzled and curious where the
machine gets lost after that. See below.

> With the #if 0 block activated, it boots up fine, the output was:

That does not make any sense at all, but yes, nothing in this context makes
sense.

> [    1.038892] x86: Booting SMP configuration:
> [    1.038930] .... node  #0, CPUs:        #1
> [    0.171851] TSC ADJUST: CPU1: -2830353064 218577682002
> [    1.117495] TSC source sync 0 -> 1 runs 3
> [    0.171852] TSC ADJUST differs: Reference CPU0: -2828600940 CPU1:
> -2830353064
> [    0.171853] TSC ADJUST synchronize: Reference CPU0: 0 CPU1: -2830353064
> [    1.117497] TSC target sync skip

> (And fwiw with my quick hack the lockups disappear to when I change that
> back to blast a zero into TSC_ADJ for all cpus.)

Right, That's what that hunk does as well.

Now what's interesting is that the adjustement of CPU1 in the non write to
zero case results in the following:

TSC ADJUST: CPU1: -2830353064 218577682002 <-- TSC value
TSC ADJUST differs: Reference CPU0: -2828600940 CPU1: -2830353064

We write CPU1 adjust register to -2828600940 which makes the TSC on CPU1
jump forwards by -2828600940 - -2830353064 = 1752124 cycles.

In the write to zero case the jump is forward as well, but this time it's
huge, i.e. 2830353064 cycles.

I tried to wreckage the TSC by writing similar values to the adjust MSR on
early boot, but independent of the values and independent of the write to
zero part the machine comes up happily all the time.

The only difference is that my machine has a somewhat saner BIOS. So the
thing might just die in the value add SMM crap, but who knows.

In the patch below is another bunch of debug prints which emit the state
information of CPU1 during bringup. Maybe that gives a hint where the
system gets stuck when you disable the 'write to zero' magic again.

The NMI watchdog does not catch anything, right?

> The system also came back up fine from suspend with this (well - still
> minus graphics...), however disabled tsc clocksource:
> 
> [  579.931739] Enabling non-boot CPUs ...
> [  579.943107] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [  579.943189] TSC ADJUST: CPU1: -1504429974 21601834126

Fun, yet another adjust value. Are they set by a random number generator?

> [  579.944093] CPU1 is up

> [  580.458983] clocksource: timekeeping watchdog on CPU1: Marking
> clocksource 'tsc' as unstable because the skew is too large:
> [  580.458985] clocksource:                       'hpet' wd_now: 587c1
> wd_last: 437c7 mask: ffffffff
> [  580.458986] clocksource:                       'tsc' cs_now:
> 563963cd8 cs_last: 508f5a02a mask: ffffffffffffffff

Ok, that's caused by the fact that we do not sanitize the TSC adjust
register on the boot CPU in the resume path.

> [  581.006760] [Firmware Bug]: TSC ADJUST differs: CPU0 0 -->
> -1502494750. Restoring

We only detect it later and correct it, but that's too late. The untested
patch below should cure that.

> > This BIOS seems to be rather strange ...
> Don't tell me...
> 
> No idea what it's doing, but I think it's safe to say whatever it's
> trying to do, it's doing it wrong...

Amen to that. I've seen a lot of Value Add BIOSes which broke things left
and right, but this one takes it to a new level of insanity.

Did you report that back to DELL already?

Thanks,

	tglx

8<--------------------
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -47,7 +47,7 @@ extern int tsc_clocksource_reliable;
  */
 #ifdef CONFIG_X86_TSC
 extern bool tsc_store_and_check_tsc_adjust(void);
-extern void tsc_verify_tsc_adjust(void);
+extern void tsc_verify_tsc_adjust(bool resume);
 extern void check_tsc_sync_source(int cpu);
 extern void check_tsc_sync_target(void);
 #else
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -277,7 +277,7 @@ void exit_idle(void)
 
 void arch_cpu_idle_enter(void)
 {
-	tsc_verify_tsc_adjust();
+	tsc_verify_tsc_adjust(false);
 	local_touch_nmi();
 	enter_idle();
 }
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1080,6 +1080,11 @@ static void detect_art(void)
 
 static struct clocksource clocksource_tsc;
 
+static void tsc_resume(struct clocksource *cs)
+{
+	tsc_verify_tsc_adjust(true);
+}
+
 /*
  * We used to compare the TSC to the cycle_last value in the clocksource
  * structure to avoid a nasty time-warp. This can be observed in a
@@ -1112,6 +1117,7 @@ static struct clocksource clocksource_ts
 	.flags                  = CLOCK_SOURCE_IS_CONTINUOUS |
 				  CLOCK_SOURCE_MUST_VERIFY,
 	.archdata               = { .vclock_mode = VCLOCK_TSC },
+	.resume			= tsc_resume,
 };
 
 void mark_tsc_unstable(char *reason)
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -30,7 +30,7 @@ struct tsc_adjust {
 
 static DEFINE_PER_CPU(struct tsc_adjust, tsc_adjust);
 
-void tsc_verify_tsc_adjust(void)
+void tsc_verify_tsc_adjust(bool resume)
 {
 	struct tsc_adjust *adj = this_cpu_ptr(&tsc_adjust);
 	s64 curval;
@@ -39,7 +39,7 @@ void tsc_verify_tsc_adjust(void)
 		return;
 
 	/* Rate limit the MSR check */
-	if (time_before(jiffies, adj->nextcheck))
+	if (!resume && time_before(jiffies, adj->nextcheck))
 		return;
 
 	adj->nextcheck = jiffies + HZ;
@@ -51,7 +51,7 @@ void tsc_verify_tsc_adjust(void)
 	/* Restore the original value */
 	wrmsrl(MSR_IA32_TSC_ADJUST, adj->adjusted);
 
-	if (!adj->warned) {
+	if (!adj->warned || resume) {
 		pr_warn(FW_BUG "TSC ADJUST differs: CPU%u %lld --> %lld. Restoring\n",
 			smp_processor_id(), adj->adjusted, curval);
 		adj->warned = true;
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -256,6 +256,7 @@ static void notrace __restore_processor_
 	mtrr_bp_restore();
 	perf_restore_debug_store();
 	msr_restore_context(ctxt);
+	tsc_verify_tsc_adjust(true);
 }
 
 /* Needed by apm.c */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -133,7 +133,9 @@ static int cpuhp_invoke_callback(unsigne
 		if (!cb)
 			return 0;
 		trace_cpuhp_enter(cpu, st->target, state, cb);
+		pr_info("Invoking state %d CB %pF\n", state, cb);
 		ret = cb(cpu);
+		pr_info("CB returned %d\n", ret);
 		trace_cpuhp_exit(cpu, st->state, state, ret);
 		return ret;
 	}
@@ -153,7 +155,10 @@ static int cpuhp_invoke_callback(unsigne
 	cnt = 0;
 	hlist_for_each(node, &step->list) {
 		trace_cpuhp_multi_enter(cpu, st->target, state, cbm, node);
+		pr_info("Invoking state %d CB %pF instance %p\n", state, cbm,
+			node);
 		ret = cbm(cpu, node);
+		pr_info("CB returned %d\n", ret);
 		trace_cpuhp_exit(cpu, st->state, state, ret);
 		if (ret)
 			goto err;
@@ -550,6 +555,7 @@ static void cpuhp_thread_fun(unsigned in
 
 	st->should_run = false;
 
+	pr_info("CPUHP thread running\n");
 	/* Single callback invocation for [un]install ? */
 	if (st->single) {
 		if (st->cb_state < CPUHP_AP_ONLINE) {