2017-04-21 17:09:42

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 0/9] sched_clock fixes

Hi,

These patches were inspired (and hopefully fix) two independent bug reports on
Core2 machines.

I never could quite reproduce one, but my Core2 machine no longer switches to
stable sched_clock and therefore no longer tickles the problematic stable ->
unstable transition either.

Before I dug up my Core2 machine, I tried emulating TSC wreckage by poking
random values into the TSC MSR from userspace. Behaviour in that case is
improved as well.

People have to realize that if we manage to boot with TSC 'stable' (both
sched_clock and clocksource) and we later find out we were mistaken (we observe
a TSC wobble) the clocks that are derived from it _will_ have had an observable
hickup. This is fundamentally unfixable.

If you own a machine where the BIOS tries to hide SMI latencies by rewinding
TSC (yes, this is a thing), the very best we can do is mark TSC unstable with a
boot parameter.

For example, this is me writing a stupid value into the TSC:

[ 46.745082] random: crng init done
[18443029775.010069] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[18443029775.023141] clocksource: 'hpet' wd_now: 3ebec538 wd_last: 3e486ec9 mask: ffffffff
[18443029775.034214] clocksource: 'tsc' cs_now: 5025acce9 cs_last: 24dc3bd21c88ee mask: ffffffffffffffff
[18443029775.046651] tsc: Marking TSC unstable due to clocksource watchdog
[18443029775.054211] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[18443029775.064434] sched_clock: Marking unstable (70569005835, -17833788)<-(-3714295689546517, -2965802361)
[ 70.573700] clocksource: Switched to clocksource hpet

With some trace_printk()s (not included) I could tell that the wobble
occured at 69.965474. The clock now resumes where it 'should' have been.

But an unfortunate scheduling event could have resulted in one task
having seen a runtime of ~584 years with 'obvious' effects. Similar
jumps can also be observed from userspace GTOD usage.



2017-04-25 09:32:00

by Lofstedt, Marta

[permalink] [raw]
Subject: RE: [PATCH 0/9] sched_clock fixes

Hi Peterz,

I tested your patch-set on the same Core2 machine as where we discovered the regression.
With the tsc=unstable boot param that passrate has improved significantly; 350 fails -> 15 fails.

BR,
Marta

> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Friday, April 21, 2017 5:58 PM
> To: [email protected]; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; Wysocki, Rafael J <[email protected]>;
> Lofstedt, Marta <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: [PATCH 0/9] sched_clock fixes
>
> Hi,
>
> These patches were inspired (and hopefully fix) two independent bug
> reports on
> Core2 machines.
>
> I never could quite reproduce one, but my Core2 machine no longer switches
> to stable sched_clock and therefore no longer tickles the problematic stable -
> > unstable transition either.
>
> Before I dug up my Core2 machine, I tried emulating TSC wreckage by poking
> random values into the TSC MSR from userspace. Behaviour in that case is
> improved as well.
>
> People have to realize that if we manage to boot with TSC 'stable' (both
> sched_clock and clocksource) and we later find out we were mistaken (we
> observe a TSC wobble) the clocks that are derived from it _will_ have had an
> observable hickup. This is fundamentally unfixable.
>
> If you own a machine where the BIOS tries to hide SMI latencies by
> rewinding TSC (yes, this is a thing), the very best we can do is mark TSC
> unstable with a boot parameter.
>
> For example, this is me writing a stupid value into the TSC:
>
> [ 46.745082] random: crng init done
> [18443029775.010069] clocksource: timekeeping watchdog on CPU0: Marking
> clocksource 'tsc' as unstable because the skew is too large:
> [18443029775.023141] clocksource: 'hpet' wd_now: 3ebec538
> wd_last: 3e486ec9 mask: ffffffff
> [18443029775.034214] clocksource: 'tsc' cs_now: 5025acce9 cs_last:
> 24dc3bd21c88ee mask: ffffffffffffffff
> [18443029775.046651] tsc: Marking TSC unstable due to clocksource
> watchdog [18443029775.054211] TSC found unstable after boot, most likely
> due to broken BIOS. Use 'tsc=unstable'.
> [18443029775.064434] sched_clock: Marking unstable (70569005835, -
> 17833788)<-(-3714295689546517, -2965802361)
> [ 70.573700] clocksource: Switched to clocksource hpet
>
> With some trace_printk()s (not included) I could tell that the wobble occured
> at 69.965474. The clock now resumes where it 'should' have been.
>
> But an unfortunate scheduling event could have resulted in one task having
> seen a runtime of ~584 years with 'obvious' effects. Similar jumps can also be
> observed from userspace GTOD usage.
>

2017-04-25 13:45:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched_clock fixes

On Tue, Apr 25, 2017 at 09:31:40AM +0000, Lofstedt, Marta wrote:
> Hi Peterz,
>
> I tested your patch-set on the same Core2 machine as where we discovered the regression.
> With the tsc=unstable boot param that passrate has improved significantly; 350 fails -> 15 fails.

So is that the same as before, or still worse? I don't really have a
handle on what your benchmark is here, nor what how 'good' is defined.

If its still worse than before, I'm completely confused. Because with
"tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset
== 0).

2017-04-26 06:42:11

by Lofstedt, Marta

[permalink] [raw]
Subject: RE: [PATCH 0/9] sched_clock fixes

For bisecting the regression we ran 14 test for 50 repetitions.

Before the bisected regression:

commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649
Author: Pavel Tatashin <[email protected]>
Date: Wed Mar 22 16:24:17 2017 -0400

sched/clock: Fix broken stable to unstable transfer

there was ~0 failing test on the Core2 machine.
After regression ~350 failing tests.
With your patch-set ~15 failing tests.

To be honest, I must say that these test used to give unstable results on the Core2. But some time ago, the results magically stabilized at ~0 fails, by timing related fixes for other issues. Ville Syrjala now has a patch-set that we believe really solves the graphics parts of the issue. However, I believe that your patch-set still improves the situation related to the tsc instability of the Core2.

/Marta

> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Tuesday, April 25, 2017 4:45 PM
> To: Lofstedt, Marta <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Wysocki, Rafael J
> <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH 0/9] sched_clock fixes
>
> On Tue, Apr 25, 2017 at 09:31:40AM +0000, Lofstedt, Marta wrote:
> > Hi Peterz,
> >
> > I tested your patch-set on the same Core2 machine as where we
> discovered the regression.
> > With the tsc=unstable boot param that passrate has improved significantly;
> 350 fails -> 15 fails.
>
> So is that the same as before, or still worse? I don't really have a handle on
> what your benchmark is here, nor what how 'good' is defined.
>
> If its still worse than before, I'm completely confused. Because with
> "tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset
> == 0).