2016-12-13 13:17:10

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

Roland reported interesting TSC ADJUST register wreckage on his DELL
machine, which seems to populate that MSR with a random number generator.

Deeper investagation into fixing this wreckage unearthed another special
feature which is designed by Intel: Negative TSC adjuste values cause
interrupt storms on the TSC deadline timer. Further details in patch 2/2

Thanks

tglx




2016-12-13 16:34:29

by Roland Scheidegger

[permalink] [raw]
Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

Am 13.12.2016 um 14:14 schrieb Thomas Gleixner:
> Roland reported interesting TSC ADJUST register wreckage on his DELL
> machine, which seems to populate that MSR with a random number generator.

FWIW, I thought about the actual values some more and I don't actually
think they are all that random any more: the behavior is consistent with
the bios trying to zero the TSC of all cpus. If I understand this right,
writing a zero to TSC would cause somewhat small negative values in the
TSC_ADJ register at boot time, and larger negative values at suspend
time (at least if the TSC just stops when suspended and isn't reset) -
exactly what I'm seeing.
(And of course the different TSC_ADJ values would be because the bios is
writing TSC without any thoughts of synchronization, just one cpu after
another).

>
> Deeper investagation into fixing this wreckage unearthed another special
> feature which is designed by Intel: Negative TSC adjuste values cause
> interrupt storms on the TSC deadline timer. Further details in patch 2/2

This actually looks like quite a serious hw bug to me, shouldn't there
be an errata for such a bug?

And I still don't quite understand why the lockup doesn't happen after a
warm boot, there must be something different there...

(I didn't have the chance to test the patch yet.)

Roland


2016-12-13 16:50:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

On Tue, 13 Dec 2016, Roland Scheidegger wrote:

> Am 13.12.2016 um 14:14 schrieb Thomas Gleixner:
> > Roland reported interesting TSC ADJUST register wreckage on his DELL
> > machine, which seems to populate that MSR with a random number generator.
>
> FWIW, I thought about the actual values some more and I don't actually
> think they are all that random any more: the behavior is consistent with
> the bios trying to zero the TSC of all cpus. If I understand this right,
> writing a zero to TSC would cause somewhat small negative values in the
> TSC_ADJ register at boot time, and larger negative values at suspend
> time (at least if the TSC just stops when suspended and isn't reset) -
> exactly what I'm seeing.
> (And of course the different TSC_ADJ values would be because the bios is
> writing TSC without any thoughts of synchronization, just one cpu after
> another).

Yeah, that might be. Still it looks like random nonsense and definitely the
BIOS developers did not follow the secrit boot protocol.

> > Deeper investagation into fixing this wreckage unearthed another special
> > feature which is designed by Intel: Negative TSC adjuste values cause
> > interrupt storms on the TSC deadline timer. Further details in patch 2/2
>
> This actually looks like quite a serious hw bug to me, shouldn't there
> be an errata for such a bug?
>
> And I still don't quite understand why the lockup doesn't happen after a
> warm boot, there must be something different there...

What are the adjust values after a warm boot?

Thanks,

tglx

2016-12-14 01:36:26

by Roland Scheidegger

[permalink] [raw]
Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

Am 13.12.2016 um 17:46 schrieb Thomas Gleixner:
> On Tue, 13 Dec 2016, Roland Scheidegger wrote:
>
>> Am 13.12.2016 um 14:14 schrieb Thomas Gleixner:
>>> Roland reported interesting TSC ADJUST register wreckage on his DELL
>>> machine, which seems to populate that MSR with a random number generator.
>>
>> FWIW, I thought about the actual values some more and I don't actually
>> think they are all that random any more: the behavior is consistent with
>> the bios trying to zero the TSC of all cpus. If I understand this right,
>> writing a zero to TSC would cause somewhat small negative values in the
>> TSC_ADJ register at boot time, and larger negative values at suspend
>> time (at least if the TSC just stops when suspended and isn't reset) -
>> exactly what I'm seeing.
>> (And of course the different TSC_ADJ values would be because the bios is
>> writing TSC without any thoughts of synchronization, just one cpu after
>> another).
>
> Yeah, that might be. Still it looks like random nonsense and definitely the
> BIOS developers did not follow the secrit boot protocol.
>
>>> Deeper investagation into fixing this wreckage unearthed another special
>>> feature which is designed by Intel: Negative TSC adjuste values cause
>>> interrupt storms on the TSC deadline timer. Further details in patch 2/2
>>
>> This actually looks like quite a serious hw bug to me, shouldn't there
>> be an errata for such a bug?
>>
>> And I still don't quite understand why the lockup doesn't happen after a
>> warm boot, there must be something different there...
>
> What are the adjust values after a warm boot?
>

So, after cold boot with a kernel which doesn't adjust TSCs, then warm
boot I got:
[ 0.000000] TSC ADJUST: CPU0: -602358264300 176072418728
[ 0.000000] TSC ADJUST: Boot CPU0: -602358264300
[ 0.172245] TSC ADJUST: CPU1: -602360207584 176587932558
[ 0.172245] TSC ADJUST differs: Reference CPU0: -602358264300 CPU1:
-602360207584
[ 0.172246] TSC ADJUST synchronize: Reference CPU0: -602358264300
CPU1: -602360207584
[ 0.252663] TSC ADJUST: CPU2: -602359000822 176828627154
[ 0.252663] TSC ADJUST differs: Reference CPU0: -602358264300 CPU2:
-602359000822
[ 0.252664] TSC ADJUST synchronize: Reference CPU0: -602358264300
CPU2: -602359000822
[ 0.337014] TSC ADJUST: CPU3: -602360177680 177081093132
[ 0.337014] TSC ADJUST differs: Reference CPU0: -602358264300 CPU3:
-602360177680
[ 0.337015] TSC ADJUST synchronize: Reference CPU0: -602358264300
CPU3: -602360177680

and so on.

Albeit after another reboot (some minutes later), it actually straight
locked up again:

TSC ADJUST: CPU1: -8257481427958 165112676430
TSC ADJUST differs: Reference CPU0: -8257479484330 CPU1: -8257481427958
TSC ADJUST synchronize: Reference CPU0: -8257479484330 CPU1: -8254781427958
TSC target sync skip
...
smpboot: Target CPU is online

So, actually I thought the TSC would get reset too on warm boot, but
clearly looks like that isn't the case...
But I don't know what's the difference between first and second reboot -
the adjust values have just more magnitude, but otherwise even the
direction of the adjustments and everything looks all the same (just
like cold boot, which also looks all the same to me).

Roland

2016-12-14 08:14:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

On Wed, 14 Dec 2016, Roland Scheidegger wrote:
> Am 13.12.2016 um 17:46 schrieb Thomas Gleixner:
> > What are the adjust values after a warm boot?
>
> So, after cold boot with a kernel which doesn't adjust TSCs, then warm
> boot I got:
> [ 0.000000] TSC ADJUST: CPU0: -602358264300 176072418728
> [ 0.000000] TSC ADJUST: Boot CPU0: -602358264300
> [ 0.172245] TSC ADJUST: CPU1: -602360207584 176587932558
> [ 0.172245] TSC ADJUST differs: Reference CPU0: -602358264300 CPU1:
> -602360207584
> [ 0.172246] TSC ADJUST synchronize: Reference CPU0: -602358264300
> CPU1: -602360207584
> [ 0.252663] TSC ADJUST: CPU2: -602359000822 176828627154
> [ 0.252663] TSC ADJUST differs: Reference CPU0: -602358264300 CPU2:
> -602359000822
> [ 0.252664] TSC ADJUST synchronize: Reference CPU0: -602358264300
> CPU2: -602359000822
> [ 0.337014] TSC ADJUST: CPU3: -602360177680 177081093132
> [ 0.337014] TSC ADJUST differs: Reference CPU0: -602358264300 CPU3:
> -602360177680
> [ 0.337015] TSC ADJUST synchronize: Reference CPU0: -602358264300
> CPU3: -602360177680
>
> and so on.
>
> Albeit after another reboot (some minutes later), it actually straight
> locked up again:
>
> TSC ADJUST: CPU1: -8257481427958 165112676430
> TSC ADJUST differs: Reference CPU0: -8257479484330 CPU1: -8257481427958
> TSC ADJUST synchronize: Reference CPU0: -8257479484330 CPU1: -8254781427958
> TSC target sync skip
> ...
> smpboot: Target CPU is online
>
> So, actually I thought the TSC would get reset too on warm boot, but
> clearly looks like that isn't the case...
> But I don't know what's the difference between first and second reboot -
> the adjust values have just more magnitude, but otherwise even the
> direction of the adjustments and everything looks all the same (just
> like cold boot, which also looks all the same to me).

I haven't found a pattern for the lockups yet and we have to wait for Intel
to provide useful information about that issue. All we know so far is that
negative adjust values are dangerous.

Could you test the two patches on top of tip x86/timers branch so we can
make progress with that whole disaster while waiting for Intel to come
forth with a proper explanation?

Thanks,

tglx

2016-12-14 21:02:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

On Wed, 14 Dec 2016, Thomas Gleixner wrote:

> On Wed, 14 Dec 2016, Roland Scheidegger wrote:
> > Am 13.12.2016 um 17:46 schrieb Thomas Gleixner:
> > > What are the adjust values after a warm boot?
> >
> > So, after cold boot with a kernel which doesn't adjust TSCs, then warm
> > boot I got:
> > [ 0.000000] TSC ADJUST: CPU0: -602358264300 176072418728
> > [ 0.000000] TSC ADJUST: Boot CPU0: -602358264300
> > [ 0.172245] TSC ADJUST: CPU1: -602360207584 176587932558
> > [ 0.172245] TSC ADJUST differs: Reference CPU0: -602358264300 CPU1:
> > -602360207584
> > [ 0.172246] TSC ADJUST synchronize: Reference CPU0: -602358264300
> > CPU1: -602360207584
> > [ 0.252663] TSC ADJUST: CPU2: -602359000822 176828627154
> > [ 0.252663] TSC ADJUST differs: Reference CPU0: -602358264300 CPU2:
> > -602359000822
> > [ 0.252664] TSC ADJUST synchronize: Reference CPU0: -602358264300
> > CPU2: -602359000822
> > [ 0.337014] TSC ADJUST: CPU3: -602360177680 177081093132
> > [ 0.337014] TSC ADJUST differs: Reference CPU0: -602358264300 CPU3:
> > -602360177680
> > [ 0.337015] TSC ADJUST synchronize: Reference CPU0: -602358264300
> > CPU3: -602360177680
> >
> > and so on.
> >
> > Albeit after another reboot (some minutes later), it actually straight
> > locked up again:
> >
> > TSC ADJUST: CPU1: -8257481427958 165112676430
> > TSC ADJUST differs: Reference CPU0: -8257479484330 CPU1: -8257481427958
> > TSC ADJUST synchronize: Reference CPU0: -8257479484330 CPU1: -8254781427958
> > TSC target sync skip
> > ...
> > smpboot: Target CPU is online
> >
> > So, actually I thought the TSC would get reset too on warm boot, but
> > clearly looks like that isn't the case...
> > But I don't know what's the difference between first and second reboot -
> > the adjust values have just more magnitude, but otherwise even the
> > direction of the adjustments and everything looks all the same (just
> > like cold boot, which also looks all the same to me).
>
> I haven't found a pattern for the lockups yet and we have to wait for Intel
> to provide useful information about that issue. All we know so far is that
> negative adjust values are dangerous.

Did some futher investigation. The values which cause the interrupt storms
have very clear identifiable points which reliably reproduce:

Positive space, results in timer not firing anymore - at least not in a
time frame you are willing to wait for.

0x0000 0000 8000 0000

Negative space, results in an interrupt storm.

0xffff ffff 0000 0000
0xffff fffe 0000 0000
0xffff fffd 0000 0000
0xffff fffc 0000 0000
0xffff fffb 0000 0000
....

These points are independent of the underlying counter value (cold boot,
warm boot) and even reproduce after hours of power on reliably.

And looking at the values makes me wonder about 32bit vs. 64bit wreckage
combined with sign expansion done wrong. Im really impressed!

In the negative space there is something else going on which is dependent
on the counter value. Right after cold boot the space is closer to zero
than after hours of power on.

So the approach of forbidding negative values is definitely not wrong.

Thanks,

tglx

2016-12-14 21:43:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

On Wed, 14 Dec 2016, Thomas Gleixner wrote:
> Positive space, results in timer not firing anymore - at least not in a
> time frame you are willing to wait for.
>
> 0x0000 0000 8000 0000
>
> Negative space, results in an interrupt storm.
>
> 0xffff ffff 0000 0000
> 0xffff fffe 0000 0000
> 0xffff fffd 0000 0000
> 0xffff fffc 0000 0000
> 0xffff fffb 0000 0000
> ....
>
> These points are independent of the underlying counter value (cold boot,
> warm boot) and even reproduce after hours of power on reliably.
>
> And looking at the values makes me wonder about 32bit vs. 64bit wreckage
> combined with sign expansion done wrong. Im really impressed!

And the whole mess stems from the fact that the deadline is not as one
would expect simply compared against the sum of the counter and the adjust
MSR.

No, they subtract the adjust value from the MSR when you write the deadline
and latch the result to compare it against the counter.

So when the following happens:

ADJUST = 0
RDTSC = 10000000
DEADLINE = 11000000

ADJUST = 1000000

INTERRUPT
RDTSC = 12000000

DEADLINE = 13000000

ADJUST = 0

INTERRUPT
RDTSC = 12000000

So depending on the direction of the adjustment the timer fires late or
early.

Combined with that math wreckage this is a complete disaster. And of course
nothing is documented anywhere and the SDM is outright wrong:

10.5.4.1 TSC-Deadline Mode

The processor generates a timer interrupt when the value of time-stamp
counter is greater than or equal to that of IA32_TSC_DEADLINE. It then
disarms the timer and clear the IA32_TSC_DEADLINE MSR. (Both the time-stamp
counter and the IA32_TSC_DEADLINE MSR are 64-bit unsigned integers.)

See the example above. 1200000 is neither equal nor greater than 1300000, at
least not in my universe.

I serioulsy doubt that Intel manages it to design at least ONE functional
non broken timer before I retire.

Thanks,

tglx

2016-12-14 22:54:38

by Roland Scheidegger

[permalink] [raw]
Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

Am 14.12.2016 um 22:40 schrieb Thomas Gleixner:
> On Wed, 14 Dec 2016, Thomas Gleixner wrote:
>> Positive space, results in timer not firing anymore - at least not in a
>> time frame you are willing to wait for.
>>
>> 0x0000 0000 8000 0000
>>
>> Negative space, results in an interrupt storm.
>>
>> 0xffff ffff 0000 0000
>> 0xffff fffe 0000 0000
>> 0xffff fffd 0000 0000
>> 0xffff fffc 0000 0000
>> 0xffff fffb 0000 0000
>> ....
>>
>> These points are independent of the underlying counter value (cold boot,
>> warm boot) and even reproduce after hours of power on reliably.
>>
>> And looking at the values makes me wonder about 32bit vs. 64bit wreckage
>> combined with sign expansion done wrong. Im really impressed!
>
> And the whole mess stems from the fact that the deadline is not as one
> would expect simply compared against the sum of the counter and the adjust
> MSR.
Why would it be compared against the sum? As far as I can tell the adjust
value should never come into play when using deadline timer (other than
indirectly because the TSC would change).
(And I'd guess better avoid an armed deadline timer while changing
TSC_ADJ...)

In any case, I've tested the two patches on top of x86-timers and they
work just fine - all TSC_ADJ values get set to zero both on boot and
resume, no lockups, and tsc clocksource active, with some whining in the
log of course :-).
So,
Tested-by: Roland Scheidegger <[email protected]>

> No, they subtract the adjust value from the MSR when you write the deadline
> and latch the result to compare it against the counter.
>
> So when the following happens:
>
> ADJUST = 0
> RDTSC = 10000000
> DEADLINE = 11000000
>
> ADJUST = 1000000
>
> INTERRUPT
> RDTSC = 12000000
>
> DEADLINE = 13000000
>
> ADJUST = 0
>
> INTERRUPT
> RDTSC = 12000000
>
> So depending on the direction of the adjustment the timer fires late or
> early.
>
> Combined with that math wreckage this is a complete disaster. And of course
> nothing is documented anywhere and the SDM is outright wrong:
>
> 10.5.4.1 TSC-Deadline Mode
>
> The processor generates a timer interrupt when the value of time-stamp
> counter is greater than or equal to that of IA32_TSC_DEADLINE. It then
> disarms the timer and clear the IA32_TSC_DEADLINE MSR. (Both the time-stamp
> counter and the IA32_TSC_DEADLINE MSR are 64-bit unsigned integers.)
>
> See the example above. 1200000 is neither equal nor greater than 1300000, at
> least not in my universe.
>
> I serioulsy doubt that Intel manages it to design at least ONE functional
> non broken timer before I retire.
>
> Thanks,
>
> tglx
>

2016-12-15 09:34:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

On Wed, 14 Dec 2016, Roland Scheidegger wrote:
> Am 14.12.2016 um 22:40 schrieb Thomas Gleixner:
> > And the whole mess stems from the fact that the deadline is not as one
> > would expect simply compared against the sum of the counter and the adjust
> > MSR.
> Why would it be compared against the sum? As far as I can tell the adjust
> value should never come into play when using deadline timer (other than
> indirectly because the TSC would change).

See the SDM. It suggests that the deadline is compared to the TSC value.

I don't care how it is implemented, but I very much care about it being
documented in the way it is implemented, which is obviously not the case.

And even if the adjust value is subtracted once when the timer is armed the
whole thing should not blow up due to 32bit/sign extension bugs or whatever
they decided to cobble together.

Thanks,

tglx