2017-12-13 08:34:04

by Dan Aloni

[permalink] [raw]
Subject: TSC x86 fixes for LTS kernel 4.9.x

Hi all,

I've tested the following changes, belonging to merge commit f7dd3b1734e,
on top of 4.9.68 after a very easy backport from 4.10, and I think it
may be worthwhile adding them to 4.9.x:

x86/tsc: Limit the adjust value further
x86/tsc: Annotate printouts as firmware bug
x86/tsc: Force TSC_ADJUST register to value >= zero
x86/tsc: Validate TSC_ADJUST after resume
x86/tsc: Validate cpumask pointer before accessing it
x86/tsc: Fix broken CONFIG_X86_TSC=n build
x86/tsc: Try to adjust TSC if sync test fails
x86/tsc: Prepare warp test for TSC adjustment
x86/tsc: Move sync cleanup to a safe place
x86/tsc: Sync test only for the first cpu in a package
x86/tsc: Verify TSC_ADJUST from idle
x86/tsc: Store and check TSC ADJUST MSR
x86/tsc: Detect random warps
x86/tsc: Use X86_FEATURE_TSC_ADJUST in detect_art()
x86/tsc: Finalize the split of the TSC_RELIABLE flag
x86/tsc: Set TSC_KNOWN_FREQ and TSC_RELIABLE flags on Intel Atom SoCs
x86/tsc: Mark Intel ATOM_GOLDMONT TSC reliable
x86/tsc: Mark TSC frequency determined by CPUID as known
x86/tsc: Add X86_FEATURE_TSC_KNOWN_FREQ flag

These changes percisely fix an issue I am having with a relatively new
8-core Intel(R) Core(TM) i7-7820X with an updated ASUS BIOS (December 2017).

Under v4.9.68, the kernel fallbacks on the chosen clocksource to HPET which
just doesn't work - there is over a 200ms time drift that does not go
away even after repeated ntpdate sync attempts.

For further testing I've posted a branch for these changes here:

https://github.com/kernelim/linux tsc-fix-for-4.9.x

--
Dan Aloni


2017-12-13 09:03:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: TSC x86 fixes for LTS kernel 4.9.x

On Wed, Dec 13, 2017 at 10:33:52AM +0200, Dan Aloni wrote:
> Hi all,
>
> I've tested the following changes, belonging to merge commit f7dd3b1734e,
> on top of 4.9.68 after a very easy backport from 4.10, and I think it
> may be worthwhile adding them to 4.9.x:
>
> x86/tsc: Limit the adjust value further
> x86/tsc: Annotate printouts as firmware bug
> x86/tsc: Force TSC_ADJUST register to value >= zero
> x86/tsc: Validate TSC_ADJUST after resume
> x86/tsc: Validate cpumask pointer before accessing it
> x86/tsc: Fix broken CONFIG_X86_TSC=n build
> x86/tsc: Try to adjust TSC if sync test fails
> x86/tsc: Prepare warp test for TSC adjustment
> x86/tsc: Move sync cleanup to a safe place
> x86/tsc: Sync test only for the first cpu in a package
> x86/tsc: Verify TSC_ADJUST from idle
> x86/tsc: Store and check TSC ADJUST MSR
> x86/tsc: Detect random warps
> x86/tsc: Use X86_FEATURE_TSC_ADJUST in detect_art()
> x86/tsc: Finalize the split of the TSC_RELIABLE flag
> x86/tsc: Set TSC_KNOWN_FREQ and TSC_RELIABLE flags on Intel Atom SoCs
> x86/tsc: Mark Intel ATOM_GOLDMONT TSC reliable
> x86/tsc: Mark TSC frequency determined by CPUID as known
> x86/tsc: Add X86_FEATURE_TSC_KNOWN_FREQ flag

I need git commit ids to be able to do anything :)

> These changes percisely fix an issue I am having with a relatively new
> 8-core Intel(R) Core(TM) i7-7820X with an updated ASUS BIOS (December 2017).
>
> Under v4.9.68, the kernel fallbacks on the chosen clocksource to HPET which
> just doesn't work - there is over a 200ms time drift that does not go
> away even after repeated ntpdate sync attempts.
>
> For further testing I've posted a branch for these changes here:
>
> https://github.com/kernelim/linux tsc-fix-for-4.9.x

Why not just use 4.14 instead? That's much easier than trying to use an
old kernel like 4.9, right?

thanks,

greg k-h

2017-12-13 09:45:29

by Dan Aloni

[permalink] [raw]
Subject: Re: TSC x86 fixes for LTS kernel 4.9.x

On Wed, Dec 13, 2017 at 10:03:35AM +0100, Greg KH wrote:
> On Wed, Dec 13, 2017 at 10:33:52AM +0200, Dan Aloni wrote:
> > Hi all,
> >
> > I've tested the following changes, belonging to merge commit f7dd3b1734e,
> > on top of 4.9.68 after a very easy backport from 4.10, and I think it
> > may be worthwhile adding them to 4.9.x:
> >
[..]
>
> I need git commit ids to be able to do anything :)

Sure, how about:

# git log 8c9b9d87b855 --oneline -n 19 --reverse --pretty="%h # %s" | awk -F" " '{print "git cherry-pick -x " $0}'

git cherry-pick -x 47c95a46d0fa # x86/tsc: Add X86_FEATURE_TSC_KNOWN_FREQ flag
git cherry-pick -x 4ca4df0b7eb0 # x86/tsc: Mark TSC frequency determined by CPUID as known
git cherry-pick -x 4635fdc696a8 # x86/tsc: Mark Intel ATOM_GOLDMONT TSC reliable
git cherry-pick -x f3a02ecebed7 # x86/tsc: Set TSC_KNOWN_FREQ and TSC_RELIABLE flags on Intel Atom SoCs
git cherry-pick -x 984fecebda3b # x86/tsc: Finalize the split of the TSC_RELIABLE flag
git cherry-pick -x 7b3d2f6e08ed # x86/tsc: Use X86_FEATURE_TSC_ADJUST in detect_art()
git cherry-pick -x bec8520dca0d # x86/tsc: Detect random warps
git cherry-pick -x 8b223bc7abe0 # x86/tsc: Store and check TSC ADJUST MSR
git cherry-pick -x 1d0095feea59 # x86/tsc: Verify TSC_ADJUST from idle
git cherry-pick -x a36f5136814b # x86/tsc: Sync test only for the first cpu in a package
git cherry-pick -x 4c5e3c637521 # x86/tsc: Move sync cleanup to a safe place
git cherry-pick -x 76d3b8515850 # x86/tsc: Prepare warp test for TSC adjustment
git cherry-pick -x cc4db26899dc # x86/tsc: Try to adjust TSC if sync test fails
git cherry-pick -x b836554386cc # x86/tsc: Fix broken CONFIG_X86_TSC=n build
git cherry-pick -x 31f8a651fc57 # x86/tsc: Validate cpumask pointer before accessing it
git cherry-pick -x 6a369583178d # x86/tsc: Validate TSC_ADJUST after resume
git cherry-pick -x 5bae156241e0 # x86/tsc: Force TSC_ADJUST register to value >= zero
git cherry-pick -x 16588f659257 # x86/tsc: Annotate printouts as firmware bug
git cherry-pick -x 8c9b9d87b855 # x86/tsc: Limit the adjust value further

There's a conflict only in a one small place in the first few patches.

> > These changes percisely fix an issue I am having with a relatively new
> > 8-core Intel(R) Core(TM) i7-7820X with an updated ASUS BIOS (December 2017).
> >
> > Under v4.9.68, the kernel fallbacks on the chosen clocksource to HPET which
> > just doesn't work - there is over a 200ms time drift that does not go
> > away even after repeated ntpdate sync attempts.
> >
> > For further testing I've posted a branch for these changes here:
> >
> > https://github.com/kernelim/linux tsc-fix-for-4.9.x
>
> Why not just use 4.14 instead? That's much easier than trying to use an
> old kernel like 4.9, right?

Yes, however the milage of 4.9.x seems more appealing somewhat.

I'll give 4.14.x a try mostly to see whether it solves hard locks that
I've seen with 4.13.x (all Fedora-based stable kernels) on three of my
machines -- an unrelated issue, and the main reason why I gave one of
the LTS branches a try.

--
Dan Aloni

2017-12-13 09:57:56

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: TSC x86 fixes for LTS kernel 4.9.x

On Wed, Dec 13, 2017 at 11:45:20AM +0200, Dan Aloni wrote:
> On Wed, Dec 13, 2017 at 10:03:35AM +0100, Greg KH wrote:
> > On Wed, Dec 13, 2017 at 10:33:52AM +0200, Dan Aloni wrote:
> > > Hi all,
> > >
> > > I've tested the following changes, belonging to merge commit f7dd3b1734e,
> > > on top of 4.9.68 after a very easy backport from 4.10, and I think it
> > > may be worthwhile adding them to 4.9.x:
> > >
> [..]
> >
> > I need git commit ids to be able to do anything :)
>
> Sure, how about:
>
> # git log 8c9b9d87b855 --oneline -n 19 --reverse --pretty="%h # %s" | awk -F" " '{print "git cherry-pick -x " $0}'
>
> git cherry-pick -x 47c95a46d0fa # x86/tsc: Add X86_FEATURE_TSC_KNOWN_FREQ flag
> git cherry-pick -x 4ca4df0b7eb0 # x86/tsc: Mark TSC frequency determined by CPUID as known
> git cherry-pick -x 4635fdc696a8 # x86/tsc: Mark Intel ATOM_GOLDMONT TSC reliable
> git cherry-pick -x f3a02ecebed7 # x86/tsc: Set TSC_KNOWN_FREQ and TSC_RELIABLE flags on Intel Atom SoCs
> git cherry-pick -x 984fecebda3b # x86/tsc: Finalize the split of the TSC_RELIABLE flag
> git cherry-pick -x 7b3d2f6e08ed # x86/tsc: Use X86_FEATURE_TSC_ADJUST in detect_art()
> git cherry-pick -x bec8520dca0d # x86/tsc: Detect random warps
> git cherry-pick -x 8b223bc7abe0 # x86/tsc: Store and check TSC ADJUST MSR
> git cherry-pick -x 1d0095feea59 # x86/tsc: Verify TSC_ADJUST from idle
> git cherry-pick -x a36f5136814b # x86/tsc: Sync test only for the first cpu in a package
> git cherry-pick -x 4c5e3c637521 # x86/tsc: Move sync cleanup to a safe place
> git cherry-pick -x 76d3b8515850 # x86/tsc: Prepare warp test for TSC adjustment
> git cherry-pick -x cc4db26899dc # x86/tsc: Try to adjust TSC if sync test fails
> git cherry-pick -x b836554386cc # x86/tsc: Fix broken CONFIG_X86_TSC=n build
> git cherry-pick -x 31f8a651fc57 # x86/tsc: Validate cpumask pointer before accessing it
> git cherry-pick -x 6a369583178d # x86/tsc: Validate TSC_ADJUST after resume
> git cherry-pick -x 5bae156241e0 # x86/tsc: Force TSC_ADJUST register to value >= zero
> git cherry-pick -x 16588f659257 # x86/tsc: Annotate printouts as firmware bug
> git cherry-pick -x 8c9b9d87b855 # x86/tsc: Limit the adjust value further
>
> There's a conflict only in a one small place in the first few patches.

That's a lot of changes to be backported. I'm _really_ hesitant to do
this, unless the maintainer of the code agrees it is ok...

> > > These changes percisely fix an issue I am having with a relatively new
> > > 8-core Intel(R) Core(TM) i7-7820X with an updated ASUS BIOS (December 2017).
> > >
> > > Under v4.9.68, the kernel fallbacks on the chosen clocksource to HPET which
> > > just doesn't work - there is over a 200ms time drift that does not go
> > > away even after repeated ntpdate sync attempts.
> > >
> > > For further testing I've posted a branch for these changes here:
> > >
> > > https://github.com/kernelim/linux tsc-fix-for-4.9.x
> >
> > Why not just use 4.14 instead? That's much easier than trying to use an
> > old kernel like 4.9, right?
>
> Yes, however the milage of 4.9.x seems more appealing somewhat.

Why? 4.14 should be much better, it's newer, has more hardware support,
more bugs fixed, and more new things left to debug :)

> I'll give 4.14.x a try mostly to see whether it solves hard locks that
> I've seen with 4.13.x (all Fedora-based stable kernels) on three of my
> machines -- an unrelated issue, and the main reason why I gave one of
> the LTS branches a try.

You really should report that. Without that, odds are it will not be
fixed.

thanks,

greg k-h

2017-12-13 10:14:38

by Dan Aloni

[permalink] [raw]
Subject: Re: TSC x86 fixes for LTS kernel 4.9.x

On Wed, Dec 13, 2017 at 10:57:55AM +0100, Greg KH wrote:
> On Wed, Dec 13, 2017 at 11:45:20AM +0200, Dan Aloni wrote:
> > git cherry-pick -x 16588f659257 # x86/tsc: Annotate printouts as firmware bug
> > git cherry-pick -x 8c9b9d87b855 # x86/tsc: Limit the adjust value further
> >
> > There's a conflict only in a one small place in the first few patches.
>[..]
> That's a lot of changes to be backported. I'm _really_ hesitant to do
> this, unless the maintainer of the code agrees it is ok...

I guessed so, that's why I probed. Otherwise I would have just sent
out patches.

> > > > These changes percisely fix an issue I am having with a relatively new
> > > > 8-core Intel(R) Core(TM) i7-7820X with an updated ASUS BIOS (December 2017).
> > > >
> > > > Under v4.9.68, the kernel fallbacks on the chosen clocksource to HPET which
> > > > just doesn't work - there is over a 200ms time drift that does not go
> > > > away even after repeated ntpdate sync attempts.
> > > >
> > > > For further testing I've posted a branch for these changes here:
> > > >
> > > > https://github.com/kernelim/linux tsc-fix-for-4.9.x
> > >
> > > Why not just use 4.14 instead? That's much easier than trying to use an
> > > old kernel like 4.9, right?
> >
> > Yes, however the milage of 4.9.x seems more appealing somewhat.
>
> Why? 4.14 should be much better, it's newer, has more hardware support,
> more bugs fixed, and more new things left to debug :)

I always enjoy debugging :)

> > I'll give 4.14.x a try mostly to see whether it solves hard locks that
> > I've seen with 4.13.x (all Fedora-based stable kernels) on three of my
> > machines -- an unrelated issue, and the main reason why I gave one of
> > the LTS branches a try.
>
> You really should report that. Without that, odds are it will not be
> fixed.

I am still collecting data, but these systems are being used rather
constantly so the downtime is problematic. It's a) a rather new
workstation, 2) an Intel Nuc, and 3) An old Lenovo Carbon X1 Gen 3.

I should have also used a vanilla build because I know that on LKML
it has preference over the Fedora-based patchset. I will try to see
if it produces on 4.14.x and perhaps kdump will be able to capture
it this time.

--
Dan Aloni

2017-12-13 15:07:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: TSC x86 fixes for LTS kernel 4.9.x

On Wed, 13 Dec 2017, Greg KH wrote:
> On Wed, Dec 13, 2017 at 11:45:20AM +0200, Dan Aloni wrote:
> > # git log 8c9b9d87b855 --oneline -n 19 --reverse --pretty="%h # %s" | awk -F" " '{print "git cherry-pick -x " $0}'
> >
> > git cherry-pick -x 47c95a46d0fa # x86/tsc: Add X86_FEATURE_TSC_KNOWN_FREQ flag
> > git cherry-pick -x 4ca4df0b7eb0 # x86/tsc: Mark TSC frequency determined by CPUID as known
> > git cherry-pick -x 4635fdc696a8 # x86/tsc: Mark Intel ATOM_GOLDMONT TSC reliable
> > git cherry-pick -x f3a02ecebed7 # x86/tsc: Set TSC_KNOWN_FREQ and TSC_RELIABLE flags on Intel Atom SoCs
> > git cherry-pick -x 984fecebda3b # x86/tsc: Finalize the split of the TSC_RELIABLE flag
> > git cherry-pick -x 7b3d2f6e08ed # x86/tsc: Use X86_FEATURE_TSC_ADJUST in detect_art()
> > git cherry-pick -x bec8520dca0d # x86/tsc: Detect random warps
> > git cherry-pick -x 8b223bc7abe0 # x86/tsc: Store and check TSC ADJUST MSR
> > git cherry-pick -x 1d0095feea59 # x86/tsc: Verify TSC_ADJUST from idle
> > git cherry-pick -x a36f5136814b # x86/tsc: Sync test only for the first cpu in a package
> > git cherry-pick -x 4c5e3c637521 # x86/tsc: Move sync cleanup to a safe place
> > git cherry-pick -x 76d3b8515850 # x86/tsc: Prepare warp test for TSC adjustment
> > git cherry-pick -x cc4db26899dc # x86/tsc: Try to adjust TSC if sync test fails
> > git cherry-pick -x b836554386cc # x86/tsc: Fix broken CONFIG_X86_TSC=n build
> > git cherry-pick -x 31f8a651fc57 # x86/tsc: Validate cpumask pointer before accessing it
> > git cherry-pick -x 6a369583178d # x86/tsc: Validate TSC_ADJUST after resume
> > git cherry-pick -x 5bae156241e0 # x86/tsc: Force TSC_ADJUST register to value >= zero
> > git cherry-pick -x 16588f659257 # x86/tsc: Annotate printouts as firmware bug
> > git cherry-pick -x 8c9b9d87b855 # x86/tsc: Limit the adjust value further
> >
> > There's a conflict only in a one small place in the first few patches.
>
> That's a lot of changes to be backported. I'm _really_ hesitant to do
> this, unless the maintainer of the code agrees it is ok...

Those TSC_ADJUST fixes are just an initial workaround. Peter has updated
that since then to the final and proper solution, which makes it dependend
on micro code version checks. If at all then the whole lot wants to be
backported, which is way more than the above set.

Thanks,

tglx