2008-10-21 01:44:09

by Alok Kataria

[permalink] [raw]
Subject: [PATCH 0/3] Improve TSC as a clocksource under VMware

Hi,

This patch set makes some changes to the TSC code, so that it is always
used as the default clocksource when running under VMware.
The first 2 patches do the ground work of adding code to detect if we
are running under VMware.
The third patch adds a tsc_reliable flag which tells us that TSC is
reliable on this system, so that we can skip some tests which can
potentially mark TSC as unstable. Details under individual patch
headers.

Please have a look.

Thanks,
Alok


2008-10-21 05:55:26

by Chris Snook

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

Alok Kataria wrote:
> This patch set makes some changes to the TSC code, so that it is always
> used as the default clocksource when running under VMware.
> The first 2 patches do the ground work of adding code to detect if we
> are running under VMware.
> The third patch adds a tsc_reliable flag which tells us that TSC is
> reliable on this system, so that we can skip some tests which can
> potentially mark TSC as unstable. Details under individual patch
> headers.

Do you intend for this to supplement Jeff Hansen's tsc=stable patch, or to
replace it? If you intend to supplement it, you should rework your submission
to make sure that both your autodetection and his command-line option use the
same internal mechanism, and if you intend to replace it, you should say this
explicitly, and justify it. Either way, you should definitely CC Jeff on the
submission, as I am doing now.

-- Chris

2008-10-21 09:43:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

Alok Kataria <[email protected]> writes:

> This patch set makes some changes to the TSC code, so that it is always
> used as the default clocksource when running under VMware.

Does this imply that vmware always emulates TSC reads?
Or how do you guarantee reliable TSC on a system where the underlying
TSC isn't?

It would be far nicer if VMware just emulated the "constant_tsc" bit
in the AMD CPUID leaf, instead of adding all that gunk to Linux.
Right now it's only checked for AMD CPUs, but that could be changed.

Otherwise with more and more hypervisors we mind end up with more
and more unscalable detection code.

-Andi

--
[email protected]

2008-10-21 09:50:18

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

> Hi,
>
> This patch set makes some changes to the TSC code, so that it is always
> used as the default clocksource when running under VMware.
> The first 2 patches do the ground work of adding code to detect if we
> are running under VMware.
> The third patch adds a tsc_reliable flag which tells us that TSC is
> reliable on this system, so that we can skip some tests which can
> potentially mark TSC as unstable. Details under individual patch
> headers.

How do you _know_ TSC is stable under VMWare? AFAICT, accesses to TSC
are not virtualized and system VMWare runs on may still use frequency
scaling, no?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-10-21 16:41:33

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

On Tue, 2008-10-21 at 02:43 -0700, Andi Kleen wrote:
> Alok Kataria <[email protected]> writes:
>
> > This patch set makes some changes to the TSC code, so that it is always
> > used as the default clocksource when running under VMware.
>
> Does this imply that vmware always emulates TSC reads?
> Or how do you guarantee reliable TSC on a system where the underlying
> TSC isn't?

Yes the hypervisor takes care of providing constant rate TSC to the
guest.

>
> It would be far nicer if VMware just emulated the "constant_tsc" bit
> in the AMD CPUID leaf, instead of adding all that gunk to Linux.
> Right now it's only checked for AMD CPUs, but that could be changed.

I am not sure if we might want to skip the tsc_sync code for all the
cpus which have this constant_tsc bit set, can we ?
Since i have seen that we can have cases where tsc is marked unstable as
its not found to be perfectly stable between cpus, we have to make sure
that we avoid this check when on VMware. Even with slight difference in
TSC between cpus, we know that TSC is the best available clocksource as
the hypervisor (VMware) makes sure that the drift is always marginal (if
ever there is).

Thanks,
Alok

>
> Otherwise with more and more hypervisors we mind end up with more
> and more unscalable detection code.
>
> -Andi
>
> --
> [email protected]

2008-10-21 16:48:25

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

On Tue, 2008-10-21 at 02:50 -0700, Pavel Machek wrote:
> > Hi,
> >
> > This patch set makes some changes to the TSC code, so that it is always
> > used as the default clocksource when running under VMware.
> > The first 2 patches do the ground work of adding code to detect if we
> > are running under VMware.
> > The third patch adds a tsc_reliable flag which tells us that TSC is
> > reliable on this system, so that we can skip some tests which can
> > potentially mark TSC as unstable. Details under individual patch
> > headers.
>
> How do you _know_ TSC is stable under VMWare?

Since we implement it that way :)
The hypervisor takes care of providing a constant rate TSC to the guest
on such systems which have variable frequency.

Thanks,
Alok

> AFAICT, accesses to TSC
> are not virtualized and system VMWare runs on may still use frequency
> scaling, no?
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-10-21 17:32:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

> > It would be far nicer if VMware just emulated the "constant_tsc" bit
> > in the AMD CPUID leaf, instead of adding all that gunk to Linux.
> > Right now it's only checked for AMD CPUs, but that could be changed.
>
> I am not sure if we might want to skip the tsc_sync code for all the
> cpus which have this constant_tsc bit set, can we ?

It should be ok, although normally sanity checks should be kept.

> Since i have seen that we can have cases where tsc is marked unstable as
> its not found to be perfectly stable between cpus, we have to make sure
> that we avoid this check when on VMware. Even with slight difference in

Normally tsc sync should only fail when the error is large.
It cannot detect very small derivation by design. It's more like
a sanity check "does the TSC sync look half way sane"

If it fails on VMware perhaps the margins are not big enough or
something else is fishy.

But still it could be skipped with constant_tsc. But again you
shouldn't really fail that check -- if you do fail are you
sure your clock is really synchronized enough that there
are no observable differences?

> TSC between cpus, we know that TSC is the best available clocksource as
> the hypervisor (VMware) makes sure that the drift is always marginal (if
> ever there is).

The drift has to be unobservable, otherwise you still risk
non monotonity. Also when there is drift it has to be short
term only, otherwise it would accumulate over longer times.

-Andi

--
[email protected]

2008-10-21 18:05:08

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

On Tue, 2008-10-21 at 10:40 -0700, Andi Kleen wrote:
> > > It would be far nicer if VMware just emulated the "constant_tsc" bit
> > > in the AMD CPUID leaf, instead of adding all that gunk to Linux.
> > > Right now it's only checked for AMD CPUs, but that could be changed.
> >
> > I am not sure if we might want to skip the tsc_sync code for all the
> > cpus which have this constant_tsc bit set, can we ?
>
> It should be ok, although normally sanity checks should be kept.
>
> > Since i have seen that we can have cases where tsc is marked unstable as
> > its not found to be perfectly stable between cpus, we have to make sure
> > that we avoid this check when on VMware. Even with slight difference in
>
> Normally tsc sync should only fail when the error is large.
> It cannot detect very small derivation by design. It's more like
> a sanity check "does the TSC sync look half way sane"

FWIU from the code, even if cpu A's TSC is just 1 tick behind that of
cpu B, we increment the nr_wraps value.
And the code expects that there are no wraps in TSC throughout the
20msec measurement window.
So IMO its fairly easy to fail this test.
>
> If it fails on VMware perhaps the margins are not big enough or
> something else is fishy.
>
> But still it could be skipped with constant_tsc. But again you
> shouldn't really fail that check -- if you do fail are you
> sure your clock is really synchronized enough that there
> are no observable differences?
>
> > TSC between cpus, we know that TSC is the best available clocksource as
> > the hypervisor (VMware) makes sure that the drift is always marginal (if
> > ever there is).
>
> The drift has to be unobservable, otherwise you still risk
> non monotonity. Also when there is drift it has to be short
> term only, otherwise it would accumulate over longer times.

Usually the problem is more pronounced during the boot process, the
TSC's between processors are not perfectly in sync at bootup but the
drift is not observable later, and from timekeeping perspective we are
perfectly fine.

Thanks,
Alok


>
> -Andi
>
> --
> [email protected]

2008-10-21 18:08:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

> FWIU from the code, even if cpu A's TSC is just 1 tick behind that of
> cpu B, we increment the nr_wraps value.
> And the code expects that there are no wraps in TSC throughout the
> 20msec measurement window.
> So IMO its fairly easy to fail this test.

Ok I think it would be better to enlarge the margin then to fix your
issue.

-Andi

2008-10-21 18:37:00

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

On Tue 2008-10-21 09:48:08, Alok Kataria wrote:
> On Tue, 2008-10-21 at 02:50 -0700, Pavel Machek wrote:
> > > Hi,
> > >
> > > This patch set makes some changes to the TSC code, so that it is always
> > > used as the default clocksource when running under VMware.
> > > The first 2 patches do the ground work of adding code to detect if we
> > > are running under VMware.
> > > The third patch adds a tsc_reliable flag which tells us that TSC is
> > > reliable on this system, so that we can skip some tests which can
> > > potentially mark TSC as unstable. Details under individual patch
> > > headers.
> >
> > How do you _know_ TSC is stable under VMWare?
>
> Since we implement it that way :)
> The hypervisor takes care of providing a constant rate TSC to the guest
> on such systems which have variable frequency.

Aha, so tsc accesses trap to VMWare? Ok, then...

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-10-21 19:10:47

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

On Tue, 2008-10-21 at 11:15 -0700, Andi Kleen wrote:
> > FWIU from the code, even if cpu A's TSC is just 1 tick behind that of
> > cpu B, we increment the nr_wraps value.
> > And the code expects that there are no wraps in TSC throughout the
> > 20msec measurement window.
> > So IMO its fairly easy to fail this test.
>
> Ok I think it would be better to enlarge the margin then to fix your
> issue.
>

Yeah, i thought about that, but given the different kinds of native
hardware that we have i think coming up with values that would be safe
for all the cases would be difficult. And given that nobody has
complained about this on native until now, i would prefer not to change
the status quo. Apart from that, even on VMware virtualized environment
there can different configurations where this margin keeps on varying,
under different scenarios, f.e. under overcommitment and all. So i don't
think that changing threshold values maybe safe for all cases.

I think going with ignoring this check for specific systems is the best
way to go further. I am open to either having a check for
CONSTANT_TSC or a new flag tsc_reliable (similar to my patch).

I hope you also agree with the CONSTANT_TSC check.

Thanks,
Alok


> -Andi

2008-10-21 19:12:00

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

On Mon, 2008-10-20 at 22:55 -0700, Chris Snook wrote:
> Alok Kataria wrote:
> > This patch set makes some changes to the TSC code, so that it is always
> > used as the default clocksource when running under VMware.
> > The first 2 patches do the ground work of adding code to detect if we
> > are running under VMware.
> > The third patch adds a tsc_reliable flag which tells us that TSC is
> > reliable on this system, so that we can skip some tests which can
> > potentially mark TSC as unstable. Details under individual patch
> > headers.
>
> Do you intend for this to supplement Jeff Hansen's tsc=stable patch, or to
> replace it? If you intend to supplement it, you should rework your submission
> to make sure that both your autodetection and his command-line option use the
> same internal mechanism, and if you intend to replace it, you should say this
> explicitly, and justify it. Either way, you should definitely CC Jeff on the
> submission, as I am doing now.

Hi Chris,

I had made this change to supplement Jeff's change, but I notice that
his change is not in any of the trees yet. Let me incorporate that in
the next iteration of this patch.

Thanks,
Alok
>
> -- Chris

2008-10-21 19:20:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

On Tue, Oct 21, 2008 at 12:10:36PM -0700, Alok Kataria wrote:
> On Tue, 2008-10-21 at 11:15 -0700, Andi Kleen wrote:
> > > FWIU from the code, even if cpu A's TSC is just 1 tick behind that of
> > > cpu B, we increment the nr_wraps value.
> > > And the code expects that there are no wraps in TSC throughout the
> > > 20msec measurement window.
> > > So IMO its fairly easy to fail this test.
> >
> > Ok I think it would be better to enlarge the margin then to fix your
> > issue.
> >
>
> Yeah, i thought about that, but given the different kinds of native
> hardware that we have i think coming up with values that would be safe
> for all the cases would be difficult. And given that nobody has
> complained about this on native until now, i would prefer not to change

It would certainly be the better solution (or trusting the CPUID bit)

Hypervisors are pretty popular these days and there are many different
flavours and if each of them adds their own detection code Linux will soon
get very crowded. It's just a non scalable bad solution.

> the status quo. Apart from that, even on VMware virtualized environment
> there can different configurations where this margin keeps on varying,
> under different scenarios, f.e. under overcommitment and all. So i don't
> think that changing threshold values maybe safe for all cases.

When the margin changes then it likely won't be good enough for
gettimeofday(). There are a couple of testers for monotonity around,
would such a system with shifting marging surive running them
for a few days?


>
> I think going with ignoring this check for specific systems is the best
> way to go further.

Disagree.


> I hope you also agree with the CONSTANT_TSC check.

Yes constant_tsc is good, although the question is how much sanity
check is still needed. I suspect even with it some sanity check
will still make sense.

And the question is if your HV even sets the bit?
Also you would need to change the Linux code to check it on Intel too.

-Andi

--
[email protected]

2008-10-21 19:47:53

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

On Tue, 2008-10-21 at 12:27 -0700, Andi Kleen wrote:
> On Tue, Oct 21, 2008 at 12:10:36PM -0700, Alok Kataria wrote:
> > On Tue, 2008-10-21 at 11:15 -0700, Andi Kleen wrote:

> > the status quo. Apart from that, even on VMware virtualized environment
> > there can different configurations where this margin keeps on varying,
> > under different scenarios, f.e. under overcommitment and all. So i don't
> > think that changing threshold values maybe safe for all cases.
>
> When the margin changes then it likely won't be good enough for
> gettimeofday(). There are a couple of testers for monotonity around,
> would such a system with shifting marging surive running them
> for a few days?

I meant, the margin changes when you are running different flavors of
the build, or running different user configurations (from overcommitment
POV). Once a system is booted the margin won't vary, apart from the
bootup stage which will always have some extreme states. In all our
internal testing that has been done, we have never noticed time drift is
withing the NTP threshold when its compared to the host system, under
varying load.

>
>
> >
> > I think going with ignoring this check for specific systems is the best
> > way to go further.
>
> Disagree.

Having a flag like tsc_reliable should be ok, right ? Systems which
provide constant rate TSC guarantees can set this bit and don't have to
pollute the code all over, this is only if we can't avoid the check for
CONSTANT_TSC systems (see below).

>
> > I hope you also agree with the CONSTANT_TSC check.
>
> Yes constant_tsc is good, although the question is how much sanity
> check is still needed. I suspect even with it some sanity check
> will still make sense.

Hmm, in one of the cases i have seen that the max_warp value was as high
as 105cycles, and that the nr_warps was as high as 1265, and this is
just one situation, given that the host can be overloaded in some cases,
this value may be high. So adding a sanity check for this is something
that i would prefer to avoid as it will always have false positives in
some cases.

>
> And the question is if your HV even sets the bit?
> Also you would need to change the Linux code to check it on Intel too.

It reflects the hardware state right now, though i am going to check
that this bit is set for future products.
Given that, i will still have to add a quirk to the kernel to enable
that capability bit for older products. This solution is acceptable
IMHO, as atleast it doesn't pollute the kernel code.
The quirk will handle the Intel case too for now, but once the bit is
added i will add code to check that leaf for intel too.

Thanks,
Alok
>
> -Andi
>
> --
> [email protected]

2008-10-21 21:42:15

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

On Tue, 2008-10-21 at 12:47 -0700, Alok Kataria wrote:
> On Tue, 2008-10-21 at 12:27 -0700, Andi Kleen wrote:
> > On Tue, Oct 21, 2008 at 12:10:36PM -0700, Alok Kataria wrote:
> > > On Tue, 2008-10-21 at 11:15 -0700, Andi Kleen wrote:

> > When the margin changes then it likely won't be good enough for
> > gettimeofday(). There are a couple of testers for monotonity around,
> > would such a system with shifting marging surive running them
> > for a few days?
>
> I meant, the margin changes when you are running different flavors of
> the build, or running different user configurations (from overcommitment
> POV). Once a system is booted the margin won't vary, apart from the
> bootup stage which will always have some extreme states. In all our
> internal testing that has been done,

> we have never noticed time drift is
> withing the NTP threshold when its compared to the host system, under
> varying load.
>

oops... i meant the time drift is _always_ within the NTP threshold.

2008-10-22 19:23:59

by Alok Kataria

[permalink] [raw]
Subject: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

Skip tsc synchronization checks if CONSTANT_TSC bit is set.

From: Alok N Kataria <[email protected]>

TSC synchronization checks between CPU's bail out even if we see
a distortion of a single cycle. This makes the TSC mostly unsuable
in a virtualized environment.

The CONSTANT_TSC bit tells us if the hardware exports a constant TSC,
we can use this bit to trust the hardware and skip the TSC sync checks
at bootup.

We also force set the CONSTANT_TSC capability bit if we are running on
VMware, since the VMware hypervisor exports a constant TSC to the guest.

Signed-off-by: Alok N Kataria <[email protected]>
Cc: Andi Kleen <[email protected]>
---

arch/x86/kernel/cpu/common.c | 1 +
arch/x86/kernel/tsc_sync.c | 8 +++++++-
2 files changed, 8 insertions(+), 1 deletions(-)


diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a48cb0e..720b583 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -452,6 +452,7 @@ void __cpuinit detect_hypervisor_vendor(struct cpuinfo_x86 *c)
{
if (vmware_platform()) {
c->x86_hyper_vendor = X86_HYPER_VENDOR_VMWARE;
+ set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
} else {
c->x86_hyper_vendor = X86_HYPER_VENDOR_NONE;
}
diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index 9ffb01c..c0c89b7 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -108,6 +108,12 @@ void __cpuinit check_tsc_sync_source(int cpu)
if (unsynchronized_tsc())
return;

+ if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+ printk(KERN_INFO
+ "Skipping synchronization checks as TSC is constant.\n");
+ return;
+ }
+
printk(KERN_INFO "checking TSC synchronization [CPU#%d -> CPU#%d]:",
smp_processor_id(), cpu);

@@ -161,7 +167,7 @@ void __cpuinit check_tsc_sync_target(void)
{
int cpus = 2;

- if (unsynchronized_tsc())
+ if (unsynchronized_tsc() || boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
return;

/*

2008-10-22 19:26:38

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

Jeff, i have kept your Signed-off-by with this slightly modified patch,
incase you have any objections to that please let me know.

--
[X86] Skip verification by the watchdog for TSC clocksource.

From: Alok N Kataria <[email protected]>

This is achieved by resetting the CLOCKSOURCE_MUST_VERIFY flag.

We add a tsc=reliable commandline option to enable this.
This enables legacy hardware without HPET, LAPIC, or ACPI timers
to enter high-resolution timer mode.

Along with that have extended this to be used in virtualization
environment, just for VMware as yet. This is important since there
is a wrap-around problem with the acpi_pm timer.
The acpi_pm counter is just 24bits and this can overflow in 4 seconds.
With the NO_HZ kernels in virtualized environment, there can be situations
when the guest is descheduled for longer duration, as a result we may miss
the wrap of the acpi counter. When TSC is used as a clocksource and acpi_pm
timer is being used as the watchdog clocksource this error in acpi_pm
results in TSC being marked as unstable, and essentially results in time
dropping in chunks of 4 seconds whenever this wrap is missed. Since the
virtualized TSC is reliable on VMware, we should always use the TSCs
clocksource on VMware, so we skip the verfication at runtime.

Since we reset the flag for mgeode systems too, i have combined
the mgeode case with the check for VMware.

Signed-off-by: Jeff Hansen <[email protected]>
Signed-off-by: Alok N Kataria <[email protected]>
Cc: Chris Snook <[email protected]>
---

Documentation/kernel-parameters.txt | 7 +++++++
arch/x86/kernel/tsc.c | 33 +++++++++++++++++++++------------
2 files changed, 28 insertions(+), 12 deletions(-)


diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index e82c5bc..c5eb028 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2230,6 +2230,13 @@ and is between 256 and 4096 characters. It is defined in the file
Format:
<io>,<irq>,<dma>,<dma2>,<sb_io>,<sb_irq>,<sb_dma>,<mpu_io>,<mpu_irq>

+ tsc= Disable clocksource-must-verify flag for TSC.
+ Format: <string>
+ [x86] reliable: mark tsc clocksource as reliable, this
+ disables clocksource verification at runtime.
+ Used to enable high-resolution timer mode on older
+ hardware, and in virtualized environment.
+
turbografx.map[2|3]= [HW,JOY]
TurboGraFX parallel port interface
Format:
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 4ae207c..24eff09 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -32,6 +32,7 @@ static int tsc_unstable;
erroneous rdtsc usage on !cpu_has_tsc processors */
static int tsc_disabled = -1;

+static int tsc_clocksource_reliable;
/*
* Scheduler clock - returns current time in nanosec units.
*/
@@ -99,6 +100,15 @@ int __init notsc_setup(char *str)

__setup("notsc", notsc_setup);

+static int __init tsc_setup(char *str)
+{
+ if (!strcmp(str, "reliable"))
+ tsc_clocksource_reliable = 1;
+ return 1;
+}
+
+__setup("tsc=", tsc_setup);
+
#define MAX_RETRIES 5
#define SMI_TRESHOLD 50000

@@ -745,24 +755,21 @@ static struct dmi_system_id __initdata bad_tsc_dmi_table[] = {
{}
};

-/*
- * Geode_LX - the OLPC CPU has a possibly a very reliable TSC
- */
+static void __init check_system_tsc_reliable(void)
+{
#ifdef CONFIG_MGEODE_LX
-/* RTSC counts during suspend */
+ /* RTSC counts during suspend */
#define RTSC_SUSP 0x100
-
-static void __init check_geode_tsc_reliable(void)
-{
unsigned long res_low, res_high;

rdmsr_safe(MSR_GEODE_BUSCONT_CONF0, &res_low, &res_high);
+ /* Geode_LX - the OLPC CPU has a possibly a very reliable TSC */
if (res_low & RTSC_SUSP)
- clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
-}
-#else
-static inline void check_geode_tsc_reliable(void) { }
+ tsc_clocksource_reliable = 1;
#endif
+ if (boot_cpu_data.x86_hyper_vendor == X86_HYPER_VENDOR_VMWARE)
+ tsc_clocksource_reliable = 1;
+}

/*
* Make an educated guess if the TSC is trustworthy and synchronized
@@ -797,6 +804,8 @@ static void __init init_tsc_clocksource(void)
{
clocksource_tsc.mult = clocksource_khz2mult(tsc_khz,
clocksource_tsc.shift);
+ if (tsc_clocksource_reliable)
+ clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
/* lower the rating if we already know its unstable: */
if (check_tsc_unstable()) {
clocksource_tsc.rating = 0;
@@ -857,7 +866,7 @@ void __init tsc_init(void)
if (unsynchronized_tsc())
mark_tsc_unstable("TSCs unsynchronized");

- check_geode_tsc_reliable();
+ check_system_tsc_reliable();
init_tsc_clocksource();
}


2008-10-22 19:26:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.


* Alok Kataria <[email protected]> wrote:

> Skip tsc synchronization checks if CONSTANT_TSC bit is set.
>
> From: Alok N Kataria <[email protected]>
>
> TSC synchronization checks between CPU's bail out even if we see a
> distortion of a single cycle. This makes the TSC mostly unsuable in a
> virtualized environment.
>
> The CONSTANT_TSC bit tells us if the hardware exports a constant TSC,
> we can use this bit to trust the hardware and skip the TSC sync checks
> at bootup.

the sync check is there to check the _offset_ between CPUs. CONSTANT_TSC
is not a guarantee that the TSC will be coherent across all CPUs.

so this patch is fundamentally wrong.

Ingo

2008-10-22 19:29:45

by Jeff Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/3] Improve TSC as a clocksource under VMware

Alok,

No objections. Looks good to me.

-Jeff

Alok Kataria wrote:
> Jeff, i have kept your Signed-off-by with this slightly modified patch,
> incase you have any objections to that please let me know.
>
> --
> [X86] Skip verification by the watchdog for TSC clocksource.
>
> From: Alok N Kataria <[email protected]>
>
> This is achieved by resetting the CLOCKSOURCE_MUST_VERIFY flag.
>
> We add a tsc=reliable commandline option to enable this.
> This enables legacy hardware without HPET, LAPIC, or ACPI timers
> to enter high-resolution timer mode.
>
> Along with that have extended this to be used in virtualization
> environment, just for VMware as yet. This is important since there
> is a wrap-around problem with the acpi_pm timer.
> The acpi_pm counter is just 24bits and this can overflow in 4 seconds.
> With the NO_HZ kernels in virtualized environment, there can be situations
> when the guest is descheduled for longer duration, as a result we may miss
> the wrap of the acpi counter. When TSC is used as a clocksource and acpi_pm
> timer is being used as the watchdog clocksource this error in acpi_pm
> results in TSC being marked as unstable, and essentially results in time
> dropping in chunks of 4 seconds whenever this wrap is missed. Since the
> virtualized TSC is reliable on VMware, we should always use the TSCs
> clocksource on VMware, so we skip the verfication at runtime.
>
> Since we reset the flag for mgeode systems too, i have combined
> the mgeode case with the check for VMware.
>
> Signed-off-by: Jeff Hansen <[email protected]>
> Signed-off-by: Alok N Kataria <[email protected]>
> Cc: Chris Snook <[email protected]>
> ---
>
> Documentation/kernel-parameters.txt | 7 +++++++
> arch/x86/kernel/tsc.c | 33 +++++++++++++++++++++------------
> 2 files changed, 28 insertions(+), 12 deletions(-)
>
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index e82c5bc..c5eb028 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2230,6 +2230,13 @@ and is between 256 and 4096 characters. It is defined in the file
> Format:
> <io>,<irq>,<dma>,<dma2>,<sb_io>,<sb_irq>,<sb_dma>,<mpu_io>,<mpu_irq>
>
> + tsc= Disable clocksource-must-verify flag for TSC.
> + Format: <string>
> + [x86] reliable: mark tsc clocksource as reliable, this
> + disables clocksource verification at runtime.
> + Used to enable high-resolution timer mode on older
> + hardware, and in virtualized environment.
> +
> turbografx.map[2|3]= [HW,JOY]
> TurboGraFX parallel port interface
> Format:
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 4ae207c..24eff09 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -32,6 +32,7 @@ static int tsc_unstable;
> erroneous rdtsc usage on !cpu_has_tsc processors */
> static int tsc_disabled = -1;
>
> +static int tsc_clocksource_reliable;
> /*
> * Scheduler clock - returns current time in nanosec units.
> */
> @@ -99,6 +100,15 @@ int __init notsc_setup(char *str)
>
> __setup("notsc", notsc_setup);
>
> +static int __init tsc_setup(char *str)
> +{
> + if (!strcmp(str, "reliable"))
> + tsc_clocksource_reliable = 1;
> + return 1;
> +}
> +
> +__setup("tsc=", tsc_setup);
> +
> #define MAX_RETRIES 5
> #define SMI_TRESHOLD 50000
>
> @@ -745,24 +755,21 @@ static struct dmi_system_id __initdata bad_tsc_dmi_table[] = {
> {}
> };
>
> -/*
> - * Geode_LX - the OLPC CPU has a possibly a very reliable TSC
> - */
> +static void __init check_system_tsc_reliable(void)
> +{
> #ifdef CONFIG_MGEODE_LX
> -/* RTSC counts during suspend */
> + /* RTSC counts during suspend */
> #define RTSC_SUSP 0x100
> -
> -static void __init check_geode_tsc_reliable(void)
> -{
> unsigned long res_low, res_high;
>
> rdmsr_safe(MSR_GEODE_BUSCONT_CONF0, &res_low, &res_high);
> + /* Geode_LX - the OLPC CPU has a possibly a very reliable TSC */
> if (res_low & RTSC_SUSP)
> - clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
> -}
> -#else
> -static inline void check_geode_tsc_reliable(void) { }
> + tsc_clocksource_reliable = 1;
> #endif
> + if (boot_cpu_data.x86_hyper_vendor == X86_HYPER_VENDOR_VMWARE)
> + tsc_clocksource_reliable = 1;
> +}
>
> /*
> * Make an educated guess if the TSC is trustworthy and synchronized
> @@ -797,6 +804,8 @@ static void __init init_tsc_clocksource(void)
> {
> clocksource_tsc.mult = clocksource_khz2mult(tsc_khz,
> clocksource_tsc.shift);
> + if (tsc_clocksource_reliable)
> + clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
> /* lower the rating if we already know its unstable: */
> if (check_tsc_unstable()) {
> clocksource_tsc.rating = 0;
> @@ -857,7 +866,7 @@ void __init tsc_init(void)
> if (unsynchronized_tsc())
> mark_tsc_unstable("TSCs unsynchronized");
>
> - check_geode_tsc_reliable();
> + check_system_tsc_reliable();
> init_tsc_clocksource();
> }
>
>
>
>


--
---------------------------------------------------
"If someone's gotta do it, it might as well be me."
[email protected]

2008-10-22 19:31:00

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Wed, 2008-10-22 at 12:26 -0700, Ingo Molnar wrote:
> * Alok Kataria <[email protected]> wrote:
>
> > Skip tsc synchronization checks if CONSTANT_TSC bit is set.
> >
> > From: Alok N Kataria <[email protected]>
> >
> > TSC synchronization checks between CPU's bail out even if we see a
> > distortion of a single cycle. This makes the TSC mostly unsuable in a
> > virtualized environment.
> >
> > The CONSTANT_TSC bit tells us if the hardware exports a constant TSC,
> > we can use this bit to trust the hardware and skip the TSC sync checks
> > at bootup.
>
> the sync check is there to check the _offset_ between CPUs. CONSTANT_TSC
> is not a guarantee that the TSC will be coherent across all CPUs.
>
> so this patch is fundamentally wrong.

Then does adding a new flag to skip this check be acceptable ?
Something like the patch that i had sent yesterday ?

Thanks,
Alok

>
> Ingo

2008-10-22 19:51:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

> We also force set the CONSTANT_TSC capability bit if we are running on
> VMware, since the VMware hypervisor exports a constant TSC to the guest.

NAK.

The whole point of this is to eliminate the need for vmware detection,
which is something the kernel shouldn't need to know about.
Instead VMware should supply that bit by itself and then Linux could
do the right thing. The only change you would need to do is to always
check it for all CPU vendors, not just AMD.
Adding it with vmware detection code is not useful.

-Andi

2008-10-22 20:10:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

> the sync check is there to check the _offset_ between CPUs. CONSTANT_TSC
> is not a guarantee that the TSC will be coherent across all CPUs.

I cannot find a quote for it in the docs now, but iirc the AMD
definition of the bit guarantees offset, aka full synchronization
over the system.

-Andi

2008-10-22 22:00:55

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

[dropped LKML of the cc list]

Hi Andi,

On Wed, 2008-10-22 at 12:58 -0700, Andi Kleen wrote:
> > We also force set the CONSTANT_TSC capability bit if we are running on
> > VMware, since the VMware hypervisor exports a constant TSC to the guest.
>
> NAK.
>
> The whole point of this is to eliminate the need for vmware detection,
> which is something the kernel shouldn't need to know about.
> Instead VMware should supply that bit by itself and then Linux could
> do the right thing. The only change you would need to do is to always
> check it for all CPU vendors, not just AMD.

Yeah, i can add that code which checks for that bit, for all cpu
vendors. And yes we can supply that bit with our future products.

> Adding it with vmware detection code is not useful.

The VMware detection code is something that's needed anyways for other
purposes(getting tsc_freq right now). So the code which force sets this
TSC_CONSTANT bit is trivial.
Apart from that, as i said yesterday, this should be viewed as a special
case for VMware products which don't already set this bit. Changing the
behavior for already existing products is not feasible.

Thanks,
Alok

>
> -Andi

2008-10-22 22:04:19

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Wed, 2008-10-22 at 13:17 -0700, Andi Kleen wrote:
> > the sync check is there to check the _offset_ between CPUs. CONSTANT_TSC
> > is not a guarantee that the TSC will be coherent across all CPUs.
>
> I cannot find a quote for it in the docs now, but iirc the AMD
> definition of the bit guarantees offset, aka full synchronization
> over the system.

I found this mail,
http://lkml.org/lkml/2005/11/4/173

If you look for,

-------------------------------------------------------------
Future TSC Directions and Solutions
===================================
Future AMD processors will provide a TSC that is P-state and
C-State invariant and unaffected by STPCLK-throttling. This
will make the TSC immune to drift. Because using the TSC
for fast timer APIs is a desirable feature that helps
performance, AMD has defined a CPUID feature bit that
software can test to determine if the TSC is
invariant. Issuing a CPUID instruction with an %eax register
value of 0x8000_0007, on a processor whose base family is
0xF, returns "Advanced Power Management Information" in the
%eax, %ebx, %ecx, and %edx registers. Bit 8 of the return
%edx is the "TscInvariant" feature flag which is set when
TSC is P-state, C-state, and STPCLK-throttling invariant; it
is clear otherwise.
--------------------------------------------------------------

The third line does mention about invariant TSC being immune to drift.

Thanks,
Alok

2008-10-22 22:05:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

> > Adding it with vmware detection code is not useful.
>
> The VMware detection code is something that's needed anyways for other
> purposes(getting tsc_freq right now). So the code which force sets this
> TSC_CONSTANT bit is trivial.

It would be better to eliminate that too. Why do you need it anyways?

> Apart from that, as i said yesterday, this should be viewed as a special
> case for VMware products which don't already set this bit. Changing the
> behavior for already existing products is not feasible.

On old hypervisors Linux already runs fine without TSC, doesn't it?

-Andi

2008-10-22 22:12:15

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Wed, 2008-10-22 at 15:13 -0700, Andi Kleen wrote:
> > > Adding it with vmware detection code is not useful.
> >
> > The VMware detection code is something that's needed anyways for other
> > purposes(getting tsc_freq right now). So the code which force sets this
> > TSC_CONSTANT bit is trivial.
>
> It would be better to eliminate that too. Why do you need it anyways?
>
> > Apart from that, as i said yesterday, this should be viewed as a special
> > case for VMware products which don't already set this bit. Changing the
> > behavior for already existing products is not feasible.
>
> On old hypervisors Linux already runs fine without TSC, doesn't it?

Not really, there are problems with the pm timer too, the one about
missing the counter wrap and time dropping in chunks of 4 seconds.
Tried to explain it over here, http://lkml.org/lkml/2008/10/22/525

So TSC is the ideal clocksource from performance and correctness point
of view for VMware.

Thanks,
Alok

>
> -Andi

2008-10-22 22:15:31

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Wed, 22 Oct 2008 15:00:46 -0700
Alok Kataria <[email protected]> wrote:

> [dropped LKML of the cc list]

Sorry this discussion belongs on the LKML list. Please take it back to
the list.

2008-10-22 22:17:55

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Wed, 2008-10-22 at 15:15 -0700, Alan Cox wrote:
> On Wed, 22 Oct 2008 15:00:46 -0700
> Alok Kataria <[email protected]> wrote:
>
> > [dropped LKML of the cc list]
>
> Sorry this discussion belongs on the LKML list. Please take it back to
> the list.

It was never dropped, sorry for the confusion :-)

2008-10-22 22:46:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

> Not really, there are problems with the pm timer too, the one about
> missing the counter wrap and time dropping in chunks of 4 seconds.
> Tried to explain it over here, http://lkml.org/lkml/2008/10/22/525

Well then pit. Or are you saying time is always broken on VMware & Linux?

> So TSC is the ideal clocksource from performance and correctness point
> of view for VMware.

But you don't seem to emulate it "ideal"ly otherwise you wouldn't
need all these hacks you're adding?

I think you should either implement a TSC that matches what
real hardware does (including CPUID semantics) or implement
a real vmware PV timer and just say it's PV and not fully virtualized.
But doesn't the vmware paravirt ops have that already anyways?

But I personally think it wouldn't really scale to add detection for
more and more "nearly PV" hypervisors to the standard native kernel.

-Andi

2008-10-23 02:22:26

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Wed, 2008-10-22 at 15:54 -0700, Andi Kleen wrote:
> > Not really, there are problems with the pm timer too, the one about
> > missing the counter wrap and time dropping in chunks of 4 seconds.
> > Tried to explain it over here, http://lkml.org/lkml/2008/10/22/525
>
> Well then pit.

PIT is not available as a clocksource if highres or one-shot timers are
enabled. So PIT is not a possibility.

>
> Or are you saying time is always broken on VMware & Linux?

The acpi_pm timer wrap problem has come up only with the clocksource and
NO_HZ kernels, without NO_HZ there were periodic interrupts which caused
the guest to be scheduled before ACPI_PM could wrap around.

> > So TSC is the ideal clocksource from performance and correctness point
> > of view for VMware.
>
> But you don't seem to emulate it "ideal"ly otherwise you wouldn't
> need all these hacks you're adding?

"All these hacks" ? i guess you are talking about only this particular,
skipping the tsc_sync checks.
Rest of them are valid bugs as i have mentioned.

>
> I think you should either implement a TSC that matches what
> real hardware does (including CPUID semantics)

Yeah accepted, i agree this is the right approach and that's the long
term view that we are taking too.

> or implement
> a real vmware PV timer and just say it's PV and not fully virtualized.
> But doesn't the vmware paravirt ops have that already anyways?

That's for 32bit only. Apart from the tsc_sync problem i doubt we have
any other issue with the TSC as clocksource, so adding a similar
clocksource is something that i would avoid.

>
> But I personally think it wouldn't really scale to add detection for
> more and more "nearly PV" hypervisors to the standard native kernel.

I think we anyways need a way to detect if we are running on a
hypervisor. That's the only way we can move towards having a single
image which runs well on both native hardware and a virtualized
environment.

I guess, the only thing that you don't agree over here is the enabling
of CONSTANT_TSC bit when VMware is detected, right ?
I would agree that this could be viewed as a hack but that is the only
way we can have the kernel still working correctly on the already
released platforms. And, this should be viewed no different than the
case of fixing a slightly different behavior of a particular hardware.

Thanks,
Alok


>
> -Andi

2008-10-23 08:03:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

> >
> > Or are you saying time is always broken on VMware & Linux?
>
> The acpi_pm timer wrap problem has come up only with the clocksource and
> NO_HZ kernels, without NO_HZ there were periodic interrupts which caused
> the guest to be scheduled before ACPI_PM could wrap around.

ACPI_PM should be just fixed. My old independent noidlehz implementation
just always limited the sleep times to half the wrap time of the
timer. I suspect this needs to be done here too.



>
> > > So TSC is the ideal clocksource from performance and correctness point
> > > of view for VMware.
> >
> > But you don't seem to emulate it "ideal"ly otherwise you wouldn't
> > need all these hacks you're adding?
>
> "All these hacks" ? i guess you are talking about only this particular,

Everything that requires vmware detection means your hardware
emulation is not good enough.

> skipping the tsc_sync checks.
> Rest of them are valid bugs as i have mentioned.

The tsc frequency one didn't sound like a valid bug.

> > or implement
> > a real vmware PV timer and just say it's PV and not fully virtualized.
> > But doesn't the vmware paravirt ops have that already anyways?
>
> That's for 32bit only.

I though there were some efforts to make it 64bit too?
Or is there no VMI ROM on 64bit? Perhaps you could do the
timer without the ROM then.

> Apart from the tsc_sync problem i doubt we have
> any other issue with the TSC as clocksource, so adding a similar
> clocksource is something that i would avoid.
>
> >
> > But I personally think it wouldn't really scale to add detection for
> > more and more "nearly PV" hypervisors to the standard native kernel.
>
> I think we anyways need a way to detect if we are running on a
> hypervisor.

For PV sure. But not for non PV.

> That's the only way we can move towards having a single
> image which runs well on both native hardware and a virtualized
> environment.

If a hypervisor is not good enough to simulate hardware closely
enough it should just set up respective paravirt ops (or register
own clock drivers etc.), but not complicate the native code with a weird
half PV half fully emulated mix.


>
> I guess, the only thing that you don't agree over here is the enabling
> of CONSTANT_TSC bit when VMware is detected, right ?

My POV is that code supposed to drive real hardware shouldn't
have any "is hypervisor X|Y|Z" hacks. We already got a whole
lot of infrastructure for PV hypervisors.

For tsc_sync I suspect the fix is to either completely trust CONSTANT_TSC
or make the check accept more offset or possibly a combination of both.

-Andi
--
[email protected]

2008-10-23 23:39:32

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Thu, 2008-10-23 at 01:10 -0700, Andi Kleen wrote:
> >
> > The acpi_pm timer wrap problem has come up only with the clocksource and
> > NO_HZ kernels, without NO_HZ there were periodic interrupts which caused
> > the guest to be scheduled before ACPI_PM could wrap around.


> ACPI_PM should be just fixed. My old independent noidlehz implementation
> just always limited the sleep times to half the wrap time of the
> timer. I suspect this needs to be done here too.
>

Yeah fixing ACPI_PM is a good idea, but this too won't fix the problem
in the virtualized space as your VCPU can be descheduled for more than
4secs. Now since time was kept using interrupts before clocksources, the
counter wrap around never made a difference.

> The tsc frequency one didn't sound like a valid bug.

It is not. The calibration algorithm until 2.6.27 did work well for
virtualization.
But with the new fast path algorithm the error in calibration could now
be as high as 7600ppm, as explained on this thread.
http://kerneltrap.org/mailarchive/linux-kernel/2008/9/5/3208194

I know that this is a corner case but can be triggered in virtualization
environment and i have seen instances of it. Its not because of any
virtualization defects but because these races are magnified here.


> I though there were some efforts to make it 64bit too?
> Or is there no VMI ROM on 64bit? Perhaps you could do the
> timer without the ROM then.

Yeah i could, the point though is that the current TSC implementation is
already good enough for virtualization with these small changes, so
adding a new algorithm doesn't seem quite that attractive to me.


> >
> > I guess, the only thing that you don't agree over here is the enabling
> > of CONSTANT_TSC bit when VMware is detected, right ?
>
> My POV is that code supposed to drive real hardware shouldn't
> have any "is hypervisor X|Y|Z" hacks. We already got a whole
> lot of infrastructure for PV hypervisors.

These "is_hypervisor" checks are not in fast path.
Apart from that, with a field in the cpuinfo_structure we wont be
calling all these detection functions over and over again. The move is
already towards standardizing the detection process for any hypervisor.

>
> For tsc_sync I suspect the fix is to either completely trust CONSTANT_TSC
> or make the check accept more offset or possibly a combination of both

I am ok with the CONSTANT_TSC bit check, but if people think that its
not important to skip this for native, i think adding a new flag to skip
this should be safe enough.


Ingo, HPA your views on this whole detection and skipping thing ?

Thanks,
Alok

> .
>
> -Andi
> --
> [email protected]

2008-10-23 23:48:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

Alok Kataria wrote:
>
> I am ok with the CONSTANT_TSC bit check, but if people think that its
> not important to skip this for native, i think adding a new flag to skip
> this should be safe enough.
>
> Ingo, HPA your views on this whole detection and skipping thing ?
>

Okay, first of all, I'm somewhat leery (to put it mildly) of trusting a
CPUID bit to tell me a *system* property, which is that all cores in the
system are synchronized. The CPU designer will know that all the cores
in the *package* are synchronized, but if that extends system-wide is a
property beyond the CPU. Now, if I'm not completely mistaken, in the
case of AMD this bit is actually set by the BIOS via a magic MSR, but
that doesn't mean it can't be wrong.

As far as skipping the check, it makes sense for me in the case of known
virtualization platforms; a CPU feature bit, real or synthetic, is a
very clean way to do that. In general we should centralize CPU
knowledge to arch/x86/kernel/cpu and have the code outside look for
specific feature flags, and that applies to virtualization platforms, too.

-hpa

2008-10-24 00:17:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

> As far as skipping the check, it makes sense for me in the case of known
> virtualization platforms; a CPU feature bit, real or synthetic, is a
> very clean way to do that.

I don't think adding detection for non PV Hypervisors is anywhere clean
Even if it's only VMware today, tomorrow it will be a few more
and long term you might need to support all of the obscuro hypervisors
that are out there. Just seems like a slippery slope. Either it's
paravirtual or it's not, but it should attempt to be both. If the hypervisor
doesn't emulate TSC well enough that the native code works it's entirely
reasonable to let it use some other timer, like it has been always
done in the past.

-Andi

2008-10-24 00:46:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

Andi Kleen wrote:
>> As far as skipping the check, it makes sense for me in the case of known
>> virtualization platforms; a CPU feature bit, real or synthetic, is a
>> very clean way to do that.
>
> I don't think adding detection for non PV Hypervisors is anywhere clean
> Even if it's only VMware today, tomorrow it will be a few more
> and long term you might need to support all of the obscuro hypervisors
> that are out there. Just seems like a slippery slope. Either it's
> paravirtual or it's not, but it should attempt to be both. If the hypervisor
> doesn't emulate TSC well enough that the native code works it's entirely
> reasonable to let it use some other timer, like it has been always
> done in the past.

That is at least to some degree nonsense, simply because we are all well
down that particular "slippery slope": we have hardware blacklists and
whitelists, CPU-specific workarounds, and so on all over the place, and
in that sense a hypervisor really isn't different than another hardware
platform.

-hpa

2008-10-24 01:12:55

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Thu, 2008-10-23 at 16:47 -0700, H. Peter Anvin wrote:
> Alok Kataria wrote:
> >
> > I am ok with the CONSTANT_TSC bit check, but if people think that its
> > not important to skip this for native, i think adding a new flag to skip
> > this should be safe enough.
> >
> > Ingo, HPA your views on this whole detection and skipping thing ?
> >
>
> Okay, first of all, I'm somewhat leery (to put it mildly) of trusting a
> CPUID bit to tell me a *system* property, which is that all cores in the
> system are synchronized. The CPU designer will know that all the cores
> in the *package* are synchronized, but if that extends system-wide is a
> property beyond the CPU. Now, if I'm not completely mistaken, in the
> case of AMD this bit is actually set by the BIOS via a magic MSR, but
> that doesn't mean it can't be wrong.
>
> As far as skipping the check, it makes sense for me in the case of known
> virtualization platforms; a CPU feature bit, real or synthetic, is a
> very clean way to do that. In general we should centralize CPU
> knowledge to arch/x86/kernel/cpu and have the code outside look for
> specific feature flags, and that applies to virtualization platforms, too.

I agree with the synthetic cpu feature thing.
Do you think i should use one of the existing word like the word 3 which
is for synthesized feature bits ? Or is it better to define a new
virtualization specific word ?

Thanks,
Alok

>
> -hpa

2008-10-24 01:18:54

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

Alok Kataria wrote:
>
> I agree with the synthetic cpu feature thing.
> Do you think i should use one of the existing word like the word 3 which
> is for synthesized feature bits ? Or is it better to define a new
> virtualization specific word ?
>

I don't see any point in using anything other than word 3 until it fills up.

What we want to end up with is cleanly separated: hypervisor detection,
hypervisor features/workaround settings, and the implementation of those.

-hpa

2008-10-24 07:15:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

> That is at least to some degree nonsense, simply because we are all well
> down that particular "slippery slope": we have hardware blacklists and

The big difference is that hardware cannot be easily fixed, hypervisors
are just software and can as simply as Linux. Also there are at least
standardized simple ways to detect hardware and platforms using standard
enumeration interfaces like PCI or DMI, while each Hypervisor detection
seems to need huge amounts of custom (and likely fragile) complicated code.
iirc there was a vmware detection patch around and it was disgusting iirc.

Also I think this concept of "not PV, but then again a little"
concept that is mandated here is not a useful one.

Either do a proper PV interface and then just write a custom
clock driver and paravirt ops interface to make everything really fast,
or just use the direct hardware interface and fix the hypervisor
to do it properly.

Especially VMware has this already (although 32bit only).

Ok the tsc_sync issue is borderline, but that one seems
to be more a case of tweaking the existing algorithms
to be a little more tolerant.

-Andi
--
[email protected]

2008-10-24 15:31:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

Andi Kleen wrote:
>> That is at least to some degree nonsense, simply because we are all well
>> down that particular "slippery slope": we have hardware blacklists and
>
> The big difference is that hardware cannot be easily fixed, hypervisors
> are just software and can as simply as Linux. Also there are at least
> standardized simple ways to detect hardware and platforms using standard
> enumeration interfaces like PCI or DMI, while each Hypervisor detection
> seems to need huge amounts of custom (and likely fragile) complicated code.
> iirc there was a vmware detection patch around and it was disgusting iirc.
>

BIOSes are also just software, and we have to deal with bugs in them
*all the time*. The reality is that we're going to have to deal with
both vendor and user reluctance to upgrade, and therefore have to deal
with brokenness in the field. As far as detection code is concerned, I
certainly have pushed back on the most disgusting (and broken) attempts,
as well insisted that the code be properly centralized.

> Also I think this concept of "not PV, but then again a little"
> concept that is mandated here is not a useful one.
>
> Either do a proper PV interface and then just write a custom
> clock driver and paravirt ops interface to make everything really fast,
> or just use the direct hardware interface and fix the hypervisor
> to do it properly.

It's functionally equivalent to hardware workarounds. There is always a
judgement call when something should be forked off into a separate
driver, which is exactly what this amounts to.

> Especially VMware has this already (although 32bit only).
>
> Ok the tsc_sync issue is borderline, but that one seems
> to be more a case of tweaking the existing algorithms
> to be a little more tolerant.

That is a possibility, but that is a technical issue versus a
pseudo-philosophical issue. However, there are clearly a few different
problems here with the start-and-stop nature of the virtual environment
versus the needs of physical environments, and it's not clear to me that
they are inherently compatible, or whether that would entail dangerous
compromises.

-hpa

2008-10-24 19:18:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

> BIOSes are also just software, and we have to deal with bugs in them
> *all the time*. The reality is that we're going to have to deal with
> both vendor and user reluctance to upgrade, and therefore have to deal
> with brokenness in the field.

In the field they will just continue using clock=pit, like they
always did on vmware. And also they will not update the Linux kernel.

This is strictly for new installations. And I frankly don't
see why Linux needs to get white listed workarounds when the
Hypervisor couldn't as well be fixed. We have the bizarre
situation here where a HV vendor tries to add workarounds
to Linux instead of fixing it on their products.

Now making generic code a little more flexible in what
it accepts is fine though (like relaxing tsc_sync or
checking and trusting UNSTABLE_TSC). That will scale at least
and doesn't need significant new code.

-Andi

2008-10-24 19:20:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

Andi Kleen wrote:
>
> In the field they will just continue using clock=pit, like they
> always did on vmware. And also they will not update the Linux kernel.
>

That is a totally bogus assumption. You will typically have the host
platform (Vmware in this case) move much much slower than the guests.

> This is strictly for new installations. And I frankly don't
> see why Linux needs to get white listed workarounds when the
> Hypervisor couldn't as well be fixed. We have the bizarre
> situation here where a HV vendor tries to add workarounds
> to Linux instead of fixing it on their products.

... just like every other hardware vendor.

-hpa

2008-10-24 19:26:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Fri, Oct 24, 2008 at 12:19:41PM -0700, H. Peter Anvin wrote:
> Andi Kleen wrote:
> >
> >In the field they will just continue using clock=pit, like they
> >always did on vmware. And also they will not update the Linux kernel.
> >
>
> That is a totally bogus assumption. You will typically have the host
> platform (Vmware in this case) move much much slower than the guests.

The people running on older hypervisors will just continue using
PIT timer. They won't get nohz, but i guess without updates
they don't expect new features.

> >This is strictly for new installations. And I frankly don't
> >see why Linux needs to get white listed workarounds when the
> >Hypervisor couldn't as well be fixed. We have the bizarre
> >situation here where a HV vendor tries to add workarounds
> >to Linux instead of fixing it on their products.
>
> ... just like every other hardware vendor.

Hardware vendors can't fix it without long lead time.
Software vendors can as easily as Linux.

Also we're talking about an optimization here (enabling NOHZ), not a
correctness issue. So your analogy to hardware workarounds
is already dubious.

-Andi
--
[email protected]

2008-10-24 19:50:30

by Alok Kataria

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On Fri, 2008-10-24 at 12:34 -0700, Andi Kleen wrote:
> On Fri, Oct 24, 2008 at 12:19:41PM -0700, H. Peter Anvin wrote:
> > Andi Kleen wrote:
> > >
> > >In the field they will just continue using clock=pit, like they
> > >always did on vmware. And also they will not update the Linux kernel.
> > >
> >
> > That is a totally bogus assumption. You will typically have the host
> > platform (Vmware in this case) move much much slower than the guests.
>
> The people running on older hypervisors will just continue using
> PIT timer. They won't get nohz, but i guess without updates
> they don't expect new features.

As hpa rightly said, people could still upgrade the kernel and not the
hypervisor and would expect things to work correctly.
So this is just about making sure that the new features in kernel which
are on by default don't break when we run on older products.
>
> > >This is strictly for new installations. And I frankly don't
> > >see why Linux needs to get white listed workarounds when the
> > >Hypervisor couldn't as well be fixed. We have the bizarre
> > >situation here where a HV vendor tries to add workarounds
> > >to Linux instead of fixing it on their products.
> >
> > ... just like every other hardware vendor.
>
> Hardware vendors can't fix it without long lead time.
> Software vendors can as easily as Linux.
>
> Also we're talking about an optimization here (enabling NOHZ), not a
> correctness issue. So your analogy to hardware workarounds
> is already dubious.

Its not like we are trying to add VMware checks all over the kernel to
enhance the performance or something of that sought. Its just about the
timer emulation which needs to be tweaked a little.

Also i think you would agree that fixing timer related issues like the
acpi_pm thing are inherently difficult to fix on virtualized hardware
without the kernel cooperating. The constraints here are different since
the physical cpu can be time shared btween vcpu's, which can break a lot
of assumptions related to timing. I like, hpa's approach of using cpuid
bits (synthetic/real) to get these things communicated to the kernel so
that things are consistent with how the OS communicates with the
hardware.

Another thing that I may like to highlight is that, with the advent of
hardware assisted virtualization, the kernel should be able to
virtualize the processor, without requiring any major changes for para
virtualizing the cpu. And thats the approach we will like to take as
well, i.e. communicate most of the things to the kernel using standard
communication methods like CPUid/MSR's. And i think this is just a step
towards that.

So in short we should view virtualization platform as just a special
hardware as far as the kernel is concerned.

Thanks,
Alok

>
> -Andi
> --
> [email protected]

2008-10-24 20:15:49

by Dan Hecht

[permalink] [raw]
Subject: Re: [PATCH] Skip tsc synchronization checks if CONSTANT_TSC bit is set.

On 10/24/2008 12:25 PM, Andi Kleen wrote:
>> BIOSes are also just software, and we have to deal with bugs in them
>> *all the time*. The reality is that we're going to have to deal with
>> both vendor and user reluctance to upgrade, and therefore have to deal
>> with brokenness in the field.
>
> In the field they will just continue using clock=pit, like they
> always did on vmware. And also they will not update the Linux kernel.
>
> This is strictly for new installations. And I frankly don't
> see why Linux needs to get white listed workarounds when the
> Hypervisor couldn't as well be fixed. We have the bizarre
> situation here where a HV vendor tries to add workarounds
> to Linux instead of fixing it on their products.
>

What exactly would you like vmware to fix? VMware fully virtualizes
x86. However, when running the kernel on virtual cpus, as compared to
running on a physical cpus, the timing characteristics are different --
virtual cpus have to time share physical cpus with each other.

So, timing assumptions that the kernel makes when running directly on
physical cpus no longer hold when running on virtual cpus.

Prior to clocksource/clockevents, the timing assumptions that the Linux
kernel made were terrible for hypervisors. Now, the assumptions are
much better. However, three *minor* assumptions that the kernel makes
are violated when running on a hypervisor:

1) The fast-path TSC calibration code makes assumptions about being able
to sample various counters in sequence in a set amount of time that are
not true when running virtualized. (Actually, it makes assumptions that
aren't really true 100% of the time on physical cpus, but in that case
the odds of violating the assumptions (by hitting an SMI at exactly the
right time and length) are really rare.

Note that accurate calibration of the TSC is extremely important in
clocksource kernels since any error will lead to long term drift of
wallclock time.

2) There is no guarantee that the acpi_pm timer will be sampled at least
every 4.68 seconds (the wrap interval), because the vcpu, in extreme
circumstances, may not have a chance to run in that time. Thus, the
acpi_pm timer is not suitable to be used as a clocksource watchdog when
running on a hypervisor.

3) Virtual TSCs can be kept nearly in sync, but because the virtual TSC
offset is set by software, it's not perfect. So, the TSC
synchronization test can fail. (Really, it can fail on native as well,
and that's why the tests for backwards TSC were added to
read_tsc()/vread_tsc()).

Clearly, #1 and #2 *cannot* be fixed in the hypervisor. These are cases
where the kernel is making assumptions that just are not true when
running on certain platforms (i.e. hypervisors). Let's fix them.

#3, as you have suggested below, can perhaps be fixed by loosening the
check a bit to allow some leeway for marginally offset TSCs.

> Now making generic code a little more flexible in what
> it accepts is fine though (like relaxing tsc_sync or
> checking and trusting UNSTABLE_TSC). That will scale at least
> and doesn't need significant new code.
>

I think everyone can agree that this is the preferred approach, in
general. And in fact it was the approach Alok first used for the TSC
frequency calibration problem (this is one reason why he merged the
32-bit and 64-bit TSC code -- to standardize on the more robust 64-bit
calibration code). But, in the end, folks wanted a "fast" TSC
calibration path, and that path makes assumptions that just won't be
true when running on a hypervisor, so we are left with skipping that
path if we are on a virtual cpu.

Also, with regards to your claim that users should continue to use
clock=pit like options on newer kernels: that is just plain *wrong*.
The reason for clock=pit (really clock=pmtmr) recommendation on
pre-clocksource kernels wasn't to avoid using the TSC, but it was simply
a workaround to avoid the kernel code that attempted to compensate for
lost ticks (but would do so incorrectly in the case of late, but not
lost, interrupts -- again, it was a kernel timing assumption that was
invalid on hypervisors).

Dan