2007-12-11 13:11:53

by Dor Laor

[permalink] [raw]
Subject: Performance overhead of get_cycles_sync

Hi Ingo, Thomas,

In the latest kernel (2.6.24-rc3) I noticed a drastic performance
decrease for KVM networking.
The reason is many vmexit (exit reason is cpuid instruction) caused by
calls to gettimeofday that uses tsc sourceclock.
read_tsc calls get_cycles_sync which might call cpuid in order to
serialize the cpu.

Can you explain why the cpu needs to be serialized for every gettime call?
Do we need to be that accurate? (It will also slightly improve physical
hosts).
I believe you have a reason and the answer is yes. In that case can you
replace the serializing instruction
with an instruction that does not trigger vmexit? Maybe use 'ltr' for
example?

Regards,
Dor.


2007-12-11 13:38:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync


* Dor Laor <[email protected]> wrote:

> Hi Ingo, Thomas,
>
> In the latest kernel (2.6.24-rc3) I noticed a drastic performance
> decrease for KVM networking. The reason is many vmexit (exit reason is
> cpuid instruction) caused by calls to gettimeofday that uses tsc
> sourceclock. read_tsc calls get_cycles_sync which might call cpuid in
> order to serialize the cpu.
>
> Can you explain why the cpu needs to be serialized for every gettime
> call? Do we need to be that accurate? (It will also slightly improve
> physical hosts). I believe you have a reason and the answer is yes. In
> that case can you replace the serializing instruction with an
> instruction that does not trigger vmexit? Maybe use 'ltr' for example?

hm, where exactly does it call CPUID?

Ingo

2007-12-11 14:14:51

by Dor Laor

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync

Ingo Molnar wrote:
>
> * Dor Laor <[email protected]> wrote:
>
> > Hi Ingo, Thomas,
> >
> > In the latest kernel (2.6.24-rc3) I noticed a drastic performance
> > decrease for KVM networking. The reason is many vmexit (exit reason is
> > cpuid instruction) caused by calls to gettimeofday that uses tsc
> > sourceclock. read_tsc calls get_cycles_sync which might call cpuid in
> > order to serialize the cpu.
> >
> > Can you explain why the cpu needs to be serialized for every gettime
> > call? Do we need to be that accurate? (It will also slightly improve
> > physical hosts). I believe you have a reason and the answer is yes. In
> > that case can you replace the serializing instruction with an
> > instruction that does not trigger vmexit? Maybe use 'ltr' for example?
>
> hm, where exactly does it call CPUID?
>
> Ingo
>
Here, commented out [include/asm-x86/tsc.h]:
/* Like get_cycles, but make sure the CPU is synchronized. */
static __always_inline cycles_t get_cycles_sync(void)
{
unsigned long long ret;
unsigned eax, edx;

/*
* Use RDTSCP if possible; it is guaranteed to be synchronous
* and doesn't cause a VMEXIT on Hypervisors
*/
alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
ASM_OUTPUT2("=a" (eax), "=d" (edx)),
"a" (0U), "d" (0U) : "ecx", "memory");
ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax);
if (ret)
return ret;

/*
* Don't do an additional sync on CPUs where we know
* RDTSC is already synchronous:
*/
// alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
// "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
rdtscll(ret);

return ret;
}

2007-12-11 14:27:24

by Andi Kleen

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync


[headers rewritten because of gmane crosspost breakage]

> In the latest kernel (2.6.24-rc3) I noticed a drastic performance
> decrease for KVM networking.

That should not have changed for quite some time.

Also it depends on the CPU of course.

> The reason is many vmexit (exit reason is cpuid instruction) caused by
> calls to gettimeofday that uses tsc sourceclock.
> read_tsc calls get_cycles_sync which might call cpuid in order to
> serialize the cpu.
>
> Can you explain why the cpu needs to be serialized for every gettime call?

Otherwise RDTSC can be speculated around and happen outside the protection
of the seqlock and that can sometimes lead to non monotonic time reporting.

Anyways after a lot of discussions it turns out there are ways to archive
this without CPUID and there is a solution implemented for this in ff
tree which I will submit for .25. It's a little complicated though
and not a quick fix.

> Do we need to be that accurate? (It will also slightly improve physical
> hosts).
> I believe you have a reason and the answer is yes. In that case can you
> replace the serializing instruction
> with an instruction that does not trigger vmexit? Maybe use 'ltr' for
> example?

ltr doesn't synchronize RDTSC.

-Andi

2007-12-11 14:27:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync


* Dor Laor <[email protected]> wrote:

> Here [include/asm-x86/tsc.h]:
>
> /* Like get_cycles, but make sure the CPU is synchronized. */
> static __always_inline cycles_t get_cycles_sync(void)
> {
> unsigned long long ret;
> unsigned eax, edx;
>
> /*
> * Use RDTSCP if possible; it is guaranteed to be synchronous
> * and doesn't cause a VMEXIT on Hypervisors
> */
> alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
> ASM_OUTPUT2("=a" (eax), "=d" (edx)),
> "a" (0U), "d" (0U) : "ecx", "memory");
> ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax);
> if (ret)
> return ret;
>
> /*
> * Don't do an additional sync on CPUs where we know
> * RDTSC is already synchronous:
> */
> // alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
> // "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
> rdtscll(ret);

The patch below should resolve this - could you please test and Ack it?
But this CPUID was present in v2.6.23 too, so why did it only show up in
2.6.24-rc for you?

Ingo

-------------->
Subject: x86: fix get_cycles_sync() overhead
From: Ingo Molnar <[email protected]>

get_cycles_sync() is causing massive overhead in KVM networking:

http://lkml.org/lkml/2007/12/11/54

remove the explicit CPUID serialization - it causes VM exits and is
pointless: we care about GTOD coherency but that goes to user-space
via a syscall, and syscalls are serialization points anyway.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/asm-x86/tsc.h | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux-x86.q/include/asm-x86/tsc.h
===================================================================
--- linux-x86.q.orig/include/asm-x86/tsc.h
+++ linux-x86.q/include/asm-x86/tsc.h
@@ -39,8 +39,8 @@ static __always_inline cycles_t get_cycl
unsigned eax, edx;

/*
- * Use RDTSCP if possible; it is guaranteed to be synchronous
- * and doesn't cause a VMEXIT on Hypervisors
+ * Use RDTSCP if possible; it is guaranteed to be synchronous
+ * and doesn't cause a VMEXIT on Hypervisors
*/
alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
ASM_OUTPUT2("=a" (eax), "=d" (edx)),
@@ -50,11 +50,11 @@ static __always_inline cycles_t get_cycl
return ret;

/*
- * Don't do an additional sync on CPUs where we know
- * RDTSC is already synchronous:
+ * Use RDTSC on other CPUs. This might not be fully synchronous,
+ * but it's not a problem: the only coherency we care about is
+ * the GTOD output to user-space, and syscalls are synchronization
+ * points anyway:
*/
- alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
- "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
rdtscll(ret);

return ret;

2007-12-11 14:57:33

by Dor Laor

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync

Andi Kleen wrote:
> [headers rewritten because of gmane crosspost breakage]
>
>
>> In the latest kernel (2.6.24-rc3) I noticed a drastic performance
>> decrease for KVM networking.
>>
>
> That should not have changed for quite some time.
>
> Also it depends on the CPU of course.
>
I didn't find the exact place of the change but using fedora 2.6.23-8
there is no problem.
3aefbe0746580a710d4392a884ac1e4aac7c728f turn X86_FEATURE_SYNC_RDTSC
off for most
intel cpus, but it was committed in May.

>
>> The reason is many vmexit (exit reason is cpuid instruction) caused by
>> calls to gettimeofday that uses tsc sourceclock.
>> read_tsc calls get_cycles_sync which might call cpuid in order to
>> serialize the cpu.
>>
>> Can you explain why the cpu needs to be serialized for every gettime call?
>>
>
> Otherwise RDTSC can be speculated around and happen outside the protection
> of the seqlock and that can sometimes lead to non monotonic time reporting.
>
What about moving the result into memory and calling mb() instead?
> Anyways after a lot of discussions it turns out there are ways to archive
> this without CPUID and there is a solution implemented for this in ff
> tree which I will submit for .25. It's a little complicated though
> and not a quick fix.
>
>
>> Do we need to be that accurate? (It will also slightly improve physical
>> hosts).
>> I believe you have a reason and the answer is yes. In that case can you
>> replace the serializing instruction
>> with an instruction that does not trigger vmexit? Maybe use 'ltr' for
>> example?
>>
>
> ltr doesn't synchronize RDTSC.
>
>
According to Intel spec it is a serializing instruction along with cpuid
and others.
> -Andi
>
>

2007-12-11 15:02:37

by Andi Kleen

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync

On Tue, Dec 11, 2007 at 04:57:16PM +0200, Dor Laor wrote:
> Andi Kleen wrote:
> >[headers rewritten because of gmane crosspost breakage]
> >
> >
> >>In the latest kernel (2.6.24-rc3) I noticed a drastic performance
> >>decrease for KVM networking.
> >>
> >
> >That should not have changed for quite some time.
> >
> >Also it depends on the CPU of course.
> >
> I didn't find the exact place of the change but using fedora 2.6.23-8
> there is no problem.
> 3aefbe0746580a710d4392a884ac1e4aac7c728f turn X86_FEATURE_SYNC_RDTSC
> off for most
> intel cpus, but it was committed in May.

Ah there was a change to enable it on Core2. On AMD it was
always there except on CPUs with RDTSCP

Anyways the very likely right way (I'm still waiting for final
confirmation from Intel) to solve this is to use LFENCE which
happens to stop RDTSC and is relatively light weight and won't
cause vmexits. Unfortunately that doesn't work on AMD, but
there MFENCE happens to work. But there are more problems
in this code which I have a (.24) patchkit to fix.

> According to Intel spec it is a serializing instruction along with cpuid
> and others.

You're right. The real roblem is that it is ring 0 only and that can
be executed in ring 3. That is why Ingo's patch was wrong too.

-Andi

2007-12-11 15:03:50

by Dor Laor

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync

Ingo Molnar wrote:
> * Dor Laor <[email protected]> wrote:
>
>
>> Here [include/asm-x86/tsc.h]:
>>
>> /* Like get_cycles, but make sure the CPU is synchronized. */
>> static __always_inline cycles_t get_cycles_sync(void)
>> {
>> unsigned long long ret;
>> unsigned eax, edx;
>>
>> /*
>> * Use RDTSCP if possible; it is guaranteed to be synchronous
>> * and doesn't cause a VMEXIT on Hypervisors
>> */
>> alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
>> ASM_OUTPUT2("=a" (eax), "=d" (edx)),
>> "a" (0U), "d" (0U) : "ecx", "memory");
>> ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax);
>> if (ret)
>> return ret;
>>
>> /*
>> * Don't do an additional sync on CPUs where we know
>> * RDTSC is already synchronous:
>> */
>> // alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
>> // "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
>> rdtscll(ret);
>>
>
> The patch below should resolve this - could you please test and Ack it?
>
It works, actually I already commented it out.

Acked-by: Dor Laor <[email protected]>

But this CPUID was present in v2.6.23 too, so why did it only show up in
> 2.6.24-rc for you?
>
>
I tried to figure out but all the code movements for i386 go in the way.
In the previous email I reported to Andi that Fedora kernel 2.6.23-8 did
not suffer from it.
Thanks for the ultra fast reply :)
Dor
> Ingo
>
> -------------->
> Subject: x86: fix get_cycles_sync() overhead
> From: Ingo Molnar <[email protected]>
>
> get_cycles_sync() is causing massive overhead in KVM networking:
>
> http://lkml.org/lkml/2007/12/11/54
>
> remove the explicit CPUID serialization - it causes VM exits and is
> pointless: we care about GTOD coherency but that goes to user-space
> via a syscall, and syscalls are serialization points anyway.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> include/asm-x86/tsc.h | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> Index: linux-x86.q/include/asm-x86/tsc.h
> ===================================================================
> --- linux-x86.q.orig/include/asm-x86/tsc.h
> +++ linux-x86.q/include/asm-x86/tsc.h
> @@ -39,8 +39,8 @@ static __always_inline cycles_t get_cycl
> unsigned eax, edx;
>
> /*
> - * Use RDTSCP if possible; it is guaranteed to be synchronous
> - * and doesn't cause a VMEXIT on Hypervisors
> + * Use RDTSCP if possible; it is guaranteed to be synchronous
> + * and doesn't cause a VMEXIT on Hypervisors
> */
> alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
> ASM_OUTPUT2("=a" (eax), "=d" (edx)),
> @@ -50,11 +50,11 @@ static __always_inline cycles_t get_cycl
> return ret;
>
> /*
> - * Don't do an additional sync on CPUs where we know
> - * RDTSC is already synchronous:
> + * Use RDTSC on other CPUs. This might not be fully synchronous,
> + * but it's not a problem: the only coherency we care about is
> + * the GTOD output to user-space, and syscalls are synchronization
> + * points anyway:
> */
> - alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
> - "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
> rdtscll(ret);
>
> return ret;
>
>

2007-12-11 16:38:18

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync

On Tue, 11 Dec 2007 15:27:17 +0100
Ingo Molnar <[email protected]> wrote:

>
> * Dor Laor <[email protected]> wrote:
>
>
> The patch below should resolve this - could you please test and Ack
> it? But this CPUID was present in v2.6.23 too, so why did it only
> show up in 2.6.24-rc for you?

isn't this probably wrong since this code is also used in the vsyscall code..

2007-12-11 17:04:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync


* Arjan van de Ven <[email protected]> wrote:

> > * Dor Laor <[email protected]> wrote:
> >
> > The patch below should resolve this - could you please test and Ack
> > it? But this CPUID was present in v2.6.23 too, so why did it only
> > show up in 2.6.24-rc for you?
>
> isn't this probably wrong since this code is also used in the vsyscall
> code..

the TSC clocksource (and hence the vsyscall code) is turned off on
systems that fail the TOD/CLOCK portion of this test:

http://people.redhat.com/mingo/time-warp-test/time-warp-test.c

i.e. on the majority of systems in place.

Ingo

2007-12-11 17:23:55

by Andi Kleen

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync

Ingo Molnar <[email protected]> writes:

> the TSC clocksource (and hence the vsyscall code) is turned off on
> systems that fail the TOD/CLOCK portion of this test:

Which is not on core2 which was the question about. And if it was
turned off it wouldn't use get_cycles_sync() at all.

-Andi

2007-12-11 20:19:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync


* Andi Kleen <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > the TSC clocksource (and hence the vsyscall code) is turned off on
> > systems that fail the TOD/CLOCK portion of this test:
>
> Which is not on core2 which was the question about. And if it was
> turned off it wouldn't use get_cycles_sync() at all.

it is turned off on core2 too:

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
acpi_pm

Ingo

2007-12-11 20:30:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: Performance overhead of get_cycles_sync


* Ingo Molnar <[email protected]> wrote:

> > Which is not on core2 which was the question about. And if it was
> > turned off it wouldn't use get_cycles_sync() at all.
>
> it is turned off on core2 too:
>
> # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> acpi_pm

but you are right that it's not turned off on all core2's, so my patch
is wrong for them.

Ingo

2007-12-11 21:30:47

by Joerg Roedel

[permalink] [raw]
Subject: Re: [kvm-devel] Performance overhead of get_cycles_sync

On Tue, Dec 11, 2007 at 03:27:17PM +0100, Ingo Molnar wrote:
>
> * Dor Laor <[email protected]> wrote:
>
> > Here [include/asm-x86/tsc.h]:
> >
> > /* Like get_cycles, but make sure the CPU is synchronized. */
> > static __always_inline cycles_t get_cycles_sync(void)
> > {
> > unsigned long long ret;
> > unsigned eax, edx;
> >
> > /*
> > * Use RDTSCP if possible; it is guaranteed to be synchronous
> > * and doesn't cause a VMEXIT on Hypervisors
> > */
> > alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
> > ASM_OUTPUT2("=a" (eax), "=d" (edx)),
> > "a" (0U), "d" (0U) : "ecx", "memory");
> > ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax);
> > if (ret)
> > return ret;
> >
> > /*
> > * Don't do an additional sync on CPUs where we know
> > * RDTSC is already synchronous:
> > */
> > // alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
> > // "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
> > rdtscll(ret);
>
> The patch below should resolve this - could you please test and Ack it?
> But this CPUID was present in v2.6.23 too, so why did it only show up in
> 2.6.24-rc for you?
>
> Ingo
>
> -------------->
> Subject: x86: fix get_cycles_sync() overhead
> From: Ingo Molnar <[email protected]>
>
> get_cycles_sync() is causing massive overhead in KVM networking:
>
> http://lkml.org/lkml/2007/12/11/54
>
> remove the explicit CPUID serialization - it causes VM exits and is
> pointless: we care about GTOD coherency but that goes to user-space
> via a syscall, and syscalls are serialization points anyway.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> include/asm-x86/tsc.h | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> Index: linux-x86.q/include/asm-x86/tsc.h
> ===================================================================
> --- linux-x86.q.orig/include/asm-x86/tsc.h
> +++ linux-x86.q/include/asm-x86/tsc.h
> @@ -39,8 +39,8 @@ static __always_inline cycles_t get_cycl
> unsigned eax, edx;
>
> /*
> - * Use RDTSCP if possible; it is guaranteed to be synchronous
> - * and doesn't cause a VMEXIT on Hypervisors
> + * Use RDTSCP if possible; it is guaranteed to be synchronous
> + * and doesn't cause a VMEXIT on Hypervisors
> */
> alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
> ASM_OUTPUT2("=a" (eax), "=d" (edx)),
> @@ -50,11 +50,11 @@ static __always_inline cycles_t get_cycl
> return ret;
>
> /*
> - * Don't do an additional sync on CPUs where we know
> - * RDTSC is already synchronous:
> + * Use RDTSC on other CPUs. This might not be fully synchronous,
> + * but it's not a problem: the only coherency we care about is
> + * the GTOD output to user-space, and syscalls are synchronization
> + * points anyway:
> */
> - alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
> - "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
> rdtscll(ret);
>
> return ret;

I don't think this is a good idea. I discussed exactly this item with
Andi Kleen a while ago and afair the serializing instruction was
necessary to fix a backwards walking gettimeofday() on some K8
revisions. Andi Kleen can tell more details, I added him to the CC list.

Joerg

--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy