2007-10-30 17:46:04

by Joerg Roedel

[permalink] [raw]
Subject: Whats the purpose of get_cycles_sync()

Hi,

I would like to answer what the special purpose of the get_cycles_sync()
function is in the x86 architecture. In special I ask myself why
this function has to be *sync*?

I mean, the sync should guarantee here that the CPU does not execute the
RDTSC instruction out-of-order, thats clear. But does that really
matter? If there is a cache/tlb miss before the function returns all
accuracy that should be won by the synchronous RDTSC is lost anyway.

The problem here is, that this function executes CPUID if RDTSC itself
is not a synchronizing instruction and CPUID is very often intercepted
by hypervisors (KVM intercepts it for example). This makes this function
very expensive if the kernel is executed as a guest.

But maybe I miss some important things here.

Joerg

--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy



2007-10-30 20:21:25

by Andi Kleen

[permalink] [raw]
Subject: Re: Whats the purpose of get_cycles_sync()

"Joerg Roedel" <[email protected]> writes:

> I would like to answer what the special purpose of the get_cycles_sync()
> function is in the x86 architecture. In special I ask myself why
> this function has to be *sync*?

Vojtech had one test that tested time monotonicity over CPUs
and it constantly failed until we added the CPUID on K8 C stepping.
He can give details on the test.

I suspect the reason was because the CPU reordered the RDTSCs so that
a later RDTSC could return a value before an earlier one. This can
happen because gettimeofday() is so fast that a tight loop calling it can
fit more than one iteration into the CPU's reordering window.

> I mean, the sync should guarantee here that the CPU does not execute the
> RDTSC instruction out-of-order, thats clear. But does that really
> matter? If there is a cache/tlb miss before the function returns all
> accuracy that should be won by the synchronous RDTSC is lost anyway.
>
> The problem here is, that this function executes CPUID if RDTSC itself
> is not a synchronizing instruction and CPUID is very often intercepted
> by hypervisors (KVM intercepts it for example). This makes this function
> very expensive if the kernel is executed as a guest.

That is why newer kernels use RDTSCP if available which doesn't need
to be intercepted and is synchronous. And since all AMD SVM systems
have RDTSCP they are fine.

On Intel Core2 without RDTSCP the CPUID can be still intercepted right
now, but the real fix there is to readd FEATURE_SYNC_TSC for Core2 --
the RDTSC there is always monotonic per CPU and the patch that changed
that (f3d73707a1e84f0687a05144b70b660441e999c7) was bogus and must be
reverted. I didn't catch that in time unfortunately.

-Andi

2007-10-30 22:02:41

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: Whats the purpose of get_cycles_sync()

On Tue, Oct 30, 2007 at 09:21:02PM +0100, Andi Kleen wrote:
> "Joerg Roedel" <[email protected]> writes:
>
> > I would like to answer what the special purpose of the get_cycles_sync()
> > function is in the x86 architecture. In special I ask myself why
> > this function has to be *sync*?
>
> Vojtech had one test that tested time monotonicity over CPUs
> and it constantly failed until we added the CPUID on K8 C stepping.
> He can give details on the test.
>
> I suspect the reason was because the CPU reordered the RDTSCs so that
> a later RDTSC could return a value before an earlier one. This can
> happen because gettimeofday() is so fast that a tight loop calling it can
> fit more than one iteration into the CPU's reordering window.

The K8's still guarantee that subsequent RDTSCs return increasing
values, even if the processor reorders them.

What could have been happening then was that the RDTSC instruction might
have been reordered by the CPU out of the seqlock, causing trouble in
the calculation.

Anyway, adding the CPUID didn't solve all the problems we've seen back
then, and so far none of the approaches for using TSC without acquiring
a spinlock on multi-socket AMD boxes worked 100% correctly.

> That is why newer kernels use RDTSCP if available which doesn't need
> to be intercepted and is synchronous. And since all AMD SVM systems
> have RDTSCP they are fine.
>
> On Intel Core2 without RDTSCP the CPUID can be still intercepted right
> now, but the real fix there is to readd FEATURE_SYNC_TSC for Core2 --
> the RDTSC there is always monotonic per CPU and the patch that changed
> that (f3d73707a1e84f0687a05144b70b660441e999c7) was bogus and must be
> reverted. I didn't catch that in time unfortunately.

--
Vojtech Pavlik
Director SuSE Labs

2007-10-30 22:42:52

by Andi Kleen

[permalink] [raw]
Subject: Re: Whats the purpose of get_cycles_sync()

On Tue, Oct 30, 2007 at 11:02:09PM +0100, Vojtech Pavlik wrote:
> > He can give details on the test.
> >
> > I suspect the reason was because the CPU reordered the RDTSCs so that
> > a later RDTSC could return a value before an earlier one. This can
> > happen because gettimeofday() is so fast that a tight loop calling it can
> > fit more than one iteration into the CPU's reordering window.
>
> The K8's still guarantee that subsequent RDTSCs return increasing
> values, even if the processor reorders them.

Ah didn't realize this

>
> What could have been happening then was that the RDTSC instruction might
> have been reordered by the CPU out of the seqlock, causing trouble in
> the calculation.

Ok anyways it fixed that problem. So it cannot be taken out.
>
> Anyway, adding the CPUID didn't solve all the problems we've seen back
> then, and so far none of the approaches for using TSC without acquiring
> a spinlock on multi-socket AMD boxes worked 100% correctly.

The code is not used on multi-core anyways currently (without Jiri's
patch). It should just work correctly on single core.

-Andi

2007-10-31 10:19:04

by Joerg Roedel

[permalink] [raw]
Subject: Re: Whats the purpose of get_cycles_sync()

Hi Andi,

On Tue, Oct 30, 2007 at 09:21:02PM +0100, Andi Kleen wrote:
> "Joerg Roedel" <[email protected]> writes:
>
> > I would like to answer what the special purpose of the get_cycles_sync()
> > function is in the x86 architecture. In special I ask myself why
> > this function has to be *sync*?
>
> Vojtech had one test that tested time monotonicity over CPUs
> and it constantly failed until we added the CPUID on K8 C stepping.
> He can give details on the test.

Interesting, I wasn't aware of that.

> I suspect the reason was because the CPU reordered the RDTSCs so that
> a later RDTSC could return a value before an earlier one. This can
> happen because gettimeofday() is so fast that a tight loop calling it can
> fit more than one iteration into the CPU's reordering window.

Ok, that is the reason why the get_cycles_sync() function only exists on
x86_64 and not on i386, because on i386 gettimeofday() is a real
syscall?

> That is why newer kernels use RDTSCP if available which doesn't need
> to be intercepted and is synchronous. And since all AMD SVM systems
> have RDTSCP they are fine.

The problem with KVM here is that they wan't to migrate guests between
Intel and AMD boxes. So they don't export RDTSCP or FEATURE_SYNC_TSC to
the guests in the CPUID calls. A 64bit Linux guest will execute the
CPUID in that function.

Joerg

2007-10-31 10:23:56

by Joerg Roedel

[permalink] [raw]
Subject: Re: Whats the purpose of get_cycles_sync()

Hi Vojtech,

On Tue, Oct 30, 2007 at 11:02:09PM +0100, Vojtech Pavlik wrote:
> The K8's still guarantee that subsequent RDTSCs return increasing
> values, even if the processor reorders them.
>
> What could have been happening then was that the RDTSC instruction might
> have been reordered by the CPU out of the seqlock, causing trouble in
> the calculation.
>
> Anyway, adding the CPUID didn't solve all the problems we've seen back
> then, and so far none of the approaches for using TSC without acquiring
> a spinlock on multi-socket AMD boxes worked 100% correctly.

Can you tell me more about the problems you have seen or give me a
pointer to a mail discussion regarding that problems? Can you also
provide your test program to me please? I want to understand these
problems a bit better.

Joerg