Hi,
I've been looking into a problem where Windows applications misbehave
across suspend/resume when run on Wine on x86. These applications see time
going backwards. The timestamp counter (TSC) is reset when system resumes.
In case of Windows on Intel and AMD, the timestamp is saved and restored
when the system resumes from suspend.
These applications read timestamp by rdtsc directly. These calls cannot be
intercepted by Wine. The application should be fixed such that it handles
these scenarios correctly. But there are hundreds of applications which
cannot be fixed. So some support is required in Wine or kernel. There isn't
anything which Wine can do as rdtsc call directly reads the timestamp. The
only option is that we support something in kernel.
As more and more things are being added to Wine, Windows application can be
run pretty easily on Linux. But this rdtsc is a big hurdle. What are your
thoughts on solving this problem?
We are thinking of saving and restoring the timestamp counter at suspend
and resume time respectively. In theory it can work on Intel because of
TSC_ADJUST register. But it'll never work on AMD until:
* AMD supports the same kind of adjust register. (AMD has said that the
adjust register cannot be implemented in their firmware. They'll have to
add it to their hardware.)
* by manual synchronization in kernel (I know you don't like this idea. But
there is something Windows is doing to save/restore and sync the TSC)
I really hope that you share some thoughts.
--
BR,
Muhammad Usama Anjum
On Thu, Jun 01, 2023 at 10:56:03AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 01:45:35PM +0500, Muhammad Usama Anjum wrote:
> > Hi,
> >
> > I've been looking into a problem where Windows applications misbehave
> > across suspend/resume when run on Wine on x86. These applications see time
> > going backwards. The timestamp counter (TSC) is reset when system resumes.
> > In case of Windows on Intel and AMD, the timestamp is saved and restored
> > when the system resumes from suspend.
> >
> > These applications read timestamp by rdtsc directly. These calls cannot be
> > intercepted by Wine. The application should be fixed such that it handles
> > these scenarios correctly. But there are hundreds of applications which
> > cannot be fixed. So some support is required in Wine or kernel. There isn't
> > anything which Wine can do as rdtsc call directly reads the timestamp. The
> > only option is that we support something in kernel.
> >
> > As more and more things are being added to Wine, Windows application can be
> > run pretty easily on Linux. But this rdtsc is a big hurdle. What are your
> > thoughts on solving this problem?
> >
> > We are thinking of saving and restoring the timestamp counter at suspend
> > and resume time respectively. In theory it can work on Intel because of
> > TSC_ADJUST register. But it'll never work on AMD until:
> > * AMD supports the same kind of adjust register. (AMD has said that the
> > adjust register cannot be implemented in their firmware. They'll have to
> > add it to their hardware.)
> > * by manual synchronization in kernel (I know you don't like this idea. But
> > there is something Windows is doing to save/restore and sync the TSC)
>
> Wine could set TIF_NOTSC, which will cause it to run with CR4.TSD
> cleared and cause RDTSC to #GP, at which point you can emulate it.
The other option is to have Wine run itself in a (KVM) virtual machine
and mess with the VMM TSC offset :-)
On Thu, Jun 01, 2023 at 01:45:35PM +0500, Muhammad Usama Anjum wrote:
> Hi,
>
> I've been looking into a problem where Windows applications misbehave
> across suspend/resume when run on Wine on x86. These applications see time
> going backwards. The timestamp counter (TSC) is reset when system resumes.
> In case of Windows on Intel and AMD, the timestamp is saved and restored
> when the system resumes from suspend.
>
> These applications read timestamp by rdtsc directly. These calls cannot be
> intercepted by Wine. The application should be fixed such that it handles
> these scenarios correctly. But there are hundreds of applications which
> cannot be fixed. So some support is required in Wine or kernel. There isn't
> anything which Wine can do as rdtsc call directly reads the timestamp. The
> only option is that we support something in kernel.
>
> As more and more things are being added to Wine, Windows application can be
> run pretty easily on Linux. But this rdtsc is a big hurdle. What are your
> thoughts on solving this problem?
>
> We are thinking of saving and restoring the timestamp counter at suspend
> and resume time respectively. In theory it can work on Intel because of
> TSC_ADJUST register. But it'll never work on AMD until:
> * AMD supports the same kind of adjust register. (AMD has said that the
> adjust register cannot be implemented in their firmware. They'll have to
> add it to their hardware.)
> * by manual synchronization in kernel (I know you don't like this idea. But
> there is something Windows is doing to save/restore and sync the TSC)
Wine could set TIF_NOTSC, which will cause it to run with CR4.TSD
cleared and cause RDTSC to #GP, at which point you can emulate it.
On Thu, Jun 01 2023 at 10:56, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 01:45:35PM +0500, Muhammad Usama Anjum wrote:
>> We are thinking of saving and restoring the timestamp counter at suspend
>> and resume time respectively. In theory it can work on Intel because of
>> TSC_ADJUST register. But it'll never work on AMD until:
>> * AMD supports the same kind of adjust register. (AMD has said that the
>> adjust register cannot be implemented in their firmware. They'll have to
>> add it to their hardware.)
>> * by manual synchronization in kernel (I know you don't like this idea. But
>> there is something Windows is doing to save/restore and sync the TSC)
>
> Wine could set TIF_NOTSC, which will cause it to run with CR4.TSD
> cleared and cause RDTSC to #GP, at which point you can emulate it.
We should ask Microsoft to do the same. That'll fix the direct RDTSC
usage quickly. :)
On Thu, Jun 01 2023 at 13:45, Muhammad Usama Anjum wrote:
> As more and more things are being added to Wine, Windows application can be
> run pretty easily on Linux. But this rdtsc is a big hurdle. What are your
> thoughts on solving this problem?
Who would have thought that rdtsc() in applications can be a problem.
Interfaces to query time exist for a reason and it's documented by
Microsoft:
https://learn.microsoft.com/en-us/windows/win32/dxtecharts/game-timing-and-multicore-processors
But sure, reading documentation is overrated...
> We are thinking of saving and restoring the timestamp counter at suspend
> and resume time respectively. In theory it can work on Intel because of
> TSC_ADJUST register. But it'll never work on AMD until:
> * AMD supports the same kind of adjust register. (AMD has said that the
> adjust register cannot be implemented in their firmware. They'll have to
> add it to their hardware.)
> * by manual synchronization in kernel (I know you don't like this idea. But
> there is something Windows is doing to save/restore and sync the TSC)
Synchronizing TSC by writing the TSC MSR is fragile as hell. This has
been tried so often and never reliably passed all synchronization tests
on a wide range of systems.
It kinda works on single socket, but not on larger systems.
We spent an insane amount of time to make timekeeping correct and I'm
not interested at all to deal with the fallout of such a mechanim.
I could be persuaded to make this work when TSC_ADJUST is available, but
that's it.
But even that might turn out to be just a solution for the moment
because there is a plan on the way that TSC grows an irreversible lock
bit, which prevents everything including SMM from fiddling with it,
which in turn spares the TSC_ADJUST sanity checks post boot.
Thanks,
tglx
On Thu, Jun 01 2023 at 12:26, Thomas Gleixner wrote:
> On Thu, Jun 01 2023 at 13:45, Muhammad Usama Anjum wrote:
>> We are thinking of saving and restoring the timestamp counter at suspend
>> and resume time respectively.
I assume you talk about suspend to disk here, right? Suspend to RAM
definitely does not have the problem at least not on any halfways
contemporary CPU.
>> In theory it can work on Intel because of
>> TSC_ADJUST register. But it'll never work on AMD until:
>> * AMD supports the same kind of adjust register. (AMD has said that the
>> adjust register cannot be implemented in their firmware. They'll have to
>> add it to their hardware.)
>> * by manual synchronization in kernel (I know you don't like this idea. But
>> there is something Windows is doing to save/restore and sync the TSC)
>
> Synchronizing TSC by writing the TSC MSR is fragile as hell. This has
> been tried so often and never reliably passed all synchronization tests
> on a wide range of systems.
>
> It kinda works on single socket, but not on larger systems.
Here is an example where it falls flat on its nose.
One of the early Ryzen laptops had a broken BIOS which came up with
unsynchronized TSCs. I tried to fix that up, but couldn't get it to sync
on all CPUs because for some stupid reason the TSC write got
arbritrarily delayed (assumably by SMI/SMM).
After the vendor fixed the BIOS, I tried again and the problem
persisted.
So on such a machine the 'fixup time' mechanism would simply render an
otherwise perfectly fine TSC unusable for timekeeping.
We asked both Intel and AMD to add TSC_ADJUST probably 15 years
ago. Intel added it with some HSW variants (IIRC) and since SKL all CPUs
have it. I don't know why AMD thought it's not required. That could have
spared a gazillion of bugzilla entries vs. the early Ryzen machines.
Thanks,
tglx
On Thursday, June 1st, 2023 at 11:20 AM, Thomas Gleixner <[email protected]> wrote:
> Here is an example where it falls flat on its nose.
>
> One of the early Ryzen laptops had a broken BIOS which came up with
> unsynchronized TSCs. I tried to fix that up, but couldn't get it to sync
> on all CPUs because for some stupid reason the TSC write got
> arbritrarily delayed (assumably by SMI/SMM).
Hah, I remember that. That was actually my laptop. A Lenovo ThinkPad A485 with a Ryzen 2700U. I've seen the problem since then occasionally on newer Ryzen laptops (and even desktops). Without the awful "tsc=directsync" patch I wrote, which I've been carrying for years now in my own kernel builds, it just falls back to HPET. It's not pleasant, but at least it's a stable clock.
> After the vendor fixed the BIOS, I tried again and the problem
> persisted.
>
> So on such a machine the 'fixup time' mechanism would simply render an
> otherwise perfectly fine TSC unusable for timekeeping.
>
> We asked both Intel and AMD to add TSC_ADJUST probably 15 years
> ago. Intel added it with some HSW variants (IIRC) and since SKL all CPUs
> have it. I don't know why AMD thought it's not required. That could have
> spared a gazillion of bugzilla entries vs. the early Ryzen machines.
>
Agreed, TSC_ADJUST is the ultimate solution for any of these kinds of issues. But last I heard from AMD, it's still several years out in silicon, and there's plenty of hardware to maintain compatibility with. Ugh.
A software solution would be preferable in the meantime, but I don't know what options are left at this point.
The trap-and-emulate via SIGSEGV approach proposed earlier in the thread is unfortunately not likely to be practical, assuming I implemented it properly.
One issue is how much overhead it has. This is an instruction that normally executes in roughly 50 clock cycles (RDTSC) to 100 clock cycles (RDTSCP) on Zen 3. Based on a proof-of-concept I wrote, the overhead of trapping and emulating with a signal handler is roughly 100x. On my Zen 3 system, it goes up to around 10000 clock cycles per trapped read of RDTSCP. Most Windows games that use this instruction directly are doing so under the assumption that the TSC is faster to read than any of the native Windows API clock sources. If it's suddenly ~100x slower than even the slowest-to-read Windows clocksource, those games would likely become entirely unplayable, depending on how frequently they do TSC reads. (And many do so quite often!)
Also, my proof-of-concept doesn't actually do the emulation part. It just traps the instruction and then executes that same instruction in the signal handler, putting the results in the right registers. So it's a pass-through approach, which is about the best you can do performance wise.
Another issue is that the implementation might be tricky. In the case of Wine, you'd need to enable PR_TSC_SIGSEGV whenever entering the Windows executable and PR_TSC_ENABLE whenever leaving it. If you don't, any of the normally well-behaved clock sources implemented using the TSC (e.g. CLOCK_MONOTONIC_RAW, etc) would also fault on the Wine side. Also, there's some Windows-specific trickery, in that the Windows registry exposes the TSC frequency in a couple of places, so those would need to be replaced with the frequency of the emulated clocksource.
- Steven
On June 1, 2023 12:07:38 PM PDT, Steven Noonan <[email protected]> wrote:
>On Thursday, June 1st, 2023 at 11:20 AM, Thomas Gleixner <[email protected]> wrote:
>> Here is an example where it falls flat on its nose.
>>
>
>> One of the early Ryzen laptops had a broken BIOS which came up with
>> unsynchronized TSCs. I tried to fix that up, but couldn't get it to sync
>> on all CPUs because for some stupid reason the TSC write got
>> arbritrarily delayed (assumably by SMI/SMM).
>
>Hah, I remember that. That was actually my laptop. A Lenovo ThinkPad A485 with a Ryzen 2700U. I've seen the problem since then occasionally on newer Ryzen laptops (and even desktops). Without the awful "tsc=directsync" patch I wrote, which I've been carrying for years now in my own kernel builds, it just falls back to HPET. It's not pleasant, but at least it's a stable clock.
>
>> After the vendor fixed the BIOS, I tried again and the problem
>> persisted.
>>
>
>> So on such a machine the 'fixup time' mechanism would simply render an
>> otherwise perfectly fine TSC unusable for timekeeping.
>>
>
>> We asked both Intel and AMD to add TSC_ADJUST probably 15 years
>> ago. Intel added it with some HSW variants (IIRC) and since SKL all CPUs
>> have it. I don't know why AMD thought it's not required. That could have
>> spared a gazillion of bugzilla entries vs. the early Ryzen machines.
>>
>
>Agreed, TSC_ADJUST is the ultimate solution for any of these kinds of issues. But last I heard from AMD, it's still several years out in silicon, and there's plenty of hardware to maintain compatibility with. Ugh.
>
>A software solution would be preferable in the meantime, but I don't know what options are left at this point.
>
>The trap-and-emulate via SIGSEGV approach proposed earlier in the thread is unfortunately not likely to be practical, assuming I implemented it properly.
>
>One issue is how much overhead it has. This is an instruction that normally executes in roughly 50 clock cycles (RDTSC) to 100 clock cycles (RDTSCP) on Zen 3. Based on a proof-of-concept I wrote, the overhead of trapping and emulating with a signal handler is roughly 100x. On my Zen 3 system, it goes up to around 10000 clock cycles per trapped read of RDTSCP. Most Windows games that use this instruction directly are doing so under the assumption that the TSC is faster to read than any of the native Windows API clock sources. If it's suddenly ~100x slower than even the slowest-to-read Windows clocksource, those games would likely become entirely unplayable, depending on how frequently they do TSC reads. (And many do so quite often!)
>
>Also, my proof-of-concept doesn't actually do the emulation part. It just traps the instruction and then executes that same instruction in the signal handler, putting the results in the right registers. So it's a pass-through approach, which is about the best you can do performance wise.
>
>Another issue is that the implementation might be tricky. In the case of Wine, you'd need to enable PR_TSC_SIGSEGV whenever entering the Windows executable and PR_TSC_ENABLE whenever leaving it. If you don't, any of the normally well-behaved clock sources implemented using the TSC (e.g. CLOCK_MONOTONIC_RAW, etc) would also fault on the Wine side. Also, there's some Windows-specific trickery, in that the Windows registry exposes the TSC frequency in a couple of places, so those would need to be replaced with the frequency of the emulated clocksource.
>
>- Steven
It seems to me that this is one of several reasons that it might be desirable to wrap the Windows executable in a KVM wrapper, exactly to be able to intercept non-system-call related system differences.
I realize this is not a small change...
On Thu, Jun 01 2023 at 19:07, Steven Noonan wrote:
> On Thursday, June 1st, 2023 at 11:20 AM, Thomas Gleixner <[email protected]> wrote:
>> Here is an example where it falls flat on its nose.
>>
>> One of the early Ryzen laptops had a broken BIOS which came up with
>> unsynchronized TSCs. I tried to fix that up, but couldn't get it to sync
>> on all CPUs because for some stupid reason the TSC write got
>> arbritrarily delayed (assumably by SMI/SMM).
>
> Hah, I remember that. That was actually my laptop. A Lenovo ThinkPad
> A485 with a Ryzen 2700U. I've seen the problem since then occasionally
> on newer Ryzen laptops (and even desktops). Without the awful
> "tsc=directsync" patch I wrote, which I've been carrying for years now
> in my own kernel builds, it just falls back to HPET. It's not
> pleasant, but at least it's a stable clock.
Well, yours seem at least to sync. The silly box I tried refused due to
SMM value add magic.
> Agreed, TSC_ADJUST is the ultimate solution for any of these kinds of
> issues. But last I heard from AMD, it's still several years out in
> silicon, and there's plenty of hardware to maintain compatibility
> with. Ugh.
Yes.
> A software solution would be preferable in the meantime, but I don't
> know what options are left at this point.
Not that many.
> The trap-and-emulate via SIGSEGV approach proposed earlier in the
> thread is unfortunately not likely to be practical, assuming I
> implemented it properly.
That's why I said we need to ask Microsoft to do the same so that the
applications get fixed. :)
> Most Windows games that use this instruction directly are doing so
> under the assumption that the TSC is faster to read than any of the
> native Windows API clock sources.
The recommended interface QueryPerformanceCounter() is actually not much
slower and safe. But sure performance first and correctness is overrated.
So back to the options:
1) Kernel
If at all then this needs to be disabled by default and enabled by
a command line option along with a big fat warning that it might
disable TSC for timekeeping and bug reports related to this are
going to be ignored.
Honestly I'm not too interested in this. It's yet another piece of
art which needs to be maintained and kept alive for a long time.
The fact that we need to check for synchronized TSCs in the first
place is hillarious already. TSC_ADJUST makes the resynchronization
attempt at least halfways sensible.
Without it, it's just a pile of never going to be correct
heuristics with a flood of "this fixes it for my machine (and
breaks the rest)" patches.
2) Binary patching
Unfortunately RDTSC is only a two byte instruction, but there are
enough advanced binary patching tools to deal with that.
It might be a completely crazy idea, but I wouldn't dismiss it
before trying.
Thanks,
tglx
On Thu, Jun 01 2023 at 22:10, Thomas Gleixner wrote:
>
> So back to the options:
>
> 1) Kernel
>
> If at all then this needs to be disabled by default and enabled by
> a command line option along with a big fat warning that it might
> disable TSC for timekeeping and bug reports related to this are
> going to be ignored.
>
> Honestly I'm not too interested in this. It's yet another piece of
> art which needs to be maintained and kept alive for a long time.
>
> The fact that we need to check for synchronized TSCs in the first
> place is hillarious already. TSC_ADJUST makes the resynchronization
> attempt at least halfways sensible.
>
> Without it, it's just a pile of never going to be correct
> heuristics with a flood of "this fixes it for my machine (and
> breaks the rest)" patches.
>
>
> 2) Binary patching
>
> Unfortunately RDTSC is only a two byte instruction, but there are
> enough advanced binary patching tools to deal with that.
>
> It might be a completely crazy idea, but I wouldn't dismiss it
> before trying.
Duh. Hit send too early
3) Virtualization
Obviously not trivial either but definitely workable.
Thanks,
tglx
On Thu, Jun 01, 2023 at 07:07:38PM +0000, Steven Noonan wrote:
> One issue is how much overhead it has. This is an instruction that
> normally executes in roughly 50 clock cycles (RDTSC) to 100 clock
> cycles (RDTSCP) on Zen 3. Based on a proof-of-concept I wrote, the
> overhead of trapping and emulating with a signal handler is roughly
> 100x. On my Zen 3 system, it goes up to around 10000 clock cycles per
> trapped read of RDTSCP.
What about kernel based emulation? You could tie it into user_dispatch
and have a user_dispatch tsc offset.
So regular kernel emulation simply returns the native value (keeps the
VDSO working for one), but then from a user_dispatch range, it returns
+offset.
That is; how slow is the below?
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 58b1f208eff5..18175b45db1f 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -645,6 +645,25 @@ static bool fixup_iopl_exception(struct pt_regs *regs)
return true;
}
+static bool fixup_rdtsc_exception(struct pt_regs *regs)
+{
+ unsigned short bytes;
+ u32 eax, edx;
+
+ if (get_user(bytes, (const short __user *)ip))
+ return false;
+
+ if (bytes != 0x0f31)
+ return false;
+
+ asm volatile ("rdtsc", "=a" (eax), "=d" (edx));
+ regs->ax = eax;
+ regs->dx = edx;
+
+ regs->ip += 2;
+ return true;
+}
+
/*
* The unprivileged ENQCMD instruction generates #GPs if the
* IA32_PASID MSR has not been populated. If possible, populate
@@ -752,6 +771,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
if (fixup_iopl_exception(regs))
goto exit;
+ if (fixup_rdtsc_exception(regs))
+ goto exit;
+
if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
goto exit;
On Thursday, June 1st, 2023 at 1:31 PM, Peter Zijlstra <[email protected]> wrote:
> What about kernel based emulation? You could tie it into user_dispatch
> and have a user_dispatch tsc offset.
>
> So regular kernel emulation simply returns the native value (keeps the
> VDSO working for one), but then from a user_dispatch range, it returns
> +offset.
>
> That is; how slow is the below?
It's around 1800-1900 clock cycles on this system (modified patch attached, compile fix + rdtscp support).
It's definitely better than the userspace signal handler (20x vs 100x). Also compared to reading one of the clock_gettime() clocks when current_clocksource is 'hpet', it's about twice as fast. So that's at least in the realm of being usable.
Since faulting would still make the vDSO clocks go through this path we'd have to be careful that whatever offsets we throw into this path don't affect the correctness of the other clocks.
On Thu, Jun 01, 2023 at 09:41:15PM +0000, Steven Noonan wrote:
> On Thursday, June 1st, 2023 at 1:31 PM, Peter Zijlstra <[email protected]> wrote:
> > What about kernel based emulation? You could tie it into user_dispatch
> > and have a user_dispatch tsc offset.
> >
>
> > So regular kernel emulation simply returns the native value (keeps the
> > VDSO working for one), but then from a user_dispatch range, it returns
> > +offset.
> >
>
> > That is; how slow is the below?
>
> It's around 1800-1900 clock cycles on this system
Much more expensive than the actual instruction ofcourse, but that seems
eminently usable.
> (modified patch attached, compile fix + rdtscp support).
Right, that's what I get for writing 'patches' while falling asleep :/
> Since faulting would still make the vDSO clocks go through this path
> we'd have to be careful that whatever offsets we throw into this path
> don't affect the correctness of the other clocks.
Hence the suggested tie-in with user-dispatch; only add the offset when
the IP is from the user-dispatch range.
...
> Who would have thought that rdtsc() in applications can be a problem.
> Interfaces to query time exist for a reason and it's documented by
> Microsoft:
>
> https://learn.microsoft.com/en-us/windows/win32/dxtecharts/game-timing-and-multicore-processors
>
> But sure, reading documentation is overrated...
That eve says:
"Multiprocessor and dual-core systems do not guarantee synchronization
of their cycle counters between cores."
.
> Synchronizing TSC by writing the TSC MSR is fragile as hell. This has
> been tried so often and never reliably passed all synchronization tests
> on a wide range of systems.
>
> It kinda works on single socket, but not on larger systems.
>
> We spent an insane amount of time to make timekeeping correct and I'm
> not interested at all to deal with the fallout of such a mechanim.
I've wondered whether the TSC ought to be deliberately mis-synchronised?
So the high order bits are effectively the cpu number.
It has to be said that using it as a time source was fundamentally
a bad idea.
Sometimes (eg micro benchmarks) you really want a TSC.
You can extract one from the performance counters, but it is hard,
root only, and the library functions have high and variable overhead.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, Jun 05 2023 at 10:27, David Laight wrote:
> It has to be said that using it as a time source was fundamentally
> a bad idea.
Too bad you weren't around many moons ago and educated us on that. That
would have saved us lots of trouble and work.
> Sometimes (eg micro benchmarks) you really want a TSC.
> You can extract one from the performance counters, but it is hard,
> root only, and the library functions have high and variable overhead.
Interesting view that high end databases are considered micro benchmarks
which need root access.
I'm sure you already talked to the developers of such code how they can
elimiate their performance problems when VDSO/TSC time queries are not
available.
Alternatively you have a replacement implementation to make VDSO work
with the same performance and precision based on (potentially
non-existing) legacy time sources.
There are damned good practical reasons, why we spent a lot of effort to
implement VDSO and make TSC usable at least on any modern platform.
Micro-benchmarks are definitely not one of those reasons.
Thanks,
tglx
From: Thomas Gleixner <[email protected]>
> Sent: 05 June 2023 15:44
>
> On Mon, Jun 05 2023 at 10:27, David Laight wrote:
> > It has to be said that using it as a time source was fundamentally
> > a bad idea.
>
> Too bad you weren't around many moons ago and educated us on that. That
> would have saved us lots of trouble and work.
Indeed :-)
I do remember thinking the TSC was really a good time source when
I first saw it being done about 30 years ago.
>
> > Sometimes (eg micro benchmarks) you really want a TSC.
> > You can extract one from the performance counters, but it is hard,
> > root only, and the library functions have high and variable overhead.
>
> Interesting view that high end databases are considered micro benchmarks
> which need root access.
I'm thinking of benchmarking the IP checksum code where you are
trying to find out how many bytes/clock the loop is doing.
On recent x86-64 the theoretical limit (without fighting AVX) 1s 16
bytes/clock, I've measured 12, 8 is relatively easy.
(The current asm code runs at 4 on older cpu, doesn't get
much above 6 at all.)
What happens is that the cpu frequency speeds up as soon as the
test starts but the TSC frequency stays constants.
So you can only use the TSC to measure time, not execution speed.
Run enough copies of 'while :; do :; done &' to make all but one
cpu busy and the cpus all speed up giving completely different
TSC counts for short loops.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 6/5/23 08:54, David Laight wrote:
> From: Thomas Gleixner <[email protected]>
>> Sent: 05 June 2023 15:44
>>
>> On Mon, Jun 05 2023 at 10:27, David Laight wrote:
>>> It has to be said that using it as a time source was fundamentally
>>> a bad idea.
>>
>> Too bad you weren't around many moons ago and educated us on that. That
>> would have saved us lots of trouble and work.
>
> Indeed :-)
> I do remember thinking the TSC was really a good time source when
> I first saw it being done about 30 years ago.
>
The TSC is certainly not perfect; partly because, ironically enough, it
was introduced just *before* out of order and power management entered
the x86 world.
It is no secret that it has been slow to catch up. It was easy to put a
counter in; it is a *lot* harder to make it work in all the possible
scenarios in the power-managed, out-of-order world.
It is one of my personal pet projects in the architecture work to push
to get that last distance; we are not yet there.
>
> I'm thinking of benchmarking the IP checksum code where you are
> trying to find out how many bytes/clock the loop is doing.
> On recent x86-64 the theoretical limit (without fighting AVX) 1s 16
> bytes/clock, I've measured 12, 8 is relatively easy.
> (The current asm code runs at 4 on older cpu, doesn't get
> much above 6 at all.)
>
> What happens is that the cpu frequency speeds up as soon as the
> test starts but the TSC frequency stays constants.
> So you can only use the TSC to measure time, not execution speed.
>
> Run enough copies of 'while :; do :; done &' to make all but one
> cpu busy and the cpus all speed up giving completely different
> TSC counts for short loops.
>
That is the reason for architecturally fixed performance counters.
-hpa
From: H. Peter Anvin
> Sent: 05 June 2023 17:32
...
> The TSC is certainly not perfect; partly because, ironically enough, it
> was introduced just *before* out of order and power management entered
> the x86 world.
Another issue is that the crystal used for the cpu clock won't be
that accurate (in terms of ppm error rate), and will have significant
temperature drift.
OTOH the crystal in the traditional x86 motherboard 'clock' chip
is (meant to be) designed to have long term accuracy.
While reading the TSC is a lot faster there ought to have been
some kind of PLL to continuously adjust the measured TSC frequency
to keep synchronised with the timer chip.
(Instead kernels end up writing the drifted TSC based time back to
the timer chip during shutdown.)
> It is no secret that it has been slow to catch up. It was easy to put a
> counter in; it is a *lot* harder to make it work in all the possible
> scenarios in the power-managed, out-of-order world.
That rather depends on what you mean by 'work' :-)
> It is one of my personal pet projects in the architecture work to push
> to get that last distance; we are not yet there.
For performance measurements possibly what you want is a simple
clock counter which is dependent on an a register.
So pretty much zero overhead but is guaranteed to happen after
some other instruction without really affecting the pipeline.
IIRC the x86 performance counters aren't dependent on anything
so they tend to execute much earlier than you want.
OTOH rdtsc is likely to be synchronising and affect what follows.
ISTR using rdtsc to wait for instructions to complete and then
the performance clock counter to see how long it took.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 6/6/23 01:23, David Laight wrote:
>
> IIRC the x86 performance counters aren't dependent on anything
> so they tend to execute much earlier than you want.
> OTOH rdtsc is likely to be synchronising and affect what follows.
> ISTR using rdtsc to wait for instructions to complete and then
> the performance clock counter to see how long it took.
>
RDPMC and RDTSC have the same (lack of) synchronization guarantees; you
need to fence them appropriately for your application no matter what.
-hpa