LinuxLists.cc - [RFC patch 15/15] LTTng timestamp x86

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Thu, 16 Oct 2008, Mathieu Desnoyers wrote:
>
> +static inline cycles_t ltt_async_tsc_read(void)

(a) this shouldn't be inline

> + rdtsc_barrier();
> + new_tsc = get_cycles();
> + rdtsc_barrier();
> + do {
> + last_tsc = ltt_last_tsc;
> + if (new_tsc < last_tsc)
> + new_tsc = last_tsc + LTT_MIN_PROBE_DURATION;
> + /*
> + * If cmpxchg fails with a value higher than the new_tsc, don't
> + * retry : the value has been incremented and the events
> + * happened almost at the same time.
> + * We must retry if cmpxchg fails with a lower value :
> + * it means that we are the CPU with highest frequency and
> + * therefore MUST update the value.
> + */
> + } while (cmpxchg64(&ltt_last_tsc, last_tsc, new_tsc) < new_tsc);

(b) This is really quite expensive.

Why do things like this? Make the timestamps be per-cpu. If you do things
like the above, then just getting the timestamp means that every single
trace event will cause a cacheline bounce, and if you do that, you might
as well just not have per-cpu tracing at all.

It really boils down to two cases:

- you do per-CPU traces

If so, you need to ONLY EVER touch per-cpu data when tracing, and the
above is a fundamental BUG. Dirtying shared cachelines makes the whole
per-cpu thing pointless.

- you do global traces

Sure, then the above works, but why bother? You'll get the ordering
from the global trace, you might as well do time stamps with local
counts.

So in neither case does it make any sense to try to do that global
ltt_last_tsc.

Perhaps more importantly - if the TSC really are out of whack, that just
means that now all your timestamps are worthless, because the value you
calculate ends up having NOTHING to do with the timestamp. So you cannot
even use it to see how long something took, because it may be that you're
running on the CPU that runs behind, and all you ever see is the value of
LTT_MIN_PROBE_DURATION.

Linus

2008-10-17 00:13:48

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Thu, 16 Oct 2008, Linus Torvalds wrote:
>
> Perhaps more importantly - if the TSC really are out of whack, that just
> means that now all your timestamps are worthless, because the value you
> calculate ends up having NOTHING to do with the timestamp. So you cannot
> even use it to see how long something took, because it may be that you're
> running on the CPU that runs behind, and all you ever see is the value of
> LTT_MIN_PROBE_DURATION.

If it isn't clear: the alternative is to just always use local timestamps.

At least that way the timestamps mean _something_. You can get the
difference between two events when they happen on the same CPU, and it is
about as meaningful as it can be.

Don't even _try_ to make a global clock.

Yes, to be able to compare across CPU's you'd need to have extra
synchronization information (eg offset and frequency things), but quite
frankly, the "global TSC" thing is already worse than even a totally
non-synchronized TSC for the above reasons.

Linus

2008-10-17 01:28:48

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Linus Torvalds ([email protected]) wrote:
>
>
> On Thu, 16 Oct 2008, Mathieu Desnoyers wrote:
> >
> > +static inline cycles_t ltt_async_tsc_read(void)
>
> (a) this shouldn't be inline
>

Ok, will fix. I will put this in a new arch/x86/kernel/ltt.c.

> > + rdtsc_barrier();
> > + new_tsc = get_cycles();
> > + rdtsc_barrier();
> > + do {
> > + last_tsc = ltt_last_tsc;
> > + if (new_tsc < last_tsc)
> > + new_tsc = last_tsc + LTT_MIN_PROBE_DURATION;
> > + /*
> > + * If cmpxchg fails with a value higher than the new_tsc, don't
> > + * retry : the value has been incremented and the events
> > + * happened almost at the same time.
> > + * We must retry if cmpxchg fails with a lower value :
> > + * it means that we are the CPU with highest frequency and
> > + * therefore MUST update the value.
> > + */
> > + } while (cmpxchg64(&ltt_last_tsc, last_tsc, new_tsc) < new_tsc);
>
> (b) This is really quite expensive.
>

Ok, let's try to figure out what the use-cases are, because we are
really facing an architectural mess (thanks to Intel and AMD). I don't
think there is a single perfect solution for all, but I'll try to
explain why I accept the cache-line bouncing behavior when
unsynchronized TSCs are detected by LTTng.

First, the most important thing in LTTng is to provide the event flow
in the correct order across CPUs. Secondary to that, getting the precise
execution time is a nice-to-have when the architecture supports it, but
the time granularity itself is not crucially important, as long as we
have a way to determine which of two events close in time happens first.
The principal use-case where I have seen such tracer in action is when
one have to understand why one or more processes are slower than
expected. The root cause can easily sit on another CPU, be a locking
delay in a particular race condition, or just a process waiting for
other processes waiting for a timeout.

> Why do things like this? Make the timestamps be per-cpu. If you do things
> like the above, then just getting the timestamp means that every single
> trace event will cause a cacheline bounce, and if you do that, you might
> as well just not have per-cpu tracing at all.
>

This cache-line bouncing global clock is a best-effort to provide
correct event order in the trace on architectures with unsync tsc. It's
actually better than a global tracing buffer because it limits the
number of cache line transfers required to one per event. Global tracing
buffers may require to transfer many cache lines across CPUs when events
are written across cache lines or larger than a cache line.

> It really boils down to two cases:
>
> - you do per-CPU traces
>
> If so, you need to ONLY EVER touch per-cpu data when tracing, and the
> above is a fundamental BUG. Dirtying shared cachelines makes the whole
> per-cpu thing pointless.

Sharing only a single cache-line is not completely pointless, as
explained above, but yes, there is a big performance hit involved.

I agree that we should maybe add a degree of flexibility in this time
infrastructure to let users select the type of time source they want :

- Global clock, potentially slow on unsynchronized CPUs.
- Local clock, fast, possibility unsynchronized across CPUs.

>
> - you do global traces
>
> Sure, then the above works, but why bother? You'll get the ordering
> from the global trace, you might as well do time stamps with local
> counts.
>

I simply don't like the global traces because of the extra cache-line
bouncing experienced by events written on multiple cache-lines.

> So in neither case does it make any sense to try to do that global
> ltt_last_tsc.
>
> Perhaps more importantly - if the TSC really are out of whack, that just
> means that now all your timestamps are worthless, because the value you
> calculate ends up having NOTHING to do with the timestamp. So you cannot
> even use it to see how long something took, because it may be that you're
> running on the CPU that runs behind, and all you ever see is the value of
> LTT_MIN_PROBE_DURATION.
>

I thought about this one. There is actually a FIXME in the code which
plans to add an IPI called at each timer interrupt to do a "read tsc" on
each CPU. This would give an HZ upper bound to the time precision, which
would give a trace with events ordered across CPUs and manage to have
the execution time at a HZ precision.

So given that global buffers are less efficient that just synchronizing
a single cache-line and that some people are willing to pay the price to
get events synchronized across CPUs and others are not, what do you
think of leaving the choice to the user about globally/locally
synchronized timestamps ?

Thanks for the feedback,

Mathieu

> Linus

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-17 02:20:07

[permalink] [raw]

Subject: RE: [RFC patch 15/15] LTTng timestamp x86

> This cache-line bouncing global clock is a best-effort to provide
> correct event order in the trace on architectures with unsync tsc. It's
> actually better than a global tracing buffer because it limits the
> number of cache line transfers required to one per event.

Even one line bouncing between cpus can be a performamce disaster.
You'll probably hit a serious wall somewhere between 8 and 16
cpus (ia64 has code that looks a lot like this in the gettimeofday()
path because it does not synchronize cpu cycle counters ... some
applications that are overly fond of timestamping internal
events using gettimeofday() end up spending significant time
doing so on large systems ... even with only a few thousands
of calls per second).

-Tony

2008-10-17 17:25:31

by Steven Rostedt

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Thu, Oct 16, 2008 at 07:19:48PM -0700, Luck, Tony wrote:
> > This cache-line bouncing global clock is a best-effort to provide
> > correct event order in the trace on architectures with unsync tsc. It's
> > actually better than a global tracing buffer because it limits the
> > number of cache line transfers required to one per event.
>
> Even one line bouncing between cpus can be a performamce disaster.
> You'll probably hit a serious wall somewhere between 8 and 16
> cpus (ia64 has code that looks a lot like this in the gettimeofday()
> path because it does not synchronize cpu cycle counters ... some
> applications that are overly fond of timestamping internal
> events using gettimeofday() end up spending significant time
> doing so on large systems ... even with only a few thousands
> of calls per second).
>

I agree that one cache line bouncer is devastating to performance. But
as Mathieu said, it is better than a global tracer with lots of bouncing
going on. My logdev tracer (something similar to ftrace, but used only
for debugging) use to have a single buffer. By moving it to a per cpu
buffer and using an atomic counter to sort the events, the increase of
speed was a few magnitudes.

ftrace does not have a global counter, but on some boxes with out of
sync TSCs, it could not find race conditions. I had to pull in logdev,
which found the race right away, because of this atomic counter.

logdev adds a bit of perfomance degradation, but for debugging, I don't
care, and it has helped me quite a bit.

ftrace can help in debugging most of the time, but on some boxes with
wacky time stamps, it is useless to find race problems between CPUS. But
ftrace is for production, and can not afford the performance penalty of
a global counter.

-- Steve

2008-10-17 18:08:47

[permalink] [raw]

Subject: RE: [RFC patch 15/15] LTTng timestamp x86

> I agree that one cache line bouncer is devastating to performance. But
> as Mathieu said, it is better than a global tracer with lots of bouncing
> going on.

Scale up enough, and it becomes more than just a performance problem.
When SGI first tried to boot on 512 cpus they found the kernel hung
completely because of a single global atomic counter for how many
interrupts there were. With HZ=1024 and 512 cpus the ensuing cache
line bouncing storm from each interrupt took longer to resolve than
the interval between interrupts.

With higher event rates (1KHz seems relatively low) this wall will
be a problem for smaller systems too.

> ftrace does not have a global counter, but on some boxes with out of
> sync TSCs, it could not find race conditions. I had to pull in logdev,
> which found the race right away, because of this atomic counter.

Perhaps this needs to be optional (and run-time switchable). Some
users (tracking performance issues) will want the tracer to have
the minumum possible effect on the system. Others (chasing race
conditions) will want the best possible ordering of events between
cpus[*].

-Tony

[*] I'd still be concerned that a heavyweight strict ordering might
perturb the system enough to make the race disappear when tracing
is enabled.

2008-10-17 18:42:28

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Luck, Tony ([email protected]) wrote:
> > I agree that one cache line bouncer is devastating to performance. But
> > as Mathieu said, it is better than a global tracer with lots of bouncing
> > going on.
>
> Scale up enough, and it becomes more than just a performance problem.
> When SGI first tried to boot on 512 cpus they found the kernel hung
> completely because of a single global atomic counter for how many
> interrupts there were. With HZ=1024 and 512 cpus the ensuing cache
> line bouncing storm from each interrupt took longer to resolve than
> the interval between interrupts.
>
> With higher event rates (1KHz seems relatively low) this wall will
> be a problem for smaller systems too.
>

Hrm, on such systems
- *large* amount of cpus
- no synchronized TSCs

What would be the best approach to order events ? Do you think we should
consider using HPET, event though it's painfully slow ? Would it be
faster than cache-line bouncing on such large boxes ? With a frequency
around 10MHz, that would give a 100ns precision, which should be enough
to order events. However, HPET is known for its poor performances, which
I doubt will do better than the cache-line bouncing alternative.

> > ftrace does not have a global counter, but on some boxes with out of
> > sync TSCs, it could not find race conditions. I had to pull in logdev,
> > which found the race right away, because of this atomic counter.
>
> Perhaps this needs to be optional (and run-time switchable). Some
> users (tracking performance issues) will want the tracer to have
> the minumum possible effect on the system. Others (chasing race
> conditions) will want the best possible ordering of events between
> cpus[*].
>

Yup, I think this solution would work. The user could specify the time
source for a specific set of buffers (a trace) through debugfs files.

> -Tony
>
> [*] I'd still be concerned that a heavyweight strict ordering might
> perturb the system enough to make the race disappear when tracing
> is enabled.
>

Yes, it's true that it may make the race disappear, but what has been
seen in the field (Steven could confirm that) is that it usually makes
the race more likely to appear due to an enlarged race window. But I
guess it all depends on where the activated instrumentation is.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-17 18:59:19

[permalink] [raw]

Subject: RE: [RFC patch 15/15] LTTng timestamp x86

> Hrm, on such systems
> - *large* amount of cpus
> - no synchronized TSCs
>
> What would be the best approach to order events ?

There isn't a perfect solution for this. My feeling is
that your best hope is with per-cpu buffers logged with
the local TSC ... together with some fancy heuristics to
post-process the logs to come up with the best approximation
to the actual ordering.

If you have a tight upper bound estimate for the
errors in converting from "per-cpu" TSC values to "global
system time" then the post processing tool will be able
to identify events for which the order is uncertain.

> Do you think we should consider using HPET, event though it's
> painfully slow ? Would it be faster than cache-line bouncing
> on such large boxes ? With a frequency around 10MHz, that
> would give a 100ns precision, which should be enough
> to order events.

This sounds like a poor choice. Makes all traces very
slow. 100ns precision isn't all that good ... we can
probably do almost as well estimating the delta between
TSC on different cpus.

-Tony

2008-10-17 19:18:15

by Steven Rostedt

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Fri, 17 Oct 2008, Mathieu Desnoyers wrote:

> * Luck, Tony ([email protected]) wrote:
> > > I agree that one cache line bouncer is devastating to performance. But
> > > as Mathieu said, it is better than a global tracer with lots of bouncing
> > > going on.
> >
> > Scale up enough, and it becomes more than just a performance problem.
> > When SGI first tried to boot on 512 cpus they found the kernel hung
> > completely because of a single global atomic counter for how many
> > interrupts there were. With HZ=1024 and 512 cpus the ensuing cache
> > line bouncing storm from each interrupt took longer to resolve than
> > the interval between interrupts.
> >
> > With higher event rates (1KHz seems relatively low) this wall will
> > be a problem for smaller systems too.
> >
>
> Hrm, on such systems
> - *large* amount of cpus
> - no synchronized TSCs

What about selective counting? Or have counters per nodes? If you are
dealing with a race, most cases, the race is not happening against CPUs
not sharing a node. Those not sharing will try hard not to ever use the
same cache lines.

>
> What would be the best approach to order events ? Do you think we should
> consider using HPET, event though it's painfully slow ? Would it be
> faster than cache-line bouncing on such large boxes ? With a frequency
> around 10MHz, that would give a 100ns precision, which should be enough
> to order events. However, HPET is known for its poor performances, which
> I doubt will do better than the cache-line bouncing alternative.
>
> > > ftrace does not have a global counter, but on some boxes with out of
> > > sync TSCs, it could not find race conditions. I had to pull in logdev,
> > > which found the race right away, because of this atomic counter.
> >
> > Perhaps this needs to be optional (and run-time switchable). Some
> > users (tracking performance issues) will want the tracer to have
> > the minumum possible effect on the system. Others (chasing race
> > conditions) will want the best possible ordering of events between
> > cpus[*].
> >
>
> Yup, I think this solution would work. The user could specify the time
> source for a specific set of buffers (a trace) through debugfs files.
>
> > -Tony
> >
> > [*] I'd still be concerned that a heavyweight strict ordering might
> > perturb the system enough to make the race disappear when tracing
> > is enabled.
> >
>
> Yes, it's true that it may make the race disappear, but what has been
> seen in the field (Steven could confirm that) is that it usually makes
> the race more likely to appear due to an enlarged race window. But I
> guess it all depends on where the activated instrumentation is.

I've seen both. 9 out of 10 times, the tracer helps induce the race. But
I've had that 1 out of 10 where it makes the race go away.

Actually, what happens is that I'll start adding trace markers (printk
like traces), and the race will happen quicker. Then I'll add a few more
markers and the race goes away. Those are the worst ;-)

-- Steve

2008-10-17 19:39:18

by Christoph Lameter

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

Luck, Tony wrote:
> Even one line bouncing between cpus can be a performamce disaster.
> You'll probably hit a serious wall somewhere between 8 and 16
> cpus (ia64 has code that looks a lot like this in the gettimeofday()
> path because it does not synchronize cpu cycle counters ... some

The code exist by necessity because some systems do not have synchronized ITCs
and one would not have time go backward. The cmpxchg there is usually switched
off. Its horrible in terms of scaling to large numbers of processor and also
horrible in terms of clock accuracy.

2008-10-17 20:28:29

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Luck, Tony ([email protected]) wrote:
> > Hrm, on such systems
> > - *large* amount of cpus
> > - no synchronized TSCs
> >
> > What would be the best approach to order events ?
>
> There isn't a perfect solution for this. My feeling is
> that your best hope is with per-cpu buffers logged with
> the local TSC ... together with some fancy heuristics to
> post-process the logs to come up with the best approximation
> to the actual ordering.
>
> If you have a tight upper bound estimate for the
> errors in converting from "per-cpu" TSC values to "global
> system time" then the post processing tool will be able
> to identify events for which the order is uncertain.
>

The only problem I see with "fancy heuristics" regarding the time base
is that when we detect that something is going wrong in the kernel or in
a userspace program, the *very last* thing we want to do is to doubt
about the reliability of the time source. When a problematic situation
is detected, it makes a huge difference whether this information can be
trusted or not. I've seen much simpler algorithms in the past (I'm
referring to the original LTT heartbeat here) which were told to be
plain simple but ended up being buggy and unreliable in rare
corner-cases (it did not take into account interrupt latency). After
fixing the main problems, I decided to start all over from scratch,
because unreliable timestamps means unreliable traces and this is not
something I am willing to provide.

> > Do you think we should consider using HPET, event though it's
> > painfully slow ? Would it be faster than cache-line bouncing
> > on such large boxes ? With a frequency around 10MHz, that
> > would give a 100ns precision, which should be enough
> > to order events.
>
> This sounds like a poor choice. Makes all traces very
> slow. 100ns precision isn't all that good ... we can
> probably do almost as well estimating the delta between
> TSC on different cpus.
>
> -Tony
>

100ns is not bad at all actually, especially given we don't plan to
require a memory barrier to be issued around the timestamp counter
reads. Memory read/writes can easily be reordered so they cause timing
skew in the order of 100ns. Also, just the TSC frequency drift and
imprecision of the TSC synchronization even when they are synchronized
(which is typically one cache line transfer delay when the TSCs are not
synchronized by the BIOS/mobo) is also in the order of 100ns. So sorry, I
disagree and think 100ns is actually the kind of precision we can expect
even from TSC reads.

Having read a lot about the subtle timestamp counter bugs one can find
in Intel and AMD boxes (gross summary of my findings here :
http://ltt.polymtl.ca/svn/trunk/lttv/doc/developer/tsc.txt), I think
there is no reliable way to give an upper bound on the timing
inaccurary, even with heroic measures trying to map the specific bugs of
each of those CPUs, when you have stuff like the southbridge temperature
throttling slowing down your CPU clock without notifying the kernel. And
as a said above, the timestamping code should be _very_ _very_ simple,
given that the first thing a kernel developer will point his finger at
when a tracer discovers a bug in his code is the tracer itself. So let's
save everyone precious time and make this code easy to review. :-)

So we are talking about performance impact of time base reads. Let's
look at some interesting numbers :

On my x86_64 box
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 6
cpu MHz : 2000.060
(dual quad-core, *not* NUMA)

Cycles to read (from 10000 loops, on a single CPU) :

get_cycles : 60
cpuid + get_cycles + cpuid (with rdtsc_barrier) : 77
ltt_get_timestamp_64() with synchronized TSC : 79
ltt_get_timestamp_64() with non-synchronized TSC (cache-hot) : 163
ltt_get_timestamp_64() with non-synchronized TSC (after clflush) : 310
(just doing the clflush takes 68 cycles, has been substracted from
the previous result)
HPET : 945

So if we have 512 processors doing timestamp reads like crazy, we can
suppose the execution to be serialized by cacheline transfer operations
from cpu to cpu. Therefore, assuming a worse-case scenario where all the
timestamp reads cause a cache line transfer, the 310 cycles (upper
bound) it takes to do the cmpxchg, with CPUs running at 2.0GHz, means
that we can do 6 451 612 timestamp reads per second on the overall
system. On 512 nodes, that means we can do 12 600 timestamp
reads/second/cpu.

Compared to this, HPET would offer a slower time base read (945 cycles
per read is fairly slow, which gives 2 116 402 time stamp read per
second), but if this mmio read is done in parallel across CPUs (I
don't see any reason why it should hold the bus exclusively for a simple
read.. ?), then it would scale much better so we could expect about 2.1M
timestamp read/second/cpu.

I guess the cache-line bouncing approach would get much worse with NUMA
systems, and in that case HPET could become increasingly interesting.

To give an order of magnitude, I expect some worse-case scenarios to be
around 8MB/s/cpu when tracing stuff like lockdep in circular per-cpu
memory buffers. With an average of, say, 16 bytes per event, including
the event header and payload, that would mean 524 288 events/second/cpu.
In that case, using the cache-line bouncing approach on a 512 nodes box
would simply kill the system, but if the HPET reads does not imply
serialization, time base reads would add a 25% performance loss in this
utterly extreme case, which is acceptable given this is a best-effort.
Compared to this, the TSC-based solution (given we have synchronized
TSCs) would add a 2% performance hit.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-17 23:52:50

[permalink] [raw]

Subject: RE: [RFC patch 15/15] LTTng timestamp x86

Complexity of dealing with all the random issues that have
plagued TSC in different cpus over the years definitely
seems to be a problem.

I have one more idea on how we might be able to use
TSC locally and still have confidence that we can
merge local cpu buffers into a consistent stream.

What if we read the HPET occasionally (once per
second?) and add a record to our per-cpu buffer
with the value of the HPET. That would give us
a periodic cross-check of each cpus TSC against
real time so that a "smart" post-processor can
sanity check the log entries at regular intervals.

It doesn't deal with the truly insane TSC behaivours
(like stopping completely in certain C states, or
varying frequency) ... but it would at least be able
to reliably detect these forms of insanity.

We need periodic entries added to the buffer anyway
to make sure we can detect rollover since we don't
want waste space in log records with a full width
TSC value.

-Tony

2008-10-18 17:01:49

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Luck, Tony ([email protected]) wrote:
> Complexity of dealing with all the random issues that have
> plagued TSC in different cpus over the years definitely
> seems to be a problem.
>

Yes :(

> I have one more idea on how we might be able to use
> TSC locally and still have confidence that we can
> merge local cpu buffers into a consistent stream.
>
> What if we read the HPET occasionally (once per
> second?) and add a record to our per-cpu buffer
> with the value of the HPET. That would give us
> a periodic cross-check of each cpus TSC against
> real time so that a "smart" post-processor can
> sanity check the log entries at regular intervals.
>

Hrm, that would make the timestamps much more sensitive to tracing
hiccups :

- if interrupts are disabled for a long time on the system (kernel bug
or at early boot), we cannot assume those HPET events will be logged
at the expected interval.
- if we are in buffer full condition (buffers are too small to handle
the load and we drop events on buffer full condition), we will not
only have missing events : given we depend on those HPET events to
have a consistent time-base, all the trace time-base must be
considered untrustable.
- we would also have to get this HPET timer value at each subbuffer
boundary (at each page in Steven's implementation). This is required
so we can make sense of the time-base of buffers when we only gather
the last subbuffers written, given the previous ones have been
overwritten in flight-recorder mode. However, with a relatively large
load and small subbuffers (e.g. 4kB), we would have to get this HPET
value 2048 times/second/cpu. On a 512 nodes machine, it may become a
problem. See my analysis of poor HPET scalability below.

> It doesn't deal with the truly insane TSC behaivours
> (like stopping completely in certain C states, or
> varying frequency) ... but it would at least be able
> to reliably detect these forms of insanity.
>

I also like the one done by AMD when the cycle counter goes backward
one a single CPU. :) Hrm, I thought those you say are truly insane
behaviors are exactly the ones we are trying to deal with ?

And what do we say when we detect this ? "sorry, please upgrade your
hardware to get a reliable trace" ? ;)

> We need periodic entries added to the buffer anyway
> to make sure we can detect rollover since we don't
> want waste space in log records with a full width
> TSC value.
>

Nope, this is not required. I removed the heartbeat event from LTTng two
weeks ago, implementing detection of the delta from the last timestamp
written into the trace. If we detect that the new timestamp is too far
from the previous one, we write the full 64 bits TSC in an extended
event header. Therefore, we have no dependency on interrupt latency to
get a sane time-base.

> -Tony
>

Here are some numbers showing the scalability of synchronized TSC vs
cache-line bouncing vs HPET read under tracing load. I use LTTng to take
a trace only in circular per-cpu memory buffers while tbench is running.
I look at the resulting tbench speed. This kind of load generates a lot
of tracing data especially because tbench does a lot of small
read/writes which generates a lot of system call events. Side-note:
LTTng is currently fully dynamic and parses the format string like
printk, and this is accountable for a large part of the performance
degradation. LTTng however supports to override this probe with
"specialized" probes which know exactly which types to record. I just
did not create any yet. So let's focus on timestamping :

model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 6
cpu MHz : 2000.073

tbench, x86_64 dual quad-core, 2.0GHz, 16GB ram Speed Slowdown

(8 cores up)
No tracing : 1910.50 MB/sec
Flight recorder tracing (per-cpu memory buffers)
synchronized TSC, get_cycles with cpuid : 940.20 MB/sec (50%)
unsync TSC, get_cycles + cmpxchg : 716.96 MB/sec (62%)
unsync TSC, HPET read : 586.53 MB/sec (69%)

(2 cores up)
No tracing : 488.15 MB/sec
Flight recorder tracing (per-cpu memory buffers)
synchronized TSC, get_cycles with cpuid : 241.34 MB/sec (50%)
unsync TSC, get_cycles + cmpxchg : 202.30 MB/sec (58%)
unsync TSC, HPET read : 187.04 MB/sec (61%)

(1 core up)
No tracing : 270.67 MB/sec
Flight recorder tracing (per-cpu memory buffers)
synchronized TSC, get_cycles with cpuid : 126.82 MB/sec (53.1%)
unsync TSC, get_cycles + cmpxchg : 124.54 MB/sec (53.9%)
unsync TSC, HPET read : 98.75 MB/sec (63.5%)

So, the conclusion it brings about scalability of those time sources
regarding tracing is :
- local TSC read scales very well when the number of CPU increases
(constant 50% overhead)
- Comparing the added overhead of both get_cyles+cmpxchg and HPET to
the local sync TSC :

cores get_cycles+cmpxchg HPET
1 0.8% 10%
2 8 % 11%
8 12 % 19%

So, is it me, or HPET scales even more poorly than a cache-line bouncing
cmpxchg ? I find it a bit surprising.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-18 17:37:05

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Sat, 18 Oct 2008, Mathieu Desnoyers wrote:
>
> So, the conclusion it brings about scalability of those time sources
> regarding tracing is :
> - local TSC read scales very well when the number of CPU increases
> (constant 50% overhead)

You should basically expect it to scale perfectly. Of course the tracing
itself adds overhead, and at some point the trace data generation may add
so much cache/memory traffic that you start getting worse scaling because
of _that_, but just a local TSC access itself will be perfect on any sane
setup.

> - Comparing the added overhead of both get_cyles+cmpxchg and HPET to
> the local sync TSC :
>
> cores get_cycles+cmpxchg HPET
> 1 0.8% 10%
> 2 8 % 11%
> 8 12 % 19%
>
> So, is it me, or HPET scales even more poorly than a cache-line bouncing
> cmpxchg ? I find it a bit surprising.

I don't think that's strictly true.

The cacheline is going to generally be faster than HPET ever will, since
caches are really important. But as you can see, the _degradation_ is
actually worse for the cacheline, since the cacheline works perfectly in
the UP case (not surprising) and starts degrading a lot more when you
start getting bouncing.

And I'm not sure what the behaviour would be for many-core, but I would
not be surprised of the cmpxchg actually ends up losing at some point. The
HPET is never fast (you can think of it as "uncached access"), and it's
going to degrade too (contention at the IO hub level), but it's actually
possible that the contention at some point becomes less than wild
bouncing.

Many cacheline bouncing issues end up being almost exponential. When you
*really* get bouncing, things degrade in a major way. I don't think you've
seen the worst of it with 8 cores ;)

And that's why I'd really like to see the "only local TSC" access, even if
I admit that the code is going to be much more subtle, and I will also
admit that especially in the presense of frequency changes *and* hw with
unsynchronized TSC's you may be in the situation where you never get
exactly what you want.

But while you may not like some of the "purely local TSC" issues, I would
like to point out that

- In _practice_, it's going to be essentially perfect on a lot of
machines, and under a lot of loads.

For example, yes it's true that frequency changes will make TSC things
less reliable on a number of machines, but people already end up
disabling dynamic cpufreq when doing various benchmark runs, simply
because people want more consistent numbers for benchmarking across
different kernels etc.

So it's entirely possible (and I'd say "likely") that most people are
simply willing to do the same thing for tracing if they are tracing
things at a level where CPU frequency changes might otherwise matter.

So maybe the "local TSC" approach isn't always perfect, but I'd expect
that quite often people who do tracing are willing to work around it.
The people doing tracing are generally not doing so without being aware
of what they are up to..

- While there is certainly a lot of hardware out there with flaky TSC's,
there's also a lot of hardware (especially upcoming) that do *not* have
flaky TSC's. We've been complaining to Intel about TSC behavior for
years, and the thing is, it actually _is_ improving. It just takes some
time.

- So considering that some of the tracing will actually be very important
on machines that have lots of cores, and considering that a lot of the
issues can generally be worked around, I really do think that it's
worth trying to spend a bit of effort on doing the "local TSC + timely
corrections"

For example, you mention that interrupts can be disabled for a time,
delaying things like regular sync events with some stable external clock
(say the HPET). That's true, and it would even be a problem if you'd use
the time of the interrupt itself as the source of the sync, but you don't
really need to depend on the timing of the interrupt - just that it
happens "reasonably often" (and now we're talking _much_ longer timeframes
than some interrupt-disabled time - we're talking tenths of seconds or
even more).

Then, rather than depend on the time of the interrupt, you just purely can
check the local TSC against the HPET (or other source), and synchronize
just _purely_ based on those. That you can do by basically doing something
like

do {
start = read_tsc();
hpet = read_hpet();
end = read_tsc();
} while (end - start > ERROR);

and now, even if you have interrupts enabled (or worry about NMI's), you
now know that you have a totally _independent_ sync point, ie you know
that your hpet read value is withing ERROR cycles of the start/end values,
so now you have a good point for doing future linear interpolation based
on those kinds of sync points.

And if you make all these linear interpolations be per-CPU (so you have
per-CPU offsets and frequencies) you never _ever_ need to touch any shared
data at all, and you know you can scale basically perfectly.

Your linear interpolations may not be _perfect_, but you'll be able to get
them pretty damn near. In fact, even if the TSC's aren't synchronized at
all, if they are at least _individually_ stable (just running at slightly
different frequencies because they are in different clock domains, and/or
at different start points), you can basically perfect the precision over
time.

Linus

2008-10-18 17:50:53

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Linus Torvalds <[email protected]> wrote:

> And if you make all these linear interpolations be per-CPU (so you
> have per-CPU offsets and frequencies) you never _ever_ need to touch
> any shared data at all, and you know you can scale basically
> perfectly.
>
> Your linear interpolations may not be _perfect_, but you'll be able to
> get them pretty damn near. In fact, even if the TSC's aren't
> synchronized at all, if they are at least _individually_ stable (just
> running at slightly different frequencies because they are in
> different clock domains, and/or at different start points), you can
> basically perfect the precision over time.

there's been code submitted by Michael Davidson recently that looked
interesting, which turns the TSC into such an entity:

http://lkml.org/lkml/2008/9/25/451

The periodic synchronization uses the hpet, but it thus allows lockless
and globally correct readouts of the TSC .

And that would match the long term goal as well: the hw should do this
all automatically. So perhaps we should have a trace_clock() after all,
independent of sched_clock(), and derived straight from RDTSC.

The approach as propoed has a couple of practical problems, but if we
could be one RDTSC+multiplication away from a pretty good timestamp that
would be rather useful, very fast and very robust ...

Ingo

2008-10-20 18:07:53

[permalink] [raw]

Subject: RE: [RFC patch 15/15] LTTng timestamp x86

> And what do we say when we detect this ? "sorry, please upgrade your
> hardware to get a reliable trace" ? ;)

My employer might be happy with that answer ;-) ... but I think
we could tell the user to:

1) adjust something in /sys/...
2) boot with some special option
3) rebuild kernel with CONFIG_INSANE_TSC=y

to switch over to a heavyweight workaround in s/w. Systems
that require this are already in the minority ... and I
think (hope!) that current and future generations of cpus
won't have these challenges.

So this is mostly a campaign for the default code path to
be based on current (sane) TSC behaviour ... with the workarounds
for past problems kept to one side.

> Nope, this is not required. I removed the heartbeat event from LTTng two
> weeks ago, implementing detection of the delta from the last timestamp
> written into the trace. If we detect that the new timestamp is too far
> from the previous one, we write the full 64 bits TSC in an extended
> event header. Therefore, we have no dependency on interrupt latency to
> get a sane time-base.

Neat. Could you grab the HPET value here too?

> (8 cores up)

Interesting results. I'm not at all sure why HPET scales so badly.
Maybe some h/w throttling/synchronizing going on???

-Tony

2008-10-20 20:11:23

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Fri, 17 Oct 2008, Mathieu Desnoyers wrote:
>
> Hrm, on such systems
> - *large* amount of cpus
> - no synchronized TSCs
>
> What would be the best approach to order events ?

My strong opinion has been - for a longish while now, and independently of
any timestamping code - that we should be seriously looking at basically
doing essentially a "ntp" inside the kernel to give up the whole idiotic
notion of "synchronized TSCs". Yes, TSC's are often synchronized, but even
when they are, we might as well _think_ of them as not being so.

In other words, instead of expecting internal clocks to be synchronized,
just make the clock be a clock network of independent TSC domains. The
domains could in theory be per-package (assuming TSC is synchronized at
that level), but even if we _could_ do that, we'd probably still be better
off by simply always doing it per-core. If only because then the reading
would be per-core.

I think it's a mistake for us to maintain a single clock for
gettimeofday() (well, "getnstimeofday" and the whole "clocksource_read()"
crud to be technically correct). And sure, I bet clocksource_read() can do
various per-CPU things and try to do that, but it's complex and pretty
generic code, and as far as I know none of the clocksources have even
tried. The TSC clocksource read certainly does not (it just does a very
similar horrible "at least don't go backwards" crud that the LTTng patch
suggested).

So I think we should make "xtime" be a per-CPU thing, and add support for
per-CPU clocksources. And screw that insane "mark_tsc_unstable()" thing.

And if we did it well, we migth be able to get good timestamps that way
too.

Linus

2008-10-20 21:39:55

by john stultz

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Mon, Oct 20, 2008 at 1:10 PM, Linus Torvalds
<[email protected]> wrote:
> I think it's a mistake for us to maintain a single clock for
> gettimeofday() (well, "getnstimeofday" and the whole "clocksource_read()"
> crud to be technically correct). And sure, I bet clocksource_read() can do
> various per-CPU things and try to do that, but it's complex and pretty
> generic code, and as far as I know none of the clocksources have even
> tried. The TSC clocksource read certainly does not (it just does a very
> similar horrible "at least don't go backwards" crud that the LTTng patch
> suggested).
>
> So I think we should make "xtime" be a per-CPU thing, and add support for
> per-CPU clocksources. And screw that insane "mark_tsc_unstable()" thing.
>
> And if we did it well, we migth be able to get good timestamps that way
> too.

Personally I'd been hoping that the experiments in the trace
timestamping code would provide a safe area of experimentation before
we adapt it to the TSC clocksource implementation for
getnstimeofday(). Earlier I know Andi and Jiri were working on such a
per-cpu TSC clocksource, but I don't know where it ended up.

I'm not quite sure I followed your per-cpu xtime thoughts. Could you
explain further your thinking as to why the entire timekeeping
subsystem should be per-cpu instead of just keeping that back in the
arch-specific clocksource implementation? In other words, why keep
things synced at the nanosecond level instead of keeping the per-cpu
TSC synched at the cycle level?

thanks
-john

2008-10-20 22:07:32

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Mon, 20 Oct 2008, john stultz wrote:
>
> I'm not quite sure I followed your per-cpu xtime thoughts. Could you
> explain further your thinking as to why the entire timekeeping
> subsystem should be per-cpu instead of just keeping that back in the
> arch-specific clocksource implementation? In other words, why keep
> things synced at the nanosecond level instead of keeping the per-cpu
> TSC synched at the cycle level?

I don't think you can kep them sync'ed without taking frequency drift into
account. When you have multiple boards (ie big boxes), they simply _will_
be in different clock domains. They won't have the exact same frequency.

So the "rewrite the TSC every once in a while" approach (where "after
coming out of idle" is just a special case of "once in a while" due to
many CPU's losing TSC in idle) works well in the kind of situation where
you really only have a single clock domain, and the TSC's are all
basically from the same reference clock. And that's a common case, but it
certainly isn't the _only_ case.

What about fundamnetally different frequencies (old TSC's that change with
cpufreq)? Or what about just subtle different ones (new TSC's but on
separate sockets that use separate external clocks)?

But sure, I can imagine using a global xtime, but just local TSC offsets
and frequencies, and just generating a local offset from xtime. BUT HOW DO
YOU EXPECT TO DO THAT?

Right now, the global xtime offset thing also depends on the fact that we
have a single global TSC offset! That whole "delta against xtime" logic
depends very much on this:

/* calculate the delta since the last update_wall_time: */
cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;

and that base-time setting depends on a _global_ clock source. Why?
Because it depends on setting that in sync with updating xtime.

And maybe I'm missing something. But I do not believe that it's easy to
just make the TSC be per-CPU. You need per-cpu correction factors, but you
_also_ need a per-CPU time base.

Oh, I'm sure you can do hacky things, and work around known issues, and
consider the TSC to be globally stable in a lot of common schenarios.
That's what you get by re-syncing after idle etc. And it's going to work
in a lot of situations.

But it's not going to solve the "hey, I have 512 CPU's, they are all on
different boards, and no, they are _not_ synchronized to one global
clock!".

That's why I'd suggest making _purely_ local time, and then aiming for
something NTP-like. But maybe there are better solutions out there.

Linus

2008-10-20 22:19:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Linus Torvalds <[email protected]> wrote:

> That's why I'd suggest making _purely_ local time, and then aiming for
> something NTP-like. But maybe there are better solutions out there.

this 'fast local time' was the rough idea we tried to implement via the
cpu_clock(cpu) interface.

cpu_clock() results are loosely coupled to xtime in every scheduler tick
via the scd->tick_gtod logic.

( That way in a sense it tracks NTP time as well, if NTP is fed back
into GTOD, such as when ntpd is running. Granted, this is not the same
quality at all as if it did native NTP-alike corrections, but it at
least has a long-term stability. )

And it only ever does cross-CPU work if we specifically ask for a remote
clock:

if (cpu != raw_smp_processor_id()) {
struct sched_clock_data *my_scd = this_scd();

lock_double_clock(scd, my_scd);

it still does this "serialization looking" even in the local case:

__raw_spin_lock(&scd->lock);
clock = __update_sched_clock(scd, now);
}

__raw_spin_unlock(&scd->lock);

... but that lock is strictly per CPU, so it only matters if there _is_
cross-CPU "interest" in that clock. Otherwise these locks are in essence
just per CPU and cause no cacheline bouncing, etc.

... but we could try to eliminate even that potential for any locking.
On 64-bit it's a real possibility i think. (we need the lock for 32-bit
mainly, the timestamps are all 64 bits)

... it also has all the tsc-stops-in-idle smarts, knows about cpufreq,
etc. Those things are needed even on UP, to not get really bad
transients in time.

That still leaves us with sched_clock() complexity, which has spread out
a bit more than it should have. So it's not all as simple as you'd like
it to be i think, but we are trying hard ...

Ideas to simplify/robustify it are welcome.

Ingo

2008-10-20 22:32:23

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

Linus Torvalds wrote:
>
> But it's not going to solve the "hey, I have 512 CPU's, they are all on
> different boards, and no, they are _not_ synchronized to one global
> clock!".
>
> That's why I'd suggest making _purely_ local time, and then aiming for
> something NTP-like. But maybe there are better solutions out there.
>

At the same time, it would definitely be nice to encourage vendors of
large SMP systems to provide a common root crystal (frequency standard)
for a single SMP domain. Preferrably a really good one, TCXO or better.

-hpa

2008-10-20 23:47:47

by john stultz

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Mon, 2008-10-20 at 15:06 -0700, Linus Torvalds wrote:
>
> On Mon, 20 Oct 2008, john stultz wrote:
> >
> > I'm not quite sure I followed your per-cpu xtime thoughts. Could you
> > explain further your thinking as to why the entire timekeeping
> > subsystem should be per-cpu instead of just keeping that back in the
> > arch-specific clocksource implementation? In other words, why keep
> > things synced at the nanosecond level instead of keeping the per-cpu
> > TSC synched at the cycle level?
>
> I don't think you can kep them sync'ed without taking frequency drift into
> account. When you have multiple boards (ie big boxes), they simply _will_
> be in different clock domains. They won't have the exact same frequency.
>
> So the "rewrite the TSC every once in a while" approach (where "after
> coming out of idle" is just a special case of "once in a while" due to
> many CPU's losing TSC in idle) works well in the kind of situation where
> you really only have a single clock domain, and the TSC's are all
> basically from the same reference clock. And that's a common case, but it
> certainly isn't the _only_ case.
>
> What about fundamnetally different frequencies (old TSC's that change with
> cpufreq)? Or what about just subtle different ones (new TSC's but on
> separate sockets that use separate external clocks)?

Ok. Thanks, the clarification about dealing with the multiple frequency
domains helps me understand what you're looking for and why per-cpu time
bases would be needed.

I was assuming that we were just looking at the single frequency domain,
but unsynced TSCs due to idle halting (or maybe just very slight
frequency skew).

<snip>
> Oh, I'm sure you can do hacky things, and work around known issues, and
> consider the TSC to be globally stable in a lot of common schenarios.
> That's what you get by re-syncing after idle etc. And it's going to work
> in a lot of situations.

Yea, and indeed this is path we've been on, because folks have had quite
a bit of difficulty getting the single freq domain solution working. So
small hacks have been added over time, hoping to get there for just one
freq.

> But it's not going to solve the "hey, I have 512 CPU's, they are all on
> different boards, and no, they are _not_ synchronized to one global
> clock!".

Yep. And for now we dodge that by pushing to use an stable global
clocksource like HPET for these cases, at the cost of performance.

> That's why I'd suggest making _purely_ local time, and then aiming for
> something NTP-like. But maybe there are better solutions out there.

The difficulty with NTP-like, is distributed systems tend to expect
slight deltas between machines. Userland gettimeofday() users do not
expect detectable skew between cpus.

Getting that last part right without those "at least don't go backwards"
hacks is hard.

I'll keep thinking about it.

thanks
-john

2008-10-21 18:11:41

by Bjorn Helgaas

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

On Monday 20 October 2008 04:29:07 pm H. Peter Anvin wrote:
> Linus Torvalds wrote:
> >
> > But it's not going to solve the "hey, I have 512 CPU's, they are all on
> > different boards, and no, they are _not_ synchronized to one global
> > clock!".
> >
> > That's why I'd suggest making _purely_ local time, and then aiming for
> > something NTP-like. But maybe there are better solutions out there.
>
> At the same time, it would definitely be nice to encourage vendors of
> large SMP systems to provide a common root crystal (frequency standard)
> for a single SMP domain. Preferrably a really good one, TCXO or better.

A single root crystal is nice for us software guys. But it often
also turns into a single point of failure, which the hardware guys
are always trying to eliminate. So I think multiple crystals are
inevitable for the really large machines.

Bjorn

2008-10-22 15:53:42

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Linus Torvalds ([email protected]) wrote:
>
>
> On Sat, 18 Oct 2008, Mathieu Desnoyers wrote:
> >
> > So, the conclusion it brings about scalability of those time sources
> > regarding tracing is :
> > - local TSC read scales very well when the number of CPU increases
> > (constant 50% overhead)
>
> You should basically expect it to scale perfectly. Of course the tracing
> itself adds overhead, and at some point the trace data generation may add
> so much cache/memory traffic that you start getting worse scaling because
> of _that_, but just a local TSC access itself will be perfect on any sane
> setup.
>

Given the rest of tracing mainly consists in reading cache-hot data to
put it in per-cpu buffers, it actually scales very well.

> > - Comparing the added overhead of both get_cyles+cmpxchg and HPET to
> > the local sync TSC :
> >
> > cores get_cycles+cmpxchg HPET
> > 1 0.8% 10%
> > 2 8 % 11%
> > 8 12 % 19%
> >
> > So, is it me, or HPET scales even more poorly than a cache-line bouncing
> > cmpxchg ? I find it a bit surprising.
>
> I don't think that's strictly true.
>
> The cacheline is going to generally be faster than HPET ever will, since
> caches are really important. But as you can see, the _degradation_ is
> actually worse for the cacheline, since the cacheline works perfectly in
> the UP case (not surprising) and starts degrading a lot more when you
> start getting bouncing.
>
> And I'm not sure what the behaviour would be for many-core, but I would
> not be surprised of the cmpxchg actually ends up losing at some point. The
> HPET is never fast (you can think of it as "uncached access"), and it's
> going to degrade too (contention at the IO hub level), but it's actually
> possible that the contention at some point becomes less than wild
> bouncing.
>

Too bad I don't have enough cores to generate a meaningful figure, but
I've been surprised to see how bad the HPET evolved between 2 and 8
cores (11% impact -> 19% impact) vs cacheline (8% -> 12%). The real big
step for cacheline bouncing seems to be between 1 and 2 CPUs, but after
that it seems to increase much more slowly than HPET.

> Many cacheline bouncing issues end up being almost exponential. When you
> *really* get bouncing, things degrade in a major way. I don't think you've
> seen the worst of it with 8 cores ;)
>

I wonder how this can end up being exponential considering I could
switch this cache-line bouncing access into an uncached memory access.
This would slow down the few CPU cases, but would likely behave much
like the HPET read. And the performance impact of that would be expected
to increase linearly with the number of cores. I think what makes this
linear with the number of cores is because we are talking about a single
cache-line exchange. I standard programs, we have to bounce many
cache-lines between the CPUs, which therefore follows a time complexity
of O(nr cores * nr shared cache-lines). If the number of shared
cache-lines is big, it may look like an exponential increase.

I also wonder how many 8+ cores with non-sync TSC systems there are out
there, and if tracing is a vital requirement for them, and also I wonder
if it's worth all the effort we are putting into this. I am always
tempted to just detect this kind of behavior, keep a slower-but-solid
solution, and point to some documentation that says what workarounds can
be enabled.

> And that's why I'd really like to see the "only local TSC" access, even if
> I admit that the code is going to be much more subtle, and I will also
> admit that especially in the presense of frequency changes *and* hw with
> unsynchronized TSC's you may be in the situation where you never get
> exactly what you want.
>
> But while you may not like some of the "purely local TSC" issues, I would
> like to point out that
>
> - In _practice_, it's going to be essentially perfect on a lot of
> machines, and under a lot of loads.
>
> For example, yes it's true that frequency changes will make TSC things
> less reliable on a number of machines, but people already end up
> disabling dynamic cpufreq when doing various benchmark runs, simply
> because people want more consistent numbers for benchmarking across
> different kernels etc.
>
> So it's entirely possible (and I'd say "likely") that most people are
> simply willing to do the same thing for tracing if they are tracing
> things at a level where CPU frequency changes might otherwise matter.
>
> So maybe the "local TSC" approach isn't always perfect, but I'd expect
> that quite often people who do tracing are willing to work around it.
> The people doing tracing are generally not doing so without being aware
> of what they are up to..

The way I work around this issue in LTTng is to detect non-synchronized
TSCs, print a warning message telling that the time-base used for
tracing is not perfect, and let the user know about some documentation
which explains how to work-around the problem with specific kernel
command line arguments (e.g. idle=poll and disabling freq scaling).
There are however some people who will not be willing to reboot their
computer or change their setup, and this is the use-case where I think
providing a non-perfect solution (which does not scale as well) makes
sense.

>
> - While there is certainly a lot of hardware out there with flaky TSC's,
> there's also a lot of hardware (especially upcoming) that do *not* have
> flaky TSC's. We've been complaining to Intel about TSC behavior for
> years, and the thing is, it actually _is_ improving. It just takes some
> time.
>

Hurray ! :)

> - So considering that some of the tracing will actually be very important
> on machines that have lots of cores, and considering that a lot of the
> issues can generally be worked around, I really do think that it's
> worth trying to spend a bit of effort on doing the "local TSC + timely
> corrections"
>

As we can see in other threads, it seems to have been tried for a while
by very competent people without brilliant success. Given that the
requirements tracing have for timestamps that follows as closely as
possible the event order across CPUs, which is stronger than the
standard kernel time-sources, I think those things should be tried as a
new kernel time source which has more relax constraints, and maybe
eventually become a trace_clock() source if it is judged stable and
precise enough (e.g. we could trace it with a tracer based on a
cache-line bouncing clock to see if the event order is correct on
various loads).

> For example, you mention that interrupts can be disabled for a time,
> delaying things like regular sync events with some stable external clock
> (say the HPET). That's true, and it would even be a problem if you'd use
> the time of the interrupt itself as the source of the sync, but you don't
> really need to depend on the timing of the interrupt - just that it
> happens "reasonably often" (and now we're talking _much_ longer timeframes
> than some interrupt-disabled time - we're talking tenths of seconds or
> even more).

Even the "reasonably often" can be a problem. Some examples :

- boot time tracing, where interrupts can be off for a while.
- some races within the kernel which could disable interrupts for a
long while. Yes, this should _never_ happen, but when it does, a
tracer becomes very handy.

>
> Then, rather than depend on the time of the interrupt, you just purely can
> check the local TSC against the HPET (or other source), and synchronize
> just _purely_ based on those. That you can do by basically doing something
> like
>
> do {
> start = read_tsc();
> hpet = read_hpet();
> end = read_tsc();
> } while (end - start > ERROR);
>
> and now, even if you have interrupts enabled (or worry about NMI's), you
> now know that you have a totally _independent_ sync point, ie you know
> that your hpet read value is withing ERROR cycles of the start/end values,
> so now you have a good point for doing future linear interpolation based
> on those kinds of sync points.
>

Just hope you don't run this on uncommon (e.g. virtualized hardware) and
always have end - start > ERROR, because you would be creating an
infinite loop. This might cause some problems. But yes, that would help
reading the hpet and tsc together. We could probably use the average for
TSC value there too :
tsc = start + ((start - end) >> 1);

> And if you make all these linear interpolations be per-CPU (so you have
> per-CPU offsets and frequencies) you never _ever_ need to touch any shared
> data at all, and you know you can scale basically perfectly.
>
> Your linear interpolations may not be _perfect_, but you'll be able to get
> them pretty damn near. In fact, even if the TSC's aren't synchronized at
> all, if they are at least _individually_ stable (just running at slightly
> different frequencies because they are in different clock domains, and/or
> at different start points), you can basically perfect the precision over
> time.
>

Hrm, you seem to assume the CPU frequency is nearly constant, but I
suspect AMD systems with idle cores will actually make the overall
frequency jump drastically between high and low CPU load. Therefore, I
don't even think we can really consider those to be individually stable.

The other problem with interpolation is when a CPU starts to accelerate.
In the sched_clock code, what is currently done is to cap the tsc value
to a max within the time window between two HPET reads so it does not
appear to go backwards when the next HPET read occurs. Such case would
clearly mess the event synchronization in a noticeable way.

Mathieu

> Linus

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-22 16:20:03

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Ingo Molnar ([email protected]) wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > And if you make all these linear interpolations be per-CPU (so you
> > have per-CPU offsets and frequencies) you never _ever_ need to touch
> > any shared data at all, and you know you can scale basically
> > perfectly.
> >
> > Your linear interpolations may not be _perfect_, but you'll be able to
> > get them pretty damn near. In fact, even if the TSC's aren't
> > synchronized at all, if they are at least _individually_ stable (just
> > running at slightly different frequencies because they are in
> > different clock domains, and/or at different start points), you can
> > basically perfect the precision over time.
>
> there's been code submitted by Michael Davidson recently that looked
> interesting, which turns the TSC into such an entity:
>
> http://lkml.org/lkml/2008/9/25/451
>
> The periodic synchronization uses the hpet, but it thus allows lockless
> and globally correct readouts of the TSC .
>
> And that would match the long term goal as well: the hw should do this
> all automatically. So perhaps we should have a trace_clock() after all,
> independent of sched_clock(), and derived straight from RDTSC.
>
> The approach as propoed has a couple of practical problems, but if we
> could be one RDTSC+multiplication away from a pretty good timestamp that
> would be rather useful, very fast and very robust ...
>
> Ingo

Looking at this code, I wonder :

- How it would support virtualization.
- How it would scale to 512 nodes, if we consider that every idle node
is doing an HPET readl each time it exits from safe_halt() (this can
end up taking most of the HPET timer bandwidth). So in the case where
we have 256 idle nodes taking all the HPET timer bandwidth and a 256
nodes doing useful work, the time these HPET reads can take on the
useful nodes when they try to resync with the HPET could be long (they
may need to sample it periodically or at CPU frequency change, or they
may simply go idle once in a while). We might end up having difficulty
getting a CPU out of idle due to the time it takes simply to get hold
of the HPET.

Given the bad scalability numbers I've recently posted for the HPET, I
doubt this a workable solution to the scalability issue.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-22 16:56:43

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Luck, Tony ([email protected]) wrote:
> > And what do we say when we detect this ? "sorry, please upgrade your
> > hardware to get a reliable trace" ? ;)
>
> My employer might be happy with that answer ;-) ... but I think
> we could tell the user to:
>
> 1) adjust something in /sys/...
> 2) boot with some special option
> 3) rebuild kernel with CONFIG_INSANE_TSC=y
>
> to switch over to a heavyweight workaround in s/w. Systems
> that require this are already in the minority ... and I
> think (hope!) that current and future generations of cpus
> won't have these challenges.
>
> So this is mostly a campaign for the default code path to
> be based on current (sane) TSC behaviour ... with the workarounds
> for past problems kept to one side.
>

This is exactly what I do in this patchset actually :) The common case,
when a synchronized TSC is detected, is to do a plain TSC read. However,
if a non-synchronized TSC is detected, a warning message is written to
the console (pointing to some documentation to get precise timestamping)
and the heavyweight cmpxchg-based workaround is enabled.

> > Nope, this is not required. I removed the heartbeat event from LTTng two
> > weeks ago, implementing detection of the delta from the last timestamp
> > written into the trace. If we detect that the new timestamp is too far
> > from the previous one, we write the full 64 bits TSC in an extended
> > event header. Therefore, we have no dependency on interrupt latency to
> > get a sane time-base.
>
> Neat. Could you grab the HPET value here too?
>

Yes, I could. When I detect that the TSC value is too far apart from the
previous one, I reserve extra space for the header (this could include
an extra 64-bits for the HPET). At that moment, I could also sample the
HPET, given this happens relatively rarely.

Given the frequency is expected to go at about 1GHz, the 27 bits would
overflow 7-8 times per second. The only thing is that I only need this
extended field when there are absolutely no events in the stream for
an whole overflow period, which is the only case that makes the overflow
impossible to detect without having more TSC bits. In the common case
where there is a steady flow of event, we would never have such "large
TSC header" event.

However, I could do something slightly different from the large TSC
header detection. I could make sure there would be a HPET sampling done
"periodically", or at least periodically when there are events saved to
the buffer by saving, for each buffer, the last TSC value at which the
HPET sampling has been done. When we log following events, we do a HPET
sampling (and write an extended event header) if we are too far apart
from the previous sample.

We would probably need to sample the HPET at subbuffer switch too to
allow fast time-based seek on the trace when we read it.

We could then do a pre-processing on the trace buffers which would
calculate the linear interpolation of cycles counters between the
per-buffer HPET values. The nice thing is that we know the _next_ value
coming _after_ an event (which is not the case for a standard kernel
time-base), so we can be a bit more precise and we do not suffer from
things like "the TSC of a given cpu accelerates and times appears to go
backwards when the next HPET sample is taken".

But I am not sure this would be sufficient to insure generally correct
event order; the maximum interpolation error can become quite large on
systems with different clock speeds in halt states and which would
happen to have a non-steady flow of events.

>
> > (8 cores up)
>
> Interesting results. I'm not at all sure why HPET scales so badly.
> Maybe some h/w throttling/synchronizing going on???
>

As Linus said, there is probably some contention at the IO hub level.
But it implies that we have to be careful about the frequency at which
we sample the HPET, otherwise it wouldn't scale.

Mathieu

> -Tony
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-22 17:05:31

[permalink] [raw]

Subject: Re: [RFC patch 15/15] LTTng timestamp x86

* Linus Torvalds ([email protected]) wrote:
>
>
> On Fri, 17 Oct 2008, Mathieu Desnoyers wrote:
> >
> > Hrm, on such systems
> > - *large* amount of cpus
> > - no synchronized TSCs
> >
> > What would be the best approach to order events ?
>
> My strong opinion has been - for a longish while now, and independently of
> any timestamping code - that we should be seriously looking at basically
> doing essentially a "ntp" inside the kernel to give up the whole idiotic
> notion of "synchronized TSCs". Yes, TSC's are often synchronized, but even
> when they are, we might as well _think_ of them as not being so.
>
> In other words, instead of expecting internal clocks to be synchronized,
> just make the clock be a clock network of independent TSC domains. The
> domains could in theory be per-package (assuming TSC is synchronized at
> that level), but even if we _could_ do that, we'd probably still be better
> off by simply always doing it per-core. If only because then the reading
> would be per-core.
>
> I think it's a mistake for us to maintain a single clock for
> gettimeofday() (well, "getnstimeofday" and the whole "clocksource_read()"
> crud to be technically correct). And sure, I bet clocksource_read() can do
> various per-CPU things and try to do that, but it's complex and pretty
> generic code, and as far as I know none of the clocksources have even
> tried. The TSC clocksource read certainly does not (it just does a very
> similar horrible "at least don't go backwards" crud that the LTTng patch
> suggested).
>
> So I think we should make "xtime" be a per-CPU thing, and add support for
> per-CPU clocksources. And screw that insane "mark_tsc_unstable()" thing.
>
> And if we did it well, we migth be able to get good timestamps that way
> too.
>
> Linus

Yep, it looks like a promising area to look into. I think, however, that
it would be good to first experiment with it as a in-kernel time source
rather than as a tracing time source, so we can use a tracer to make
sure it is stable enough. :-)

Also, we have to wonder if it's worth side-stepping tracing developement
on what I consider being a "special-case for buggy hardware". If we let
development on this specific problem at the kernel level go on its own
and decide to use it for tracing when it's judged good enough, we
(tracing people) can focus on the following steps needed to get a tracer
into Linux, namely buffering, event id management, etc. Given I feel the
need for tracing is relatively urgent for the community, I'd recommend
getting a basic, non-perfect timestamping solution in first, and keep
room for improvement.

I prefer to provide tracing for 98% of the machines out there and point
to some documentation telling how to configure the other 1.95% (and feel
sorry for the people how fall in the inevitable 0.05%) than to spend
years trying to come up with a complex scheme aiming precisely at this
1.95%.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-23 15:50:03