2008-10-16 23:56:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC patch 00/15] Tracer Timestamping

Hi,

Starting with the bottom of my LTTng patchset
(git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git)
I post as RFC the timestamping infrastructure I have been using for a while in
the tracer. It integrates get_cycles() standardization following David Miller's
comments I did more recently.

It also deals with 32 -> 64 bits timestamp counter extension with a RCU-style
algorithm, which is especially useful on MIPS and SuperH architectures.

There is also a TSC synchronization test within this patchset to detect
unsynchronized TSCs. See comments in this specific patch to figure out the
difference between the current x86 tsc_sync.c and the one I propose in this
patch.

It also provides an architecture-agnostic fallback in case there is no
timestamp counter available : basically, it's
(jiffies << 13) | atomically_incremented_counter (if there are more than 8192
events per jiffy, time will still be monotonic, but will increment faster than
the actual system frequency).

Comments are welcome. Note that this is only the beginning of the patchset. I
plan to submit the event ID allocation/portable event typing aimed at exporting
the data to userspace and buffering mechanism as soon as I integrate a core
version of the LTTV userspace tools to the kernel build tree. Other than that, I
currently have a tracer which fulfills most of the requirements expressed
earlier. I just fear that if I release only the kernel part without foolproof
binary-to-ascii trace decoder within the kernel, people might be a bit reluctant
to fetch a separate userspace package.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2008-10-17 07:59:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC patch 00/15] Tracer Timestamping

On Thu, 2008-10-16 at 19:27 -0400, Mathieu Desnoyers wrote:
> Hi,
>
> Starting with the bottom of my LTTng patchset
> (git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git)
> I post as RFC the timestamping infrastructure I have been using for a while in
> the tracer. It integrates get_cycles() standardization following David Miller's
> comments I did more recently.
>
> It also deals with 32 -> 64 bits timestamp counter extension with a RCU-style
> algorithm, which is especially useful on MIPS and SuperH architectures.

Have you looked at the existing 32->63 extention code in
include/linux/cnt32_to_63.h and considered unifying it?

> There is also a TSC synchronization test within this patchset to detect
> unsynchronized TSCs.

We already have such code, no? Does this code replace that one or just
add a second test?

> See comments in this specific patch to figure out the
> difference between the current x86 tsc_sync.c and the one I propose in this
> patch.

Right so you don't unify, that's a missed opportunity, no?

> It also provides an architecture-agnostic fallback in case there is no
> timestamp counter available : basically, it's
> (jiffies << 13) | atomically_incremented_counter (if there are more than 8192
> events per jiffy, time will still be monotonic, but will increment faster than
> the actual system frequency).
>
> Comments are welcome. Note that this is only the beginning of the patchset. I
> plan to submit the event ID allocation/portable event typing aimed at exporting
> the data to userspace and buffering mechanism as soon as I integrate a core
> version of the LTTV userspace tools to the kernel build tree. Other than that, I
> currently have a tracer which fulfills most of the requirements expressed
> earlier. I just fear that if I release only the kernel part without foolproof
> binary-to-ascii trace decoder within the kernel, people might be a bit reluctant
> to fetch a separate userspace package.

It might be good to drop all the ltt naming and pick more generic names,
esp. as ftrace could use a lot of this infrastructure as well.

2008-10-20 20:25:30

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC patch 00/15] Tracer Timestamping

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2008-10-16 at 19:27 -0400, Mathieu Desnoyers wrote:
> > Hi,
> >
> > Starting with the bottom of my LTTng patchset
> > (git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git)
> > I post as RFC the timestamping infrastructure I have been using for a while in
> > the tracer. It integrates get_cycles() standardization following David Miller's
> > comments I did more recently.
> >
> > It also deals with 32 -> 64 bits timestamp counter extension with a RCU-style
> > algorithm, which is especially useful on MIPS and SuperH architectures.
>
> Have you looked at the existing 32->63 extention code in
> include/linux/cnt32_to_63.h and considered unifying it?
>

Yep, I felt this code was dangerous on SMP given it could suffer from
the following type of race due to lack of proper barriers :

CPU A B
read hw cnt low
read __m_cnt_hi
read hw cnt low
(wrap detected)
write __m_cnt_hi (incremented)
read __m_cnt_hi
(wrap detected)
write __m_cnt_hi (incremented)

we therefore increment the high bits twice in the given race.

On UP, the same race could happen if the code is called with preemption
enabled.

I don't think the "volatile" statement would necessarily make sure the
compiler and CPU would do the __m_cnt_hi read before the hw cnt low
read. A real memory barrier to order mmio reads wrt memory reads (or
instruction sync barrier if the value is taken from the cpu registers)
would be required to insure such order.

I also felt it would be more solid to have per-cpu structures to keep
track of 32->64 bits TSC updates, given the TSCs can always be slightly
out-of-sync :

CPU A B
read __m_cnt_hi
read hw cnt low (+200 cycles)
(wrap detected)
write __m_cnt_hi (incremented)
read __m_cnt_hi
read hw cnt low (-200 cycles)
(no wrap)
-> bogus value returned.




> > There is also a TSC synchronization test within this patchset to detect
> > unsynchronized TSCs.
>
> We already have such code, no? Does this code replace that one or just
> add a second test?
>

It adds a second test, which seems more solid to me than the existing
x86 tsc_sync detection code.

> > See comments in this specific patch to figure out the
> > difference between the current x86 tsc_sync.c and the one I propose in this
> > patch.
>
> Right so you don't unify, that's a missed opportunity, no?
>

Yep, If we can switch the current x86 tsc_sync code to use my
architecture agnostic implementation, that would be a gain. We could
probably port other tsc sync detect code (ia64 ?) to use this
infrastructure too.


> > It also provides an architecture-agnostic fallback in case there is no
> > timestamp counter available : basically, it's
> > (jiffies << 13) | atomically_incremented_counter (if there are more than 8192
> > events per jiffy, time will still be monotonic, but will increment faster than
> > the actual system frequency).
> >
> > Comments are welcome. Note that this is only the beginning of the patchset. I
> > plan to submit the event ID allocation/portable event typing aimed at exporting
> > the data to userspace and buffering mechanism as soon as I integrate a core
> > version of the LTTV userspace tools to the kernel build tree. Other than that, I
> > currently have a tracer which fulfills most of the requirements expressed
> > earlier. I just fear that if I release only the kernel part without foolproof
> > binary-to-ascii trace decoder within the kernel, people might be a bit reluctant
> > to fetch a separate userspace package.
>
> It might be good to drop all the ltt naming and pick more generic names,
> esp. as ftrace could use a lot of this infrastructure as well.
>

Sure. I've done all this development as part of the LTTng project, but I
don't care about renaming stuff. trace_clock() seems like a good name
for trace clock source. The unsync TSC detection and the 23->64 bits TSC
extension would also probably require more generic names (and would
benefit to be moved to kernel/).

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-21 00:20:46

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [RFC patch 00/15] Tracer Timestamping

On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:

> * Peter Zijlstra ([email protected]) wrote:
> > Have you looked at the existing 32->63 extention code in
> > include/linux/cnt32_to_63.h and considered unifying it?
> >
>
> Yep, I felt this code was dangerous on SMP given it could suffer from
> the following type of race due to lack of proper barriers :

You are wrong.

> CPU A B
> read hw cnt low
> read __m_cnt_hi
> read hw cnt low
> (wrap detected)
> write __m_cnt_hi (incremented)
> read __m_cnt_hi
> (wrap detected)
> write __m_cnt_hi (incremented)
>
> we therefore increment the high bits twice in the given race.

No. Either you do the _same_ incrementation twice effectively writing
the _same_ high value twice, or you don't have simultaneous wrap
detections. Simulate it on paper with real numbers if you are not
convinced.

> On UP, the same race could happen if the code is called with preemption
> enabled.

Wrong again.

> I don't think the "volatile" statement would necessarily make sure the
> compiler and CPU would do the __m_cnt_hi read before the hw cnt low
> read. A real memory barrier to order mmio reads wrt memory reads (or
> instruction sync barrier if the value is taken from the cpu registers)
> would be required to insure such order.

That's easy enough to fix, right?

> I also felt it would be more solid to have per-cpu structures to keep
> track of 32->64 bits TSC updates, given the TSCs can always be slightly
> out-of-sync :

If the low part is a per CPU value then the high part has to be a per
CPU value too. There only needs to be a per-CPU variant of the same
algorithm where the only difference is that __m_cnt_hi would be per CPU.

If the low part is global then __m_cnt_hi has to be global, even on SMP.


Nicolas

2008-10-21 01:32:36

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC patch 00/15] Tracer Timestamping

* Nicolas Pitre ([email protected]) wrote:
> On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
>
> > * Peter Zijlstra ([email protected]) wrote:
> > > Have you looked at the existing 32->63 extention code in
> > > include/linux/cnt32_to_63.h and considered unifying it?
> > >
> >
> > Yep, I felt this code was dangerous on SMP given it could suffer from
> > the following type of race due to lack of proper barriers :
>
> You are wrong.
>
> > CPU A B
> > read hw cnt low
> > read __m_cnt_hi
> > read hw cnt low
> > (wrap detected)
> > write __m_cnt_hi (incremented)
> > read __m_cnt_hi
> > (wrap detected)
> > write __m_cnt_hi (incremented)
> >
> > we therefore increment the high bits twice in the given race.
>
> No. Either you do the _same_ incrementation twice effectively writing
> the _same_ high value twice, or you don't have simultaneous wrap
> detections. Simulate it on paper with real numbers if you are not
> convinced.
>

Hi Nicolas,

Yup, you are right. However, the case where one CPU sees the clock source
a little bit off-sync (late) still poses a problem. Example follows :

CPU A B
read __m_cnt_hi (0x80000000)
read hw cnt low (0x00000001)
(wrap detected :
(s32)(0x80000000 ^ 0x1) < 0)
write __m_cnt_hi = 0x00000001
return 0x0000000100000001
read __m_cnt_hi (0x00000001)
(late) read hw cnt low (0xFFFFFFFA)
(wrap detected :
(s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
write __m_cnt_hi = 0x80000001
return 0x80000001FFFFFFFA
(time jumps)
read __m_cnt_hi (0x80000001)
read hw cnt low (0x00000020)
(wrap detected :
(s32)(0x80000001 ^ 0x20) < 0)
write __m_cnt_hi = 0x00000002
return 0x0000000200000020
(time jumps again)

A similar situation can be generated by out-of-order hi/low bits reads.

> > On UP, the same race could happen if the code is called with preemption
> > enabled.
>
> Wrong again.
>
> > I don't think the "volatile" statement would necessarily make sure the
> > compiler and CPU would do the __m_cnt_hi read before the hw cnt low
> > read. A real memory barrier to order mmio reads wrt memory reads (or
> > instruction sync barrier if the value is taken from the cpu registers)
> > would be required to insure such order.
>
> That's easy enough to fix, right?
>

For memory read vs cycle counter, something like the "data_barrier(x)"
on powerpc would probably be enough when available. We would need
something that orders memory reads with the rest of instruction
execution. On other architectures, smp_rmb() + get_cycles_barrier() (the
latter being an instruction sync) would be required.

For memory read vs mmap io read, a smp_rmb() should be enough.

> > I also felt it would be more solid to have per-cpu structures to keep
> > track of 32->64 bits TSC updates, given the TSCs can always be slightly
> > out-of-sync :
>
> If the low part is a per CPU value then the high part has to be a per
> CPU value too. There only needs to be a per-CPU variant of the same
> algorithm where the only difference is that __m_cnt_hi would be per CPU.
>
> If the low part is global then __m_cnt_hi has to be global, even on SMP.
>

Hrm, given the performance impact of barriers that would need to be
added to insure correct read order of hi and low bits, and also
considering the risk that a "synchronized" timestamp counter off by a
few cycles poses in terms of time jumping forward, I don't see why we
would ever want to use a global __m_cnt_hi ? I would therefore recommend
to always use a per-cpu __m_cnt_hi.

Also, a small nit, to make sure cnt32_to_63 never gets scheduled out
with a stale old __m_cnt_hi value, its comments should probably say that
it has to be always used with preemption off.

Mathieu

>
> Nicolas

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-21 02:32:57

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [RFC patch 00/15] Tracer Timestamping

On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:

> * Nicolas Pitre ([email protected]) wrote:
> > On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
> >
> > > * Peter Zijlstra ([email protected]) wrote:
> > > > Have you looked at the existing 32->63 extention code in
> > > > include/linux/cnt32_to_63.h and considered unifying it?
> > > >
> > >
> > > Yep, I felt this code was dangerous on SMP given it could suffer from
> > > the following type of race due to lack of proper barriers :
> >
> > You are wrong.
> >
> > > CPU A B
> > > read hw cnt low
> > > read __m_cnt_hi
> > > read hw cnt low
> > > (wrap detected)
> > > write __m_cnt_hi (incremented)
> > > read __m_cnt_hi
> > > (wrap detected)
> > > write __m_cnt_hi (incremented)
> > >
> > > we therefore increment the high bits twice in the given race.
> >
> > No. Either you do the _same_ incrementation twice effectively writing
> > the _same_ high value twice, or you don't have simultaneous wrap
> > detections. Simulate it on paper with real numbers if you are not
> > convinced.
> >
>
> Hi Nicolas,
>
> Yup, you are right. However, the case where one CPU sees the clock source
> a little bit off-sync (late) still poses a problem. Example follows :
>
> CPU A B
> read __m_cnt_hi (0x80000000)
> read hw cnt low (0x00000001)
> (wrap detected :
> (s32)(0x80000000 ^ 0x1) < 0)
> write __m_cnt_hi = 0x00000001
> return 0x0000000100000001
> read __m_cnt_hi (0x00000001)
> (late) read hw cnt low (0xFFFFFFFA)
> (wrap detected :
> (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> write __m_cnt_hi = 0x80000001
> return 0x80000001FFFFFFFA
> (time jumps)

If this is through a global counter then I have a big problem believing
you.

If this is a per-CPU counter then you just need a per-CPU version of
__m_cnt_hi.

> A similar situation can be generated by out-of-order hi/low bits reads.

This, of course, should and can be prevented. No big deal.

> > If the low part is a per CPU value then the high part has to be a per
> > CPU value too. There only needs to be a per-CPU variant of the same
> > algorithm where the only difference is that __m_cnt_hi would be per CPU.
> >
> > If the low part is global then __m_cnt_hi has to be global, even on SMP.
> >
>
> Hrm, given the performance impact of barriers that would need to be
> added to insure correct read order of hi and low bits, and also
> considering the risk that a "synchronized" timestamp counter off by a
> few cycles poses in terms of time jumping forward, I don't see why we
> would ever want to use a global __m_cnt_hi ?

I still would like to know just _how_ you could manage to have one CPU
perform two reads and one write and manage for another CPU to see that
just written value but read an even older value from the global counter
than what the first CPU saw.

> I would therefore recommend to always use a per-cpu __m_cnt_hi.

No, not until you convince me of the above nonsense. Sure it will
work with a per-CPU __m_cnt_hi of course, but then the timing constraint
to keep the algorithm warm is for each CPU which, if the source counter
is global, shouldn't be necessary.

If the 63-bit counter is queried often enough on each CPU then it is of
course advantageous to go per-CPU to avoid excessive cache bouncing, and
if you start worrying about cache bouncing then you certainly call this
code often enough to keep everything in synch. But that needs not to be
always the case.

> Also, a small nit, to make sure cnt32_to_63 never gets scheduled out
> with a stale old __m_cnt_hi value, its comments should probably say that
> it has to be always used with preemption off.

NO! THAT's THE WHOLE POINT OF THIS CODE: TO NOT NEED ANY KIND OF
LOCKING! The only requirement is that you must not be preempted away
for longer than half (or a quarter if you want to play real safe) the
period of the base counter. If you cannot guarantee that then either
your base counter is so fast that it wraps so often to be unusable, or
you have another and bigger problem for being preempted for so long.


Nicolas

2008-10-21 04:05:56

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC patch 00/15] Tracer Timestamping

* Nicolas Pitre ([email protected]) wrote:
> On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
>
> > * Nicolas Pitre ([email protected]) wrote:
> > > On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
> > >
> > > > * Peter Zijlstra ([email protected]) wrote:
> > > > > Have you looked at the existing 32->63 extention code in
> > > > > include/linux/cnt32_to_63.h and considered unifying it?
> > > > >
> > > >
> > > > Yep, I felt this code was dangerous on SMP given it could suffer from
> > > > the following type of race due to lack of proper barriers :
> > >
> > > You are wrong.
> > >
> > > > CPU A B
> > > > read hw cnt low
> > > > read __m_cnt_hi
> > > > read hw cnt low
> > > > (wrap detected)
> > > > write __m_cnt_hi (incremented)
> > > > read __m_cnt_hi
> > > > (wrap detected)
> > > > write __m_cnt_hi (incremented)
> > > >
> > > > we therefore increment the high bits twice in the given race.
> > >
> > > No. Either you do the _same_ incrementation twice effectively writing
> > > the _same_ high value twice, or you don't have simultaneous wrap
> > > detections. Simulate it on paper with real numbers if you are not
> > > convinced.
> > >
> >
> > Hi Nicolas,
> >
> > Yup, you are right. However, the case where one CPU sees the clock source
> > a little bit off-sync (late) still poses a problem. Example follows :
> >
> > CPU A B
> > read __m_cnt_hi (0x80000000)
> > read hw cnt low (0x00000001)
> > (wrap detected :
> > (s32)(0x80000000 ^ 0x1) < 0)
> > write __m_cnt_hi = 0x00000001
> > return 0x0000000100000001
> > read __m_cnt_hi (0x00000001)
> > (late) read hw cnt low (0xFFFFFFFA)
> > (wrap detected :
> > (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> > write __m_cnt_hi = 0x80000001
> > return 0x80000001FFFFFFFA
> > (time jumps)
>
> If this is through a global counter then I have a big problem believing
> you.
>

If the global counter read through mmap io insures that the values
monotonically increases and if the proper memory barriers are there, it
should be ok and two consecutive reads done on different CPUs should
never go backward.

> If this is a per-CPU counter then you just need a per-CPU version of
> __m_cnt_hi.
>

Counters like MIPS 32-bits timestamp counter is supposedly synchronized
across cores, and could thus be thought to be global, but could suffer
from such small drift due to inexact counter synchronization.

> > A similar situation can be generated by out-of-order hi/low bits reads.
>
> This, of course, should and can be prevented. No big deal.
>
> > > If the low part is a per CPU value then the high part has to be a per
> > > CPU value too. There only needs to be a per-CPU variant of the same
> > > algorithm where the only difference is that __m_cnt_hi would be per CPU.
> > >
> > > If the low part is global then __m_cnt_hi has to be global, even on SMP.
> > >
> >
> > Hrm, given the performance impact of barriers that would need to be
> > added to insure correct read order of hi and low bits, and also
> > considering the risk that a "synchronized" timestamp counter off by a
> > few cycles poses in terms of time jumping forward, I don't see why we
> > would ever want to use a global __m_cnt_hi ?
>
> I still would like to know just _how_ you could manage to have one CPU
> perform two reads and one write and manage for another CPU to see that
> just written value but read an even older value from the global counter
> than what the first CPU saw.
>

As I said above, it would be ok with a global mmio counter, given the
proper barriers are used.

> > I would therefore recommend to always use a per-cpu __m_cnt_hi.
>
> No, not until you convince me of the above nonsense. Sure it will
> work with a per-CPU __m_cnt_hi of course, but then the timing constraint
> to keep the algorithm warm is for each CPU which, if the source counter
> is global, shouldn't be necessary.
>

True, with a mmio time source, only a single counter seems necessary.
Only when there are slightly off time sources (such as "almost"
synchronized TSC") would the per-cpu counter be needed.

> If the 63-bit counter is queried often enough on each CPU then it is of
> course advantageous to go per-CPU to avoid excessive cache bouncing, and
> if you start worrying about cache bouncing then you certainly call this
> code often enough to keep everything in synch. But that needs not to be
> always the case.

This code should only cache bounce about once or twice per overflow
period. The rest should only read the __m_cnt_hi value for comparison.
Therefore I don't think cache-line bouncing is such an issue here. OTOH,
added barriers can degrade performance a lot.

>
> > Also, a small nit, to make sure cnt32_to_63 never gets scheduled out
> > with a stale old __m_cnt_hi value, its comments should probably say that
> > it has to be always used with preemption off.
>
> NO! THAT's THE WHOLE POINT OF THIS CODE: TO NOT NEED ANY KIND OF
> LOCKING! The only requirement is that you must not be preempted away
> for longer than half (or a quarter if you want to play real safe) the
> period of the base counter. If you cannot guarantee that then either
> your base counter is so fast that it wraps so often to be unusable, or
> you have another and bigger problem for being preempted for so long.
>

Or maybe you are just unlucky and a user hit CTRL-Z at the wrong time ?
I could be wrong and maybe there is no such thing possible as to "stop"
a thread while it's in kernel-space, but unless there are restrictions
about where a thread can be stopped, I assume such scenario could cause
a very long preemption if preemption is not disabled.

Even if such thread stop is not possible, assuming the scheduler will
never leave _any_ thread out of the runqueue for a few seconds seems
questionable.

Mathieu

>
> Nicolas

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68